EP1668628A1 - Method for synthesizing speech - Google Patents
Method for synthesizing speechInfo
- Publication number
- EP1668628A1 EP1668628A1 EP04784355A EP04784355A EP1668628A1 EP 1668628 A1 EP1668628 A1 EP 1668628A1 EP 04784355 A EP04784355 A EP 04784355A EP 04784355 A EP04784355 A EP 04784355A EP 1668628 A1 EP1668628 A1 EP 1668628A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- match
- speech
- pitch
- speech segment
- segment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 230000002194 synthesizing effect Effects 0.000 title description 2
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 7
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 7
- 238000004422 calculation algorithm Methods 0.000 description 6
- 230000002040 relaxant effect Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/06—Elementary speech units used in speech synthesisers; Concatenation rules
- G10L13/07—Concatenation rules
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/06—Elementary speech units used in speech synthesisers; Concatenation rules
Definitions
- the present invention relates generally to Text-To-Speech (TTS) synthesis.
- TTS Text-To-Speech
- the invention is particularly useful for, but not necessarily limited to, determining an appropriate synthesized pronunciation of a text segment using a non-exhaustive utterance corpus.
- TTS Text to Speech
- concatenated text to speech synthesis allows electronic devices to receive an input text string and provide a converted representation of the string in the form of synthesized speech.
- a device that may be required to synthesize speech originating from a non-deterministic number of received text strings will have difficulty in providing high quality realistic synthesized speech. That is because the pronunciation of each word or syllable (for Chinese characters and the like) to be synthesized is context and location dependent. For example, a pronunciation of a word at the beginning of a sentence
- input text string may be drawn out or lengthened.
- the pronunciation of the same word may be lengthened even more if it occurs in the middle of a sentence where emphasis is required.
- the pronunciation of a word depends on at least tone (pitch), volume and duration.
- many languages include numerous possible pronunciations of individual syllables.
- a single syllable represented by a Chinese character (or other similar character based script) may have up to 6 different pronunciations.
- a large pre-recorded utterance waveform corpus of sentences is required. This corpus typically requires on average about 500 variations of each pronunciation if realistic speech synthesis is to be achieved.
- an utterance waveform corpus of all pronunciations for every character would be prohibitively large.
- the size of the utterance waveform corpus may be particularly limited when it is embedded in a small electronic device having a low memory capacity such as a radio telephone or a personal digital assistant.
- the algorithms used to compare the input text strings with the audio database also need to be efficient and fast so that the resulting synthesized and concatenated speech flows naturally and smoothly. Due to memory and processing speed limitations, existing TTS methods for embedded applications often result in speech that is unnatural or robotic sounding. There is therefore a need for an improved method for performing TTS to provide a natural sounding synthesized speech whilst using a non-exhaustive utterance corpus.
- the present invention is a method of performing speech synthesis that includes comparing an input text segment with an utterance waveform corpus that contains numerous speech samples. The method determines whether there is a contextual best match between the text segment and one speech sample included in the utterance waveform corpus. If there is not a contextual best match, the method determines whether there is a contextual phonetic hybrid match between the text segment and a speech sample included in the utterance waveform corpus.
- a contextual phonetic hybrid match requires a match of all implicit prosodic features in a defined prosodic feature group.
- the prosodic feature group is redefined by deleting one of the implicit prosodic features from the prosodic feature group so as to redefine the prosodic feature group.
- the prosodic feature group is successively redefined by deleting one implicit prosodic feature from the group until a match is found between the input text segment and a speech sample. When a match is found, the matched speech sample is used to generate concatenative speech.
- Fig. 1 is a block diagram of an electronic device upon which the invention may be implemented
- Fig. 2 is a flow chart illustrating a specific embodiment of the present invention used to generate concatenative speech in the Chinese language
- Fig. 3 is a flowchart illustrating the process of determining whether a contextual phonetic hybrid match exists by successively relaxing the constraints used to define a match.
- Fig. 1 there is illustrated a block diagram of an electronic device 10 upon which the invention may be implemented.
- the device 10 includes a processor 30 operatively coupled, by a common bus 15, to a text memory module 20, a Read only Memory (ROM) 40, a Random Access Memory (RAM) 50 and a waveform corpus 60.
- the processor 30 is also operatively coupled to a touch screen display 90 and an input of a speech synthesizer 70.
- An output of the speech synthesizer 70 is operatively coupled to a speaker 80.
- the text memory module is a store for storing text obtained by any receiving means possible such as by radio reception, internet, or plug in portable memory cards etc.
- the ROM stores operating code for performing the invention as described in Figures 2 and 3.
- the Corpus 60 is a essentially a conventional corpus as is the speech synthesizer 70 and speaker 80 and the touch screen display 90 is a user interface and allows for display of text stored in the a text memory module 20.
- Fig. 2 is a flow chart illustrating a specific embodiment of the present invention used to generate concatenative speech 110 from an input text segment 120 in the Chinese language.
- the text segment 120 is compared with an utterance waveform corpus 60, which includes a plurality of speech samples 140, to determine whether there is a contextual best match (step S110). If a contextual best match is found between a text segment 120 and a specific speech sample 140, that specific speech sample 140 is sent to a concatenating algorithm 150 for generating the concatenative speech 110. If no contextual best match is found between the text segment 120 and a specific speech sample 140, then the text segment 120 is compared again with the utterance waveform corpus 130 to determine whether there is a contextual phonetic hybrid match (step S120).
- Fig. 3 is a flowchart illustrating the process of determining whether a contextual phonetic hybrid match exists by successively relaxing the constraints used to define a match.
- a contextual phonetic hybrid match requires a match between a text segment 120 and all of the implicit prosodic features 210 included in a defined prosodic feature group 220. If no match is found, one of the implicit prosodic features 210 is deleted from the defined prosodic feature group 220 and the group 220 is redefined as including all of the previously included features 210 less the deleted feature 210 (e.g., Step 130). The redefined prosodic feature group 220 is then compared with the text segment 120 to determine whether there is a match. The process of deleting an implicit prosodic feature 210, redefining the prosodic feature group 220, and then redetermining whether there is a contextual phonetic hybrid match, continues until a match is found (Steps S130, S140, etc.
- the matched speech sample 140 which matches the text segment 120, is sent to the concatenating algorithm 150 for generating concatenative speech 110.
- the concatenating algorithm 150 for generating concatenative speech 110.
- a basic phonetic match is performed matching only pinyin (Step S180).
- the utterance waveform corpus 60 is designed so that there is always at least one syllable included with the correct pinyin to match all possible input text segments 120. That basic phonetic match is then input into the concatenating algorithm 150.
- the invention is thus a multi-layer, data-driven method for controlling the prosody (rhythm and intonation) of the resulting synthesized, concatenative speech 110.
- each layer of the method includes a redefined prosodic feature group 220.
- a text segment 120 means any type of input text string or segment of coded language. It should not be limited to only visible text that is scaned or otherwise entered into a TTS system.
- the utterance waveform corpus 130 of the present invention is annotated with information concerning each speech sample 140 (usually a word) that is included in the corpus 130.
- the speech samples 140 themselves are generally recordings of actual human speech, usually digitized or analog waveforms. Annotations are thus required to identify the samples 140.
- Such annotations may include the specific letters or characters (depending on the language) that define the sample 140 as well as the implicit prosodic features 210 of the speech sample 140.
- the implicit prosodic features 210 include context information concerning how the speech sample 140 is used in a sentence.
- a speech sample 140 in the Chinese language may include the following implicit prosodic features 210: Text context: the Chinese characters immediately preceding and immediately following the annotated text of a speech sample 140.
- Pinyin the phonetic representation of a speech sample 140. Pinyin is a standard romanization of the Chinese language using the Western alphabet.
- Tone context the tone context of the Chinese characters immediately preceding and immediately following the annotated text of a speech sample 140.
- Co-articulation the phonetic level representatives that immediately precede and immediately follow the annotated text of a speech sample 140, such as phonemes or sub-syllables.
- Syllable position the position of a syllable in a prosodic phrase.
- Phrase position the position of a prosodic phrase in a sentence. Usually the phrase position is identified as one of the three positions of sentence initial, sentence medial and sentence final.
- Character symbol the code (e.g., ASCII code) representing the Chinese character that defines a speech sample 140.
- Length of phrase the number of Chinese characters included in a prosodic phrase.
- each character's sound could represent a speech sample 140 and could be annotated with the above implicit prosodic features 210.
- the character "H” as found in the above sentence could be annotated as follows: Text context: ⁇ , nowadays ; Pinyin: guo2; Tone context: 1 , 3; Co-articulation: ong, h ; Syllable position: 2 ; Phrase position: 1 ; Character symbol: ASCII code for H ; and Length of phrase: 2.
- step S110 determines whether there is a contextual best match between a text segment 120 and a speech sample 140.
- a contextual best match is generally defined as the closest, or an exact, match of both 1 ) the letters or characters (depending on the language) of an input text segment 120 with the corresponding letters or characters of an annotated speech sample 140, and 2) the implicit prosodic features 210 of the input text segment 120 with the implicit prosodic features 210 of the annotated speech sample 140.
- a best match is being determined by identifying the greatest number of consecutive syllables in the input text segment that are identical to attributes and attribute positions in each of the waveform utterances (speech sample) in the waveform corpus 60.
- a speech sample 140 selected immediately as an element for use in the concatenating algorithm 150.
- the method of the present invention determines whether there is a contextual phonetic hybrid match between an input text segment 120 and a speech sample 140.
- a contextual phonetic hybrid match requires a match between a text segment 120 and all of the implicit prosodic features 210 included in a defined prosodic feature group 220. As shown in Fig.
- one embodiment of the present invention used to synthesize speech in the Chinese language uses a first defined prosodic feature group 220 that includes the implicit prosodic features 210 of pinyin, tone context, co- articulation, syllable position, phrase position, character symbol, and length of phrase (Step S120). If none of the annotated speech samples 140 found in the utterance waveform corpus 130 have identical values for each of the above features 210 as found in the input text segment 120, then the corpus 130 does not contain a speech sample 140 that is close enough to the input text segment 120 based on the matching rules as applied in Step S120. Therefore the constraints of the matching rules must be relaxed and thus broadened to include other speech samples 140 that possess the next most preferable features 210 found in the input text segment 120.
- the matching rules are broadened by deleting the one feature 210 found in the defined prosodic feature group 220 that is least likely to affect the natural prosody of the input text segment 120.
- the next most preferable features 210 found in the illustrated embodiment of the present invention include all of the features 210 defined above less the length of phrase feature 210.
- the order in which the implicit prosodic features 210 are deleted from the defined prosodic feature group 220 is determined empirically. When the features 210 are deleted in a proper order, the method of the present invention results in efficient and fast speech synthesis. The output speech therefore sounds more natural even though the utterance waveform corpus 130 may be relatively limited in size.
- the variable BestPitch may be determined based on a statistical analysis of the utterance waveform corpus 130.
- a corpus 130 may include five tones, each having an average pitch.
- Each annotated speech sample 140 in the corpus 130 may also include individual prosody information represented by the values of pitch, duration and energy. So the average values of pitch, duration and energy of the entire corpus 130 are available.
- the best pitch for a particular context may then be determined using the following formula: BestPitch - pitch tone — nlndexx empiricalvalue (Eq.
- pitch t one the average pitch including tone of the utterance waveform corpus
- nlndex the index of the text segment 120 in a prosody phrase
- empircalvalue an empirical value based on the utterance waveform corpus.
- the empirical value of 4 is used in one particular embodiment of the present invention that synthesizes the Chinese language; however this number could vary depending on the content of a particular utterance waveform corpus 130.
- the empirical value of 4 is used in one particular embodiment of the present invention that synthesizes the Chinese language, however this number could vary depending on the content of a particular utterance waveform corpus 130.
- the invention is suitable for many languages.
- the implicit prosodic features 210 would need to be deleted or redefined from the examples given hereinabove.
- the feature 210 identified above as tone context would be deleted in an application of the present invention for the English language because English is not a tonal language.
- the feature 210 identified above as pinyin would likely be redefined as simply a phonetic symbol when the present invention is applied to English.
- the present invention is therefore a multi-layer, data-driven prosodic control scheme that utilizes the implicit prosodic information in an utterance waveform corpus 130.
- the method of the present invention When searching for an appropriate speech sample 140 to match with a given input text segment 120, the method of the present invention employs a strategy based on multi-layer matching, where each layer is tried in turn until a sufficiently good match is found. By successively relaxing the constraints of each layer, the method efficiently determines whether the utterance waveform corpus 130 contains a match.
- the method is therefore particularly appropriate for embedded TTS systems where the size of the utterance waveform corpus 130 and the processing power of the system may be limited.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
Abstract
A method of performing speech synthesis that includes comparing a text segment (120) with an utterance waveform corpus (60) that contains numerous speech samples (140). The method determines whether there is a contextual best match between the text segment (120) and one speech sample (140). If there is not a contextual best match, the method determines whether there is a contextual phonetic hybrid match between the text segment (120) and a speech sample (140). A contextual phonetic hybrid match requires a match of all implicit prosodic features (210) in a defined prosodic feature group (220). If a match is still not found, the prosodic feature group (220) is redefined by deleting one of the implicit prosodic features (210) from the prosodic feature group (220). The prosodic feature group (220) is successively redefined by deleting one implicit prosodic feature (210) from the group (220) until a match is found between the input text segment (120) and a speech sample (140). When a match is found, the matched speech sample (140) is used to generate concatenative speech (110).
Description
METHOD FOR SYNTHESIZING SPEECH
FIELD OF THE INVENTION The present invention relates generally to Text-To-Speech (TTS) synthesis. The invention is particularly useful for, but not necessarily limited to, determining an appropriate synthesized pronunciation of a text segment using a non-exhaustive utterance corpus.
BACKGROUND OF THE INVENTION Text to Speech (TTS) conversion, often referred to as concatenated text to speech synthesis, allows electronic devices to receive an input text string and provide a converted representation of the string in the form of synthesized speech. However, a device that may be required to synthesize speech originating from a non-deterministic number of received text strings will have difficulty in providing high quality realistic synthesized speech. That is because the pronunciation of each word or syllable (for Chinese characters and the like) to be synthesized is context and location dependent. For example, a pronunciation of a word at the beginning of a sentence
(input text string) may be drawn out or lengthened. The pronunciation of the same word may be lengthened even more if it occurs in the middle of a sentence where emphasis is required. In most languages the pronunciation of a word depends on at least tone (pitch), volume and duration. Furthermore many languages include numerous possible pronunciations of individual syllables. Typically a single syllable represented by a Chinese character (or other similar character based script) may have up to 6 different pronunciations. Furthermore, in order to provide a realistic synthesized utterance of each pronunciation, a large pre-recorded utterance waveform corpus of sentences is required. This corpus typically requires on average about 500 variations of each pronunciation if realistic speech synthesis is to be achieved. Thus an
utterance waveform corpus, of all pronunciations for every character would be prohibitively large. In most TTS systems there is a need to determine the appropriate pronunciation of an input text string based on comparisons with a limited size utterance waveform corpus. The size of the utterance waveform corpus may be particularly limited when it is embedded in a small electronic device having a low memory capacity such as a radio telephone or a personal digital assistant. The algorithms used to compare the input text strings with the audio database also need to be efficient and fast so that the resulting synthesized and concatenated speech flows naturally and smoothly. Due to memory and processing speed limitations, existing TTS methods for embedded applications often result in speech that is unnatural or robotic sounding. There is therefore a need for an improved method for performing TTS to provide a natural sounding synthesized speech whilst using a non-exhaustive utterance corpus.
SUMMARY OF THE INVENTION The present invention is a method of performing speech synthesis that includes comparing an input text segment with an utterance waveform corpus that contains numerous speech samples. The method determines whether there is a contextual best match between the text segment and one speech sample included in the utterance waveform corpus. If there is not a contextual best match, the method determines whether there is a contextual phonetic hybrid match between the text segment and a speech sample included in the utterance waveform corpus. A contextual phonetic hybrid match requires a match of all implicit prosodic features in a defined prosodic feature group. If a match is still not found, the prosodic feature group is redefined by deleting one of the implicit prosodic features from the prosodic feature group so as to redefine the prosodic feature group. The prosodic feature group is successively redefined by deleting one implicit prosodic feature from the group until a match is found between the input text segment and a speech sample. When a match is found, the matched speech sample
is used to generate concatenative speech.
BRIEF DESCRIPTION OF THE DRAWINGS Other aspects of the present invention will become apparent from the following detailed description taken together with the drawings, wherein like reference characters designate like or corresponding elements or steps throughout the drawings, in which: Fig. 1 is a block diagram of an electronic device upon which the invention may be implemented; Fig. 2 is a flow chart illustrating a specific embodiment of the present invention used to generate concatenative speech in the Chinese language; and Fig. 3 is a flowchart illustrating the process of determining whether a contextual phonetic hybrid match exists by successively relaxing the constraints used to define a match.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT OF THE INVENTION Referring to Fig. 1 there is illustrated a block diagram of an electronic device 10 upon which the invention may be implemented. The device 10 includes a processor 30 operatively coupled, by a common bus 15, to a text memory module 20, a Read only Memory (ROM) 40, a Random Access Memory (RAM) 50 and a waveform corpus 60. The processor 30 is also operatively coupled to a touch screen display 90 and an input of a speech synthesizer 70. An output of the speech synthesizer 70 is operatively coupled to a speaker 80. As will be apparent to a person skilled in the art, the text memory module is a store for storing text obtained by any receiving means possible such as by radio reception, internet, or plug in portable memory cards etc. The ROM stores operating code for performing the invention as described in Figures 2 and 3. Also the Corpus 60 is a essentially a conventional corpus as is the speech synthesizer 70 and
speaker 80 and the touch screen display 90 is a user interface and allows for display of text stored in the a text memory module 20. Fig. 2 is a flow chart illustrating a specific embodiment of the present invention used to generate concatenative speech 110 from an input text segment 120 in the Chinese language. The text segment 120 is compared with an utterance waveform corpus 60, which includes a plurality of speech samples 140, to determine whether there is a contextual best match (step S110). If a contextual best match is found between a text segment 120 and a specific speech sample 140, that specific speech sample 140 is sent to a concatenating algorithm 150 for generating the concatenative speech 110. If no contextual best match is found between the text segment 120 and a specific speech sample 140, then the text segment 120 is compared again with the utterance waveform corpus 130 to determine whether there is a contextual phonetic hybrid match (step S120). Fig. 3 is a flowchart illustrating the process of determining whether a contextual phonetic hybrid match exists by successively relaxing the constraints used to define a match. A contextual phonetic hybrid match requires a match between a text segment 120 and all of the implicit prosodic features 210 included in a defined prosodic feature group 220. If no match is found, one of the implicit prosodic features 210 is deleted from the defined prosodic feature group 220 and the group 220 is redefined as including all of the previously included features 210 less the deleted feature 210 (e.g., Step 130). The redefined prosodic feature group 220 is then compared with the text segment 120 to determine whether there is a match. The process of deleting an implicit prosodic feature 210, redefining the prosodic feature group 220, and then redetermining whether there is a contextual phonetic hybrid match, continues until a match is found (Steps S130, S140, etc. to S170). When a contextual phonetic hybrid match is found, the matched speech sample 140, which matches the text segment 120, is sent to the concatenating algorithm 150 for generating concatenative speech 110.
As shown in Fig. 3, if all of the implicit prosodic features 210 except pinyin are successively deleted from the prosodic feature group 220 and still no match is found, then a basic phonetic match is performed matching only pinyin (Step S180). In one embodiment of the present invention the utterance waveform corpus 60 is designed so that there is always at least one syllable included with the correct pinyin to match all possible input text segments 120. That basic phonetic match is then input into the concatenating algorithm 150. The invention is thus a multi-layer, data-driven method for controlling the prosody (rhythm and intonation) of the resulting synthesized, concatenative speech 110. Where each layer of the method includes a redefined prosodic feature group 220. For purposes of the present invention a text segment 120 means any type of input text string or segment of coded language. It should not be limited to only visible text that is scaned or otherwise entered into a TTS system. The utterance waveform corpus 130 of the present invention is annotated with information concerning each speech sample 140 (usually a word) that is included in the corpus 130. The speech samples 140 themselves are generally recordings of actual human speech, usually digitized or analog waveforms. Annotations are thus required to identify the samples 140. Such annotations may include the specific letters or characters (depending on the language) that define the sample 140 as well as the implicit prosodic features 210 of the speech sample 140. The implicit prosodic features 210 include context information concerning how the speech sample 140 is used in a sentence. For example, a speech sample 140 in the Chinese language may include the following implicit prosodic features 210: Text context: the Chinese characters immediately preceding and immediately following the annotated text of a speech sample 140. Pinyin: the phonetic representation of a speech sample 140. Pinyin is a standard romanization of the Chinese language using the Western alphabet.
Tone context: the tone context of the Chinese characters immediately preceding and immediately following the annotated text of a speech sample 140. Co-articulation: the phonetic level representatives that immediately precede and immediately follow the annotated text of a speech sample 140, such as phonemes or sub-syllables. Syllable position: the position of a syllable in a prosodic phrase. Phrase position: the position of a prosodic phrase in a sentence. Usually the phrase position is identified as one of the three positions of sentence initial, sentence medial and sentence final. Character symbol: the code (e.g., ASCII code) representing the Chinese character that defines a speech sample 140. Length of phrase: the number of Chinese characters included in a prosodic phrase. For an example of the specific values of the above implicit prosodic features 210, consider the following sentence in Chinese: "Φ Hfl^ ." If a spoken audio recording of that sentence were stored in an utterance waveform corpus 130, each character's sound could represent a speech sample 140 and could be annotated with the above implicit prosodic features 210. For example, the character "H" as found in the above sentence could be annotated as follows: Text context: ψ, ?! ; Pinyin: guo2; Tone context: 1 , 3; Co-articulation: ong, h ; Syllable position: 2 ; Phrase position: 1 ; Character symbol: ASCII code for H ; and Length of phrase: 2.
In Fig. 2, step S110 determines whether there is a contextual best match between a text segment 120 and a speech sample 140. A contextual best match is generally defined as the closest, or an exact, match of both 1 ) the letters or characters (depending on the language) of an input text segment 120 with the corresponding letters or characters of an annotated speech sample 140, and 2) the implicit prosodic features 210 of the input text segment 120 with the implicit prosodic features 210 of the annotated speech sample 140. In more general terms a best match is being determined by identifying the greatest number of consecutive syllables in the input text segment that are identical to attributes and attribute positions in each of the waveform utterances (speech sample) in the waveform corpus 60. Only when both the letters or characters and the implicit prosodic features 210 match exactly is a speech sample 140 selected immediately as an element for use in the concatenating algorithm 150. When a contextual best match is not found, the method of the present invention then determines whether there is a contextual phonetic hybrid match between an input text segment 120 and a speech sample 140. As described above, a contextual phonetic hybrid match requires a match between a text segment 120 and all of the implicit prosodic features 210 included in a defined prosodic feature group 220. As shown in Fig. 3, one embodiment of the present invention used to synthesize speech in the Chinese language uses a first defined prosodic feature group 220 that includes the implicit prosodic features 210 of pinyin, tone context, co- articulation, syllable position, phrase position, character symbol, and length of phrase (Step S120). If none of the annotated speech samples 140 found in the utterance waveform corpus 130 have identical values for each of the above features 210 as found in the input text segment 120, then the corpus 130 does not contain a speech sample 140 that is close enough to the input text segment 120 based on the matching rules as applied in Step S120. Therefore the constraints of the matching rules must be relaxed and thus broadened to include other speech samples 140 that possess the next most
preferable features 210 found in the input text segment 120. In other words, the matching rules are broadened by deleting the one feature 210 found in the defined prosodic feature group 220 that is least likely to affect the natural prosody of the input text segment 120. For example, as shown in Step S130 in both Fig. 2 and Fig.3, the next most preferable features 210 found in the illustrated embodiment of the present invention include all of the features 210 defined above less the length of phrase feature 210. The order in which the implicit prosodic features 210 are deleted from the defined prosodic feature group 220 is determined empirically. When the features 210 are deleted in a proper order, the method of the present invention results in efficient and fast speech synthesis. The output speech therefore sounds more natural even though the utterance waveform corpus 130 may be relatively limited in size. According to the present invention, after the utterance waveform corpus 130 has been compared with a text segment 120 using a particular defined prosodic feature group 220, it is possible that the annotations of multiple speech samples 140 will be found to match the analyzed text segment 120. In such a case, an optimal contextual phonetic hybrid match may be selected by using the following equation: dW = Wp dur- BestDur 2 (E 1 •" p
d BestDur ' where Wp = weight of the pitch of the text segment 120; Wd = weight of the duration of the text segment 120; cliff = differential value for selecting an optimal contextual phonetic hybrid match; pitch = pitch of the text segment 120; BestPitch = pitch of an ideal text segment 120; dur= duration of the text segment 120; and BestDur = duration of the ideal text segment 120.
In the above Equation 1, the variable BestPitch may be determined
based on a statistical analysis of the utterance waveform corpus 130. For example, a corpus 130 may include five tones, each having an average pitch. Each annotated speech sample 140 in the corpus 130 may also include individual prosody information represented by the values of pitch, duration and energy. So the average values of pitch, duration and energy of the entire corpus 130 are available. The best pitch for a particular context may then be determined using the following formula: BestPitch - pitchtone — nlndexx empiricalvalue (Eq. 2) where pitchtone= the average pitch including tone of the utterance waveform corpus; nlndex= the index of the text segment 120 in a prosody phrase; and empircalvalue = an empirical value based on the utterance waveform corpus. The empirical value of 4 is used in one particular embodiment of the present invention that synthesizes the Chinese language; however this number could vary depending on the content of a particular utterance waveform corpus 130. Similarly, the duration of an ideal text segment 120 may be determined using the following equation: BestDur= durs fs - nlndexx empircαlvdue (Eq.3) where durs = the average duration of the text segment 120 without tone; nlndex = the index of the text segment 120 in a prosody phrase; fs = a coefficient for prosody position; and empircalvalue = an empirical value based on said utterance waveform corpus. Again, the empirical value of 4 is used in one particular embodiment of the present invention that synthesizes the Chinese language, however this number could vary depending on the content of a particular utterance waveform corpus 130. The differential value for a word diffW may be the summation of differential values for each syllable in the word. That may be represented in
mathematical terms by the following equation: difw = /iffk (Eq.4) it
As described above, if several speech samples 140 are found to match a particular text segment 120, the system will choose the speech sample 140 whose differential value is lowest. That may be represented in mathematical terms by the following equation: diJWmm = ikfw(J W, (Eq. 5) I Further, the method of the present invention may include the use of preset thresholds for the differential value diffW. If the differential value for a matched speech sample 140 is below a particular threshold, the method will route the matched speech sample 140 to the concatenating algorithm 150 for generating the concatenative speech 110. Otherwise, the method may require relaxing the constraints on the contextual phonetic hybrid match by deleting one of the required implicit prosodic features 210 and continuing to search for a match. Although the above description concerns a specific example of the method of the present invention for the Chinese language, the invention is suitable for many languages. For some languages the implicit prosodic features 210 would need to be deleted or redefined from the examples given hereinabove. For example, the feature 210 identified above as tone context would be deleted in an application of the present invention for the English language because English is not a tonal language. Also, the feature 210 identified above as pinyin would likely be redefined as simply a phonetic symbol when the present invention is applied to English. The present invention is therefore a multi-layer, data-driven prosodic control scheme that utilizes the implicit prosodic information in an utterance waveform corpus 130. When searching for an appropriate speech sample 140 to match with a given input text segment 120, the method of the present invention employs a strategy based on multi-layer matching, where each layer is tried in turn until a sufficiently good match is found. By successively
relaxing the constraints of each layer, the method efficiently determines whether the utterance waveform corpus 130 contains a match. The method is therefore particularly appropriate for embedded TTS systems where the size of the utterance waveform corpus 130 and the processing power of the system may be limited. Although exemplary embodiments of a method of the present invention has been illustrated in the accompanying drawings and described in the foregoing description, it is to be understood that the invention is not limited to the embodiments disclosed; rather the invention can be varied in numerous ways, particularly concerning applications in languages other than
Chinese. It should, therefore, be recognized that the invention should be limited only by the scope of the following claims.
Claims
1. A method for performing speech synthesis on a test segment, the method being performed on an electronic device, the method comprising: comparing a text segment with an utterance waveform corpus, said utterance waveform corpus comprising a plurality of speech waveform samples; determining a best match between consecutive syllables in the text segment and attributes associated with sampled speech waveform utterances, the best match being determined by identifying the greatest number of consecutive syllables that are identical to the attributes and attribute positions in each of the waveform utterances, ascertaining a suitable match for each unmatched syllable in the text segment, each unmatched syllable being a syllable that is not one of the consecutive syllables and the suitable match being determined from a comparison of a prosodic features in prosodic features group with the attributes associated with sampled speech waveform utterances, wherein the ascertaining is characterized by successively removing the prosodic features from the prosodic feature group until there is said suitable match; and generating concatenated synthesized speech for the text segment by using the speech waveform samples in the corpus, the speech waveform samples being selected from best match between consecutive syllables and suitable match for each unmatched syllable.
2. The method of claim 1 , wherein the prosodic features includes features selected from the group consisting of text context, pinyin, tone context, co-articulation, syllable position, phrase position, character symbol, and length of phrase.
3. The method of claim 1 , wherein the prosodic features comprises tone context, co-articulation, syllable position, phrase position, and character symbol.
4. The method of claim 1 , further comprising the step of performing a basic phonetic match based on only pinyin after all of said other prosodic features have been successively removed.
5. The method of claim 1 , wherein the step of determining includes the step of selecting an optimal contextual phonetic hybrid match when numerous best matches are found by using the formula: pitch- BestPltcly dur - BestDur - 2 p BestPitch a BestDur where Wp = weight of the pitch of said speech segment; Wd = weight of the duration of said speech segment; diff= differential value for selecting said optimal contextual phonetic hybrid match; pitch = pitch of said speech segment; BestPitch = pitch of an ideal speech segment; dur= duration of said speech segment; and BestDur = duration of said ideal speech segment.
6. The method of claim 5, wherein the BestPitch is determined using the formula: BestPitch = pitchtone - nlndexx empiricalvalue where pitchtone = the average pitch including tone of said utterance waveform corpus; nlndex = the index of said speech segment in a prosody phrase; and empircalvalue = an empirical value based on said utterance waveform corpus.
7. The method of claim 5, wherein the Bestdur is determined using the formula: BestDur= durs xfs — nlndexx empircalvdue where durs = the average duration of said speech segment without tone; nlndex= the index of said speech segment in a prosody phrase; and fs = the coefficient for prosody position. empircalvalue = an empirical value based on said utterance waveform corpus.
8. The method of claim 1 , wherein the step of determining includes the step of selecting an optimal contextual phonetic hybrid match when numerous suitable matches are found by using the formula: pitch- BestPitclγ dur- BestDur 2 p BestPitch BestDur where Wp = weight of the pitch of said speech segment; Wd = weight of the duration of said speech segment; diff= differential value for selecting said optimal contextual phonetic hybrid match; pitch = pitch of said speech segment; BestPitch = pitch of an ideal speech segment; dur= duration of said speech segment; and BestDur = duration of said ideal speech segment.
9. The method of claim 8, wherein said optimal contextual phonetic hybrid match is the match having the lowest differential value (difή.
10. The method of claim 8, wherein said differential value (diff) for selecting said optimal contextual phonetic hybrid match is compared with a preset threshold.
11. The method of claim 8, wherein the BestPitch is determined using the formula: BestPitch = pitchtone - nlndex x empiricalvalue where pitchtone = the average pitch including tone of said utterance waveform corpus; nlndex = the index of said speech segment in a prosody phrase; and empircalvalue = an empirical value based on said utterance waveform corpus.
12. The method of claim 8, wherein the Bestdur \s determined using the formula:
BestDur= durs xfs - nlndexx empircαlvdue where durs = the average duration of said speech segment without tone; nlndex = the index of said speech segment in a prosody phrase; and fs = the coefficient for prosody position. empircalvalue = an empirical value based on said utterance waveform corpus.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB031326986A CN1260704C (en) | 2003-09-29 | 2003-09-29 | Method for voice synthesizing |
PCT/US2004/030467 WO2005034082A1 (en) | 2003-09-29 | 2004-09-17 | Method for synthesizing speech |
Publications (2)
Publication Number | Publication Date |
---|---|
EP1668628A1 true EP1668628A1 (en) | 2006-06-14 |
EP1668628A4 EP1668628A4 (en) | 2007-01-10 |
Family
ID=34398359
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP04784355A Withdrawn EP1668628A4 (en) | 2003-09-29 | 2004-09-17 | Method for synthesizing speech |
Country Status (5)
Country | Link |
---|---|
EP (1) | EP1668628A4 (en) |
KR (1) | KR100769033B1 (en) |
CN (1) | CN1260704C (en) |
MX (1) | MXPA06003431A (en) |
WO (1) | WO2005034082A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112530406A (en) * | 2020-11-30 | 2021-03-19 | 深圳市优必选科技股份有限公司 | Voice synthesis method, voice synthesis device and intelligent equipment |
Families Citing this family (58)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8677377B2 (en) | 2005-09-08 | 2014-03-18 | Apple Inc. | Method and apparatus for building an intelligent automated assistant |
US9318108B2 (en) | 2010-01-18 | 2016-04-19 | Apple Inc. | Intelligent automated assistant |
US8996376B2 (en) | 2008-04-05 | 2015-03-31 | Apple Inc. | Intelligent text-to-speech conversion |
US10241752B2 (en) | 2011-09-30 | 2019-03-26 | Apple Inc. | Interface for a virtual digital assistant |
US10241644B2 (en) | 2011-06-03 | 2019-03-26 | Apple Inc. | Actionable reminder entries |
US9431006B2 (en) | 2009-07-02 | 2016-08-30 | Apple Inc. | Methods and apparatuses for automatic speech recognition |
TWI421857B (en) * | 2009-12-29 | 2014-01-01 | Ind Tech Res Inst | Apparatus and method for generating a threshold for utterance verification and speech recognition system and utterance verification system |
US8682667B2 (en) | 2010-02-25 | 2014-03-25 | Apple Inc. | User profiling for selecting user specific voice input processing information |
US10134385B2 (en) | 2012-03-02 | 2018-11-20 | Apple Inc. | Systems and methods for name pronunciation |
US9721563B2 (en) | 2012-06-08 | 2017-08-01 | Apple Inc. | Name recognition system |
KR20140008870A (en) * | 2012-07-12 | 2014-01-22 | 삼성전자주식회사 | Method for providing contents information and broadcasting receiving apparatus thereof |
US9547647B2 (en) | 2012-09-19 | 2017-01-17 | Apple Inc. | Voice-based media searching |
WO2014197334A2 (en) | 2013-06-07 | 2014-12-11 | Apple Inc. | System and method for user-specified pronunciation of words for speech synthesis and recognition |
US9430463B2 (en) | 2014-05-30 | 2016-08-30 | Apple Inc. | Exemplar-based natural language processing |
US9338493B2 (en) | 2014-06-30 | 2016-05-10 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US9668121B2 (en) | 2014-09-30 | 2017-05-30 | Apple Inc. | Social reminders |
CN105989833B (en) * | 2015-02-28 | 2019-11-15 | 讯飞智元信息科技有限公司 | Multilingual mixed this making character fonts of Chinese language method and system |
US10567477B2 (en) | 2015-03-08 | 2020-02-18 | Apple Inc. | Virtual assistant continuity |
CN106157948B (en) * | 2015-04-22 | 2019-10-18 | 科大讯飞股份有限公司 | A kind of fundamental frequency modeling method and system |
US9578173B2 (en) | 2015-06-05 | 2017-02-21 | Apple Inc. | Virtual assistant aided communication with 3rd party service in a communication session |
CN105096934B (en) * | 2015-06-30 | 2019-02-12 | 百度在线网络技术(北京)有限公司 | Construct method, phoneme synthesizing method, device and the equipment in phonetic feature library |
US10671428B2 (en) | 2015-09-08 | 2020-06-02 | Apple Inc. | Distributed personal assistant |
US10747498B2 (en) | 2015-09-08 | 2020-08-18 | Apple Inc. | Zero latency digital assistant |
US10366158B2 (en) | 2015-09-29 | 2019-07-30 | Apple Inc. | Efficient word encoding for recurrent neural network language models |
US11010550B2 (en) | 2015-09-29 | 2021-05-18 | Apple Inc. | Unified language modeling framework for word prediction, auto-completion and auto-correction |
US10691473B2 (en) | 2015-11-06 | 2020-06-23 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US10049668B2 (en) | 2015-12-02 | 2018-08-14 | Apple Inc. | Applying neural network language models to weighted finite state transducers for automatic speech recognition |
US10223066B2 (en) | 2015-12-23 | 2019-03-05 | Apple Inc. | Proactive assistance based on dialog communication between devices |
US10446143B2 (en) | 2016-03-14 | 2019-10-15 | Apple Inc. | Identification of voice inputs providing credentials |
US9934775B2 (en) | 2016-05-26 | 2018-04-03 | Apple Inc. | Unit-selection text-to-speech synthesis based on predicted concatenation parameters |
US9972304B2 (en) | 2016-06-03 | 2018-05-15 | Apple Inc. | Privacy preserving distributed evaluation framework for embedded personalized systems |
US10249300B2 (en) | 2016-06-06 | 2019-04-02 | Apple Inc. | Intelligent list reading |
US10049663B2 (en) | 2016-06-08 | 2018-08-14 | Apple, Inc. | Intelligent automated assistant for media exploration |
DK179309B1 (en) | 2016-06-09 | 2018-04-23 | Apple Inc | Intelligent automated assistant in a home environment |
US10192552B2 (en) | 2016-06-10 | 2019-01-29 | Apple Inc. | Digital assistant providing whispered speech |
US10586535B2 (en) | 2016-06-10 | 2020-03-10 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US10067938B2 (en) | 2016-06-10 | 2018-09-04 | Apple Inc. | Multilingual word prediction |
US10490187B2 (en) | 2016-06-10 | 2019-11-26 | Apple Inc. | Digital assistant providing automated status report |
US10509862B2 (en) | 2016-06-10 | 2019-12-17 | Apple Inc. | Dynamic phrase expansion of language input |
DK201670540A1 (en) | 2016-06-11 | 2018-01-08 | Apple Inc | Application integration with a digital assistant |
DK179049B1 (en) | 2016-06-11 | 2017-09-18 | Apple Inc | Data driven natural language event detection and classification |
DK179343B1 (en) | 2016-06-11 | 2018-05-14 | Apple Inc | Intelligent task discovery |
DK179415B1 (en) | 2016-06-11 | 2018-06-14 | Apple Inc | Intelligent device arbitration and control |
US10043516B2 (en) | 2016-09-23 | 2018-08-07 | Apple Inc. | Intelligent automated assistant |
CN106534528A (en) * | 2016-11-04 | 2017-03-22 | 广东欧珀移动通信有限公司 | A text information processing method, device and mobile terminal |
US10593346B2 (en) | 2016-12-22 | 2020-03-17 | Apple Inc. | Rank-reduced token representation for automatic speech recognition |
DK201770439A1 (en) | 2017-05-11 | 2018-12-13 | Apple Inc. | Offline personal assistant |
DK179496B1 (en) | 2017-05-12 | 2019-01-15 | Apple Inc. | USER-SPECIFIC Acoustic Models |
DK179745B1 (en) | 2017-05-12 | 2019-05-01 | Apple Inc. | SYNCHRONIZATION AND TASK DELEGATION OF A DIGITAL ASSISTANT |
DK201770432A1 (en) | 2017-05-15 | 2018-12-21 | Apple Inc. | Hierarchical belief states for digital assistants |
DK201770431A1 (en) | 2017-05-15 | 2018-12-20 | Apple Inc. | Optimizing dialogue policy decisions for digital assistants using implicit feedback |
DK179560B1 (en) | 2017-05-16 | 2019-02-18 | Apple Inc. | Far-field extension for digital assistant services |
CN107481713B (en) * | 2017-07-17 | 2020-06-02 | 清华大学 | A kind of mixed language speech synthesis method and apparatus |
CN109948124B (en) * | 2019-03-15 | 2022-12-23 | 腾讯科技(深圳)有限公司 | Voice file segmentation method and device and computer equipment |
CN110942765B (en) * | 2019-11-11 | 2022-05-27 | 珠海格力电器股份有限公司 | Method, device, server and storage medium for constructing corpus |
CN111128116B (en) * | 2019-12-20 | 2021-07-23 | 珠海格力电器股份有限公司 | Voice processing method and device, computing equipment and storage medium |
US20210350788A1 (en) * | 2020-05-06 | 2021-11-11 | Samsung Electronics Co., Ltd. | Electronic device for generating speech signal corresponding to at least one text and operating method of the electronic device |
CN113393829B (en) * | 2021-06-16 | 2023-08-29 | 哈尔滨工业大学(深圳) | Chinese speech synthesis method integrating rhythm and personal information |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5970454A (en) * | 1993-12-16 | 1999-10-19 | British Telecommunications Public Limited Company | Synthesizing speech by converting phonemes to digital waveforms |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS6449622A (en) * | 1987-08-19 | 1989-02-27 | Jsp Corp | Resin foaming particle containing crosslinked polyolefin-based resin and manufacture thereof |
US5704007A (en) * | 1994-03-11 | 1997-12-30 | Apple Computer, Inc. | Utilization of multiple voice sources in a speech synthesizer |
US6134528A (en) * | 1997-06-13 | 2000-10-17 | Motorola, Inc. | Method device and article of manufacture for neural-network based generation of postlexical pronunciations from lexical pronunciations |
KR100259777B1 (en) * | 1997-10-24 | 2000-06-15 | 정선종 | Optimal synthesis unit selection method in text-to-speech system |
US7283964B1 (en) * | 1999-05-21 | 2007-10-16 | Winbond Electronics Corporation | Method and apparatus for voice controlled devices with improved phrase storage, use, conversion, transfer, and recognition |
DE60215296T2 (en) * | 2002-03-15 | 2007-04-05 | Sony France S.A. | Method and apparatus for the speech synthesis program, recording medium, method and apparatus for generating a forced information and robotic device |
JP2003295882A (en) * | 2002-04-02 | 2003-10-15 | Canon Inc | Text structure for speech synthesis, speech synthesizing method, speech synthesizer and computer program therefor |
KR100883649B1 (en) * | 2002-04-04 | 2009-02-18 | 삼성전자주식회사 | Text-to-speech device and method |
GB2388286A (en) * | 2002-05-01 | 2003-11-05 | Seiko Epson Corp | Enhanced speech data for use in a text to speech system |
CN1320482C (en) * | 2003-09-29 | 2007-06-06 | 摩托罗拉公司 | Natural voice pause in identification text strings |
-
2003
- 2003-09-29 CN CNB031326986A patent/CN1260704C/en not_active Expired - Lifetime
-
2004
- 2004-09-17 MX MXPA06003431A patent/MXPA06003431A/en not_active Application Discontinuation
- 2004-09-17 EP EP04784355A patent/EP1668628A4/en not_active Withdrawn
- 2004-09-17 KR KR1020067006170A patent/KR100769033B1/en not_active Expired - Lifetime
- 2004-09-17 WO PCT/US2004/030467 patent/WO2005034082A1/en active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5970454A (en) * | 1993-12-16 | 1999-10-19 | British Telecommunications Public Limited Company | Synthesizing speech by converting phonemes to digital waveforms |
Non-Patent Citations (6)
Title |
---|
HELEN M MENG ET AL: "CU VOCAL: CORPUS-BASED SYLLABLE CONCATENATION FOR CHINESE SPEECH SYNTHESIS ACROSS DOMAINS AND DIALECTS" ICSLP 2002 : 7TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING. DENVER, COLORADO, SEPT. 16 - 20, 2002, vol. 4 OF 4, 16 September 2002 (2002-09-16), pages 2373-2376, XP007011576 ISBN: 1-876346-40-X * |
HIROKAWA T ET AL: "HIGH QUALITY SPEECH SYNTHESIS SYSTEM BASED ON WAVEFORM CONCATENATION OF PHONEME SEGMENT" IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS, COMMUNICATIONS AND COMPUTER SCIENCES, ENGINEERING SCIENCES SOCIETY, TOKYO, JP, vol. 76A, no. 11, 1 November 1993 (1993-11-01), pages 1964-1970, XP000420615 ISSN: 0916-8508 * |
REN-HUA WANG ET AL.: "A CORPUS-BASED CHINESE SPEECH SYNTHESIS WITH CONTEXTUAL DEPENDENT UNIT SELECTION" IEEE INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING (ICSLP), vol. 2, 16 October 2000 (2000-10-16), pages 391-394, XP007010255 * |
See also references of WO2005034082A1 * |
WEIBIN ZHU ET AL: "Corpus building for data-driven tts systems" SPEECH SYNTHESIS, 2002. PROCEEDINGS OF 2002 IEEE WORKSHOP ON 11-13 SEPT. 2002, PISCATAWAY, NJ, USA,IEEE, 11 September 2002 (2002-09-11), pages 199-202, XP010653645 ISBN: 0-7803-7395-2 * |
WOEI-LUEN PERNG ET AL: "Image Talk: a real time synthetic talking head using one single image with Chinese text-to-speech capability" COMPUTER GRAPHICS AND APPLICATIONS, 1998. PACIFIC GRAPHICS '98. SIXTH PACIFIC CONFERENCE ON SINGAPORE 26-29 OCT. 1998, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 26 October 1998 (1998-10-26), pages 140-148, XP010315487 ISBN: 0-8186-8620-0 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112530406A (en) * | 2020-11-30 | 2021-03-19 | 深圳市优必选科技股份有限公司 | Voice synthesis method, voice synthesis device and intelligent equipment |
Also Published As
Publication number | Publication date |
---|---|
EP1668628A4 (en) | 2007-01-10 |
MXPA06003431A (en) | 2006-06-20 |
KR20060066121A (en) | 2006-06-15 |
CN1604182A (en) | 2005-04-06 |
KR100769033B1 (en) | 2007-10-22 |
WO2005034082A1 (en) | 2005-04-14 |
CN1260704C (en) | 2006-06-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR100769033B1 (en) | Method for synthesizing speech | |
US5949961A (en) | Word syllabification in speech synthesis system | |
US6823309B1 (en) | Speech synthesizing system and method for modifying prosody based on match to database | |
US6684187B1 (en) | Method and system for preselection of suitable units for concatenative speech | |
US6029132A (en) | Method for letter-to-sound in text-to-speech synthesis | |
US6505158B1 (en) | Synthesis-based pre-selection of suitable units for concatenative speech | |
US6243680B1 (en) | Method and apparatus for obtaining a transcription of phrases through text and spoken utterances | |
US6910012B2 (en) | Method and system for speech recognition using phonetically similar word alternatives | |
EP0833304B1 (en) | Prosodic databases holding fundamental frequency templates for use in speech synthesis | |
KR100403293B1 (en) | Speech synthesizing method, speech synthesis apparatus, and computer-readable medium recording speech synthesis program | |
JP3481497B2 (en) | Method and apparatus using a decision tree to generate and evaluate multiple pronunciations for spelled words | |
EP1213705A2 (en) | Method and apparatus for speech synthesis without prosody modification | |
WO1996023298A2 (en) | System amd method for generating and using context dependent sub-syllable models to recognize a tonal language | |
JP5198046B2 (en) | Voice processing apparatus and program thereof | |
JPH0916602A (en) | Translation system and its method | |
WO2006106182A1 (en) | Improving memory usage in text-to-speech system | |
JP3576066B2 (en) | Speech synthesis system and speech synthesis method | |
Akinwonmi | Development of a prosodic read speech syllabic corpus of the Yoruba language | |
CN114999447A (en) | Speech synthesis model based on confrontation generation network and training method | |
Hendessi et al. | A speech synthesizer for Persian text using a neural network with a smooth ergodic HMM | |
Kaur et al. | BUILDING AText-TO-SPEECH SYSTEM FOR PUNJABI LANGUAGE | |
JP2005534968A (en) | Deciding to read kanji | |
GB2292235A (en) | Word syllabification. | |
Bharthi et al. | Unit selection based speech synthesis for converting short text message into voice message in mobile phones | |
JP2003345372A (en) | Method and device for synthesizing voice |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20060323 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): DE FR GB IT |
|
DAX | Request for extension of the european patent (deleted) | ||
RBV | Designated contracting states (corrected) |
Designated state(s): DE FR GB IT |
|
A4 | Supplementary search report drawn up and despatched |
Effective date: 20061208 |
|
17Q | First examination report despatched |
Effective date: 20070907 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20080118 |
|
P01 | Opt-out of the competence of the unified patent court (upc) registered |
Effective date: 20230520 |