GB2393369A

GB2393369A - A method of implementing a text to speech (TTS) system and a mobile telephone incorporating such a TTS system

Info

Publication number: GB2393369A
Application number: GB0221945A
Authority: GB
Inventors: John Anderton
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 2002-09-20
Filing date: 2002-09-20
Publication date: 2004-03-24
Also published as: AU2003267582A1; GB0221945D0; EP1454313A1; EP1454313B1; DE60309131D1; WO2004027757A1; DE60309131T2

Abstract

A typical TTS system converts text to speech and analyses normalised text morphologically and syntactically to determine the words or morphological tokens and the syntax of those words into notation which represents the pronunciation. In the prior art, a pronunciation module 6 of the TTS system determines whether the word or token is present in a dictionary 2 and if so obtains the pronunciation but if not then applies pronunciation rules to determine the pronunciation and stores the word in an exceptions log 4. The present invention relates to adapting the contents of the dictionary 2 by adding from the log 4 those words most frequently used and deleting from the dictionary 2 those words which are less frequently used. The present invention is particularly advantageous in those instances where memory size is restricted such as in mobile phones, mobile computers and digital still cameras. Pronunciation of words added to the dictionary may be obtained from a user or from a remote dictionary.

Description

(' 1 A method of implementing a text to speech (TTS) system and a mobile

telephone incorporating such a TTS system The present invention relates to a method of implementing a text to speech (TTS) system and a mobile telephone incorporating such a TTS system.

A text to speech (TTS) system converts text to speech and involves determining the correct pronunciation. Figure 1 illustrates a typical TTS system incorporating four typical processing steps. The input text is analysed, segmented and normalised in the first processing step. In the second step, the normalised text is analysed morphologically and syntactically to determine the words or morphological tokens and the syntax of those words into notation which represents the pronunciation. That notation or linguistic text is combined together with prosodic parameters in step three. Finally, the linguistic text together with the prosody is synthesised using the pronunciation notation to output speech of the input text.

The present invention is directed towards such a TTS system and involves an improvement of at least step 2 in determining the pronunciation.

Hitherto, there have been proposed two methods for determining the pronunciation.

The first method is to use pronunciation rules. These rules are typically developed (either manually or automatically) based on extensive knowledge and exposure to the language used in the TTS system. Examples for the constructions for such rules can be found from the following references: [1] The CMU Pronouncing Dictionary (http;//www.speech.cs.cmu.edu/cgibin/cmudict) [2] "Phonemic transcription by analogy in text to speech synthesis: Novel word pronunciation and lexicon compression" by Bagshaw, Computer Speech and Language (1998) volume 12 pages 119-142.

Historically most early TTS systems used such rules extensively. The primary advantage of this method is the low memory requirements. As the quality of the TTS systems improved, pronunciations errors from the rule based method became more apparent. The art developed to derive the second method which uses a dictionary. The dictionary or lexicon stores a potentially large number of input words together with the associated pronunciation,

preferably for all syntactic variations of the word. A typical example for two entries in such a dictionary is as follows: Orthographic Syntactic Category Pronunciation Reuresentation l record verb R AHO K AOI R D record noun R EH1 K ERO D Each entry comprises the word in the form of an orthographic representation, a field

defining the syntactic category of the word and the pronunciation, in this case using phonetic notation. Thus, the above two entries show the two pronunciations for the word "record" for the two different syntactic categories. Entries for many other words can be found in reference [1] noted above.

If a word is not found in the dictionary, then the TTS reverts to determining the pronunciation using the pronunciation rules.

It has been found that the dictionary method produces a higher quality speech than the pronunciation rule method. However, the dictionary method requires a large memory to store the dictionary. In certain applications, there is a trade-off which must be made between the speech quality and the size of the memory which can be used. Previously, the most common way of addressing this trade-off is to design the dictionary so as to include only words which are most likely to be used in that application. Those most frequently used words must be analysed for the particular application. Such a trade-off and alternative ways of addressing this trade-offis described in reference number [2] above and the following references: [3] "Speech Technology for Communications" by Westall, Johnston & Lewis. 1998 ISBN 0 412 79080 7 Chapter 6.

[4] "Letter to sound rules for accented lexicon compression" Pagel, Lenzo & Black ESCA98, 3rd International Workshop on Speech Syntheses, November 1998.

Thus, the present invention aims to address this trade-off and improve the balance between speech quality and size of memory. In essence, the present invention relates to adapting the contents of the memory by adding to or maintaining those words most frequently used and deleting those words which are less frequently used.

( 3 Accordingly, the present invention relates to a method of implementing a text to speech (TTS) system including a pronunciation dictionary and a memory containing pronunciation rules, said method comprising: comparing each word in said text to words in said pronunciation dictionary to identify if the identical word is present; if the word is identified, then obtaining the pronunciation of said word; if the word is not identified, then applying said pronunciation rules to said word to obtain the pronunciation of said word; wherein the improvement lies in storing each word which is not identified in an exceptions log; and adding the most frequently relented words from the exceptions log to the pronunciation dictionary. In many instances, the size of the memory is restricted. Thus, the method enables the dictionary stored in the memory to be adapted so as to retain only those most frequently used words. This obviates the need for an initial analysis of the application of the TTS system.

Moreover, in many applications where the memory size is severely limited such as in a mobile telephone, the speech quality can be improved considerably.

Accordingly, the present invention also relates to a mobile telephone incorporating a TTS system as defined in the attached claims.

Embodiments of the present invention will now be described by way of further example only and with reference to the accompanying drawings, in which: Figure 1 is an illustration of a TTS system to which the present invention may be applied; Figure 2 is an illustration of the present invention in determining pronunciation; Figure 3 is an illustration of the interaction between the pronunciation determining stage and dictionary update stage; Figure 4 is an illustration of the dictionary update stage; Figure 5 is a schematic view of a mobile telephone incorporating a TTS system according to the present invention;

( 4 Figure 6 is a schematic view of a mobile personal computer incorporating a TTS system according to the present invention; and Figure 7 is a schematic view of a digital camera incorporating a TTS system according to the present invention.

As discussed above, a typical TTS system as shown in Figure 1 converts text to speech and in step 2 analyses the normalised text morphologically and syntactically to determine the words or morphological tokens and the syntax of those words into notation which represents the pronunciation. The TTS system determines whether the word or token is present in the dictionary and if so obtains the pronunciation but if not then applies pronunciation rules to determine the pronunciation. The present invention provides an improvement as shown in Figure 2.

In the first instance, in the embodiment, the memory 2 containing the dictionary is full and an exceptions log 4 is empty. The memory 2 contains for each entry at least an orthographic representation of the word, the syntactic category and the pronunciation in accordance with the prior art. In addition, the memory 2 contains two further fields for each

entry, namely a long term occurrence count and a current occurrence count. The long term occurrence count stores the average number of tokens analysed between each occurrence of a token. The current occurrence count stores the current number of tokens analysed since the last occurrence of a token.

The exceptions log 4 similarly contains for each entry at least the orthographic representation of the word or token, the syntactic category and the pronunciation together with the long term occurrence count and the current occurrence count.

A pronunciation module 6 receives the word or morphological token. The module 6 takes each token and searches the dictionary 2 to identify if the token is present in the dictionary 2. If the token is present, then the module obtains the pronunciation. A dictionary search and statistics module 8 increments the current occurrence count for that entry if the token is present in the dictionary. Either module 6 or 8 may search the dictionary 2.

If the token is not present in the dictionary 2, then the pronunciation module 6 searches the exceptions log 4 to identify if the token is present in the exceptions log.

If the token is present, then the module 6 obtains the pronunciation from the exceptions log 4. In addition, if the token is present in the exceptions log, then the current occurrence count for that entry is incremented.

If the token is not present, then the module applies the pronunciation rules to obtain the pronunciation for the token. The pronunciation rules are stored in a memory (not shown) integral with or coupled to the pronunciation module 6. In addition, if the token is not present, then, if there is space in the exceptions log, then the token is added to the exceptions log. If there is no space in the exceptions log, then the module 6 identifies the token already stored in the exceptions log with the lowest weighted occurrence statistics as follows: kl *current occurrence count + k2*1cng term occurrence count If the lowest weighted occurrence statistic is below a threshold k3 then the new token replaces that stored in the exceptions log. If the lowest weighted occurrence statistic is above a threshold k3 then no action is taken with regard to the new token.

Thus, the exceptions log 4 will be limited to only those entries which occur most frequently which have not been found in the dictionary 2.

In the present invention, the dictionary 2 is updated periodically with entries from the exception log 2. The updating may be effected at regular intervals in time, or be effected following certain events or a combination of both or even effected through exponentially changing time periods in order to update the dictionary very frequently when first initiated and less frequently thereafter having achieved an optimal content of the dictionary for the application in which the TTS is embodied.

Certain events could include that the pronunciation module 6 has processed a total number of tokens which exceeds a predetermined threshold or that the exceptions log is full or the event could be effected by a user operating the system or that the application in which the TTS system is embodied has access to a remote pronunciation dictionary to be discussed in more detail below.

( 6 Figure 3 illustrates the interaction between the pronunciation determining stage and dictionary update stage. The dictionary update stage is only effected when the TTS system is not processing text and outputting speech and visa versa. When the TTS system is to update the dictionary, following one of the events or at a specific time as discussed above, a switch 12 disables the pronunciation stage and enables the dictionary update stage.

In the first instance, as shown in Figure 4, a dictionary update control module 14 calculates the mean occurrence statistics and updates the long term occurrence count in both the pronunciation dictionary and the exceptions log. The mean occurrence statistics are updated as a weighted function as follows: long term occurrence count = k4 *long tenn occurrence count + k5 *current occurrence count The constants k4 and k5 may be difference for the exceptions log and for the pronunciation dictionary and they may be fixed or variable. For example, the constants may change depending on the number of tokens processed since the dictionary was last updated.

The current occurrence count for each entry is reset to zero.

Having updated the long term occurrence count and reset the current occurrence count, the dictionary update control module 8 collates a deletion list from the pronunciation dictionary 2 comprising all entries whose long term occurrence count is below a deletion candidate threshold and an addition candidate list from the exceptions log for all entries where the long term occurrence count is above the addition candidate threshold. The addition count threshold may or may not be set to zero.

Thus, tokens in the deletion list are those which are rarely encountered by the TTS system. Hence, their deletion from the pronunciation dictionary would imply that there would be no significant speech quality degradation.

The addition candidate list is sorted in descending order according to the long term occurrence count. That is to say the most likely tokens to be added first. The deletion candidate list is sorted in ascending order according to the long term occurrence count. That is to say the most likely deletion candidates first.

( 7 The dictionary update control module 14 analyses the addition candidate list and the deletion candidate list entry by entry with the most likely addition candidate replacing the next deletion candidate, if the addition candidate long term occurrence count is greater than the deletion candidate long term occurrence count.

The number of entries in the addition list may or may not equal the number of entries in the deletion list. The various thresholds and constants may be adapted depending upon previous dictionary update stages.

Whenever a token from the addition list replaces a deletion candidate, then the addition candidate in the exceptions log is deleted and the long term occurrence count is reset to zero.

In addition, when a token from the addition list is to replace a deletion candidate, verification of the pronunciation of the token is required. Verification may be achieved by one of two ways or both and is effected by a pronunciation acquisition module 16.

The first way is to request the user to define the correct pronunciation for the token. In this case, the pronunciation acquisition module 16 requests a user dialogue module 18 to commence a verification dialogue with the user. The dialogue comprises interaction with the user via a user interface 20. The user interface comprises a screen, keyboard and speech dialogue. In this embodiment, the user dialogue module 18 announces that a verification dialogue with the user is to commence and requests confirmation from the user. If no confirmation is received, then the verification dialogue is deferred for a period of time.

Having received confirmation, the user dialogue module 18 displays tokens on the screen together with the syntax category and pronunciation, usually in the form of phonetic notation.

The user is requested to input via the keyboard confirmation that the token definition is correct. If the user is not conversant with phonetic notation, then the user may confirm or correct the token more phonetically. There is also the option for the user to indicate that the TTS system should output the token with the defined pronunciation so that the user can check aurally that the correct pronunciation is defined. Finally, if the token is an abbreviation or acronym, then the user may input the token in it's full form.

( 8 The second way for verification comprises the interaction with a remote pronunciation dictionary 22. In this case, the remote pronunciation dictionary 22 provides the verification of the pronunciation of the token. If the token is absent from the remote pronunciation dictionary, then the correct pronunciation must be verified from the user in accordance with the first way.

The remote pronunciation dictionary 22 is typically quite large and would therefore be unlikely to be stored within the application embodying the TTS system. Availability to the remote pronunciation dictionary would be dependent upon the availability of access which is usually subject to physical or cost considerations.

The pronunciation acquisition module 16 instructs a remote dictionary access module 24 to ascertain whether the remote pronunciation dictionary 22 can be accessed. This includes the remote dictionary access module 24 activating a communications link control 26.

The communications link control 26 initiates a communications link to another communications link control 28. The communications link may comprise any of those known in the art, such as an infrared link; wireless link such as the GSM air interface, GDM data channels or GSM Short message service channels; or a cable link connecting the application in which the TTS is embodied to a computer or host PC. The communications links 26, 28 access the remote pronunciation dictionary 22 via a remote dictionary server 30.

The addition candidate list and the verified pronunciation is passed to a dictionary update module 32. The dictionary update module 32 updates the pronunciation dictionary 2 by overwriting those entries in the pronunciation dictionary marked for being listed in the deletion list and present in the addition candidate list with those to be added in the addition candidate list.

Thus, the present invention enables the size of the pronunciation dictionary to be minimised; hence reducing the memory size of the TTS system. To ensure that this does not result in an unacceptable loss of speech quality, the method provides steps for adapting the contents of the pronunciation dictionary to contain only those entries that are most frequently encountered by the TTS system.

( 9 The present invention is advantageous for use in small, mobile electronic products such as mobile phones, computers, CD players, DVD players and the like -

although it is not limited thereto.

Several electronic apparatuses using the TTS system will now be described.

<1: Portable Phone> An example in which the TTS system is applied to a portable or mobile phone will be described. Fig. 5 is an isometric view illustrating the configuration of the portable phone. In the drawing, the portable phone 1200 is provided with a plurality of operation keys 1202, an ear piece 1204, a mouthpiece 1206, and a display panel 100. The mouthpiece 1206 or ear piece 1204 may be used for outputting speech. <2: Mobile Computer> An example in which the TTS system according to one of the above embodiments is applied to a mobile personal computer will now be described.

Figure 6 is an isometric view illustrating the configuration of this personal computer. In the drawing, the personal computer 1100 is provided with a body 1104 including a keyboard 1102 and a display unit 1106. The TTS system may use the display unit 1106 or keyboard 1102 to provide the user interface according to the present invention, as described above.

<3: Digital Still Camera> Next, a digital still camera using a TTS system will be described. Fig. 7 is an isometric view illustrating the configuration of the digital still camera and the connection to external devices in brief.

Typical cameras sensitise films based on optical images from objects, whereas the digital still camera 1300 generates imaging signals from the optical image of an object by photoelectric conversion using, for example, a charge coupled device

(CCD). The digital still camera 1300 is provided with an OEL element 100 at the back face of a case 1302 to perform display based on the imaging signals from the CCD. Thus, the display panel 100 functions as a finder for displaying the object. A photo acceptance unit 1304 including optical lenses and the CCD is provided at the front side (behind in the drawing) of the case 1302. The ITS system may be embodied in the digital still camera.

Further examples of electronic apparatuses, other than the personal computer shown in Fig. 6, the portable phone shown in Fig. 5, and the digital still camera shown in Fig. 7, include television sets, viewfinder-type and monitoring-type video tape recorders, car navigation systems, pagers, electronic notebooks, portable calculators, word processors, workstations, TV telephones, point-of-sales system (POS) terminals, and devices provided with touch panels. Of course, the ITS system of the present invention can be applied to any of these electronic apparatuses.

The aforegoing description has been given by way of example only and it will be

appreciated by a person skilled in the art that modifications can be made without departing from the scope of the present invention.

For example, the exceptions log may not always store the pronunciation of a word which has not been found in the dictionary, merely the word itself together with the average and current occurrence counts. The dictionary may not always initially be full but rather the dictionary is empty and all words referred in the first instance are stored in the dictionary.

The pronunciation module and the dictionary update control module may not set the thresholds and constants but rather these are predetermined. In some circumstances, the number of entries in the deletion list may not equal the number in the addition list.

Moreover, the dictionary could be continuously updated rather than waiting until the pronunciation stage has paused and then switching between the pronunciation stage and the dictionary updating stage.

The ITS system according to the present invention may be disposed on a single semiconductor chip with or without memory for the dictionary and the exceptions log and with or without means for outputting speech encoder-decoder (CODEC).

Claims

1. A method of implementing a text to speech (TTS) system including a pronunciation dictionary and a memory containing pronunciation rules, said method comprising: comparing each word in said text to words in said pronunciation dictionary to identify if the identical word is present; if the word is identified, then obtaining the pronunciation of said word; if the word is not identified, then applying said pronunciation rules to said word to obtain the pronunciation of said word; wherein the improvement lies in storing each word which is not identified in an exceptions log; and adding the most frequently referred words from the exceptions log to the pronunciation dictionary.

2. A method of implementing a TTS system as claimed in claim I, in which if the word is not identified in the pronunciation dictionary, then comparing the word with those in the exceptions log to identify whether the word is already stored in the exceptions log.

3. A method of implementing a TTS system as claimed in either claim 1 or 2, in which said pronunciation is also stored in the exceptions log.

4. A method of implementing a TTS system as claimed in any one of claims 1 to 3, in which said syntax is also stored in the exceptions log.

5. A method of implementing a TTS system as claimed in any one of the previous claims, further comprising logging in the exceptions log the number of times a word has not been identified in the dictionary.

6. A method of implementing a TTS system as claimed in any one of the previous claims, further comprising logging in the dictionary the number of times a word has been identified.

7. A method of implementing a TTS system as claimed in claim 5 and 6, further comprising analysing the most frequently logged words in the exceptions log with the least frequently logged words in the dictionary and deleting the least logged words in the dictionary whilst adding the most frequently logged words from the exceptions log.

( 13

8. A method of implementing a TTS system as claimed in any one of the previous claims, further comprising adding the most frequently used words periodically.

9. A method of implementing a TTS system as claimed in any one of claims I to 7, further comprising adding the most frequently used words when any one of the following events occur, when the number of words compared exceeds a predetermined number or that the exceptions log is full or following receipt of a command from a user of the TTS system or exponentially throughout time following initiation of the TTS system.

10. A method of implementing a TTS system as claimed in any one of the previous claims, further comprising verifying the pronunciation of the words stored in the exceptions log prior to adding the words to the dictionary.

11. A method of implementing a TTS system as claimed in claim 11, wherein verifying the pronunciation comprising verifying the pronunciation with a user of the TTS system.

12. A method of implementing a TTS system as claimed in claim 11, wherein verifying the pronunciation comprising verifying the pronunciation with a remote pronunciation dictionary.

13. A mobile telephone including a text to speech system operated in accordance with the method as claimed in any one of the preceding claims.