CA2343071A1

CA2343071A1 - Device and method for digital voice processing

Info

Publication number: CA2343071A1
Application number: CA002343071A
Authority: CA
Inventors: Hans Kull
Original assignee: Individual
Current assignee: Individual
Priority date: 1998-09-11
Filing date: 1999-09-10
Publication date: 2000-03-23
Also published as: ATE222393T1; JP2002525663A; DE19841683A1; AU769036B2; DE59902365D1; AU6081399A; WO2000016310A1; EP1110203A1; EP1110203B1

Abstract

The invention relates to a device for digital voice processing which comprises a sentence melody generating device for generating a sentence melody for a text, and an editing device for displaying and modifying the generated sentence melody.

Description

APPARATUS AND METHOD FOR DIGITAL SPEECH PROCESSING
The present invention relates to an apparatus and a method for digital speech processing or speech generation, respectively. Present systems for outputting speech typically are applied in areas in which a synthetic speech is acceptable or even desired. The present invention, however, relates to a system which makes it possible to synthetically generate speech giving a natural impression.
o In present systems for digital speech generation the information regarding prosody and intonation are generated automatically, like e.g.
described in EP 0689706. In some systems it is possible to insert additional commands into the text screen before it is handed over to the speech generator, e.g. in EP
0598599.
Those commands are inputted e.g. as (non-pronouncable) special characters, like ~5 e.g. described in EP 0598598.
The commands inserted into the text screen may also contain indications regarding the characteristics of the speaker (e.g. parameters of the speaker's model). In EP 0762384 there is described a system in which those speakers 2o characteristics may be inserted on the screen by means of a graphical user interface.
The speech synthesis is carried out using auxiliary information which are stored in a databank (e.g. as waveform sequence in case of EP 0831460).
25 Nevertheless, for the pronunciation of the words which are not stored in the databank there need to be provided rules regarding the pronunciation in the program. The combination of the individual sequences leads to distortions and acoustic artefacts if no measures are provided for their suppression. This problem (commonly called "segmental quality"), however, nowadays is mostly solved (cp.
for 3o example Volker Kraft: Verkettung naturlichsprachlicher Bausteine zur Sprachsynthese: Anforderungen, Techniken and Evaluierung. Fortschr-Ber.VDl

-2-Reihe 10, Nr. 468, VDI-Veriag 1997). Nevertheless even in modern speech synthesis systems several further problems arise.
One problem with digital speech output is for example the multiple language capability.
Another problem consists in the improvement of the prosodic quality, i.e.
the quality of the intonation, cp. e.g. " Volker Kraft: Verkettung naturlichsprachlicher Bausteine zur Sprachsynthese: Anforderungen, Techniken and Evaluierung.
Fortschr-Ber.VDl Reihe 10, Nr. 468, VDI-Verlag 1997". This difficulty is based on the fact that the intonation can be constructed based on the orthographic input information only insufficiently. It depends also on higher levels like semantics and pragmatics as well as on the situation of the speaker and on the type of the speaker. In general it can be said that the quality of today's speech outputting systems fulfill the requirements where the listener expects or accepts a synthetic speech. However, often the quality of synthetic speech is considered as not sufficient or as unsatisfactory.
It is therefore an object of the present invention to provide an apparatus 2o and a method for digital speech processing which allows the generation of synthetic speech having a better quality. It is a further object of the invention to synthetically generate speech giving a natural impression. The applications reach from the generation of simple texts for multimedia applications up to the sound generation for the dubbing of movies, radio dramas, and audible books.
Even if the synthetically generated speech gives a natural impression sometimes there is needed the possibility to intervene in order to generate dramaturgical effects. Another object of the present invention therefore resides in the provision of such possibilities to intervene.
The present invention is defined in the independent claims. The dependent claims define particular embodiments of the invention.

-3-The problem of the invention is substantially solved by providing the capability to modify the prosody generated for a text by means of an editor.
Particular embodiments of the invention provide for the addition of further characteristics of the synthetically generated speech in addition to the editing of the prosody.
Thereby the starting point is the written text. However, in order to 1o achieve a sufficient (in particular prosodic) quality as well as in order to achieve dramaturgical effects the user in a particular embodiment is provided with far-reaching capabilities to intervene. The user is in the position of the director who defines the speakers on the system and assignes to them speech rhythm and prosody, pronunciation and intonation.
Preferably the present invention also comprises the generation of a phonetic transcription for a written text as well as the provision of the capability to modify the generated phonetic transcription, or to modify the phonetic transcription based on modifiable rules, respectively. Thereby for example a particular accent of 2o a speaker may be generated.
According to a further preferred embodiment the present invention comprises a dictionary means in which the words of one or more languages are stored together with their pronunciation. In the latter case this allows for multiple language capability (mulilinguality), i.e. the processing of text in different languages.
Preferably the editing of the generated phonetic transcription or the prosody, respectively, is carried out by means of an editor which is easy to use, such as a graphical user interface.
According to a further preferred embodiment speaker's models which are either predefined or which are defined respectively modified by a user are taken

-4-into account into the speech processing. Thereby the characteristics of different speakers can be realized, let them be male or female voices, or different accents of a speaker, such as a Bavarian, a Swabian or a Northern German accent.
According to a particularly preferred embodiment the apparatus comprises a dictionary in which for all words also the pronunciation is stored in a phonetic transcription (if hereinafter reference is made to a phonetic transcription, then this may mean an arbitrary phonetic transcription, such as for example the SAMPA-notation, compare for example "Multilingual speech input/output o assessment, methodology and standardization, standard computer-compatible transcription, pp 29-31, in Esprit Project 2589 (SAM) Fin. Report SAM-UCC-037") or the international phonetic script known from teaching materials, compare for example "The Principles of the International Phonetik Association: A
description of the International Phonetic Alphabet and the Manner of Using it. International ~5 Phonetic Association, Dept, Phonetics, Univ. College of London", a translator which converts inputted text into a phonetic transcription and generates a prosody, an editor with which text can be inputted and can be assigned to a speaker and in which the generated phonetic transcription as well as the prosody can be displayed and modified, an input module, in which the speaker's models can be defined, a 2o system for digital speech generation which generates from the phonetic transcription together with the prosody signals representing spoken speech respectively data representing such signals and which is capable of processing different speaker's models, a system of digital filters and other devices (for hall, echo, etc.) with which particular effects can be generated, a sound archive, as well 25 as a mixing device in which the generated speech signals can be mixed with sounds from the archive and can be edited with effects. The invention can either be realized in a hybrid manner by means of software and hardware or fully by means of software. The generated digital speech signals can be outputted by means of a particular device for digital audio or by means of a PC sound board.
The present invention will be described hereinafter in detail by means of several embodiments and by referring to the accompanying drawings.

-5-Fig. 1 shows a block diagram of an apparatus for generating digital speech according to an embodiment of the present invention.
In the embodiment of the present invention described hereinafter the invention comprises several individual components which may be realized by means of one or more digital processing apparatuses and the combination and operation of which is described in more detail hereinafter.
1o The dictionary 100 comprises simple tables (one for each language) in which the words of a language are stored together with their pronunciation.
The tables may be extended arbitrarily for the incorporation of additional words and their pronunciation. For particular purposes, e.g. for the generation of accents, there may also be generated additional tables with different phonetic entries in one language.
~5 One table of the dictionary is assigned to the different speakers, respectively.
The translator 100 on one hand generates the phonetic transcriptions by replacing the words of the inserted text through the phonetic correspondences in the dictionary. If in the speaker's model there are stored modifiers which will be 2o described later in more detail, then they are used for modifying the pronunciation.
Additionally it generates the prosody using heuristics known in speech processing. Such heuristics are e.g. the model of Fujisaki (1992) or other acoustic methods, then the perceptual models, e.g. the one of d'Alessandro and Mertens 25 (1995). Both, however, also older linguistic models are described e.g. in "Thierry Dutoit: An Introduction to Text-to-Speech Synthesis, Kluwer 1997". Therein one also can find methods for the segmentation (setting of breaks) which are likewise generated by the translator.
3o The selection of the methods thereby is of more or less lower importance since the translator only generates a version of the prosody which still can be modified by the user.

-6-Editor 120 provides the user with an instrument with which he can input and modify pronunciation, intonation, accentuation, speed, volume, breaks (interruptions), etc.
At first the user assignes a speaker's model to the text segments to be processed which will be explained in more detail later with respect to its composition and operation. The translator reacts onto this assignment by adapting and newly generating the phonetics and possibly the prosody to the speaker's o model. The phonetics is displayed to the user in a phonetic transcription, the prosody is displayed e.g. in a symbolic notation which is taken from music (musical notation). The user then has the possibility to modify them, to listen to individual text segments and to improve his inputs once again, and so on.
~5 The texts themselves may of course be kept in the editor if they cannot be directly imported from another text processing system.
Speaker's models 130 are for example parameterizations for speech generation. In those models the characteristics of the human speech organism are 2o modelled. The function of the vocal chords is represented by a sequence of pulses of which only the frequency (pitch) can be amended. The remaining characteristics (oral cavity, nasal cavity) of the speech organism are realized by means of digital filters. Their parameters are stored in the speaker's model. There are stared standard models (child, young lady, old man, etc.). The user may generate from 25 them additional models by suitably chosing or amending the parameters and storing the model. The therein stored parameters are used during the speech generation which will be explained later together with the prosody information for the intonation.
3o Thereby also the particularities of the speaker such as e.g. accents or speech impediments may be inputted. Those are used by the translator for modifying the pronunciation. A simple example of such a modifier is e.g. the rule to

-7-replace (in the phonetic transcription) ".(t" by "st" (to generate the accent of a person from Hamburg.
A speaker's model may e.g. concern the rules according to which the translator generates the phonetic transcription. Different speaker's models may thereby proceed according to different rules. A speaker's model may, however, correspond to a certain set of filter parameters in order to process the speech signals according to the thereby prescribed speech characteristics. Of course one can also imagine different combinations of these two aspects of a speaker's model.
The task of the speech generation unit 140 consists in generating a numerical data stream representing digital speech signals based on the given text together with the phonetic and prosodic additional information generated by the translator and edited by the user. This data stream can then be converted into analog sound signals, the text to be outputted, by an output device 150 which may be a digital audio device or a sound board in a PC.
For generating the speech a conventional text-to-speech conversion method can be used in which, however, the pronunciation and the prosody already 2o have been generated. In general one distinguishes between rule-based synthesizers and concatenation-based synthesizers.
Rule-based synthesizers operate using rules for the generation of the sounds and the transitions therebetween. Those synthesizers operate with up to parameters the determination of which is very demanding. However, very good results may be achieved with those type of synthesizers. An overview over those type of systems and further references may be found in "Thierry Dutoit: An Introduction to Text-to-Speech Synthesis, Kluwer 1997".
3o On the other hand, concatenation-based synthesizers are easier to handle. They work with a database which stores all possible pairs of sound.
They can be easily concatenated, however, systems providing a good quality require a

- 8 _ high computational power. Those types of systems are described in "Thierry Dutoit:
An Introduction to Text-to-Speech Synthesis, Kluwer 1997" and in "Volker Kraft:
Verkettung natiarlichsprachlicher Bausteine zur Sprachsynthese: Anforderungen, Techniken and Evaluierung. Fortschr.-Ber. VDI Reihe 10 Nr 468, VDI-Verlag 1997".
In principle both types of systems can be used. In the rule-based synthesizers the prosodic information directly influences the rules, while in the concatenation-based systems the rules are superposed in a suitable manner.
For the generation of particular effects 160 known techniques from digital signal processing are used, such as e.g. digital filters (e.g. bandpass filters for a telephone effect), hall generators, etc. They may also be applied to sounds stored in an archive 170.
In archive 170 sounds like e.g. street noise, railway noise, noise of children, sound of the sea, background music etc. are stored. The archive may be extended arbitrarily. The archive may just be a collection of files having digitized noises, it may, however, also be a database in which noises are stored as blobs (binary large ojects).
In the mixing device 180 the speech signals thus generated are combined with the background noises. The volume of all signals thereby may be adjusted before combination. Additionally, it is possible to apply effects to each signal individually or to all of them together.
The result of the thus generated signal may be handed over (transmitted) to a suitable device for digital audio 150 such as a sound board of a PC, and may thereby be acoustically checked respectively acoustically outputted.
Additionally, a storage means (not shown) is provided in order to store the signal so that it may be 3o transmitted later on in a suitable manner to the target medium.

_g_ As a mixing device one can use a device classically implemented in hardware, or it can be realized in software and can form a part of the whole program.
For the skilled person modifications of the above embodiment are easily apparent. For example, in a further embodiment of the present invention the output device 150 may be replaced by a further computer which is connected by means of a network connection to mixing device 180. Thereby the generated speech signal can be transmitted through a computer network, e.g. the Internet, to another ~o computer.
In another embodiment the speech signal generated from the speech generation unit 140 can be transmitted directly to the output device 150, without passing through mixing device 180. Further comparable modifications are easily 15 apparent for the skilled person.

Claims

-10-

1. A digital speech processing apparatus comprising:
a prosody generation means for generating a prosody for a text; and an editing means for displaying and modifying the generated prosody.

2. The apparatus of claim 1 further comprising:
translation means for translating the text into a phonetic transcription, said translation means further comprising:
means for displaying and modifying the generated phonetic transcription.

3. The apparatus of claim 1 or 2, wherein said prosody generating means and/or said translation means generates said prosody and/or said phonetic transcription based on respectively in dependence of a particular speaker's model.

4. The apparatus of one of claims 1 to 3, further comprising:
means for displaying and/or modification of one or more speaker's models.

5. The apparatus of claim 4, wherein said speaker's model modification means comprises:
means for modifying phonetic transcription elements for the generation of accents.

6. An apparatus for generating digital speech comprising:
an apparatus for digital speech processing according to one of claims 1 to 4; and means for generating speech signals based on said phonetic transcription which may have been edited using said editing means and/or based on said prosody.

7. The apparatus of claim 6, wherein said speech signal generating means further comprises:
a speaker's model processing means for generating said speech signals based on respectively depending on a particular speaker's model.

8. The apparatus of claim 7, wherein said speaker's model processing means comprises one or more of the following:
a digital filter system;
means for adopting a set of filter parameters representing a particular speaker's model.

9. The apparatus of claim 7 or 8, wherein said speaker's model processing means further comprises:
means for selecting and/or modifying a speaker's model.

10. The apparatus of one of claims 6 to 9, further comprising:
effect generating means for generating sound effects.

11. The apparatus of claim 10, wherein said effect generating means comprises one or more of the following:
digital filter means for modifying the generated speech signals, and/or a hall generator for generating a hall effect.

12. The apparatus of one of claims 6 to 11, further comprising:
archive means for storing sounds; and mixing means for mixing the generated speech signals with the sounds stored in said archive means.

13. The apparatus of one of the preceding claims, further comprising:
a graphical user interface for editing the generated phonetic transcription and/or prosody.

14. The apparatus of one of the preceding claims, further comprising:
means for modifying speech rhythm and/or pronunciation and/or intonation.

15. The apparatus of one of the preceding claims, further comprising:
display means for displaying the prosody by means of a symbolic notation.

16. The apparatus of one of the preceding claims, further comprising:
dictionary means in which the words of one or more languages are stored together with their pronunciation.

17. The apparatus of claim 16, wherein for at least one dictionary entry different phonetic entries are stored in said dictionary means.

18. The apparatus of one of claims 6 to 17, further comprising:
means for converting said digital speech signals into acoustic signals.

19. A digital speech processing method comprising:
generating a prosody for a text;
displaying said generated prosody; and editing said generated and displayed prosody.

20. The method of claim 19, further comprising:
using an apparatus according to one of claims 1 to 18 for generating digital speech.

21. A computer program product comprising:
a medium, in particular a data carrier for storing and/or transmitting digital data readable by a computer, wheras said stored and/or transmitted data comprise:

a sequence of computer-executable instructions causing said computer to carry out a method according to one of claims 19 or 20.