CA2343071A1 - Device and method for digital voice processing - Google Patents
Device and method for digital voice processing Download PDFInfo
- Publication number
- CA2343071A1 CA2343071A1 CA002343071A CA2343071A CA2343071A1 CA 2343071 A1 CA2343071 A1 CA 2343071A1 CA 002343071 A CA002343071 A CA 002343071A CA 2343071 A CA2343071 A CA 2343071A CA 2343071 A1 CA2343071 A1 CA 2343071A1
- Authority
- CA
- Canada
- Prior art keywords
- prosody
- generating
- speech
- speaker
- generated
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000012545 processing Methods 0.000 title claims abstract description 17
- 238000000034 method Methods 0.000 title claims description 13
- 238000013518 transcription Methods 0.000 claims description 21
- 230000035897 transcription Effects 0.000 claims description 21
- 230000000694 effects Effects 0.000 claims description 10
- 238000012986 modification Methods 0.000 claims description 4
- 230000004048 modification Effects 0.000 claims description 4
- 230000033764 rhythmic process Effects 0.000 claims description 2
- 238000013519 translation Methods 0.000 claims 3
- 230000005355 Hall effect Effects 0.000 claims 1
- 238000004590 computer program Methods 0.000 claims 1
- 238000003672 processing method Methods 0.000 claims 1
- 230000015572 biosynthetic process Effects 0.000 description 5
- 238000003786 synthesis reaction Methods 0.000 description 5
- 239000003607 modifier Substances 0.000 description 2
- 230000001944 accentuation Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010348 incorporation Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 210000000214 mouth Anatomy 0.000 description 1
- 210000003928 nasal cavity Anatomy 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000005477 standard model Effects 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention relates to a device for digital voice processing which comprises a sentence melody generating device for generating a sentence melody for a text, and an editing device for displaying and modifying the generated sentence melody.
Description
APPARATUS AND METHOD FOR DIGITAL SPEECH PROCESSING
The present invention relates to an apparatus and a method for digital speech processing or speech generation, respectively. Present systems for outputting speech typically are applied in areas in which a synthetic speech is acceptable or even desired. The present invention, however, relates to a system which makes it possible to synthetically generate speech giving a natural impression.
o In present systems for digital speech generation the information regarding prosody and intonation are generated automatically, like e.g.
described in EP 0689706. In some systems it is possible to insert additional commands into the text screen before it is handed over to the speech generator, e.g. in EP
0598599.
Those commands are inputted e.g. as (non-pronouncable) special characters, like ~5 e.g. described in EP 0598598.
The commands inserted into the text screen may also contain indications regarding the characteristics of the speaker (e.g. parameters of the speaker's model). In EP 0762384 there is described a system in which those speakers 2o characteristics may be inserted on the screen by means of a graphical user interface.
The speech synthesis is carried out using auxiliary information which are stored in a databank (e.g. as waveform sequence in case of EP 0831460).
25 Nevertheless, for the pronunciation of the words which are not stored in the databank there need to be provided rules regarding the pronunciation in the program. The combination of the individual sequences leads to distortions and acoustic artefacts if no measures are provided for their suppression. This problem (commonly called "segmental quality"), however, nowadays is mostly solved (cp.
for 3o example Volker Kraft: Verkettung naturlichsprachlicher Bausteine zur Sprachsynthese: Anforderungen, Techniken and Evaluierung. Fortschr-Ber.VDl
The present invention relates to an apparatus and a method for digital speech processing or speech generation, respectively. Present systems for outputting speech typically are applied in areas in which a synthetic speech is acceptable or even desired. The present invention, however, relates to a system which makes it possible to synthetically generate speech giving a natural impression.
o In present systems for digital speech generation the information regarding prosody and intonation are generated automatically, like e.g.
described in EP 0689706. In some systems it is possible to insert additional commands into the text screen before it is handed over to the speech generator, e.g. in EP
0598599.
Those commands are inputted e.g. as (non-pronouncable) special characters, like ~5 e.g. described in EP 0598598.
The commands inserted into the text screen may also contain indications regarding the characteristics of the speaker (e.g. parameters of the speaker's model). In EP 0762384 there is described a system in which those speakers 2o characteristics may be inserted on the screen by means of a graphical user interface.
The speech synthesis is carried out using auxiliary information which are stored in a databank (e.g. as waveform sequence in case of EP 0831460).
25 Nevertheless, for the pronunciation of the words which are not stored in the databank there need to be provided rules regarding the pronunciation in the program. The combination of the individual sequences leads to distortions and acoustic artefacts if no measures are provided for their suppression. This problem (commonly called "segmental quality"), however, nowadays is mostly solved (cp.
for 3o example Volker Kraft: Verkettung naturlichsprachlicher Bausteine zur Sprachsynthese: Anforderungen, Techniken and Evaluierung. Fortschr-Ber.VDl
-2-Reihe 10, Nr. 468, VDI-Veriag 1997). Nevertheless even in modern speech synthesis systems several further problems arise.
One problem with digital speech output is for example the multiple language capability.
Another problem consists in the improvement of the prosodic quality, i.e.
the quality of the intonation, cp. e.g. " Volker Kraft: Verkettung naturlichsprachlicher Bausteine zur Sprachsynthese: Anforderungen, Techniken and Evaluierung.
Fortschr-Ber.VDl Reihe 10, Nr. 468, VDI-Verlag 1997". This difficulty is based on the fact that the intonation can be constructed based on the orthographic input information only insufficiently. It depends also on higher levels like semantics and pragmatics as well as on the situation of the speaker and on the type of the speaker. In general it can be said that the quality of today's speech outputting systems fulfill the requirements where the listener expects or accepts a synthetic speech. However, often the quality of synthetic speech is considered as not sufficient or as unsatisfactory.
It is therefore an object of the present invention to provide an apparatus 2o and a method for digital speech processing which allows the generation of synthetic speech having a better quality. It is a further object of the invention to synthetically generate speech giving a natural impression. The applications reach from the generation of simple texts for multimedia applications up to the sound generation for the dubbing of movies, radio dramas, and audible books.
Even if the synthetically generated speech gives a natural impression sometimes there is needed the possibility to intervene in order to generate dramaturgical effects. Another object of the present invention therefore resides in the provision of such possibilities to intervene.
The present invention is defined in the independent claims. The dependent claims define particular embodiments of the invention.
One problem with digital speech output is for example the multiple language capability.
Another problem consists in the improvement of the prosodic quality, i.e.
the quality of the intonation, cp. e.g. " Volker Kraft: Verkettung naturlichsprachlicher Bausteine zur Sprachsynthese: Anforderungen, Techniken and Evaluierung.
Fortschr-Ber.VDl Reihe 10, Nr. 468, VDI-Verlag 1997". This difficulty is based on the fact that the intonation can be constructed based on the orthographic input information only insufficiently. It depends also on higher levels like semantics and pragmatics as well as on the situation of the speaker and on the type of the speaker. In general it can be said that the quality of today's speech outputting systems fulfill the requirements where the listener expects or accepts a synthetic speech. However, often the quality of synthetic speech is considered as not sufficient or as unsatisfactory.
It is therefore an object of the present invention to provide an apparatus 2o and a method for digital speech processing which allows the generation of synthetic speech having a better quality. It is a further object of the invention to synthetically generate speech giving a natural impression. The applications reach from the generation of simple texts for multimedia applications up to the sound generation for the dubbing of movies, radio dramas, and audible books.
Even if the synthetically generated speech gives a natural impression sometimes there is needed the possibility to intervene in order to generate dramaturgical effects. Another object of the present invention therefore resides in the provision of such possibilities to intervene.
The present invention is defined in the independent claims. The dependent claims define particular embodiments of the invention.
-3-The problem of the invention is substantially solved by providing the capability to modify the prosody generated for a text by means of an editor.
Particular embodiments of the invention provide for the addition of further characteristics of the synthetically generated speech in addition to the editing of the prosody.
Thereby the starting point is the written text. However, in order to 1o achieve a sufficient (in particular prosodic) quality as well as in order to achieve dramaturgical effects the user in a particular embodiment is provided with far-reaching capabilities to intervene. The user is in the position of the director who defines the speakers on the system and assignes to them speech rhythm and prosody, pronunciation and intonation.
Preferably the present invention also comprises the generation of a phonetic transcription for a written text as well as the provision of the capability to modify the generated phonetic transcription, or to modify the phonetic transcription based on modifiable rules, respectively. Thereby for example a particular accent of 2o a speaker may be generated.
According to a further preferred embodiment the present invention comprises a dictionary means in which the words of one or more languages are stored together with their pronunciation. In the latter case this allows for multiple language capability (mulilinguality), i.e. the processing of text in different languages.
Preferably the editing of the generated phonetic transcription or the prosody, respectively, is carried out by means of an editor which is easy to use, such as a graphical user interface.
According to a further preferred embodiment speaker's models which are either predefined or which are defined respectively modified by a user are taken
Particular embodiments of the invention provide for the addition of further characteristics of the synthetically generated speech in addition to the editing of the prosody.
Thereby the starting point is the written text. However, in order to 1o achieve a sufficient (in particular prosodic) quality as well as in order to achieve dramaturgical effects the user in a particular embodiment is provided with far-reaching capabilities to intervene. The user is in the position of the director who defines the speakers on the system and assignes to them speech rhythm and prosody, pronunciation and intonation.
Preferably the present invention also comprises the generation of a phonetic transcription for a written text as well as the provision of the capability to modify the generated phonetic transcription, or to modify the phonetic transcription based on modifiable rules, respectively. Thereby for example a particular accent of 2o a speaker may be generated.
According to a further preferred embodiment the present invention comprises a dictionary means in which the words of one or more languages are stored together with their pronunciation. In the latter case this allows for multiple language capability (mulilinguality), i.e. the processing of text in different languages.
Preferably the editing of the generated phonetic transcription or the prosody, respectively, is carried out by means of an editor which is easy to use, such as a graphical user interface.
According to a further preferred embodiment speaker's models which are either predefined or which are defined respectively modified by a user are taken
-4-into account into the speech processing. Thereby the characteristics of different speakers can be realized, let them be male or female voices, or different accents of a speaker, such as a Bavarian, a Swabian or a Northern German accent.
According to a particularly preferred embodiment the apparatus comprises a dictionary in which for all words also the pronunciation is stored in a phonetic transcription (if hereinafter reference is made to a phonetic transcription, then this may mean an arbitrary phonetic transcription, such as for example the SAMPA-notation, compare for example "Multilingual speech input/output o assessment, methodology and standardization, standard computer-compatible transcription, pp 29-31, in Esprit Project 2589 (SAM) Fin. Report SAM-UCC-037") or the international phonetic script known from teaching materials, compare for example "The Principles of the International Phonetik Association: A
description of the International Phonetic Alphabet and the Manner of Using it. International ~5 Phonetic Association, Dept, Phonetics, Univ. College of London", a translator which converts inputted text into a phonetic transcription and generates a prosody, an editor with which text can be inputted and can be assigned to a speaker and in which the generated phonetic transcription as well as the prosody can be displayed and modified, an input module, in which the speaker's models can be defined, a 2o system for digital speech generation which generates from the phonetic transcription together with the prosody signals representing spoken speech respectively data representing such signals and which is capable of processing different speaker's models, a system of digital filters and other devices (for hall, echo, etc.) with which particular effects can be generated, a sound archive, as well 25 as a mixing device in which the generated speech signals can be mixed with sounds from the archive and can be edited with effects. The invention can either be realized in a hybrid manner by means of software and hardware or fully by means of software. The generated digital speech signals can be outputted by means of a particular device for digital audio or by means of a PC sound board.
The present invention will be described hereinafter in detail by means of several embodiments and by referring to the accompanying drawings.
According to a particularly preferred embodiment the apparatus comprises a dictionary in which for all words also the pronunciation is stored in a phonetic transcription (if hereinafter reference is made to a phonetic transcription, then this may mean an arbitrary phonetic transcription, such as for example the SAMPA-notation, compare for example "Multilingual speech input/output o assessment, methodology and standardization, standard computer-compatible transcription, pp 29-31, in Esprit Project 2589 (SAM) Fin. Report SAM-UCC-037") or the international phonetic script known from teaching materials, compare for example "The Principles of the International Phonetik Association: A
description of the International Phonetic Alphabet and the Manner of Using it. International ~5 Phonetic Association, Dept, Phonetics, Univ. College of London", a translator which converts inputted text into a phonetic transcription and generates a prosody, an editor with which text can be inputted and can be assigned to a speaker and in which the generated phonetic transcription as well as the prosody can be displayed and modified, an input module, in which the speaker's models can be defined, a 2o system for digital speech generation which generates from the phonetic transcription together with the prosody signals representing spoken speech respectively data representing such signals and which is capable of processing different speaker's models, a system of digital filters and other devices (for hall, echo, etc.) with which particular effects can be generated, a sound archive, as well 25 as a mixing device in which the generated speech signals can be mixed with sounds from the archive and can be edited with effects. The invention can either be realized in a hybrid manner by means of software and hardware or fully by means of software. The generated digital speech signals can be outputted by means of a particular device for digital audio or by means of a PC sound board.
The present invention will be described hereinafter in detail by means of several embodiments and by referring to the accompanying drawings.
-5-Fig. 1 shows a block diagram of an apparatus for generating digital speech according to an embodiment of the present invention.
In the embodiment of the present invention described hereinafter the invention comprises several individual components which may be realized by means of one or more digital processing apparatuses and the combination and operation of which is described in more detail hereinafter.
1o The dictionary 100 comprises simple tables (one for each language) in which the words of a language are stored together with their pronunciation.
The tables may be extended arbitrarily for the incorporation of additional words and their pronunciation. For particular purposes, e.g. for the generation of accents, there may also be generated additional tables with different phonetic entries in one language.
~5 One table of the dictionary is assigned to the different speakers, respectively.
The translator 100 on one hand generates the phonetic transcriptions by replacing the words of the inserted text through the phonetic correspondences in the dictionary. If in the speaker's model there are stored modifiers which will be 2o described later in more detail, then they are used for modifying the pronunciation.
Additionally it generates the prosody using heuristics known in speech processing. Such heuristics are e.g. the model of Fujisaki (1992) or other acoustic methods, then the perceptual models, e.g. the one of d'Alessandro and Mertens 25 (1995). Both, however, also older linguistic models are described e.g. in "Thierry Dutoit: An Introduction to Text-to-Speech Synthesis, Kluwer 1997". Therein one also can find methods for the segmentation (setting of breaks) which are likewise generated by the translator.
3o The selection of the methods thereby is of more or less lower importance since the translator only generates a version of the prosody which still can be modified by the user.
In the embodiment of the present invention described hereinafter the invention comprises several individual components which may be realized by means of one or more digital processing apparatuses and the combination and operation of which is described in more detail hereinafter.
1o The dictionary 100 comprises simple tables (one for each language) in which the words of a language are stored together with their pronunciation.
The tables may be extended arbitrarily for the incorporation of additional words and their pronunciation. For particular purposes, e.g. for the generation of accents, there may also be generated additional tables with different phonetic entries in one language.
~5 One table of the dictionary is assigned to the different speakers, respectively.
The translator 100 on one hand generates the phonetic transcriptions by replacing the words of the inserted text through the phonetic correspondences in the dictionary. If in the speaker's model there are stored modifiers which will be 2o described later in more detail, then they are used for modifying the pronunciation.
Additionally it generates the prosody using heuristics known in speech processing. Such heuristics are e.g. the model of Fujisaki (1992) or other acoustic methods, then the perceptual models, e.g. the one of d'Alessandro and Mertens 25 (1995). Both, however, also older linguistic models are described e.g. in "Thierry Dutoit: An Introduction to Text-to-Speech Synthesis, Kluwer 1997". Therein one also can find methods for the segmentation (setting of breaks) which are likewise generated by the translator.
3o The selection of the methods thereby is of more or less lower importance since the translator only generates a version of the prosody which still can be modified by the user.
-6-Editor 120 provides the user with an instrument with which he can input and modify pronunciation, intonation, accentuation, speed, volume, breaks (interruptions), etc.
At first the user assignes a speaker's model to the text segments to be processed which will be explained in more detail later with respect to its composition and operation. The translator reacts onto this assignment by adapting and newly generating the phonetics and possibly the prosody to the speaker's o model. The phonetics is displayed to the user in a phonetic transcription, the prosody is displayed e.g. in a symbolic notation which is taken from music (musical notation). The user then has the possibility to modify them, to listen to individual text segments and to improve his inputs once again, and so on.
~5 The texts themselves may of course be kept in the editor if they cannot be directly imported from another text processing system.
Speaker's models 130 are for example parameterizations for speech generation. In those models the characteristics of the human speech organism are 2o modelled. The function of the vocal chords is represented by a sequence of pulses of which only the frequency (pitch) can be amended. The remaining characteristics (oral cavity, nasal cavity) of the speech organism are realized by means of digital filters. Their parameters are stored in the speaker's model. There are stared standard models (child, young lady, old man, etc.). The user may generate from 25 them additional models by suitably chosing or amending the parameters and storing the model. The therein stored parameters are used during the speech generation which will be explained later together with the prosody information for the intonation.
3o Thereby also the particularities of the speaker such as e.g. accents or speech impediments may be inputted. Those are used by the translator for modifying the pronunciation. A simple example of such a modifier is e.g. the rule to
At first the user assignes a speaker's model to the text segments to be processed which will be explained in more detail later with respect to its composition and operation. The translator reacts onto this assignment by adapting and newly generating the phonetics and possibly the prosody to the speaker's o model. The phonetics is displayed to the user in a phonetic transcription, the prosody is displayed e.g. in a symbolic notation which is taken from music (musical notation). The user then has the possibility to modify them, to listen to individual text segments and to improve his inputs once again, and so on.
~5 The texts themselves may of course be kept in the editor if they cannot be directly imported from another text processing system.
Speaker's models 130 are for example parameterizations for speech generation. In those models the characteristics of the human speech organism are 2o modelled. The function of the vocal chords is represented by a sequence of pulses of which only the frequency (pitch) can be amended. The remaining characteristics (oral cavity, nasal cavity) of the speech organism are realized by means of digital filters. Their parameters are stored in the speaker's model. There are stared standard models (child, young lady, old man, etc.). The user may generate from 25 them additional models by suitably chosing or amending the parameters and storing the model. The therein stored parameters are used during the speech generation which will be explained later together with the prosody information for the intonation.
3o Thereby also the particularities of the speaker such as e.g. accents or speech impediments may be inputted. Those are used by the translator for modifying the pronunciation. A simple example of such a modifier is e.g. the rule to
-7-replace (in the phonetic transcription) ".(t" by "st" (to generate the accent of a person from Hamburg.
A speaker's model may e.g. concern the rules according to which the translator generates the phonetic transcription. Different speaker's models may thereby proceed according to different rules. A speaker's model may, however, correspond to a certain set of filter parameters in order to process the speech signals according to the thereby prescribed speech characteristics. Of course one can also imagine different combinations of these two aspects of a speaker's model.
The task of the speech generation unit 140 consists in generating a numerical data stream representing digital speech signals based on the given text together with the phonetic and prosodic additional information generated by the translator and edited by the user. This data stream can then be converted into analog sound signals, the text to be outputted, by an output device 150 which may be a digital audio device or a sound board in a PC.
For generating the speech a conventional text-to-speech conversion method can be used in which, however, the pronunciation and the prosody already 2o have been generated. In general one distinguishes between rule-based synthesizers and concatenation-based synthesizers.
Rule-based synthesizers operate using rules for the generation of the sounds and the transitions therebetween. Those synthesizers operate with up to parameters the determination of which is very demanding. However, very good results may be achieved with those type of synthesizers. An overview over those type of systems and further references may be found in "Thierry Dutoit: An Introduction to Text-to-Speech Synthesis, Kluwer 1997".
3o On the other hand, concatenation-based synthesizers are easier to handle. They work with a database which stores all possible pairs of sound.
They can be easily concatenated, however, systems providing a good quality require a
A speaker's model may e.g. concern the rules according to which the translator generates the phonetic transcription. Different speaker's models may thereby proceed according to different rules. A speaker's model may, however, correspond to a certain set of filter parameters in order to process the speech signals according to the thereby prescribed speech characteristics. Of course one can also imagine different combinations of these two aspects of a speaker's model.
The task of the speech generation unit 140 consists in generating a numerical data stream representing digital speech signals based on the given text together with the phonetic and prosodic additional information generated by the translator and edited by the user. This data stream can then be converted into analog sound signals, the text to be outputted, by an output device 150 which may be a digital audio device or a sound board in a PC.
For generating the speech a conventional text-to-speech conversion method can be used in which, however, the pronunciation and the prosody already 2o have been generated. In general one distinguishes between rule-based synthesizers and concatenation-based synthesizers.
Rule-based synthesizers operate using rules for the generation of the sounds and the transitions therebetween. Those synthesizers operate with up to parameters the determination of which is very demanding. However, very good results may be achieved with those type of synthesizers. An overview over those type of systems and further references may be found in "Thierry Dutoit: An Introduction to Text-to-Speech Synthesis, Kluwer 1997".
3o On the other hand, concatenation-based synthesizers are easier to handle. They work with a database which stores all possible pairs of sound.
They can be easily concatenated, however, systems providing a good quality require a
- 8 _ high computational power. Those types of systems are described in "Thierry Dutoit:
An Introduction to Text-to-Speech Synthesis, Kluwer 1997" and in "Volker Kraft:
Verkettung natiarlichsprachlicher Bausteine zur Sprachsynthese: Anforderungen, Techniken and Evaluierung. Fortschr.-Ber. VDI Reihe 10 Nr 468, VDI-Verlag 1997".
In principle both types of systems can be used. In the rule-based synthesizers the prosodic information directly influences the rules, while in the concatenation-based systems the rules are superposed in a suitable manner.
For the generation of particular effects 160 known techniques from digital signal processing are used, such as e.g. digital filters (e.g. bandpass filters for a telephone effect), hall generators, etc. They may also be applied to sounds stored in an archive 170.
In archive 170 sounds like e.g. street noise, railway noise, noise of children, sound of the sea, background music etc. are stored. The archive may be extended arbitrarily. The archive may just be a collection of files having digitized noises, it may, however, also be a database in which noises are stored as blobs (binary large ojects).
In the mixing device 180 the speech signals thus generated are combined with the background noises. The volume of all signals thereby may be adjusted before combination. Additionally, it is possible to apply effects to each signal individually or to all of them together.
The result of the thus generated signal may be handed over (transmitted) to a suitable device for digital audio 150 such as a sound board of a PC, and may thereby be acoustically checked respectively acoustically outputted.
Additionally, a storage means (not shown) is provided in order to store the signal so that it may be 3o transmitted later on in a suitable manner to the target medium.
_g_ As a mixing device one can use a device classically implemented in hardware, or it can be realized in software and can form a part of the whole program.
For the skilled person modifications of the above embodiment are easily apparent. For example, in a further embodiment of the present invention the output device 150 may be replaced by a further computer which is connected by means of a network connection to mixing device 180. Thereby the generated speech signal can be transmitted through a computer network, e.g. the Internet, to another ~o computer.
In another embodiment the speech signal generated from the speech generation unit 140 can be transmitted directly to the output device 150, without passing through mixing device 180. Further comparable modifications are easily 15 apparent for the skilled person.
An Introduction to Text-to-Speech Synthesis, Kluwer 1997" and in "Volker Kraft:
Verkettung natiarlichsprachlicher Bausteine zur Sprachsynthese: Anforderungen, Techniken and Evaluierung. Fortschr.-Ber. VDI Reihe 10 Nr 468, VDI-Verlag 1997".
In principle both types of systems can be used. In the rule-based synthesizers the prosodic information directly influences the rules, while in the concatenation-based systems the rules are superposed in a suitable manner.
For the generation of particular effects 160 known techniques from digital signal processing are used, such as e.g. digital filters (e.g. bandpass filters for a telephone effect), hall generators, etc. They may also be applied to sounds stored in an archive 170.
In archive 170 sounds like e.g. street noise, railway noise, noise of children, sound of the sea, background music etc. are stored. The archive may be extended arbitrarily. The archive may just be a collection of files having digitized noises, it may, however, also be a database in which noises are stored as blobs (binary large ojects).
In the mixing device 180 the speech signals thus generated are combined with the background noises. The volume of all signals thereby may be adjusted before combination. Additionally, it is possible to apply effects to each signal individually or to all of them together.
The result of the thus generated signal may be handed over (transmitted) to a suitable device for digital audio 150 such as a sound board of a PC, and may thereby be acoustically checked respectively acoustically outputted.
Additionally, a storage means (not shown) is provided in order to store the signal so that it may be 3o transmitted later on in a suitable manner to the target medium.
_g_ As a mixing device one can use a device classically implemented in hardware, or it can be realized in software and can form a part of the whole program.
For the skilled person modifications of the above embodiment are easily apparent. For example, in a further embodiment of the present invention the output device 150 may be replaced by a further computer which is connected by means of a network connection to mixing device 180. Thereby the generated speech signal can be transmitted through a computer network, e.g. the Internet, to another ~o computer.
In another embodiment the speech signal generated from the speech generation unit 140 can be transmitted directly to the output device 150, without passing through mixing device 180. Further comparable modifications are easily 15 apparent for the skilled person.
Claims (21)
1. A digital speech processing apparatus comprising:
a prosody generation means for generating a prosody for a text; and an editing means for displaying and modifying the generated prosody.
a prosody generation means for generating a prosody for a text; and an editing means for displaying and modifying the generated prosody.
2. The apparatus of claim 1 further comprising:
translation means for translating the text into a phonetic transcription, said translation means further comprising:
means for displaying and modifying the generated phonetic transcription.
translation means for translating the text into a phonetic transcription, said translation means further comprising:
means for displaying and modifying the generated phonetic transcription.
3. The apparatus of claim 1 or 2, wherein said prosody generating means and/or said translation means generates said prosody and/or said phonetic transcription based on respectively in dependence of a particular speaker's model.
4. The apparatus of one of claims 1 to 3, further comprising:
means for displaying and/or modification of one or more speaker's models.
means for displaying and/or modification of one or more speaker's models.
5. The apparatus of claim 4, wherein said speaker's model modification means comprises:
means for modifying phonetic transcription elements for the generation of accents.
means for modifying phonetic transcription elements for the generation of accents.
6. An apparatus for generating digital speech comprising:
an apparatus for digital speech processing according to one of claims 1 to 4; and means for generating speech signals based on said phonetic transcription which may have been edited using said editing means and/or based on said prosody.
an apparatus for digital speech processing according to one of claims 1 to 4; and means for generating speech signals based on said phonetic transcription which may have been edited using said editing means and/or based on said prosody.
7. The apparatus of claim 6, wherein said speech signal generating means further comprises:
a speaker's model processing means for generating said speech signals based on respectively depending on a particular speaker's model.
a speaker's model processing means for generating said speech signals based on respectively depending on a particular speaker's model.
8. The apparatus of claim 7, wherein said speaker's model processing means comprises one or more of the following:
a digital filter system;
means for adopting a set of filter parameters representing a particular speaker's model.
a digital filter system;
means for adopting a set of filter parameters representing a particular speaker's model.
9. The apparatus of claim 7 or 8, wherein said speaker's model processing means further comprises:
means for selecting and/or modifying a speaker's model.
means for selecting and/or modifying a speaker's model.
10. The apparatus of one of claims 6 to 9, further comprising:
effect generating means for generating sound effects.
effect generating means for generating sound effects.
11. The apparatus of claim 10, wherein said effect generating means comprises one or more of the following:
digital filter means for modifying the generated speech signals, and/or a hall generator for generating a hall effect.
digital filter means for modifying the generated speech signals, and/or a hall generator for generating a hall effect.
12. The apparatus of one of claims 6 to 11, further comprising:
archive means for storing sounds; and mixing means for mixing the generated speech signals with the sounds stored in said archive means.
archive means for storing sounds; and mixing means for mixing the generated speech signals with the sounds stored in said archive means.
13. The apparatus of one of the preceding claims, further comprising:
a graphical user interface for editing the generated phonetic transcription and/or prosody.
a graphical user interface for editing the generated phonetic transcription and/or prosody.
14. The apparatus of one of the preceding claims, further comprising:
means for modifying speech rhythm and/or pronunciation and/or intonation.
means for modifying speech rhythm and/or pronunciation and/or intonation.
15. The apparatus of one of the preceding claims, further comprising:
display means for displaying the prosody by means of a symbolic notation.
display means for displaying the prosody by means of a symbolic notation.
16. The apparatus of one of the preceding claims, further comprising:
dictionary means in which the words of one or more languages are stored together with their pronunciation.
dictionary means in which the words of one or more languages are stored together with their pronunciation.
17. The apparatus of claim 16, wherein for at least one dictionary entry different phonetic entries are stored in said dictionary means.
18. The apparatus of one of claims 6 to 17, further comprising:
means for converting said digital speech signals into acoustic signals.
means for converting said digital speech signals into acoustic signals.
19. A digital speech processing method comprising:
generating a prosody for a text;
displaying said generated prosody; and editing said generated and displayed prosody.
generating a prosody for a text;
displaying said generated prosody; and editing said generated and displayed prosody.
20. The method of claim 19, further comprising:
using an apparatus according to one of claims 1 to 18 for generating digital speech.
using an apparatus according to one of claims 1 to 18 for generating digital speech.
21. A computer program product comprising:
a medium, in particular a data carrier for storing and/or transmitting digital data readable by a computer, wheras said stored and/or transmitted data comprise:
a sequence of computer-executable instructions causing said computer to carry out a method according to one of claims 19 or 20.
a medium, in particular a data carrier for storing and/or transmitting digital data readable by a computer, wheras said stored and/or transmitted data comprise:
a sequence of computer-executable instructions causing said computer to carry out a method according to one of claims 19 or 20.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DE19841683A DE19841683A1 (en) | 1998-09-11 | 1998-09-11 | Device and method for digital speech processing |
DE19841683.0 | 1998-09-11 | ||
PCT/EP1999/006712 WO2000016310A1 (en) | 1998-09-11 | 1999-09-10 | Device and method for digital voice processing |
Publications (1)
Publication Number | Publication Date |
---|---|
CA2343071A1 true CA2343071A1 (en) | 2000-03-23 |
Family
ID=7880683
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CA002343071A Abandoned CA2343071A1 (en) | 1998-09-11 | 1999-09-10 | Device and method for digital voice processing |
Country Status (7)
Country | Link |
---|---|
EP (1) | EP1110203B1 (en) |
JP (1) | JP2002525663A (en) |
AT (1) | ATE222393T1 (en) |
AU (1) | AU769036B2 (en) |
CA (1) | CA2343071A1 (en) |
DE (2) | DE19841683A1 (en) |
WO (1) | WO2000016310A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8566880B2 (en) | 2008-07-22 | 2013-10-22 | Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. | Device and method for providing a television sequence using database and user inputs |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE10117367B4 (en) * | 2001-04-06 | 2005-08-18 | Siemens Ag | Method and system for automatically converting text messages into voice messages |
JP2002318593A (en) * | 2001-04-20 | 2002-10-31 | Sony Corp | Language processing system and language processing method as well as program and recording medium |
AT6920U1 (en) | 2002-02-14 | 2004-05-25 | Sail Labs Technology Ag | METHOD FOR GENERATING NATURAL LANGUAGE IN COMPUTER DIALOG SYSTEMS |
DE10207875A1 (en) * | 2002-02-19 | 2003-08-28 | Deutsche Telekom Ag | Parameter-controlled, expressive speech synthesis from text, modifies voice tonal color and melody, in accordance with control commands |
EP1726005A4 (en) * | 2004-03-05 | 2007-06-20 | Lessac Technologies Inc | Prosodic speech text codes and their use in computerized speech systems |
DE102004012208A1 (en) * | 2004-03-12 | 2005-09-29 | Siemens Ag | Individualization of speech output by adapting a synthesis voice to a target voice |
US10424288B2 (en) | 2017-03-31 | 2019-09-24 | Wipro Limited | System and method for rendering textual messages using customized natural voice |
Family Cites Families (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS5695295A (en) * | 1979-12-28 | 1981-08-01 | Sharp Kk | Voice sysnthesis and control circuit |
FR2494017B1 (en) * | 1980-11-07 | 1985-10-25 | Thomson Csf | METHOD FOR DETECTING THE MELODY FREQUENCY IN A SPEECH SIGNAL AND DEVICE FOR CARRYING OUT SAID METHOD |
JPS58102298A (en) * | 1981-12-14 | 1983-06-17 | キヤノン株式会社 | Electronic appliance |
US4623761A (en) * | 1984-04-18 | 1986-11-18 | Golden Enterprises, Incorporated | Telephone operator voice storage and retrieval system |
US5559927A (en) * | 1992-08-19 | 1996-09-24 | Clynes; Manfred | Computer system producing emotionally-expressive speech messages |
US5956685A (en) * | 1994-09-12 | 1999-09-21 | Arcadia, Inc. | Sound characteristic converter, sound-label association apparatus and method therefor |
AU3400395A (en) * | 1994-09-12 | 1996-03-29 | Atr Human Information Processing Research Laboratories Co.,Ltd. | Sound characteristic convertor, sound/label associating apparatus and method to form them |
DE19503419A1 (en) * | 1995-02-03 | 1996-08-08 | Bosch Gmbh Robert | Method and device for outputting digitally coded traffic reports using synthetically generated speech |
JPH08263094A (en) * | 1995-03-10 | 1996-10-11 | Winbond Electron Corp | Synthesizer for generation of speech mixed with melody |
EP0762384A2 (en) * | 1995-09-01 | 1997-03-12 | AT&T IPM Corp. | Method and apparatus for modifying voice characteristics of synthesized speech |
DE19610019C2 (en) * | 1996-03-14 | 1999-10-28 | Data Software Gmbh G | Digital speech synthesis process |
JP3616250B2 (en) * | 1997-05-21 | 2005-02-02 | 日本電信電話株式会社 | Synthetic voice message creation method, apparatus and recording medium recording the method |
US6226614B1 (en) * | 1997-05-21 | 2001-05-01 | Nippon Telegraph And Telephone Corporation | Method and apparatus for editing/creating synthetic speech message and recording medium with the method recorded thereon |
-
1998
- 1998-09-11 DE DE19841683A patent/DE19841683A1/en not_active Withdrawn
-
1999
- 1999-09-10 AT AT99947314T patent/ATE222393T1/en not_active IP Right Cessation
- 1999-09-10 AU AU60813/99A patent/AU769036B2/en not_active Ceased
- 1999-09-10 WO PCT/EP1999/006712 patent/WO2000016310A1/en active IP Right Grant
- 1999-09-10 CA CA002343071A patent/CA2343071A1/en not_active Abandoned
- 1999-09-10 EP EP99947314A patent/EP1110203B1/en not_active Expired - Lifetime
- 1999-09-10 DE DE59902365T patent/DE59902365D1/en not_active Expired - Fee Related
- 1999-09-10 JP JP2000570766A patent/JP2002525663A/en not_active Withdrawn
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8566880B2 (en) | 2008-07-22 | 2013-10-22 | Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. | Device and method for providing a television sequence using database and user inputs |
Also Published As
Publication number | Publication date |
---|---|
ATE222393T1 (en) | 2002-08-15 |
JP2002525663A (en) | 2002-08-13 |
DE19841683A1 (en) | 2000-05-11 |
AU769036B2 (en) | 2004-01-15 |
DE59902365D1 (en) | 2002-09-19 |
AU6081399A (en) | 2000-04-03 |
WO2000016310A1 (en) | 2000-03-23 |
EP1110203A1 (en) | 2001-06-27 |
EP1110203B1 (en) | 2002-08-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7979274B2 (en) | Method and system for preventing speech comprehension by interactive voice response systems | |
WO2006123539A1 (en) | Speech synthesizer | |
JPH0833744B2 (en) | Speech synthesizer | |
AU769036B2 (en) | Device and method for digital voice processing | |
JP2008058379A (en) | Speech synthesis system and filter device | |
JPH08335096A (en) | Text voice synthesizer | |
JPH07200554A (en) | Sentence read-aloud device | |
JP2577372B2 (en) | Speech synthesis apparatus and method | |
JP3113101B2 (en) | Speech synthesizer | |
JPH09179576A (en) | Voice synthesizing method | |
JP2703253B2 (en) | Speech synthesizer | |
JP2573586B2 (en) | Rule-based speech synthesizer | |
JP2658109B2 (en) | Speech synthesizer | |
JP3862300B2 (en) | Information processing method and apparatus for use in speech synthesis | |
JP3292218B2 (en) | Voice message composer | |
JP2573587B2 (en) | Pitch pattern generator | |
KR100269215B1 (en) | Method for producing fundamental frequency contour of prosodic phrase for tts | |
JP2573585B2 (en) | Speech spectrum pattern generator | |
JP2586040B2 (en) | Voice editing and synthesis device | |
JP2006349787A (en) | Speech synthesis method and apparatus | |
JPH1011083A (en) | Text voice converting device | |
JPH06214585A (en) | Voice synthesizer | |
JPH06138894A (en) | Device and method for voice synthesis | |
JP2001166787A (en) | Voice synthesizer and natural language processing method | |
JPH0553595A (en) | Speech synthesizing device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FZDE | Discontinued |