JP2002525663A

JP2002525663A - Digital voice processing apparatus and method

Info

Publication number: JP2002525663A
Application number: JP2000570766A
Authority: JP
Inventors: ハンスクル
Original assignee: ハンスクル
Priority date: 1998-09-11
Filing date: 1999-09-10
Publication date: 2002-08-13
Also published as: ATE222393T1; DE19841683A1; AU769036B2; DE59902365D1; AU6081399A; WO2000016310A1; EP1110203A1; EP1110203B1; CA2343071A1

Abstract

(57)【要約】テキストの韻律を生成するための韻律生成手段と、生成された韻律を表示し、修正するための編集手段とを有するディジタル音声処理装置。 (57) [Summary] A digital speech processing apparatus having a prosody generation means for generating a text prosody and an editing means for displaying and correcting the generated prosody.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】本発明は、ディジタル音声処理ないし音声生成のための装置及び方法に関する
。ディジタル的に音声を出力するシステムは、従来は、合成音声が許容される場
合又は望まれる場合に適用された。本発明は、自然な印象を与える音声を合成的
に生成する(synthetically generate)ことを可能にするシステムに関する。The present invention relates to an apparatus and a method for digital audio processing or audio generation. Systems that digitally output speech have traditionally been applied where synthesized speech is acceptable or desired. The present invention relates to a system that enables the synthetically generating speech to give a natural impression.

【０００２】ディジタル音声生成のためのシステムでは、韻律（prosody）及びイントネー
ションに関する情報は自動的に、例えばEP0689706に記載されているように生成
される。システムによっては、例えばEP0598599に示されるように、テキストス
クリーンに付加的コマンドを挿入し、その後で音声生成手段に渡すことが可能で
ある。これらのコマンドは、例えばEP0598598に記載されているように、例えば
発音されない特別文字として入力される。In a system for digital speech generation, information about prosody and intonation is automatically generated, for example as described in EP 0 689 706. In some systems, it is possible to insert additional commands into the text screen and then pass them on to the speech generating means, for example as shown in EP0598599. These commands are entered, for example, as special characters that are not pronounced, as described, for example, in EP0598598.

【０００３】テキストスクリーンに挿入されたコマンドは、話者の特性に関する表記（例え
ば、話者モデルのパラメータ）を含む場合がある。EP0762384には、これらの話
者の特性を、グラフィカルユーザインターフェイス（graphical user interface
）により、スクリーン上で挿入することができるシステムが記載されている。A command inserted into a text screen may include a notation related to a characteristic of a speaker (for example, a parameter of a speaker model). EP0762384 describes these speaker characteristics in a graphical user interface.
) Describes a system that can be inserted on a screen.

【０００４】音声合成は、（例えば、EP0831460の場合には波形系列（waveform sequence）
として）データバンクに記憶された補助情報を用いて行われる。しかし、データ
バンクに記憶されていない単語（word）の発音に関しては、プログラムに発音に
関する規則を設ける必要がある。個々の系列を組合せるだけで、格別の工夫をし
なければ、歪やアーティファクト（artefacts）が生じる。この問題（一般に「
分節的品質（segmental quality）」と呼ばれる）は、今日では大部分解決され
た(例えば、「フォルカー・クラフト：音声合成のための自然言語成分の連結：
条件、技術及び評価。ドイツ技術者協会の現況報告書 (VDI)シリーズ10, No. 46
8, VDI 出版, 1997（"Volker Kraft; Verkettung naturlichsprachlicher Baust
ein zur Sprachsynthese: Anforderungen, Techniken und Evaluierung. Fortsc
hr-Ber. VDI Reihe 10, Nr. 468, VDI-Verlag 1997"）」参照）。しかし、近年
の音声合成においても、幾つかの問題がある。[0004] Speech synthesis is based on, for example, a waveform sequence in the case of EP0831460.
(As) using the auxiliary information stored in the data bank. However, regarding the pronunciation of a word that is not stored in the data bank, it is necessary to set rules for pronunciation in the program. Distortion and artefacts will occur if only individual series are combined and no special measures are taken. This problem (typically "
The term "segmental quality" has been largely resolved today (eg, "Volker Kraft: Concatenation of natural language components for speech synthesis:
Conditions, skills and evaluation. German Institute of Engineers Status Report (VDI) Series 10, No. 46
8, VDI Publishing, 1997 ("Volker Kraft; Verkettung naturlichsprachlicher Baust
ein zur Sprachsynthese: Anforderungen, Techniken und Evaluierung. Fortsc
hr-Ber. VDI Reihe 10, Nr. 468, VDI-Verlag 1997 ")"). However, there are some problems in recent speech synthesis.

【０００５】ディジタル音声出力に関する一つの問題は、例えば、複数言語能力である。One problem with digital audio output is, for example, multilingual ability.

【０００６】他の問題は、韻律的品質（prosodic quality）、即ちイントネーションの品質
の改善に関する。例えば、(例えば、「フォルカー・クラフト：音声合成のため
の自然言語成分の連結：条件、技術及び評価。ドイツ技術者協会の現況報告書 (
VDI)シリーズ10, No. 468, VDI 出版, 1997（"Volker Kraft; Verkettung natur
lichsprachlicher Baustein zur Sprachsynthese: Anforderungen, Techniken u
nd Evaluierung. Fortschr-Ber. VDI Reihe 10, Nr. 468, VDI-Verlag 1997"）
」を参照。この問題は、綴り字法的入力情報（orthographic input information
）のみに基づいてイントネーションを構築した場合には不十分であると言う事実
による。それは、意味論（semantics）や語用論（pragmatics）などのより高い
レベルや、話者の状況や、話者のタイプにも依存する。一般的に、今日の音声出
力システムの品質は、受聴者が合成音声を予期又は許容する場合には、条件を満
たすものである。しかし、しばしば、合成音声の品質は、不十分乃至満足できな
いものと考えられる。Another problem concerns the improvement of prosodic quality, the quality of intonation. For example, (e.g., "Volker Kraft: Concatenation of Natural Language Components for Speech Synthesis: Conditions, Techniques and Evaluations. Status Report of the German Institute of Engineers (
VDI) Series 10, No. 468, VDI Publishing, 1997 ("Volker Kraft; Verkettung natur
lichsprachlicher Baustein zur Sprachsynthese: Anforderungen, Techniken u
nd Evaluierung. Fortschr-Ber. VDI Reihe 10, Nr. 468, VDI-Verlag 1997 ")
See. This issue is due to orthographic input information
) Alone is not enough to build intonation. It also depends on higher levels such as semantics and pragmatics, the situation of the speaker, and the type of speaker. In general, the quality of today's audio output systems is acceptable if the listener expects or tolerates synthesized speech. However, often the quality of the synthesized speech is considered to be insufficient or unsatisfactory.

【０００７】従って、本発明の目的は、品質の改善された合成音声の生成が可能なディジタ
ル音声処理のための装置及び方法を提供することにある。本発明の他の目的は、
自然な印象を与える音声を合成的に生成することにある。適用範囲は、マルチメ
ディアへの応用のための単純なテキストの音の生成から、映画のダビング、ラジ
オドラマ、オーディオブックのための音の生成までをカバーする。Accordingly, it is an object of the present invention to provide an apparatus and a method for digital voice processing capable of generating a synthesized voice with improved quality. Another object of the present invention is to
It is to generate a speech that gives a natural impression synthetically. The scope of application covers from simple text sound generation for multimedia applications to sound generation for movie dubbing, radio dramas and audio books.

【０００８】合成的に生成された音声が自然な印象を与える場合でも、演出的（dramaturgi
cal）効果を生じさせるために、介入することを可能にする必要がある。本発明
の他の目的は、そのような介入の可能性を提供することにある。[0008] Even when synthetically generated speech gives a natural impression,
cal) It is necessary to be able to intervene in order to produce an effect. Another object of the invention is to provide the possibility of such an intervention.

【０００９】本発明は、独立クレームに定義されている。従属クレームは、本発明の特定の
実施の形態を定義する。[0009] The invention is defined in the independent claims. The dependent claims define a particular embodiment of the invention.

【００１０】本発明の課題は、エディター（editor）によりテキストのために生成された韻
律を修正する能力を設けることにより略解決される。[0010] The object of the present invention is substantially solved by providing the ability to modify the prosody generated for text by an editor.

【００１１】本発明の特定の実施の形態は、韻律の編集に加え、合成的に生成された音声へ
の更なる特性の付加を可能にするものである。Certain embodiments of the present invention allow for the addition of additional properties to synthetically generated speech in addition to prosody editing.

【００１２】従って、出発点は文字テキスト（written text）である。しかし、十分な（特
に韻律に関する）品質を達成し、かつ演出的効果を生じさせる目的で、特定の実
施の形態においては、ユーザは介入するための広範囲の可能性を有する。ユーザ
は監督の役割を持ち、システム上で話者を定義し、彼等に音声リズム及び韻律、
発音及びイントネーションを割り当てる。[0012] Thus, the starting point is written text. However, in certain embodiments, the user has a wide range of possibilities to intervene in order to achieve sufficient quality (especially with respect to prosody) and to produce a staging effect. Users have a supervisory role, define speakers on the system and give them voice rhythms and prosody,
Assign pronunciation and intonation.

【００１３】本発明は、文字テキストのための音声表記（phonetic transcription）を生成
すること、及び生成された音声表記を修正する、又は修正可能な規則に基づき、
音声表記を修正する能力を備えることが好ましい。このことにより、例えば、話
者の特定のなまり（accent）を生成することができる。The present invention provides for generating phonetic transcriptions for textual text, and based on rules that modify or modify the generated phonetic transcriptions,
Preferably, it has the ability to modify the phonetic transcription. This can, for example, generate a particular accent for the speaker.

【００１４】本発明のさらに他の好ましい実施の形態では、１つ又は２つ以上の言語の単語
が、それらの発音とともに記憶された辞書手段が設けられている。後者の場合、
複数言語処理能力がある。即ち、異なる言語のテキストの処理が可能である。In yet another preferred embodiment of the invention, a dictionary means is provided in which words in one or more languages are stored together with their pronunciation. In the latter case,
Has multilingual processing ability. That is, texts in different languages can be processed.

【００１５】生成された音声表記又は韻律の編集が、例えば、グラフィカルユーザインター
フェイスのごとき、容易に使用できるエディターにより行われるのが好ましい。[0015] Preferably, the editing of the generated phonetic notation or prosody is performed by an editor that is readily available, such as, for example, a graphical user interface.

【００１６】本発明の更に他の実施の形態では、予め定められた、或いはユーザにより定義
され乃至修正された話者モデルが音声処理において考慮される。従って、男性の
声であれ、女性の声であれ、また例えばバイエルン（bayerischer）、シュワー
ベン（schwabischer）、或いは北ドイツのなまりの如く、話者の異なるなまりで
あれ、異なる話者の特性が実現できる。In yet another embodiment of the present invention, a predetermined or user-defined or modified speaker model is considered in the speech processing. Thus, different speaker characteristics are realized, whether male or female, and different speakers, such as, for example, Bayerischer, Schwabischer, or Northern Germany. it can.

【００１７】特に好ましい実施の形態では、装置は以下のものを有する。辞書。この辞書には、すべての単語に対する発音が音声表記で記憶されている
（以下「音声表記」は任意の音声表記を意味する。その一例は、SAMPA表記（SAM
PA-notation）である。例えば、「多言語音声入力／出力評価、方法論及び標準
化、標準的コンピュータ−対応表記、第29-31頁、エスプリットプロジェクト258
9 (SAM)フィンレポートSAM-UCC-037（"Multilingual speech input/output asse
ssment, methodology and standardization, standard computer-compatible tr
anscription, pp 29-31, in Esprit Project 2589 (SAM) Fin Report SAM-UCC-0
37"）」を参照。)又は、教育資料、例えば、「国際音声学協会の基準：国際音声
字母及びその使用方法。国際音声学協会、音声学部門、ユニバーシティカレッジ
ロンドン（"The Principles of the International Phonetik Association: A d
escription of the International Phonetic Alphabet and the Manner of Usin
g it. International Phonetic Association, Dept. Phonetics, Univ. Colleg
e of London"」を参照。）に記載された国際音声表記（International phonetic
script）ものである。トランスレータ（変換手段）。このトランスレータは、入力されたテキストを
音声表記に変換し、韻律を生成する。エディター（編集手段）。このエディターによりテキストが入力され、話者に
割り当てられ、またそれにより生成された音声表記及び韻律を表示し、変更する
ことができる。入力モジュール。この入力モジュールにより、話者モデルを定義することがで
きる。ディジタル音声生成のためのシステム。このシステムは、音声表記と、話され
る音声を表わす韻律信号乃至そのような信号を表わすデータとから生成し、かつ
異なる話者モデルを処理する能力を有する。ディジタルフィルタ等で構成されたシステム（反響、エコーなどのためのもの
）。このシステムにより、特定の効果を生じさせることができる。音アーカイブ（sound archive）。ミキシング装置。このミキシング装置において、生成された音声信号をアーカ
イブからの音とミキシングすることができ、効果音により修正することができる
。本発明は、ソフトウエア及びハードウエアによりハイブリッド形態で、又は完
全にソフトウエアで実現することができる。生成されたディジタル音声信号は、
ディジタルオーディオのための特定の装置又はパソコン（ＰＣ）のサウンドボー
ドにより出力することができる。In a particularly preferred embodiment, the device comprises: dictionary. In this dictionary, pronunciations for all words are stored in phonetic notation (hereinafter "phonetic notation" means an arbitrary phonetic notation. One example is SAMPA notation (SAM
PA-notation). See, for example, "Multilingual Speech Input / Output Evaluation, Methodology and Standardization, Standard Computer-Compatible Notation, Pages 29-31, Split Project 258
9 (SAM) Fin Report SAM-UCC-037 ("Multilingual speech input / output asse
ssment, methodology and standardization, standard computer-compatible tr
anscription, pp 29-31, in Esprit Project 2589 (SAM) Fin Report SAM-UCC-0
37 "). ) Or educational materials, such as "International Phonetics Association Standards: International Phonetics and Their Use. International Phonetics Association, Department of Phonetics, University College London (" The Principles of the International Phonetik Association: Ad
escription of the International Phonetic Alphabet and the Manner of Usin
g it.International Phonetic Association, Dept. Phonetics, Univ. Colleg
e of London ". ) International Phonetic
script). Translator (conversion means). This translator converts the input text into phonetic transcriptions and generates prosody. Editor (editing means). The editor allows text to be entered, assigned to speakers, and the resulting phonetic transcription and prosody to be displayed and modified. Input module. With this input module, a speaker model can be defined. A system for digital speech generation. The system has the ability to generate from phonetic transcriptions and prosodic signals representing spoken speech or data representing such signals and to process different speaker models. System composed of digital filters, etc. (for reverberation, echo, etc.) With this system, certain effects can be produced. Sound archive. Mixing device. In this mixing device, the generated audio signal can be mixed with the sound from the archive, and can be corrected by the sound effect. The invention can be implemented in hybrid form by software and hardware, or entirely in software. The generated digital audio signal is
It can be output by a specific device for digital audio or by a sound board of a personal computer (PC).

【００１８】以下、本発明を、幾つかの実施の形態により、図面を参照して説明する。Hereinafter, the present invention will be described with reference to the drawings according to some embodiments.

【００１９】図１は、本発明の一実施の形態のディジタル音声を生成する装置のブロック図
である。FIG. 1 is a block diagram of an apparatus for generating digital audio according to an embodiment of the present invention.

【００２０】以下に説明する本発明の実施の形態では、本発明は、１又は２以上のディジタ
ル処理装置及びこれらの組み合わせで実現し得る幾つかの部材を有する。これら
の動作は以下に説明される。In the embodiments of the present invention described below, the present invention has one or more digital processing devices and some components that can be realized by a combination thereof. These operations are described below.

【００２１】辞書１００は簡単なテーブル(各言語に対し１つ）を含み、そのテーブルには
、言語の単語が発音とともに記憶されている。テーブルは、付加的単語及びそれ
らの発音を含むように任意に拡張しても良い。特定の目的のため、例えばなまり
の生成のため、一つの言語において異なる音声データを記録した付加的テーブル
を設けても良い。辞書の一つのテーブルがそれぞれ異なる話者に割り当てられる
。The dictionary 100 includes a simple table (one for each language) in which words of the language are stored along with pronunciations. The table may optionally be expanded to include additional words and their pronunciation. For a specific purpose, for example for generating a rounding, an additional table may be provided which records different audio data in one language. One table in the dictionary is assigned to each different speaker.

【００２２】トランスレータ１１０は入力されたテキストの単語を、辞書内の音声的対応に
基づく置き換えにより、音声表記を生成する。話者モデルにおいて、後に詳述す
るモディファイヤー（修正手段）が記憶されている場合には、それらのモディフ
ァイヤーは発音を修正するために用いられる。The translator 110 generates phonetic notations by replacing words of the input text based on phonetic correspondence in the dictionary. In the speaker model, when modifiers (correction means) described later are stored, those modifiers are used to correct the pronunciation.

【００２３】さらに、音声処理において知られている学習機能（heuristics）を用いて韻律
を生成する。そのような学習機能は、例えばフジサキ（Fujisaki）のモデル(199
2)、又は他の音響的方法（acoustic method）、それから認知（perceptual）モ
デル、例えば、ドアレッサンドロ及びメルテンス（d'Alessandro and Mertens）
のもの（1995)である。両者の、より古い言語モデルが、例えば、「チエリ・デ
ュトワ：テキスト−音声合成入門、クルベール1997（"Thierry Dutoit: An Intr
oduction to Text-to-Speech Synthesis, Kluwer 1997"）」に記載されている。
そこには、トランスレータにより同様に行われる、分節化（segmentation）（ポ
ーズ（breaks）の設定）の方法も記載されている。Further, a prosody is generated using a learning function (heuristics) known in speech processing. Such a learning function is described in, for example, the model of Fujisaki (199).
2) or other acoustic methods and then perceptual models such as d'Alessandro and Mertens
(1995). Both older language models are described, for example, in "Thierry Dutoit: An Introduction to Text-to-Speech Synthesis, Clubert 1997 (" Thierry Dutoit: An Intr
oduction to Text-to-Speech Synthesis, Kluwer 1997 ")".
It also describes a method of segmentation (setting of breaks), also performed by the translator.

【００２４】方法の選択は、重要度の低い問題である。トランスレータは、ユーザにより修
正され得る韻律の一態様（version）を生成するだけだからである。The choice of method is a minor issue. The translator only generates a version of the prosody that can be modified by the user.

【００２５】エディター１２０は、ユーザに対し、発音、イントネーション、強勢（accent
uation）、速度、音量、ポーズ（中断）などを入力し、変更するための手段（in
strument）を提供する。The editor 120 provides the user with pronunciation, intonation, and stress (accent
), speed, volume, pause, etc.
(strument).

【００２６】最初にユーザは、話者モデルを、処理すべきテキスト分節（segments）に割り
当てる。その構成（composition）及び操作（operation）については後に詳述す
る。トランスレータはこの割り当てに応じて、音韻（phonetics）及び場合によ
っては韻律を話者モデルに適合させ、新たに生成する。音韻は音声表記の形でユ
ーザに対して表示され、韻律は、例えば音楽から採用された記号表記（楽譜）で
表示される。ユーザはこれを修正し、個々のテキスト分節を聴き、入力し直すな
どのことを行うことができる。First, a user assigns a speaker model to text segments to be processed. The composition and operation will be described later in detail. In response to this assignment, the translator adapts the phonetics and possibly the prosody to the speaker model and generates a new one. The phoneme is displayed to the user in the form of phonetic notation, and the prosody is displayed, for example, in symbolic notation (score) adopted from music. The user can modify this, listen to individual text segments, re-enter, and so on.

【００２７】テキスト自体は、もしも他のテキスト処理システムから直接取込むことができ
ない場合には、エディター内に保存することができる。The text itself can be saved in an editor if it cannot be captured directly from another text processing system.

【００２８】話者モデル１３０は、例えば音声生成のためパラメータ化したものである。こ
れらのモデルは、人間の発声器官の特性をモデル化したものである。声帯の機能
は、パルスのシーケンスにより表わされ、その周波数（pitch)のみが補正可能で
ある。発声器官の他の特性（口腔、鼻腔）はディジタルフィルタにより実現でき
る。それらのパラメータは、話者モデル内に記憶されている。標準モデル (子供
、若い女性、高齢の男等）が記憶されている。ユーザは、これらに基づき、パラ
メータを選択し、補正することにより、付加的モデルを生成し、記憶することが
できる。記憶されたパラメータは、後述の音声生成の際、イントネーションのた
めの韻律情報とともに、使用される。The speaker model 130 is, for example, parameterized for speech generation. These models model the characteristics of human vocal organs. The function of the vocal cords is represented by a sequence of pulses, of which only the frequency (pitch) can be corrected. Other characteristics of the vocal organs (oral, nasal) can be realized by digital filters. Those parameters are stored in the speaker model. Standard models (children, young women, elderly men, etc.) are stored. Based on these, the user can generate and store additional models by selecting and correcting parameters. The stored parameters are used together with the prosody information for intonation at the time of speech generation described later.

【００２９】話者の特別の性質、例えばなまり又は言語障害をも入力することができる。こ
れらは、発音の修正のため、トランスレータにより使用される。そのような修正
の簡単な例は、（ハンブルグ（Hamburg）の出身者のなまりを生成するため）音
声表記において、"∫t"を"st"で置き換えることである。[0029] Special characteristics of the speaker, such as dullness or speech impairment, can also be entered. These are used by the translator to modify the pronunciation. A simple example of such a modification is to replace "∫t" with "st" in phonetic notation (to generate a rounding out of Hamburg).

【００３０】話者モデルは例えば、トランスレータが音声表記を生成する際に従う規則に関
する。異なる話者モデルは、異なる規則に従う。しかし、話者モデルは、規定さ
れた音声特性に応じて音声信号を処理するための、フィルタパラメータの組に対
応するものであっても良い。当然ながら、これら２つの話者モデルの態様の種々
の組合せも考えられる。The speaker model relates, for example, to the rules that the translator follows in generating phonetic transcriptions. Different speaker models follow different rules. However, the speaker model may correspond to a set of filter parameters for processing an audio signal according to a specified audio characteristic. Of course, various combinations of aspects of these two speaker models are also conceivable.

【００３１】音声生成ユニット１４０の役割は、与えられたテキストと、音韻及び韻律的付
加的情報（トランスレータにより生成され、エディターで編集されたもの）とに
基づいて、ディジタル音声信号を表わす数値的データストリーム（numerical da
ta stream）を生成することにある。このデータストリームは、出力装置１５０
（ディジタルオーディオ装置又はＰＣ内のサウンドボード）により、アナログ音
信号（出力されるべきテキスト)に変換される。The role of the speech generation unit 140 is to provide, based on the given text and phonological and prosodic additional information (generated by a translator and edited by an editor), numerical data representing a digital speech signal Stream (numerical da
ta stream). This data stream is output to the output device 150.
(A digital audio device or a sound board in a PC) is converted into an analog sound signal (text to be output).

【００３２】音声を生成するためには、従来のテキスト−音声変換方法（発音及び韻律が事
前に生成されている）を用いることができる。一般的に、規則に基づくシンセサ
イザー（rule-based synthesizers）及び連結に基づくシンセサイザー（concate
nation-based synthesizers）の区別がある。To generate speech, conventional text-to-speech conversion methods (pronunciation and prosody are pre-generated) can be used. In general, rule-based synthesizers and concatenation-based synthesizers
nation-based synthesizers).

【００３３】規則に基づくシンセサイザーは、音及び音から音への遷移の生成のための規則
を用いて動作する。これらのシンセサイザーは、６０以下のパラメータで動作す
るが、これらのパラメータの決定は多くの労力を要する。しかし、これらのタイ
プのシンセサイザーを用いれば極めて良好な結果が得られる。これらのタイプの
システムの概要及び詳細は、「チエリ・デュトワ：テキスト−音声合成入門、ク
ルベール1997（"Thierry Dutoit: An Introduction to Text-to-Speech Synthes
is, Kluwer 1997"）」に記載されている。A rule-based synthesizer operates using rules for the production of sounds and sound-to-sound transitions. These synthesizers operate with less than 60 parameters, but determining these parameters is labor intensive. However, very good results are obtained with these types of synthesizers. An overview and details of these types of systems can be found in Thierry Dutoit: An Introduction to Text-to-Speech Synthes.
is, Kluwer 1997 ")".

【００３４】連結に基づくシンセサイザーは操作が容易である。これらは、あらゆる音の対
を記憶したデータベースを用いて動作する。これらは容易に連結される。しかし
、良好な品質を実現するシステムを形成するには、高い計算能力が必要である。
これらのタイプのシステムは「チエリ・デュトワ：テキスト−音声合成入門、ク
ルベール1997（"Thierry Dutoit: An Introduction to Text-to-Speech Synthes
is, Kluwer 1997"）」及び(例えば、「フォルカー・クラフト：音声合成のため
の自然言語成分の連結：条件、技術及び評価。ドイツ技術者協会の現況報告書 (
VDI)シリーズ10, No. 468, VDI 出版, 1997（"Volker Kraft; Verkettung natur
lichsprachlicher Baustein zur Sprachsynthese: Anforderungen, Techniken u
nd Evaluierung. Fortschr-Ber. VDI Reihe 10, Nr. 468, VDI-Verlag 1997"）
」に記載されている。A synthesizer based on concatenation is easy to operate. These operate using a database that stores every sound pair. These are easily connected. However, forming a system that achieves good quality requires high computing power.
These types of systems are described in "Thierry Dutoit: An Introduction to Text-to-Speech Synthes.
is, Kluwer 1997 ")" and (for example, "Volker Kraft: Concatenation of Natural Language Components for Speech Synthesis: Conditions, Techniques and Evaluations. Status Report of the German Institute of Engineers (
VDI) Series 10, No. 468, VDI Publishing, 1997 ("Volker Kraft; Verkettung natur
lichsprachlicher Baustein zur Sprachsynthese: Anforderungen, Techniken u
nd Evaluierung. Fortschr-Ber. VDI Reihe 10, Nr. 468, VDI-Verlag 1997 ")
"It is described in.

【００３５】原理的には、いずれのタイプのシステムでも使用できる。規則に基づくシンセ
サイザーでは、韻律的情報は規則に直接の影響を与える。連結に基づくシステム
では、これらが適切な形で重ねられる。In principle, any type of system can be used. In rule-based synthesizers, prosodic information has a direct effect on rules. In systems based on concatenation, they are stacked in an appropriate manner.

【００３６】特定の効果音１６０の生成には、ディジタル信号処理のための公知の技術、例
えばディジタルフィルタ（例えば、電話効果のためのバンドパスフィルタ）、反
響生成器等を用い得る。これらはアーカイブ１７０に記憶された音にも適用でき
る。For generating the specific sound effect 160, a known technique for digital signal processing, for example, a digital filter (for example, a band-pass filter for a telephone effect), a reverberation generator, or the like can be used. These can also be applied to sounds stored in the archive 170.

【００３７】アーカイブ１７０では、道路の騒音、鉄道の音、子供達のざわめき、海の音、
背景音楽等の音が記憶されている。アーカイブは任意に拡張できる。アーカイブ
は、単にディジタル化された物音のファイルを集めたものであっても良く、また
、物音がＢＬＯＢ(binary large objects)として記憶されたデータベースであっ
ても良い。In the archive 170, road noise, railway noise, children's noise, sea noise,
Sounds such as background music are stored. The archive can be arbitrarily expanded. The archive may be simply a collection of digitized sound files, or a database in which sound is stored as BLOBs (binary large objects).

【００３８】ミキシング装置１８０では、生成された音声信号が、背景の物音と組合せられ
る。組合せの前にすべての信号の音量を調整することができる。さらに、各信号
に個々に、あるいは共通に効果音を適用することができる。In the mixing device 180, the generated audio signal is combined with a background sound. The volume of all signals can be adjusted before the combination. Further, sound effects can be applied to each signal individually or in common.

【００３９】このようにして生成された信号は、ディジタルオーディオのための適切な装置
１５０、例えばＰＣのサウンドボードに伝えられ、音響的にチェックされ、乃至
音響的に出力される。さらに、記憶手段（図示しない）が設けられ、その信号を
記憶し、後に目的とする媒体に適切な方法で伝え得るようにしてある。The signal thus generated is passed to a suitable device 150 for digital audio, for example a sound board of a PC, and is acoustically checked or acoustically output. Further, a storage means (not shown) is provided so that the signal can be stored and transmitted to an intended medium later in an appropriate manner.

【００４０】ミキシング装置としては、古くからハードウエアで実現されたものが知られて
おり、それを用いても良く、ソフトウエアで実現しても良い。また、プログラム
の一部で形成しても良い。As the mixing device, a device realized by hardware has been known for a long time, and it may be used, or may be realized by software. Also, it may be formed as a part of a program.

【００４１】当業者には、上記の実施の形態の改変が容易であろう。例えば、本発明の更な
る実施の形態として、出力装置１５０の代りに、ミキシング装置１８０にネット
ワークで接続された更なるコンピュータを用いることができる。この場合、生成
された音声信号を、コンピュータネットワーク、例えばインターネットを介して
他のコンピュータに伝送することができる。Those skilled in the art will readily be able to modify the above embodiments. For example, as a further embodiment of the present invention, instead of the output device 150, a further computer connected to the mixing device 180 via a network can be used. In this case, the generated audio signal can be transmitted to another computer via a computer network, for example, the Internet.

【００４２】さらに他の実施の形態では、音声生成ユニット１４０で生成された音声信号を
、ミキシング装置１８０を通さずに、出力装置１５０に直接伝送することができ
る。更なる類似の改変が当業者には自明であろう。In still another embodiment, the audio signal generated by the audio generation unit 140 can be directly transmitted to the output device 150 without passing through the mixing device 180. Further similar modifications will be apparent to those skilled in the art.

[Brief description of the drawings]

【図１】本発明の一実施の形態のディジタル音声を生成する装置のブロッ
ク図である。FIG. 1 is a block diagram of an apparatus for generating digital audio according to an embodiment of the present invention.

───────────────────────────────────────────────────── フロントページの続き (71)出願人２，ＫａｌｋａｒｒａＣｒ，Ｍｔ．ＤｕｎｅｅｄＶｉｃ 3216，Ａｕｓｔｒａｌｉａ──────────────────────────────────────────────────続き Continued on the front page (71) Applicant 2, Kalkarra Cr, Mt. Duneed Vic 3216, Australia

Claims

[Claims]

1. A digital speech processing apparatus comprising: a prosody generation means for generating a prosody for a text; and an editing means for displaying and modifying the generated prosody.

2. The apparatus according to claim 1, further comprising a conversion unit for converting the text to a phonetic transcription, wherein the conversion unit includes a unit for displaying and correcting the generated phonetic transcription. Equipment.

3. The method according to claim 1, wherein the prosody generation means and / or the conversion means generate the prosody and / or the phonetic notation based on or dependent on a specific speaker model. The described device.

4. Apparatus according to claim 1, further comprising means for selecting and / or modifying one or more speaker models.

5. The apparatus according to claim 4, wherein the means for modifying the speaker model comprises means for modifying a phonetic transcription element for rounding.

6. A device for digital speech processing according to claim 1, wherein the speech is based on a phonetic notation that may have been edited using the editing means and / or based on the prosody. Means for generating a signal.

7. The apparatus according to claim 6, wherein said voice signal generating means further comprises speaker model processing means for generating said voice signal based on or dependent on a specific speaker model.

8. The method of claim 7, wherein said speaker model processing means comprises at least one of: a digital filter system; and means for employing a set of filter parameters representing a particular speaker model. apparatus.

9. Apparatus according to claim 7, wherein said speaker model processing means comprises means for selecting and / or modifying a speaker model.

10. The apparatus according to claim 6, further comprising sound effect generating means for generating a sound effect.

11. The apparatus according to claim 10, wherein said sound effect generating means has at least one of digital filter means for modifying the generated audio signal, and reverberation generating means for generating a reverberation effect. The described device.

12. An apparatus according to claim 6, further comprising: archiving means for storing a sound; and mixing means for mixing the generated audio signal with the sound stored in the archiving means. An apparatus according to claim 1.

13. Apparatus according to claim 1, further comprising a graphical user interface for editing the generated phonetic transcription and / or prosody.

14. Apparatus according to claim 1, further comprising means for modifying speech rhythm and / or pronunciation and / or intonation.

15. The apparatus according to claim 1, further comprising display means for displaying prosody by symbolic notation.

16. Apparatus according to claim 1, further comprising dictionary means for storing words in one or more languages together with their pronunciation.

17. The apparatus according to claim 16, wherein different voice data is stored in the dictionary means for at least one dictionary entry.

18. The apparatus according to claim 6, further comprising means for converting said digital audio signal into an audio signal.

19. A digital speech processing method comprising: generating a prosody of a text; displaying the generated prosody; and editing the generated and displayed prosody.

20. The method according to claim 1, further comprising generating digital voice.
20. The method according to claim 19, wherein the method according to claim 19 is used.

21. A method according to claim 19 or 20, comprising a medium readable by a computer for storing and transmitting digital data, in particular a data carrier, wherein the data to be stored and / or transmitted to a computer. A computer program product comprising a sequence of computer-executable instructions for causing a computer to execute.