JP4244706B2

JP4244706B2 - Audio playback device

Info

Publication number: JP4244706B2
Application number: JP2003152895A
Authority: JP
Inventors: 隆宏川嶋
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2003-05-29
Filing date: 2003-05-29
Publication date: 2009-03-25
Anticipated expiration: 2023-05-29
Also published as: JP2004354748A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声再生装置に関し、特に音声合成により特定のフレーズ（例えば「こんにちは」など）について高品質に再生することができる音声再生装置に関する。
【０００２】
【従来の技術】
従来、電子メールなどの文字列情報を音声に変換して出力する文字列音声変換装置が考え出されている。従来の文字列音声変換装置としては、文字列情報を文節単位に区切り、音声出力すると同時にその内容を表示するものがある（例えば、特許文献１参照）。
【０００３】
【特許文献１】
特開２００１−７９３７号公報
【０００４】
【発明が解決しようとする課題】
しかしながら、従来の文字列音声変換装置では、文字列情報を文節単位に区切って音声出力するものの、その音声出力は発音単位（又は文字単位）の音声の集合であるので、発音単位のつなぎ目の再生（音声出力）に違和感がある。すなわち、従来の文字列音声変換装置では、文節全体について品質の良い音声で声色を変化させて出力すること、すなわち自然な音声（例えば「こんにちは」）として出力することができないという問題点がある。
【０００５】
また、この問題点を解決するために、例えば文節（又はフレーズ）についての音声を予めサンプリングして音声データとして保持しておき、再生時には音声波形として出力する手法が考えられる。しかし、この手法では、音声出力の品質を上げるためにはサンプリング周波数を上げなければならず、大容量の音声データを保持する必要があり、携帯電話などにおいて大きなデメリットがある。
【０００６】
本発明は、上記問題を解決するためになされたもので、文字列情報などからなる所望のフレーズ（例えば「こんにちは」など）を品質の良い音声として声色を変化させて再生（出力）することができる音声再生装置を提供するものである。
【０００７】
【課題を解決するための手段】
上記課題を解決するため、この発明は以下の構成を有する。
即ち、本発明は、予め発音単位に対応するフォルマントフレームデータを保持するデータベースである合成辞書を有して、発音単位が羅列された情報が与えられることにより前記合成辞書を用いて音声合成する音声再生装置において、前記合成辞書に保持されている発音単位のフォルマントフレームデータを任意のユーザデータに置き換える置換手段と、前記発音単位が羅列された情報が与えられたときに、前記置換手段によって保持データが置き換えられた前記合成辞書を用いて音声を合成する音声合成手段とを有することを特徴とする。
【０００８】
また、本発明は、前記ユーザデータがフレーズ単位で取得されたフォルマントフレームデータであることを特徴とする。
【０００９】
また、本発明は、前記ユーザデータが前記合成辞書に保持されるフォルマントフレームデータを加工する音色パラメータに付加されていることを特徴とする。
【００１０】
また、本発明は、前記置換手段が、前記ユーザデータが付加されている音色パラメータが与えられ、かつ、再生時に該音色パラメータが指定されたときに、前記合成辞書の保持データであるフォルマントフレームデータを、前記ユーザデータに置き換え、前記音声合成手段は、音声単位の羅列情報が与えられたときに、前記音色パラメータにより置き換えられた合成辞書を用いて音声合成することを特徴とする。
【００１１】
また、本発明は、楽曲及び音声を同期させて所望データを再生するための情報構造を定義したデータ交換フォーマットに、前記ユーザデータを含ませ、該データ交換フォーマットを用いて音声を合成するものであることを特徴とする。
【００１２】
また、本発明は、前記データ交換フォーマットとして構成された情報に含まれる楽曲部情報についてはそのまま再生し、該情報に含まれる音声情報については前記置換手段及び前記音声合成手段を用いて再生するものであることを特徴とする。
【００１３】
また、本発明は、前記データ交換フォーマットが音声パラメータに前記ユーザデータを付加した情報を構成要素とすることを特徴とする。
【００１４】
【発明の実施の形態】
以下、図面を参照し、本発明の実施形態について説明する。
図１は、本発明の実施形態に係る音声再生装置の構成を示すブロック図である。まず、本実施形態に係る音声再生装置の基盤構成について説明する。
【００１５】
本音声再生装置１は、アプリケーション１４、ミドルウェアＡＰＩ１５、コンバータ１６、ドライバ１７、デフォルト音色パラメータ１８、デフォルト合成辞書１９及び音源２０を備え、スクリプト１１、ユーザ音色パラメータ１２、ユーザフレーズ合成辞書（可変長）１３が入力されることにより音声を再生する構成となっている。
【００１６】
音声再生装置１は、ＦＭ音源のリソースを用いたＣＳＭ（複合正弦波モデル）音声合成方式によるフォルマント合成により音声を再生する手法を基本としている。そして、本実施形態では、ユーザフレーズ合成辞書１３を定義し、音声再生装置１が音色パラメータに音素単位でユーザフレーズを割り付ける。そして、音声再生装置１は再生時において音色パラメータにユーザフレーズ合成辞書１３のデータが割り付けられているときは、デフォルト合成辞書１９の音素をユーザフレーズに置き換え、その置き換えたデータにより音声合成を行う。なお、上記「音素（Phoneme）」とは、発音の最小単位であり、日本語でいえば母音と子音の２種類がある。次に、音声再生装置１の詳細について説明する。
【００１７】
スクリプト１１は、「ＨＶ（Human Voice：前記手法により合成される音声）」を再生するためのデータフォーマットを定義しているものである。すなわち、スクリプト１１は、韻律記号を含んだ合成文字列、発音させる音の設定、再生アプリケーションなどのメッセージからなる音声合成を行うためのフォーマットであり、ユーザによる入力を容易にするために例えばテキスト入力となっている。このスクリプト１１におけるデータフォーマットの定義は、言語依存性があり、様々な言語による定義が可能であるが、本実施形態では日本語による定義のみを一例として取り上げる。
【００１８】
ユーザフレーズ合成辞書１３及びデフォルト合成辞書１９は、実際の声を発音文字単位で（例えば「あ」，「い」など）サンプリング及び分析することで８組のフォルマント周波数、フォルマントレベル及びピッチをパラメータとして割り出し、予めフォルマントフレームデータとしてそれらパラメータを発音文字単位で保持しているデータベースである。ユーザフレーズ合成辞書１３は、ミドルウェア外に構築されたデータベースであり、かかるデータベースをユーザが任意に作成することができ、保持内容についてはミドルウェアＡＰＩ１５を介してデフォルト合成辞書１９の保持内容と丸ごと入れ替えることができる。すなわち、デフォルト合成辞書１９の内容を丸ごとユーザフレーズ合成辞書１３の内容に置き換えることができる。一方、デフォルト合成辞書１９は、ミドルウェア内に構築されたデータベースである。
【００１９】
ユーザフレーズ合成辞書１３及びデフォルト合成辞書１９としては、それぞれ男声用と女声用との２種類を持つのが好ましい。また、ユーザフレーズ合成辞書１３及びデフォルト合成辞書１９が保持するフレームデータの間隔により、音声再生装置１の出力音声の品質が変化するが、例えばフレームデータの間隔を２０ｍｓとする。
【００２０】
ユーザ音色パラメータ１２及びデフォルト音色パラメータ１８は、音声再生装置１の出力音声における声質を制御するパラメータ群である。そして、ユーザ音色パラメータ１２及びデフォルト音色パラメータ１８は、例えば８組のフォルマント周波数及びフォルマントレベルの変更（ユーザフレーズ合成辞書１３及びデフォルト合成辞書１９に登録されているフォルマント周波数、フォルマントレベルからの変化量の指定）、並びに、フォルマント合成のための基本波形の指定をすることができ、様々な音色を作り出すことができる。
【００２１】
デフォルト音色パラメータ１８は、予めミドルウェア内にデフォルトで保持されている音色パラメータセットである。ユーザ音色パラメータ１２は、ユーザが任意に作成することができるパラメータであって、ミドルウェアの外側に保持されているものであり、ミドルウェアＡＰＩ１５を介してデフォルト音色パラメータ１８を拡張するものである。
【００２２】
アプリケーション１４は、スクリプト１１を再生するためのソフトウェアである。
ミドルウェアＡＰＩ（Application Program Interface）１５は、ソフトウェアからなるアプリケーション１４と、ミドルウェアからなるコンバータ１６、ドライバ１７、デフォルト音色パラメータ１８及びデフォルト合成辞書１９とのインターフェースとなるものである。
【００２３】
コンバータ１６は、スクリプト１１を解釈し、ドライバ１７を用いて最終的にフレームデータが連続して構成されるフォルマントフレーム列のデータへ変換するものである。
ドライバ１７は、スクリプト１１に含まれる発音文字とデフォルト合成辞書１９とに基づいてフォルマントフレーム列を生成し、音色パラメータを解釈しフォルマントフレーム列を加工するものである。
音源２０はコンバータ１６から出力されたデータに対応した音信号を出力するものであり、その音信号がスピーカに出力されて音となる。
【００２４】
次に、本実施形態に係る音声再生装置１の特徴について詳細に説明する。
まず、ユーザ音色パラメータ１２では、任意の発音単位に対して、ユーザフレーズ合成辞書１３が保持するフレーズＩＤを割り付けるというパラメータがある。図２は発音単位毎にフレーズＩＤを割り付けたものの一例を示す図である。すなわち、図２はモーラとフレーズＩＤとの割り付けを示すものである。
なお、モーラとは、拍を意味し、日本語でいえば仮名文字単位である。
【００２５】
発音単位毎にフレーズＩＤを割り付けることにより、ユーザ音色パラメータ１２で指定された発音単位がデフォルト合成辞書１９ではなく、ユーザフレーズ合成辞書１３を用いることを規定する。また、ユーザ音色パラメータ１２は、１つの音色パラメータ中に指定できる発音単位数が任意であるとするのが好ましい。上記のように、ユーザ音色パラメータ１２において、発音単位毎にフレーズＩＤを割り付ける構成は本実施形態の一例であり、発音単位に置き換えることができるものであればその手法は問わない。
【００２６】
次いで、ユーザフレーズ合成辞書１３の詳細について説明する。図３はユーザフレーズ合成辞書１３の内容例を示す図である。ユーザフレーズ合成辞書１３では、フレーズＩＤ毎に、８組のフォルマント周波数、フォルマントレベル及びピッチからなるフレームデータを格納している。図３における「フレーズ」とは、例えば「おはよう」など一つのまとまりを持った句である。そして、「フレーズ」は、単語、音節、文章など、特にまとまりは規定せず、任意の一塊を意味する。
【００２７】
ユーザフレーズ合成辞書１３を製作するツールは、通常のサウンドファイル（＊．ｗａｖ，＊ａｉｆなど）から、分析して８組のフォルマント周波数、フォルマントレベル及びピッチからなるフレームデータを生成する分析エンジンを搭載する必要がある。
【００２８】
スクリプト１１には、声質変更のイベントが用意されているが、このイベントにより、ユーザ音色パラメータ１２を指定することができる。
【００２９】
例えば、スクリプト１１の記述としては、「ＴＪＫ１２みなさんＸ１０あか」とする。
この例では、「Ｋ」がデフォルト音色パラメータ１８を指定するイベントであり、「Ｘ」がユーザ音色パラメータ１２を指定するイベントである。また、「Ｘ１０」が図２に示すユーザ音色パラメータを指定するものとする。
【００３０】
そして、この場合、再生音声は「みなさんこんにちは鈴木です」となる。
「みなさん」はデフォルト音色パラメータ１８及びデフォルト合成辞書１９を用いた音声となり、また、「こんにちは」と「鈴木です」はユーザ音色パラメータ１２及びユーザフレーズ合成辞書１３を用いた音声となる。すなわち、「みなさん」は「み」と「な」と「さ」と「ん」のそれぞれのフォルマントフレームデータをデフォルト合成辞書１９から読み出して合成した音声となり、「こんにちは」と「鈴木です」はそれぞれのフレーズ単位のフォルマントフレームデータをユーザフレーズ合成辞書１３から読み出して合成した音声となる。
【００３１】
上記例では、「あ」、「い」、「か」を使ったが、テキストで表記できる文字及び記号であれば何でもよい。また、上記例では、「Ｘ１０」以降、「あ」は「こんにちは」、「か」は「鈴木です」と発音されるので、次に本来の「あ」を発音させたい時はデフォルト合成辞書に戻す記号（例えばＸ○○）を入れればよい。
【００３２】
次に、本実施形態に係る音声再生装置１で用いられる音楽再生シーケンスデータ（ＳＭＡＦ：Synthetic music Mobile Application Format）のデータ交換フォーマットについて、図４を参照して説明する。図４は、本実施形態に係るＳＭＡＦファイルのフォーマットを示す説明図である。ＳＭＡＦは、音源を用いて音楽を表現するためのデータを配布したり相互に利用したりするためのデータ交換フォーマットの一つであり、携帯端末などにおいてマルチメディアコンテンツを表現するためのデータフォーマット仕様である。
【００３３】
図４に示すデータ交換フォーマットのＳＭＡＦファイル３０は、チャンク（Ｃｈｕｎｋ）と呼ばれるデータの塊が基本構造となっている。チャンクは、固定長（８バイト）のヘッダ部と任意長のボディ部とからなる。ヘッダ部は、４バイトのチャンクＩＤと４バイトのチャンクサイズに分けられる。チャンクＩＤはチャンクの識別子に用い、チャンクサイズはボディ部の長さを示している。ＳＭＡＦファイル３０は、それ自体及びそれに含まれる各種データも全てチャンク構造となっている。
【００３４】
図４に示すようにＳＭＡＦファイル３０は、コンテンツ・インフォ・チャンク（Contents Info Chunk）３１と、オプショナル・データ・チャンク（Optional Data Chunk）３２と、トラック・チャンク(Score Track Chunk)３３と、ＨＶチャンク(ＨＶ Chunk)３６とからなる。
【００３５】
コンテンツ・インフォ・チャンク３１には、ＳＭＡＦファイル３０についての各種管理用情報が格納されており、例えばコンテンツのクラス、種類、著作権情報、ジャンル名、曲名、アーティスト名、作詞／作曲者名などが格納されている。オプショナル・データ・チャンク３２には、例えば著作権情報、ジャンル名、曲名、アーティスト名、作詞／作曲者名などの情報が格納されている。なお、ＳＭＡＦファイル３０においてオプショナル・データ・チャンク３２は設けなくてもよい。
【００３６】
トラック・チャンク３３は、音源へ送り込む楽曲のシーケンス・トラックを格納するチャンクであり、セットアップ・データ・チャンク（Setup Data Chunk(オプション)）３４及びシーケンス・データ・チャンク（Sequence Data Chunk）３５を含んでいる。
【００３７】
セットアップ・データ・チャンク３４は、音源部分の音色データなどを格納するチャンクであり、イクスクルーシブ・メッセージの並びを格納する。イクスクルーシブ・メッセージは、例えば音色パラメータ登録メッセージである。
【００３８】
シーケンス・データ・チャンク３５は、実演奏データを格納するものであり、スクリプト１１の再生タイミングを決めるＨＶ（Human Voice：音声）ノートオンとその他のシーケンス・イベントとを混在させて格納している。ここで、ＨＶとそれ以外の楽曲のイベントとは、ＨＶのチャネル指定により区別される。
【００３９】
ＨＶチャンク３６は、ＨＶセットアップ・データ・チャンク（HV Setup Data Chunk(オプション)）３７と、ＨＶユーザ・フレーズ・辞書チャンク（HV User Phrase Dictionary Chunk(オプション)）３８と、ＨＶ-Ｓチャンク３９とを含んでいる。
【００４０】
ＨＶセットアップ・データ・チャンク３７には、ＨＶユーザ音色パラメータや、ＨＶのチャネルを指定するためのメッセージが格納されている。また、ＨＶ-Ｓチャンク３９には、ＨＶ-スクリプトデータが格納されている。
【００４１】
ＨＶユーザ・フレーズ・辞書チャンク３８には、ユーザフレーズ合成辞書１３の内容が格納されている。また、ＨＶセットアップ・データ・チャンク３７に格納されるＨＶユーザ音色パラメータには、図２に示すモーラとフレーズＩＤを割り付けるパラメータが必要である。
【００４２】
これらの図４に示すＳＭＡＦファイル３０を上記音声再生装置１に適用することにより、楽曲と同期して音声（ＨＶ）を再生することができるとともに、ユーザフレーズ合成辞書１３の内容についても再生することが可能となる。
【００４３】
次に、図１におけるユーザフレーズ合成辞書１３及び図４に示すＳＭＡＦファイル３０を作成するためのツールであるＨＶオーサリングツールについて、図５を参照して説明する。図５はＨＶオーサリングツールの一例を示す機能イメージ図である。
【００４４】
ＨＶオーサリングツール４２は、ＳＭＡＦファイル３０を作成する場合、予めＭＩＤＩシーケンサによって作成されたＳＭＦ（Standard MIDI File）ファイル４１（ＨＶの発音タイミングを決めるノートオンを含む）を読み込み、ＨＶスクリプトＵＩ４４及びＨＶボイスエディタ４５から得られた情報を元にＳＭＡＦファイル４３（ＳＭＡＦファイル３０に相当）への変換処理を行う。
【００４５】
ＨＶボイスエディタ４５は、ＨＶユーザ音色ファイル４８に含まれるＨＶユーザ音色パラメータ（ユーザ音色パラメータ１２に相当）を編集することができるエディタである。このＨＶボイスエディタ４５は、各種のＨＶ音色パラメータの編集に加え、任意のモーラに対してユーザフレーズを割り付けることができる。
【００４６】
ＨＶボイスエディタ４５のインターフェースとしては、モーラを選択するメニューと、そのモーラに対して任意のサウンドファイル５０を割り付ける機能を持つ。ＨＶボイスエディタ４５のインターフェースによって割り付けられたサウンドファイル５０は、波形分析器４６により分析され、８組のフォルマント周波数、フォルマントレベル及びピッチのフレームデータを生成する。これらのフレームデータは、個別ファイル（ＨＶユーザ音色ファイル４８、ＨＶユーザ合成辞書ファイル４９）として入出力することができる。
【００４７】
ＨＶスクリプトＵＩ４４は、ＨＶスクリプトを直接編集することができる。このＨＶスクリプトも、個別ファイル（ＨＶスクリプトファイル４７）として入出力することができる。また、本実施形態に係るＨＶオーサリングツール４０は、上記ＨＶオーサリングツール４２と、ＨＶスクリプトＵＩ４４と、ＨＶボイスエディタ４５と、波形分析器４６とからなるものとしてもよい。
【００４８】
次に、上記音声再生装置１を携帯通信端末に適用した例について、図６を参照して説明する。図６は、音声再生装置１を備える携帯通信端末６０の構成例を示すブロック図である。
【００４９】
携帯通信端末６０は、例えば、携帯電話などからなり、ＣＰＵ６１、ＲＯＭ６２、ＲＡＭ６３、表示部６４、バイブレータ６５、入力部６６、通信部６７、アンテナ６８、音声処理部６９、音源７０、スピーカ７１及びバス７２を備えている。ＣＰＵ６１は、携帯通信端末６０全体の制御を行う。ＲＯＭ６２は、各種通信制御プログラム及び楽曲再生のためのプログラムなどの制御プログラム、並びに、各種定数データなどを格納している。
【００５０】
ＲＡＭ６３は、ワークエリアとして使用されるとともに、楽曲ファイル及び各種アプリケーションプログラムなどを記憶する。表示部６４は、液晶表示装置（ＬＣＤ）などからなる。バイブレータ６５は着信などがあったときに振動する。入力部６６は、複数の釦などからなる。通信部６７は、変復調部などからなり、アンテナ６８に接続されている。
【００５１】
音声処理部６９は、送話マイク及び受話スピーカに接続されており、通話のために音声信号について符号化及び復号化を行う機能を有する。音源７０は、ＲＡＭ６３などに記憶された楽曲ファイルに基づいて楽曲を再生するとともに、音声を再生して、スピーカ７１に出力する。バス７２は、ＣＰＵ６１、ＲＯＭ６２、ＲＡＭ６３、表示部６４、バイブレータ６５、入力部６６、通信部６７、音声処理部６９及び音源７０の各構成要素間でデータ転送を行うための伝送路である。
【００５２】
さらに、通信部６７は、ＨＶ−スクリプトファイル又は図４に示すＳＭＡＦファイル３０をコンテンツサーバなどからダウンロードしてＲＡＭ６３へ記憶させることができる。そして、ＲＯＭ６２には図１に示す音声再生装置１のアプリケーション１４及びミドルウェアのプログラムも記憶されている。そのアプリケーション１４及びミドルウェアのプログラムはＣＰＵ６１によって読み出され起動される。また、ＣＰＵ６１は、ＲＡＭ６３で記憶されているＨＶ−スクリプトを解釈してフォルマントフレームデータを生成し、そのフォルマントフレームデータを音源７０へ送る。
【００５３】
（動作）
次に、上記音声再生装置１の動作について説明する。先ず、ユーザフレーズ合成辞書１３の制作方法について説明する。図７は、ユーザフレーズ合成辞書１３の制作方法を示すフローチャートである。
【００５４】
先ず、図５に示すＨＶオーサリングツール４２を用いて、ユーザフレーズ合成辞書１３を使用するＨＶ音色を選択し、ＨＶボイスエディタ４５を起動させる（ステップＳ１）。
次いで、ＨＶボイスエディタ４５を用いて、当てはめたいモーラを選択し、サウンドファイルを貼り付ける。すると、ＨＶボイスエディタ４５は、ユーザフレーズ辞書（ＨＶユーザ合成辞書ファイル４９に相当）を出力する（ステップＳ２）。
【００５５】
次いで、ＨＶボイスエディタ４５を用いて、ＨＶ音色パラメータを編集する。すると、ＨＶボイスエディタ４５は、ユーザ音色パラメータ（ＨＶユーザ音色ファイル４８に相当）を出力する（ステップＳ３）。
【００５６】
次いで、ＨＶスクリプトＵＩ４４を用いて、ＨＶ−スクリプトに、該当するＨＶ音色を指定する声質変更イベントを記述し、再生したいモーラを記述する。すると、ＨＶスクリプトＵＩ４４は、ＨＶ−スクリプト（ＨＶスクリプトファイル４７に相当）を出力する（ステップＳ４）。
【００５７】
次に、音声再生装置１におけるユーザフレーズ辞書の再生動作について、図８を参照して説明する。図８は、音声再生装置１におけるユーザフレーズ合成辞書の再生動作を示すフローチャートである。
先ず、ユーザ音色パラメータ１２及びユーザフレーズ合成辞書１３を、音声再生装置１のミドルウェアに登録する。そして、スクリプト１１を音声再生装置１のミドルウェアに登録し、再生を開始する（ステップＳ１１，Ｓ１２）。
【００５８】
その再生においては、スクリプト１１中に、ユーザ音色パラメータ１２を指定する声質変更イベント（Ｘイベント）があるか監視する（ステップＳ１３）。
ステップＳ１３で声質変更イベントを見つけた場合、そのユーザ音色パラメータ１２からモーラに割り付けられているフレーズＩＤを探し、フレーズＩＤに対応するデータをユーザフレーズ合成辞書１３から読み取り、ＨＶドライバが管理するデフォルト合成辞書１９のデータのうち、該当するモーラの辞書データをユーザフレーズ合成辞書１３のデータに置き換える（ステップＳ１４）。
ステップＳ１４の置き換え処理は、再生前に事前に行ってもよい。
【００５９】
ステップＳ１４が終了した場合、及び、ステップＳ１３で声質変更イベントが見つからなかった場合は、コンバータ１６がスクリプト１１（ステップＳ１４が行われた場合は該ステップＳ１４の置き換え処理後のスクリプト）のモーラを解釈し、ＨＶドライバを用いて最終的にフォルマントフレーム列のデータへコンバートする（ステップＳ１５）。
次いで、ステップＳ１５でコンバートされたデータを音源２０により再生する（ステップＳ１６）。
【００６０】
次いで、スクリプト１１が終了か否か判断し（ステップＳ１７）、終了していない場合は上記ステップＳ１３に戻り、終了した場合はユーザフレーズ辞書の再生動作を終了する。
【００６１】
次に、図４に示すＳＭＡＦファイル３０の制作方法について、図９を参照して説明する。図９は、ＳＭＡＦファイル３０の制作方法を示すフローチャートである。
先ず、図７に示す手順によりユーザフレーズ合成辞書１３、ユーザ音色パラメータ１２及びスクリプト１１を制作する（ステップＳ２１）。
【００６２】
次いで、楽曲データ及びＨＶスクリプトの発音を制御するイベントを含んだＳＭＦファイル４１を制作する（ステップＳ２２）。
次いで、図５に示すＨＶオーサリングツール４２へＳＭＦファイル４１を入力し、ＨＶオーサリングツール４２によりＳＭＦファイル４１をＳＭＡＦファイル４３（ＳＭＡＦファイル３０に相当）に変換する（ステップＳ２３）。
【００６３】
そして、ステップＳ２１で作られたユーザ音色パラメータ１２が図４に示すＳＭＡＦファイル３０のＨＶチャンク３６のＨＶセットアップ・データ・チャンク３７へ入れられ、ステップＳ２１で作られたユーザフレーズ合成辞書１３が同ＳＭＡＦファイル３０のＨＶチャンク３６のＨＶユーザ・フレーズ・辞書チャンク３８へ入れられ、ＳＭＡＦファイル３０として出力される（ステップＳ２４）。
【００６４】
次に、ＳＭＡＦファイル３０の再生方法について図１０を参照して説明する。図１０は、ＳＭＡＦファイル３０の再生方法を示すフローチャートである。
先ず、ＳＭＡＦファイル３０を図１に示す音声再生装置１のミドルウェアに登録する（ステップＳ３１）。
ここで、音声再生装置１は、通常、ＳＭＡＦファイル３０内の楽曲データの部分をミドルウェアの楽曲再生部に登録し、再生準備を行う。
【００６５】
次いで、音声再生装置１は、ＳＭＡＦファイル３０にＨＶチャンク３６があるか否か判断する（ステップＳ３２）。
ステップＳ３２でＨＶチャンク３６があった場合、音声再生装置１はＨＶチャンク３６の内容を解釈する（ステップＳ３３）。
次いで、音声再生装置１は、ユーザ音色パラメータの登録、ユーザフレーズ合成辞書の登録及びスクリプトの登録をする（ステップＳ３４）。
【００６６】
ステップＳ３２でＨＶチャンク３６がなかった場合、もしくはステップ３４における登録が終了した場合、音声再生装置１は楽曲部のチャンクを解釈する（ステップＳ３５）。
次いで、音声再生装置１は、「スタート」信号に対応してシーケンス・データ・チャンク３５内のシーケンスデータ（実演奏データ）の解釈をスタートさせることにより、楽曲再生を行う（ステップＳ３６）。
【００６７】
この再生において、音声再生装置１はシーケンスデータにおけるイベントを順次解釈する過程において、そのイベントがＨＶノートオンであるか否か判断する（ステップＳ３７）。
ステップＳ３７において、ＨＶノートオンであった場合、音声再生装置１はそのＨＶノートオンで指定されているＨＶチャンクのＨＶスクリプトデータの再生を開始する（ステップＳ３８）。
【００６８】
このステップＳ３８の後、音声再生装置１は図８に示すユーザフレーズ辞書の再生動作を行う。
すなわち、音声再生装置１はステップＳ３８の再生において、ユーザ音色パラメータ１２を指定する声質変更イベント（Ｘイベント）があるか監視する（ステップＳ３９）。
【００６９】
ステップＳ３９で声質変更イベントを見つけた場合、そのユーザ音色パラメータ１２からモーラに割り付けられているフレーズＩＤを探し、フレーズＩＤに対応するデータをユーザフレーズ合成辞書１３から読み取り、ＨＶドライバが管理するデフォルト合成辞書１９のデータのうち、該当するモーラの辞書データをユーザフレーズ辞書データに置き換える（ステップＳ４０）。
ステップＳ４０の置き換え処理は、再生前に事前に行ってもよい。
【００７０】
ステップＳ４０が終了した場合、及び、ステップＳ３９で声質変更イベントが見つからなかった場合は、コンバータ１６がスクリプトのモーラを解釈し、ＨＶドライバを用いて最終的にフォルマントフレーム列のデータへコンバートする（ステップＳ４１）。
【００７１】
次いで、音声再生装置１は、ステップＳ４１でコンバートされたデータを音源２０のＨＶ部にて再生する（ステップＳ４２）。
次いで、音声再生装置１は、楽曲が終了したか否か判断し（ステップＳ４３）、楽曲が終了した場合はＳＭＡＦファイル３０の再生を終了させ、楽曲が終了していない場合はステップＳ３７に戻る。
【００７２】
ステップＳ３７において、イベントがＨＶノートオンでなかった場合、音声再生装置１はそのイベントを楽曲データとして、音源再生イベントデータにコンバートする（ステップＳ４４）。
次いで、音声再生装置１は、ステップＳ４４でコンバートされたデータを音源２０の楽曲部にて再生する（ステップＳ４５）。
【００７３】
これらにより、本実施形態によれば、ＦＭ音源のリソースを用いてフォルマント合成により再生する方法において、以下の３つの利点がある。
第１に、本実施形態によれば、ユーザが好みのフレーズを割り付けることができる。これにより、固定辞書に依存することなく、好みの声色により近づけた再生をすることができる。
第２に、本実施形態によれば、デフォルト合成辞書１９の一部をユーザフレーズ合成辞書１３で置き換えるため、音声再生装置１においてデータ容量が過大に増加することを回避することができる。また、デフォルト合成辞書１９の一部を任意のフレーズに置き換えることもできるため、フレーズ単位の発音をすることができ、従来の発音単位の合成音声で生じる各発音のつなぎ目での違和感をなくすことができる。
第３に、本実施形態によれば、ＨＶスクリプトにおいて任意のフレーズ指定をすることができるので、モーラ単位の合成とフレーズ単位の発音を併用することができる。
【００７４】
さらに、本実施形態によれば、フレーズを予めサンプリングして構成した波形データを再生する方法に比べて、フォルマントレベルで声色変化させることができる。そして、本実施形態によれば、データサイズ及び品質はフレームレートによるが、サンプリング波形データに比べてはるかに少ないデータ容量で高品質な再生をすることができる。したがって、例えば、本実施形態の音声再生装置１を携帯電話などの携帯通信端末に組み込むことが容易に実行でき、電子メールの内容などを高品質な音声で再生することもできる。
【００７５】
以上、本発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、本発明の要旨を逸脱しない範囲の設計変更等も含まれる。
【００７６】
【発明の効果】
以上説明したように、本発明によれば、合成辞書に発音単位で保持されているデータを任意のユーザデータに置き換えることができるので、所望のフレーズを品質のよい音声で再生することができる。
【図面の簡単な説明】
【図１】本発明の実施形態に係る音声再生装置を示すブロック図である。
【図２】発音単位毎にフレーズＩＤを割り付けた例を示す図である。
【図３】ユーザフレーズ合成辞書の内容例を示す図である。
【図４】ＳＭＡＦファイルのフォーマットを示す図である。
【図５】ＨＶオーサリングツールの一例を示す機能イメージ図である。
【図６】本実施形態の音声再生装置を備える携帯通信端末の一例を示すブロック図である。
【図７】ユーザフレーズ合成辞書の制作方法のフローチャートである。
【図８】ユーザフレーズ合成辞書の再生動作のフローチャートである。
【図９】ＳＭＡＦファイルの制作方法を示すフローチャートである。
【図１０】ＳＭＡＦファイル３０の再生方法のフローチャートである。
【符号の説明】
１…音声再生装置、１１…スクリプト、１２…ユーザ音色パラメータ、１３…ユーザフレーズ合成辞書（可変長）、１４…アプリケーション、１５…ミドルウェアＡＰＩ、１６…コンバータ、１７…ドライバ、１８…デフォルト音色パラメータ、１９…デフォルト合成辞書、２０…音源、３０…ＳＭＡＦファイル、３１…コンテンツ・インフォ・チャンク、３２…オプショナル・データ・チャンク、３３…トラック・チャンク、３４…セットアップ・データ・チャンク、３５…シーケンス・データ・チャンク、３６…ＨＶチャンク、３７…ＨＶセットアップ・データ・チャンク、３８…ＨＶユーザ・フレーズ・辞書チャンク、３９…ＨＶ-Ｓチャンク、４１…ＳＭＦファイル、４２…ＨＶオーサリングツール、４３…ＳＭＡＦファイル、４４…ＨＶスクリプトＵＩ、４５…ＨＶボイスエディタ、４６…波形分析器、４７…ＨＶスクリプトファイル、４８…ＨＶユーザ音色ファイル、４９…ＨＶユーザ合成辞書ファイル、５０…サウンドファイル[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a sound reproducing device, a sound reproducing apparatus capable of reproducing the high quality for a particular phrase (such as "Hello") in particular by speech synthesis.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, a character string speech conversion device has been devised that converts character string information such as electronic mail into speech and outputs it. As a conventional character string speech conversion device, there is a device that divides character string information into phrase units and outputs the sound at the same time as displaying the content (for example, see Patent Document 1).
[0003]
[Patent Document 1]
JP 2001-7937 A
[0004]
[Problems to be solved by the invention]
However, in the conventional character string speech conversion device, the character string information is divided into phrases and output as speech. However, since the speech output is a set of sounds in pronunciation units (or character units), reproduction of the joint of the pronunciation units is performed. (Sound output) is strange. That is, in the conventional character string speech conversion system may be output by changing the tone of voice with good sound quality for the entire phrase, i.e. there is a problem that can not be output as a natural voice (e.g. "Hello").
[0005]
In order to solve this problem, for example, a method is conceivable in which a voice for a phrase (or phrase) is sampled in advance and stored as voice data, and is output as a voice waveform during reproduction. However, with this method, in order to improve the quality of audio output, the sampling frequency must be increased, and it is necessary to store a large volume of audio data.
[0006]
The present invention has been made to solve the above problems, be reproduced desired phrase consisting of a character string information (such as "Hello") to change the tone of voice as good speech quality (output) An audio reproducing apparatus that can be used is provided.
[0007]
[Means for Solving the Problems]
In order to solve the above problems, the present invention has the following configuration.
That is, Book The present invention provides a speech reproduction apparatus that has a synthesis dictionary that is a database that holds formant frame data corresponding to pronunciation units in advance, and that synthesizes speech using the synthesis dictionary by providing information in which pronunciation units are listed. The replacement unit replaces the formant frame data of the pronunciation unit held in the synthesis dictionary with arbitrary user data, and the stored data is replaced by the replacement unit when the information listing the pronunciation units is given. And speech synthesis means for synthesizing speech using the synthesis dictionary.
[0008]
Also, Book invention Before The user data is formant frame data acquired in units of phrases.
[0009]
Also, Book invention Before The user data is added to a timbre parameter for processing formant frame data held in the synthesis dictionary.
[0010]
Also, Book invention Before When the timbre parameter to which the user data is added is given and the timbre parameter is designated at the time of reproduction, the replacement means adds formant frame data, which is held in the synthesis dictionary, to the user data. The replacement, the speech synthesizing means, synthesizes speech using the synthesis dictionary replaced by the timbre parameter when the enumeration information in units of speech is given.
[0011]
Also, Book invention Is easy The user exchange is included in a data exchange format that defines an information structure for reproducing desired data by synchronizing music and voice, and voice is synthesized using the data exchange format .
[0012]
Also, Book invention Before The music piece information included in the information configured as the data exchange format is reproduced as it is, and the audio information included in the information is reproduced using the replacement unit and the voice synthesizing unit. To do.
[0013]
Also, Book invention Before The data exchange format includes information in which the user data is added to a voice parameter as a constituent element.
[0014]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing a configuration of an audio reproducing apparatus according to an embodiment of the present invention. First, the basic configuration of the audio reproduction device according to the present embodiment will be described.
[0015]
The audio playback device 1 includes an application 14, a middleware API 15, a converter 16, a driver 17, a default tone color parameter 18, a default synthesis dictionary 19, and a sound source 20, and includes a script 11, a user tone color parameter 12, a user phrase synthesis dictionary (variable length). The audio is reproduced by inputting 13.
[0016]
The sound reproduction device 1 is based on a method of reproducing sound by formant synthesis using a CSM (Composite Sine Wave Model) speech synthesis method using FM sound source resources. And in this embodiment, the user phrase synthetic | combination dictionary 13 is defined and the audio | voice reproduction apparatus 1 allocates a user phrase per phoneme to a timbre parameter. When the data of the user phrase synthesis dictionary 13 is assigned to the timbre parameter at the time of reproduction, the voice reproduction device 1 replaces the phoneme in the default synthesis dictionary 19 with the user phrase, and performs voice synthesis using the replaced data. The “phoneme” is a minimum unit of pronunciation, and there are two types of vowels and consonants in Japanese. Next, the details of the audio reproduction device 1 will be described.
[0017]
The script 11 defines a data format for reproducing “HV (Human Voice: voice synthesized by the above method)”. That is, the script 11 has a format for synthesizing speech composed of a message such as a synthesized character string including prosodic symbols, setting of sound to be generated, and a reproduction application. It has become. The definition of the data format in the script 11 is language-dependent and can be defined in various languages. In this embodiment, only the definition in Japanese is taken as an example.
[0018]
The user phrase synthesizing dictionary 13 and the default synthesizing dictionary 19 sample and analyze an actual voice in units of pronunciation characters (for example, “A”, “I”, etc.), thereby using eight formant frequencies, formant levels, and pitches as parameters. It is a database that is indexed and holds these parameters in advance as phonetic character data as formant frame data. The user phrase synthesizing dictionary 13 is a database constructed outside the middleware, and the user can arbitrarily create such a database, and the retained contents are completely replaced with the retained contents of the default synthetic dictionary 19 via the middleware API 15. Can do. That is, the entire contents of the default synthesis dictionary 19 can be replaced with the contents of the user phrase synthesis dictionary 13. On the other hand, the default synthesis dictionary 19 is a database constructed in the middleware.
[0019]
The user phrase synthesis dictionary 13 and the default synthesis dictionary 19 preferably have two types of male voice and female voice, respectively. Further, the quality of the output sound of the audio reproduction device 1 varies depending on the interval of the frame data held by the user phrase synthesis dictionary 13 and the default synthesis dictionary 19, but the interval of the frame data is set to 20 ms, for example.
[0020]
The user tone color parameter 12 and the default tone color parameter 18 are a group of parameters for controlling the voice quality in the output sound of the sound reproducing device 1. The user tone color parameter 12 and the default tone color parameter 18 are, for example, eight sets of formant frequencies and formant level changes (formant frequencies registered in the user phrase synthesis dictionary 13 and the default synthesis dictionary 19 and the amount of change from the formant level). Designation), and a basic waveform for formant synthesis can be designated, and various timbres can be created.
[0021]
The default timbre parameter 18 is a timbre parameter set that is previously stored in the middleware by default. The user tone color parameter 12 is a parameter that can be arbitrarily created by the user and is held outside the middleware. The user tone color parameter 12 extends the default tone color parameter 18 via the middleware API 15.
[0022]
The application 14 is software for reproducing the script 11.
A middleware API (Application Program Interface) 15 is an interface between an application 14 made of software, a converter 16 made of middleware, a driver 17, a default tone color parameter 18, and a default synthesis dictionary 19.
[0023]
The converter 16 interprets the script 11 and uses the driver 17 to finally convert it into formant frame string data in which frame data is continuously formed.
The driver 17 generates a formant frame sequence based on the pronunciation characters included in the script 11 and the default synthesis dictionary 19, interprets the timbre parameters, and processes the formant frame sequence.
The sound source 20 outputs a sound signal corresponding to the data output from the converter 16, and the sound signal is output to a speaker to become a sound.
[0024]
Next, features of the audio playback device 1 according to the present embodiment will be described in detail.
First, the user tone color parameter 12 includes a parameter for assigning a phrase ID held by the user phrase synthesis dictionary 13 to an arbitrary pronunciation unit. FIG. 2 is a diagram showing an example in which a phrase ID is assigned to each pronunciation unit. That is, FIG. 2 shows allocation of mora and phrase ID.
The mora means a beat, and in Japanese, it is a kana character unit.
[0025]
By assigning a phrase ID to each pronunciation unit, it is specified that the pronunciation unit specified by the user tone color parameter 12 is not the default synthesis dictionary 19 but the user phrase synthesis dictionary 13 is used. The user tone color parameter 12 preferably has an arbitrary number of sounding units that can be specified in one tone color parameter. As described above, the configuration in which the phrase ID is assigned to each sounding unit in the user tone color parameter 12 is an example of this embodiment, and any method can be used as long as it can be replaced with the sounding unit.
[0026]
Next, details of the user phrase synthesis dictionary 13 will be described. FIG. 3 is a diagram showing an example of the contents of the user phrase synthesis dictionary 13. The user phrase synthesis dictionary 13 stores frame data composed of eight sets of formant frequencies, formant levels, and pitches for each phrase ID. The “phrase” in FIG. 3 is a phrase having a single unit such as “good morning”. The “phrase” means an arbitrary lump without defining a group such as a word, a syllable, or a sentence.
[0027]
The tool for creating the user phrase synthesis dictionary 13 is equipped with an analysis engine that analyzes and generates frame data consisting of eight formant frequencies, formant levels, and pitches from normal sound files (* .wav, * aif, etc.). There is a need to.
[0028]
The script 11 has an event for changing voice quality, and the user tone color parameter 12 can be designated by this event.
[0029]
For example, the description of the script 11 is “TJK12 everyone X10 Aka”.
In this example, “K” is an event for specifying the default timbre parameter 18, and “X” is an event for specifying the user timbre parameter 12. Further, “X10” designates the user tone color parameter shown in FIG.
[0030]
And, in this case, the reproduced sound is "Hello everyone is Suzuki".
"Everyone" becomes a voice using the default tone color parameter 18 and default synthesis dictionary 19, also, "Suzuki" and "Hello" is the voice using a user tone color parameter 12 and a user phrase synthesis dictionary 13. In other words, "you" becomes a voice that was synthesized by reading each of the formant frame data of "only" and "Do not" and "of", "I" from the default synthesis dictionary 19, "Suzuki" and "Hello", respectively The phrase-based formant frame data is read from the user phrase synthesis dictionary 13 and synthesized.
[0031]
In the above example, “a”, “i”, and “ka” are used, but any character and symbol that can be expressed in text are acceptable. In addition, in the above example, "X10" and later, "A" is "Hello", because "or" is pronounced "Suzuki", the next time you want to pronounce the original of the "A" is the default synthesis dictionary What is necessary is just to put the symbol to return (for example, XOO).
[0032]
Next, a data exchange format of music reproduction sequence data (SMAF: Synthetic music Mobile Application Format) used in the audio reproducing apparatus 1 according to the present embodiment will be described with reference to FIG. FIG. 4 is an explanatory diagram showing the format of the SMAF file according to the present embodiment. SMAF is one of the data exchange formats for distributing and mutually using data for expressing music using a sound source, and is a data format specification for expressing multimedia contents on mobile terminals and the like. It is.
[0033]
The SMAF file 30 in the data exchange format shown in FIG. 4 has a basic structure of data chunks called “chunks”. The chunk is composed of a fixed length (8 bytes) header part and an arbitrary length body part. The header part is divided into a 4-byte chunk ID and a 4-byte chunk size. The chunk ID is used as a chunk identifier, and the chunk size indicates the length of the body part. The SMAF file 30 has a chunk structure for itself and various data included therein.
[0034]
As shown in FIG. 4, the SMAF file 30 includes a contents info chunk 31, an optional data chunk 32, a track track chunk 33, and an HV chunk. (HV Chunk) 36.
[0035]
The content info chunk 31 stores various management information about the SMAF file 30. For example, the content class, type, copyright information, genre name, song name, artist name, song / composer name, and the like are stored. Stored. The optional data chunk 32 stores information such as copyright information, genre name, song name, artist name, and lyrics / composer name. The optional data chunk 32 may not be provided in the SMAF file 30.
[0036]
The track chunk 33 is a chunk for storing a sequence track of music to be sent to a sound source, and includes a setup data chunk (Setup Data Chunk (option)) 34 and a sequence data chunk (Sequence Data Chunk) 35. Yes.
[0037]
The setup data chunk 34 is a chunk for storing timbre data of the sound source portion and the like, and stores a sequence of exclusive messages. The exclusive message is, for example, a tone color parameter registration message.
[0038]
The sequence data chunk 35 stores actual performance data, and stores a mix of HV (Human Voice) note-on that determines the playback timing of the script 11 and other sequence events. Here, HV and other music events are distinguished by HV channel designation.
[0039]
The HV chunk 36 includes an HV Setup Data Chunk (HV Setup Data Chunk (option)) 37, an HV User Phrase Dictionary Chunk (option) 38, and an HV-S chunk 39. Contains.
[0040]
The HV setup data chunk 37 stores HV user tone color parameters and a message for specifying an HV channel. The HV-S chunk 39 stores HV-script data.
[0041]
The contents of the user phrase synthesis dictionary 13 are stored in the HV user phrase phrase dictionary chunk 38. Also, the HV user tone color parameters stored in the HV setup data chunk 37 require parameters for assigning mora and phrase IDs shown in FIG.
[0042]
By applying the SMAF file 30 shown in FIG. 4 to the audio reproduction device 1, audio (HV) can be reproduced in synchronization with the music, and the contents of the user phrase synthesis dictionary 13 can also be reproduced. Is possible.
[0043]
Next, an HV authoring tool, which is a tool for creating the user phrase synthesis dictionary 13 in FIG. 1 and the SMAF file 30 shown in FIG. 4, will be described with reference to FIG. FIG. 5 is a functional image diagram showing an example of the HV authoring tool.
[0044]
When creating the SMAF file 30, the HV authoring tool 42 reads an SMF (Standard MIDI File) file 41 (including note-on that determines the HV sounding timing) created in advance by a MIDI sequencer, and creates an HV script UI 44 and an HV voice. Based on the information obtained from the editor 45, conversion processing to the SMAF file 43 (corresponding to the SMAF file 30) is performed.
[0045]
The HV voice editor 45 is an editor capable of editing the HV user tone color parameter (corresponding to the user tone color parameter 12) included in the HV user tone color file 48. The HV voice editor 45 can assign a user phrase to an arbitrary mora in addition to editing various HV tone parameters.
[0046]
The HV voice editor 45 has an interface for selecting a mora and a function for assigning an arbitrary sound file 50 to the mora. The sound file 50 allocated by the interface of the HV voice editor 45 is analyzed by the waveform analyzer 46 to generate eight sets of formant frequency, formant level and pitch frame data. These frame data can be input / output as individual files (HV user tone color file 48, HV user synthesis dictionary file 49).
[0047]
The HV script UI 44 can directly edit the HV script. This HV script can also be input / output as an individual file (HV script file 47). Further, the HV authoring tool 40 according to the present embodiment may include the HV authoring tool 42, the HV script UI 44, the HV voice editor 45, and the waveform analyzer 46.
[0048]
Next, an example in which the audio reproduction device 1 is applied to a mobile communication terminal will be described with reference to FIG. FIG. 6 is a block diagram illustrating a configuration example of the mobile communication terminal 60 including the audio reproduction device 1.
[0049]
The mobile communication terminal 60 includes, for example, a mobile phone, and includes a CPU 61, a ROM 62, a RAM 63, a display unit 64, a vibrator 65, an input unit 66, a communication unit 67, an antenna 68, an audio processing unit 69, a sound source 70, a speaker 71, and a bus. 72. The CPU 61 controls the entire mobile communication terminal 60. The ROM 62 stores various communication control programs, control programs such as a music reproduction program, and various constant data.
[0050]
The RAM 63 is used as a work area and stores music files and various application programs. The display unit 64 includes a liquid crystal display device (LCD). Vibrator 65 vibrates when an incoming call is received. The input unit 66 includes a plurality of buttons. The communication unit 67 includes a modem unit and the like, and is connected to the antenna 68.
[0051]
The voice processing unit 69 is connected to a transmission microphone and a reception speaker, and has a function of encoding and decoding a voice signal for a call. The sound source 70 reproduces music based on the music file stored in the RAM 63 or the like, reproduces sound, and outputs it to the speaker 71. The bus 72 is a transmission path for transferring data among the constituent elements of the CPU 61, ROM 62, RAM 63, display unit 64, vibrator 65, input unit 66, communication unit 67, audio processing unit 69, and sound source 70.
[0052]
Further, the communication unit 67 can download the HV-script file or the SMAF file 30 shown in FIG. 4 from the content server or the like and store it in the RAM 63. The ROM 62 also stores the application 14 and middleware program of the audio reproduction device 1 shown in FIG. The application 14 and middleware program are read and activated by the CPU 61. Further, the CPU 61 generates formant frame data by interpreting the HV-script stored in the RAM 63, and sends the formant frame data to the sound source 70.
[0053]
(Operation)
Next, the operation of the audio playback device 1 will be described. First, a method for producing the user phrase synthesis dictionary 13 will be described. FIG. 7 is a flowchart showing a method for producing the user phrase synthesis dictionary 13.
[0054]
First, by using the HV authoring tool 42 shown in FIG. 5, the HV tone color using the user phrase synthesis dictionary 13 is selected, and the HV voice editor 45 is activated (step S1).
Next, using the HV voice editor 45, a mora to be applied is selected and a sound file is pasted. Then, the HV voice editor 45 outputs a user phrase dictionary (corresponding to the HV user synthesis dictionary file 49) (step S2).
[0055]
Next, the HV voice parameter is edited using the HV voice editor 45. Then, the HV voice editor 45 outputs a user tone color parameter (corresponding to the HV user tone color file 48) (step S3).
[0056]
Next, using the HV script UI 44, a voice quality change event for designating the corresponding HV tone is described in the HV-script, and a mora to be reproduced is described. Then, the HV script UI 44 outputs an HV-script (corresponding to the HV script file 47) (step S4).
[0057]
Next, the reproduction | regeneration operation | movement of the user phrase dictionary in the audio | voice reproduction apparatus 1 is demonstrated with reference to FIG. FIG. 8 is a flowchart showing the reproduction operation of the user phrase synthesis dictionary in the audio reproduction device 1.
First, the user tone color parameter 12 and the user phrase synthesis dictionary 13 are registered in the middleware of the sound reproducing device 1. Then, the script 11 is registered in the middleware of the audio playback device 1 and playback is started (steps S11 and S12).
[0058]
In the reproduction, it is monitored whether there is a voice quality change event (X event) for designating the user tone color parameter 12 in the script 11 (step S13).
When a voice quality change event is found in step S13, the phrase ID assigned to the mora is searched from the user tone parameter 12, and the data corresponding to the phrase ID is read from the user phrase synthesis dictionary 13 and the default synthesis managed by the HV driver. Of the data in the dictionary 19, the dictionary data of the corresponding mora is replaced with the data of the user phrase synthesis dictionary 13 (step S14).
The replacement process in step S14 may be performed in advance before reproduction.
[0059]
When step S14 ends and when no voice quality change event is found in step S13, converter 16 interprets the mora of script 11 (or the script after the replacement process in step S14 if step S14 is performed). Then, the data is finally converted into formant frame sequence data using the HV driver (step S15).
Next, the data converted in step S15 is reproduced by the sound source 20 (step S16).
[0060]
Next, it is determined whether or not the script 11 is finished (step S17). If not finished, the process returns to step S13. If finished, the reproduction operation of the user phrase dictionary is finished.
[0061]
Next, a method for producing the SMAF file 30 shown in FIG. 4 will be described with reference to FIG. FIG. 9 is a flowchart showing a method for producing the SMAF file 30.
First, the user phrase synthesis dictionary 13, the user tone color parameter 12, and the script 11 are produced according to the procedure shown in FIG. 7 (step S21).
[0062]
Next, the SMF file 41 including an event for controlling the music data and the pronunciation of the HV script is produced (step S22).
Next, the SMF file 41 is inputted to the HV authoring tool 42 shown in FIG. 5, and the SMF file 41 is converted into the SMAF file 43 (corresponding to the SMAF file 30) by the HV authoring tool 42 (step S23).
[0063]
Then, the user tone color parameter 12 created in step S21 is put into the HV setup data chunk 37 of the HV chunk 36 of the SMAF file 30 shown in FIG. 4, and the user phrase synthesis dictionary 13 created in step S21 is stored in the same SMAF. It is put into the HV user / phrase / dictionary chunk 38 of the HV chunk 36 of the file 30 and outputted as the SMAF file 30 (step S24).
[0064]
Next, a method for reproducing the SMAF file 30 will be described with reference to FIG. FIG. 10 is a flowchart showing a method of reproducing the SMAF file 30.
First, the SMAF file 30 is registered in the middleware of the audio reproduction device 1 shown in FIG. 1 (step S31).
Here, the audio reproducing apparatus 1 normally registers the music data portion in the SMAF file 30 in the middleware music reproducing unit and prepares for reproduction.
[0065]
Next, the audio reproducing device 1 determines whether or not the HV chunk 36 is present in the SMAF file 30 (step S32).
If there is an HV chunk 36 in step S32, the audio reproducing device 1 interprets the contents of the HV chunk 36 (step S33).
Next, the sound reproducing device 1 registers user tone color parameters, registers a user phrase synthesis dictionary, and registers a script (step S34).
[0066]
When there is no HV chunk 36 in step S32, or when the registration in step 34 is completed, the audio reproducing device 1 interprets the chunk of the music part (step S35).
Next, the audio reproducing device 1 performs music reproduction by starting interpretation of the sequence data (actual performance data) in the sequence data chunk 35 in response to the “start” signal (step S36).
[0067]
In this reproduction, in the process of sequentially interpreting events in the sequence data, the audio reproduction device 1 determines whether or not the event is HV note-on (step S37).
If it is determined in step S37 that the HV note is on, the audio reproduction device 1 starts reproducing the HV script data of the HV chunk designated by the HV note on (step S38).
[0068]
After this step S38, the audio reproducing device 1 performs the reproducing operation of the user phrase dictionary shown in FIG.
That is, the audio reproducing device 1 monitors whether or not there is a voice quality change event (X event) that specifies the user tone color parameter 12 in the reproduction in step S38 (step S39).
[0069]
If a voice quality change event is found in step S39, the phrase ID assigned to the mora is searched from the user tone parameter 12 and the data corresponding to the phrase ID is read from the user phrase synthesis dictionary 13 and the default synthesis managed by the HV driver. Of the data in the dictionary 19, the corresponding mora dictionary data is replaced with user phrase dictionary data (step S40).
The replacement process in step S40 may be performed in advance before reproduction.
[0070]
When step S40 is completed, and when no voice quality change event is found in step S39, the converter 16 interprets the script mora and finally converts it to formant frame sequence data using the HV driver (step S40). S41).
[0071]
Next, the audio reproducing device 1 reproduces the data converted in step S41 in the HV portion of the sound source 20 (step S42).
Next, the audio reproducing device 1 determines whether or not the music has ended (step S43). When the music has ended, the audio reproducing device 1 ends the reproduction of the SMAF file 30, and when the music has not ended, the process returns to step S37.
[0072]
In step S37, when the event is not HV note-on, the audio reproducing device 1 converts the event into music source reproduction event data as music data (step S44).
Next, the audio reproducing device 1 reproduces the data converted in step S44 on the music portion of the sound source 20 (step S45).
[0073]
Thus, according to the present embodiment, the method of reproducing by formant synthesis using the resources of the FM sound source has the following three advantages.
First, according to this embodiment, a user can assign a favorite phrase. Thereby, it is possible to perform reproduction closer to the favorite voice color without depending on the fixed dictionary.
Secondly, according to the present embodiment, a part of the default synthesis dictionary 19 is replaced with the user phrase synthesis dictionary 13, so that it is possible to avoid an excessive increase in the data capacity in the audio reproduction device 1. In addition, since a part of the default synthesis dictionary 19 can be replaced with an arbitrary phrase, it is possible to pronounce in units of phrases, and to eliminate the uncomfortable feeling at the joints of the pronunciations that occur in the synthesized speech of conventional pronunciation units. it can.
Thirdly, according to the present embodiment, since an arbitrary phrase can be designated in the HV script, synthesis in mora units and pronunciation in phrase units can be used in combination.
[0074]
Furthermore, according to the present embodiment, it is possible to change the voice color at the formant level as compared with the method of reproducing the waveform data configured by sampling the phrase in advance. According to the present embodiment, although the data size and quality depend on the frame rate, high-quality reproduction can be performed with a much smaller data capacity than the sampling waveform data. Therefore, for example, the audio reproducing device 1 of the present embodiment can be easily incorporated into a mobile communication terminal such as a mobile phone, and the contents of an e-mail can be reproduced with high quality audio.
[0075]
As mentioned above, although embodiment of this invention was explained in full detail with reference to drawings, the specific structure is not restricted to this embodiment, The design change etc. of the range which does not deviate from the summary of this invention are included.
[0076]
【The invention's effect】
As described above, according to the present invention, data held in the pronunciation unit in the synthesis dictionary can be replaced with arbitrary user data, so that a desired phrase can be reproduced with high quality sound.
[Brief description of the drawings]
FIG. 1 is a block diagram showing an audio reproduction device according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating an example in which a phrase ID is assigned to each sound generation unit.
FIG. 3 is a diagram showing an example of the contents of a user phrase synthesis dictionary.
FIG. 4 is a diagram showing a format of a SMAF file.
FIG. 5 is a functional image diagram showing an example of an HV authoring tool.
FIG. 6 is a block diagram illustrating an example of a mobile communication terminal including the audio reproduction device according to the present embodiment.
FIG. 7 is a flowchart of a method for creating a user phrase synthesis dictionary.
FIG. 8 is a flowchart of a reproduction operation of a user phrase synthesis dictionary.
FIG. 9 is a flowchart illustrating a method for producing a SMAF file.
10 is a flowchart of a method for reproducing a SMAF file 30. FIG.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Voice reproduction apparatus, 11 ... Script, 12 ... User tone color parameter, 13 ... User phrase synthetic | combination dictionary (variable length), 14 ... Application, 15 ... Middleware API, 16 ... Converter, 17 ... Driver, 18 ... Default tone color parameter, 19 ... Default composition dictionary, 20 ... Sound source, 30 ... SMAF file, 31 ... Content info chunk, 32 ... Optional data chunk, 33 ... Track chunk, 34 ... Setup data chunk, 35 ... Sequence data Chunk, 36 ... HV chunk, 37 ... HV setup data chunk, 38 ... HV user phrase / dictionary chunk, 39 ... HV-S chunk, 41 ... SMF file, 42 ... HV authoring tool, 43 ... SMAF file, 44 ... HV script UI 45 ... HV voice editor 46 ... waveform analyzer 47 ... HV script file 48 ... HV user tone file 49 ... HV user synthesis dictionary file 50 ... sound file

Claims

In a voice reproduction apparatus that has a synthesis dictionary that is a database that holds formant frame data corresponding to pronunciation units in advance, and that synthesizes speech using the synthesis dictionary by being given information in which pronunciation units are listed,
Substitution means for replacing formant frame data in pronunciation units held in the synthesis dictionary with user data that is formant frame data acquired in phrase units ;
And a voice synthesizing unit that synthesizes a voice using the synthesis dictionary in which retained data is replaced by the replacing unit when the information in which the pronunciation units are arranged is given.

The user data, sound reproducing apparatus according to claim 1, characterized in that it is added to the tone color parameters for processing the formant frame data held in the synthesis dictionary.

When the timbre parameter to which the user data is added is given and the timbre parameter is designated at the time of reproduction, the replacement means converts formant frame data, which is data held in the synthesis dictionary, to the user data. Replace,
Said speech synthesis means, when the enumeration information of the speech units given according to any one of claims 1 or 2, characterized in that the speech synthesized using synthesis dictionary is replaced by the tone color parameter Audio playback device.

The audio playback device
The user exchange is included in a data exchange format that defines an information structure for reproducing desired data by synchronizing music and voice, and voice is synthesized using the data exchange format. The audio reproduction device according to any one of claims 1 to 3 .

The audio playback device
The music piece information included in the information configured as the data exchange format is reproduced as it is, and the audio information included in the information is reproduced using the replacing unit and the voice synthesizing unit. The sound reproducing device according to claim 4 .

6. The audio reproducing apparatus according to claim 4 , wherein the data exchange format includes information obtained by adding the user data to an audio parameter.