JP3971577B2

JP3971577B2 - Speech synthesis apparatus and speech synthesis method, portable terminal, speech synthesis program, and program recording medium

Info

Publication number: JP3971577B2
Application number: JP2001017189A
Authority: JP
Inventors: 浩幸勘座
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2001-01-25
Filing date: 2001-01-25
Publication date: 2007-09-05
Anticipated expiration: 2021-01-25
Also published as: JP2002221982A

Abstract

PROBLEM TO BE SOLVED: To automatically suppress an unnatural rhythm generated by the expression that is not following a grammatical rule. SOLUTION: A first accent phrase generating section 23 generates an accent phrase based on the text analysis result of a text analysis section 21. A second accent phrase generating section 24 generates an accent phrase without depending on the text analysis result. An accent phrase generating discriminating section 22 allows the generation of an accent phrase in the section 24 based on the analysis result, when an inputted text is a speech. When the inputted text is a speech not following a grammar, an accent phrase is generated without depending on the text analysis result. Thus, the generation of an unnatural pitch pattern based on an incorrect text analysis result by the section 21 is prevented.

Description

【０００１】
【発明の属する技術分野】
この発明は、文字情報から音声を合成する音声合成装置および音声合成方法、携帯端末器、音声合成プログラム、並びに、プログラム記録媒体に関する。
【０００２】
【従来の技術】
従来より、文字情報から音声を合成するテキスト音声合成として、テキスト解析処理,韻律生成処理および音声合成処理の３つの処理を順次行う方法が知られている。図９に従来の音声合成装置のブロック図を示す。
【０００３】
テキスト解析部１は、上記テキスト解析処理を行ない、入力文字情報から単語境界を検出し、各単語の音素記号列を求める。また、韻律生成部２は、上記韻律生成処理を行ない、上記求められた音素の継続時間長,単語のアクセント,文イントネーション等の韻律情報を付与する。また、音声合成部３は、上記音声合成処理を行ない、予め蓄積してある合成単位と規則とに基づいて、音声合成器の制御信号を生成する。
【０００４】
以下、日本語のテキスト音声合成装置を例に、テキスト音声合成方法について詳細に説明する。日本語は、英語のように単語の境界をスペースで区切る言語と異なり、単語境界が明確でない所謂膠着語であるため、テキスト解析処理を行って単語境界を検出するのである。このテキスト解析処理は、単語の表記や読みの情報を記憶した辞書と単語の接続関係情報を記憶した文法とを用いて、文章の先頭から順次照合処理を行うことによって実行される。
【０００５】
上記単語には、名詞や動詞のような自立語と、助詞や助動詞のような付属語とがある。例えば、「今日は天気です。」という文は、以下のようにテキスト解析される。
「今日(名詞)/ は(助詞)/ 天気(名詞)/ です(助動詞)。」
【０００６】
このようなテキスト解析結果に基づいて、韻律生成処理および音声合成処理を行うのが一般的なテキスト音声合成方法である。尚、韻律生成処理および音声合成処理の詳細については、例えば古井著「ディジタル音声処理」(東海大学出版会)に記載されている通りである。
【０００７】
【発明が解決しようとする課題】
しかしながら、上記従来のテキスト音声合成方法においては、以下のような問題がある。すなわち、近年、インターネット等の普及によって電子化された文字情報が一般社会で日常使われるようになってきている。特に、メール文のように日常会話で使う言葉で書かれたテキストが増加している。日常会話で使うような所謂話し言葉は、表現が多様であるため文法で規則化することは困難である。
【０００８】
このように、文法では規定できないような話し言葉が入力テキストとして与えられた場合、テキスト解析が正しく行われないことが多い。その場合、上記韻律生成処理はテキスト解析結果が正しいという前提で行われるために、不自然な韻律が生成されてしまうのである。
【０００９】
例えば、「見たことなーい」という話し言葉文が、テキスト解析処理によって以下のように解析されたとする。
「見(動詞)/ たこ(名詞)/ となー(名詞)/ い(名詞)」
このテキスト解析結果に基づいて韻律生成処理が行われると、「見る」という動詞と「たこ」という名詞に誤解析されたことが原因となって、音節「た」の位置で声立て成分が開始されて不自然なアクセントになってしまうのである。
【００１０】
このような問題を解決するために、特開平１１‐２５９０９４号公報においては、図１０にブロック図を示すような音声合成装置が提案されている。図１０において、テキスト解析部１１,韻律生成部１２および音声合成部１３は、図９におけるテキスト解析部１,韻律生成部２および音声合成部３と同じである。本音声合成装置は、ユーザの選択した文字列に付与された韻律情報をユーザの指示に応じて修正する韻律編集部１４を有している。したがって、テキスト解析部１１の誤解析等に起因して韻律生成部１２によって不自然な韻律が生成された場合には、韻律の不自然な箇所を韻律編集部１４の修正機能を用いてユーザが修正することによって、自然な音声に修正することができるのである。
【００１１】
しかしながら、上記特開平１１‐２５９０９４号公報に記載された音声合成装置においては、ユーザが手作業で修正する必要があり、ユーザに手間と負担が掛るという問題がある。
【００１２】
そこで、この発明の目的は、話し言葉等に出現する文法規定外の表現に起因して生成される不自然な韻律を自動的に抑制できる音声合成装置および音声合成方法、この音声合成装置が搭載された携帯端末器、音声合成プログラム、並びに、プログラム記録媒体を提供することにある。
【００１３】
【課題を解決するための手段】
上記目的を達成するため、第１の発明は、
入力テキストを解析するテキスト解析手段と、上記テキスト解析結果に基づいて韻律情報を生成する韻律生成手段と、上記テキスト解析結果および韻律情報に基づいて音声を合成する音声合成手段を有する音声合成装置において、
上記テキスト解析結果のうちの品詞付き単語に基づいてアクセント句を生成して上記韻律生成手段に送出する第１アクセント句生成手段と、
上記テキスト解析結果に基づいて且つ上記品詞付き単語に囚われることなくアクセント句を生成して上記韻律生成手段に送出する第２アクセント句生成手段と、
上記テキスト解析結果に基づいて、上記第１アクセント句生成手段と第２アクセント句生成手段との何れによってアクセント句を生成するかを、仮名連鎖分岐確率およびテキスト解析尤度分岐確率の少なくとも一つを用いて判定するアクセント句生成判定手段
を備え、
上記仮名連鎖分岐確率は、仮名文字連鎖が話し言葉のテキストコーパスに属する確率であって、上記第２アクセント句生成手段によるアクセント句生成への分岐確率を表しており、
上記テキスト解析尤度分岐確率は、品詞条件に応じて予め設定されて、上記第２アクセント句生成部によるアクセント句生成への分岐確率を表している
ことを特徴としている。
【００１４】
上記構成によれば、アクセント句生成判定手段によって、入力テキストに基づくアクセント句の生成を、テキスト解析結果のうちの品詞付き単語に基づいて生成する第１アクセント句生成手段と上記テキスト解析結果に基づいて且つ上記品詞付き単語に囚われることなく生成する第２アクセント句生成手段との何れによって行うかが予め判定される。したがって、例えば話し言葉のようにテキスト解析手段によって誤解析され易い入力テキストに関するアクセント句は、上記第２アクセント句生成手段によって、テキスト解析結果のうちの品詞付き単語に囚われることなく生成することが可能になる。
【００１５】
さらに、上記アクセント句生成判定手段によって、仮名文字連鎖が話し言葉のテキストコーパスに属する確率であって、上記第２アクセント句生成手段によるアクセント句生成への分岐確率を表す仮名連鎖情報、および、品詞条件に応じて予め設定されて、上記第２アクセント句生成部によるアクセント句生成への分岐確率を表すテキスト解析尤度情報の少なくとも一つを基準として、第１アクセント句生成手段か第２アクセント句生成手段かの判定が行われる。したがって、話し言葉での入力テキストに基づくアクセント句の生成は第２アクセント句生成手段によって行うべきと、的確に判定される。
【００１６】
また、第１の実施例は、上記第１の発明の音声合成装置において、
上記第２アクセント句生成手段は、生成するアクセント句における声立て成分の開始位置を、仮名連鎖情報,テキスト解析尤度情報,アクセント句候補のモーラ数およびアクセント句候補中の位置の少なくとも一つを用いて設定する
ことを特徴としている。
【００１７】
この実施例によれば、上記第２アクセント句生成手段によって、仮名連鎖情報,テキスト解析尤度情報,アクセント句候補のモーラ数およびアクセント句候補中の位置の少なくとも一つを用いて、生成するアクセント句における声立て成分の開始位置が設定される。こうして、上記品詞付き単語に囚われることなく正しくアクセント句が生成される。すなわち、例えば話し言葉のように文法では規定できないような入力テキストが与えられても、不自然なピッチパターンの生成が抑制されて自然な韻律が生成される。
【００１８】
また、第２の実施例は、上記第１の発明の音声合成装置において、
上記仮名連鎖情報は、テキストデータに基づいて予め求められた連続する二つの仮名文字の間で声立て成分が開始される確率であり、
上記テキスト解析尤度情報は、上記テキスト解析尤度分岐確率の逆数の値で与えられる声立て成分が開始される確率であり、
上記アクセント句候補のモーラ数は、アクセント句候補の先頭文字に上記アクセント句候補モーラ数に応じて与えられる声立て成分が開始される確率であり、
上記アクセント句候補中の位置は、上記アクセント句候補中で文字が占める位置に基づいて与えられる声立て成分が開始される確率である
ことを特徴としている。
【００１９】
この実施例によれば、上記第２アクセント句生成手段によって、テキストデータに基づいて予め求められた連続する二つの仮名文字の間で声立て成分が開始される確率である仮名連鎖情報、上記テキスト解析尤度分岐確率の逆数の値で与えられる声立て成分が開始される確率であるテキスト解析尤度情報、アクセント句候補の先頭文字に上記アクセント句候補モーラ数に応じて与えられる声立て成分が開始される確率であるアクセント句候補のモーラ数、および、上記アクセント句候補中で文字が占める位置に基づいて与えられる声立て成分が開始される確率であるアクセント句候補中の位置、の少なくとも一つを用いて、生成するアクセント句における声立て成分の開始位置が設定される。
【００２０】
また、第２の発明は、
入力テキストを解析し、このテキスト解析結果に基づいて韻律情報を生成し、上記テキスト解析結果および韻律情報に基づいて音声を合成する音声合成方法において、
上記テキスト解析結果のうちの品詞付き単語に基づいて、上記韻律情報を生成する際に用いる第１アクセント句を生成する第１アクセント句生成ステップと、
上記テキスト解析結果に基づいて且つ上記品詞付き単語に囚われることなく、上記韻律情報を生成する際に用いる第２アクセント句を生成する第２アクセント句生成ステップと、
上記テキスト解析結果に基づいて、上記第１アクセント句と第２アクセント句とのうちの何れのアクセント句を生成するかを、仮名連鎖分岐確率およびテキスト解析尤度分岐確率の少なくとも一つを用いて判定するアクセント句生成判定ステップ
を備え、
上記仮名連鎖分岐確率は、仮名文字連鎖が話し言葉のテキストコーパスに属する確率であって、上記第２アクセント句生成手段によるアクセント句生成への分岐確率を表しており、
上記テキスト解析尤度分岐確率は、品詞条件に応じて予め設定されて、上記第２アクセント句生成部によるアクセント句生成への分岐確率を表している
ことを特徴としている。
【００２１】
上記構成によれば、入力テキストに基づくアクセント句の生成を、テキスト解析結果のうちの品詞付き単語に基づいて生成するか、上記テキスト解析結果に基づいて且つ上記品詞付き単語に囚われることなく生成するかが、仮名文字連鎖が話し言葉のテキストコーパスに属する確率であって、上記第２アクセント句生成手段によるアクセント句生成への分岐確率を表す仮名連鎖分岐確率、および、品詞条件に応じて予め設定されて、上記第２アクセント句生成部によるアクセント句生成への分岐確率を表すテキスト解析尤度分岐確率、の少なくとも一つを用いて予め判定される。したがって、例えば話し言葉のようにテキスト解析の際に誤解析され易い入力テキストに関するアクセント句は、テキスト解析結果のうちの上記品詞付き単語に囚われることなく生成することが可能になる。
【００２２】
また、第３の発明の携帯端末器は、上記第１の発明の音声合成装置を搭載したことを特徴としている。
【００２３】
上記構成によれば、例えば話し言葉のように文法では規定できない入力テキストに対して自然なアクセント句を与えることができる音声合成装置が携帯端末器に搭載される。したがって、日常会話で使う言葉で書かれたメール文を受信した場合でも合成音声によって正確に出力することが可能になり、携帯端末器の操作性が向上される。
【００２４】
また、第４の発明の音声合成プログラムは、コンピューターを、上記第１の発明におけるテキスト解析手段,韻律生成手段,音声合成手段,アクセント句生成判定手段,第１アクセント句生成手段および第２アクセント句生成手段として機能させることを特徴としている。
【００２５】
また、第５の発明のプログラム記録媒体は、上記第４の発明の音声合成プログラムが記録されたことを特徴としている。
【００２６】
上記第４,第５の発明の構成によれば、上記第１の発明の場合と同様に、例えば話し言葉のようにテキスト解析手段で誤解析され易い入力テキストに関するアクセント句が、上記第２アクセント句生成手段によって、テキスト解析結果のうちの品詞付き単語に囚われることなく生成することが可能になる。
【００２７】
【発明の実施の形態】
以下、この発明を図示の実施の形態により詳細に説明する。図１は、本実施の形態の音声合成装置におけるブロック図である。テキスト解析部２１は、入力されたテキストを解析して単語境界を検出し、各単語の音素記号列を求める。アクセント句生成判定部２２は、上記テキスト解析結果に基づいて、アクセント句の生成を第１アクセント句生成部２３で行なうか第２アクセント句生成部２４で行なうかを判定する。そして、第１アクセント句生成部２３によって、上記テキスト解析結果に基づいてアクセント句が生成される。一方、第２アクセント句生成部２４は、上記テキスト解析結果に依存せずにアクセント句を生成する。
【００２８】
韻律生成部２５は、上記第１アクセント句生成部２３あるいは第２アクセント句生成部２４によって生成された各アクセント句に対して、音素の継続時間長,アクセント核の位置および文イントネーション等の韻律情報を付与する。音声合成部２６は、上記付与された韻律生成情報に基づいて、予め蓄積されている合成単位と規則とによって音声合成器の制御信号を生成する。
【００２９】
上記テキスト解析部２１,第１アクセント句生成部２３,韻律生成部２５および音声合成部２６の詳細については、例えば、古井著「ディジタル音声処理」(東海大学出版会)に記載されている通りであり、ここでは用語の簡単な説明にとどめる。
【００３０】
アクセント核を１個保有するアクセントのまとまりをアクセント句という。ここで、上記アクセント核とは、個々の語において、声の高さが高から低へ移る位置をいう。声は、その出始めでは高いが、次第に声門下圧の低下等によって高さが低下する。このようなピッチ(基本周波数)が時間と共に低下する特性を声立て成分と呼び、この特性の上に、アクセントによって決まる単語および文節固有のピッチパターン(アクセント成分)が重畳されて、文全体のピッチパターンが決まる。図６にピッチパターンを求める過程を示す。
【００３１】
以下においては、説明を容易にするために、アクセント句生成判定部２２は、テキスト解析結果を見て、書き言葉であれば第１アクセント句生成部２３に解析結果データを送る一方、話し言葉であれば第２アクセント句生成部２４に判定結果データを送るものとする。しかしながら、この発明はこれに限定されるものではない。また、説明の都合上、先ず第１アクセント句生成部２３による話し言葉の処理に関する問題点について述べる。尚、第１アクセント句生成部２３の機能は、図９や図１０に示す従来の音声合成装置においては、テキスト解析部１,１１または韻律生成部２,１２の何れか、あるいは両者で行われるものである。そして次に、アクセント句生成判定部２２の処理、最後に第２アクセント句生成部２４の処理の順に説明する。
【００３２】
上記第１アクセント句生成部２３は、上記テキスト解析部２１によるテキスト解析の結果に基づいてアクセント句を生成するものであり、上述したように従来から一般的に行なわれている技術である。例として、単語の接続関係情報を記憶した文法に則った文「今日は天気です。」に対する第１アクセント句生成部２３でのピッチパターンの生成は、上述のように図６に示す手順によって行なわれる。こうして、文法に則った文が正しくテキスト解析されれば、問題なく第１アクセント句生成部２３によってピッチパターンが生成されるのである。
【００３３】
ここで、仮に、上記第１アクセント句生成部２３によって、文法に則っていない「なーんちゃってぇー」という文のピッチパターンを生成すると図７に示すようになる。すなわち、テキスト解析部２１によるテキスト解析結果は、「なー(助：終助詞)/ ん(助詞：格助詞)/ ちゃっ(動詞：５段ワ行)/ て(助詞：接続助詞)/ ぇ(未知語)/ ー(未知語)」のように解析され、「ん」と「ちゃっ」の間にアクセント句の区切れがあると判断されることで、「ちゃっ」のところで次の声立て成分が開始される。これは、図７において、声立て成分が２つに別れていることで示されており、不自然なピッチパターンの原因になっている。
【００３４】
そこで、本実施の形態における音声合成装置では、上記第２アクセント句生成部２４を設けて、図８に示すように、アクセント句の区切れで生成される次の声立て成分の開始を抑制し、更にアクセント成分も抑制することによって、ピッチパターンの変動を抑えて大きく誤らないようにするのである。
【００３５】
上記テキスト解析部２１によるテキスト解析が確実に正しく行われれば、第１アクセント句生成部２３だけで十分なのである。ところが、現時点におけるテキスト解析処理では、区切り位置の誤りや品詞の判断誤り、あるいは辞書に登録されていない未知語の処理等、不完全な部分がまだある。特に、話し言葉のような文法規定外の入力テキストからは、韻律情報を付与するための正確な情報は得にくい。すなわち、「なーんちゃってぇー」のような話し言葉を辞書や文法で表現しようとしても、多くのバリエーションがあるために書き言葉に比べて規則化が困難なのである。
【００３６】
上記話し言葉の特徴は仮名文字列に現れる。本実施の形態においては、この仮名文字列の特徴を捕えて不自然な韻律を抑制するのである。例えば、「なーんちゃってぇー」の例の場合には、「ちゃっ」が動詞であるというテキスト解析結果を用いないために、「なーんちゃってぇー」という一つのアクセント句に対してピッチパターンを生成できるのである。
【００３７】
次に、「なーんちゃってぇー」を一つのアクセント句として第２アクセント句生成部２４で処理すべきであると判定するアクセント句生成判定部２２について述べる。書き言葉のテキスト解析結果は、一般的に自立語と付属語とが連続する形になる。これに対して、話し言葉をテキスト解析すると、誤解析によって、自立語がない文節ができたり辞書に登録されていない未知語と判定されたりするという現象が見られる。そこで、この現象を捕えて、テキスト解析結果が信頼できると判定すれば第１アクセント句生成部２３でアクセント句生成の処理を行ない、そうでなければ第２アクセント句生成部２４でアクセント句生成の処理を行なうのである。
【００３８】
したがって、上記第２アクセント句生成部２４で処理を行なう場合には、どの単位をアクセント句とするかを予め決めてやる必要がある。その場合、テキスト解析部２１によるテキスト解析の結果は信頼性が低いため、区切り位置や品詞情報は使用しないようにする。そして、未知語と判定された単語および小文字「ぇ」や長音記号「ー」を含む部分は書き言葉である可能性が高いため、アクセント句を細切れとせずに広い範囲をアクセント句としてまとめるのである。
【００３９】
このように、上記テキスト解析結果に未知語を含んだりあるいは話し言葉特有の文字が存在するという情報を手がかりにすることによって、アクセント句生成判定部２２によって、入力された文字列が書き言葉であるか話し言葉であるか、すなわち第１アクセント句生成部２３で処理するか第２アクセント句生成部２４で処理するかを判断することが可能になるのである。
【００４０】
図２に、上記テキスト解析部２１,アクセント句生成判定部２２,第１アクセント句生成部２３および第２アクセント句生成部２４によって行なわれるアクセント句生成処理動作のフローチャートを示す。以下、第１アクセント句生成部２３で処理される通常のテキスト「今日は天気です」と、第２アクセント句生成部２４で処理される話し言葉のテキスト「なーんちゃってぇー」とを例に、アクセント句生成処理動作の具体的手法について説明する。
【００４１】
ステップＳ1で、上記テキスト解析部２１によって入力テキストに対してテキスト解析処理が行なわれる。ステップＳ2で、単語番号ｉに初期値「１」がセットされる。ステップＳ3で、単語番号ｉが、上記テキスト解析処理結果に基づく当該入力テキストの単語数Ｎ1よりも大きいか否かが判別される。その結果、Ｎ1よりも大きければアクセント句生成処理動作を終了する。一方、Ｎ1以下であればステップＳ4に進む。ステップＳ4で、ｉ番目の単語が読み出されて変数Ｔiに代入される。ステップＳ5で、単語Ｔi中に連続する仮名列が在るか否かが判別される。その結果、在ればステップＳ6に進む。一方、なければステップＳ9に進む。ステップＳ6で、仮名連鎖分岐確率テーブルが参照される。
【００４２】
ここで、仮名連鎖分岐確率とは、２つの仮名文字の第１文字Ｗiと第２文字Ｗjとが連続して出現する場合に第２アクセント句生成部２４での処理に分岐すべきと判断される確率(つまり、話し言葉である確率)であり、予め求められて仮名連鎖分岐確率テーブルに格納されている。上記仮名連鎖分岐確率テーブルの求め方は次のように行う。
【００４３】
予め大量のテキストデータに基づいて、任意の平仮名文字連鎖Ｗi,Ｗjが書き言葉のテキストコーパスＬ1と話し言葉のテキストコーパスＬ2との夫々に出現する確率Ｐ(Ｗi,Ｗj,Ｌ1)とＰ(Ｗi,Ｗj,Ｌ2)とを求める。そして、平仮名文字連鎖Ｗi,Ｗjが出現した場合に話し言葉のテキストコーパスＬ2に属する確率Ｒ(Ｗi,Ｗj)を、次式
Ｒ(Ｗi,Ｗj)＝Ｐ(Ｗi,Ｗj,Ｌ2)/{(Ｐ(Ｗi,Ｗj,Ｌ1)＋Ｐ(Ｗi,Ｗj,Ｌ2)}
によって求める。こうして求めた、話し言葉のテキストコーパスＬ2に属する確率Ｒ(Ｗi,Ｗj)を上記分岐確率として、第１文字Ｗiと第２文字Ｗjとに対応付けてテーブルに格納することによって、上記仮名連鎖分岐確率テーブルが得られるのである。
【００４４】
図３は上記仮名連鎖分岐確率テーブルの一例を示し、例えば、第１文字「で」と第２文字「す」と両仮名文字連鎖が現れた場合にテキストコーパスＬ2に属する確率値Ｒ(で,す)である分岐確率とが対応付けられて格納されている。この場合、仮名文字「で」と「す」との連鎖は話し言葉特有のものではないために、分岐確率Ｒ(で,す)の値は小さい。一方、仮名文字「な」と「ー」との連鎖は話し言葉特有のものであり、分岐確率Ｒ(な,ー)の値は大きい。
【００４５】
ステップＳ7で、解析尤度分岐確率テーブルが参照される。ここで、解析尤度分岐確率は、テキスト解析の結果の信頼性が低いために第２アクセント句生成部２４での処理に分岐すべきと判断される確率(つまり、話し言葉である確率)である。例えば、品詞が「未知語」であれば解析尤度分岐確率は高くなり、その他の品詞であれば小さくなる。また、文頭が付属語で始まる場合にはテキスト解析の信頼性は低いと考えられるため、解析尤度分岐確率は高くなる。この解析尤度分岐確率は、品詞条件とその品詞条件を満たす場合には第２アクセント句生成部２４での処理に分岐すべきと判断される分岐確率とが対応付けられて格納された解析尤度分岐確率テーブルを参照することで求められる。図４は上記解析尤度分岐確率テーブルの一例を示す。例えば、「今日は天気です」中の「です」は、品詞が助動詞で付属語ではあるが名詞「天気」に後続しているために文頭の付属語ではなく、解析尤度分岐確率値は小さい値となるのである。
【００４６】
ステップＳ8で、上記ステップＳ6において求められた仮名連鎖分岐確率値とステップＳＳ7において求められた解析尤度分岐確率値とに基づいて、分岐確率が計算される。ステップＳ9で、アクセント句が形成されるか否かが判別される。その結果、アクセント句が形成される場合はステップＳ10に進む一方、形成されない場合はステップＳ13に進む。ステップＳ10で、分岐確率は所定値αよりも大きいか否かが判別される。その結果、所定値αよりも大きければステップＳ11に進み、所定値α以下であればステップＳ12に進む。ステップＳ11で、上記第２アクセント句生成部２４によってアクセント句が生成される。そうした後にステップＳ13に進む。ステップＳ12で、テキスト解析結果に基づいて、第１アクセント句生成部２３によってアクセント句が生成される。ステップＳ13で、単語番号ｉがインクリメントされる。そうした後に上記ステップＳ3に戻って、次の単語番号ｉの処理に移行する。そして、上記ステップＳ3において、単語番号ｉが入力テキストの単語数Ｎ1よりも大きいと判別されと、アクセント句生成処理動作を終了する。
【００４７】
以下、通常のテキスト「今日は天気です」が入力された場合を例に、上述したアクセント句生成処理動作について具体的に説明する。先ず、テキスト「今日は天気です」に対してテキスト解析が行なわれ、処理結果「今日(名詞)/ は(助詞)/ 天気(名詞)/ です(助動詞)」が得られる。この場合には、上記テキスト解析処理によって、入力テキスト「今日は天気です」は４つの単語(Ｎ1＝４)に区切られる。
【００４８】
次に、１番目の単語「今日」が読み出される。そして、この単語「今日」には連続する仮名列はないので、アクセント句を形成するか否かが判別される。そして、後方に助詞が続くのでアクセント句は形成されないと判定されて、２番目の単語「は」が読み出される。そして、前の単語「今日」との連結を考慮しても連続する仮名列がないので、アクセント句を形成するか否かが判別される。そして、前の単語「今日」との結合で「今日は」という文節になるため、アクセント句を形成すると判別される。ここで、連続する仮名列はなく分岐確率の計算処理を行っていないため分岐確率は「０」となり、第１アクセント句生成部２３によって、テキスト解析結果に基づいてアクセント句が生成される。
【００４９】
次に、３番目の単語「天気」に対する処理が１番目の単語「今日」の場合と同様に処理される。次に、４番目の単語「です」が読み出される。そして、この単語「です」には、連続する仮名列(「で」と「す」)とがあるので、「で」と「す」との仮名連鎖分岐確率と解析尤度分岐確率とが求められる。また、求められた仮名連鎖分岐確率値と解析尤度分岐確率値とに基づいて、分岐確率が計算される。この場合、仮名連鎖分岐確率値と解析尤度分岐確率値との両者共に小さいために、単語「です」の分岐確率の値は小さくなる。さらに、アクセント句「天気です」が形成されると判断される。そして、上記分岐確率の値は小さいためにαより小さいと判断されて、第１アクセント句生成部２３によるテキスト解析結果に基づくアクセント句の生成が行なわれるのである。そして、単語番号ｉの内容が単語数「４」より大きくなると、テキスト「今日は天気です」によるアクセント句生成部判定処理動作を終了する。尚、上述の例においては２連鎖の仮名列を例に説明しているが、３連鎖以上であっても同様である。
【００５０】
次に、話し言葉によるテキスト「なーんちゃってぇー」が入力された場合を例に挙げて、上述したアクセント句生成処理動作について具体的に説明する。先ず、テキスト「なーんちゃってぇー」に対してテキスト解析が行なわれ、処理結果「なー(助詞：終助詞)/ ん(助詞：格助詞)/ ちゃっ(動詞：５段ワ行)/ て(助詞：接続助詞)/ ぇ(未知語)/ ー(未知語)」が得られる。この場合は、上記テキスト解析処理によって、入力テキスト「なーんちゃってぇー」は６つの単語に区切られる。
【００５１】
次に、１番目の単語「なー」が読み出される。そして、この単語「なー」には、連続する仮名列(「な」と「ー」)とがあるため、「な」と「ー」との仮名連鎖分岐確率と解析尤度分岐確率とが求められる。その場合、「な」と「ー」との連鎖は話し言葉特有のものであるために、仮名連鎖分岐確率Ｒ(な,ー)の値は大きくなっている。また、文頭が付属語で始まる場合はテキスト解析の信頼性が低いと考えられるために、解析尤度分岐確率は大きくなっている。そして、求められた仮名連鎖分岐確率値と解析尤度分岐確率値とに基づいて、分岐確率が計算される。この場合、仮名連鎖分岐確率値と解析尤度分岐確率値との両者共に大きいため、単語「なー」の分岐確率の値は大きくなる。
【００５２】
さらに、後続の単語「ん」とまとまってアクセント句が形成されるため、当該単語「なー」だけではアクセント句が形成されないと判断される。次に、２番目の単語「ん」に対する処理が１番目の単語「なー」の場合と同様に処理される。そして、アクセント句を形成するか否かを判別する際に、後続の「ちゃっ」という動詞との間にアクセント句の切れ目がないと判断され、「なーん」だけではアクセント句は形成しないと判別される。このことは、「なーん」や「ちゃっ」の分岐確率がある程度高いことから判断される。以下、３番目の単語「ちゃっ」から６番目の単語「ー」に対して同様の処理が行われ、何れの単語も分岐確率が高いことからアクセント句を形成することはないと判断される。結局、入力テキスト「なーんちゃってぇー」に対するテキスト解析によって区切られた単語は、夫々分岐確率が高いことから「なんーちゃってぇー」という一つのアクセント句が形成されることになる。
【００５３】
このようにして形成された一つのアクセント句は、上記分岐確率が大きいためにαより大きいと判断されて、第２アクセント生成部２４によって、テキスト解析の結果を用いずにアクセント句が生成されるのである。したがって、第１アクセント生成部２３によって、テキスト解析の誤解析結果を用いてアクセント句を生成することによる不自然なアクセントの生成を避けることができるのである。
【００５４】
次に、上記第２アクセント句生成部２４によって実行されるテキスト解析結果を用いないアクセント句生成処理について詳細に説明する。図５に、第２アクセント句生成部２４によるアクセント句生成処理動作のフローチャートを示す。図２に示すアクセント句生成処理動作における上記ステップＳ11において、アクセント句候補「なーんちゃってぇー」が第２アクセント生成部２４に送出されるとアクセント句生成処理動作がスタートする。
【００５５】
ステップＳ21で、入力アクセント句候補のモーラ番号ｊに初期値「１」がセットされる。ステップＳ22で、入力アクセント句候補「なーんちゃってぇー」からｊ番目のモーラに該当する文字が読み出されて変数Ｍjに代入される。ステップＳ23で、仮名連鎖Ｍ(j-1),Ｍjに基づいて、文字Ｍjの部分で声立て成分が開始される確率(以下、声立て確率と言う)が仮名連鎖情報テーブルを用いて求められ、変数ａ1に代入される。ここで、上記仮名連鎖情報テーブルは、連続する二つの仮名文字の間で声立て成分が開始される確率を予め大量のテキストデータに基づいて求めたものである。アクセント句生成判定部２２で用いられる上記仮名連鎖分岐確率テーブルは、その確率値(分岐確率値)は話し言葉である確率値である。これに対して、仮名連鎖情報テーブルの確率値は、上記声立て確率値であることだけが異なるのである。したがって、上記仮名連鎖情報テーブルの確率値が大きければ、第２文字Ｍjで声立て成分が開始される可能性が高いのである。例えば、入力アクセント句候補「なーんちゃってぇー」における「ん」と「ちゃ」との場合には、大量のテキストデータ中において「ん」と「ちゃ」との間で声立て成分が開始される場合は少ないので、その声立て確率値は低くなるのである。
【００５６】
ステップＳ24で、仮名Ｍjに続く文字列に基づいて、図２に示すアクセント句生成処理動作における上記ステップＳ7において参照された解析尤度分岐確率の値が検索され、その逆数の値が変数ａ2に代入される。ここで、上記解析尤度分岐確率が高いと言うことはテキスト解析結果の信頼性が低いことを意味しているので、解析尤度分岐確率の値が大きければ文字Ｍjが声立て成分の開始位置となる可能性は低くなる。例えば、解析尤度を計る尺度として品詞情報を例に説明すると、未知語と解析された仮名文字列は、テキスト解析結果が正しい確率は低いので声立て成分の開始位置となる可能性も低い。これに対して、代名詞,副詞等と解析された平仮名は、テキスト解析結果が正しい確率は高いので声立て成分の開始位置となる可能性も高いのである。
【００５７】
入力アクセント句候補の仮名文字連鎖「なーん」の場合は、文頭であるにも拘らず助詞＋助詞(つまり、文頭の付属語)と解析されているので、解析尤度分岐確率の値は高くなる。したがって、その逆数であるａ2の値は小さくなるのである。
【００５８】
ステップＳ25で、入力アクセント句候補のモーラ数に基づく声立て成分開始確率が変数ａ3に代入される。入力アクセント句候補のモーラ数が多ければ当該アクセント句候補の先頭で声立て成分を開始する必要性は高くなるので、先頭文字における上記声立て確率はモーラ数に対して単調増加の関数になる。そこで、文字Ｍjが入力アクセント句候補の先頭文字である場合には、上記関数に基づいて上記声立て確率が得られる。例えば、上記入力アクセント句候補「なーんちゃってぇー」の場合には７モーラであるから、「な」で声立て成分が開始される可能性が高くなる。尚、当該文字Ｍjが入力アクセント句候補の先頭文字でない場合には、変数ａ3には「０」が代入される。
【００５９】
ステップＳ26で、文字Ｍjが入力アクセント句候補中において占める位置に基づく声立て成分の開始確率が変数ａ4に代入される。注目文字Ｍjが入力アクセント句候補の先頭であれば声立て成分が開始される可能性が高くなり、末尾に近づく程低くなるので、先頭からの位置に対する上記声立て確率は単調減少の関数になる。したがって、この関数に基づいて、注目文字Ｍjにおける上記声立て確率が求められるのである。すなわち、上記入力アクセント句候補「なーんちゃってぇー」の場合には、「な」で声立て成分が開始される確率は高いが、「ちゃ」で声立て成分が開始される確率は低くなる。
【００６０】
ステップＳ27で、上述のようにして上記ステップＳ23〜ステップＳ26において求められた変数ａ1〜ａ4に重み係数ｂ1〜ｂ4が乗じられて加算され、変数Ａに代入される。ステップＳ28で、変数Ａの値が所定値βよりも大きいか否かが判別される。その結果、Ａ＞βであればステップＳ29に進み、Ａ≦βであればステップＳ30に進む。ステップＳ29で、文字列Ｍ1〜Ｍ(j-1)に対して声立て成分が与えられる。そうした後にステップＳ31に進む。ステップＳ30で、文字列Ｍ1〜Ｍ(j-1)に対して声立て成分が与えられない。
【００６１】
ステップＳ31で、上記モーラ番号ｊが、上記入力アクセント句候補の総モーラ数Ｎ2よりも小さいか否かが判別される。その結果、総モーラ数Ｎ2よりも小さければステップＳ32に進み、総モーラ数Ｎ2以上であればアクセント句生成処理動作を終了する。ステップＳ32で、モーラ番号ｊがインクリメントされる。そうした後、上記ステップＳ22に戻り、次のモーラに該当する文字に対する処理に移行する。そして、上記ステップＳ31においてモーラ番号ｊが総モーラ数Ｎ2以上であると判別されると、アクセント句生成処理動作を終了するのである。
【００６２】
このように、上記第２アクセント句生成部２４は、入力アクセント句候補の仮名連鎖に基づく上記声立て確率、１/解析尤度分岐確率、モーラ数に基づく上記声立て確立、アクセント句候補中に占める位置に基づく上記声立て確立に基づいて、入力アクセント句候補に対して新たに声立て成分開始位置を設定するか否かを判定するようにしている。したがって、話し言葉のテキストに基づくアクセント句候補「なーんちゃってぇー」が入力された場合には、文字列「ちゃっ」に関する仮名連鎖に基づく上記声立て確率,１/解析尤度分岐確率,モーラ数に基づく上記声立て確立およびアクセント句候補中に占める位置に基づく上記声立て確立の値は何れも小さく、文字列「ちゃっ」で声立て成分が開始されることはない。こうして、声立て成分が２つに別れて不自然なピッチパターンの要因にはなることが抑制されるのである。
【００６３】
上述したように、本実施の形態においては、テキスト解析部２１によるテキスト解析結果に基づいてアクセント句を生成する第１アクセント句生成部２３に加えて、上記テキスト解析結果に依存せずにアクセント句を生成する第２アクセント句生成部２４を設けている。そして、アクセント句生成判定部２２によって、上記テキスト解析結果に基づいて、入力テキストが書き言葉である場合には、アクセント句の生成を第１アクセント句生成部２３で行なうと判定する。一方、話し言葉である場合には、第２アクセント句生成部２４で行なうと判定するようにしている。
【００６４】
したがって、入力テキストが、文法に則っていない話し言葉「なーんちゃってぇー」である場合には、第２アクセント句生成部２４によって、上記テキスト解析結果に依存せずにアクセント句を生成することができる。その結果、テキスト解析部２１による誤ったテキスト解析結果に基づいてアクセント句が生成された場合のように「ちゃっ」のところで次の声立て成分が開始されることはなく、不自然なピッチパターンが生成されることを防止できるのである。
【００６５】
その際に、上記アクセント句生成判定部２２は、２つの仮名文字の連鎖と第２アクセント句生成部２４での処理に分岐すべき確率とを対応付けた仮名連鎖分岐確率テーブルと、品詞条件とその品詞条件を満たす場合に第２アクセント句生成部２４での処理に分岐すべき確率とを対応付けた解析尤度分岐確率テーブルとを参照して、第１アクセント句生成部２３で処理するか第２アクセント句生成部２４で処理するかを判定するようにしている。したがって、話し言葉特有の仮名文字列情報および品詞条件に基づいて、的確に第２アクセント句生成部２４で処理するか否かを判定することができるのである。
【００６６】
また、上記第２アクセント句生成部２４は、上記アクセント句生成判定部２２から入力されたアクセント句候補の仮名連鎖に基づく上記声立て確率,１/解析尤度分岐確率,モーラ数に基づく上記声立て確立,アクセント句候補中に占める位置に基づく上記声立て確立に基づいて、入力アクセント句候補に対して新たに声立て成分開始位置を設定するか否かを判定するようにしている。したがって、例えば話し言葉のように文法では規定できないテキストが入力された場合でも、誤ったテキスト解析結果に基づいて不自然な声立てが与えられることが抑制されて、自然な韻律が生成されるのである。
【００６７】
尚、上記実施の形態においては、アクセント句の生成を第１アクセント句生成部２３で行なうか第２アクセント句生成部２４で行なうかのアクセント句生成判定部２２による判定を、書き言葉であるか話し言葉であるかによって行う場合を例に説明しているが、この発明はこれに限定されるものではない。要は、テキスト解析によって誤解析が生ずるような文法では規定できない文章を第２アクセント句生成部２４で処理すると判定すればよいのである。
【００６８】
上述したような話し言葉によるテキスト入力は、携帯端末器によるメール文の入力時によく行われる。そして、上記携帯端末器においては、画面における表示文字数に制限があるため、受信したメール文を合成音声によって出力することが望ましい。そこで、上記実施の形態で述べたような音声合成装置を上記携帯端末器に搭載することによって、携帯端末器の機能を大幅に向上することができるのである。
【００６９】
ところで、上記実施の形態におけるテキスト解析部２１,アクセント句生成判定部２２,第１アクセント句生成部２３および第２アクセント句生成部２４による上記テキスト解析手段,アクセント句生成判定手段,第１アクセント句生成手段および第２アクセント句生成手段としての機能は、プログラム記録媒体に記録された音声合成処理プログラムによって実現される。上記実施の形態における上記プログラム記録媒体は、ＲＯＭ(リード・オンリ・メモリ)でなるプログラムメディアである。または、外部補助記憶装置に装着されて読み出されるプログラムメディアであってもよい。尚、何れの場合においても、上記プログラムメディアから音声合成処理プログラムを読み出すプログラム読み出し手段は、上記プログラムメディアに直接アクセスして読み出す構成を有していてもよいし、ＲＡＭ(ランダム・アクセス・メモリ)に設けられたプログラム記憶エリア(図示せず)にダウンロードし、上記プログラム記憶エリアにアクセスして読み出す構成を有していてもよい。尚、上記プログラムメディアからＲＡＭの上記プログラム記憶エリアにダウンロードするためのダウンロードプログラムは、予め本体装置に格納されているものとする。
【００７０】
ここで、上記プログラムメディアとは、本体側と分離可能に構成され、磁気テープやカセットテープ等のテープ系、フロッピーディスク,ハードディスク等の磁気ディスクやＣＤ(コンパクトディスク)‐ＲＯＭ,ＭＯ(光磁気)ディスク,ＭＤ(ミニディスク),ＤＶＤ(ディジタルビデオディスク)等の光ディスクのディスク系、ＩＣ(集積回路)カードや光カード等のカード系、マスクＲＯＭ,ＥＰＲＯＭ（紫外線消去型ＲＯＭ),ＥＥＰＲＯＭ(電気的消去型ＲＯＭ),フラッシュＲＯＭ等の半導体メモリ系を含めた、固定的にプログラムを坦持する媒体である。
【００７１】
また、上記実施の形態における音声合成装置は、モデムを備えてインターネットを含む通信ネットワークと接続可能な構成を有している場合には、上記プログラムメディアは、通信ネットワークからのダウンロード等によって流動的にプログラムを坦持する媒体であっても差し支えない。尚、その場合における上記通信ネットワークからダウンロードするためのダウンロードプログラムは、予め本体装置に格納されているものとする。あるいは、別の記録媒体からインストールされるものとする。
【００７２】
尚、上記記録媒体に記録されるものはプログラムのみに限定されるものではなく、データも記録することが可能である。
【００７３】
【発明の効果】
以上より明らかなように、第１の発明の音声合成装置は、テキスト解析結果のうちの品詞付き単語に基づいてアクセント句を生成する第１アクセント句生成手段と上記テキスト解析結果に基づいて且つ上記品詞付き単語に囚われることなくアクセント句を生成する第２アクセント句生成手段とを有し、アクセント句生成判定手段によって、アクセント句の生成を上記第１アクセント句生成手段で行うか第２アクセント句生成手段で行うかを判定するので、例えば話し言葉のようにテキスト解析の際に誤解析され易い入力テキストに関するアクセント句を、上記第２アクセント句生成手段によって、テキスト解析結果のうちの品詞付き単語に囚われることなく生成することが可能になる。
【００７４】
したがって、この発明によれば、話し言葉のように文法では規定できないテキストに対して自然なピッチパターンを付与することが可能になり、不自然な韻律を抑制することが可能になる。
【００７５】
さらに、上記アクセント句生成判定手段は、上記判定の基準として、仮名文字連鎖が話し言葉のテキストコーパスに属する確率であって、上記第２アクセント句生成手段によるアクセント句生成への分岐確率を表す仮名連鎖情報、および、品詞条件に応じて予め設定されて、上記第２アクセント句生成部によるアクセント句生成への分岐確率を表すテキスト解析尤度情報の少なくとも一つを用いるように成したので、話し言葉のように文法では規定できないテキストに基づくアクセント句の生成は上記第２アクセント句生成手段によって行うべきと、的確に判定することができる。
【００７６】
また、第１の実施例は、上記第２アクセント句生成手段を、生成するアクセント句における声立て成分の開始位置を、仮名連鎖情報,テキスト解析尤度情報,アクセント句候補のモーラ数およびアクセント句候補中の位置の少なくとも一つを用いて設定するようにしたので、テキスト解析結果のうちの品詞付き単語に囚われることなく正しくアクセント句を生成することができる。したがって、話し言葉のように文法では規定できないような入力テキストが与えられても、不自然なピッチパターンの生成を抑制して自然な韻律を生成することができる。
【００７７】
また、第２の実施例は、上記第２アクセント句生成手段によって、テキストデータに基づいて予め求められた連続する二つの仮名文字の間で声立て成分が開始される確率である仮名連鎖情報、上記テキスト解析尤度分岐確率の逆数の値で与えられる声立て成分が開始される確率であるテキスト解析尤度情報、アクセント句候補の先頭文字に上記アクセント句候補モーラ数に応じて与えられる声立て成分が開始される確率であるアクセント句候補のモーラ数、および、上記アクセント句候補中で文字が占める位置に基づいて与えられる声立て成分が開始される確率であるアクセント句候補中の位置、の少なくとも一つを用いて、生成するアクセント句における声立て成分の開始位置が設定される。したがって、話し言葉のように文法では規定できないような入力テキストが与えられても、不自然なピッチパターンの生成を抑制してより自然な韻律を生成することができる。
【００７８】
また、第２の発明の音声合成方法は、入力テキストに基づくアクセント句の生成を、テキスト解析結果のうちの品詞付き単語に基づいて生成するか上記テキスト解析結果基づいて且つ上記品詞付き単語に囚われることなく生成するかを、仮名文字連鎖が話し言葉のテキストコーパスに属する確率であって、上記第２アクセント句生成手段によるアクセント句生成への分岐確率を表す仮名連鎖分岐確率、および、品詞条件に応じて予め設定されて、上記第２アクセント句生成部によるアクセント句生成への分岐確率を表すテキスト解析尤度分岐確率、の少なくとも一つを用いて予め判定し、その判定結果に従って上記アクセント句を生成するので、例えば話し言葉のようにテキスト解析の際に誤解析され易い入力テキストに関するアクセント句を、テキスト解析結果のうちの上記品詞付き単語に囚われることなく生成することが可能になる。
【００７９】
また、第３の発明の携帯端末器は、話し言葉のように文法では規定できな入力テキストに対して自然なアクセント句を与えることができる上記第１の発明の音声合成装置を搭載したので、日常会話で使う言葉で書かれたメール文を受信した場合でも合成音声によって正確に出力することが可能になり、携帯端末器の操作性を向上することができる。
【００８０】
また、第４の発明の音声合成プログラムは、コンピューターを、上記第１の発明におけるテキスト解析手段,韻律生成手段,音声合成手段,アクセント句生成判定手段,第１アクセント句生成手段および第２アクセント句生成手段として機能させる。また、第５の発明のプログラム記録媒体は、上記第４の発明の音声合成プログラムを記録している。したがって、上記第１の発明の場合と同様に、話し言葉のようにテキスト解析手段で誤解析され易い入力テキストに関するアクセント句を、上記第２アクセント句生成手段によってテキスト解析結果のうちの品詞付き単語に囚われることなく生成することが可能になる。
【図面の簡単な説明】
【図１】この発明の音声合成装置におけるブロック図である。
【図２】図１に示す音声合成装置によって行なわれるアクセント句生成処理動作のフローチャートである。
【図３】仮名連鎖分岐確率テーブルの一例を示す図である。
【図４】解析尤度分岐確率テーブルの一例を示す図である。
【図５】図１における第２アクセント句生成部によって行われるアクセント句生成処理動作のフローチャートである。
【図６】ピッチパターンを求める過程を示す図である。
【図７】図１における第１アクセント句生成部によって話し言葉に基づいてピッチパターンを生成する過程を示す図である。
【図８】図１における第２アクセント句生成部によって話し言葉に基づいてピッチパターンを生成する過程を示す図である。
【図９】従来の音声合成装置のブロック図である。
【図１０】図９とは異なる従来の音声合成装置のブロック図である。
【符号の説明】
２１…テキスト解析部、
２２…アクセント句生成判定部、
２３…第１アクセント句生成部、
２４…第２アクセント句生成部、
２５…韻律生成部、
２６…音声合成部。[0001]
BACKGROUND OF THE INVENTION
  The present invention relates to a speech synthesizer and a speech synthesis method for synthesizing speech from character information, a portable terminal device, a speech synthesis program, and a program recording medium.
[0002]
[Prior art]
  Conventionally, as text-to-speech synthesis for synthesizing speech from character information, a method of sequentially performing three processes of text analysis, prosody generation, and speech synthesis is known. FIG. 9 shows a block diagram of a conventional speech synthesizer.
[0003]
  The text analysis unit 1 performs the above text analysis processing, detects a word boundary from the input character information, and obtains a phoneme symbol string of each word. The prosody generation unit 2 performs the prosody generation process and assigns prosodic information such as the obtained phoneme duration, word accent, sentence intonation, and the like. Also, the speech synthesizer 3 performs the speech synthesis process, and generates a control signal for the speech synthesizer based on the synthesis units and rules stored in advance.
[0004]
  Hereinafter, the text-to-speech synthesis method will be described in detail by taking a Japanese text-to-speech synthesis device as an example. Japanese is a so-called agglutinating word in which the word boundary is not clear, unlike the language that separates the word boundary with a space like English, and therefore, the word boundary is detected by performing text analysis processing. This text analysis processing is executed by sequentially performing collation processing from the beginning of a sentence using a dictionary storing word notation and reading information and a grammar storing word connection relation information.
[0005]
  The words include independent words such as nouns and verbs, and adjuncts such as particles and auxiliary verbs. For example, the sentence “Today is the weather” is text-analyzed as follows.
    “Today (noun) / is (particle) / weather (noun) / is (auxiliary verb).”
[0006]
  A general text-to-speech synthesis method performs prosody generation processing and speech synthesis processing based on such text analysis results. The details of the prosody generation processing and speech synthesis processing are as described in, for example, “Digital Speech Processing” by Toru University (Tokai University Press).
[0007]
[Problems to be solved by the invention]
  However, the conventional text-to-speech synthesis method has the following problems. That is, in recent years, character information digitized by the spread of the Internet and the like has come to be used everyday in the general society. In particular, text written in words used in daily conversation, such as e-mails, is increasing. So-called spoken words used in daily conversation are difficult to regularize with grammar because of their diverse expressions.
[0008]
  Thus, when spoken words that cannot be defined by grammar are given as input text, text analysis is often not performed correctly. In this case, since the prosody generation process is performed on the premise that the text analysis result is correct, an unnatural prosody is generated.
[0009]
  For example, it is assumed that the spoken word sentence “I have never seen it” is analyzed as follows by the text analysis process.
    `` Look (verb) / Tako (noun) / Nato (noun) / I (noun) ''
When prosody generation processing is performed based on this text analysis result, the voice component starts at the position of the syllable “ta” due to the misanalysis of the verb “see” and the noun “tako” It becomes an unnatural accent.
[0010]
  In order to solve such a problem, Japanese Patent Laid-Open No. 11-259094 proposes a speech synthesizer whose block diagram is shown in FIG. 10, the text analysis unit 11, the prosody generation unit 12 and the speech synthesis unit 13 are the same as the text analysis unit 1, the prosody generation unit 2 and the speech synthesis unit 3 in FIG. The speech synthesizer includes a prosody editing unit 14 that corrects prosodic information given to a character string selected by a user in accordance with a user instruction. Therefore, when an unnatural prosody is generated by the prosody generation unit 12 due to an erroneous analysis of the text analysis unit 11 or the like, the user uses the correction function of the prosody editing unit 14 to correct an unnatural part of the prosody. By correcting it, it can be corrected to a natural voice.
[0011]
  However, the speech synthesizer described in the above-mentioned Japanese Patent Application Laid-Open No. 11-259094 has a problem that it is necessary for the user to make corrections manually, which places a burden on the user.
[0012]
  Accordingly, an object of the present invention is to incorporate a speech synthesizer and speech synthesis method that can automatically suppress unnatural prosody generated due to expressions outside the grammatical rules that appear in spoken language, etc., and this speech synthesizer. Another object of the present invention is to provide a portable terminal, a speech synthesis program, and a program recording medium.
[0013]
[Means for Solving the Problems]
  In order to achieve the above object, the first invention provides:
  In a speech synthesizer having text analysis means for analyzing input text, prosody generation means for generating prosody information based on the text analysis result, and speech synthesis means for synthesizing speech based on the text analysis result and prosodic information ,
  First accent phrase generation means for generating an accent phrase based on a part of speech word in the text analysis result and sending it to the prosody generation means;
  Second accent phrase generation means for generating an accent phrase based on the text analysis result and without being bound by the part-of-speech word and sending the accent phrase to the prosody generation means;
  Based on the text analysis result, it is determined which of the first accent phrase generation means and the second accent phrase generation means generates the accent phrase.Using at least one of kana chain branch probability and text analysis likelihood branch probabilityAccent phrase generation determination means for determining
With,
  The kana chain branch probability is the probability that the kana character chain belongs to the spoken text corpus, and represents the branch probability to the accent phrase generation by the second accent phrase generation means,
  The text analysis likelihood branch probability is preset according to the part-of-speech condition and represents the branch probability to the accent phrase generation by the second accent phrase generation unit.
It is characterized by that.
[0014]
  According to the above configuration, the accent phrase generation determination unit generates the accent phrase based on the input text based on the part-of-speech word in the text analysis result and the text analysis result. In addition, it is determined in advance which of the second accent phrase generation means generates without being trapped by the part-of-speech word. Therefore, for example, an accent phrase relating to an input text that is easily misanalyzed by a text analysis unit such as a spoken word can be generated by the second accent phrase generation unit without being trapped by a part-of-speech word in the text analysis result. Become.
[0015]
  furtherBy the accent phrase generation determination means,Kana character chainSpokenThis is the probability of belonging to the text corpus and represents the probability of branching to accent phrase generation by the second accent phrase generation means.Kana chain information,and, Which is preset according to the part-of-speech condition, and represents the probability of branching to the accent phrase generation by the second accent phrase generation unitA determination is made as to whether the first accent phrase generation means or the second accent phrase generation means is based on at least one of the text analysis likelihood information. Therefore, it is accurately determined that the accent phrase generation based on the input text in the spoken language should be performed by the second accent phrase generation means.
[0016]
  Also,FirstThe embodiment of the first embodimentMysteriousIn a speech synthesizer,
  The second accent phrase generation means determines at least one of the kana component information, the text analysis likelihood information, the accent phrase candidate mora number, and the position in the accent phrase candidate in the accent phrase to be generated. Use to set
It is characterized by that.
[0017]
  According to this embodiment, the second accent phrase generating means generates an accent using at least one of kana chain information, text analysis likelihood information, accent phrase candidate mora number, and position in the accent phrase candidate. The starting position of the voice component in the phrase is set. In this way, an accent phrase is correctly generated without being trapped by the word with part of speech. That is, even if input text that cannot be defined by grammar, such as spoken language, is given, generation of an unnatural pitch pattern is suppressed and a natural prosody is generated.
[0018]
  The second embodiment is the speech synthesizer of the first invention,
  The kana chain information is a probability that a voice component is started between two consecutive kana characters obtained in advance based on text data,
  The text analysis likelihood information is a probability that a voice component given by a reciprocal value of the text analysis likelihood branch probability is started,
  The number of mora of the accent phrase candidate is the probability that a voice component given to the first character of the accent phrase candidate according to the number of accent phrase candidates is started,
  The position in the accent phrase candidate is the probability that a voice component given based on the position occupied by the character in the accent phrase candidate is started.
It is characterized by that.
[0019]
  According to this embodiment, the kana chain information which is the probability that the second accent phrase generating means will start a voice component between two consecutive kana characters obtained in advance based on text data, the text Text analysis likelihood information, which is the probability that the voice component given by the reciprocal value of the analysis likelihood branch probability is started, and the voice component given according to the accent phrase candidate mora number to the first character of the accent phrase candidate At least one of the number of mora of the accent phrase candidate that is the probability of starting and the position in the accent phrase candidate that is the probability of starting the voice component given based on the position occupied by the character in the accent phrase candidate Are used to set the start position of the voice component in the generated accent phrase.
[0020]
  In addition, the second invention,
  In a speech synthesis method for analyzing input text, generating prosodic information based on the text analysis result, and synthesizing speech based on the text analysis result and the prosodic information,
  A first accent phrase generating step for generating a first accent phrase to be used when generating the prosodic information based on a part-of-speech word in the text analysis result;
  A second accent phrase generation step for generating a second accent phrase to be used when generating the prosodic information based on the text analysis result and without being bound by the part-of-speech word;
  Based on the text analysis result, it is determined which of the first accent phrase and the second accent phrase is to be generated.Using at least one of kana chain branch probability and text analysis likelihood branch probabilityAccent phrase generation determination step for determination
With,
  The kana chain branch probability is the probability that the kana character chain belongs to the spoken text corpus, and represents the branch probability to the accent phrase generation by the second accent phrase generation means,
  The text analysis likelihood branch probability is preset according to the part-of-speech condition and represents the branch probability to the accent phrase generation by the second accent phrase generation unit.
It is characterized by that.
[0021]
  According to the above configuration, the generation of the accent phrase based on the input text is generated based on the part-of-speech word in the text analysis result, or is generated based on the text analysis result without being bound by the part-of-speech word. ButThe kana character chain is a probability belonging to the text corpus of spoken words, and is set in advance according to the kana chain branching probability representing the branching probability to the accent phrase generation by the second accent phrase generating means, and the part of speech condition, Using at least one of text analysis likelihood branch probabilities representing branch probabilities for accent phrase generation by the second accent phrase generatorIt is determined in advance. Therefore, for example, an accent phrase related to input text that is likely to be erroneously analyzed during text analysis, such as spoken language, can be generated without being trapped by the word with part of speech in the text analysis result.
[0022]
  According to a third aspect of the present invention, there is provided a portable terminal equipped with the speech synthesizer of the first aspect.
[0023]
  According to the above configuration, the mobile terminal is equipped with a speech synthesizer that can give natural accent phrases to input text that cannot be defined by grammar, such as spoken language. Therefore, even when an e-mail sentence written in words used in daily conversation is received, it is possible to output it accurately with synthesized speech, and the operability of the portable terminal is improved.
[0024]
  The speech synthesis program according to the fourth aspect of the invention is a computer that converts the text analysis means, prosody generation means, speech synthesis means, accent phrase generation determination means, first accent phrase generation means, and second accent phrase in the first invention. It is characterized by functioning as generation means.
[0025]
  A program recording medium according to a fifth aspect is characterized in that the speech synthesis program according to the fourth aspect is recorded.
[0026]
  According to the configurations of the fourth and fifth inventions, as in the case of the first invention, an accent phrase relating to an input text that is easily misanalyzed by text analysis means, such as spoken language, is obtained. The generation means can generate the text analysis result without being trapped by the words with parts of speech.
[0027]
DETAILED DESCRIPTION OF THE INVENTION
  Hereinafter, the present invention will be described in detail with reference to the illustrated embodiments. FIG. 1 is a block diagram of the speech synthesizer according to the present embodiment. The text analysis unit 21 analyzes the input text to detect a word boundary, and obtains a phoneme symbol string of each word. The accent phrase generation determination unit 22 determines whether the accent phrase generation is performed by the first accent phrase generation unit 23 or the second accent phrase generation unit 24 based on the text analysis result. Then, the first accent phrase generator 23 generates an accent phrase based on the text analysis result. On the other hand, the second accent phrase generator 24 generates an accent phrase without depending on the text analysis result.
[0028]
  The prosody generation unit 25, for each accent phrase generated by the first accent phrase generation unit 23 or the second accent phrase generation unit 24, prosodic information such as phoneme duration, accent nucleus position and sentence intonation. Is granted. The speech synthesizer 26 generates a control signal for the speech synthesizer based on the synthesis units and rules stored in advance based on the assigned prosody generation information.
[0029]
  Details of the text analysis unit 21, the first accent phrase generation unit 23, the prosody generation unit 25, and the speech synthesis unit 26 are as described in, for example, “Digital Speech Processing” by Toru University (Tokai University Press). Yes, here is a brief explanation of the terminology.
[0030]
  A group of accents with one accent core is called an accent phrase. Here, the accent nucleus means a position where the pitch of the voice shifts from high to low in each word. The voice is high at the beginning, but gradually decreases due to a decrease in subglottic pressure. Such a characteristic in which the pitch (fundamental frequency) decreases with time is called a voice component, and the pitch of the whole sentence is superimposed on this characteristic by adding a word and phrase-specific pitch pattern (accent component) determined by the accent. The pattern is determined. FIG. 6 shows a process for obtaining the pitch pattern.
[0031]
  In the following, for ease of explanation, the accent phrase generation determination unit 22 looks at the text analysis result and sends the analysis result data to the first accent phrase generation unit 23 if it is a written word, while it is a spoken word. It is assumed that determination result data is sent to the second accent phrase generator 24. However, the present invention is not limited to this. For convenience of explanation, first, problems related to spoken language processing by the first accent phrase generator 23 will be described. The function of the first accent phrase generator 23 is performed by either or both of the text analyzers 1 and 11 and the prosody generators 2 and 12 in the conventional speech synthesizer shown in FIGS. Is. Next, the processing of the accent phrase generation determination unit 22 and finally the processing of the second accent phrase generation unit 24 will be described.
[0032]
  The first accent phrase generation unit 23 generates an accent phrase based on the result of text analysis by the text analysis unit 21, and is a technique that is generally performed conventionally as described above. As an example, the pitch pattern generation in the first accent phrase generation unit 23 for the sentence “Today is the weather” in accordance with the grammar storing the word connection relation information is performed according to the procedure shown in FIG. 6 as described above. It is. Thus, if the sentence conforming to the grammar is correctly analyzed, the first accent phrase generator 23 generates the pitch pattern without any problem.
[0033]
  Here, if the pitch pattern of the sentence “Nanchattee” not conforming to the grammar is generated by the first accent phrase generator 23, the result is as shown in FIG. That is, the text analysis result by the text analysis unit 21 is “na (auxiliary: final particle) / n (particle: case particle) / cha-t (verb: 5-stage wa line) / te (particle: connection particle) / ee (unknown) Word) /-(unknown word) '', and it is judged that there is an accent phrase separated between `` n '' and `` Cha '', so the next voice component at `` Cha '' is Be started. This is indicated by the fact that the voice component is divided into two in FIG. 7, which causes an unnatural pitch pattern.
[0034]
  Therefore, in the speech synthesizer in the present embodiment, the second accent phrase generation unit 24 is provided to suppress the start of the next voice component generated at the accent phrase delimiter as shown in FIG. Further, by suppressing the accent component, the variation of the pitch pattern is suppressed so as not to be largely mistaken.
[0035]
  If the text analysis by the text analysis unit 21 is performed correctly, the first accent phrase generation unit 23 is sufficient. However, in the text analysis process at the present time, there are still incomplete parts such as an error in the delimiter position, an error in judgment of the part of speech, or a process of an unknown word that is not registered in the dictionary. In particular, it is difficult to obtain accurate information for adding prosodic information from input text that is out of grammatical rules such as spoken language. In other words, even if you try to express a spoken word such as “Nanchattee” in a dictionary or grammar, there are many variations that make it difficult to regularize compared to written language.
[0036]
  The features of the spoken language appear in the kana character string. In this embodiment, the characteristic of this kana character string is captured to suppress an unnatural prosody. For example, in the case of “Nanchattee”, the pitch pattern for one accent phrase “Nanchattee” is not used because the text analysis result that “Chat” is a verb is not used. Can be generated.
[0037]
  Next, the accent phrase generation determination unit 22 that determines that “Nanchattee” should be processed by the second accent phrase generation unit 24 as one accent phrase will be described. In general, the result of text analysis of written words is a series of independent words and adjunct words. On the other hand, when a spoken language is analyzed by text, a phenomenon in which a phrase without an independent word is created or an unknown word that is not registered in the dictionary is determined due to a mistaken analysis. Therefore, if this phenomenon is captured and it is determined that the text analysis result is reliable, the first accent phrase generation unit 23 performs accent phrase generation processing; otherwise, the second accent phrase generation unit 24 generates accent phrase generation. Processing is performed.
[0038]
  Therefore, when processing is performed by the second accent phrase generator 24, it is necessary to determine in advance which unit is the accent phrase. In that case, since the result of the text analysis by the text analysis unit 21 is low in reliability, the break position and the part of speech information are not used. Then, since there is a high possibility that a word determined as an unknown word and a portion including a small letter “e” or a long sound symbol “-” are written words, a wide range is collected as an accent phrase without chopping up the accent phrase.
[0039]
  In this way, by using the information that the text analysis result includes an unknown word or that there is a character unique to spoken language, the accent phrase generation determination unit 22 determines whether the input character string is a written word or not. That is, it is possible to determine whether the first accent phrase generator 23 or the second accent phrase generator 24 processes.
[0040]
  FIG. 2 shows a flowchart of an accent phrase generation processing operation performed by the text analysis unit 21, the accent phrase generation determination unit 22, the first accent phrase generation unit 23, and the second accent phrase generation unit 24. Hereinafter, the normal text “Today is the weather” processed by the first accent phrase generator 23 and the spoken text “Nanchattee” processed by the second accent phrase generator 24 are taken as an example. A specific method of the accent phrase generation processing operation will be described.
[0041]
  In step S1, the text analysis unit 21 performs text analysis processing on the input text. In step S2, an initial value “1” is set to the word number i. In step S3, it is determined whether or not the word number i is larger than the word number N1 of the input text based on the text analysis processing result. As a result, if it is larger than N1, the accent phrase generation processing operation is terminated. On the other hand, if N1 or less, the process proceeds to step S4. In step S4, the i-th word is read and assigned to the variable Ti. In step S5, it is determined whether or not there is a continuous kana string in the word Ti. As a result, if present, the process proceeds to step S6. On the other hand, if not, the process proceeds to step S9. In step S6, the kana chain branch probability table is referenced.
[0042]
  Here, the kana chain branching probability is determined to branch to the process in the second accent phrase generation unit 24 when the first character Wi and the second character Wj of two kana characters appear in succession. (That is, the probability of being a spoken language), which is obtained in advance and stored in the kana chain branching probability table. The kana chain branching probability table is obtained as follows.
[0043]
  Based on a large amount of text data in advance, the probabilities P (Wi, Wj, L1) and P (Wi, Wj) that arbitrary hiragana character chains Wi, Wj appear in the written text corpus L1 and the spoken text corpus L2, respectively. , L2). Then, when the hiragana character chain Wi, Wj appears, the probability R (Wi, Wj) belonging to the text corpus L2 of the spoken language is expressed by the following equation:
    R (Wi, Wj) = P (Wi, Wj, L2) / {(P (Wi, Wj, L1) + P (Wi, Wj, L2)}
Ask for. The probability R (Wi, Wj) belonging to the text corpus L2 of the spoken language thus obtained is stored as a branch probability in the table in association with the first character Wi and the second character Wj. A table is obtained.
[0044]
  FIG. 3 shows an example of the kana chain branching probability table. For example, when the first character “de”, the second character “su”, and both kana character chains appear, the probability value R ( Are stored in association with each other. In this case, since the chain between the kana characters “de” and “su” is not unique to spoken language, the branch probability R (de, su) is small. On the other hand, the chain between the kana characters “na” and “−” is unique to spoken language, and the value of the branching probability R (na, −) is large.
[0045]
  In step S7, the analysis likelihood branch probability table is referenced. Here, the analysis likelihood branching probability is a probability (that is, a probability of being a spoken word) that it is determined to branch to the processing in the second accent phrase generation unit 24 because the reliability of the text analysis result is low. . For example, if the part of speech is an “unknown word”, the analysis likelihood branch probability is high, and if it is any other part of speech, it is small. Further, when the sentence head begins with an attached word, the reliability of the text analysis is considered to be low, so that the analysis likelihood branch probability is high. This analysis likelihood branch probability is stored in association with the part-of-speech condition and the branch probability determined to branch to the processing in the second accent phrase generation unit 24 when the part-of-speech condition is satisfied. It is obtained by referring to the degree branch probability table. FIG. 4 shows an example of the analysis likelihood branch probability table. For example, “is” in “Today is the weather” is an auxiliary verb and an adjunct, but because it follows the noun “weather”, it is not an adjunct to the beginning of the sentence, and the analysis likelihood branch probability value is small Value.
[0046]
  In step S8, a branch probability is calculated based on the kana chain branch probability value obtained in step S6 and the analysis likelihood branch probability value obtained in step SS7. In step S9, it is determined whether or not an accent phrase is formed. As a result, if an accent phrase is formed, the process proceeds to step S10, whereas if not formed, the process proceeds to step S13. In step S10, it is determined whether the branch probability is greater than a predetermined value α. As a result, if larger than the predetermined value α, the process proceeds to step S11, and if it is equal to or smaller than the predetermined value α, the process proceeds to step S12. In step S11, the second accent phrase generator 24 generates an accent phrase. After that, the process proceeds to step S13. In step S12, an accent phrase is generated by the first accent phrase generator 23 based on the text analysis result. In step S13, the word number i is incremented. After that, the process returns to step S3, and the process proceeds to the next word number i. When it is determined in step S3 that the word number i is larger than the number N1 of words in the input text, the accent phrase generation processing operation is terminated.
[0047]
  Hereinafter, the above-described accent phrase generation processing operation will be specifically described by taking as an example the case where the normal text “Today is the weather” is input. First, text analysis is performed on the text “Today is the weather”, and the processing result “Today (noun) / is (particle) / weather (noun) / is (auxiliary verb)” is obtained. In this case, the input text “Today is the weather” is divided into four words (N1 = 4) by the text analysis process.
[0048]
  Next, the first word “today” is read out. Since there is no continuous kana string for the word “today”, it is determined whether or not an accent phrase is to be formed. Then, since the particle continues behind, it is determined that an accent phrase is not formed, and the second word “ha” is read out. Then, even if the concatenation with the previous word “today” is taken into consideration, there is no continuous kana string, so it is determined whether or not an accent phrase is to be formed. Since the phrase “Today is” is combined with the previous word “Today”, it is determined that an accent phrase is to be formed. Here, since there is no continuous kana string and branch probability calculation processing is not performed, the branch probability is “0”, and the first accent phrase generation unit 23 generates an accent phrase based on the text analysis result.
[0049]
  Next, the processing for the third word “weather” is performed in the same manner as in the case of the first word “today”. Next, the fourth word “is” is read out. Since this word “is” has a continuous kana string (“de” and “su”), the kana chain branch probability of “de” and “su” and the analysis likelihood branch probability are obtained. It is done. Further, the branch probability is calculated based on the obtained kana chain branch probability value and the analysis likelihood branch probability value. In this case, since both the kana chain branch probability value and the analysis likelihood branch probability value are small, the branch probability value of the word “I” is small. Furthermore, it is determined that an accent phrase “we are the weather” is formed. Since the branch probability value is small, it is determined that the branch probability is smaller than α, and the first accent phrase generation unit 23 generates an accent phrase based on the text analysis result. When the content of the word number i becomes larger than the number of words “4”, the accent phrase generation unit determination processing operation with the text “Today is the weather” is terminated. In the above example, a two-chain kana string has been described as an example, but the same applies to three or more chains.
[0050]
  Next, the above-described accent phrase generation processing operation will be specifically described by taking as an example the case where the text “Nanchattee” is input in spoken language. First, text analysis is performed on the text “Nanchattee”, and the processing result “Na (particle: final particle) / n (particle: case particle) / chat (verb: 5-stage wa line) / Te ( Particle: conjunctive particle) / é (unknown word) / ー (unknown word) ”. In this case, the input text “Nanchatte” is divided into six words by the text analysis process.
[0051]
  Next, the first word “na” is read out. Since this word “Na-” has consecutive kana strings (“na” and “-”), the kana chain branch probability and the analysis likelihood branch probability of “na” and “-” are obtained. It is done. In this case, since the chain between “NA” and “-” is unique to spoken language, the value of the kana chain branching probability R (NA,-) is large. In addition, when the sentence head starts with an attached word, it is considered that the reliability of the text analysis is low, so the analysis likelihood branch probability is large. Then, the branch probability is calculated based on the obtained kana chain branch probability value and the analysis likelihood branch probability value. In this case, since both the kana chain branch probability value and the analysis likelihood branch probability value are large, the value of the branch probability of the word “na” is large.
[0052]
  Furthermore, since the accent phrase is formed together with the subsequent word “n”, it is determined that the accent phrase is not formed only by the word “na”. Next, the processing for the second word “n” is performed in the same manner as in the case of the first word “na”. When determining whether or not to form an accent phrase, it is determined that there is no break in the accent phrase with the subsequent verb “Cha”, and it is determined that an accent phrase will not be formed with “Nan” alone. Is done. This is judged from the fact that the branching probabilities of “nan” and “cha-cha” are high to some extent. Thereafter, the same processing is performed for the third word “Cha” to the sixth word “—”, and it is determined that no accent phrase is formed because any word has a high branching probability. Eventually, words separated by text analysis for the input text “Nanchatte” have a high probability of branching, so one accent phrase “Nanchatte” is formed.
[0053]
  One accent phrase formed in this way is determined to be larger than α due to the large branch probability, and the second accent generation unit 24 generates an accent phrase without using the result of text analysis. It is. Therefore, the first accent generation unit 23 can avoid generation of an unnatural accent by generating an accent phrase using a misanalysis result of text analysis.
[0054]
  Next, an accent phrase generation process using the text analysis result executed by the second accent phrase generation unit 24 will be described in detail. FIG. 5 shows a flowchart of the accent phrase generation processing operation by the second accent phrase generation unit 24. In step S11 of the accent phrase generation processing operation shown in FIG. 2, when the accent phrase candidate “Nanchatte” is sent to the second accent generation unit 24, the accent phrase generation processing operation starts.
[0055]
  In step S21, an initial value “1” is set to the mora number j of the input accent phrase candidate. In step S22, the character corresponding to the jth mora is read from the input accent phrase candidate “Nanchattee” and substituted into the variable Mj. In step S23, based on the kana chain M (j-1), Mj, the probability that the voice component is started at the character Mj portion (hereinafter referred to as the voice probability) is obtained using the kana chain information table. Is assigned to the variable a1. Here, the kana chain information table is obtained in advance based on a large amount of text data, the probability that a voice component is started between two consecutive kana characters. In the kana chain branch probability table used in the accent phrase generation determination unit 22, the probability value (branch probability value) is a probability value that is spoken language. On the other hand, the only difference is that the probability value of the kana chain information table is the above-mentioned voice probability value. Therefore, if the probability value of the kana chain information table is large, there is a high possibility that a voice component is started at the second character Mj. For example, in the case of “n” and “cha” in the input accent phrase candidate “Nanchattee”, a voice component is started between “n” and “cha” in a large amount of text data. Since there are few cases, the voice probability value is low.
[0056]
  In step S24, the value of the analysis likelihood branch probability referred to in step S7 in the accent phrase generation processing operation shown in FIG. 2 is searched based on the character string following the kana Mj, and the reciprocal value is stored in the variable a2. Assigned. Here, the fact that the analysis likelihood branch probability is high means that the reliability of the text analysis result is low. Therefore, if the value of the analysis likelihood branch probability is large, the character Mj is the start position of the voice component. Is less likely. For example, taking part-of-speech information as an example of a measure for measuring the analysis likelihood, a kana character string analyzed as an unknown word has a low probability of being a correct text analysis result, so it is less likely to be the starting position of a voice component. On the other hand, hiragana analyzed as pronouns, adverbs, etc. has a high probability that the text analysis result is correct, so it is highly likely that it will be the starting position of the voice component.
[0057]
  In the case of the kana character chain “nan” as the input accent phrase candidate, it is analyzed as a particle + particle (that is, an adjunct to the beginning of the sentence) even though it is the beginning of the sentence, so the value of the analysis likelihood branch probability is high. Become. Therefore, the value of a2 which is the reciprocal is small.
[0058]
  In step S25, the voice component start probability based on the number of mora of the input accent phrase candidate is substituted into the variable a3. If the number of mora of the input accent phrase candidate is large, the necessity of starting a voice component at the head of the accent phrase candidate becomes higher. Therefore, the voice probability of the first character is a function that increases monotonously with respect to the number of mora. Therefore, when the character Mj is the first character of the input accent phrase candidate, the voice probabilities are obtained based on the function. For example, in the case of the input accent phrase candidate “Nanchattee”, since there are 7 mora, there is a high possibility that the voice component starts at “Na”. If the character Mj is not the first character of the input accent phrase candidate, “0” is assigned to the variable a3.
[0059]
  In step S26, the start probability of the voice component based on the position occupied by the character Mj in the input accent phrase candidate is substituted into the variable a4. If the noticed character Mj is the head of the input accent phrase candidate, the voice component is more likely to be started and becomes lower as it approaches the tail. Therefore, the voice probability for the position from the head is a monotonically decreasing function. . Therefore, based on this function, the voice probabilities for the target character Mj are obtained. That is, in the case of the input accent phrase candidate “Nanchattee”, the probability that a voice component starts at “NA” is high, but the probability that a voice component starts at “Cha” is low. .
[0060]
  In step S27, the variables a1 to a4 obtained in steps S23 to S26 as described above are multiplied by the weighting factors b1 to b4 and added, and assigned to the variable A. In step S28, it is determined whether or not the value of the variable A is larger than a predetermined value β. As a result, if A> β, the process proceeds to step S29, and if A ≦ β, the process proceeds to step S30. In step S29, voice components are given to the character strings M1 to M (j-1). After that, the process proceeds to step S31. In step S30, no voice component is given to the character strings M1 to M (j-1).
[0061]
  In step S31, it is determined whether or not the mora number j is smaller than the total mora number N2 of the input accent phrase candidates. If the result is smaller than the total number of mora N2, the process proceeds to step S32. If the total number of mora is N2 or more, the accent phrase generation processing operation is terminated. In step S32, the mora number j is incremented. After that, the process returns to step S22, and the process proceeds to the process for the character corresponding to the next mora. When it is determined in step S31 that the mora number j is equal to or greater than the total number of mora N2, the accent phrase generation processing operation is terminated.
[0062]
  As described above, the second accent phrase generation unit 24 includes the voice probabilities based on the kana chain of the input accent phrase candidates, 1 / analysis likelihood branching probability, the voice establishment based on the number of mora, and the accent phrase candidates. Based on the above voice establishment based on the occupied position, it is determined whether or not a new voice component start position is set for the input accent phrase candidate. Therefore, when the accent phrase candidate “Nanchattee” based on the spoken language text is input, the above voiced probability based on the kana chain for the character string “Cha”, 1 / analysis likelihood branching probability, number of mora The values for the voice establishment based on the above and the voice establishment establishment based on the position occupied in the accent phrase candidate are both small, and the voice component is not started with the character string “Cha”. In this way, it is suppressed that the voice component is divided into two and causes an unnatural pitch pattern.
[0063]
  As described above, in the present embodiment, in addition to the first accent phrase generation unit 23 that generates an accent phrase based on the text analysis result by the text analysis unit 21, the accent phrase does not depend on the text analysis result. A second accent phrase generator 24 is provided. Then, the accent phrase generation determination unit 22 determines that the first accent phrase generation unit 23 generates the accent phrase based on the text analysis result when the input text is a written word. On the other hand, if it is a spoken language, the second accent phrase generator 24 determines that it is performed.
[0064]
  Therefore, when the input text is a spoken word “Nanchattee” that does not conform to the grammar, the second accent phrase generation unit 24 may generate an accent phrase without depending on the text analysis result. it can. As a result, the next voice component is not started at “Cha” as in the case where an accent phrase is generated based on an erroneous text analysis result by the text analysis unit 21, and an unnatural pitch pattern is generated. It can be prevented from being generated.
[0065]
  At this time, the accent phrase generation determination unit 22 includes a kana chain branching probability table that associates a chain of two kana characters with a probability to branch to the processing in the second accent phrase generation unit 24, a part of speech condition, Whether the first accent phrase generation unit 23 performs processing with reference to the analysis likelihood branch probability table that associates the probability to be branched to the processing in the second accent phrase generation unit 24 when the part of speech condition is satisfied The second accent phrase generator 24 determines whether to process. Therefore, based on the kana character string information peculiar to the spoken language and the part-of-speech condition, it is possible to accurately determine whether or not the second accent phrase generation unit 24 performs processing.
[0066]
  Further, the second accent phrase generation unit 24 is configured to generate the voice based on the kana chain of the accent phrase candidates input from the accent phrase generation determination unit 22, 1 / analysis likelihood branching probability, and the voice based on the number of mora. Whether or not to newly set a voice component start position for the input accent phrase candidate is determined based on the voice establishment and the voice establishment based on the position occupied in the accent phrase candidate. Therefore, even when text that cannot be specified by grammar, such as spoken language, is input, it is suppressed that an unnatural voice is given based on an erroneous text analysis result, and a natural prosody is generated. .
[0067]
  In the above embodiment, whether the accent phrase is generated by the first accent phrase generator 23 or the second accent phrase generator 24 is determined by the accent phrase generation determination unit 22 as to whether it is a written word or a spoken word. However, the present invention is not limited to this. In short, it is only necessary to determine that the second accent phrase generation unit 24 processes a sentence that cannot be defined by a grammar that causes an erroneous analysis by text analysis.
[0068]
  Text input by spoken language as described above is often performed when a mail sentence is input by a portable terminal. And in the said portable terminal device, since there is a limit in the number of display characters on a screen, it is desirable to output the received mail sentence by a synthetic voice. Therefore, by mounting the speech synthesizer as described in the above embodiment on the portable terminal, the function of the portable terminal can be greatly improved.
[0069]
  By the way, the text analysis unit 21, the accent phrase generation determination unit 22, the first accent phrase generation unit 23, and the second accent phrase generation unit 24, the text analysis unit, the accent phrase generation determination unit, and the first accent phrase in the above embodiment. The functions as the generating means and the second accent phrase generating means are realized by a speech synthesis processing program recorded on a program recording medium. The program recording medium in the above embodiment is a program medium composed of a ROM (Read Only Memory). Alternatively, it may be a program medium that is loaded into an external auxiliary storage device and read out. In any case, the program reading means for reading the voice synthesis processing program from the program medium may have a configuration in which the program medium is directly accessed and read, or the random access memory (RAM). You may have the structure which downloads to the program storage area (not shown) provided, and accesses and reads the said program storage area. It is assumed that a download program for downloading from the program medium to the program storage area of the RAM is stored in the main unit in advance.
[0070]
  Here, the program medium is configured to be separable from the main body side, and is a tape system such as a magnetic tape or a cassette tape, a magnetic disk such as a floppy disk or a hard disk, a CD (compact disk) -ROM, or MO (magneto-optical). Optical discs such as discs, MDs (mini discs), DVDs (digital video discs), card systems such as IC (integrated circuit) cards and optical cards, mask ROMs, EPROMs (ultraviolet erasable ROMs), EEPROMs (electrical This is a medium that carries a fixed program including a semiconductor memory system such as an erasable ROM) and a flash ROM.
[0071]
  Further, when the speech synthesizer in the above embodiment has a configuration that includes a modem and can be connected to a communication network including the Internet, the program medium is fluidly downloaded by downloading from the communication network. It can be a medium that carries the program. In this case, it is assumed that a download program for downloading from the communication network is stored in the main device in advance. Or it shall be installed from another recording medium.
[0072]
  It should be noted that what is recorded on the recording medium is not limited to a program, and data can also be recorded.
[0073]
【The invention's effect】
  As is clear from the above, the speech synthesizer of the first invention is based on the first accent phrase generating means for generating an accent phrase based on the part-of-speech word in the text analysis result, the above text analysis result, and the above Second accent phrase generating means for generating an accent phrase without being constrained by a word with part of speech, and generating the accent phrase by the first accent phrase generating means by the accent phrase generation determining means or generating the second accent phrase Therefore, for example, an accent phrase relating to input text that is easily misanalyzed during text analysis, such as a spoken word, is captured by a part-of-speech word in the text analysis result by the second accent phrase generating means. It becomes possible to generate without.
[0074]
  Therefore, according to the present invention, it is possible to give a natural pitch pattern to text that cannot be defined by grammar, such as spoken language, and to suppress unnatural prosody.
[0075]
  furtherThe accent phrase generation determination meansIsAs a criterion for the above judgment,Kana character chainSpokenThis is the probability of belonging to the text corpus and represents the probability of branching to accent phrase generation by the second accent phrase generation means.Kana chain information,and, Which is preset according to the part-of-speech condition, and represents the probability of branching to the accent phrase generation by the second accent phrase generation unitSince at least one of the text analysis likelihood information is used, it is possible to accurately determine that the accent phrase generation based on the text that cannot be defined by the grammar such as spoken language should be performed by the second accent phrase generation means. Can do.
[0076]
  Also,FirstIn the embodiment of the present invention, the second accent phrase generating means determines the start position of the voice component in the accent phrase to be generated, the kana chain information, the text analysis likelihood information, the number of mora of the accent phrase candidate, and the position in the accent phrase candidate. Therefore, the accent phrase can be generated correctly without being trapped by the part-of-speech word in the text analysis result. Therefore, even if an input text that cannot be defined by grammar such as spoken language is given, generation of an unnatural pitch pattern can be suppressed and a natural prosody can be generated.
[0077]
  In the second embodiment, kana chain information which is a probability that a voice component is started between two consecutive kana characters obtained in advance based on text data by the second accent phrase generating means, Text analysis likelihood information, which is the probability of starting a voice component given by the reciprocal value of the text analysis likelihood branching probability, and voices given to the first character of accent phrase candidates according to the number of accent phrase candidate mora The number of mora of the accent phrase candidate that is the probability that the component will start, and the position in the accent phrase candidate that is the probability that the voice component given based on the position occupied by the character in the accent phrase candidate Using at least one, the start position of the voice component in the generated accent phrase is set. Therefore, even if input text that cannot be defined by grammar such as spoken language is given, generation of an unnatural pitch pattern can be suppressed and a more natural prosody can be generated.
[0078]
  In the speech synthesis method according to the second aspect of the present invention, generation of an accent phrase based on the input text is generated based on a word with a part of speech in a text analysis result or based on the text analysis result and is trapped by the word with a part of speech. To generate withoutThe kana character chain belongs to the spoken text corpus and is set in advance according to the kana chain branching probability representing the branching probability to the accent phrase generation by the second accent phrase generating means, and the part of speech condition, Using at least one of text analysis likelihood branching probabilities representing branching probability to accent phrase generation by the second accent phrase generating unitSince the accent phrase is generated in advance according to the determination result, the accent phrase related to the input text that is easily misanalyzed at the time of text analysis, such as spoken language, is captured by the word with part of speech in the text analysis result. It becomes possible to generate without.
[0079]
  In addition, since the portable terminal device of the third invention is equipped with the speech synthesizer of the first invention that can give a natural accent phrase to input text that cannot be defined by grammar like spoken language, Even when an e-mail message written in a language used in conversation is received, it is possible to output it accurately with synthesized speech, and the operability of the portable terminal can be improved.
[0080]
  The speech synthesis program according to the fourth aspect of the invention is a computer that converts the text analysis means, prosody generation means, speech synthesis means, accent phrase generation determination means, first accent phrase generation means, and second accent phrase in the first invention. It functions as a generation means. A program recording medium according to a fifth aspect records the speech synthesis program according to the fourth aspect. Therefore, as in the case of the first invention, an accent phrase related to input text that is easily misanalyzed by text analysis means such as spoken words is converted into a part-of-speech word in the text analysis result by the second accent phrase generation means. It becomes possible to generate without being caught.
[Brief description of the drawings]
FIG. 1 is a block diagram of a speech synthesizer according to the present invention.
FIG. 2 is a flowchart of an accent phrase generation processing operation performed by the speech synthesizer shown in FIG.
FIG. 3 is a diagram illustrating an example of a kana chain branch probability table;
FIG. 4 is a diagram illustrating an example of an analysis likelihood branch probability table.
FIG. 5 is a flowchart of an accent phrase generation processing operation performed by a second accent phrase generation unit in FIG. 1;
FIG. 6 is a diagram illustrating a process of obtaining a pitch pattern.
7 is a diagram showing a process of generating a pitch pattern based on spoken language by a first accent phrase generator in FIG. 1; FIG.
8 is a diagram showing a process of generating a pitch pattern based on spoken language by a second accent phrase generator in FIG. 1. FIG.
FIG. 9 is a block diagram of a conventional speech synthesizer.
FIG. 10 is a block diagram of a conventional speech synthesizer different from FIG.
[Explanation of symbols]
21 ... Text analysis part,
22 ... Accent phrase generation determination unit,
23. First accent phrase generator,
24 ... second accent phrase generator,
25 ... Prosody generation part,
26: Speech synthesis unit.

Claims

In a speech synthesizer having text analysis means for analyzing input text, prosody generation means for generating prosody information based on the text analysis result, and speech synthesis means for synthesizing speech based on the text analysis result and prosodic information ,
First accent phrase generation means for generating an accent phrase based on a part of speech word in the text analysis result and sending it to the prosody generation means;
Second accent phrase generation means for generating an accent phrase based on the text analysis result and without being bound by the part-of-speech word and sending the accent phrase to the prosody generation means;
Based on the text analysis result, whether the accent phrase is generated by the first accent phrase generation means or the second accent phrase generation means is determined by using at least one of the kana chain branch probability and the text analysis likelihood branch probability. An accent phrase generation judging means for judging using ,
The kana chain branch probability is the probability that the kana character chain belongs to the spoken text corpus, and represents the branch probability to the accent phrase generation by the second accent phrase generation means,
The speech synthesis apparatus characterized in that the text analysis likelihood branch probability is preset according to a part of speech condition and represents a branch probability to the accent phrase generation by the second accent phrase generation unit. .

The speech synthesis apparatus according to claim 1,
The second accent phrase generating means determines at least one of the kana chain information , the text analysis likelihood information , the number of mora of the accent phrase candidate, and the position in the accent phrase candidate in the accent phrase to be generated. speech synthesis apparatus characterized that you set using.

The speech synthesis apparatus according to claim 2 ,
The kana chain information is a probability that a voice component is started between two consecutive kana characters obtained in advance based on text data,
The text analysis likelihood information is a probability that a voice component given by a reciprocal value of the text analysis likelihood branch probability is started,
The number of mora of the accent phrase candidate is the probability that a voice component given to the first character of the accent phrase candidate according to the number of accent phrase candidates is started,
Positions in the accent phrase candidates, the speech synthesis apparatus according to claim probability der Rukoto voices freshly component given based on the position occupied by the characters in the accent phrase candidates is started.

In a speech synthesis method for analyzing input text, generating prosody information based on the text analysis result, and synthesizing speech based on the text analysis result and the prosody information,
A first accent phrase generating step for generating a first accent phrase to be used when generating the prosodic information based on a part-of-speech word in the text analysis result;
A second accent phrase generation step for generating a second accent phrase to be used when generating the prosodic information based on the text analysis result and without being bound by the part-of-speech word;
Based on the text analysis result, which one of the first accent phrase and the second accent phrase is generated is determined using at least one of the kana chain branch probability and the text analysis likelihood branch probability. An accent phrase generation determination step for determining ,
The pseudonym chain branching probability, kana characters chain is a probability of belonging to the text corpus of spoken language, Ri you represent the branch probability of the accent phrase generated by the second accent phrase generating means,
The speech synthesis method, wherein the text analysis likelihood branching probability is preset according to a part-of-speech condition and represents a branching probability to the accent phrase generation by the second accent phrase generation unit .

A portable terminal device comprising the speech synthesizer according to any one of claims 1 to 3.

Computer
The speech synthesis program according to claim 1, wherein the speech synthesis program functions as text analysis means, prosody generation means, speech synthesis means, accent phrase generation determination means, first accent phrase generation means, and second accent phrase generation means.

A computer-readable program recording medium on which the speech synthesis program according to claim 6 is recorded.