JP2015118222A

JP2015118222A - Voice synthesis system and voice synthesis method

Info

Publication number: JP2015118222A
Application number: JP2013261142A
Authority: JP
Inventors: 永松　健司; Kenji Nagamatsu; 健司永松; 竹雄森; Takeo Mori; 本間　健; Takeshi Honma; 健本間
Original assignee: Hitachi ULSI Systems Co Ltd
Current assignee: Hitachi Solutions Technology Ltd
Priority date: 2013-12-18
Filing date: 2013-12-18
Publication date: 2015-06-25
Anticipated expiration: 2033-12-18
Also published as: JP6336749B2

Abstract

PROBLEM TO BE SOLVED: To provide a voice synthesis system allowing a user of a voice synthesis service to easily know available corrections on a synthetic voice.SOLUTION: The voice synthesis system receives a read-out text data and outputs the synthetic voice of the read-out text data. The voice synthesis system includes a correction candidate information generation processing section that generates a piece of correction candidate information indicating a points to be corrected in the synthetic voice and description of the correction candidate based on the information used for generating the synthetic voice.

Description

本発明は、音声合成システム及び音声合成方法に関する。 The present invention relates to a speech synthesis system and a speech synthesis method.

従来より、テキストを音声に変換して読み上げるテキスト音声合成技術、およびそれを利用したテキスト音声合成システムがある。このような技術の応用先として、例えば、カーナビゲーションでのガイド音声、携帯電話・スマートフォンでのメール読み上げや音声対話インタフェース、視覚障害者向けのスクリーンリーダー、電子書籍の読み上げ機能などが存在する。 2. Description of the Related Art Conventionally, there are text-to-speech synthesis techniques that convert text into speech and read it out, and text-to-speech synthesis systems that use it. Application destinations of such technologies include, for example, guide voice in car navigation, mail reading and voice dialogue interfaces on mobile phones / smartphones, screen readers for visually impaired people, and reading functions of electronic books.

このようなシステムにおいて、音声合成処理は、合成音声を再生するＰＣ（Personal Computer）などで行われる場合が多かったが、近年、Ｗｅｂ技術を組み合わせることで、サーバを利用した音声合成サービスの事業化が進んでいる。この音声合成サービスでは、読み上げテキストをサーバに送信すると、サーバが、変換結果の合成音声データを返送する。このようなＷｅｂベースでの音声合成変換処理（サービス）を実現することにより、多くの音声合成変換要求への対応が容易になったり、多くのユーザに対してサービスを同時に提供することが容易になる。通常、音声合成サービスでは、読み上げテキストを送り、変換された合成音声を受け取るという１回のやりとりでサービスは完了する。 In such a system, speech synthesis processing is often performed by a PC (Personal Computer) that reproduces synthesized speech, but in recent years, by combining Web technologies, commercialization of speech synthesis services using a server has become possible. Is progressing. In this speech synthesis service, when the read-out text is transmitted to the server, the server returns synthesized speech data as a conversion result. By realizing such a web-based speech synthesis conversion process (service), it becomes easy to respond to many speech synthesis conversion requests or to provide services to many users simultaneously. Become. Usually, in the speech synthesis service, the service is completed by one exchange of sending a read-out text and receiving the converted synthesized speech.

特開２０１２−１０８３７８号公報JP 2012-108378 A

音声合成処理は、単語の読みや抑揚の決定を自動的に行っているため、読み誤りや抑揚・アクセントが不自然になる箇所が存在することは避けられない。そのため、従来の音声合成プログラムには、ユーザが気づいた誤りや不自然な箇所を修正して再度、音声合成する仕組みを持つものがある。例えば、特許文献１には、ユーザからの音声による改善指示を音声認識し、それに基づいて合成音声データを修正する技術が開示されている。 Since speech synthesis processing automatically determines the reading of words and intonations, it is inevitable that there are places where reading errors and intonation / accents are unnatural. For this reason, some conventional speech synthesis programs have a mechanism for re-synthesizing speech after correcting an error or an unnatural part noticed by the user. For example, Patent Document 1 discloses a technique for recognizing a voice improvement instruction from a user and correcting synthesized voice data based on the voice recognition.

音声合成サービスでは、通常、サーバ側とクライアント側に分かれており、音声合成処理はサーバ側で行い、クライアント側のユーザは通常、返された合成音声を聞くことだけしかできない。しかし、上記のように解析誤りなどの結果、読みや抑揚が不自然になった場合でも、クライアント側は合成音声に対してどのような修正が可能なのか分からないため、クライアント側で誤りや不自然な箇所を修正する機能を持たせることができない。 The speech synthesis service is usually divided into a server side and a client side, the speech synthesis processing is performed on the server side, and the user on the client side can usually only listen to the returned synthesized speech. However, even if the reading or inflection becomes unnatural as a result of an analysis error or the like as described above, the client side does not know what corrections can be made to the synthesized speech, so the client side has errors or inaccuracies. It cannot have a function to correct natural parts.

本発明の目的は、音声合成サービスを利用するユーザが合成音声に対してどのような修正ができるかを容易に知ることができる技術を提供することである。 An object of the present invention is to provide a technology that allows a user who uses a speech synthesis service to easily know what corrections can be made to synthesized speech.

本発明者らは、合成音声を生成した際の情報に基づいて、合成音声における修正可能な箇所と修正候補の内容を示す修正候補情報を生成することにより、上記課題が解決できることを見出した。 The present inventors have found that the above-mentioned problem can be solved by generating correction candidate information indicating a correctable portion and the contents of correction candidates in the synthesized voice based on information when the synthesized voice is generated.

上記課題を解決する為に、例えば特許請求の範囲に記載の構成を採用する。本願は上記課題を解決する手段を複数含んでいるが、その一例をあげるならば、読み上げテキストを受け取り、前記読み上げテキストを読み上げた合成音声を出力する音声合成システムであって、前記合成音声を生成した際の情報に基づいて、前記合成音声における修正可能な箇所と修正候補の内容を示す修正候補情報を生成する修正候補情報生成処理部を備える音声合成システムが提供される。 In order to solve the above problems, for example, the configuration described in the claims is adopted. The present application includes a plurality of means for solving the above-described problems. For example, a speech synthesis system that receives a read-out text and outputs a synthesized speech obtained by reading the read-out text, and generates the synthesized speech. A speech synthesis system is provided that includes a correction candidate information generation processing unit that generates correction candidate information indicating a correctable portion and the contents of correction candidates in the synthesized speech based on the information at the time.

また、他の例によれば、入力された読み上げテキストを読み上げた合成音声を生成する第１ステップと、前記合成音声を生成した際の情報に基づいて、前記合成音声における修正可能な箇所と修正候補の内容を示す修正候補情報を生成する第２ステップと、を備える音声合成方法が提供される。 According to another example, the first step of generating a synthesized speech obtained by reading out the input read-out text, and a correctable portion and correction in the synthesized speech based on information when the synthesized speech is generated And a second step of generating correction candidate information indicating the content of the candidate.

本発明によれば、音声合成処理の結果として変換結果の合成音声とともに、その合成音声に対する修正可能箇所および修正可能内容を受け取ることが可能となる。したがって、音声合成サービスを利用するユーザが合成音声に対してどのような修正ができるかを容易に知ることができる。 According to the present invention, it is possible to receive, as a result of the speech synthesis process, the synthesized speech as the conversion result, the correctable portion and the correctable content for the synthesized speech. Therefore, it is possible to easily know how the user who uses the speech synthesis service can correct the synthesized speech.

本発明に関連する更なる特徴は、本明細書の記述、添付図面から明らかになるものである。また、上記した以外の、課題、構成及び効果は、以下の実施例の説明により明らかにされる。 Further features related to the present invention will become apparent from the description of the present specification and the accompanying drawings. Further, problems, configurations and effects other than those described above will be clarified by the description of the following examples.

本発明における音声合成システムの基本構成を説明する図である。It is a figure explaining the basic composition of the speech synthesis system in the present invention. 音声合成システムの基本構成における読み生成処理部および読み修正候補情報生成処理部の内部処理を説明する図である。It is a figure explaining the internal process of the reading production | generation process part in the basic composition of a speech synthesizer system, and a reading correction candidate information generation process part. 音声合成システムの基本構成における韻律修正候補情報生成処理部の内部処理を説明する図である。It is a figure explaining the internal process of the prosody correction candidate information generation process part in the basic composition of a speech synthesizer system. 音声合成システムの基本構成における読み選択処理部および単語読み修正候補情報生成処理部の内部処理を説明する図である。It is a figure explaining the internal process of the reading selection process part in a basic composition of a speech synthesizer system, and a word reading correction candidate information generation process part. 単語ネットワーク情報のデータ例である。It is an example of data of word network information. 最適単語列選択処理の出力データ例である。It is an output data example of the optimal word string selection process. 単語統合処理の出力データ例である。It is an example of output data of a word integration process. 最適形態素列のデータ例である。It is an example of data of an optimal morpheme sequence. 音声合成システムの基本構成における単語読み修正候補情報生成処理のフローチャートである。It is a flowchart of the word reading correction candidate information generation process in the basic configuration of the speech synthesis system. 単語読み修正候補情報のデータ例である。It is an example of data of word reading correction candidate information. 区切り決定処理の出力データ例である。It is an example of output data of a division | segmentation determination process. 区切り決定処理の出力データ例である。It is an example of output data of a division | segmentation determination process. 区切り修正候補情報のデータ例である。It is an example of data of delimiter correction candidate information. 結合決定処理の出力データ例である。It is an example of output data of a joint determination process. 結合修正候補情報のデータ例である。It is an example of data of joint correction candidate information. 読み修正候補情報のデータ例である。It is an example of data of reading correction candidate information. 通常の音声合成要求時の音声合成システムの処理構成を示す図である。It is a figure which shows the processing structure of the speech synthesis system at the time of the normal speech synthesis request | requirement. 修正及び改善指示要求時の音声合成システムの処理構成を示す図である。It is a figure which shows the processing structure of the speech synthesis system at the time of a correction | amendment and improvement instruction | indication request | requirement. 第１実施例の音声合成サービスのＩＤ付きメタ情報のデータ例である。It is an example of data of meta information with ID of the speech synthesis service of the first embodiment. 第２実施例におけるサーバ側システムの構成を示す図である。It is a figure which shows the structure of the server side system in 2nd Example. 第２実施例におけるクライアント側システムの構成を示す図である。It is a figure which shows the structure of the client side system in 2nd Example. ユーザ情報データベースのデータ例である。It is an example of data of a user information database. 英語テキストに対して出力された読み修正候補情報のデータ例である。It is a data example of the reading correction candidate information output with respect to the English text. 図２０のデータがサービス内容修正部によって修正された後のデータ例である。It is an example of data after the data of FIG. 20 is corrected by the service content correction part. 修正候補情報の出力データ例である。It is an example of output data of correction candidate information. クライアント側システムにおいて表示される画面例である。It is an example of a screen displayed in a client side system.

以下、添付図面を参照して本発明の実施例について説明する。なお、添付図面は本発明の原理に則った具体的な実施例を示しているが、これらは本発明の理解のためのものであり、決して本発明を限定的に解釈するために用いられるものではない。 Embodiments of the present invention will be described below with reference to the accompanying drawings. The accompanying drawings show specific embodiments in accordance with the principle of the present invention, but these are for the understanding of the present invention, and are never used to interpret the present invention in a limited manner. is not.

まず本発明における音声合成システムの基本構成を説明する。図１は、本発明における音声合成システムの基本構成を説明する図である。この音声合成システムは、読み上げテキストを受け取り、その読み上げテキストを読み上げた合成音声を出力するものである。音声合成システムの以下で説明する具体的な構成は、実施例の機能を実現するソフトウェアのプログラムコードで実現してもよい。すなわち、音声合成システムの基本構成は、所定のプログラムがプログラムコードとしてメモリに格納され、中央演算処理装置が各プログラムコードを実行することによって実現できる。 First, the basic configuration of the speech synthesis system according to the present invention will be described. FIG. 1 is a diagram for explaining the basic configuration of a speech synthesis system according to the present invention. This speech synthesis system receives a read-out text and outputs a synthesized speech obtained by reading out the read-out text. The specific configuration described below of the speech synthesis system may be implemented by software program code that implements the functions of the embodiments. That is, the basic configuration of the speech synthesis system can be realized by storing a predetermined program as a program code in a memory and executing each program code by the central processing unit.

また、以後の説明では、本実施例において扱われる情報について「テーブル」構造を用いて説明するが、これら情報は必ずしもテーブルによるデータ構造で表現されていなくても良く、リスト、ＤＢ、キュー等のデータ構造やそれ以外で表現されていても良い。そのため、データ構造に依存しないことを示すために、以下では各種データを単に「情報」と呼ぶことがある。 In the following description, the information handled in this embodiment will be described using a “table” structure. However, such information does not necessarily have to be represented by a table data structure, such as a list, DB, queue, etc. It may be expressed by a data structure or other. Therefore, in order to show that it does not depend on the data structure, various data may be simply referred to as “information” below.

本基本構成では、従来の音声合成処理を拡張することで、合成音声とともに修正候補情報の生成を可能にする。音声合成システムは、読み生成処理部１０２と、韻律生成処理部１０３と、波形生成処理部１０４とを備える。読み上げテキスト１０１が入力されると、読み生成処理部１０２により、形態素解析処理や読み分け処理、区切り位置決定処理などのテキスト解析処理が行われて、読み上げテキストの読み・アクセントなどが決定される。続いて、韻律生成処理部１０３により、それぞれの音節の継続時間長や基本周波数、パワーなどが決定される。最後に、波形生成処理部１０４により、合成音声波形データが生成され、読み上げ音声１０５として出力される。これら読み生成処理部１０２、韻律生成処理部１０３、波形生成処理部１０４の実現方法は様々な公知文献によって開示されているため、ここでは説明しない。 In this basic configuration, it is possible to generate correction candidate information together with synthesized speech by extending the conventional speech synthesis processing. The speech synthesis system includes a reading generation processing unit 102, a prosody generation processing unit 103, and a waveform generation processing unit 104. When the reading text 101 is input, the reading generation processing unit 102 performs text analysis processing such as morpheme analysis processing, reading separation processing, and delimiter position determination processing, and reading / accent of the reading text is determined. Subsequently, the prosody generation processing unit 103 determines the duration of each syllable, the fundamental frequency, the power, and the like. Finally, the synthesized voice waveform data is generated by the waveform generation processing unit 104 and output as the read-out voice 105. The implementation methods of the reading generation processing unit 102, prosody generation processing unit 103, and waveform generation processing unit 104 are disclosed in various known documents, and thus will not be described here.

本実施例では、上述した従来の音声合成システムに対して、読み修正候補情報生成処理部１０６と、韻律修正候補情報生成処理部１０７と、波形修正候補情報生成処理部１０８と、修正候補情報統合処理部１０９とを追加する。読み修正候補情報生成処理部１０６は、合成音声の読みを生成する読み生成処理における修正情報である読み修正候補情報を生成する。また、韻律修正候補情報生成処理部１０７は、合成音声の韻律を生成する韻律生成処理における修正情報である韻律修正候補情報を生成する。また、波形修正候補情報生成処理部１０８は、合成音声の波形を生成する波形生成処理における修正情報である波形修正候補情報を生成する。これらの処理部により、合成音声における修正しうる箇所と修正候補の内容を示す修正候補情報１１０を読み上げ音声１０５に加えて出力できることが本発明の特徴となる。 In this embodiment, compared to the above-described conventional speech synthesis system, the reading correction candidate information generation processing unit 106, the prosody correction candidate information generation processing unit 107, the waveform correction candidate information generation processing unit 108, and the correction candidate information integration A processing unit 109 is added. The reading correction candidate information generation processing unit 106 generates reading correction candidate information that is correction information in a reading generation process for generating a synthesized speech reading. The prosody modification candidate information generation processing unit 107 generates prosody modification candidate information that is modification information in the prosody generation process for generating a synthesized speech prosody. Further, the waveform correction candidate information generation processing unit 108 generates waveform correction candidate information that is correction information in the waveform generation processing for generating the waveform of the synthesized speech. It is a feature of the present invention that these processing units can output correction candidate information 110 indicating the portion of the synthesized speech that can be corrected and the contents of the correction candidates in addition to the read-out speech 105.

以下では、読み修正候補情報生成処理部１０６、韻律修正候補情報生成処理部１０７、波形修正候補情報生成処理部１０８、および修正候補情報統合処理部１０９の処理の実施例について説明する。図２Ａは、読み生成処理部１０２、および読み修正候補情報生成処理部１０６の内部処理を説明するものである。 In the following, an example of processing of the reading correction candidate information generation processing unit 106, prosody correction candidate information generation processing unit 107, waveform correction candidate information generation processing unit 108, and correction candidate information integration processing unit 109 will be described. FIG. 2A explains internal processing of the reading generation processing unit 102 and the reading correction candidate information generation processing unit 106.

読み生成処理部１０２は、大きく分けて、単語分割処理部２０２と、読み選択処理部２０３と、区切り決定処理部２０４と、結合決定処理部２０５とから構成される。これらは音声合成処理における読み生成処理で必要な最低限の構成要素であるが、実施例によってはこれに加えて、係り受け解析処理などの他の構成要素が追加される場合もある。これらの追加構成要素があったとしても、本実施例での説明の一般性は失われない。また、単語分割処理部２０２、読み選択処理部２０３、区切り決定処理部２０４、結合決定処理部２０５の実現方法は様々な公知文献によって開示されているため、ここでは説明しない。 The reading generation processing unit 102 is roughly composed of a word division processing unit 202, a reading selection processing unit 203, a delimitation determination processing unit 204, and a combination determination processing unit 205. These are the minimum constituent elements necessary for the reading generation process in the speech synthesis process, but other constituent elements such as a dependency analysis process may be added depending on the embodiment. Even with these additional components, the generality of the description in this embodiment is not lost. In addition, since methods for realizing the word division processing unit 202, the reading selection processing unit 203, the delimitation determination processing unit 204, and the combination determination processing unit 205 are disclosed in various known documents, they will not be described here.

読み修正候補情報生成処理部１０６は、単語読み修正候補情報生成処理部２０７と、区切り修正候補情報生成処理部２０８と、結合修正候補情報生成処理部２１１と、読み修正候補情報統合処理部２０９とから構成される。単語読み修正候補情報生成処理部２０７は、合成音声内の単語の読みに関する修正情報である単語読み修正情報を生成する。また、区切り修正候補情報生成処理部２０８は、合成音声の区切り位置に関する修正情報である区切り修正候補情報を生成する。また、結合修正候補情報生成処理部２１１は、合成音声のアクセント結合の位置に関する修正情報である結合修正候補情報を生成する。 The reading correction candidate information generation processing unit 106 includes a word reading correction candidate information generation processing unit 207, a delimiter correction candidate information generation processing unit 208, a combined correction candidate information generation processing unit 211, a reading correction candidate information integration processing unit 209, Consists of The word reading correction candidate information generation processing unit 207 generates word reading correction information that is correction information regarding the reading of the words in the synthesized speech. The delimiter correction candidate information generation processing unit 208 generates delimiter correction candidate information that is correction information related to the delimiter position of the synthesized speech. Further, the combination correction candidate information generation processing unit 211 generates combination correction candidate information that is correction information related to the position of the accent combination of the synthesized speech.

本実施例では、読み選択処理部２０３、区切り決定処理部２０４、結合決定処理部２０５のそれぞれに、単語読み修正候補情報生成処理部２０７、区切り修正候補情報生成処理部２０８、及び、結合修正候補情報生成処理部２１１が追加される。読み修正候補情報統合処理部２０９は、単語読み修正候補情報生成処理部２０７、区切り修正候補情報生成処理部２０８、及び、結合修正候補情報生成処理部２１１からの出力結果を統合することにより、読み修正候補情報２１０を出力する。 In this embodiment, each of the reading selection processing unit 203, the delimitation determination processing unit 204, and the combination determination processing unit 205 includes a word reading correction candidate information generation processing unit 207, a delimitation correction candidate information generation processing unit 208, and a combination correction candidate. An information generation processing unit 211 is added. The reading correction candidate information integration processing unit 209 integrates the output results from the word reading correction candidate information generation processing unit 207, the delimiter correction candidate information generation processing unit 208, and the combined correction candidate information generation processing unit 211, thereby The correction candidate information 210 is output.

図３は、読み選択処理部２０３、および単語読み修正候補情報生成処理部２０７の内部処理を説明するものである。読み選択処理部２０３は、大きく分けて、最適単語列選択処理部３０２と、単語統合処理部３０３と、読み分け処理部３０４とから構成される。音声合成システムに入力された読み上げテキスト２０１は、単語分割処理部２０２によって単語ネットワーク情報３０１に変換される。この処理は一般的な形態素解析と呼ばれるものであり、様々な公知文献でアルゴリズムは開示されているため、ここでは説明しない。この単語ネットワーク情報３０１は、一般には単語ラティス（lattice）と呼ばれるリンク構造が制限されたネットワーク構造となる。図４は、単語ネットワーク情報３０１の例を示す。図４では、「今日は晴れです。」という文章の表記に部分一致する全ての単語を辞書から抜き出し、単語の候補を列挙したラティス構造が示されている。 FIG. 3 illustrates internal processing of the reading selection processing unit 203 and the word reading correction candidate information generation processing unit 207. The reading selection processing unit 203 is roughly composed of an optimum word string selection processing unit 302, a word integration processing unit 303, and a reading classification processing unit 304. The read-out text 201 input to the speech synthesis system is converted into word network information 301 by the word division processing unit 202. This process is called a general morphological analysis, and since an algorithm is disclosed in various known documents, it will not be described here. The word network information 301 has a network structure with a limited link structure generally called a word lattice. FIG. 4 shows an example of the word network information 301. FIG. 4 shows a lattice structure in which all words partially matching the notation of the sentence “Today is sunny” are extracted from the dictionary and word candidates are listed.

読み選択処理部２０３に単語ネットワーク情報３０１が入力されると、まず最適単語列選択処理部３０２によって、単語ネットワーク情報の中から最適単語列が決定される。最適単語列とは、単語ネットワーク情報３０１のあらゆる単語の組み合わせの中から最も日本語として適した単語の組み合わせとなりうる単語列のことである。図５は、図４の単語ネットワーク情報に対する最適単語列の例を示す。図５において、「文頭」から開始して「文末」で終了するまでに、矢印でリンクされた単語列が、最適単語列である。この最適単語列の決定処理は、単語やリンクのコストや、単語やリンクの出現確率を算出し、最もコスト和が小さい、または最も出現確率の積が大きい単語の組み合わせを探索する処理で実現される。この処理も様々な文献で開示されている公知の技術である。 When the word network information 301 is input to the reading selection processing unit 203, first, the optimal word string selection processing unit 302 determines the optimal word string from the word network information. The optimum word string is a word string that can be a combination of words that is most suitable for Japanese among all combinations of words in the word network information 301. FIG. 5 shows an example of the optimum word string for the word network information of FIG. In FIG. 5, the word string linked by the arrows from the beginning of the sentence to the end of the sentence is the optimum word string. This optimal word string determination process is realized by calculating the cost of words and links, the appearance probability of words and links, and searching for a combination of words with the smallest cost sum or the largest product of appearance probabilities. The This process is also a known technique disclosed in various documents.

続く単語統合処理部３０３では、形態素候補として統合しうるものをまとめる処理を行う。「統合しうるもの」とは、同じ位置から始まる同じ長さの単語であって、品詞としても交換可能なものを指す。例えば、図５の例で、文字列「今日」に関して、「キョウ（kyou）」いう読み方の単語候補５０１と「コンニチ（konnichi）」という読み方の単語候補５０２の二つの読み方の単語候補が出力されている。このうち、「キョウ（kyou）」の単語候補５０１の方が、より適切な読み方として最適単語候補列として出力されるが、「コンニチ（konnichi）」の単語候補５０２の方が正しい読み方の場合もある。これらの二つの単語は同じ位置から始まる同じ長さの単語であり、ともに名詞であるため、形態素候補として同等である。よってこれら二つの単語は統合しうる候補である。単語統合処理部３０３は、図５に示される最適単語列に対して形態素候補を統合し、図６に示す単語列に変換する。図６では、文字列「今日」に「キョウ（kyou）」いう読み方と「コンニチ（konnichi）」という読み方が関連付けられている。 The subsequent word integration processing unit 303 performs a process of collecting those that can be integrated as morpheme candidates. “Things that can be integrated” refer to words of the same length that start from the same position and can be exchanged as parts of speech. For example, in the example of FIG. 5, regarding the character string “today”, two word candidates for reading are output: a word candidate 501 for reading “kyou” and a word candidate 502 for reading “konnichi”. ing. Of these, the word candidate 501 of “kyou” is output as the optimum word candidate string as a more appropriate reading, but the word candidate 502 of “konnichi” may be read more correctly. is there. Since these two words are words of the same length starting from the same position and are both nouns, they are equivalent as morpheme candidates. Therefore, these two words are candidates that can be integrated. The word integration processing unit 303 integrates the morpheme candidates with the optimum word string shown in FIG. 5 and converts it into the word string shown in FIG. In FIG. 6, the reading “kyou” and the reading “konnichi” are associated with the character string “today”.

続く読み分け処理部３０４では、最適単語列の中で複数の読みを持つ単語の中から最も適切な読みを決定する。この処理では、例えば、ある単語とある単語が共起しやすいという情報を集めた共起確率データベースなどの情報を用いて、複数の読みの中から適切なものを決定する。図６の例の場合、「今日（キョウ：kyou）」という単語と「晴れ（ハレ：hare）」という単語が共起しやすいという情報が共起確率データベースの中に存在することで、「キョウ（kyou）」の方の読みが選択されるという処理が行われる。以上の処理により、読み選択処理部２０３の出力として、読み方までが決定された単語分割結果である、最適形態素列（図７）が得られる。 The subsequent reading processing unit 304 determines the most appropriate reading from words having a plurality of readings in the optimum word string. In this process, for example, an appropriate one is determined from a plurality of readings using information such as a co-occurrence probability database that collects information that a certain word and a certain word are likely to co-occur. In the case of the example in FIG. 6, the fact that the word “today (kyou)” and the word “hare” are likely to co-occur exists in the co-occurrence probability database. (Kyou) ”is selected. Through the above processing, an optimum morpheme string (FIG. 7) is obtained as the output of the reading selection processing unit 203, which is a word division result determined up to how to read.

以上で説明した処理は、従来の読み選択処理部２０３の処理の流れであるが、本実施例では図３の単語読み修正候補情報生成処理部３０６を追加していることを特徴とする。図３の単語読み修正候補情報生成処理部３０６は、図２Ａの単語読み修正候補情報生成処理部２０７に対応するものである。単語読み修正候補情報生成処理部３０６では、単語統合処理部３０３から出力された複数の読みを持つ最適単語列情報、及び、読み分け処理部３０４で決定された最も適切な読み情報を用いて、単語の読みに関する単語読み修正候補情報３０７を生成する処理を行う。 The processing described above is the flow of processing of the conventional reading selection processing unit 203, but this embodiment is characterized in that the word reading correction candidate information generation processing unit 306 of FIG. 3 is added. The word reading correction candidate information generation processing unit 306 in FIG. 3 corresponds to the word reading correction candidate information generation processing unit 207 in FIG. 2A. The word reading correction candidate information generation processing unit 306 uses the optimal word string information having a plurality of readings output from the word integration processing unit 303 and the most appropriate reading information determined by the reading processing unit 304 to generate a word. A process of generating word reading correction candidate information 307 related to reading is performed.

以下、図８を用いて単語読み修正候補情報生成処理部３０６の処理フローを説明する。単語読み修正候補情報生成処理部３０６には、単語統合処理部３０３から出力された統合された最適単語列情報（図６）および読み分け処理部３０４から出力された最適単語列情報（図７）が入力される。 Hereinafter, the processing flow of the word reading correction candidate information generation processing unit 306 will be described with reference to FIG. The word reading correction candidate information generation processing unit 306 includes the integrated optimum word string information (FIG. 6) output from the word integration processing unit 303 and the optimum word string information (FIG. 7) output from the reading processing unit 304. Entered.

処理がスタート（Ｓ８０１）すると、初期化処理（Ｓ８０２）の後、ｉ番目の単語情報があるかを判定する（Ｓ８０３）。ｉ番目の単語情報がある場合、ｉ番目の単語情報を取得する（Ｓ８０４）。例えば、図６及び図７の例であれば、１番目の単語は「今日」となる。なお、ｉ番目の単語情報がない場合は、文字列Ｓをそのまま出力する（Ｓ８０９）。 When the process starts (S801), it is determined whether there is i-th word information after the initialization process (S802) (S803). If there is i-th word information, the i-th word information is acquired (S804). For example, in the example of FIGS. 6 and 7, the first word is “today”. If there is no i-th word information, the character string S is output as it is (S809).

ｉ番目の単語情報を取得した（Ｓ８０４）後、この単語に複数の読みが設定されているかどうかを統合された最適単語列情報をもとに判定する（Ｓ８０５）。複数の読みがなければ、ステップＳ８０７にて単語の表記文字列をそのまま変数ｓに追加して、ｉを増分して次のループ処理へ移動する（Ｓ８０８）。一方、複数の読みがある場合は、単語読み修正候補指定タグの追加処理が行われる（Ｓ８０６）。図６の例の場合、１番目の単語「今日」には「キョウ」と「コンニチ」という二つの読みが存在している。よって、単語「今日」に対しては、ステップＳ８０６において単語読み修正候補指定タグが追加される。単語読み修正候補指定タグが追加された後、ｉを増分して次のループ処理へ移動する（Ｓ８０８）。 After obtaining the i-th word information (S804), it is determined based on the integrated optimum word string information whether or not a plurality of readings are set for this word (S805). If there are not a plurality of readings, the notation character string of the word is added to the variable s as it is in step S807, i is incremented, and the process proceeds to the next loop process (S808). On the other hand, if there are a plurality of readings, a word reading correction candidate designation tag addition process is performed (S806). In the example of FIG. 6, the first word “today” has two readings “Kyo” and “Konichi”. Therefore, for the word “today”, a word reading correction candidate designation tag is added in step S806. After the word reading correction candidate designation tag is added, i is incremented and the process proceeds to the next loop process (S808).

単語読み修正候補指定タグとしては様々な形態が考えられるが、例えば、＜ＷＯＲＤＹＯＭＩ＝”キョウ（kyou）” ＯＴＨＥＲ＝”コンニチ（konnichi）”＞今日＜／ＷＯＲＤ＞のような形態が一例となりえる。このタグは、文字列「今日」に対して「キョウ（kyou）」という読みが決定されており、他の読み候補として「コンニチ（konnichi）」があることを示している。図９は、単語読み修正候補情報生成処理部３０６の処理の結果として出力される単語読み修正候補情報３０７の例である。この文字列形態は、一つの実施例に過ぎず、他にも表形式やリスト形式など様々な形式で表現することも可能である。 There are various forms of the word reading correction candidate designation tag. For example, a form such as <WORD YOMI = “kyou” OTHER = “konnichi”> today </ WORD> can be an example. . This tag indicates that the reading “kyou” is determined for the character string “today”, and “konnichi” is another reading candidate. FIG. 9 is an example of word reading correction candidate information 307 output as a result of processing by the word reading correction candidate information generation processing unit 306. This character string form is only one example, and can be expressed in various forms such as a table form and a list form.

読み選択処理部２０３に続いて、区切り決定処理部２０４において処理が実行される。この区切り決定処理部２０４では、読み上げテキストにおいて、どこで間（ポーズ）を置くかが決定される。ここでの区切り決定アルゴリズムには様々な手法がありえるが、公知文献にて開示されている手法も多いので、ここでは説明を行わない。例えば、文字列「、」（読点）がある箇所でのみ区切りを挿入する方法、またはそれに加えて特定の品詞（例えば係助詞「は」など）の後ろに区切りを挿入する方法などがある。これは、日本語の場合の例であるが、区切りの決定は、他の言語の場合でも当然ながら可能であり、その文法や表現に合わせて行えばよい。 Subsequent to the reading selection processing unit 203, processing is executed in a delimitation determination processing unit 204. The delimiter determination processing unit 204 determines where to place a pause (pause) in the text to be read out. Various methods can be used for the delimiter determination algorithm here, but since there are many methods disclosed in publicly known documents, description thereof will not be given here. For example, there is a method of inserting a delimiter only at a place where a character string “,” (reading mark) is present, or a method of inserting a delimiter after a specific part of speech (for example, a particle “ha”). This is an example in the case of Japanese, but the delimiter can naturally be determined in other languages as well, and may be determined according to the grammar and expression.

ここでは、区切り決定処理部２０４によって、区切り位置候補およびそこでの区切り挿入確率が出力されたとする。この区切り位置候補、および区切り挿入確率の表現形態の一例を図１０に示す。区切り決定処理部２０４では、図１０のような区切り位置候補情報をもとにして、読み指定中間表現における区切り位置の決定処理を行い、それを出力する。区切り位置の決定処理では、例えば、区切り挿入確率が所定のしきい値以上である箇所を区切り位置として決定することができる。図１１は、読み指定中間表現における区切り位置の出力例である。 Here, it is assumed that the delimiter decision processing unit 204 outputs delimiter position candidates and delimiter insertion probabilities there. FIG. 10 shows an example of the expression form of the break position candidates and the break insertion probability. The delimitation determination processing unit 204 performs determination processing of the delimiter position in the reading designation intermediate expression based on the delimiter position candidate information as shown in FIG. 10, and outputs it. In the delimiter position determination process, for example, a position where the delimiter insertion probability is a predetermined threshold value or more can be determined as the delimiter position. FIG. 11 is an output example of the break position in the reading designation intermediate expression.

以上で説明した処理は従来の区切り決定処理部２０４の処理の流れであるが、本実施例では図２Ａの区切り修正候補情報生成処理部２０８を追加していることを特徴とする。この区切り修正候補情報生成処理部２０８は、図１０に示すような区切り位置候補情報、および区切り決定処理部２０４から出力された区切り位置情報（図１１）を用いて、区切り位置候補への区切り修正候補タグを入力テキストに対して追加する。その処理フローは、図８に示す単語読み修正候補タグの挿入処理フローと同様であるため、説明を省略する。この処理の結果、例えば図１２に示すような文字列情報が出力される。ここで、ＰＲＯＢ属性は内部で計算された区切り挿入確率を示し、ＭＯＤＥ属性は合成音声において実際に区切りが出力されたかどうかを示す。したがって、図１２の例では、「今日は」の後に、区切りが「有」と判定され、その区切り挿入確率が「０．８」であることが示されている。同様に、図１２では、「晴れ」の後に、区切りが「無」と判定され、その区切り挿入確率が「０．２」であることが示されている。 The processing described above is the processing flow of the conventional delimiter determination processing unit 204, but this embodiment is characterized in that the delimiter correction candidate information generation processing unit 208 of FIG. 2A is added. This delimiter correction candidate information generation processing unit 208 uses the delimiter position candidate information as shown in FIG. 10 and delimiter position information (FIG. 11) output from the delimiter determination processing unit 204 to correct delimiters into delimiter position candidates. Add candidate tags to the input text. The processing flow is the same as the processing flow for inserting the word reading correction candidate tag shown in FIG. As a result of this processing, for example, character string information as shown in FIG. 12 is output. Here, the PROB attribute indicates a break insertion probability calculated internally, and the MODE attribute indicates whether a break is actually output in the synthesized speech. Therefore, in the example of FIG. 12, it is determined that “there is” a break after “today”, and the break insertion probability is “0.8”. Similarly, in FIG. 12, after “sunny”, it is determined that the break is determined to be “none”, and the break insertion probability is “0.2”.

区切り決定処理部２０４に続いて、結合決定処理部２０５において処理が実行される。ここでの処理は、日本語におけるアクセント結合が生じるかどうかを判定し、それに基づいてアクセント位置の移動を行う。「アクセント結合」とは、例えば、「音声（オ’ンセー）」と「合成（ゴーセー）」という二つの単語（それぞれアクセント型は１型と０型）が連続して「音声合成」という文字列で現れた場合に、元のアクセント位置のまま、「オ’ンセーゴーセー」と発音されるのではなく、「オンセーゴ’ーセー」とアクセント位置が移動する現象を指す。このアクセント結合に関する処理も従来の音声合成処理で行われているものであり、様々な公知文献でその処理形態は開示されている。アクセント結合は二つのアクセント句の間で生じ、生じた場合にはその結果はアクセント結合規則から予測可能なものである。そのため、結合決定処理部２０５は、アクセント句それぞれについて、その後ろのアクセント句と結合するかどうかのフラグ情報を判定する。 Subsequent to the delimitation determination processing unit 204, processing is executed in the combination determination processing unit 205. In this processing, it is determined whether or not accent coupling occurs in Japanese, and the accent position is moved based on the determination. The “accent combination” is, for example, a character string “speech synthesis” in which two words “speech (once)” and “synthesis (gosei)” (accent type is 1 type and 0 type, respectively) are consecutive. When it appears at, it means that the accent position is moved to “Onsego Gosei” instead of being pronounced “Onse Go Gosei” in the original accent position. The processing related to the accent connection is also performed by the conventional speech synthesis processing, and its processing form is disclosed in various known documents. Accent coupling occurs between two accent phrases, and when it occurs, the result is predictable from the accent coupling rules. Therefore, the combination determination processing unit 205 determines flag information as to whether or not each accent phrase is to be combined with the subsequent accent phrase.

以上で説明した処理は従来の結合決定処理部２０５の流れであるが、本実施例では図２Ａの結合修正候補情報生成処理部２１１を追加していることを特徴とする。この結合修正候補情報生成処理部２１１は、図１３に示すようなアクセント句結合フラグ情報を用いて、アクセント結合候補箇所へのアクセント結合候補タグを入力テキストに対して追加する。図１３は、結合決定処理部２０５において決定された判定結果の例を示す。図１３では、各アクセント句について、アクセント結合フラグ１３０１及びアクセント結合候補箇所フラグ１３０２が関連付けられている。アクセント結合フラグ１３０１は、結合決定処理部２０５において決定されたアクセント結合が起きる箇所を示すフラグであり、アクセント結合候補箇所フラグ１３０２は、アクセント結合が起きるうる候補を示すフラグである。アクセント句「音声」の箇所は、アクセント結合が起きると判定され、アクセント句「高品質」及び「音声」の箇所は、アクセント結合が起きるうる候補として判定されている。 The processing described above is the flow of the conventional combination determination processing unit 205, but this embodiment is characterized in that the combination correction candidate information generation processing unit 211 of FIG. 2A is added. The combined correction candidate information generation processing unit 211 uses the accent phrase combination flag information as shown in FIG. 13 to add an accent combination candidate tag to the accent combination candidate portion to the input text. FIG. 13 shows an example of the determination result determined by the combination determination processing unit 205. In FIG. 13, an accent combination flag 1301 and an accent combination candidate location flag 1302 are associated with each accent phrase. The accent combination flag 1301 is a flag indicating a place where the accent combination determined by the combination determination processing unit 205 occurs, and the accent combination candidate position flag 1302 is a flag indicating a candidate where the accent combination can occur. The location of the accent phrase “speech” is determined to cause an accent combination, and the locations of the accent phrases “high quality” and “speech” are determined as candidates that may cause the accent combination.

また、結合修正候補情報生成処理部２１１の処理フローは、図８に示す単語読み修正候補タグの挿入処理フローと同様であるため、説明は省略する。この処理の結果、例えば図１４に示すような文字列情報が出力される。ここで、この「ＡＣＣ」タグが追加された箇所はアクセント結合しうる候補の箇所であり、そのうちＭＯＤＥ属性が「ＯＮ」の箇所は合成音声において実際にアクセント結合した箇所を示す。 The processing flow of the combined correction candidate information generation processing unit 211 is the same as the processing flow for inserting the word reading correction candidate tag shown in FIG. As a result of this processing, for example, character string information as shown in FIG. 14 is output. Here, the part where the “ACC” tag is added is a candidate part that can be accent-coupled, and the part where the MODE attribute is “ON” indicates the part that is actually accent-coupled in the synthesized speech.

単語読み修正候補情報生成処理部２０７、区切り修正候補情報生成処理部２０８、および結合修正候補情報生成処理部２１１からそれぞれ出力情報が出されると、それらは読み修正候補情報統合処理部２０９に入力される。読み修正候補情報統合処理部２０９は、それら出力された３つのタグ付き文字列の統合処理を実行する。以上の３つの処理から出力されたタグ付き文字列は、同じ入力文字列に対してタグ情報の追加がされたものである。そのため、読み修正候補情報統合処理部２０９では、同じ一つの入力文字列に対して上記３つのタグ情報をすべて埋め込む統合処理を実施する。この統合処理の目的は、タグ付き文字列の受け手（音声合成サービスのユーザ）側でタグ付き文字列の解析処理を容易にするためであり、冗長性に問題がなければ統合処理は必須ではない。 When output information is output from the word reading correction candidate information generation processing unit 207, the delimiter correction candidate information generation processing unit 208, and the combined correction candidate information generation processing unit 211, they are input to the reading correction candidate information integration processing unit 209. The The reading correction candidate information integration processing unit 209 executes integration processing of the three tagged character strings that are output. The tagged character string output from the above three processes is obtained by adding tag information to the same input character string. For this reason, the reading correction candidate information integration processing unit 209 performs integration processing for embedding all the three tag information in the same input character string. The purpose of this integration process is to facilitate the analysis process of the tagged character string on the side of the tagged character string recipient (speech synthesis service user), and the integration process is not essential if there is no problem with redundancy. .

この統合処理では開始タグと終了タグが入れ子構造になるようにしなければならない。読み生成処理の結果、タグが埋め込まれる位置はほぼ単語（形態素）単位であるため、単語単位に設定される開始・終了タグが入れ子構造にできない／ならない場合はほとんどない。そのような場合でのタグの統合処理は、スタックを用いた簡単なタグ対応付けを考慮するだけで実現できる。図１５は、読み修正候補情報統合処理部２０９から出力された読み修正候補情報の例を示す。 In this integration process, the start tag and the end tag must be nested. As a result of the reading generation process, the position where the tag is embedded is almost in units of words (morphemes). Therefore, there are almost no cases where the start / end tags set in units of words cannot be nested. Tag integration processing in such a case can be realized only by considering simple tag association using a stack. FIG. 15 illustrates an example of reading correction candidate information output from the reading correction candidate information integration processing unit 209.

以上のようにして、本実施例による音声合成システムでは、読み上げテキストが入力されると、合成音声に加えて、修正可能箇所および修正可能候補を記述したメタ情報を出力する。これにより、このメタ情報を受け取った利用者側のシステムでは、この情報を解析することにより、ユーザによる修正の指示が可能になる。また、利用者側のシステムが、自動的に修正候補を本音声合成システムに通知することで、別のタイプの合成音声を取得してユーザに提示することが可能となる。 As described above, in the speech synthesis system according to the present embodiment, when read-out text is input, meta information describing a correctable portion and a correctable candidate is output in addition to the synthesized speech. As a result, the user's system that has received this meta information can instruct the user to make corrections by analyzing this information. In addition, the user-side system automatically notifies the speech synthesis system of correction candidates, so that another type of synthesized speech can be acquired and presented to the user.

また上記の説明では、図１の読み生成処理部１０２からの情報を用いた読み修正候補情報の生成方法について詳細を述べたが、韻律生成処理部１０３や波形生成処理部１０４でも同様の処理が可能である。韻律生成処理部１０３や波形生成処理部１０４では、音声合成における内部情報を決定している。例えば韻律生成処理部１０３では、読み上げテキストの各部分（文やフレーズ（呼気段落）やアクセント句や単語など）に対する音の高さ（ピッチ）のパターン、話速、音素基本周波数、音素継続長などを決定している。韻律修正候補情報生成処理部１０７は、これらの情報について韻律修正候補を出力することができる。 In the above description, the method for generating the reading correction candidate information using the information from the reading generation processing unit 102 in FIG. 1 has been described in detail, but the prosody generation processing unit 103 and the waveform generation processing unit 104 perform similar processing. Is possible. The prosody generation processing unit 103 and the waveform generation processing unit 104 determine internal information in speech synthesis. For example, in the prosody generation processing unit 103, a pitch pattern, speech speed, phoneme fundamental frequency, phoneme duration, etc. for each part of the text to be read (sentence, phrase (exhalation paragraph), accent phrase, word, etc.), etc. Is determined. The prosody modification candidate information generation processing unit 107 can output prosody modification candidates for these pieces of information.

図２Ｂは、韻律生成処理部１０３、および韻律修正候補情報生成処理部１０７の内部処理を説明するものである。韻律修正候補情報生成処理部１０７は、継続長修正候補情報生成処理部２２１と、基本周波数修正候補情報生成処理部２２２と、高さ修正候補情報生成処理部２２３と、話速修正候補情報生成処理部２２４と、韻律修正候補情報統合処理部２２５とを備える。 FIG. 2B explains the internal processing of the prosody generation processing unit 103 and the prosody modification candidate information generation processing unit 107. The prosody correction candidate information generation processing unit 107 includes a duration correction candidate information generation processing unit 221, a fundamental frequency correction candidate information generation processing unit 222, a height correction candidate information generation processing unit 223, and a speech speed correction candidate information generation process. Unit 224 and prosodic correction candidate information integration processing unit 225.

継続長修正候補情報生成処理部２２１は、合成音声の音素継続長に関する修正情報である継続長修正候補情報を生成する。また、基本周波数修正候補情報生成処理部２２２は、合成音声の音素基本周波数に関する修正情報である基本周波数修正候補情報を生成する。また、高さ修正候補情報生成処理部２２３は、合成音声の部分の音の高さのパターンに関する修正情報である高さ修正候補情報を生成する。さらに、話速修正候補情報生成処理部２２４は、合成音声の部分の話速に関する修正情報である話速修正候補情報を生成する。韻律修正候補情報統合処理部２２５は、これらの処理部２２１〜２２４から出力された修正候補情報を統合し、韻律修正候補情報２２６を出力する。 The continuation length correction candidate information generation processing unit 221 generates continuation length correction candidate information that is correction information related to the phoneme continuation length of the synthesized speech. The fundamental frequency correction candidate information generation processing unit 222 generates basic frequency correction candidate information that is correction information related to the phoneme fundamental frequency of the synthesized speech. Also, the height correction candidate information generation processing unit 223 generates height correction candidate information that is correction information related to the pitch pattern of the synthesized speech portion. Furthermore, the speech speed correction candidate information generation processing unit 224 generates speech speed correction candidate information that is correction information regarding the speech speed of the portion of the synthesized speech. The prosodic correction candidate information integration processing unit 225 integrates the correction candidate information output from these processing units 221 to 224, and outputs prosodic correction candidate information 226.

韻律修正候補情報生成処理部１０７は、読み上げテキストに対して、例えばＰＩＴＣＨタグやＳＰＥＥＤタグを追加することにより、韻律修正候補情報２２６を示すことができる。図２２の例では、読み上げテキストの各部分に対する音の高さ（ピッチ）のパターンの修正候補がＰＩＴＣＨタグとして示され、読み上げテキストの各部分に対する話速の修正候補がＳＰＥＥＤタグとして示されている。 The prosody modification candidate information generation processing unit 107 can indicate the prosody modification candidate information 226 by adding, for example, a PITCH tag or a SPEED tag to the read-out text. In the example of FIG. 22, correction candidates for the pitch (pitch) pattern for each part of the text to be read are indicated as PITCH tags, and correction candidates for the speech speed for each part of the text to be read are indicated as SPEED tags. .

また、波形生成処理部１０４では、合成音声で使用される素片（音声部品）が決定される。波形修正候補情報生成処理部１０８は、合成音声で使用された素片に関する修正情報である素片修正候補情報を生成する。例えば、波形修正候補情報生成処理部１０８は、音声合成処理で実際に使用された音声部品データのＩＤを示した上で、代わりに使用することもできた別の音声部品データＩＤを列挙するなどすることで波形修正候補情報を出力することもできる。 Further, the waveform generation processing unit 104 determines a segment (speech component) used in the synthesized speech. The waveform correction candidate information generation processing unit 108 generates segment correction candidate information that is correction information related to the segments used in the synthesized speech. For example, the waveform correction candidate information generation processing unit 108 lists the IDs of voice component data actually used in the voice synthesis process, and lists other voice component data IDs that could be used instead. Thus, the waveform correction candidate information can be output.

図２２は、最終的に修正候補情報統合処理部１０９から出力された修正候補情報１１０の例を示す。修正候補情報統合処理部１０９は、読み修正候補情報生成処理部１０６と、韻律修正候補情報生成処理部１０７と、波形修正候補情報生成処理部１０８とから出力された情報を統合する。 FIG. 22 shows an example of the correction candidate information 110 that is finally output from the correction candidate information integration processing unit 109. The correction candidate information integration processing unit 109 integrates information output from the reading correction candidate information generation processing unit 106, the prosody correction candidate information generation processing unit 107, and the waveform correction candidate information generation processing unit 108.

［第１実施例］
第１実施例では、図１の基本構成を用いて、音声合成結果に対する修正及び改善指示要求を受け付ける音声合成システムについて説明する。本実施例では、本発明の基本構成を用いたサーバ・クライアント構成での音声合成サービスを想定する。サーバ側は本発明の音声合成手法を用いた音声合成装置を構成し、この音声合成装置は、クライアント（ユーザ）側から要求された読み上げテキストに対して、合成音声を生成してそれを送信するサービスを行う。この時、本発明の基本構成で生成される読み上げテキストのメタ情報を用いることで、クライアント側に対して、生成された合成音声に対して、どのような修正・改善が可能かを示すことができる。 [First embodiment]
In the first embodiment, a speech synthesis system that accepts correction and improvement instruction requests for speech synthesis results will be described using the basic configuration of FIG. In the present embodiment, a speech synthesis service in a server / client configuration using the basic configuration of the present invention is assumed. The server side constitutes a speech synthesizer using the speech synthesis method of the present invention, and this speech synthesizer generates synthesized speech for the read-out text requested from the client (user) side and transmits it. Do service. At this time, by using the meta information of the reading text generated by the basic configuration of the present invention, it is possible to indicate to the client side what kind of correction / improvement is possible for the generated synthesized speech. it can.

クライアント側では、受信したメタ情報をもとにして、クライアント側システム、またはユーザによる修正及び改善指示要求を作成し、それをサーバ側システムへと返送する。サーバ側システムでは、その修正及び改善指示要求を処理することにより、クライアント側が意図した箇所を意図したように修正された合成音声を再度生成して、クライアント側にそれを返送することができる。また、サーバ側システムでは、その修正及び改善指示の内容が、次回以降の音声合成処理でも反映されるように、音声合成用データや音声合成用パラメータの内容を更新することができる。 On the client side, based on the received meta information, a request for correction and improvement instruction by the client side system or the user is created and returned to the server side system. In the server side system, by processing the correction and improvement instruction request, it is possible to generate again the synthesized speech modified as intended by the client side and return it to the client side. Further, in the server side system, the contents of the speech synthesis data and the speech synthesis parameters can be updated so that the contents of the correction and improvement instructions are reflected in the subsequent speech synthesis processing.

以上の処理により、第１実施例では、サーバ・クライアント構成での音声合成サービスにおいて、クライアント側が指示した修正及び改善指示を反映して、修正された合成音声を生成できるとともに、その修正及び改善指示の内容を今後の音声合成処理にも反映できる音声合成サービスを実現することができる。 With the above processing, in the first embodiment, in the speech synthesis service in the server / client configuration, the modified synthesized speech can be generated by reflecting the correction and improvement instructions instructed by the client side, and the correction and improvement instructions are given. It is possible to realize a speech synthesis service that can reflect the contents of the above in future speech synthesis processing.

以下、図１６Ａ及び図１６Ｂを用いて、本実施列の構成について説明する。図１６Ａ及び図１６Ｂは、本実施例のサーバ側システム（音声合成サーバシステム）の構成について説明する図である。音声合成サーバシステムは、通常の音声合成要求時と、修正及び改善指示要求時とで処理内容が異なる。 Hereinafter, the configuration of this embodiment will be described with reference to FIGS. 16A and 16B. 16A and 16B are diagrams illustrating the configuration of the server-side system (speech synthesis server system) of this embodiment. The speech synthesis server system has different processing contents when a normal speech synthesis is requested and when a correction and improvement instruction is requested.

図１６Ａは、通常の音声合成要求時の処理構成を示す。音声合成サーバシステムは、音声合成用の構成要素として、音声合成コンテキスト生成処理部１６０２と、音声合成データ格納装置１６１０と、コンテキスト格納装置１６１１と、メタ情報付き音声合成処理部１６０３と、解析単位別ＩＤ設定処理部１６０５と、サービス応答生成処理部１６０８とを備える。 FIG. 16A shows a processing configuration when a normal speech synthesis request is made. The speech synthesis server system includes speech synthesis context generation processing unit 1602, speech synthesis data storage device 1610, context storage device 1611, speech synthesis processing unit with meta information 1603, and analysis unit as components for speech synthesis. An ID setting processing unit 1605 and a service response generation processing unit 1608 are provided.

以下、読み上げテキストが入力されて合成音声およびそのメタ情報を含んだサービス応答が返送されるまでの処理の流れを説明する。まず、ユーザからのサービス要求情報内の読み上げテキスト１６０１が入力されると、音声合成コンテキスト生成処理部１６０２によって処理が実行される。音声合成コンテキスト生成処理部１６０２は、音声合成要求を行ったユーザ（クライアント）が設定している音声合成パラメータを図示しないデータベースから取り出して、音声合成コンテキスト情報を生成する。生成された音声合成コンテキスト情報はコンテキスト格納装置１６１１に格納される。ユーザの判定は、ユーザＩＤをもとに行うことができる。このユーザＩＤは、読み上げテキスト１６０１が格納されていたサービス要求情報の中に埋め込まれている場合が多い。 Hereinafter, the flow of processing from when the read-out text is input until the service response including the synthesized speech and its meta information is returned will be described. First, when the read-out text 1601 in the service request information from the user is input, the speech synthesis context generation processing unit 1602 executes the process. The speech synthesis context generation processing unit 1602 extracts speech synthesis parameters set by the user (client) who has made the speech synthesis request from a database (not shown), and generates speech synthesis context information. The generated speech synthesis context information is stored in the context storage device 1611. The user can be determined based on the user ID. This user ID is often embedded in the service request information in which the read-out text 1601 is stored.

ここで、ユーザごとの音声合成パラメータが格納されているデータベースは、通常のリレーショナルデータベースをそのまま用いることができる。このデータベースに格納される情報には、キー情報としてのユーザＩＤ、合成音声の話者ＩＤ、音声話速、音声高さ、音量などが含まれる。さらには、音声合成システムによっては、記号を読む／読まないという指定や、数字列の桁読み／棒読みという区別をパラメータとして指定できるものもあり、このような音声合成システムを用いている場合は、これらのパラメータもユーザごとに取得される音声合成パラメータの要素となる。また、別の音声合成システムでは読み上げテキストの文脈を判定して読み方や抑揚などを変更できるものも存在する。例えば、ある文章Ａが単独で読み上げられる場合の読み方・抑揚と、同じ文章Ａが別の文章Ｂの後で読み上げられる場合の読み方・抑揚とが異なることがある。このような音声合成システムの場合では、現時点の文脈を特定するための情報を出力できる。音声合成コンテキスト生成処理部１６０２は、このような文脈特定情報も併せて音声合成用パラメータとして出力してもよい。 Here, a normal relational database can be used as it is as the database storing the speech synthesis parameters for each user. The information stored in this database includes a user ID as key information, a speaker ID of synthesized speech, a speech speed, a speech height, a volume, and the like. Furthermore, depending on the speech synthesis system, there is a specification that the symbol can be read / not read and the distinction between digit reading / bar reading of the numeric string can be specified as a parameter. When such a speech synthesis system is used, These parameters are also elements of speech synthesis parameters acquired for each user. There are other speech synthesis systems that can determine the context of the text to be read and change the reading or inflection. For example, the reading / inflection when a certain sentence A is read out separately may differ from the reading / inflection when the same sentence A is read out after another sentence B. In the case of such a speech synthesis system, information for specifying the current context can be output. The speech synthesis context generation processing unit 1602 may also output such context specifying information as a speech synthesis parameter.

音声合成コンテキスト生成処理部１６０２は、さらに、今回の音声合成要求を特定するための音声合成要求ＩＤ情報を生成する。音声合成サーバシステムは複数のユーザに対して同時にサービス提供しているため、この情報をもとにしてどのユーザに対していつ生成した合成音声かを識別する必要がある。この処理については特別な構成は必要ない。音声合成サーバシステムが、音声合成要求を受けたときに一意のシーケンス番号を各合成音声に付与すればよい。こうして、音声合成コンテキスト生成処理部１６０２は、ユーザＩＤをもとにして、今回の合成音声を特定するための音声合成要求ＩＤ、および音声合成パラメータセットを生成して、それらをコンテキスト格納装置１６１１に格納する。 The speech synthesis context generation processing unit 1602 further generates speech synthesis request ID information for specifying the current speech synthesis request. Since the speech synthesis server system provides services to a plurality of users at the same time, it is necessary to identify when the synthesized speech is generated for which user based on this information. No special configuration is necessary for this processing. What is necessary is just to give a unique sequence number to each synthesized speech when the speech synthesis server system receives a speech synthesis request. Thus, the speech synthesis context generation processing unit 1602 generates a speech synthesis request ID and a speech synthesis parameter set for specifying the current synthesized speech based on the user ID, and stores them in the context storage device 1611. Store.

続いて、読み上げテキスト１６０１は、メタ情報付き音声合成処理部１６０３に入力される。メタ情報付き音声合成処理部１６０３は、図１の基本構成で説明したメタ情報付き音声合成処理を実装したものである。この処理の結果、読み上げテキストを読み上げた合成音声１６０６、および、読み上げ文の解析結果である文解析メタ情報１６０４が出力される。合成音声１６０６は、図１で説明した読み上げ音声１０５に相当する。また、文解析メタ情報１６０４は、図１で説明した修正候補情報１１０に相当する。ここでの処理は基本構成ですでに説明しているので省略する。 Subsequently, the read-out text 1601 is input to the speech synthesis processing unit 1603 with meta information. The speech synthesis processing unit with meta information 1603 is implemented with the speech synthesis processing with meta information described in the basic configuration of FIG. As a result of this processing, the synthesized speech 1606 read out from the read-out text and the sentence analysis meta information 1604 that is the analysis result of the read-out sentence are output. The synthesized voice 1606 corresponds to the reading voice 105 described with reference to FIG. The sentence analysis meta information 1604 corresponds to the correction candidate information 110 described with reference to FIG. Since the processing here has already been described in the basic configuration, it will be omitted.

なお、メタ情報付き音声合成処理部１６０３は、音声合成データ格納装置１６１０から音声合成データを取得して、メタ情報付き音声合成処理を実行する。この音声合成データには、読み生成処理部１０２が利用する辞書データ、韻律生成処理部１０３が利用する韻律データ、波形生成処理部１０４が利用する波形データなどが少なくとも含まれている。 Note that the speech synthesis processing unit with meta information 1603 acquires speech synthesis data from the speech synthesis data storage device 1610 and executes speech synthesis processing with meta information. This speech synthesis data includes at least dictionary data used by the reading generation processing unit 102, prosody data used by the prosody generation processing unit 103, waveform data used by the waveform generation processing unit 104, and the like.

文解析メタ情報１６０４は、続いて解析単位別ＩＤ設定処理部１６０５に入力される。解析単位別ＩＤ設定処理部１６０５は、基本構成で説明した文解析メタ情報に対して、解析単位別にＩＤ（識別子）を設定する処理を行う。ここで言う解析単位とは、読み上げテキスト１６０１を構成する文、文を構成するフレーズ（ポーズ間の一息で読まれる単位）、フレーズを構成するアクセント句（一つのアクセント核を持つ抑揚の単位）、アクセント句を構成する単語（形態素）、文中に含まれるポーズなどを指す。このように読み上げテキスト１６０１を構成する様々な単位にＩＤを付加しておくことで、クライアント側からの修正及び改善指示要求がどの箇所に対するものなのかを特定しやすくする。この解析単位別ＩＤの付与方法としては、例えば音声合成要求ＩＤに対して解析単位ごとのＩＤを順番に付加して生成する方法などが考えられる。 The sentence analysis meta information 1604 is then input to the ID setting processing unit 1605 for each analysis unit. The ID setting processing unit 1605 for each analysis unit performs processing for setting an ID (identifier) for each analysis unit for the sentence analysis meta information described in the basic configuration. The analysis unit referred to here is a sentence that constitutes the reading text 1601, a phrase that constitutes the sentence (a unit that is read in a pause between pauses), an accent phrase that constitutes the phrase (an intonation unit having one accent nucleus), This refers to the words (morphemes) that make up the accent phrase, and the poses that are included in the sentence. In this way, IDs are added to various units constituting the text to be read out 1601, so that it is easy to identify to which part the correction and improvement instruction request from the client side is. As an ID assigning method for each analysis unit, for example, a method of generating by adding IDs for each analysis unit in order to the speech synthesis request ID can be considered.

図１７は、図１５の読み修正候補情報に対して解析単位別ＩＤを付与した結果を示す。読み修正候補情報に限らず、韻律修正候補情報や波形修正候補情報、それらを統合した修正候補情報１１０の全体に対しても同様の方法で解析単位別ＩＤを付与することができる。解析単位別ＩＤ設定処理部１６０５は、この解析単位別ＩＤが付与された修正候補情報をＩＤ付きメタ情報１６０７として出力する。なお、メタ情報１６０７は、コンテキスト格納装置１６１１に格納されてもよい。 FIG. 17 shows the result of assigning IDs by analysis unit to the reading correction candidate information of FIG. Not only the reading correction candidate information but also the prosodic correction candidate information, the waveform correction candidate information, and the entire correction candidate information 110 obtained by integrating them can be assigned IDs by analysis unit by the same method. The ID setting processing unit 1605 for each analysis unit outputs the correction candidate information to which the ID for each analysis unit is given as meta information 1607 with ID. The meta information 1607 may be stored in the context storage device 1611.

最後に、サービス応答生成処理部１６０８は、合成音声１６０６とＩＤ付きメタ情報１６０７を用いて、音声合成サービス要求に対するサービス応答情報１６０９を生成する。このサービス応答情報はＨＴＴＰプロトコルのレスポンス情報などに対応する。レスポンス情報の主となるデータとして合成音声データを設定し、ＩＤ付きメタ情報はレスポンス情報のヘッダ属性値として設定するなどすればここでのサービス応答情報を実現できる。 Finally, the service response generation processing unit 1608 generates service response information 1609 for the speech synthesis service request using the synthesized speech 1606 and the ID-added meta information 1607. This service response information corresponds to HTTP protocol response information and the like. If synthetic voice data is set as the main data of the response information and the meta information with ID is set as a header attribute value of the response information, the service response information here can be realized.

以上の流れにより、本実施例の音声合成サーバシステムは、読み上げテキスト１６０１が入力されると、その音声合成要求を特定して再現するためのパラメータセットをコンテキスト格納装置１６１１に保存した上で、テキストを読み上げた合成音声１６０６と、その合成音声に対する修正可能候補箇所及び修正可能候補内容を示すＩＤ付きメタ情報１６０７を出力することができる。 According to the above flow, when the read-out text 1601 is input, the speech synthesis server system according to the present embodiment stores the parameter set for specifying and reproducing the speech synthesis request in the context storage device 1611, and then the text. Can be output, and the ID-added meta information 1607 indicating the correctable candidate location and the correctable candidate content for the synthesized speech.

続いて、本実施例の音声合成サーバシステムがユーザからの修正及び改善指示要求に対して処理する流れについて説明する。図１６Ｂは、修正及び改善指示要求時の処理構成を示す。音声合成サーバシステムは、修正及び改善指示要求用の構成要素として、改善指示解釈処理部１６５２と、音声合成コンテキスト選択処理部１６５４と、コンテキスト格納装置１６１１と、メタ情報付き音声合成処理部１６０３と、音声合成データ格納装置１６１０と、改善箇所決定処理部１６６０と、改善データ反映処理部１６５９と、改善データ作成処理部１６６１と、サービス応答生成処理部１６０８とを備える。図１６Ｂにおいて、図１６Ａと同じ符号が付された構成要素については同様の機能を備える。 Next, a flow that the speech synthesis server system according to the present embodiment processes in response to a correction and improvement instruction request from the user will be described. FIG. 16B shows a processing configuration when a correction and improvement instruction is requested. The speech synthesis server system includes a modification instruction interpretation processing unit 1652, a speech synthesis context selection processing unit 1654, a context storage device 1611, a meta-information-synthesized speech synthesis processing unit 1603 as components for correction and improvement instruction requests. A speech synthesis data storage device 1610, an improved part determination processing unit 1660, an improved data reflection processing unit 1659, an improved data creation processing unit 1661, and a service response generation processing unit 1608 are provided. In FIG. 16B, the same components as those in FIG. 16A have the same functions.

本実施例の音声合成サーバシステムに、クライアント側から修正及び改善指示要求が送信されると、まずそこから改善指示情報１６５１および対象ＩＤ１６５３が分離される。修正及び改善指示要求情報の構成方法としては、ＨＴＴＰプロトコルのリクエスト情報のフォーマットなどが可能である。ここで、修正及び改善対象を指定する対象ＩＤ１６５３として、音声合成要求時に渡された解析単位別ＩＤを含むことができる。また改善指示情報１６５１としては、同じく音声合成時に渡された修正候補情報の中に記述されている修正候補の中から、ユーザ（クライアント）側が望む修正結果を指定する情報を含むことができる。分離された改善指示情報１６５１は、改善指示解釈処理部１６５２に入力され、一方、対象ＩＤ１６５３は音声合成コンテキスト選択処理部１６５４に入力される。 When a correction and improvement instruction request is transmitted from the client side to the speech synthesis server system of the present embodiment, the improvement instruction information 1651 and the target ID 1653 are first separated therefrom. As a configuration method of the correction and improvement instruction request information, a request information format of the HTTP protocol can be used. Here, as the target ID 1653 for designating the correction and improvement target, the ID for each analysis unit passed at the time of the speech synthesis request can be included. Further, the improvement instruction information 1651 can include information designating a correction result desired by the user (client) from among the correction candidates described in the correction candidate information passed at the time of speech synthesis. The separated improvement instruction information 1651 is input to the improvement instruction interpretation processing unit 1652, while the target ID 1653 is input to the speech synthesis context selection processing unit 1654.

音声合成コンテキスト選択処理部１６５４は、対象ＩＤ１６５３をもとに、コンテキスト格納装置１６１１から音声合成コンテキスト情報を取り出す。その合成音声が生成された際の音声合成要求ＩＤは、対象ＩＤ１６５３から取り出すことができる。簡単には上述のように音声合成要求ＩＤに解析単位ＩＤが付加される形でこの対象ＩＤ１６５３が生成されている場合には、その音声合成要求ＩＤの部分だけを対象ＩＤ１６５３から取り出せばよい。また、別の構成方法（例えば、変換テーブル）で音声合成要求ＩＤと対象ＩＤが対応づけられている場合には、その変換テーブルを参照することにより、音声合成要求ＩＤを取り出すことができる。こうして対象ＩＤ１６５３から取り出された音声合成要求ＩＤをキー情報としてコンテキスト格納装置１６１１を検索することで、この合成音声が生成された際の様々な音声合成パラメータ、または文脈情報も含むパラメータ情報を取得することができる。こうして取り出された音声合成パラメータセット（コンテキスト情報）はメタ情報付き音声合成処理部１６０３に出力される。 The speech synthesis context selection processing unit 1654 extracts speech synthesis context information from the context storage device 1611 based on the target ID 1653. The speech synthesis request ID when the synthesized speech is generated can be extracted from the target ID 1653. In brief, when the target ID 1653 is generated in such a manner that the analysis unit ID is added to the speech synthesis request ID as described above, only the part of the speech synthesis request ID is extracted from the target ID 1653. When the speech synthesis request ID and the target ID are associated with each other by another configuration method (for example, a conversion table), the speech synthesis request ID can be extracted by referring to the conversion table. By searching the context storage device 1611 using the speech synthesis request ID extracted from the target ID 1653 as key information, various speech synthesis parameters when the synthesized speech is generated, or parameter information including context information is acquired. be able to. The extracted speech synthesis parameter set (context information) is output to the speech synthesis processing unit 1603 with meta information.

メタ情報付き音声合成処理部１６０３は、図１の基本構成で説明したメタ情報付き音声合成処理を実装したものである。ここで行われる音声合成処理は、今回の修正及び改善対象となっている読み上げテキストを前回音声合成した際（つまり、この修正及び改善指示要求の対象となっている合成音声を作成した際）の音声合成処理を再度繰り返すことである。その結果、前回出力されたもの（すなわち、図１６Ａの文解析メタ情報１６０４）と同一の合成音声１６５７および文解析メタ情報１６５８が出力されることになる。この実施例では、修正及び改善要求時に再度、前回の音声合成処理を繰り返して文解析メタ情報１６５８を生成することとしているが、もちろん、前回生成した際に文解析メタ情報１６５８を音声合成要求ＩＤと関連づけてデータベースに格納しておくという方法をとってもよい。この場合は、対象ＩＤ１６５３から音声合成要求ＩＤが分離された後に、このデータベースを参照することで文解析メタ情報１６５８を取得することができる。 The speech synthesis processing unit with meta information 1603 is implemented with the speech synthesis processing with meta information described in the basic configuration of FIG. The speech synthesis process performed here is performed when the text to be read and corrected this time is synthesized previously (that is, when the synthesized speech that is the target of the correction and improvement instruction request is created). The speech synthesis process is repeated again. As a result, the same synthesized speech 1657 and sentence analysis meta information 1658 as those output last time (that is, the sentence analysis meta information 1604 in FIG. 16A) are output. In this embodiment, the sentence analysis meta-information 1658 is generated again by repeating the previous speech synthesis process at the time of the correction and improvement request. Of course, the sentence analysis meta-information 1658 is generated when the previous generation is performed. It is also possible to take a method of storing them in a database in association with In this case, after the speech synthesis request ID is separated from the target ID 1653, the sentence analysis meta information 1658 can be acquired by referring to this database.

メタ情報付き音声合成処理部１６０３は、音声合成データ格納装置１６１０に格納されている音声合成データを参照する。図１６Ａでの構成と同様に、この音声合成データには、読み生成処理部１０２が利用する辞書データ、韻律生成処理部１０３が利用する韻律データ、波形生成処理部１０４が利用する波形データなどが少なくとも含まれている。メタ情報付き音声合成処理部１６０３は、これらの情報を用いて合成音声１６５７を出力することができる。 The speech synthesis processing unit with meta information 1603 refers to speech synthesis data stored in the speech synthesis data storage device 1610. Similar to the configuration in FIG. 16A, the speech synthesis data includes dictionary data used by the reading generation processing unit 102, prosody data used by the prosody generation processing unit 103, waveform data used by the waveform generation processing unit 104, and the like. At least included. The speech synthesis processing unit with meta information 1603 can output synthesized speech 1657 using these pieces of information.

こうして出力された文解析メタ情報１６５８は、改善箇所決定処理部１６６０へ入力される。改善箇所決定処理部１６６０は、文解析メタ情報１６５８および対象ＩＤ１６５３を基に、ユーザ（クライアント）側がどの箇所に対する修正及び改善指示をしているかを判定する。対象ＩＤ１６５３として前回音声合成した際の文解析メタ情報内に記述されている解析単位ＩＤを用いていれば、今回の文解析メタ情報１６５８内でその解析単位ＩＤを検索するだけで改善箇所を決定することができる。 The sentence analysis meta information 1658 output in this way is input to the improved part determination processing unit 1660. Based on the sentence analysis meta information 1658 and the target ID 1653, the improvement part determination processing unit 1660 determines which part the user (client) side is instructing to correct and improve. If the analysis unit ID described in the sentence analysis meta-information at the time of the previous speech synthesis is used as the target ID 1653, the improvement portion is determined only by searching for the analysis unit ID in the current sentence analysis meta-information 1658. can do.

改善指示解釈処理部１６５２は、改善指示情報１６５１を解釈し、ユーザ（クライアント）側からの読み修正情報、韻律修正情報、波形修正情報の少なくとも１つの修正及び改善情報を抽出する。改善指示解釈処理部１６５２から出力された修正及び改善情報、および改善箇所決定処理部１６６０から出力された改善箇所情報（解析単位ＩＤ）は、改善データ作成処理部１６６１に入力される。 The improvement instruction interpretation processing unit 1652 interprets the improvement instruction information 1651 and extracts at least one correction and improvement information of reading correction information, prosody correction information, and waveform correction information from the user (client) side. The correction and improvement information output from the improvement instruction interpretation processing unit 1652 and the improvement point information (analysis unit ID) output from the improvement point determination processing unit 1660 are input to the improvement data creation processing unit 1661.

改善データ作成処理部１６６１は、指定された箇所の音声合成結果が指定された改善指示になるように、合成音声を更新し、サービス応答生成処理部１６０８に出力する。サービス応答生成処理部１６０８は、更新された合成音声とＩＤ付きメタ情報を用いて、修正及び改善指示要求に対するサービス応答情報１６０９を生成する。 The improved data creation processing unit 1661 updates the synthesized speech so that the speech synthesis result at the designated location becomes the designated improvement instruction, and outputs the synthesized speech to the service response generation processing unit 1608. The service response generation processing unit 1608 generates service response information 1609 for the correction / improvement instruction request using the updated synthesized speech and ID-added meta information.

また、改善データ作成処理部１６６１は、指定された箇所の音声合成結果が指定された改善指示になるように、元となる音声合成データの内容を更新するための情報を生成する。ここで生成された情報をもとに、改善データ反映処理部１６５９は、音声合成データ格納装置１６１０に格納されている音声合成用データの中から更新対象のデータを取り出し、そのデータに対して指定された更新処理を実施する。以上により、ユーザから指定された修正及び改善指示要求に対して、その箇所及び内容を決定し、その修正及び改善を実施するための音声合成データの更新が実現される。 Also, the improved data creation processing unit 1661 generates information for updating the content of the original speech synthesis data so that the speech synthesis result at the designated location becomes the designated improvement instruction. Based on the information generated here, the improved data reflection processing unit 1659 extracts the data to be updated from the data for speech synthesis stored in the speech synthesis data storage device 1610 and designates the data for the update. Perform the updated processing. As described above, the location and content of the correction and improvement instruction request designated by the user are determined, and the update of the speech synthesis data for implementing the correction and improvement is realized.

［第２実施例］
第２実施例では、本発明の基本構成および第１実施例に記載の構成を用いて、音声合成結果に対する修正及び改善指示要求を受け付ける音声合成サービスについて説明する。 [Second Embodiment]
In the second embodiment, a speech synthesis service that accepts correction and improvement instruction requests for speech synthesis results will be described using the basic configuration of the present invention and the configuration described in the first embodiment.

本実施例では、第１実施例と同様のサーバ・クライアント構成での音声サービスを想定する。さらに本実施例では、第１実施例のシステムをベースとして、ユーザレベルによって提供するサービス内容（修正可能な箇所の制限など）を変更したり、日本語以外の外国語の音声合成に適用したりした場合についても説明を行う。さらに、クライアント側システムにおいて、合成音声の修正可能箇所のＧＵＩ表示を行ったり、そのＧＵＩ表示を用いてユーザから合成音声の修正及び改善指示をサーバに送信したりするための機能についても説明を行う。 In the present embodiment, a voice service with the same server / client configuration as in the first embodiment is assumed. Furthermore, in this embodiment, based on the system of the first embodiment, the service content provided by the user level (such as restrictions on parts that can be modified) is changed or applied to speech synthesis in a foreign language other than Japanese. The case will be described. Furthermore, in the client side system, a function for performing GUI display of a portion where the synthesized speech can be corrected, and for transmitting a speech correction and improvement instruction from the user to the server using the GUI display will be described. .

図１８Ａは、第２実施例におけるサーバ側システムの構成を示す図である。図１８Ｂは、第２実施例におけるクライアント側システムの構成を示す図である。本実施例の全体システムは、サーバ側システムとクライアント側システムで構成される。サーバ側システムは、音声合成サービスを提供する事業者側が稼働及び運用しているシステムであり、クライアント側システムは、上記音声合成サービスの提供を受けるユーザが用いるシステム（プログラム）となる。これらのシステムは、図１８Ａ及び図１８Ｂに記載されていないネットワークを介して相互に接続されている。 FIG. 18A is a diagram illustrating a configuration of a server-side system in the second embodiment. FIG. 18B is a diagram illustrating a configuration of a client-side system in the second embodiment. The entire system of this embodiment is composed of a server side system and a client side system. The server-side system is a system that is operated and operated by the provider providing the speech synthesis service, and the client-side system is a system (program) used by the user who receives the speech synthesis service. These systems are connected to each other via a network not described in FIGS. 18A and 18B.

また、サーバ側システムとクライアント側システムは、例えば、パーソナルコンピュータやワークステーションなどの情報処理装置によって構成される。各システムは、中央演算処理装置と、補助記憶装置と、主記憶装置と、出力装置と、入力装置とを備えている。例えば、中央演算処理装置は、ＣＰＵ（Central Processing Unit）などのプロセッサ（又は演算装置ともいう）で構成されている。また、例えば、補助記憶装置はハードディスクであり、主記憶装置はメモリであり、出力装置はディスプレイやスピーカなどであり、入力装置はキーボード及びポインティングデバイス（マウスなど）である。本発明で使用される各種データは、情報処理装置の記憶装置に格納される。 Further, the server side system and the client side system are configured by an information processing apparatus such as a personal computer or a workstation. Each system includes a central processing unit, an auxiliary storage device, a main storage device, an output device, and an input device. For example, the central processing unit is composed of a processor (or a processing unit) such as a CPU (Central Processing Unit). For example, the auxiliary storage device is a hard disk, the main storage device is a memory, the output device is a display, a speaker, and the like, and the input device is a keyboard and a pointing device (such as a mouse). Various data used in the present invention is stored in the storage device of the information processing apparatus.

サーバ側システムは、ユーザ情報取得部１８０２と、音声合成選択部１８０３と、ユーザ情報データベース１８０４と、音声合成データ・データベース１８０６と、サービス内容修正部１８０７と、第１実施例で説明した修正及び改善指示要求を受け付ける音声合成システム１８０５とを備える。 The server-side system includes a user information acquisition unit 1802, a speech synthesis selection unit 1803, a user information database 1804, a speech synthesis data database 1806, a service content modification unit 1807, and the modifications and improvements described in the first embodiment. A speech synthesis system 1805 that receives an instruction request.

サーバ側システムは、まず、クライアント側システムから読み上げリクエスト情報１８０１を受け取る。この読み上げリクエスト情報１８０１は、例えばＨＴＴＰなどのネットワークプロトコルを用いた音声合成サービス要求情報である。この情報の中には、少なくとも音声合成したい読み上げテキスト情報とサービスを受けるユーザを識別するためのユーザ情報が格納されている。また、改善指示要求フェーズなどの場合では、この読み上げリクエスト情報の中に第１実施例で説明した改善指示情報などの情報が格納されることもある。 The server side system first receives the read request information 1801 from the client side system. The reading request information 1801 is speech synthesis service request information using a network protocol such as HTTP, for example. In this information, at least read-out text information to be synthesized and user information for identifying a user who receives the service are stored. In the case of the improvement instruction request phase, information such as the improvement instruction information described in the first embodiment may be stored in the reading request information.

ユーザ情報取得部１８０２は、読み上げリクエスト情報１８０１の中からユーザ情報を分離する。ユーザ情報取得部１８０２は、そのユーザ情報をユーザ情報データベース１８０４で検索することにより、そのユーザに対してどのようなサービス内容が許可されているかを知ることができる。 The user information acquisition unit 1802 separates user information from the reading request information 1801. The user information acquisition unit 1802 can know what service content is permitted for the user by searching the user information database 1804 for the user information.

音声合成選択部１８０３は、そのユーザ情報をもとに、利用可能な音声合成手段を決定する。例えば、音声合成手段として日本語音声合成手段、英語音声合成手段のように言語に応じた音声合成手段のセットが利用可能な場合、ユーザ情報を調べることで英語の音声合成サービスが許可されていないユーザには英語テキスト読み上げリクエストに対してはエラー応答をすることが可能となる。音声合成選択部１８０３は、ユーザごとに設定されている利用可能な音声合成サービス情報を参照することで、そのユーザが利用できる音声合成手段を選択する、またはエラー応答を返す機能を備える。 The speech synthesis selection unit 1803 determines available speech synthesis means based on the user information. For example, when a set of speech synthesizers according to languages such as Japanese speech synthesizer and English speech synthesizer is available as speech synthesizer, English speech synthesis service is not permitted by examining user information The user can make an error response to the English text reading request. The speech synthesis selection unit 1803 has a function of selecting speech synthesis means that can be used by the user or returning an error response by referring to available speech synthesis service information set for each user.

音声合成選択部１８０３は、音声合成データ・データベース１８０６の中から、その読み上げリクエストに対して必要な音声合成データであり、かつそのユーザが利用可能なデータを選択する。そのデータを音声合成システム１８０５が利用することにより、読み上げリクエストに対して適切なデータであり、かつユーザが利用可能なデータが適切に選択されることになる。この実施例では、さまざまな音声合成データを切り替えることで同一の音声合成システム１８０５で様々な音声合成サービス（例えば、日本語、英語、中国語などの言語の異なる音声合成、またはアナウンサ調、会話調などの発話スタイルの異なる音声合成など）を提供できるように記載しているが、場合によっては音声合成データを切り替えるだけでなく、音声合成システム１８０５で実行される音声合成プログラムそのものも切り替えないといけない場合もありえる。その場合は、音声合成データ・データベース１８０６に加えて、音声合成プログラムデータベースを新たに設けて、音声合成選択部１８０３によって音声合成プログラム自体も切り替える手法を採用してもよい。 The speech synthesis selection unit 1803 selects, from the speech synthesis data database 1806, speech synthesis data necessary for the reading request and data usable by the user. By using the data by the speech synthesis system 1805, data that is appropriate for the reading request and that can be used by the user is appropriately selected. In this embodiment, by switching various speech synthesis data, various speech synthesis services (for example, speech synthesis in different languages such as Japanese, English, Chinese, etc., announcer tone, conversation tone) are performed by the same speech synthesis system 1805. However, in some cases, not only the speech synthesis data is switched but also the speech synthesis program itself executed by the speech synthesis system 1805 must be switched. It can be the case. In that case, in addition to the speech synthesis data database 1806, a method may be employed in which a speech synthesis program database is newly provided and the speech synthesis program itself is switched by the speech synthesis selection unit 1803.

音声合成システム１８０５は、第１実施例（図１６Ａ及び図１６Ｂ）で説明した機能を有する音声合成システムである。音声合成システム１８０５は、読み上げリクエスト情報１８０１で指定された読み上げテキストから、利用可能な音声合成用データを使って合成音声へと変換する。同時に、音声合成システム１８０５は、合成音声における修正しうる箇所と修正候補の内容を示す修正候補情報であるＩＤ付きメタ情報を生成する。 The speech synthesis system 1805 is a speech synthesis system having the functions described in the first embodiment (FIGS. 16A and 16B). The speech synthesis system 1805 converts the read text specified by the read request information 1801 into synthesized speech using available speech synthesis data. At the same time, the speech synthesis system 1805 generates meta information with ID, which is correction candidate information indicating the portion that can be corrected in the synthesized speech and the contents of the correction candidates.

最後にサービス内容修正部１８０７は、音声合成システム１８０５で生成されたＩＤ付きメタ情報から、このユーザに提供が許可されていない修正候補情報を削除する処理を行う。例えば、あるユーザには単語の読みやアクセント等の変更修正のみを許可し、抑揚・リズムなどの韻律変更修正を許可しない設定の場合、音声合成システム１８０５が出力したＩＤ付きメタ情報から韻律に関わる修正候補情報を削除する。これにより、そのユーザに対して許可されている修正情報のみを提供するサービスが可能となる。なお、ユーザに対して許可されている修正に関する情報は、ユーザ情報データベース１８０４に格納されている。 Finally, the service content correction unit 1807 performs processing for deleting correction candidate information that is not permitted to be provided to the user from the meta information with ID generated by the speech synthesis system 1805. For example, in the case of a setting in which a user is permitted only to change and read words and change accents and not to change prosody changes such as intonation and rhythm, the user is involved in prosody from the meta-information with ID output by the speech synthesis system 1805. Delete correction candidate information. As a result, it is possible to provide a service that provides only the correction information permitted for the user. Information regarding corrections permitted for the user is stored in the user information database 1804.

図１８Ｂに示すように、クライアント側システムは、修正候補リスト提示部１８５２と、改善要求対象指示部１８５５と、改善指示リクエスト作成部１８５７とを備える。修正候補リスト提示部１８５２は、提示デバイスであるディスプレイ装置１８５３及びスピーカ装置１８５４への出力を行う。また、改善要求対象指示部１８５５は、指示デバイスであるマウスまたはキーボード１８５６からの入力を受け取る。 As illustrated in FIG. 18B, the client side system includes a correction candidate list presentation unit 1852, an improvement request target instruction unit 1855, and an improvement instruction request creation unit 1857. The correction candidate list presenting unit 1852 performs output to the display device 1853 and the speaker device 1854 that are presentation devices. Further, the improvement request target instruction unit 1855 receives an input from a mouse or a keyboard 1856 that is an instruction device.

クライアント側システムでは、まずクライアント側システムからサーバ側システムに対して、新たな読み上げテキストの音声合成処理要求が発行される。ただし、この最初の音声合成要求処理は本発明において特異な部分ではないため、図１８Ａ及び図１８Ｂでは省略している。図１８Ａでは、最初の音声合成処理要求が発行され、その読み上げリクエストに対してサーバ側システムからサービス応答情報１８５１が返された際の処理の流れを示している。 In the client-side system, first, a new text-to-speech synthesis request is issued from the client-side system to the server-side system. However, since this first speech synthesis request process is not a unique part in the present invention, it is omitted in FIGS. 18A and 18B. FIG. 18A shows a processing flow when the first speech synthesis processing request is issued and the service response information 1851 is returned from the server side system in response to the reading request.

ある読み上げリクエストに対するサービス応答情報１８５１が返されると、修正候補リスト提示部１８５２は、その応答情報に含まれる合成音声およびＩＤ付きメタ情報を取り出し、ユーザに対して修正箇所が提示される。修正箇所をＧＵＩによって視覚的に提示するためにはディスプレイ装置１８５３が使用され、読み上げテキストの合成音声全体の確認や修正箇所部分の音声確認や修正箇所の修正結果の音声確認にはスピーカ装置１８５４が使用される。この修正候補リスト提示部１８５２の詳細については後で説明を行う。 When the service response information 1851 for a certain reading request is returned, the correction candidate list presentation unit 1852 takes out the synthesized speech and the ID-added meta information included in the response information, and presents the correction location to the user. A display device 1853 is used for visually presenting the corrected portion using the GUI, and a speaker device 1854 is used for checking the synthesized speech of the read-out text, checking the voice of the corrected portion, and checking the voice of the correction result of the corrected portion. used. Details of the correction candidate list presentation unit 1852 will be described later.

修正候補リスト提示部１８５２、ディスプレイ装置１８５３、およびスピーカ装置１８５４によって、読み上げテキストに対する音声合成結果、または修正及び改善指示要求を行った後の修正結果がＧＵＩや音声によって提示される。この際、クライアント側システムを使用しているユーザは、そのＧＵＩを通じて、読み上げ合成音声に対する修正及び改善指示要求を出せるようになる。ユーザは、マウス、キーボード１８５６などの入力装置によって、ＧＵＩ上において、読み上げテキストのどの部分に対してどのような修正及び改善指示要求を行うかを指定する。改善要求対象指示部１８５５は、入力装置からの情報を解析し、読み上げテキストのどの部分に対してどのような修正及び改善要求が指示されたかを決定する。最後に、改善指示リクエスト作成部１８５７は、ユーザが指示した読み上げテキスト位置に対する修正及び改善要求をもとに、サーバ側システムに送信する修正及び改善指示要求リクエストを作成する。 The correction candidate list presentation unit 1852, the display device 1853, and the speaker device 1854 present the speech synthesis result for the read-out text or the correction result after the correction / improvement instruction request is made by GUI or voice. At this time, the user using the client side system can issue a correction and improvement instruction request for the read-out synthesized speech through the GUI. The user designates what correction and improvement instruction request is to be made for which part of the text to be read on the GUI by using an input device such as a mouse or a keyboard 1856. The improvement request target instruction unit 1855 analyzes information from the input device and determines what correction and improvement requests are instructed for which part of the text to be read. Finally, the improvement instruction request creating unit 1857 creates a correction and improvement instruction request to be transmitted to the server-side system based on the correction and improvement request for the reading text position designated by the user.

次にサーバ側システムおよびクライアント側システムのそれぞれについて、各構成要素の技術の実施例についてより詳細に説明する。 Next, for each of the server-side system and the client-side system, an embodiment of the technology of each component will be described in more detail.

ユーザ情報取得部１８０２は、読み上げリクエスト情報１８０１からユーザ情報を分離する。このユーザ情報は、ユーザを識別する情報であり、ユーザＩＤなどの情報を意味する。読み上げリクエスト情報１８０１がＨＴＴＰリクエストなどのプロトコルで送信される場合、ユーザ情報はＨＴＴＰプロトコルのヘッダ情報として、またはＨＴＴＰリクエストの本体部分情報に、本システム独自のフォーマットで埋め込まれている。ユーザ情報取得部１８０２は、ユーザ情報取得部１８０２で分離されたユーザ情報を用いて、ユーザ情報データベース１８０４を検索し、そのユーザに対して許可されているサービス内容を取得する。 The user information acquisition unit 1802 separates user information from the reading request information 1801. This user information is information for identifying the user, and means information such as a user ID. When the read request information 1801 is transmitted by a protocol such as an HTTP request, the user information is embedded as HTTP protocol header information or in the main part information of the HTTP request in a format unique to this system. The user information acquisition unit 1802 searches the user information database 1804 using the user information separated by the user information acquisition unit 1802, and acquires the service content permitted for the user.

図１９は、ユーザ情報データベース１８０４の内容の例を示す。ユーザ情報データベース１８０４には、少なくともユーザを識別するためのユーザＩＤ、および、そのユーザに対して許可されているサービス情報が格納されている。許可されているサービス情報としては、取得できる情報（読み、韻律、波形など、処理部から出力される個々のタグ情報）や改善要求を出すことができるかなどの情報を少なくとも含む。また、多言語音声合成を対象とするシステムの場合には、許可されているサービス情報として、さらに利用可能な読み上げ言語などの情報も含まれる。したがって、図１９に示すように、ユーザ情報データベース１８０４は、ユーザＩＤ１９０１と、利用できる言語１９０２と、取得できる情報１９０３と、改善要求の可否１９０４とを、構成項目として少なくとも含んでもよい。 FIG. 19 shows an example of the contents of the user information database 1804. The user information database 1804 stores at least a user ID for identifying the user and service information permitted for the user. Permitted service information includes at least information that can be acquired (individual tag information output from the processing unit such as reading, prosody, waveform, etc.) and whether an improvement request can be issued. In the case of a system for multilingual speech synthesis, information such as a usable reading language is further included as permitted service information. Accordingly, as illustrated in FIG. 19, the user information database 1804 may include at least a user ID 1901, an available language 1902, information 1903 that can be acquired, and whether or not an improvement request is possible 1904 as configuration items.

例えば、図１９の例において、ユーザＩＤ「１２３４５６７」のユーザに対しては、「日本語」と「英語」の読み上げサービスが提供可能であり、日本語に対しては読み（ＷＯＲＤ、ＡＣＣ、ＰＡＵＳＥタグ）と韻律（ＰＩＴＣＨ、ＳＰＥＥＤタグ）の情報の取得、およびそれらへの改善指示要求が可能であることが示されている。一方、英語のサービスに対しては読み（ＰＡＵＳＥ）の情報のみの提供であり、それに対する改善指示要求を受け付けることはできない。ユーザ情報データベース１８０４で検索された情報は、音声合成選択部１８０３、音声合成データ・データベース１８０６、およびサービス内容修正部１８０７に渡されて、適切な情報の選択や削除などが行われる。 For example, in the example of FIG. 19, a reading service of “Japanese” and “English” can be provided to the user with the user ID “1234567”, and reading (WORD, ACC, PAUSE) is performed for Japanese. (Tag) and prosody (PITCH, SPEED tag) information can be obtained, and improvement instruction requests for them can be obtained. On the other hand, only English (PAUSE) information is provided for an English service, and an improvement instruction request for it cannot be accepted. Information retrieved from the user information database 1804 is passed to the speech synthesis selection unit 1803, the speech synthesis data database 1806, and the service content modification unit 1807, where appropriate information is selected or deleted.

音声合成選択部１８０３は、ユーザ情報データベース１８０４で検索された、ユーザに許可されている読み上げ言語情報をもとに、音声合成手段の選択を行う。音声合成手段の選択処理は、言語ごとに音声合成プログラムと音声合成データが異なる場合、音声合成プログラムは共通で音声合成データのみが異なる場合などがありえる。たとえ後者だとしても、音声合成プログラムに対して読み上げテキストの言語を指定する必要はある。同じアルファベットを用いたテキストだとしても、言語を特定することが不可能な場合は多いためである。 The speech synthesis selection unit 1803 selects speech synthesis means based on the reading language information permitted by the user searched in the user information database 1804. The speech synthesis means may be selected when the speech synthesis program and the speech synthesis data are different for each language, or when the speech synthesis program is common and only the speech synthesis data is different. Even if it is the latter, it is necessary to specify the language of the text to be read to the speech synthesis program. This is because even if the text uses the same alphabet, it is often impossible to specify the language.

図１８Ａの実施例では、音声合成選択部１８０３は、読み上げテキストの言語情報のみを音声合成システム１８０５に渡す構成をとっている。つまり、音声合成プログラムである音声合成システム１８０５は、読み上げ言語によらず同一のものを利用できるという想定である。 In the embodiment of FIG. 18A, the speech synthesis selection unit 1803 is configured to pass only the language information of the text to be read to the speech synthesis system 1805. That is, it is assumed that the speech synthesis system 1805 that is a speech synthesis program can use the same system regardless of the reading language.

一方、音声合成プログラムが言語ごとに異なる場合は、音声合成選択部１８０３によって適切な音声合成プログラムが選択され、音声合成システム１８０５においてそのプログラムが実行されることとなる。この場合は、音声合成選択部１８０３に加えて、言語ごとの音声合成プログラムを格納した音声合成プログラムデータベースが存在し、その中から適切なプログラムが選択され、プログラム実行手段であるところの音声合成システム１８０５でそのプログラムが実行されるという構成となる。 On the other hand, if the speech synthesis program differs for each language, an appropriate speech synthesis program is selected by the speech synthesis selection unit 1803 and the program is executed in the speech synthesis system 1805. In this case, in addition to the speech synthesis selection unit 1803, there is a speech synthesis program database storing a speech synthesis program for each language, an appropriate program is selected from the database, and the speech synthesis system is a program execution unit. In 1805, the program is executed.

本実施例では、音声合成選択部１８０３からは読み上げテキストの言語のみが出力される場合を想定する。読み上げテキストの言語を特定する手法としては、読み上げリクエスト情報１８０１に属性値として言語情報を指定することが一般的である。音声合成選択部１８０３は、その属性値を読み上げリクエストから分離して後段へと出力する。 In the present embodiment, it is assumed that only the language of the read-out text is output from the speech synthesis selection unit 1803. As a method for specifying the language of the text to be read, it is common to specify language information as an attribute value in the text request information 1801. The speech synthesis selection unit 1803 separates the attribute value from the reading request and outputs it to the subsequent stage.

また、別の実施例としては、音声合成選択部１８０３が、言語情報だけではなく、発話スタイルなどの情報を取得し、その情報を音声合成システム１８０５、または音声合成データ・データベース１８０６に出力する構成も考えられる。例えば、朗読調や話し言葉調などによって音声合成プログラムや音声合成データが異なる場合には、音声合成選択部１８０３で発話スタイル情報を取得して、プログラムとデータを切り替えることが必要となる。このような場合でも、読み上げリクエスト情報１８０１内に記述された発話スタイル情報を抽出して音声合成システム１８０５や音声合成データ・データベース１８０６に渡すことで対応することが可能となる。 As another embodiment, the speech synthesis selection unit 1803 acquires not only language information but also information such as an utterance style and outputs the information to the speech synthesis system 1805 or the speech synthesis data database 1806. Is also possible. For example, when the speech synthesis program and speech synthesis data differ depending on reading tone, spoken tone, etc., it is necessary to acquire speech style information by the speech synthesis selection unit 1803 and switch between the program and data. Even in such a case, it is possible to cope by extracting the speech style information described in the reading request information 1801 and passing it to the speech synthesis system 1805 or the speech synthesis data database 1806.

音声合成データ・データベース１８０６は、音声合成システム１８０５が利用する複数の音声合成用データを格納している。上述のように、日本語音声合成用データや英語音声合成用データのような言語ごとのデータ、および、日本語朗読調データや日本語話し言葉調データのように発話スタイルごとのデータなどが格納されており、音声合成選択部１８０３で取得された情報により、実際の処理で利用される音声合成データが検索及び出力されることになる。また、音声合成データ・データベース１８０６には、音声合成システム１８０５において使用される辞書データ、韻律データ、および波形データなども格納されている。 The voice synthesis data database 1806 stores a plurality of voice synthesis data used by the voice synthesis system 1805. As described above, data for each language, such as data for Japanese speech synthesis and data for English speech synthesis, and data for each utterance style, such as Japanese reading tone data and Japanese spoken tone tone data, are stored. Therefore, based on the information acquired by the speech synthesis selection unit 1803, speech synthesis data used in actual processing is retrieved and output. The speech synthesis data database 1806 also stores dictionary data, prosody data, waveform data, and the like used in the speech synthesis system 1805.

音声合成システム１８０５は、基本構成または第１実施例で説明した音声合成手段である。その詳細説明は省略するが、上述のように複数の言語に対して処理することが可能である。読み生成処理において、読みとアクセントという情報が日本語固有の情報であるため、言語によって解析される情報は変わりうる。例えば、英語の場合には読みとアクセントに代えて、発音とストレス情報が、読み生成処理部１０２において解析される情報となる。また、例えば中国語の場合には、発音（ピンイン）と声調が、読み生成処理部１０２において解析される情報となる。このように言語によって読み生成処理部１０２で解析されて出力される情報は変わりうるが、それぞれの言語のテキスト解析処理はさまざまな文献で開示されており、その手法を利用すればよい。さらに、それらの手法で解析された上記言語ごとの解析情報を読み修正候補情報として出力する方法についても、情報の種類の違いはあってもフォーマットや処理アルゴリズムは上記基本構成の説明で日本語音声合成に対して説明したものがそのまま利用できる。 The speech synthesis system 1805 is the speech synthesis means described in the basic configuration or the first embodiment. Although detailed description thereof is omitted, it is possible to process a plurality of languages as described above. In the reading generation process, the information of reading and accent is information unique to Japanese, so the information analyzed according to the language can change. For example, in the case of English, instead of reading and accent, pronunciation and stress information are information analyzed by the reading generation processing unit 102. For example, in the case of Chinese, pronunciation (pinyin) and tone are information analyzed by the reading generation processing unit 102. As described above, the information that is analyzed and output by the reading generation processing unit 102 may vary depending on the language, but the text analysis processing of each language is disclosed in various documents, and the method may be used. In addition, regarding the method of reading the analysis information for each language analyzed by those methods and outputting it as correction candidate information, the format and processing algorithm are described in the basic configuration described above, even if there is a difference in the type of information. Those described for the synthesis can be used as they are.

図２０は、英語テキストに対して出力された読み修正候補情報の例を示す。この読み修正候補情報において、ＷＯＲＤタグは、日本語のＹＯＭＩ属性の代わりに発音を示すＰＲＯＮ属性を有する。また、この読み修正候補情報は、アクセントを示すＡＣＣタグの代わりに、ストレスを示すＳＴＲＥＳＳタグを有する。もちろん、このフォーマットは一つの実施例であり、他のタグを設定することも可能である。 FIG. 20 shows an example of the reading correction candidate information output for the English text. In this reading correction candidate information, the WORD tag has a PRON attribute indicating pronunciation instead of the Japanese YOMI attribute. This reading correction candidate information has a STRESS tag indicating stress instead of an ACC tag indicating accent. Of course, this format is one embodiment, and other tags can be set.

最後に、サービス内容修正部１８０７は、ユーザ情報データベース１８０４から出力された、対象ユーザに対して許可された情報のみの選別を行う。この処理では、最も単純には図１７や図２０のような出力された修正候補付き情報から、対象ユーザに対して提供を許可されていないタグ情報が削除される。例えば、現在の対象ユーザがユーザＩＤ「１２３４５６７」のユーザで読み上げテキストが英語の場合、音声合成システム１８０５から出力された修正候補付き情報（図２０）からＰＡＵＳＥタグ以外のタグ情報が削除され、図２１で示すような情報と合成音声を含むサービス応答情報１８０８が出力される。これにより、ユーザごとに許可された情報の提供が可能となる。 Finally, the service content correction unit 1807 selects only the information output from the user information database 1804 and permitted for the target user. In this process, tag information that is not permitted to be provided to the target user is deleted from the output information with correction candidates as shown in FIGS. For example, when the current target user is the user with the user ID “1234567” and the read-out text is in English, tag information other than the PAUSE tag is deleted from the information with correction candidates (FIG. 20) output from the speech synthesis system 1805. Service response information 1808 including information as indicated by 21 and synthesized speech is output. Thereby, it is possible to provide information permitted for each user.

次に図１８Ｂのクライアント側システムの実現方法について説明する。まず、クライアント側システムでは、サーバ側システムからサービス応答情報１８５１を受け取る。この情報は、クライアント側システムから先に送信された読み上げリクエスト情報１８０１に対する合成音声データおよび修正候補付き情報（図１７、図２０、図２１）の組である。これらの情報は、修正候補リスト提示部１８５２に出力される。 Next, a method for realizing the client-side system in FIG. 18B will be described. First, the client side system receives service response information 1851 from the server side system. This information is a set of synthesized voice data and information with correction candidates (FIGS. 17, 20, and 21) for the reading request information 1801 previously transmitted from the client-side system. These pieces of information are output to the correction candidate list presentation unit 1852.

修正候補リスト提示部１８５２は、修正候補付き情報をグラフィカルにディスプレイ装置１８５３に表示するための処理を行う。音声合成装置では、生成された合成音声の読みや抑揚を確認するために読み上げテキストをグラフィカルに画面に表示し、その各部分をマウス等の入力装置で指定することで、その部分の合成音声を確認したりする製品が実現されている。例えば、図２３に示すような画面となる。マウスで「今日」の部分をクリックすると、スピーカ装置１８５４から「キョウ」に相当する音声部分が再生されるなどのインタフェースはすでに実現例がある。 The correction candidate list presentation unit 1852 performs processing for displaying the information with correction candidates on the display device 1853 graphically. In the speech synthesizer, the text to be read is graphically displayed on the screen to confirm the reading and inflection of the generated synthesized speech, and each part is designated by an input device such as a mouse, so that the synthesized speech of that part is displayed. The product to be confirmed is realized. For example, a screen as shown in FIG. An interface has already been realized in which an audio part corresponding to “Kyo” is reproduced from the speaker device 1854 by clicking the “today” part with the mouse.

修正候補リスト提示部１８５２は、このようなグラフィカルな提示に加えて、図２３に示すような読みやアクセント、さらには韻律など、本発明の修正候補情報として提供される情報について現在の合成音声での実現値およびその変更可能な値をグラフィカルに表示する。例えば、図２３はその例であり、ここでは読みについて、「今日」に対応する部分の読み・アクセントの現在の値（キョウ）と変更可能な値（コンニチ）がメニュー形式でグラフィカルに提示されている。ユーザは、マウスなどでこれらの値を指示することで、クライアント側システムに対して変更したい値を指示する。 In addition to such graphical presentation, the correction candidate list presentation unit 1852 uses information such as readings, accents, and prosody as shown in FIG. Graphically display the realization value and its changeable value. For example, FIG. 23 is an example, and here, regarding the reading, the current value (Kyo) and the changeable value (Konichi) of the reading / accent of the part corresponding to “today” are graphically presented in a menu format. Yes. The user instructs the values to be changed to the client side system by instructing these values with a mouse or the like.

図２３の例では読みについてのみ例示しているが、もちろん、アクセント、ポーズの可能な位置、ピッチや話速などの韻律情報、合成音声に使用された音声部品データのＩＤなどについても同様の方法でグラフィカルな提示が可能である。このような、生成された合成音声に対してグラフィカルに表示し、その読み・アクセントや韻律などを変更可能なクライアント側システムはすでに実現されている。本発明のクライアント側システムでは、そのような既存のクライアント側システムを使用することももちろん可能である。 In the example of FIG. 23, only reading is illustrated, but, of course, the same method is used for accents, positions where pauses are possible, prosodic information such as pitch and speech speed, ID of speech component data used for synthesized speech, and the like. Graphical presentation is possible. A client-side system that can graphically display the generated synthesized speech and change its reading, accent, prosody, etc. has already been realized. It is of course possible to use such an existing client side system in the client side system of the present invention.

こうして変更可能な値の一つをユーザがマウスなどで指示した場合、改善要求対象指示部１８５５によって、その位置がどのＩＤの要素に対応するかが判定される。例えば、図２３で「コンニチ」をマウスでクリックした場合、図２２の修正候補情報のＩＤ「２０１３０６３０１２３０１２３４５６Ｗ００１」の要素である「今日」の部分が、ユーザが指示した要素であると判定される。さらに、その読み修正候補として「コンニチ」が選択されたことも判定することは容易である。 When the user designates one of the values that can be changed with the mouse or the like, the improvement request target instruction unit 1855 determines which ID element corresponds to the position. For example, when “Connichi” is clicked with the mouse in FIG. 23, it is determined that the “today” part, which is the element of the correction candidate information ID “2013306301230123456W001” in FIG. 22, is the element designated by the user. Furthermore, it is easy to determine that “Connichi” has been selected as the reading correction candidate.

以上により、クライアント側システムでグラフィカルに表示された合成音声を示す読みテキストの上でマウス等の入力装置で指示することにより、改善要求対象指示部１８５５は、ユーザが合成音声を変更したい箇所のＩＤおよび改善情報という二つの情報を取得することができる。これら二つの情報は、改善指示リクエスト作成部１８５７において、例えばＸＭＬやＪＳＯＮ等のフォーマットに変換され、改善指示情報１８５８として、音声合成サーバシステム（図１６Ａ及び図１６Ｂあるいは図１８Ａに示すシステム）へと送信される。サーバシステム側でどのように改善指示要求が処理されるかは第１実施例で説明した通りである。 As described above, the improvement request target instruction unit 1855 instructs the user to change the synthesized speech by giving an instruction with an input device such as a mouse on the reading text indicating the synthesized speech graphically displayed on the client side system. And two kinds of information, improvement information, can be acquired. These two pieces of information are converted into a format such as XML or JSON in the improvement instruction request creation unit 1857, and are sent to the speech synthesis server system (system shown in FIG. 16A and FIG. 16B or FIG. 18A) as improvement instruction information 1858. Sent. How the improvement instruction request is processed on the server system side is as described in the first embodiment.

以上のように、第２実施例におけるサーバ・クライアント音声合成システムにおいて、サーバ側システムでは、ユーザごとにきめ細かなサービスレベルを設定し、そのユーザに許可されている情報の提供だけを行うことができる。また、クライアント側システムでは、ユーザ自身に対して許可されている情報の範囲内で、合成音声をグラフィカルに表示した上で、サーバ側システムから渡された修正候補の中から自分の希望通りの値を指示することで、合成音声を修正及び改善していくことが可能となる。 As described above, in the server / client speech synthesis system according to the second embodiment, the server-side system can set a fine service level for each user and only provide information permitted to the user. . In the client side system, the synthesized speech is displayed graphically within the range of information permitted for the user himself, and the desired value is selected from among the correction candidates passed from the server side system. It is possible to correct and improve the synthesized speech.

以上の実施例によれば、サーバ側システムは、音声合成処理を構成する読み生成、韻律生成、波形生成の各処理において、合成音声の生成に使用された情報とともに、その代替候補としてどのような情報があるかをメタ情報として記述する。そして、サーバ側システムは、読み上げテキストに対してその修正候補情報をメタ情報として付加した情報を合成音声とともに、クライアント側システムに返送する。この構成によれば、音声合成処理の結果として変換結果の合成音声とともに、その合成音声に対する修正可能箇所および修正可能内容をメタ情報として受け取ることができる。クライアント側システムではこの情報を用いることで、ユーザが簡単に修正箇所および修正内容を指定するユーザインタフェースを作成することが可能となる。これにより、音声合成サービスの利用者が求める合成音声を容易に作成することができる。 According to the above embodiment, the server-side system can use any information that has been used to generate the synthesized speech in each of the reading generation, prosody generation, and waveform generation processing that constitutes the speech synthesis processing as an alternative candidate. Describes whether there is information as meta information. Then, the server side system returns information obtained by adding the correction candidate information as meta information to the read-out text together with the synthesized speech to the client side system. According to this configuration, it is possible to receive, as meta information, a correctable portion and correctable content for the synthesized speech, as well as the synthesized speech of the conversion result as a result of the speech synthesis process. By using this information in the client side system, a user can easily create a user interface for designating a correction location and correction content. Thereby, the synthesized speech which the user of a speech synthesis service requires can be easily created.

なお、本発明は上述した実施例に限定されるものではなく、様々な変形例が含まれる。例えば、上述した実施例は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに限定されるものではない。また、ある実施例の構成の一部を他の実施例の構成に置き換えることがあり、また、ある実施例の構成に他の実施例の構成を加えることも可能である。また、各実施例の構成の一部について、他の構成の追加・削除・置換をすることが可能である。 In addition, this invention is not limited to the Example mentioned above, Various modifications are included. For example, the above-described embodiments have been described in detail for easy understanding of the present invention, and are not necessarily limited to those having all the configurations described. In addition, a part of the configuration of one embodiment may be replaced with the configuration of another embodiment, and the configuration of another embodiment may be added to the configuration of one embodiment. Further, it is possible to add, delete, and replace other configurations for a part of the configuration of each embodiment.

また、サーバ側システム及びユーザ側システムの機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等によりハードウェアで実現してもよい。また、上記の各構成、機能等は、プロセッサがそれぞれの機能を実現するプログラムを解釈し、実行することによりソフトウェアで実現してもよい。この場合、プログラムコードを記録した非一時的なコンピュータ可読媒体（non-transitory computer readable medium）を情報処理装置（コンピュータ）に提供し、その情報処理装置（又はＣＰＵ）が非一時的なコンピュータ可読媒体に格納されたプログラムコードを読み出す。非一時的なコンピュータ可読媒体としては、例えば、メモリや、ハードディスク、ＳＳＤ（Solid State Drive）等の記憶装置、または、ＩＣカード、ＳＤカード、ＤＶＤ等の記憶媒体などが用いられる。 The functions, processing units, processing means, and the like of the server side system and the user side system may be realized by hardware by designing a part or all of them, for example, with an integrated circuit. Each of the above-described configurations, functions, and the like may be realized by software by interpreting and executing a program that realizes each function by the processor. In this case, a non-transitory computer readable medium in which the program code is recorded is provided to the information processing apparatus (computer), and the information processing apparatus (or CPU) is a non-transitory computer readable medium. The program code stored in is read. As the non-transitory computer-readable medium, for example, a storage device such as a memory, a hard disk, or an SSD (Solid State Drive), or a storage medium such as an IC card, an SD card, or a DVD is used.

また、プログラムコードは、様々なタイプの一時的なコンピュータ可読媒体（transitory computer readable medium）によって情報処理装置に供給されてもよい。一時的なコンピュータ可読媒体の例は、電気信号、光信号、及び電磁波を含む。一時的なコンピュータ可読媒体は、電線及び光ファイバ等の有線通信路、又は無線通信路を介して、プログラムを情報処理装置に供給できる。 Further, the program code may be supplied to the information processing apparatus by various types of temporary computer readable media. Examples of transitory computer readable media include electrical signals, optical signals, and electromagnetic waves. The temporary computer-readable medium can supply the program to the information processing apparatus via a wired communication path such as an electric wire and an optical fiber, or a wireless communication path.

また、上述の実施例において制御線や情報線は説明上必要と考えられるものを示しており、製品上必ずしも全ての制御線や情報線を示しているとは限らない。実際には殆ど全ての構成が相互に接続されていると考えてもよい。 In the above-described embodiments, control lines and information lines are those that are considered necessary for the explanation, and not all control lines and information lines on the product are necessarily shown. Actually, it may be considered that almost all the components are connected to each other.

１０１：読み上げテキスト
１０２：読み生成処理部
１０３：韻律生成処理部
１０４：波形生成処理部
１０５：読み上げ音声
１０６：読み修正候補情報生成処理部
１０７：韻律修正候補情報生成処理部
１０８：波形修正候補情報生成処理部
１０９：修正候補情報統合処理部
１１０：修正候補情報
２０１：読み上げテキスト
２０２：単語分割処理部
２０３：読み選択処理部
２０４：区切り決定処理部
２０５：結合決定処理部
２０７：単語読み修正候補情報生成処理部
２０８：区切り修正候補情報生成処理部
２０９：読み修正候補情報統合処理部
２１０：読み修正候補情報
２１１：結合修正候補情報生成処理部
２２１：継続長修正候補情報生成処理部
２２２：基本周波数修正候補情報生成処理部
２２３：高さ修正候補情報生成処理部
２２４：話速修正候補情報生成処理部
２２５：韻律修正候補情報統合処理部
２２６：韻律修正候補情報
３０１：単語ネットワーク情報
３０２：最適単語列選択処理部
３０３：単語統合処理部
３０４：読み分け処理部
３０６：単語読み修正候補情報生成処理部
３０７：単語読み修正候補情報
５０１，５０２：単語候補
１３０１：アクセント結合フラグ
１３０２：アクセント結合候補箇所フラグ
１６０１：読み上げテキスト
１６０２：音声合成コンテキスト生成処理部
１６０３：メタ情報付き音声合成処理部
１６０４：文解析メタ情報
１６０５：解析単位別ＩＤ設定処理部
１６０６：合成音声
１６０７：ＩＤ付きメタ情報
１６０８：サービス応答生成処理部
１６０９：サービス応答情報
１６１０：音声合成データ格納装置
１６１１：コンテキスト格納装置
１６５１：改善指示情報
１６５２：改善指示解釈処理部
１６５３：対象ＩＤ
１６５４：音声合成コンテキスト選択処理部
１６５７：合成音声
１６５８：文解析メタ情報
１６５９：改善データ反映処理部
１６６０：改善箇所決定処理部
１６６１：改善データ作成処理部
１８０１：読み上げリクエスト情報
１８０２：ユーザ情報取得部
１８０３：音声合成選択部
１８０４：ユーザ情報データベース
１８０５：音声合成システム
１８０６：音声合成データ・データベース
１８０７：サービス内容修正部
１８０８：サービス応答情報
１８５１：サービス応答情報
１８５２：修正候補リスト提示部
１８５３：ディスプレイ装置
１８５４：スピーカ装置
１８５５：改善要求対象指示部
１８５６：マウス・キーボード
１８５７：改善指示リクエスト作成部
１８５８：改善指示情報
１９０１：ユーザＩＤ
１９０２：利用できる言語
１９０３：取得できる情報
１９０４：改善要求の可否 101: Reading text 102: Reading generation processing unit 103: Prosody generation processing unit 104: Waveform generation processing unit 105: Reading speech 106: Reading correction candidate information generation processing unit 107: Prosody correction candidate information generation processing unit 108: Waveform correction candidate information Generation processing unit 109: correction candidate information integration processing unit 110: correction candidate information 201: reading text 202: word division processing unit 203: reading selection processing unit 204: delimitation determination processing unit 205: combination determination processing unit 207: word reading correction candidate Information generation processing unit 208: Separation correction candidate information generation processing unit 209: Reading correction candidate information integration processing unit 210: Reading correction candidate information 211: Combined correction candidate information generation processing unit 221: Continuous length correction candidate information generation processing unit 222: Basic Frequency correction candidate information generation processing unit 223: Height correction candidate information generation processing 224: Spoken speed correction candidate information generation processing unit 225: Prosody modification candidate information integration processing unit 226: Prosody modification candidate information 301: Word network information 302: Optimal word string selection processing unit 303: Word integration processing unit 304: Reading classification processing unit 306 : Word reading correction candidate information generation processing unit 307: Word reading correction candidate information 501 and 502: Word candidate 1301: Accent combination flag 1302: Accent combination candidate location flag 1601: Text to read out 1602: Speech synthesis context generation processing unit 1603: Meta information Speech synthesis processing unit 1604: sentence analysis meta information 1605: ID setting processing unit 1606 for each analysis unit: synthesized speech 1607: meta information with ID 1608: service response generation processing unit 1609: service response information 1610: speech synthesis data storage device 161 : Context storage device 1651: Improvement instruction information 1652: Improvement instruction interpretation processing unit 1653: target ID
1654: speech synthesis context selection processing unit 1657: synthesized speech 1658: sentence analysis meta-information 1659: improved data reflection processing unit 1660: improved portion determination processing unit 1661: improved data creation processing unit 1801: reading request information 1802: user information acquisition unit 1803: Speech synthesis selection unit 1804: User information database 1805: Speech synthesis system 1806: Speech synthesis data database 1807: Service content modification unit 1808: Service response information 1851: Service response information 1852: Correction candidate list presentation unit 1853: Display device 1854: Speaker device 1855: Improvement request target instruction unit 1856: Mouse / keyboard 1857: Improvement instruction request creation unit 1858: Improvement instruction information 1901: User ID
1902: Available language 1903: Acquired information 1904: Whether improvement request is possible

Claims

A speech synthesis system that receives a read-out text and outputs a synthesized speech obtained by reading out the read-out text,
A speech synthesis system comprising: a correction candidate information generation processing unit that generates correction candidate information indicating a correctable portion and correction candidate contents in the synthesized speech based on information when the synthesized speech is generated. .

The speech synthesis system according to claim 1,
The correction candidate information generation processing unit
A reading correction candidate information generation processing unit that generates reading correction candidate information that is correction information in the reading generation processing for generating the reading of the synthesized speech;
A prosody modification candidate information generation processing unit for generating prosody modification candidate information which is modification information in the prosody generation process for generating the prosody of the synthesized speech;
A waveform correction candidate information generation processing unit that generates waveform correction candidate information that is correction information in the waveform generation processing for generating the waveform of the synthesized speech;
A speech synthesis system comprising at least one of the following.

The speech synthesis system according to claim 2,
The reading correction candidate information generation processing unit
A word reading correction candidate information generation processing unit that generates word reading correction information that is correction information related to the reading of a word in the synthesized speech;
A delimiter correction candidate information generation processing unit that generates delimiter correction candidate information that is correction information regarding the delimiter position of the synthesized speech;
A speech synthesis system comprising at least one of a combined correction candidate information generation processing unit that generates combined correction candidate information, which is correction information related to the position of accent combining of the synthesized speech.

The speech synthesis system according to claim 2,
The prosody modification candidate information generation processing unit
A continuation length correction candidate information generation processing unit that generates continuation length correction candidate information that is correction information related to the phoneme continuation length of the synthesized speech;
A fundamental frequency correction candidate information generation processing unit that generates basic frequency correction candidate information that is correction information related to the phoneme fundamental frequency of the synthesized speech;
A height correction candidate information generation processing unit that generates height correction candidate information that is correction information related to a pitch pattern of the synthesized speech portion;
A speech speed correction candidate information generation processing unit that generates speech speed correction candidate information that is correction information related to the speech speed of the portion of the synthesized speech;
A speech synthesis system comprising at least one of the following.

The speech synthesis system according to claim 2,
The waveform correction candidate information generation processing unit includes a unit correction candidate information generation processing unit that generates unit correction candidate information that is correction information related to a unit used in the synthesized speech. .

The speech synthesis system according to claim 1,
A speech synthesis system, further comprising an improved data creation processing unit that updates the synthesized speech based on improvement instruction information that is improvement information of the synthesized speech.

The speech synthesis system according to claim 6,
A speech synthesis data storage device for storing speech synthesis data for generating the synthesized speech;
An improved data reflection processing unit for updating the voice synthesis data in the voice synthesis data storage device based on the improvement instruction information;
A speech synthesis system, further comprising:

The speech synthesis system according to claim 1,
A user information database that stores information about services provided to users;
A service content correction unit for correcting the content of the correction candidate information based on the information about the service;
A speech synthesis system, further comprising:

The speech synthesis system according to claim 1,
The speech synthesizer system is characterized in that the correction candidate information is information obtained by adding a tag indicating the correctable portion and the content of the correction candidate to the read-out text data.

The speech synthesis system according to claim 9,
The speech synthesis system, wherein the correction candidate information includes an identifier for each correctable portion.

The speech synthesis system according to claim 1,
A server device that includes the correction candidate information generation processing unit and transmits the synthesized speech and the correction candidate information;
A client device that receives the synthesized speech and the correction candidate information from the server device;
A speech synthesis system comprising:

The speech synthesis system according to claim 11.
The client device includes an improvement instruction request creating unit that creates improvement instruction information regarding improvement information selected from the correction candidate information,
The said server apparatus is provided with the improvement data creation process part which updates the said synthesized speech based on the said improvement instruction information, The speech synthesis system characterized by the above-mentioned.

The speech synthesis system according to claim 11.
The client device is
A correction candidate list presentation unit for displaying the correction candidate information on a display device;
An improvement request target instruction unit that receives improvement information selected from the correction candidate information displayed on the display device;
A speech synthesis system, further comprising:

A first step of generating a synthesized speech in which the input text is read out;
A second step of generating correction candidate information indicating a correctable portion and correction candidate content in the synthetic voice based on information when the synthetic voice is generated;
A speech synthesis method comprising:

The speech synthesis method according to claim 14.
The second step includes
Generating reading correction candidate information that is correction information in a reading generation process for generating a reading of the synthesized speech;
Generating prosody correction candidate information that is correction information in the prosody generation process for generating the prosody of the synthesized speech;
Generating waveform correction candidate information that is correction information in a waveform generation process for generating a waveform of the synthesized speech;
A speech synthesis method comprising at least one of the following.