JP6840124B2

JP6840124B2 - Language processor, language processor and language processing method

Info

Publication number: JP6840124B2
Application number: JP2018244555A
Authority: JP
Inventors: 悟行松永; 大和大谷
Original assignee: AI Inc Canada
Current assignee: AI Inc Canada
Priority date: 2018-12-27
Filing date: 2018-12-27
Publication date: 2021-03-10
Anticipated expiration: 2038-12-27
Also published as: JP2020106643A

Description

この発明は言語処理装置、言語処理プログラムおよび言語処理方法に関し、特にたとえば、入力テキストに従う合成音声を生成および出力する、言語処理装置、言語処理プログラムおよび言語処理方法に関する。 The present invention relates to a language processor, a language processor and a language processing method, and more particularly to, for example, a language processor, a language processing program and a language processing method that generate and output synthetic speech according to input text.

背景技術の一例が特許文献１に開示される。特許文献１によれば、音声データからディープニューラルネットワーク音響モデルを学習し、学習されたディープニューラルネットワーク音響モデルを用いて合成音声を生成する音声合成装置が開示される。この音声合成装置では、コンテキストデータを数値ベクトルで表現した言語特徴量ベクトルと、話者コードを連結したものを入力とし、話者、コンテキストデータに対応する音声パラメータを出力とするディープニューラルネットワーク音響モデルが学習される。 An example of the background technique is disclosed in Patent Document 1. According to Patent Document 1, a speech synthesizer that learns a deep neural network acoustic model from speech data and generates a synthetic speech using the learned deep neural network acoustic model is disclosed. In this speech synthesizer, a deep neural network acoustic model that inputs a language feature vector that expresses context data as a numerical vector and a speaker code, and outputs speech parameters corresponding to the speaker and context data. Is learned.

また、背景技術の他の例が特許文献２に開示される。特許文献２によれば、音声コーパスから時間長ＤＮＮ（ディープニューラルネットワーク）および音響特徴量ＤＮＮを事前学習し、学習された時間長ＤＮＮおよび音響特徴量ＤＮＮを用いてテキストに対応する音声波形を合成する音声合成装置が開示される。この音声合成装置では、事前学習部は、音声コーパスから音素の言語特徴量、音素フレームの言語特徴量、音素の時間長及び音素フレームの音響特徴量を生成し、話者ラベル及び感情ラベルを付与する。そして、事前学習部は、音素の言語特徴量、話者ラベル、感情ラベル及び音素の時間長を与えて時間長ＤＮＮを学習し、音素フレームの言語特徴量、話者ラベル、感情ラベル及び音素フレームの音響特徴量を与えて音響特徴量ＤＮＮを学習する。 Further, another example of the background technique is disclosed in Patent Document 2. According to Patent Document 2, the time-length DNN (deep neural network) and the acoustic feature DNN are pre-learned from the speech corpus, and the speech waveform corresponding to the text is synthesized using the learned time-length DNN and the acoustic feature DNN. The voice synthesizer to be used is disclosed. In this speech synthesizer, the pre-learning unit generates phoneme language features, phoneme frame language features, phoneme time lengths, and phoneme frame acoustic features from the speech corpus, and assigns speaker labels and emotion labels. To do. Then, the pre-learning unit learns the time length DNN by giving the phoneme language features, speaker label, emotion label, and phoneme time length, and learns the phoneme frame language features, speaker label, emotion label, and phoneme frame. The acoustic feature amount DNN is learned by giving the acoustic feature amount of.

特開２０１７−０３２８３９JP-A-2017-032839 特開２０１８―１４６８０３JP-A-2018-146803

H.Zen et al,IEICE Trans.Inf. & Syst.,vol.E90-D, no.5,pp.825-834,May 2007H.Zen et al, IEICE Trans.Inf. & Syst., Vol.E90-D, no.5, pp.825-834, May 2007 Zhizheng Wu et al,ISCA SSW9,vol PS2-13,pp.218-223,Sep 2016Zhizheng Wu et al, ISCA SSW9, vol PS2-13, pp.218-223, Sep 2016

特許文献１や特許文献２においては、言語特徴量の正規化には何ら開示されていないが、これらの特許文献において参照される非特許文献１や非特許文献２においては、すべての学習データから計算される平均と分散または最小値と最大値による正規化が用いられている。しかし、これらの正規化手法では自由文章のテキストが入力となる音声合成装置においては、学習外となる値が言語特徴量に含まれることにより外れ値が発生する。さらに、ニューラルネットワークの外挿能力が不十分でないため、予測が不安定になる問題が生じる。この問題の対策として、一般的には、学習データを多くしてカバーする範囲を広くする対策を採る。しかし、この対策ではすべての入力のパターンをカバーすることはできない。また、多くの学習データを収集するのに必要なコストが高くなってしまう。 In Patent Document 1 and Patent Document 2, nothing is disclosed in the normalization of language feature quantities, but in Non-Patent Document 1 and Non-Patent Document 2 referred to in these Patent Documents, from all the learning data. Normalization with calculated mean and variance or minimum and maximum values is used. However, in these normalization methods, in a speech synthesizer in which free text is input, outliers are generated because values that are not learned are included in the language features. Furthermore, since the extrapolation ability of the neural network is not sufficient, there is a problem that the prediction becomes unstable. As a countermeasure for this problem, generally, a countermeasure is taken to increase the learning data and widen the coverage. However, this measure cannot cover all input patterns. In addition, the cost required to collect a large amount of learning data becomes high.

それゆえに、この発明の主たる目的は、新規な、言語処理装置、言語処理プログラムおよび言語処理方法を提供することである。 Therefore, a main object of the present invention is to provide a novel language processing device, language processing program and language processing method.

この発明の他の目的は、外れ値が発生するのを防止することができる、言語処理装置、言語処理プログラムおよび言語処理方法を提供することである。 Another object of the present invention is to provide a language processing apparatus, a language processing program, and a language processing method capable of preventing outliers from occurring.

第１の発明は、合成音声を生成する音声合成装置のディープニューラルネットワークに入力され、複数の異なる属性で構成される言語特徴量ベクトル系列を正規化する言語処理装置であって、１発話分の言語特徴量ベクトル系列における第１の属性を、当該１発話分の言語特徴量ベクトル系列における当該第１の属性とは異なる第２の属性で正規化する正規化手段を備える、言語処理装置である。 The first invention is a language processing device that is input to a deep neural network of a speech synthesizer that generates synthetic speech and normalizes a language feature amount vector series composed of a plurality of different attributes for one utterance. It is a language processing device provided with a normalization means for normalizing the first attribute in the language feature quantity vector series with a second attribute different from the first attribute in the language feature quantity vector series for the one utterance. ..

第２の発明は、第１の発明に従属し、第１の属性および第２の属性は、言語的に関連のある値である。 The second invention is subordinate to the first invention, where the first and second attributes are linguistically relevant values.

第３の発明は、第１または第２の発明に従属し、正規化手段は、第１の属性を第２の属性で除することで正規化する。 The third invention is subordinate to the first or second invention, and the normalization means normalizes by dividing the first attribute by the second attribute.

第４の発明は、第１から第３の発明までのいずれかに従属し、第１の属性の絶対値は、第２の属性の絶対値以下である。 The fourth invention is subordinate to any of the first to third inventions, and the absolute value of the first attribute is equal to or less than the absolute value of the second attribute.

第５の発明は、合成音声を生成する音声合成装置のディープニューラルネットワークに入力され、複数の異なる属性で構成される言語特徴量ベクトル系列を正規化する言語処理装置によって実行される言語処理プログラムであって、言語処理装置のプロセッサに、１発話分の言語特徴量ベクトル系列における第１の属性を、当該１発話分の言語特徴量ベクトル系列における当該第１の属性とは異なる第２の属性で正規化する正規化ステップを実行させる、言語処理プログラムである。 A fifth invention is a language processing program that is input to the deep neural network of a speech synthesizer that generates synthetic speech and is executed by a language processor that normalizes a language feature vector series composed of a plurality of different attributes. Therefore, in the processor of the language processing device, the first attribute in the language feature quantity vector series for one utterance is set to a second attribute different from the first attribute in the language feature quantity vector series for the one utterance. A language processing program that executes a normalization step to normalize.

第６の発明は、合成音声を生成する音声合成装置のディープニューラルネットワークに入力され、複数の異なる属性で構成される言語特徴量ベクトル系列を正規化する言語処理装置の言語処理方法であって、言語処理装置のプロセッサに、１発話分の言語特徴量ベクトル系列における第１の属性を、当該１発話分の言語特徴量ベクトル系列における当該第１の属性とは異なる第２の属性で正規する処理を実行させる、言語処理方法である。 The sixth invention is a language processing method of a language processing device that is input to a deep neural network of a speech synthesizer that generates synthetic speech and normalizes a language feature quantity vector series composed of a plurality of different attributes. the processor of the language processing unit, the first attribute in the language feature vector sequence of 1 utterance, the process normalizes a different second attribute with the first attribute in the language feature vector sequence of the 1 utterances It is a language processing method to execute.

この発明によれば、１発話分の言語特徴量ベクトル系列における第１の属性を、当該第１の属性とは異なる第２の属性で正規化するので、外れ値が発生するのを防止することができる。 According to the present invention, the first attribute in the language feature vector series for one utterance is normalized by a second attribute different from the first attribute, so that outliers are prevented from occurring. Can be done.

この発明の上述の目的，その他の目的，特徴および利点は、図面を参照して行う以下の実施例の詳細な説明から一層明らかとなろう。 The above-mentioned objectives, other objectives, features and advantages of the present invention will become more apparent from the detailed description of the following examples made with reference to the drawings.

図１はこの実施例の音声合成装置の一例を示す機能ブロック図である。FIG. 1 is a functional block diagram showing an example of the voice synthesizer of this embodiment. 図２は図１に示す事前学習部を説明するための図である。FIG. 2 is a diagram for explaining the pre-learning unit shown in FIG. 図３は図２に示すテキスト解析部を説明するための図である。FIG. 3 is a diagram for explaining the text analysis unit shown in FIG. 図４は図２に示す音響分析部を説明するための図である。FIG. 4 is a diagram for explaining the acoustic analysis unit shown in FIG. 図５は図１に示す合成処理部を説明するための図である。FIG. 5 is a diagram for explaining the synthesis processing unit shown in FIG. 図６は継続長ＤＮＮおよび音響特徴量ＤＮＮの入力データおよび出力データである音素の言語特徴量、音素フレームの言語特徴量、音素の継続長および音素フレームの音響特徴量の関係について説明するための図である。FIG. 6 is for explaining the relationship between the continuous length DNN and the acoustic features, the language features of the phonemes, which are the input data and the output data of the DNN, the language features of the phoneme frame, the continuation length of the phonemes, and the acoustic features of the phoneme frame. It is a figure. 図７は言語特徴量のデータの一例を示す図である。FIG. 7 is a diagram showing an example of language feature data. 図８は音声合成装置に内蔵されるＣＰＵの属性値算出処理を示すフロー図である。FIG. 8 is a flow chart showing an attribute value calculation process of a CPU built in the speech synthesizer. 図９は音声合成装置に内蔵されるＣＰＵの第１属性値算出処理を示すフロー図である。FIG. 9 is a flow chart showing a first attribute value calculation process of the CPU built in the speech synthesizer. 図１０は音声合成装置に内蔵されるＣＰＵの第２属性値算出処理を示すフロー図である。FIG. 10 is a flow chart showing a second attribute value calculation process of the CPU built in the speech synthesizer.

図１はこの実施例の音声合成装置１０の機能ブロック図である。音声合成装置１０は、汎用のパーソナルコンピュータまたはワークステーションであり、後述する言語特徴量を正規化する言語処理装置または正規化処理装置としても機能する。図示等は省略するが、音声合成装置１０は、ＣＰＵ、メモリ（ＨＤＤ、ＲＯＭ、ＲＡＭ）および通信装置（ネットワーク接続装置）などのコンポーネントを備えている。 FIG. 1 is a functional block diagram of the voice synthesizer 10 of this embodiment. The speech synthesizer 10 is a general-purpose personal computer or workstation, and also functions as a language processing device or a normalization processing device that normalizes language features, which will be described later. Although not shown, the speech synthesizer 10 includes components such as a CPU, a memory (HDD, ROM, RAM), and a communication device (network connection device).

以下、音声合成装置１０について説明するが、学習および合成音声の生成に関する処理は、音声合成装置１０のＣＰＵが種々のプログラムに従って処理する。また、各記憶部１２、１６は、音声合成装置１０のメモリ（ＨＤＤまたは／およびＲＡＭ）、または、音声合成装置１０がアクセス可能なネットワーク上のコンピュータに内蔵されるメモリまたはアクセス可能なネットワーク上のデータベースを意味する。 Hereinafter, the speech synthesizer 10 will be described, but the CPU of the speech synthesizer 10 processes the processes related to learning and the generation of the synthesized speech according to various programs. Further, the storage units 12 and 16 are stored in the memory (HDD and / and RAM) of the speech synthesizer 10 or the memory built in the computer on the network accessible to the speech synthesizer 10 or on the accessible network. Means a database.

図１に示すように、音声合成装置１０は、記憶部１２、事前学習部１４、記憶部１６および合成処理部１８を備える。記憶部１２は、音声コーパスを記憶する。音声コーパスは、特定の文章が１または複数の話者によって読み上げられた音声に関する情報である。この実施例では、音声に関する情報は、テキストおよび音声波形である。ただし、テキストおよびそれを読み上げた音声についての音声波形はペアとして（互いに紐付けて）記憶部１２に記憶される。 As shown in FIG. 1, the speech synthesizer 10 includes a storage unit 12, a pre-learning unit 14, a storage unit 16, and a synthesis processing unit 18. The storage unit 12 stores the voice corpus. A voice corpus is information about a voice in which a particular sentence is read aloud by one or more speakers. In this embodiment, the information about speech is text and speech waveforms. However, the voice waveform of the text and the voice read aloud thereof is stored in the storage unit 12 as a pair (associating with each other).

事前学習部１４は、記憶部１２から読み出した音声コーパスのテキストに対して所定のテキスト解析を行うとともに、当該テキストに対する音声波形に対して所定の音響分析を行うことで、継続長ＤＮＮを学習するための言語特徴量および音響特徴量ＤＮＮを学習するための音響特徴量等の情報を生成する。ただし、ＤＮＮは、ディープニューラルネットワークを意味する。事前学習部１４は、言語特徴量および音響特徴量などの情報を用いて、記憶部１６に記憶された継続長ＤＮＮおよび音響特徴量ＤＮＮを事前に学習する。 The pre-learning unit 14 learns the continuous length DNN by performing a predetermined text analysis on the text of the voice corpus read from the storage unit 12 and performing a predetermined acoustic analysis on the voice waveform for the text. Generates information such as language features and acoustic features for learning DNN. However, DNN means a deep neural network. The pre-learning unit 14 learns the continuous length DNN and the acoustic feature amount DNN stored in the storage unit 16 in advance by using information such as the language feature amount and the acoustic feature amount.

なお、テキスト解析の手法および音響解析の手法は既知であるから、この実施例では、その詳細な説明は省略することにする。また、この実施例では、継続長ＤＮＮおよび音響特徴量ＤＮＮは同じ記憶部１６に記憶するようにしてあるが、それぞれ異なる記憶部に記憶されてもよい。 Since the text analysis method and the acoustic analysis method are known, detailed description thereof will be omitted in this embodiment. Further, in this embodiment, the continuous length DNN and the acoustic feature amount DNN are stored in the same storage unit 16, but they may be stored in different storage units.

この実施例では、継続長ＤＮＮおよび音響特徴量ＤＮＮは、それぞれ、複数のノードを、入力層、複数の隠れ層（中間層）および出力層で構成した順伝播型のネットワークである。 In this embodiment, the continuous length DNN and the acoustic feature DNN are feedforward networks in which a plurality of nodes are composed of an input layer, a plurality of hidden layers (intermediate layers), and an output layer, respectively.

継続長ＤＮＮは、学習時に、音素の言語特徴量が入力層の各ユニットに与えられ、音素の継続長が出力層のユニットに与えられることで、入力層、隠れ層および出力層の各ユニットの重みなどが計算され、音素単位の学習が行われる。この実施例では、学習のための音素の言語特徴量は、たとえば、音素ラベル、モーラの情報、アクセント句の情報、呼気段落の情報および発話の情報などを含む。ただし、音素の継続長は、音素を構成する音素フレームの数で表される。この実施例では、音素フレームの１フレームの長さは５ｍｓｅｃである。 In the continuous length DNN, the linguistic features of phonemes are given to each unit of the input layer at the time of learning, and the continuous length of phonemes is given to the units of the output layer, so that each unit of the input layer, the hidden layer and the output layer Weights and the like are calculated, and phoneme unit learning is performed. In this embodiment, the phoneme linguistic features for learning include, for example, phoneme labels, mora information, accent phrase information, exhalation paragraph information, utterance information, and the like. However, the duration of a phoneme is represented by the number of phoneme frames that make up the phoneme. In this embodiment, the length of one frame of the phoneme frame is 5 msec.

後述する音声合成処理を実行するときには、継続長ＤＮＮの入力層の各ユニットに、音素の言語特徴量が与えられる。すると、継続長ＤＮＮの出力層のユニットから、入力層に与えられた音素の言語特徴量に対応する音素の継続長が出力される。 When the speech synthesis process described later is executed, the language features of phonemes are given to each unit of the input layer of the continuous length DNN. Then, the continuous length of the phoneme corresponding to the language feature amount of the phoneme given to the input layer is output from the unit of the output layer of the continuous length DNN.

また、音響特徴量ＤＮＮは、学習時に、音素フレームの言語特徴量が入力層の各ユニットに与えられ、音素フレームの音響特徴量が出力層の各ユニットに与えられることで、入力層、隠れ層及び出力層の各ユニットの重みなどが計算され、音素フレーム単位の学習が行われる。この実施例では、音素フレームの音響特徴量は、たとえば、スペクトル係数、雑音性係数、ピッチ、有声／無声判定などの情報を含む。 Further, in the acoustic feature DNN, the language feature of the phoneme frame is given to each unit of the input layer at the time of learning, and the acoustic feature of the phoneme frame is given to each unit of the output layer, so that the input layer and the hidden layer And the weight of each unit of the output layer is calculated, and the phoneme frame unit learning is performed. In this embodiment, the acoustic features of the phoneme frame include, for example, information such as a spectral coefficient, a noise coefficient, a pitch, and a voiced / unvoiced determination.

後述する音声合成処理を実行するときには、音響特徴量ＤＮＮの入力層の各ユニットに、音素フレームの言語特徴量が与えられる。すると、音響特徴量ＤＮＮの出力層の各ユニットから、入力装置に与えられた音素フレームの言語特徴量に対応する音素フレームの音響特徴量が出力される。 When the speech synthesis process described later is executed, the language feature amount of the phoneme frame is given to each unit of the input layer of the acoustic feature amount DNN. Then, each unit of the output layer of the acoustic feature DNN outputs the acoustic feature of the phoneme frame corresponding to the language feature of the phoneme frame given to the input device.

図２は、図１に示した事前学習部１４を説明するための図である。図２に示すように、事前学習部１４は、テキスト解析部１４ａと音響分析部１４ｂを含む。図３に示すように、テキスト解析部１４ａは、テキスト解析手段１４０、フレーム処理手段１４２および正規化処理手段１４４を含む。 FIG. 2 is a diagram for explaining the pre-learning unit 14 shown in FIG. As shown in FIG. 2, the pre-learning unit 14 includes a text analysis unit 14a and an acoustic analysis unit 14b. As shown in FIG. 3, the text analysis unit 14a includes a text analysis means 140, a frame processing means 142, and a normalization processing means 144.

テキスト解析手段１４０は、記憶部１２の音声コーパスから読み出されたテキストに対して形態素解析などのテキスト解析を行い、音素毎に音素の言語特徴量を生成し、音素の言語特徴量をフレーム処理手段１４２および正規化処理手段１４４に出力するとともに、音素の言語特徴量に含まれる音素ラベルを音響分析部１４ｂに出力する。 The text analysis means 140 performs text analysis such as morphological analysis on the text read from the speech corpus of the storage unit 12, generates a phoneme language feature for each phoneme, and frame-processes the phoneme language feature. In addition to outputting to the means 142 and the normalization processing means 144, the phoneme label included in the linguistic feature amount of the phoneme is output to the phoneme analysis unit 14b.

ここで、音素の言語特徴量は、テキスト解析により生成された情報を意味する。たとえば、テキスト解析により生成された音素の言語特徴量は、音素毎に、音素ラベル、アクセントの位置、品詞情報、アクセント句情報、呼気段落情報および総数情報などの各種の情報を含む。ただし、音素ラベルは、テキストを構成する音素を特定するための情報（音素情報）であり、当該音素に加え、前後の音素も含まれる。 Here, the linguistic features of phonemes mean the information generated by the text analysis. For example, the linguistic feature quantity of a phoneme generated by text analysis includes various information such as a phoneme label, an accent position, part of speech information, an accent phrase information, an exhalation paragraph information, and a total number information for each phoneme. However, the phoneme label is information (phoneme information) for specifying the phonemes constituting the text, and in addition to the phonemes, the preceding and following phonemes are also included.

フレーム処理手段１４２は、テキスト解析手段１４０から、事前学習のための音素の言語特徴量を入力されるとともに、音響分析部１４ｂから音素の継続長を入力される。フレーム処理手段１４２は、事前学習のための音素の言語特徴量および音素の継続長に基づいて、音素の継続長が示す音素フレーム数分の音素フレームの言語特徴量を生成する。生成された音素フレームの言語特徴量は、正規化処理手段１４４に出力される。 In the frame processing means 142, the language feature amount of the phoneme for pre-learning is input from the text analysis means 140, and the continuation length of the phoneme is input from the acoustic analysis unit 14b. The frame processing means 142 generates the language features of the phoneme frames for the number of phoneme frames indicated by the phoneme continuation length based on the phoneme language features and the phoneme continuation length for pre-learning. The linguistic features of the generated phoneme frame are output to the normalization processing means 144.

正規化処理手段１４４は、音素の言語特徴量および音素フレームの言語特徴量のそれぞれについて正規化を行い、正規化された音素言語特徴量を継続長ＤＮＮに出力するとともに、正規化された音素フレームの言語特徴量を音響特徴量ＤＮＮに出力する。 The normalization processing means 144 normalizes each of the language features of the phoneme and the language features of the phoneme frame, outputs the normalized phoneme language features to the continuous length DNN, and normalizes the phoneme frame. The language feature amount of is output to the acoustic feature amount DNN.

なお、正規化処理手段１４４における正規化処理については後で詳細に説明することにする。 The normalization process in the normalization processing means 144 will be described in detail later.

図４に示すように、音響分析部１４ｂは、音素区切り処理手段１５０と音響分析手段１５２を含む。音素区切り処理手段１５０は、テキスト解析部１４ａから音素ラベルを入力され、記憶部１２の音声コーパスから読み出された音声波形に対して、所定の学習データを用いて音響分析を行う。音素区切り処理手段１５０は、音素ラベルの示す音素が音声波形内でどの位置にあるかを特定し、音素の区切り位置を求める。求められた音素の区切り位置は、音響分析手段１５２に出力される。 As shown in FIG. 4, the acoustic analysis unit 14b includes a phoneme dividing processing means 150 and an acoustic analysis means 152. The phoneme delimiter processing means 150 inputs a phoneme label from the text analysis unit 14a, and performs acoustic analysis on the voice waveform read from the voice corpus of the storage unit 12 using predetermined learning data. The phoneme dividing processing means 150 specifies the position of the phoneme indicated by the phoneme label in the voice waveform, and obtains the phoneme dividing position. The obtained phoneme delimiter position is output to the acoustic analysis means 152.

また、音素区切り処理手段１５０は、音素の区切り位置に基づいて、音素ラベルの示す音素の継続長を求める。上述したように、音素の継続長は、音素を構成する音素フレームの数で表される。求められた音素の継続長は、記憶部１６の継続長ＤＮＮにおける出力層の各ユニットに出力されるとともに、テキスト解析部１４ａ（フレーム処理手段１４２）に出力される。 Further, the phoneme dividing processing means 150 obtains the continuous length of the phoneme indicated by the phoneme label based on the phoneme dividing position. As described above, the duration of a phoneme is represented by the number of phoneme frames that make up the phoneme. The obtained phoneme continuation length is output to each unit of the output layer in the continuation length DNN of the storage unit 16 and is also output to the text analysis unit 14a (frame processing means 142).

音響分析手段１５２は、音素区切り処理手段１５０から音素の区切り位置を入力され、記憶部１２の音声コーパスから読み出された音声波形に対して音響分析を行い、音素を構成する複数の音素フレームのそれぞれについて、音素フレームの音響特徴量を生成する。たとえば、音素フレームの音響特徴量は、スペクトル係数、雑音性係数、ピッチ、音声／無声判定等の情報を含む。生成された音素フレームの音響特徴量は、記憶部１６の音響特徴量ＤＮＮにおける出力層の各ユニットに出力される。 The acoustic analysis means 152 receives the phoneme delimiter position from the phoneme delimiter processing means 150, performs acoustic analysis on the voice waveform read from the voice corpus of the storage unit 12, and performs acoustic analysis on the phoneme frame of a plurality of phoneme frames. For each, the acoustic features of the phoneme frame are generated. For example, the acoustic feature amount of the phoneme frame includes information such as a spectral coefficient, a noise coefficient, a pitch, and a voice / unvoiced determination. The generated acoustic feature amount of the phoneme frame is output to each unit of the output layer in the acoustic feature amount DNN of the storage unit 16.

なお、音響分析により音素の区切り位置及び音素の継続長を求め、音素フレームの音響特徴量を生成する手法は既知であるから、この実施例では、その詳細な説明は省略する。 Since the method of obtaining the phoneme dividing position and the phoneme continuation length by acoustic analysis and generating the acoustic feature amount of the phoneme frame is known, detailed description thereof will be omitted in this embodiment.

上述したように、テキスト解析部１４ａが、事前学習のための音素の言語特徴量を継続長ＤＮＮの入力層に出力するとともに、音響分析部１４ｂが、音素の継続長を継続長ＤＮＮの出力層に出力する。これにより、継続長ＤＮＮの事前学習が行われる。また、テキスト解析部１４ａが、音素フレームの言語特徴量を音響特徴量ＤＮＮの入力層に出力するとともに、音響分析部１４ｂが、音素フレームの音響特徴量を音響特徴量ＤＮＮの出力層に出力する。これにより、音響特徴量ＤＮＮの事前学習が行われる。 As described above, the text analysis unit 14a outputs the linguistic features of the phonemes for pre-learning to the input layer of the continuous length DNN, and the acoustic analysis unit 14b outputs the continuous length of the phonemes to the output layer of the continuous length DNN. Output to. As a result, the continuous length DNN is pre-learned. Further, the text analysis unit 14a outputs the language feature amount of the phoneme frame to the input layer of the acoustic feature amount DNN, and the acoustic analysis unit 14b outputs the acoustic feature amount of the phoneme frame to the output layer of the acoustic feature amount DNN. .. As a result, the acoustic feature DNN is pre-learned.

図５は、図１に示した合成処理部１８の具体的な構成の一例を示す図である。図５に示すように、合成処理部１８は、テキスト解析部１８０、継続長生成部１８２、音響特徴量生成部１８４および音声波形合成部１８６を含む。 FIG. 5 is a diagram showing an example of a specific configuration of the synthesis processing unit 18 shown in FIG. As shown in FIG. 5, the synthesis processing unit 18 includes a text analysis unit 180, a continuous length generation unit 182, an acoustic feature amount generation unit 184, and a voice waveform synthesis unit 186.

テキスト解析部１８０は、図２に示したテキスト解析部１４ａと同様の処理を行う。具体的には、テキスト解析部１８０は、自由文章によるテキストを入力され、このテキストに対してテキスト解析を行い、音素毎に音素の言語特徴量を生成し、正規化する。テキスト解析部１８０は、テキスト解析にて生成および正規化した音素の言語特徴量に基づいて、図２に示したテキスト解析部１４ａにより生成された事前学習のための音素の言語特徴量と同様の音素の言語特徴量を生成する。そして、テキスト解析部１８０は、生成した音素の言語特徴量を、継続長生成部１８２および音響特徴量生成部１８４に出力する。 The text analysis unit 180 performs the same processing as the text analysis unit 14a shown in FIG. Specifically, the text analysis unit 180 inputs a text in free sentences, performs text analysis on the text, generates a phoneme language feature for each phoneme, and normalizes the text. The text analysis unit 180 is similar to the phoneme language features for pre-learning generated by the text analysis unit 14a shown in FIG. 2 based on the phoneme language features generated and normalized by the text analysis. Generates phoneme linguistic features. Then, the text analysis unit 180 outputs the linguistic feature amount of the generated phoneme to the continuous length generation unit 182 and the acoustic feature amount generation unit 184.

また、テキスト解析部１８０は、継続長生成部１８２および音響特徴量生成部１８４から、当該継続長生成部１８２および音響特徴量生成部１８４に出力した音素の言語特徴量に対応する音素の継続長を入力し、音素の言語特徴量及び音素の継続長に基づいて、音素の継続長が示す音素フレーム数分の音素フレームの言語特徴量を生成する。そして、テキスト解析部１８０は、音素フレームの言語特徴量を、継続長生成部１８２および音響特徴量生成部１８４に出力する。 Further, the text analysis unit 180 has a phoneme continuation length corresponding to the language feature amount of the phoneme output from the continuation length generation unit 182 and the acoustic feature amount generation unit 184 to the continuation length generation unit 182 and the acoustic feature amount generation unit 184. Is input, and based on the phoneme language feature amount and the phoneme continuation length, the language feature amount of the phoneme frame corresponding to the phoneme frame number indicated by the phoneme continuation length is generated. Then, the text analysis unit 180 outputs the language feature amount of the phoneme frame to the continuous length generation unit 182 and the acoustic feature amount generation unit 184.

継続長生成部１８２は、テキスト解析部１８０から音素の言語特徴量を入力され、記憶部１６の継続長ＤＮＮを用いて、音素の言語特徴量に基づいて音素の継続長を生成する。そして、継続長生成部１８２は、音素の継続長をテキスト解析部１８０に出力する。また、音響特徴量生成部１８４は、テキスト解析部１８０から音素フレームの言語特徴量を入力され、記憶部１６の音響特徴量ＤＮＮを用いて、音素フレームの言語特徴量に基づいて、音素フレームの音響特徴量を生成し、音素フレームの音響特徴量を音声波形合成部１８６に出力する。 The continuation length generation unit 182 receives the linguistic features of the phonemes from the text analysis unit 180, and uses the continuation length DNN of the storage unit 16 to generate the continuation length of the phonemes based on the linguistic features of the phonemes. Then, the continuous length generation unit 182 outputs the continuous length of the phoneme to the text analysis unit 180. Further, the acoustic feature amount generation unit 184 receives the language feature amount of the phoneme frame from the text analysis unit 180, and uses the acoustic feature amount DNN of the storage unit 16 based on the language feature amount of the phoneme frame. The acoustic feature amount is generated, and the acoustic feature amount of the phoneme frame is output to the voice waveform synthesis unit 186.

音声波形合成部１８６は、音響特徴量生成部１８４から音素フレームの音響特徴量を入力され、音素フレームの音響特徴量に基づいて、音声波形を合成し、合成した音声波形を出力する。具体的には、音声波形合成部１８６は、音素フレームの音響特徴量に含まれるピッチ、雑音特性等の情報に基づいて、声帯音源波形を生成する。そして、音声波形合成部１８６は、声帯音源波形に対し、音素フレームの音響特徴量に含まれるスペクトル係数等の情報に基づいて声道フィルタ処理を施し、音声波形を合成する。つまり、テキストに対応する合成音声が生成および出力される。 The voice waveform synthesis unit 186 receives the acoustic feature amount of the phoneme frame from the sound feature amount generation unit 184, synthesizes the voice waveform based on the phoneme feature amount of the phoneme frame, and outputs the synthesized voice waveform. Specifically, the voice waveform synthesis unit 186 generates a vocal cord sound source waveform based on information such as pitch and noise characteristics included in the acoustic features of the phoneme frame. Then, the voice waveform synthesis unit 186 performs vocal tract filter processing on the vocal cord sound source waveform based on information such as a spectral coefficient included in the acoustic feature amount of the phoneme frame, and synthesizes the voice waveform. That is, synthetic speech corresponding to the text is generated and output.

なお、音素フレームの音響特徴量に基づいて音声波形を合成する手法は周知であるため、この実施例では、詳細な説明を省略する。 Since the method of synthesizing the voice waveform based on the acoustic features of the phoneme frame is well known, detailed description thereof will be omitted in this embodiment.

図６は、継続長ＤＮＮおよび音響特徴量ＤＮＮの入力データおよび出力データである音素の言語特徴量、音素フレームの言語特徴量、音素の継続長および音素フレームの音響特徴量の関係について説明するための図である。 FIG. 6 is for explaining the relationship between the phoneme language features, the phoneme frame language features, the phoneme continuation length, and the phoneme frame acoustic features, which are the input data and output data of the continuation length DNN and the acoustic features DNN. It is a figure of.

図６に示すように、１発話分のテキストを「あれがこれで、それはどれ。」とした場合には、呼気段落は「あれがこれで」と「それはどれ」である。また、この場合、「あれがこれで」のアクセント句は「あれが」と「これで」である。さらに、この場合、「あれが」のモーラは、「あ」、「れ」および「が」である。この場合、「あ」の音素ラベルは「ａ」とされ、「れ」の音素ラベルは「ｒ」および「ｅ」とされ、「が」の音素ラベルは「ｇ」および「ａ」とされる。図６に示す例では、音素ラベル「ａ」、「ｒ」、「ｅ」、「ｇ」および「ａ」の音素の継続長は、それぞれ、「６」、「３」、「５」、「５」、「５」および「３」とする。このように、発話、呼気段落、アクセント句、モーラ、音素は階層的な構造となっており、これらに関する属性を要素とする情報も階層的な構造となっている。上述したように、この実施例では、音素フレームの１フレームの長さは、５msecである。 As shown in FIG. 6, when the text for one utterance is "that is this and which is it.", The exhalation paragraph is "that is this" and "that is which". Also, in this case, the accent phrases of "that is this" are "that is" and "this is". Further, in this case, the mora of "that" is "a", "re", and "ga". In this case, the phoneme label of "a" is "a", the phoneme label of "re" is "r" and "e", and the phoneme label of "ga" is "g" and "a". .. In the example shown in FIG. 6, the phoneme continuation lengths of the phoneme labels "a", "r", "e", "g" and "a" are "6", "3", "5" and "a", respectively. Let it be "5", "5" and "3". In this way, utterances, exhalation paragraphs, accent phrases, mora, and phonemes have a hierarchical structure, and information related to these attributes is also a hierarchical structure. As described above, in this embodiment, the length of one frame of the phoneme frame is 5 msec.

図６に示すように、音素ラベル「ａ」の時間区間において、この１音素に対応して、１組の音素の言語特徴量（上記の各情報）が生成され、６組の音素フレームの言語特徴量（の各情報）が生成され、６組の音素フレームの音響特徴量（の各情報）が生成される。また、音素ラベル「ｒ」の時間区間において、この１音素に対応して、１組の音素の言語特徴量が生成され、３組の音素フレームの言語特徴量が生成され、３組の音素フレームの音響特徴量が生成される。されに、音素ラベル「ｅ」の時間区間において、この１音素に対応して、１組の音素の言語特徴量が生成され、５組の音素フレームの言語特徴量が生成され、５組の音素フレームの音響特徴量が生成される。 As shown in FIG. 6, in the time interval of the phoneme label “a”, the language features of one set of phonemes (each of the above information) are generated corresponding to this one phoneme, and the languages of the six phoneme frames are generated. The feature amount (each information) is generated, and the acoustic feature amount (each information) of 6 sets of phoneme frames is generated. Further, in the time interval of the phoneme label "r", the linguistic features of one set of phonemes are generated corresponding to this one phoneme, the linguistic features of three sets of phoneme frames are generated, and three sets of phoneme frames are generated. Acoustic features are generated. In addition, in the time interval of the phoneme label "e", the linguistic features of one set of phonemes are generated corresponding to this one phoneme, the linguistic features of five sets of phoneme frames are generated, and five sets of phonemes are generated. The acoustic features of the frame are generated.

このように、事前学習において、継続長ＤＮＮの入力層の各ユニットには、音素の言語特徴量が与えられ、出力層のユニットには、音素の継続長が与えられ、この事前学習は音素を単位として行われる。つまり、継続長ＤＮＮには、音素毎に、音素の言語特徴量および音素の継続長が与えられ、事前学習が行われる。また、音声合成においては、音素毎に、継続長ＤＮＮを用いて、音素の言語特徴量に基づいて、音素の継続長が生成され出力される。 Thus, in the pre-learning, each unit of the input layer of the continuation length DNN is given the linguistic feature of the phoneme, and the unit of the output layer is given the continuation length of the phoneme, and this pre-learning gives the phoneme. It is done as a unit. That is, the continuation length DNN is given the language feature amount of the phoneme and the continuation length of the phoneme for each phoneme, and pre-learning is performed. Further, in speech synthesis, the continuation length of a phoneme is generated and output based on the linguistic features of the phoneme by using the continuation length DNN for each phoneme.

また、上述したように、事前学習において、音響特徴量ＤＮＮの入力層の各ユニットには、音素フレームの言語特徴量が与えられ、出力層の各ユニットには、音素フレームの音響特徴量が与えられ、この事前学習は音素フレームを単位として行われる。つまり、音響特徴量ＤＮＮには、音素フレーム毎に、音素フレームの言語特徴量および音素フレームの音響特徴量が与えられ、事前学習が行われる。音声合成においては、音素フレーム毎に、音響特徴量ＤＮＮを用いて、音素フレームの言語特徴量に基づいて、音素フレームの音響特徴量が生成され、出力される。 Further, as described above, in the pre-learning, each unit of the input layer of the acoustic feature DNN is given the language feature of the phoneme frame, and each unit of the output layer is given the acoustic feature of the phoneme frame. This pre-learning is performed in units of phoneme frames. That is, the acoustic feature DNN is given the language feature of the phoneme frame and the acoustic feature of the phoneme frame for each phoneme frame, and pre-learning is performed. In speech synthesis, the acoustic feature amount DNN of the phoneme frame is used for each phoneme frame, and the acoustic feature amount of the phoneme frame is generated and output based on the language feature amount of the phoneme frame.

図７は言語特徴量のデータの一例を示す図である。上述したように、言語特徴量は、音素に関する属性、モーラに関する属性、アクセント句に関する属性、呼気段落に関する属性および発話に関する属性を含み、時刻ｔにおいてｄ次元のベクトルで表される。言語特徴量としては、時刻ｔにおける各属性の情報が数値で表される。各属性の詳細についての説明は省略するが、一例として、アクセント句に関する属性には、「当該アクセント句中のモーラの昇順位置」が含まれる。ただし、「当該」とは、正規化処理を行う場合の処理の対象であることを意味する。 FIG. 7 is a diagram showing an example of language feature data. As described above, the language feature quantity includes attributes related to phonemes, attributes related to mora, attributes related to accent phrases, attributes related to exhalation paragraphs, and attributes related to utterance, and is represented by a d-dimensional vector at time t. As the language feature quantity, the information of each attribute at time t is represented by a numerical value. Although the details of each attribute will be omitted, as an example, the attribute related to the accent phrase includes "ascending position of the mora in the accent phrase". However, "corresponding" means that it is the target of the processing when the normalization processing is performed.

ここで、上述した正規化処理手段１４４には、テキスト解析手段１４０でテキスト解析された音素の言語特徴量と、テキスト解析された音素の言語特徴量に、フレーム処理手段１４２で処理を施された音素フレームの言語特徴量が入力される。入力される言語特徴量はベクトル系列であり、数１で示すことができる。正規化処理手段１４４は、数２に示すように、第１の属性（第１属性値）を、当該第１の属性とは異なる属性であり、かつ当該第１の属性よりも絶対値の大きいまたは等しい第２の属性（第２属性値）で除することで言語特徴量を正規化し、正規化した言語特徴量を出力する。ただし、第１の属性と第２の属性は関連があるものとする。また、１発話分の言語特徴量Ｌは、図７に示したように、時刻ｔにおけるｄ次元の属性を要素に持つ言語特徴量ベクトルの系列である。ただし、数２において、｜・｜は絶対値を意味する。なお、この実施例では、第１属性値の絶対値は、第２属性値の絶対値以下である。 Here, in the above-mentioned normalization processing means 144, the linguistic features of the phonemes text-analyzed by the text analysis means 140 and the linguistic features of the text-analyzed phonemes are processed by the frame processing means 142. The language features of the phoneme frame are input. The input language features are vector series and can be represented by Equation 1. As shown in Equation 2, the normalization processing means 144 has the first attribute (first attribute value) different from the first attribute and has a larger absolute value than the first attribute. Alternatively, the language features are normalized by dividing by the same second attribute (second attribute value), and the normalized language features are output. However, it is assumed that the first attribute and the second attribute are related. Further, as shown in FIG. 7, the language feature amount L for one utterance is a series of language feature amount vectors having a d-dimensional attribute at time t as an element. However, in Equation 2, | and | mean absolute values. In this embodiment, the absolute value of the first attribute value is equal to or less than the absolute value of the second attribute value.

また、上記の「関連があるもの」について図６および図７を用いて説明する。テキスト解析手段１４０によってテキストを解析すると、図７のように、発話、呼気段落、アクセント句、モーラ、音素に関する属性を要素とする情報が得られる。これらの情報は、図６のように階層的な構造となっており、各階層の情報は主に下位の階層の情報で構成される。たとえば、アクセント句の階層は、モーラと音素の階層を下位に持ち、アクセント句の属性には、当該アクセント句中のモーラの昇順位置や、当該アクセント句中のモーラの総数などがある。よって、この実施例の正規化処理手段１４４における正規化処理では、基本的に、位置に関する属性は同じ階層という関連性のもとで同じ階層の総数に関する属性で除され、総数に関する属性は総数という関連性のもとで別の階層の総数に関する属性で除されることになる。また、継続長に関しては総数と同様である。なお、正規化処理はすべての階層に対して適用される。 Further, the above "related items" will be described with reference to FIGS. 6 and 7. When the text is analyzed by the text analysis means 140, as shown in FIG. 7, information having attributes related to utterances, exhalation paragraphs, accent phrases, mora, and phonemes can be obtained. These pieces of information have a hierarchical structure as shown in FIG. 6, and the information of each layer is mainly composed of the information of the lower layers. For example, the hierarchy of accent phrases has a hierarchy of mora and phonemes at a lower level, and the attributes of the accent phrase include the ascending position of the mora in the accent phrase and the total number of mora in the accent phrase. Therefore, in the normalization process in the normalization processing means 144 of this embodiment, basically, the attribute related to the position is divided by the attribute related to the total number of the same layer under the relation of the same layer, and the attribute related to the total number is called the total number. Under relevance, it will be divided by the attribute related to the total number of different hierarchies. The continuation length is the same as the total number. The normalization process is applied to all layers.

数２に示す正規化手法は、非特許文献１および非特許文献２のようにすべての学習データから計算した平均と分散または最小値と最大値のような当該発話以外の条件が入ることはなく、１発話内の限られた条件で計算されるため外れ値が発生しない。そのため、外れ値による予測性能の低下を回避することができ、従来よりも安定した音響特徴量の予測を可能にする。 The normalization method shown in Equation 2 does not include conditions other than the utterance such as the mean and variance calculated from all the training data or the minimum and maximum values as in Non-Patent Document 1 and Non-Patent Document 2. 1. Since it is calculated under the limited conditions within one utterance, no outliers occur. Therefore, it is possible to avoid deterioration of the prediction performance due to outliers, and it is possible to predict the acoustic feature amount more stably than before.

図８は図１に示した音声合成装置１０のＣＰＵの属性正規化処理の一例を示すフロー図である。この属性正規化処理についてのプログラム（「言語処理プログラム」に相当する）は、音声合成装置１０のメモリに記憶され、ＣＰＵによって実行される。学習処理および音声合成処理に必要な他のプログラムおよびデータについても同様である。 FIG. 8 is a flow chart showing an example of the attribute normalization process of the CPU of the speech synthesizer 10 shown in FIG. The program for this attribute normalization process (corresponding to the "language processing program") is stored in the memory of the speech synthesizer 10 and executed by the CPU. The same applies to other programs and data required for learning processing and speech synthesis processing.

また、この実施例では、言語特徴量に含まれる複数の属性のうちの「当該アクセント句中のモーラの昇順位置」の属性を正規化する場合の処理について説明する。詳細な説明は省略するが、他の属性についても同様の属性正規化処理が実行される。 Further, in this embodiment, a process for normalizing the attribute of "ascending position of mora in the accent phrase" among a plurality of attributes included in the language feature will be described. Although detailed description is omitted, the same attribute normalization process is executed for other attributes.

図８に示すように、ＣＰＵは、属性正規化処理を開始すると、ステップＳ１で、後述する第１属性値算出処理（図９参照）を実行し、ステップＳ３で、後述する第２属性値算出処理（図１０参照）を実行する。 As shown in FIG. 8, when the CPU starts the attribute normalization process, the CPU executes the first attribute value calculation process (see FIG. 9) described later in step S1 and calculates the second attribute value described later in step S3. The process (see FIG. 10) is executed.

次のステップＳ５では、時刻についての変数ｔを初期化し、配列array m[T]を設定する。つまり、ＣＰＵは、変数ｔに０を代入し、音素フレームの最大値Ｔまでの要素（ここでは、当該アクセント句中のモーラの昇順位置）を格納可能な配列array m[T]を設定する。ただし、変数ｔはフレーム数をカウントするための変数である。これは、後述する第１属性値算出処理（図９）および第２属性値算出処理（図１０）においても同じである。また、配列array m[T]は正規化された属性値を格納するための配列である。なお、最大値Ｔは１発話における音素フレームの総数である。 In the next step S5, the variable t for time is initialized and the array array m [T] is set. That is, the CPU assigns 0 to the variable t and sets an array m [T] capable of storing the elements up to the maximum value T of the phoneme frame (here, the ascending position of the mora in the accent clause). However, the variable t is a variable for counting the number of frames. This also applies to the first attribute value calculation process (FIG. 9) and the second attribute value calculation process (FIG. 10), which will be described later. The array array m [T] is an array for storing normalized attribute values. The maximum value T is the total number of phoneme frames in one utterance.

次のステップＳ７では、変数ｔが音素フレームの最大値Ｔよりも小さいかどうかを判断する。ステップＳ７で“ＮＯ”であれた、つまり、変数ｔが音素フレームの最大値Ｔ以上であれば、「当該アクセント句中のモーラの昇順位置」をすべてのモーラについて正規化したと判断し、属性正規化処理を終了する。 In the next step S7, it is determined whether or not the variable t is smaller than the maximum value T of the phoneme frame. If the value is "NO" in step S7, that is, if the variable t is equal to or greater than the maximum value T of the phoneme frame, it is determined that the "ascending position of the mora in the accent phrase" is normalized for all mora, and the attribute is attributed. End the normalization process.

一方、ステップＳ７で“ＹＥＳ”であれば、つまり、変数ｔが音素フレームの最大値Ｔ未満であれば、「当該アクセント句中のモーラの昇順位置」を正規化していないモーラが残っていると判断し、ステップＳ９で、昇順の位置が変数ｔにおける要素ｍ［ｔ］を算出し（ｍ［ｔ］＝ｘ［ｔ］／ｙ［ｔ］）、ステップＳ１１で、変数ｔを１加算して（ｔ＝ｔ＋１）、ステップＳ７に戻る。 On the other hand, if "YES" in step S7, that is, if the variable t is less than the maximum value T of the phoneme frame, there remains a mora that does not normalize the "ascending position of the mora in the accent clause". Then, in step S9, the element m [t] whose ascending position is in the variable t is calculated (m [t] = x [t] / y [t]), and in step S11, the variable t is added by 1. (T = t + 1), the process returns to step S7.

図９は図８に示したステップＳ１および後述する図１０のステップＳ５１で実行されるＣＰＵの第１属性値算出処理を示すフロー図である。以下、第１属性値算出処理について説明するが、既に説明した処理と同じ処理については、簡単に説明することにする。 FIG. 9 is a flow chart showing a first attribute value calculation process of the CPU executed in step S1 shown in FIG. 8 and step S51 of FIG. 10 described later. Hereinafter, the first attribute value calculation process will be described, but the same process as the process already described will be briefly described.

図９に示すように、ＣＰＵは、第１属性値算出処理を開始すると、ステップＳ２１で、変数ｔおよび変数ｉを初期化するととに（t=0, i=1）、配列array x[T]を用意する（x[0],x[1],…,x[T-1]）。ただし、変数ｉは、当該アクセント句中のモーラの昇順位置をカウントするための変数である。図１０においても同じである。また、配列array x[T]は、各属性値についてのモーラの昇順位置を格納するための配列である。 As shown in FIG. 9, when the CPU starts the first attribute value calculation process, the variable t and the variable i are initialized in step S21 (t = 0, i = 1), and the array array x [T. ] Is prepared (x [0], x [1],…, x [T-1]). However, the variable i is a variable for counting the ascending position of the mora in the accent phrase. The same is true in FIG. The array array x [T] is an array for storing the ascending position of the mora for each attribute value.

次のステップＳ２３では、変数ｔが音素フレームの最大値Ｔよりも小さいかどうかを判断する。ステップＳ２３で“ＮＯ”であれば、第１属性値算出処理を終了して、図８に示した属性正規化処理にリターンする。 In the next step S23, it is determined whether or not the variable t is smaller than the maximum value T of the phoneme frame. If "NO" in step S23, the first attribute value calculation process is terminated, and the process returns to the attribute normalization process shown in FIG.

一方、ステップＳ２３で“ＹＥＳ”であれば、ステップＳ２５で、要素ｘ［ｔ］に変数ｉの数値を代入する。つまり、当該アクセント句における当該モーラの昇順の番号が割り当てられる。次のステップＳ２７では、変数ｔにおいてモーラの終わりかどうかを判断する。ここでは、ＣＰＵは、変数ｔが当該モーラにおける最終フレームを示すかどうかを判断する。ステップＳ２７で“ＹＥＳ”であれば、つまり、変数ｔにおいてモーラの終わりでなければ、ステップＳ３１に進む。一方、ステップＳ２７で“ＮＯ”であれば、つまり、変数ｔにおいてモーラの終わりであれば、ステップＳ２９で、変数ｉを１加算して（i=i+1）、ステップＳ３１に進む。 On the other hand, if "YES" in step S23, the numerical value of the variable i is assigned to the element x [t] in step S25. That is, the ascending number of the mora in the accent phrase is assigned. In the next step S27, it is determined whether the variable t is the end of the mora. Here, the CPU determines whether the variable t indicates the final frame in the mora. If "YES" in step S27, that is, if the variable t is not the end of the mora, the process proceeds to step S31. On the other hand, if it is "NO" in step S27, that is, if the mora ends in the variable t, the variable i is added by 1 in step S29 (i = i + 1), and the process proceeds to step S31.

ステップＳ３１では、変数ｔにおいてアクセント句の終わりであるかどうかを判断する。ここでは、ＣＰＵは、変数ｔが当該アクセント句における最終フレームを示すかどうかを判断する。 In step S31, it is determined whether or not the variable t is the end of the accent phrase. Here, the CPU determines whether the variable t indicates the final frame in the accent clause.

ステップＳ３１で“ＮＯ”であれば、変数ｔにおいてアクセント句の終わりでなければ、ステップＳ３５に進む。一方、ステップＳ３１で“ＹＥＳ”であれば、変数ｔにおいてアクセント句の終わりであれば、ステップＳ３３で、変数ｉを初期値に設定し、ステップＳ３５で、変数ｔを１加算して（t=t+1）、ステップＳ２３に戻る。 If "NO" in step S31, the process proceeds to step S35 if it is not the end of the accent phrase in the variable t. On the other hand, if "YES" in step S31, if the end of the accent phrase in the variable t, the variable i is set to the initial value in step S33, and the variable t is added by 1 in step S35 (t =). t + 1), the process returns to step S23.

図１０は図８に示したステップＳ３の第２属性値算出処理のフロー図である。図１０に示すように、第２属性値算出処理を開始すると、ステップＳ５１で、図９に示した第１属性値算出処理を実行する。 FIG. 10 is a flow chart of the second attribute value calculation process in step S3 shown in FIG. As shown in FIG. 10, when the second attribute value calculation process is started, the first attribute value calculation process shown in FIG. 9 is executed in step S51.

次のステップＳ５３では、変数ｔおよび変数ｉを初期化するとともに（t=T, i=x[T-1]）、配列array y[T]を用意する（y[0],y[1],…,y[T-1]）。ただし、配列array y[T]は、第１属性値の各要素を正規化するための第２属性値（ここでは、、当該アクセント句中のモーラの総数）の各要素を格納するための配列である。なお、この第２属性値算出処理においては、配列array y[T]は末尾の要素ｙ[T-1]から先頭の要素y[0]に向けて値が代入される。 In the next step S53, the variable t and the variable i are initialized (t = T, i = x [T-1]), and the array array y [T] is prepared (y [0], y [1]]. ,…, y [T-1]). However, the array array y [T] is an array for storing each element of the second attribute value (here, the total number of mora in the accent clause) for normalizing each element of the first attribute value. Is. In this second attribute value calculation process, the array array y [T] is assigned a value from the last element y [T-1] to the first element y [0].

続いて、ステップＳ５５では、変数ｔが０よりも大きいかどうかを判断する。ステップＳ５５で“ＮＯ”であれば、第２属性値算出処理を終了して、図８に示した属性正規化処理にリターンする。 Subsequently, in step S55, it is determined whether or not the variable t is larger than 0. If "NO" in step S55, the second attribute value calculation process is terminated, and the process returns to the attribute normalization process shown in FIG.

一方、ステップＳ５５で“ＹＥＳ”であれば、ステップＳ５７で、変数ｔを１減算して（t=t-1）、ステップＳ５９で、変数ｔにおけるアクセント句の終わりかどうかを判断する。ステップＳ５９で“ＮＯ”であれば、つまり、変数ｔにおけるアクセント句の終わりでなければ、ステップＳ６３に進む。 On the other hand, if "YES" in step S55, the variable t is subtracted by 1 in step S57 (t = t-1), and in step S59, it is determined whether or not the end of the accent phrase in the variable t is reached. If "NO" in step S59, that is, if it is not the end of the accent phrase in the variable t, the process proceeds to step S63.

一方、ステップＳ５９で“ＹＥＳ”であれば、つまり、変数ｔにおけるアクセント句の終わりであれば、ステップＳ６１で、変数ｉに要素ｘ［ｔ］を代入し（i=x[t]）、さらに、ステップＳ６３で、要素ｙ［ｔ］に変数ｉを代入して（y[t]=i）、ステップＳ５５に戻る。 On the other hand, if "YES" in step S59, that is, if it is the end of the accent clause in the variable t, the element x [t] is assigned to the variable i in step S61 (i = x [t]), and further. , In step S63, the variable i is assigned to the element y [t] (y [t] = i), and the process returns to step S55.

この実施例によれば、１発話内の値のみを用いて、言語的に関連する属性の比を取ることにより、言語特徴量を正規化するので、外れ値が発生するのを防止することができる。このため、音響特徴量の予測精度が良好である。 According to this embodiment, the linguistic features are normalized by taking the ratio of linguistically related attributes using only the values in one utterance, so that outliers can be prevented from occurring. it can. Therefore, the prediction accuracy of the acoustic feature amount is good.

また、この実施例によれば、外れ値が発生しないため、外れ値が発生するのを防止するために学習データを増やす必要が無い。つまり、この実施例によれば、少量の学習データであっても、音響特徴量の予測精度が良好である。 Further, according to this embodiment, since outliers do not occur, it is not necessary to increase the learning data in order to prevent outliers from occurring. That is, according to this embodiment, the prediction accuracy of the acoustic feature amount is good even with a small amount of learning data.

なお、この実施例では、第１属性値（Ｌ_ｔｄ）の絶対値が第２属性値（Ｌ_ｔδ)の絶対値以下であることを条件とすることにより（数２）、ＤＮＮの入力値が０から１の間（または範囲）に収まるように言語特徴量（の各属性）を正規化したが、これに限定される必要はない。たとえば、第１属性値の絶対値が第２属性値の絶対値よりも大きいことを条件とし、第１属性値（Ｌ_ｔｄ）および第２属性値（Ｌ_ｔδ）に所定の定数を加算したり乗算したりすることでスケールを変化させ、正規化後の値を０から１の範囲を超える値にするようにしてもよい。この場合、各属性値のスケールが変化するだけであるため、スケールを変化させる前と同様の効果が得られる。また、正規化後の値の範囲によっては、第２属性値の絶対値が第１属性値の絶対値より大きいこと（｜Ｌ_ｔｄ｜＜｜Ｌ_ｔδ｜）を条件としてもよい。 In this embodiment, the input value of the DNN is set on the condition that the absolute value of the first attribute value (L _td ) is equal to or less than the absolute value of the second attribute value (L _{t δ) (Equation 2).} The language features (each attribute) have been normalized so that they fall between 0 and 1 (or in the range), but are not limited to this. For example, the absolute value of the first attribute value with the proviso greater than the absolute value of the second attribute value, or by adding a predetermined constant to the first attribute value (L _td) and a second attribute value (L _T.DELTA.) The scale may be changed by multiplying the value so that the value after normalization exceeds the range of 0 to 1. In this case, since the scale of each attribute value is only changed, the same effect as before the scale is changed can be obtained. Further, depending on the range of values after normalization, the condition may be that the absolute value of the second attribute value is larger than the absolute value of the first attribute value (| L _td | <| L _{tδ |).}

なお、上述の実施例で示した具体的な数値は単なる一例であり、限定されるべきではなく、実施される製品等に応じて適宜変更可能である。 It should be noted that the specific numerical values shown in the above-described examples are merely examples, and should not be limited, and can be appropriately changed depending on the product to be implemented and the like.

１０ …音声合成装置
１２、１６ …記憶部 10 ... Speech synthesizer 12, 16 ... Storage unit

Claims

A language processing device that is input to the deep neural network of a speech synthesizer that generates synthetic speech and normalizes a language feature vector series composed of multiple different attributes.
A normalization means for normalizing the first attribute in the language feature quantity vector series for one utterance with a second attribute different from the first attribute in the language feature quantity vector series for the one utterance is provided. , Language processor.

The language processing device according to claim 1, wherein the first attribute and the second attribute are linguistically related values.

The language processing apparatus according to claim 1 or 2, wherein the normalization means normalizes by dividing the first attribute by the second attribute.

The language processing apparatus according to any one of claims 1 to 3, wherein the absolute value of the first attribute is equal to or less than the absolute value of the second attribute.

A language processing program that is input to the deep neural network of a speech synthesizer that generates synthetic speech and is executed by a language processor that normalizes a language feature vector series composed of multiple different attributes.
To the processor of the language processing device, the first attribute in the language feature quantity vector series for one utterance is set to a second attribute different from the first attribute in the language feature quantity vector series for the one utterance. A language processing program that executes a normalization step to normalize.

The language processing program according to claim 5 , wherein the first attribute and the second attribute are linguistically related values.

The language processing program according to claim 5 or 6, wherein the normalization means normalizes by dividing the first attribute by the second attribute.

The language processing program according to any one of claims 5 to 7, wherein the absolute value of the first attribute is equal to or less than the absolute value of the second attribute.

It is a language processing method of a language processing device that is input to the deep neural network of a speech synthesizer that generates synthetic speech and normalizes a language feature vector series composed of multiple different attributes.
To the processor of the language processing device, the first attribute in the language feature quantity vector series for one utterance is set to a second attribute different from the first attribute in the language feature quantity vector series for the one utterance. A language processing method that executes normalization processing.

The language processing method according to claim 9, wherein the first attribute and the second attribute are linguistically related values.

The language processing method according to claim 9 or 10, wherein the first attribute is normalized by dividing by the second attribute.

The language processing method according to any one of claims 9 to 11, wherein the absolute value of the first attribute is equal to or less than the absolute value of the second attribute.