JPS63121899A

JPS63121899A - Voice improvement

Info

Publication number: JPS63121899A
Application number: JP61266704A
Authority: JP
Inventors: 尚夫桑原; 徹都木
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 1986-11-11
Filing date: 1986-11-11
Publication date: 1988-05-25

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】［産業上の利用分野］本発明は、病気その他の理由によって、声帯が正常に機
能しない話者の声−いわゆる鳴声−の改善を行い、ピッ
チ情報を付加することによって正常発生の音声に近すけ
、それによる鳴声の質的変換を可能にしたものである。[Detailed Description of the Invention] [Industrial Application Field] The present invention improves the voice of a speaker whose vocal cords do not function properly due to illness or other reasons - so-called vocalization - and adds pitch information. This makes it possible to bring the sound closer to normally generated sounds, thereby making it possible to qualitatively transform the sound.

［従来の技術］医学の分野では、なんらかの理由によって喉頭をなくし
た人に対して、人工喉頭など、発話補助装置の開発が従
来から行われている。また、声帯に異常があってもまだ
摘出されていない人の声に対しては、外部から声帯に機
械的な振動を与えるバイブレーション装置を除いてはあ
まり対策がない。しかも、この装置は、バイブレーショ
ンの周期（ピッチの高さ）は一定であり、発話の内容に
応じたイントネーションの変化をつけることはできない
。[Prior Art] In the field of medicine, speech assisting devices such as artificial larynxes have been developed for people who have lost their larynx for some reason. In addition, there are few countermeasures for the voices of people who have vocal cord abnormalities that have not yet been removed, except for vibration devices that apply mechanical vibrations to the vocal cords from the outside. Moreover, in this device, the vibration period (pitch height) is constant, and the intonation cannot be changed depending on the content of the utterance.

［発明が解決しようとする問題点］従来、発声者に制約を加えることなく、発声された音声
を人工的に処理して、純工学的に鳴声の改善を試みた例
はない。本発明は、この改善をある程度可能にしたもの
であり、声帯振動が正常でない話者の声を計算機処理に
より正常話者の音声に近ずけることを目的とする。[Problems to be Solved by the Invention] Hitherto, there has been no attempt to improve vocalization in a purely engineering manner by artificially processing uttered voices without imposing restrictions on the speaker. The present invention makes this improvement possible to some extent, and aims to make the voice of a speaker with abnormal vocal fold vibrations closer to the voice of a normal speaker through computer processing.

［問題点を解決するための手段］このような目的を達成するために、本発明の嗄声改善方
法は嗄声音信号の有声音部分からスペクトル情報を抽出
して声道フィルターを構成し、声道フィルターに嗄声音
信号以外の音声信号から抽出したピッチ情報を人力して
音声を合成することを特徴とする。[Means for Solving the Problems] In order to achieve such an object, the hoarseness improvement method of the present invention extracts spectrum information from the voiced part of a hoarse sound signal, configures a vocal tract filter, and improves the vocal tract. It is characterized by manually inputting pitch information extracted from audio signals other than hoarseness signals into a filter to synthesize speech.

［作　用］本発明によれば、嗄声音において不十分なピッチ情報を
正常音声中または人工的な声帯波形から抽出して、鳴声
から抽出したスペクトル情報と合成することによって、
鳴声な改善することができる。[Function] According to the present invention, insufficient pitch information in hoarse voices is extracted from normal speech or artificial vocal fold waveforms, and synthesized with spectral information extracted from vocalizations.
The noise can be improved.

［実施例コ以下に、図面を参照しながら、本発明の詳細な説明する
。[Example] The present invention will be described in detail below with reference to the drawings.

第１図は末男式のブロック図を示す。全体のシステムは
、分析部１、特徴抽出部２、合成部３の三つの部分から
なり、すべて電子計算機内でデジタル信号処理によって
実行される。本発明の基本は嗄声音声の改善であるが、
改善すべき鳴声の他に、同一内容の正常音声の情報を用
いるため、図に示すように、末男式には原則として鳴声
と正常音声の２人力が必要である。但し、同一内容の正
常音声が得られない場合、次善の策として、人工的な声
帯波形を作り、合成部へ供給することができる。FIG. 1 shows a block diagram of the youngest son's method. The entire system consists of three parts: an analysis section 1, a feature extraction section 2, and a synthesis section 3, all of which are executed by digital signal processing within an electronic computer. The basis of the present invention is to improve hoarse speech.
In addition to the voice to be improved, information on normal voice with the same content is used, so as shown in the figure, the youngest son's ceremony requires two people in principle to produce the voice and the normal voice. However, if normal speech with the same content cannot be obtained, as a second best measure, an artificial vocal fold waveform can be created and supplied to the synthesis section.

分析部１では、鳴声に対する分析（１）と、正常音声に
対する分析（２）とが行われる。The analysis unit 1 performs analysis (1) for vocalizations and analysis (2) for normal voices.

まず鳴声に対する分析（１）について説明する。First, analysis (1) of voice will be explained.

へ／Ｄ変換器によって標本化された嗄声音信号を、有音
区間と無音区間に区別し、さらに有音区間に対して無声
音と有声音区間の判別を行う。無声音区間はそのまま記
録され、有声音区間に対していわゆる線形予測分析がな
され、スペクトル情報を担っている線形予測係数とピッ
チ情報を担っている残差信号が得られる。このとき、分
析（１）、すなわち鳴声に対する分析では、第２図に示
すようにフレーム長（分析窓幅）は固定（２０〜３０ミ
リ秒）で、フレーム周期はフレーム長の１７２とする。The hoarse sound signal sampled by the D/D converter is divided into a voiced section and a silent section, and the voiced section is further distinguished into an unvoiced sound section and a voiced section. Unvoiced sound sections are recorded as they are, and so-called linear prediction analysis is performed on voiced sound sections to obtain linear prediction coefficients carrying spectral information and residual signals carrying pitch information. At this time, in the analysis (1), that is, the analysis of the voice, the frame length (analysis window width) is fixed (20 to 30 milliseconds) and the frame period is 172 of the frame length, as shown in FIG.

正常な音声は、基本波とその高調波からなり、明確なピ
ッチ情報があるため、それに同期した分析は可能である
が、鳴声ではピッチ情報が不完全なので、このようにフ
レーム長を固定し、フレーム周期だけずらせて分析する
。Normal speech consists of a fundamental wave and its harmonics, and has clear pitch information, so it is possible to analyze it in synchronization with it. However, since pitch information is incomplete in speech, the frame length cannot be fixed in this way. , the frame period is shifted and analyzed.

次の特徴抽出部では第２図のフレーム長、フレーム周期
を用いて線形予測分析を行った結果得たスペクトル計数
（予測係数）のみを抽出し、ピッチ情報（残差信号）は
棄却する。The next feature extraction section extracts only the spectral counts (prediction coefficients) obtained as a result of linear prediction analysis using the frame length and frame period shown in FIG. 2, and discards pitch information (residual signal).

次に、分析（２）、すなわち鳴声と同一内容の正常音声
に対する分析を行う。この場合有声音部に対して、通常
の分析を行うまえに先ずピッチ同期分析を行う。すなわ
ち第３図に示すように、ピッチの長さを検出して、その
区間をフレーム長とする。この場合はフレームは重複し
ない。次いで、正常音の有声部分の長さと、対応する鳴
声の有声部分の長さとを比較し、正常音声の長さを伸縮
し、１ピツチの精度で長さが等しくなるように調節する
。第４図は、第１図の分析（２）の部分の詳細である。Next, analysis (2), that is, analysis of normal speech having the same content as the speech is performed. In this case, the voiced part is first subjected to pitch synchronization analysis before being subjected to normal analysis. That is, as shown in FIG. 3, the length of the pitch is detected and that section is taken as the frame length. In this case, frames will not overlap. Next, the length of the voiced part of the normal sound is compared with the length of the voiced part of the corresponding utterance, and the length of the normal sound is expanded or contracted to adjust the lengths to be equal with an accuracy of one pitch. FIG. 4 shows details of the analysis (2) part of FIG. 1.

ピッチ同期分析を行ったのち、長さを調節することは、
１ピツチの単位で伸縮が可能で、しかも特定部分の伸縮
ではなく、有声音区間全体にわたって、平均した伸縮が
できる利点がある。すなわち、第５図に示すように、正
常音声の有声音部のほうが鳴声より短い場合、いくつか
のピッチ区間を等間隔に直前の区間を挿入し、長い場合
には、逆にいくつかのピッチ区間を等間隔に間びいて長
さを揃える。Adjusting the length after performing pitch synchronization analysis is
It has the advantage that expansion and contraction can be performed in units of one pitch, and that expansion and contraction can be averaged over the entire voiced sound section, rather than expansion and contraction of a specific part. In other words, as shown in Figure 5, when the voiced part of normal speech is shorter than the utterance, several pitch sections are inserted at equal intervals, and when it is long, several pitch sections are inserted at equal intervals. The pitch sections are spaced at equal intervals to make the length the same.

このようにして、各有声区間毎に正常音声の長さを調節
し、対応する吟声の有声音区間とほぼ同じになるように
作り替える。しかる後に、鳴声の分析（第１図の分析（
１））と同一の分析を行い、スペクトル情報（予測係数
）とピッチ情報（残差信号）を得るが、スペクトル情報
は不用なため棄却する。第１図の合成部では、吟声から
得られたスペクトル情報、すなわち線形予測係数を用い
て声道フィルターを構成し、正常音声から得られたピッ
チ情報、すなわち残差信号をそのフィルターに人力する
ことによって合成音声を得る。ちなみに、第１図におい
て、ピッチ情報を鳴声のそれにすると、出力音声は元の
吟声がそのまま復元され、逆に、スペクトル情報を正常
音声のそれにすると、元の正常音声が復元される。吟声
にはピッチ情報がもともとないか、あっても極めて不完
全であるが、合成の際、このように正常話者の残差信号
と入れ替えることによりピッチ情報を付加することがで
き、正常音声に近い音色の声が得られる。In this way, the length of the normal voice is adjusted for each voiced section, and the length of the normal voice is rearranged so that it is almost the same as the corresponding voiced section. After that, analysis of vocalizations (analysis of Figure 1)
The same analysis as in 1)) is performed to obtain spectral information (prediction coefficients) and pitch information (residual signal), but the spectral information is unnecessary and is therefore discarded. In the synthesis section shown in Figure 1, a vocal tract filter is constructed using the spectrum information obtained from the ginsei, that is, the linear prediction coefficients, and the pitch information obtained from the normal speech, that is, the residual signal, is manually applied to the filter. Obtain synthetic speech by doing this. Incidentally, in FIG. 1, when the pitch information is set to be that of a voice, the original sung voice is restored as the output sound, and conversely, when the spectrum information is set to that of a normal voice, the original normal sound is restored. Ginsei originally does not have pitch information, or even if it does, it is extremely incomplete, but during synthesis, by replacing the residual signal of a normal speaker in this way, pitch information can be added, and normal speech can be reproduced. You can get a voice with a tone similar to that of

しかし、常に同一内容の正常音声が得られるとは限らな
い。このようなとき、第６図に示すように、三角波を模
擬した人工的な声帯波形を作り、それから得られるピッ
チ情報を声道フィルターに供給する。三角波の周期Ｔが
ピッチ情報を与えるのて、Ｔの値として鳴声話者の性別
２年齢で決まる平均的な値を設定する。また、三角波の
立上り（Ｔ１）、立ち下がり時間（Ｔ２）はそれぞれ数
ミリ秒が適当である。但しこの場合、得られた合成音出
力は、人工的な声帯波を使っているため、機械的な響き
の音声になり、肉声からかなり離れた響きの合成音とな
る。However, it is not always possible to obtain normal audio with the same content. In such a case, as shown in FIG. 6, an artificial vocal cord waveform simulating a triangular wave is created, and the pitch information obtained from it is supplied to the vocal tract filter. Since the period T of the triangular wave provides pitch information, an average value determined by the gender and age of the speaker is set as the value of T. Further, it is appropriate that the rise (T1) and fall time (T2) of the triangular wave are each several milliseconds. However, in this case, the obtained synthesized sound output uses artificial vocal cord waves, so it becomes a mechanical sounding sound, and the synthesized sound has a sound that is quite different from the real voice.

第７図に鳴声と、上述した２種類の方法によって改善さ
れた音声の波形の一例を示す。同図（Ａ）が吟声であり
、（Ｂ）は正常音声を用いて改善した音声波形、（Ｃ）
は三角波を用いて改善した音声波形である。なお、この
ようにして声道特性の変更を行うと、波形は多少変形す
るため、合成の際、フレームの接合部で不連続が生じる
場合がある。FIG. 7 shows an example of waveforms of voice and voice improved by the two methods described above. In the same figure, (A) is Ginsei, (B) is the speech waveform improved using normal speech, and (C) is the speech waveform improved using normal speech.
is an improved audio waveform using a triangular wave. Note that when the vocal tract characteristics are changed in this manner, the waveform is slightly deformed, so that discontinuity may occur at the joint of frames during synthesis.

そこで、第８図に示すように、各フレーム毎に合成音出
力に振幅１の三角窓をかけ、隣り合うフレームとの重な
り合う部分のゲインの和が常に１になるように設定して
両者の波形を加えるとよい。Therefore, as shown in Figure 8, a triangular window with an amplitude of 1 is applied to the synthesized sound output for each frame, and the sum of the gains in the overlapping parts of adjacent frames is always set to 1, so that the waveforms of both It is a good idea to add

このような操作は、波形の連続性を保ち、スムーズな合
成音を得るのに有効である。Such operations are effective in maintaining waveform continuity and obtaining smooth synthesized sounds.

［発明の効果］以上説明したように、本発明によれば、嗄声音において
不十分なピッチ情報を正常音声中または人工的な声帯波
形から抽出して、鳴声から抽出したスペクトル情報と合
成することによって、鳴声を改善することができる。[Effects of the Invention] As explained above, according to the present invention, insufficient pitch information in hoarseness is extracted from normal speech or artificial vocal fold waveforms, and synthesized with spectral information extracted from the vocalizations. By doing so, the sound can be improved.

[Brief explanation of the drawing]

第１図は本発明のブロック図、第２図は鳴声分析におけるフレームおよび分析窓幅を説
明する波形図、第３図は正常音声に対するピッチ同期分析を説明する波
形図、第４図は正常音声に対する分析のブロック図、第５図は正常音声の有声区間の長さを調整する方法を示
す線図、第６図は人工的な声帯波形図、第７図（Ａ）は鳴声の音声波形図、同図（Ｂ）は正常音
声を用いて改善した吟声波形図、同図（Ｃ）は三角波を
用いて改善した嘆声波形図、第８図は合成波形の連続性を保存するための方法を示す
模式図である。１・・・分析部、２・・・特徴抽出部、３・・・合成部。Figure 1 is a block diagram of the present invention. Figure 2 is a waveform diagram explaining the frame and analysis window width in speech analysis. Figure 3 is a waveform diagram explaining pitch synchronization analysis for normal speech. Figure 4 is a waveform diagram explaining the pitch synchronization analysis for normal speech. A block diagram of the analysis for speech; Fig. 5 is a diagram showing a method for adjusting the length of voiced sections of normal speech; Fig. 6 is an artificial vocal fold waveform diagram; Fig. 7 (A) is the voice of vocalization. Waveform diagram: (B) is a wailing waveform diagram improved using normal speech, (C) is a wailing waveform diagram improved using a triangular wave, and Figure 8 preserves the continuity of the composite waveform. FIG. 1... Analysis section, 2... Feature extraction section, 3... Synthesis section.

Claims

[Claims] 1) A vocal tract filter is configured by extracting spectrum information from a voiced part of a hoarse sound signal, and pitch information extracted from a voice signal other than the hoarse sound signal is input to the vocal tract filter. A method for improving hoarseness, which is characterized by synthesizing speech. 2) The method for improving hoarseness according to claim 1, wherein the pitch information is extracted from a voiced part of a normal speech signal having the same content as the hoarseness signal. 3) The pitch information is extracted after matching the length of the voiced part of the normal speech signal with the length of the voiced part of the hoarse sound signal.
How to improve hoarseness as described in section. 4) The hoarseness improvement method according to claim 1, characterized in that the pitch information is extracted from a triangular wave that simulates a vocal cord waveform. 5) The hoarseness improvement method according to claim 4, characterized in that the period of the triangular wave is set to an average pitch determined by the gender and age of the person producing the hoarseness.