JP2707564B2

JP2707564B2 - Audio coding method

Info

Publication number: JP2707564B2
Application number: JP62315621A
Authority: JP
Inventors: 吉章浅川; 熹市川; 和弘近藤; 俊郎鈴木
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1987-12-14
Filing date: 1987-12-14
Publication date: 1998-01-28
Anticipated expiration: 2013-01-28
Also published as: US5119424A; JPH01155400A

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、音声符号化方式に関し、特に音声情報を8K
bps前後に圧縮する際に符号化音声の品質を向上させる
ための方式に関するものである。〔従来の技術〕音声信号を広帯域ケーブルで伝送するためには、音声
信号をサンプリングし、量子化して、２進のディジタル
符号に変換することにより、PCM伝送する。一方、専用ディジタル回線を用いて通信ネットワーク
を構築する場合、通信コストの低減は非常に重要な課題
であり、60Kbpsにも及ぶ音声信号の情報量は多過ぎるた
め、そのままでは伝送できない。そこで、伝送のための
音声信号の情報圧縮（つまり低ビットレート符号化）が
必要となった。音声信号を8Kbps前後で圧縮する音声符号化方式とし
ては、音声をスペクトル包絡情報と音源情報とに分離し
て、各々を符号化する方法が知られている。その中で、
音源情報を単一パルス列と白色雑音でモデル化したもの
が、いわゆるPARCOR（Partial Autocorrelation:偏自己
相関）法であり、この方法では、低ビットレートで符号
化できるが、その反面、品質の劣化が大きい。これに対
して、音源を複数のパルス列で表現する方式として、マ
ルチパルス法（例えば、小沢，他『マルチパルス駆動形
音声符号化法の品質改善』日本音響学会音声研究会資料
S83−78（1984.1）参照）や、あるいは残差圧縮法（浅
川，他『残差情報を利用した音声合成法の検討』日本音
響学会講演論文集３−１−７（昭和59.10）参照）等が
ある。残差圧縮法としては、例えば特開昭61−296398号公報
に記載された方法が提案されており、また特願昭60−24
1419号，特願昭61−35148号の各明細書にも記載されて
いる。これらの方式では、音源の表現が精密化する分だけ、
PARCOR法に比べて品質が向上している。〔発明が解決しようとする問題点〕前述の従来技術においては、音源である複数のパルス
列を、フレームごとに独立して一定の基準で生成する。
ここで、フレームとは、音声を分析する時間単位であっ
て、通常は20ms程度に設定される。ところで、音声波形は、サンプリングされてサンプル
値x_iの系列に変換されているものとする。現在をx_tと
し、それから過去にさかのぼるｐ個のサンプル値を｛x
_t-i｝，（ｉ＝1,・・・,p）とする。ここで、音声波形
は近似的に過去のｐ個のサンプルから予測できると仮定
する。予測の中で最も簡単なものは線形予測であるか
ら、過去のサンプル値の各々にある一定の係数を乗じて
加え合わせたもので、現在の値が近似されるものと考え
る。このとき、現在点ｔでの実現値x_tと予測値y_tとの差
を、予測誤差εとする。この予測誤差εを、予測残差ま
たは単に残差と呼ぶ。音声波形の予測残差波形は、２種
類の波形の和と考えられる。その１つは、いわゆる誤差
成分であり、その振幅は余り大きくなく、ランダムな雑
音波形に近い。また、他の１つは、入力に声帯振動によ
るパルスが加わったときの誤差であって、予測が大きく
狂い、振幅の大きな残差波形となる。この残差成分は、
音源の周期性に対応して、繰り返し周期的に現われる。音声は、周期性を有する区間（有声音）と、周期性が
顕著でない区間（無声音）とに大別されるので、それに
対応して、予測残差波形も、有声音部では周期性を有し
ている。一方、マルチパルス法や残差圧縮法において生成され
るパルス列は、残差の近似とみなすことができるので、
有声音部では周期性を有するはずである。ところが、こ
れらのパルス列は前後のフレームとは独立して生成され
るために、パルス列の相対的な位置関係がフレームごと
にずれてしまい、周期性が乱れる場合がある。このようなパルス列を音源として音声を合成すると、
『ゴロゴロ』という音質劣化が生じるという問題があっ
た。本発明の目的は、このような従来の問題を改善し、マ
ルチパルス法や残差圧縮法で生成されるパルス列に対し
て、フレームごとの周期性の乱れによる音質の劣化を防
止することができる音声符号化方式を提供することにあ
る。〔問題点を解決するための手段〕上記目的を達成するため、本発明の音声符号化方式
は、有声フレームが無声フレームから切り替わった直後
か、有声フレームが連続したか、あるいは無声フレーム
であるかのいずれかを判定する手段と、上記無声フレー
ムから有声フレームに切り替わった直後に、音源パルス
を生成する第１の音源パルス生成手段と、上記有声フレ
ームが連続するときに、音源パルスを生成する第２の音
源パルス生成手段と、上記無声フレームのときに、音源
パルスを生成する第３の音源パルス生成手段とを具備す
ることに特徴がある。〔作用〕本発明においては、最初に生成されたパルス列を基準
として、ピッチ周期により次のフレームのパルス列の位
置を推定し、その推定された位置の近傍で新たなパルス
列を生成し、周期性を保持する。すなわち、有声音にお
ける音声の周期は、声の高さであるピッチ周波数の逆数
であるピッチ周期に対応している。音の高さの変化は比
較的ゆるやかであるから、１フレームの中ではほぼ一定
とみなすことができる。そこで、最初の基準となるフレ
ーム、例えば、無声音から有声音に切り替わった最初の
フレームでは、従来技術により一定の基準で音源パルス
列を生成した後、順次、生成された音源パルス列を基準
に次のフレームにおける音源パルス列の位置を推定し
て、音源パルス列を生成する方法を用いる。マルチパルス法や残差圧縮法では、音源パルス数が少
ないので、一般に生成される音源パルス列はピッチ周期
ごとに一塊のまとまったものとなる。従って、フレーム
の最後のピッチ周期における音源パルス列を基準とし
て、ピッチ周期だけ時間軸方向に進めた位置を次のフレ
ームの先頭のパルス列の位置とするのである。このよう
にすれば、２フレーム間でのパルス列の周期性が保持さ
れる。次フレームにおいては、この位置を基準として、
この位置の近傍に最初の音源パルス列を生成する。それ
により、フレーム間での周期性の乱れは無くなり、音質
の劣化も防止でき、かつパルス列生成の基準に基づいた
最適な音源パルス列が得られることになる。〔実施例〕以下、本発明の実施例を、図面により詳細に説明す
る。第１図は、本発明の音声符号化方式を残差圧縮法を用
いた音声符号化装置（音声CODEC）に適用した場合のブ
ロック構成図であって、（ａ）が符号化部であり、
（ｂ）が復号化部である。本発明の符号化部は、第１図（ａ）に示すように、デ
ィジタル音声信号を格納するバッファメモリ１と、線形
予測を行う線形予測回路３と、パラメータ４を用いて制
御される逆フィルタ５と、残差相関法等を用いてピッチ
を抽出するピッチ抽出回路７と、有声無声判定回路９
と、有声無声判定結果に応じて音源パルスを生成する音
源生成部11と、量子化符号化回路13とを具備している。
また、本発明の復号化部は、第１図（ｂ）に示すよう
に、入力信号を４種のパラメータに分離する復号回路16
と、復号化されたスペクトルパラメータを格納するバッ
ファメモリ19と、ピッチ周期と有声無声判定結果と音源
情報を入力として、音源パルスを生成する音源パルス再
生回路17と、音源パルス再生回路17での遅延を補正し
て、これを係数とする合成フィルタ20とを具備してい
る。第１図（ａ）において、符号化時には、ディジタル化
された音声信号は、バッファメモリ１に１フレーム分格
納され、よく知られている線形予測回路３により、スペ
クトル包絡を表わすパラメータ（例えば、偏自己相関係
数）４に変換される。次に、このパラメータ４を係数に
用いて逆フィルタ５を構成し、これに音声信号２を入力
することにより、残差信号６を得る。ピッチ抽出回路７
は、残差相関法やAMDF（Average Magnitude Differenti
al Function）法等のよく知られた手法を用いており、
残差信号６を入力としてフレームのピッチ周期８を抽出
する。有声無声判定回路９は、そのフレームが有声フレ
ームであるか無声フレームであるかの判定結果10a、お
よび無声フレームから有声フレームに切り替わったこと
を示す信号10bを出力する。音源生成部11は、本発明に
より新たに設けられたものであって、音声無声判定結果
10aおよび切り替え信号10bに応じて音源パルスを生成
し、その情報12を出力する。量子化符号化回路13は、スペクトルパラメータ４とピ
ッチ周期８と有声無声判定結果10aと音源情報12とを受
け取り、所定のビット数に量子化して、所定の書式に変
換された結果14をディジタル回線15に送出する。第１図（ｂ）において、復号化時には、ディジタル回
線15から受信されたディジタル・データ14が復号回路16
に入力されると、（ａ）に示す４種のパラメータ（ピッ
チ周期８′，音源情報12′，有声無声判定結果10a′，
スペクトルパラメータ４′）に分離される。上記４種の
パラメータのうちの３種のパラメータ（復号化さたピッ
チ周期８′，有声無声判定結果10a′，音源情報12′）
を入力とする音源パルス再生回路17により、目的とする
音源パルス18を得る。一方、４種のパラメータのうちの１種のパラメータ
（復号化されたスペクトルパラメータ４′）のみは、バ
ッファメモリ19に格納され、音源パルス再生回路17での
遅延を補正した後、そのバッファメモリ19の出力を合成
フィルタ20の係数として用いる。音源パルス18をこの合
成フィルタ20に入力することにより、その出力として合
成音声21を得ることができる。第２図は、第１図における音源生成部の機能ブロック
図である。音源生成部11は、第２図に示すように、無声から有声
に切り替わったことにより制御を切り替えるための切替
制御部31と、残差信号を格納するバッファメモリ111
と、無声から有声に切り替わったとき、パルスの抽出位
置を決定するためのパルス抽出位置決定部112と、前フ
レームで決定された代表残差の先頭アドレスがバッファ
メモリ111のアドレスに変換されて格納されている先頭
位置メモリ30と、有声フレームが連続しているとき、パ
ルス抽出位置を決定するためのパルス抽出位置決定部32
と、先頭アドレスおよびバッファメモリ111から音源を
抽出するための音源抽出部115と、無声音源を生成する
ための無声音源生成部116とから構成されている。本実施例の音声符号化方式は、有声フレームの音源生
成に関するものであるため、有声無声判定結果10aは有
声を示しており、ピッチ周期８は値が確定しているもの
とする（以下、ピッチ周期の値をNPTCHとする）。先ず、有声無声切換信号10bが無声から有声に切り替
わった直後であることを示しているときには、切替制御
部31からの信号で制御がパルス抽出位置決定部（Ｉ）11
2に移る。ここで制御される場合の音源生成部11の機能
は、従来の残差圧縮法（例えば、前述の公報（特開昭61
−296398号公報）に第２の方法として記載されている残
差圧縮法）と同一である。すなわち、代表的なピッチ区
間に対して、連続したLN本の残差パルスを抽出する（こ
こで、LN本とは、抽出パルス数113の値で示される本数
である）。また、前述の特願昭60−241419号明細書に記載されて
いるように、復号時に、前フレームの復号残差と現フレ
ームの代表残差を補間する場合には、代表ピッチ区間は
現フレームの最後の点を含むように定める。つまり、パ
ルス抽出位置決定部（Ｉ）112においては、次式を算出
する。ただし、ｉは次の条件式を満足する。 iFRM−NPTCH＋１≦ｉ≦iFRM ……（２）ここで、x_jは、アドレスｊの残差パルス振幅であり、
バッファメモリ111から読み出される。なお、バッファ
メモリ111はリングバッファであって、前フレームと現
フレームの残差が格納されている。また、iFRMはフレー
ム長であり、LNは抽出パルス数113の値である。例えば、パルス抽出位置検定部112が、補間すべき次
の残差パルスの振幅情報と位置情報を求るるため、上式
（１），（２）式で先ず振幅累計値を求めている。い
ま、バッファメモリ111に、現フレーム長として０〜159
のアドレスが割付けられ、代表的ピッチ区間に対して連
続した20本の残差パルスがある場合には、次の代表ピッ
チ区間は現フレームの最後の点を含むようにして決定さ
れ、上式（２）よりフレーム長より小さく、かつフレー
ム長よりピッチ周期だけ小さい区間より大きい区間内に
求める位置ｉを定める。そして、上記（１）式で算出さ
れた振幅累計値から先頭アドレスを求め、そのアドレス
から20本分の残差パルスをバッファメモリ111から取り
出すことにより補間するのである。上式（１）で算出されたAMP（ｉ）の最大値を与える
ｉをi₀とすると、i₀が代表残差の先頭アドレス114aであ
る。先頭アドレス114aが音源抽出部115に送られると、
先頭アドレスからLN本の残差をバッファメモリ111から
読み出し、これらを音源情報12として後段に送出する。次に、有声無声切替信号10bが無声から有声への切り
替わり直後でないとき、つまり有声フレームが連続して
いることを示す場合について、詳述する。このときには、切替制御部31からの信号で、制御がパ
ルス抽出位置決定部（II）32に移る。バッファメモリ111には、２フレームの残差が格納さ
れている。アドレス−iFRM＋１〜０までが前フレーム分
であり、１〜iFRMまでが現フレーム分である。また、先
頭位置メモリ30には、前フレームで決定された代表残差
の先頭アドレスi₀がバッファメモリ111上のアドレスに
変換され（i₀′＝i₀−IFRM）、これが格納されている。
現フレームの代表残差の先頭位置は、i₀′を基準とし
て、次のように決定する。なお、上式（３）において、STADRS₁,・・・・・・ST
ADRS_Nは、復号時に代表残差を補間するための先頭アド
レスに対応したものであって、STADRS_Nは現フレームに
おける最後のピッチ区間内のもの、つまり代表残差の先
頭アドレスであり、次のようになる。 i₀＝STADRS_N ……（４）このようにすれば、前フレームの代表残差先頭アドレ
スから現フレームの代表残差先頭アドレスを、極めて簡
単に求めることができる。しかし、ピッチ周期NPTCHは、現フレームの平均的な
ピッチ周期であるため、実際のピッチ位置とは誤差を持
つ可能性がある。そこで、より精密に位置を決めるため
に、次のようにする。先ず、（５）式により、短区間相互相関値を定義す
る。 i₀′＋NPTCH−Ｄ≦ｉ≦i₀′＋NPTCH＋Ｄ ……（６）ここで、Ｄ（＞０）は、ピッチのゆらぎ等で決まる値
であり、CORは相互相関値を表わす。上式（６）では、
現フレームの最初の音源パルス列の先頭アドレスの存在
範囲が前フレームの代表残差の先頭アドレスにピッチ周
期のゆらぎを考慮して加算した範囲にあることを示して
おり、上式（５）では、先頭アドレスから抽出パルス数
LN本分の残差パルスの振幅累積値を求めるもので、位相
が一致していれば相関値は最大値となる。次の式により、第１のスタートアドレスを求める。上式（７）では、前フレームの代表残差とNPTCH離れ
た位置の近傍で、最も相関値が高くなる位置ｉを検出し
たことになる。以下、i₀′をSTADRS₁に置き換えて、同
じ手順でSTADRS₂を求め、順次、STADRS_N（＝i₀）まで求
めればよい。また、STADRS_nの決定には、上式（１）を利用するこ
とも可能である（ここで、ｎは任意の整数）。すなわ
ち、上式（１）におけるｉの範囲を（６）式として、下
記（８）式を導く。以下、同じ手順で、STADRS_N（＝i₀）まで求める。以上に述べたうちのいずれかの方法で決定した代表残
差の先頭アドレス（i₀）114bを、音源抽出部115に送出
する。復号時には、従来の方法（例えば、前述の特願昭60−
241419号明細書参照）により、代表残差と前フレームの
復号残差とを補間しながら音源パルスを再生する。この
とき、補間対応点アドレスは、前フレームの代表残差位
置そのものであるから、改めて伝送する必要がない。本実施例に示す音源パルス生成部11は、以上詳述した
ように、加算器、相関器および比較器等により簡単に実
現することができる。また、汎用のマイクロプロセッサ
により、同じ機能を実現することも可能である。なお、現フレームにおいて、音声無声判定結果10aが
無声となっているときには、切替制御部31からの制御信
号により、制御が無声音源生成部116に切り替えられ
る。無声音源生成部116の動作は、例えば、従来提案さ
れている方法（例えば、特願昭61−35148号明細書参
照）のように、ピッチ周期とは無関係に音源パルスを生
成するものである。第３図は、本発明の効果を説明するためのタイムチャ
ートである。第３図（ａ）は従来の方法による入力音源波形41、残
差波形42、代表残差波形43a、および合成波形44aを示す
波形図であり、第３図（ｂ）は本実施例による入力音声
波形41、残差波形42、代表残差43b、および合成波形44b
を示す波形図である。入力音声波形41は（ａ）（ｂ）ともに同一波形であっ
て、逆フィルタ５の残差信号の波形42も同一波形とな
る。従来の方法では、代表残差（復号後）をフレームご
とに独立に抽出しているので、波形43aに示すように、
フレーム＃３において代表残差の位置ずれが生じてお
り、周期性が乱れている。矢印で、そのずれ幅を示して
いる。その結果、第３図（ａ）に示すように、合成波形
44aは位置ずれが生じた位置で振幅の減衰が生じ、音質
の劣化を招いている。本実施例の場合には、第３図（ｂ）に示すように、有
声フレームが連続したとき、前フレームの代表残差位置
を基準として従属的に抽出した代表残差（復号後）43b
となる。この代表残差43bには位置ずれがなく、従って
合成波形44bも減衰がなく、自然であって、第３図
（ａ）の従来方式に比較して音質が向上している。〔発明の効果〕以上説明したように、本発明によれば、有声音が連続
するときには、本来の音声が有する周期性を乱すことな
く音源パルス列を生成するので、周期性の乱れにより生
じていた音質の劣化を防ぐことができ、符号化音声の品
質を向上させることが可能である。Description: BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech coding method, and
The present invention relates to a method for improving the quality of encoded audio when compressing to around bps. [Prior Art] In order to transmit an audio signal through a broadband cable, the audio signal is sampled, quantized, and converted into a binary digital code, thereby performing PCM transmission. On the other hand, when constructing a communication network using a dedicated digital line, reduction of communication cost is a very important issue, and the amount of information of a voice signal reaching 60 Kbps is too large, so that it cannot be transmitted as it is. Therefore, information compression (that is, low bit rate encoding) of the audio signal for transmission has become necessary. As an audio encoding method for compressing an audio signal at around 8 Kbps, a method is known in which audio is separated into spectral envelope information and sound source information, and each is encoded. inside that,
The so-called PARCOR (Partial Autocorrelation: Partial Autocorrelation) method, in which the sound source information is modeled by a single pulse train and white noise, can be encoded at a low bit rate, but on the other hand, the quality deteriorates. large. On the other hand, the multipulse method (for example, Ozawa, et al., "Quality Improvement of Multipulse Driven Speech Coding", Speech Research Meeting of the Acoustical Society of Japan)
S83-78 (1984.1), or residual compression method (see Asakawa, et al., "Speech synthesis method using residual information," Proceedings of the Acoustical Society of Japan, 3-1-7 (Showa 59.10)) There is. As a residual compression method, for example, a method described in JP-A-61-296398 has been proposed.
No. 1419 and Japanese Patent Application No. 61-35148. In these methods, as the expression of the sound source is refined,
The quality is improved compared to the PARCOR method. [Problems to be Solved by the Invention] In the above-described conventional technique, a plurality of pulse trains as sound sources are generated independently for each frame on a fixed basis.
Here, the frame is a time unit for analyzing the voice, and is usually set to about 20 ms. By the way, it is assumed that the audio waveform has been sampled and converted into a series of sample values x _i . Let x _t be the present and then p sample values going back to the past are ｛x
_ti ｝, (i = 1,..., p). Here, it is assumed that the speech waveform can be approximately predicted from the past p samples. Since the simplest prediction is linear prediction, it is considered that current values are approximated by multiplying each of past sample values by a certain coefficient and adding them. In this case, the difference between the realized value x _t at the current point t and the predicted value y _t, and the prediction error epsilon. This prediction error ε is called a prediction residual or simply a residual. The predicted residual waveform of the speech waveform is considered to be the sum of two types of waveforms. One of them is a so-called error component, the amplitude of which is not so large, and is close to a random noise waveform. The other one is an error when a pulse due to vocal cord vibration is applied to the input, and the prediction is largely out of order, resulting in a residual waveform having a large amplitude. This residual component is
It appears periodically and repeatedly according to the periodicity of the sound source. Speech is roughly classified into a section having periodicity (voiced sound) and a section having no remarkable periodicity (unvoiced sound), and accordingly, the prediction residual waveform also has periodicity in the voiced sound portion. doing. On the other hand, the pulse train generated in the multi-pulse method or the residual compression method can be regarded as an approximation of the residual,
The voiced part should have periodicity. However, since these pulse trains are generated independently of the preceding and succeeding frames, the relative positional relationship of the pulse trains is shifted for each frame, and the periodicity may be disturbed. When speech is synthesized using such a pulse train as a sound source,
There was a problem that the sound quality was degraded, called "gorogoro." An object of the present invention is to improve such a conventional problem and prevent deterioration of sound quality due to disturbance in periodicity of each frame with respect to a pulse train generated by a multi-pulse method or a residual compression method. It is to provide a speech coding system. [Means for Solving the Problems] In order to achieve the above object, the speech coding method of the present invention is to determine whether a voiced frame is switched from an unvoiced frame, immediately after a voiced frame is continuous, or is a voiceless frame. A first sound source pulse generating means for generating a sound source pulse immediately after switching from the unvoiced frame to a voiced frame, and a second sound source pulse generating means for generating a sound source pulse when the voiced frame continues. It is characterized by comprising: a second sound source pulse generating means; and a third sound source pulse generating means for generating a sound source pulse in the case of the unvoiced frame. [Operation] In the present invention, the position of the pulse train of the next frame is estimated based on the pitch period based on the first generated pulse train, and a new pulse train is generated in the vicinity of the estimated position. Hold. That is, the voice cycle of the voiced sound corresponds to the pitch cycle that is the reciprocal of the pitch frequency that is the pitch of the voice. Since the change in pitch is relatively gentle, it can be regarded as substantially constant within one frame. Therefore, in the first reference frame, for example, in the first frame in which voice is switched from unvoiced sound to voiced sound, after generating a sound source pulse train based on a certain standard according to the related art, the next frame is sequentially generated based on the generated sound source pulse train. The method of estimating the position of the sound source pulse train in and generating the sound source pulse train is used. In the multi-pulse method and the residual compression method, since the number of sound source pulses is small, generally generated sound source pulse trains are grouped together for each pitch period. Therefore, the position advanced in the time axis direction by the pitch period with respect to the sound source pulse train in the last pitch period of the frame is set as the position of the leading pulse train of the next frame. In this way, the periodicity of the pulse train between two frames is maintained. In the next frame, based on this position,
A first sound source pulse train is generated near this position. As a result, the disturbance of the periodicity between the frames is eliminated, the deterioration of the sound quality can be prevented, and the optimum sound source pulse train based on the pulse train generation standard can be obtained. EXAMPLES Hereinafter, examples of the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing a case where the speech coding method of the present invention is applied to a speech coding device (speech CODEC) using a residual compression method, wherein (a) is a coding unit;
(B) is a decoding unit. As shown in FIG. 1 (a), the encoding unit of the present invention includes a buffer memory 1 for storing a digital audio signal, a linear prediction circuit 3 for performing linear prediction, and an inverse filter controlled using parameters 4. 5, a pitch extraction circuit 7 for extracting a pitch using a residual correlation method or the like, and a voiced / unvoiced determination circuit 9
A sound source generating unit 11 for generating a sound source pulse in accordance with a voiced / unvoiced determination result; and a quantization encoding circuit 13.
Further, as shown in FIG. 1 (b), the decoding unit of the present invention comprises a decoding circuit 16 for separating an input signal into four types of parameters.
And a buffer memory 19 for storing the decoded spectral parameters, a pitch pulse, a voiced / unvoiced determination result, and a sound source information as input, a sound source pulse reproducing circuit 17 for generating a sound source pulse, and a delay in the sound source pulse reproducing circuit 17. And a synthesizing filter 20 that uses the result as a coefficient. In FIG. 1 (a), at the time of encoding, a digitized audio signal is stored in a buffer memory 1 for one frame, and a parameter (for example, a partial (Autocorrelation coefficient) 4. Next, an inverse filter 5 is configured using the parameter 4 as a coefficient, and the audio signal 2 is input to the inverse filter 5 to obtain a residual signal 6. Pitch extraction circuit 7
Is based on the residual correlation method or AMDF (Average Magnitude Differenti
al Function) method and other well-known methods,
A pitch period 8 of a frame is extracted using the residual signal 6 as an input. The voiced / unvoiced determination circuit 9 outputs a determination result 10a indicating whether the frame is a voiced frame or an unvoiced frame, and a signal 10b indicating that the frame has been switched from an unvoiced frame to a voiced frame. The sound source generation unit 11 is newly provided according to the present invention, and includes a voice unvoiced determination result.
A sound source pulse is generated according to 10a and the switching signal 10b, and the information 12 is output. The quantization encoding circuit 13 receives the spectrum parameter 4, the pitch period 8, the voiced / unvoiced determination result 10a, and the sound source information 12, quantizes it to a predetermined number of bits, and converts the result 14 converted into a predetermined format into a digital line. Send to 15. In FIG. 1 (b), at the time of decoding, digital data 14 received from a digital line 15 is decoded.
, The four parameters shown in (a) (pitch period 8 ', sound source information 12', voiced / unvoiced judgment result 10a ',
It is separated into spectral parameters 4 '). Three of the above four parameters (decoded pitch period 8 ', voiced / unvoiced determination result 10a', sound source information 12 ')
A target sound source pulse 18 is obtained by a sound source pulse reproducing circuit 17 having the input as an input. On the other hand, only one of the four parameters (decoded spectrum parameter 4 ') is stored in the buffer memory 19, and after correcting the delay in the excitation pulse reproducing circuit 17, the buffer memory 19 Is used as a coefficient of the synthesis filter 20. By inputting the sound source pulse 18 to the synthesis filter 20, a synthesized speech 21 can be obtained as the output. FIG. 2 is a functional block diagram of a sound source generation unit in FIG. As shown in FIG. 2, the sound source generation unit 11 includes a switching control unit 31 for switching control when voice is switched from unvoiced to voiced, and a buffer memory 111 for storing a residual signal.
When the voice is switched from unvoiced to voiced, the pulse extraction position determination unit 112 for determining the pulse extraction position, and the head address of the representative residual determined in the previous frame is converted to the address of the buffer memory 111 and stored. And a pulse extraction position determination unit 32 for determining a pulse extraction position when voiced frames are continuous.
And a sound source extraction unit 115 for extracting a sound source from the head address and buffer memory 111, and an unvoiced sound source generation unit 116 for generating an unvoiced sound source. Since the speech coding method according to the present embodiment is related to the generation of a voiced frame sound source, the voiced / unvoiced determination result 10a indicates voiced, and the pitch period 8 is assumed to have a fixed value (hereinafter referred to as pitch The value of the cycle is NPTCH). First, when the voiced / unvoiced switching signal 10b indicates that it is immediately after switching from unvoiced to voiced, control is performed by a signal from the switching control unit 31 to control the pulse extraction position determining unit (I) 11
Move on to 2. The function of the sound source generation unit 11 when controlled here is the same as that of the conventional residual compression method (for example, the above-mentioned publication (Japanese Unexamined Patent Publication No.
No. 296398), which is the same as the residual compression method described as the second method. That is, consecutive LN residual pulses are extracted for a representative pitch section (here, LN is the number indicated by the value of the number of extracted pulses 113). Further, as described in the above-mentioned Japanese Patent Application No. 60-241419, when the decoding residual of the previous frame and the representative residual of the current frame are interpolated at the time of decoding, the representative pitch section is the current frame. To include the last point of That is, the pulse extraction position determination unit (I) 112 calculates the following equation. Here, i satisfies the following conditional expression. iFRM−NPTCH + 1 ≦ i ≦ iFRM (2) where x _j is the residual pulse amplitude at address j,
It is read from the buffer memory 111. Note that the buffer memory 111 is a ring buffer and stores the residual of the previous frame and the current frame. Also, iFRM is the frame length, and LN is the value of the number of extracted pulses 113. For example, in order to obtain the amplitude information and the position information of the next residual pulse to be interpolated, the pulse extraction position test unit 112 first obtains the cumulative amplitude value using the above equations (1) and (2). Now, the current frame length is stored in the buffer memory 111 as 0 to 159.
Is assigned, and if there are 20 consecutive residual pulses for the representative pitch section, the next representative pitch section is determined to include the last point of the current frame, and the above equation (2) The position i to be determined is determined within a section smaller than the frame length and larger than a section smaller by the pitch period than the frame length. Then, the head address is obtained from the accumulated amplitude value calculated by the above equation (1), and interpolation is performed by taking out 20 residual pulses from the address from the buffer memory 111. Assuming that i which gives the maximum value of AMP (i) calculated by the above equation (1) is i ₀ , i ₀ is the head address 114a of the representative residual. When the head address 114a is sent to the sound source extraction unit 115,
The LN residuals are read from the buffer memory 111 from the start address, and are transmitted to the subsequent stage as sound source information 12. Next, a case where the voiced / unvoiced switching signal 10b is not immediately after switching from unvoiced to voiced, that is, a case where voiced frames are indicated as being continuous will be described in detail. At this time, the control is transferred to the pulse extraction position determination unit (II) 32 by the signal from the switching control unit 31. The buffer memory 111 stores the residual of two frames. Addresses -iFRM + 1 to 0 correspond to the previous frame, and 1 to iFRM correspond to the current frame. Also, in the start position memory 30, the start address i ₀ of the representative residual determined in the previous frame is converted into an address on the buffer memory 111 (i ₀ ′ = i ₀ −IFRM) and stored.
The leading position of the representative residual of the current frame is determined as follows based on i ₀ ′. In the above equation (3), STADRS ₁ ,..., ST
ADRS _N corresponds to the head address for interpolating the representative residual at the time of decoding, and STADRS _N is the one in the last pitch section in the current frame, that is, the head address of the representative residual, and Become like i ₀ = STADRSN _N (4) In this way, the representative residual start address of the current frame can be obtained very easily from the representative residual start address of the previous frame. However, since the pitch cycle NPTCH is an average pitch cycle of the current frame, there is a possibility that the pitch cycle NPTCH has an error from the actual pitch position. Therefore, in order to determine the position more precisely, the following is performed. First, a short-term cross-correlation value is defined by equation (5). i ₀ ′ + NPTCH−D ≦ i ≦ i ₀ ′ + NPTCH + D (6) Here, D (> 0) is a value determined by pitch fluctuation and the like, and COR represents a cross-correlation value. In the above equation (6),
This shows that the existence range of the head address of the first excitation pulse train of the current frame is in the range obtained by adding the head address of the representative residual of the previous frame in consideration of the fluctuation of the pitch period. Number of pulses extracted from start address
This is to calculate the amplitude accumulated value of LN residual pulses, and the correlation value becomes the maximum value if the phases match. The first start address is obtained by the following equation. In the above equation (7), a position i where the correlation value becomes highest is detected near the position distant from the representative residual of the previous frame by NPTCH. Hereinafter, it is sufficient to replace i ₀ ′ with STADRS ₁ and obtain STADRS ₂ in the same procedure, and sequentially obtain up to STADRS _N (= i ₀ ). The above equation (1) can also be used to determine STADRS _n (where n is an arbitrary integer). That is, assuming that the range of i in the above equation (1) is the equation (6), the following equation (8) is derived. Hereinafter, the same procedure is used to obtain up to STADRS _N (= i ₀ ). The head address (i ₀ ) 114b of the representative residual determined by one of the methods described above is sent to the sound source extraction unit 115. At the time of decryption, the conventional method (for example, the aforementioned Japanese Patent Application No.
Thus, the excitation pulse is reproduced while interpolating the representative residual and the decoding residual of the previous frame. At this time, since the interpolation corresponding point address is the representative residual position itself of the previous frame, it is not necessary to transmit the address again. The sound source pulse generation unit 11 shown in the present embodiment can be easily realized by an adder, a correlator, a comparator, and the like, as described in detail above. The same function can be realized by a general-purpose microprocessor. In the current frame, when the voice unvoiced determination result 10a is unvoiced, the control is switched to the unvoiced sound source generating unit 116 by the control signal from the switching control unit 31. The operation of the unvoiced sound source generation unit 116 generates a sound source pulse irrespective of the pitch period, for example, as in a method proposed in the past (for example, see Japanese Patent Application No. 61-35148). FIG. 3 is a time chart for explaining the effect of the present invention. FIG. 3A is a waveform diagram showing an input sound source waveform 41, a residual waveform 42, a representative residual waveform 43a, and a composite waveform 44a according to a conventional method, and FIG. 3B is an input waveform according to the present embodiment. Audio waveform 41, residual waveform 42, representative residual 43b, and composite waveform 44b
FIG. The input voice waveform 41 has the same waveform in both (a) and (b), and the waveform 42 of the residual signal of the inverse filter 5 also has the same waveform. In the conventional method, since the representative residual (after decoding) is extracted independently for each frame, as shown in the waveform 43a,
In the frame # 3, the displacement of the representative residual occurs, and the periodicity is disturbed. Arrows indicate the shift width. As a result, as shown in FIG.
In the case of 44a, the amplitude is attenuated at the position where the displacement has occurred, resulting in deterioration of the sound quality. In the case of the present embodiment, as shown in FIG. 3 (b), when voiced frames are consecutive, the representative residual (after decoding) 43b extracted subordinately with reference to the representative residual position of the previous frame.
Becomes The representative residual 43b has no displacement, and therefore the composite waveform 44b has no attenuation, is natural, and has improved sound quality as compared with the conventional system shown in FIG. 3 (a). [Effects of the Invention] As described above, according to the present invention, when a voiced sound is continuous, a sound source pulse train is generated without disturbing the periodicity of the original sound, so that the periodicity is caused by disturbance. Deterioration of sound quality can be prevented, and the quality of encoded sound can be improved.

【図面の簡単な説明】第１図は本発明の一実施例を示す音声符号化システムの
ブロック図、第２図は第１図における音源生成部のブロ
ック図、第３図は本発明の効果を説明する波形タイムチ
ャートである。 1,19,111:バッファメモリ、3:線形予測回路、5:逆フィ
ルタ、7:ピッチ抽出回路、9:有声無声判別器、11:音源
生成部、17:音源パルス再生器、20:合成フィルタ、31:
切替制御部、112,32:パルス抽出位置決定回路、30:先頭
位置メモリ、116:無声音源生成部、115:音源抽出部、6:
残差信号、12:音源情報、21:合成音声、43a,b:代表残差
波形、44a,b:合成波形、42:残差波形、41:入力音声波
形。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram of a speech encoding system showing one embodiment of the present invention, FIG. 2 is a block diagram of a sound source generation unit in FIG. 1, and FIG. 3 is an effect of the present invention. 7 is a waveform time chart for explaining the present invention. 1, 19, 111: buffer memory, 3: linear prediction circuit, 5: inverse filter, 7: pitch extraction circuit, 9: voiced / unvoiced discriminator, 11: sound source generator, 17: sound source pulse reproducer, 20: synthesis filter, 31 :
Switching control unit, 112, 32: pulse extraction position determination circuit, 30: head position memory, 116: unvoiced sound source generation unit, 115: sound source extraction unit, 6:
Residual signal, 12: sound source information, 21: synthesized voice, 43a, b: representative residual waveform, 44a, b: synthesized waveform, 42: residual waveform, 41: input voice waveform.

───────────────────────────────────────────────────── フロントページの続き (72)発明者鈴木俊郎東京都国分寺市東恋ケ窪１丁目280番地株式会社日立製作所中央研究所内 (56)参考文献特開昭59−65897（ＪＰ，Ａ) 特開昭60−162300（ＪＰ，Ａ) 特開昭60−235200（ＪＰ，Ａ) 特開昭61−7899（ＪＰ，Ａ) 特開昭62−38500（ＪＰ，Ａ) ────────────────────────────────────────────────── ─── Continuation of front page (72) Inventor Toshiro Suzuki 1-280 Higashi Koikebo, Kokubunji-shi, Tokyo Central Research Laboratory, Hitachi, Ltd. (56) References JP-A-59-65897 (JP, A) JP-A-60-162300 (JP, A) JP-A-60-235200 (JP, A) JP-A-61-7899 (JP, A) JP-A-62-38500 (JP, A)

Claims

(57) [Claims] The audio signal is analyzed for each frame, and is separated into spectral envelope information and sound source information, and it is determined whether the audio signal is voiced or unvoiced. In a voiced frame, a plurality of pulses are used as a sound source for one pitch period. Means for determining whether the voiced frame is switched from an unvoiced frame or whether voiced frames are continuous, and a sound source pulse immediately after switching from the unvoiced frame to a voiced frame. First sound source pulse generating means for generating a sound source pulse, second sound source pulse generating means for generating a sound source pulse when the voiced frame is continuous, and third sound source pulse generating for the unvoiced frame.
Sound source pulse generating means, wherein the second sound source pulse generating means determines the sound source pulse position of the current voiced frame by the pitch cycle based on the sound source pulse position of the voiced frame immediately before the current voiced frame. And generating an excitation pulse train near the determined position. 2. 2. The speech coding method according to claim 1, wherein a correlation method is used to determine a sound source pulse position of the current voiced frame.