JP3199128B2

JP3199128B2 - Audio encoding method

Info

Publication number: JP3199128B2
Application number: JP08890392A
Authority: JP
Inventors: 仲大室; 一則間野; 健弘守谷
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1992-04-09
Filing date: 1992-04-09
Publication date: 2001-08-13
Anticipated expiration: 2016-08-13
Also published as: JPH05289696A

Abstract

PURPOSE:To obtain an encoded voice of auditorily high quality with a small number of bits even where the periodicity of a voice is unsteady or even when the periodicity can not be reproduced sufficiently by a mode of an adaptive code book. CONSTITUTION:Fluctuations L1, L2, and L3 of the pitch period in a frame of the residual signal (a) generated by inversely filtering an input voice through a synthesizing filter are detected and the input voice is converted into a voice with the same pitch L4=(L2+L3)/2 as shown by (b) or a voice with the same pitch period L5=(L1+L2+L3) as shown by (c), and, the distortion of a synthesized voice at the time of the adaptive vector selection of the adaptive code book is calculated for the converted voice and an adaptive vector which minimizes the distortion is selected.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】この発明は、音声のスペクトル包
絡特性を表すフィルタを、音源ベクトルで駆動して音声
を合成する予測符号化することにより、音声の信号系列
を少ない情報量でディジタル符号化する高能率音声符号
化方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a digital encoding system for a speech signal sequence with a small amount of information by performing a predictive encoding for synthesizing a speech by driving a filter representing a spectral envelope characteristic of the speech with a sound source vector. The present invention relates to a highly efficient speech coding method.

【０００２】[0002]

【従来の技術】ディジタル移動体通信において電波を効
率的に利用したり、音声蓄積サービス等で記憶媒体を効
率的に利用するために、高能率音声符号化方法が用いら
れる。現在、音声を高能率に符号化する方法として、原
音声をフレームと呼ばれる５〜50ms程度の一定間隔の区
間に分割し、その１フレームの音声を、周波数スペクト
ルの包絡特性を表す線形フィルタの特性と、そのフィル
タを駆動するための駆動音源信号との２つの情報に分離
し、それぞれを符号化する手法が提案されている。この
手法において、駆動音源信号を符号化する方法として、
駆動音源信号を音声のピッチ周期（基本周波数）に対応
すると考えられる周期成分と、それ以外の成分（非周期
成分）とに分離して符号化する方法が知られている。こ
の駆動音源情報の符号化法の例として、符号駆動線形予
測符号化（Code-Excited Linear Prediction:CELP)があ
る。上記技術の詳細については、文献M.R.Schroeder an
d B.S.Atal,"Code-Excited Linear Prediction(CELP)：
High-Quality Speech at Very Low Bit Rates",IEEE Pr
oc.ICASSP-85,pp.937-940,1985に記載されている。2. Description of the Related Art In order to efficiently use radio waves in digital mobile communication and to efficiently use a storage medium in a voice storage service or the like, a high-efficiency voice coding method is used. At present, as a method of encoding speech efficiently, an original speech is divided into sections called frames, which are arranged at regular intervals of about 5 to 50 ms, and the speech of one frame is subjected to a linear filter characteristic representing an envelope characteristic of a frequency spectrum. And a drive excitation signal for driving the filter are separated into two pieces of information, and a method of encoding each of them is proposed. In this method, as a method of encoding the drive excitation signal,
There is known a method of separating and encoding a drive excitation signal into a periodic component considered to correspond to a pitch period (fundamental frequency) of a voice and other components (non-periodic components). Code-Excited Linear Prediction (CELP) is an example of an encoding method of the driving excitation information. For details of the above technology, refer to the document MRSchroeder an
d BSAtal, "Code-Excited Linear Prediction (CELP):
High-Quality Speech at Very Low Bit Rates ", IEEE Pr
oc.ICASSP-85, pp.937-940,1985.

【０００３】図８に上記符号化方法の構成例を示す。入
力端子１に入力された原音声は、線形予測分析部２にお
いて、原音声の周波数スペクトル包絡特性を表す線形予
測パラメータが計算される。得られた線形予測パラメー
タは、線形予測パラメータ符号化部３において符号化さ
れて線形予測パラメータ復号化部４に送られる。線形予
測パラメータ復号化部４では、受け取った符号からフィ
ルタ係数を再生し、合成フィルタ１１および歪み計算部
１２に送る。なお、線形予測分析の詳細および線形予測
パラメータの符号化例については、例えば、古井貞煕著
“ディジタル音声処理”（東海大学出版会）に記載され
ている。ここで、線形予測分析部２、線形予測パラメー
タ符号化部３、線形予測パラメータ復号化部４、および
合成フィルタ１１は、非線形な分析および非線形なフィ
ルタに置き換えてもよい。FIG. 8 shows a configuration example of the above-mentioned encoding method. For the original speech input to the input terminal 1, the linear prediction analysis unit 2 calculates a linear prediction parameter representing the frequency spectrum envelope characteristic of the original speech. The obtained linear prediction parameters are encoded by the linear prediction parameter encoding unit 3 and sent to the linear prediction parameter decoding unit 4. The linear prediction parameter decoding unit 4 reproduces a filter coefficient from the received code and sends it to the synthesis filter 11 and the distortion calculation unit 12. The details of the linear prediction analysis and examples of encoding of the linear prediction parameters are described in, for example, “Digital Speech Processing” by Sadahiro Furui (Tokai University Press). Here, the linear prediction analysis unit 2, the linear prediction parameter encoding unit 3, the linear prediction parameter decoding unit 4, and the synthesis filter 11 may be replaced with a nonlinear analysis and a nonlinear filter.

【０００４】適応符号帳５からは、バッファに記憶され
た、直前の過去の駆動音源ベクトル（既に量子化された
直前の１〜数フレーム分の駆動音源ベクトル）を、ある
周期に相当する長さで切り出し、その切り出したベクト
ルをフレームの長さになるまで繰り返すことによって、
音声の周期成分に対応する時系列ベクトルの候補が出力
される。From the adaptive codebook 5, the immediately preceding past excitation vector (already quantized one to several frames of the excitation vector) stored in the buffer is converted to a length corresponding to a certain period. , And by repeating the cut vector until the length of the frame,
Time-series vector candidates corresponding to the periodic components of the voice are output.

【０００５】雑音符号帳６からは、音声の非周期成分に
対応する１フレーム分の長さの時系列符号ベクトルの候
補が出力される。これらの候補は通常白色ガウス性雑音
を基調とし、入力音声とは独立に、符号化のためのビッ
ト数に応じてあらかじめ指定された数の候補ベクトルが
記憶されている。適応符号帳５および雑音符号帳６から
出力された各時系列ベクトルの候補は、乗算部８、９に
おいて、それぞれ重み作成部７において作成された重み
ｇ_a,ｇ_rが乗算され、加算部１０において加算されて、
駆動音源ベクトルの候補となる。[0005] From the noise codebook 6, candidates for a time-series code vector having a length of one frame corresponding to the non-periodic component of speech are output. These candidates are usually based on white Gaussian noise, and a predetermined number of candidate vectors according to the number of bits for encoding are stored independently of the input speech. Candidate of each time series vector outputted from the adaptive codebook 5 and the random codebook 6, in the multiplication section 8,9, the weight g _a created in the weight creating unit 7, _{respectively,} g _r is multiplied, adding section 10 Are added in
It becomes a candidate of the driving sound source vector.

【０００６】合成フィルタ１１は、線形予測パラメータ
復号化部４の出力をフィルタの係数とする線形フィルタ
で、加算部１０の出力である駆動音源ベクトル候補を入
力として、再生音声の候補を出力する。合成フィルタ１
１の次数すなわち線形予測分析の次数は、一般に１０〜
１６次程度が用いられることが多い。なお、既に述べた
ように、合成フィルタ１１は非線形なフィルタでもよ
い。The synthesis filter 11 is a linear filter that uses the output of the linear prediction parameter decoding unit 4 as a filter coefficient, and receives a driving excitation vector candidate output from the adding unit 10 as an input and outputs a reproduced voice candidate. Synthesis filter 1
The order of 1 or the order of the linear prediction analysis is generally 10 to 10.
The 16th order is often used. As described above, the synthesis filter 11 may be a non-linear filter.

【０００７】上記手順によって、一つの再生音声候補が
作られる様子を、図９に模式的に示す。歪み計算部１２
では、合成フィルタ１１の出力である再生音声の候補
と、入力音声との歪みを計算する。この歪みの計算は、
例えば聴覚重み付けなど、合成フィルタ１１の係数、ま
たは量子化してない線形予測係数を考慮に入れて行うこ
とが多い。FIG. 9 schematically shows the manner in which one playback audio candidate is created by the above procedure. Distortion calculator 12
Then, the distortion between the reproduced voice candidate output from the synthesis filter 11 and the input voice is calculated. The calculation of this distortion is
For example, it is often performed taking into account the coefficients of the synthesis filter 11 such as auditory weighting or the linear prediction coefficients that have not been quantized.

【０００８】符号帳検索制御部１３では、各再生音声候
補と入力音声との歪みが最小となるような、周期符号、
雑音符号および重み符号を選択し、そのフレームにおけ
る駆動音源ベクトルを決定する。符号帳検索制御部１３
において決定された、周期符号、雑音符号、重み符号
と、線形予測パラメータ符号化部３の出力である線形予
測パラメータ符号は、符号送出部１４に送られ、利用の
形態に応じて、記憶されるか、または受信側へ送られ
る。[0008] The codebook search control unit 13 generates a cyclic code that minimizes distortion between each reproduced speech candidate and the input speech.
A noise code and a weight code are selected, and a driving excitation vector in the frame is determined. Codebook search control unit 13
The periodic code, the noise code, the weight code, and the linear prediction parameter code output from the linear prediction parameter encoding unit 3 determined in are sent to the code transmission unit 14 and stored according to the form of use. Or sent to the recipient.

【０００９】図１０に、復号化方法の構成例を示す。復
号化方法は、符号帳の検索をしないで、受信した符号か
ら音声を再生する点が符号化方法と異なるが、基本的に
は符号化方法と同一の構成となる。符号受信部２１は、
伝送路あるいは記憶媒体から線形予測パラメータ符号、
周期符号、雑音符号と重み符号を受け取る。線形予測パ
ラメータ符号は線形予測パラメータ復号化部２２に送ら
れ、フィルタ係数に変換される。FIG. 10 shows a configuration example of a decoding method. The decoding method is different from the encoding method in that a speech is reproduced from the received code without searching the codebook, but has basically the same configuration as the encoding method. The code receiving unit 21
A linear prediction parameter code from a transmission line or a storage medium,
Receives periodic code, noise code and weight code. The linear prediction parameter code is sent to the linear prediction parameter decoding unit 22 and is converted into a filter coefficient.

【００１０】適応符号帳２３は、バッファに記憶されて
いる前フレームの駆動音源ベクトル（既に復号化された
直前の１〜数フレーム分の駆動音源ベクトル）と受け取
った周期符号から、直前の過去の駆動音源ベクトルを周
期符号で指定される周期に相当する長さで切り出し、そ
の切り出したベクトルをフレームの長さになるまで繰り
返すことによって、音声の周期成分に対応する時系列ベ
クトルを出力する。The adaptive codebook 23 derives the immediately preceding past excitation code from the driving excitation vector of the previous frame stored in the buffer (the already-decoded driving excitation vector of one to several frames immediately before) and the received periodic code. The driving excitation vector is cut out at a length corresponding to the cycle specified by the cycle code, and the cut-out vector is repeated until the length of the frame is reached, thereby outputting a time-series vector corresponding to the sound cycle component.

【００１１】雑音符号帳２４は、あらかじめ決められた
数だけ記憶された１フレーム分の長さの時系列ベクトル
のなかから、受け取った雑音符号で指定される時系列ベ
クトルを出力する。重み作成部２５は、受け取った重み
符号から、適応符号帳２３および雑音符号帳２４より出
力される符号ベクトルに乗ずるべき重みの値を出力す
る。The noise codebook 24 outputs a time-series vector specified by the received noise code from time-series vectors of a length corresponding to one frame stored in a predetermined number. The weight creation unit 25 outputs a value of a weight to be multiplied by a code vector output from the adaptive codebook 23 and the noise codebook 24 from the received weight code.

【００１２】上記各符号ベクトルは、それぞれ乗算部２
６、２７において、重みの値が乗算される。重みが乗算
された両ベクトルは、加算部２８において加算され、駆
動音源ベクトルとなる。合成フィルタ２９は、線形予測
パラメータ復号化部２２の出力をフィルタ係数とする線
形フィルタで、加算部２８より駆動音源ベクトルを受け
取って、音声を再生する。符号化方法の項でも述べた
が、合成フィルタ２９として非線形なフィルタが用いら
れることもある。Each of the above code vectors is multiplied by a multiplication unit 2
At 6, 27, the weight value is multiplied. The two vectors multiplied by the weights are added in the adding unit 28 to become a driving sound source vector. The synthesis filter 29 is a linear filter using the output of the linear prediction parameter decoding unit 22 as a filter coefficient, receives the driving excitation vector from the addition unit 28, and reproduces a sound. As described in the section of the encoding method, a non-linear filter may be used as the synthesis filter 29 in some cases.

【００１３】再生音声は端子３０より出力される。The reproduced sound is output from a terminal 30.

【００１４】[0014]

【発明が解決しようとする課題】このような符号化方法
において、適応符号帳の果たす役割は大きい。有声部な
ど、音声の周期成分が大きい部分では、適応符号帳の出
力である適応符号ベクトルに乗算される重みｇ_aの値は
一般に大きな（１に近い）値に符号化される。一方、無
声部など、音声に周期成分が少ないところでは、重みｇ
_aの値は小さな（０に近い）値に符号化される。しか
し、音声に周期性がある部分でも、例えばその周期性が
非定常な場合や、適応符号帳のモデルで十分にその周期
性が表現できない場合には、歪み計算部１２において計
算された歪みが最小になるように符号帳検索制御部１３
で重み符号＝重みｇ_aの値を決定すると、ｇ_aの値が小
さく符号化されてしまい、再生音声の品質が著しく低下
する。この発明の目的は、音声の周期性が非定常な部分
や、適応符号帳のモデルで十分に周期性を表現できない
場合でも、適応符号帳による符号化部分を改善すること
によって、低ビットで、聴覚的に高品質な音声の符号化
方法を提供することにある。In such an encoding method, an adaptive codebook plays a large role. In a portion such as a voiced portion in which the periodic component of speech is large, the value of the weight g _a multiplied by the adaptive code vector output from the adaptive codebook is generally coded to a large value (close to 1). On the other hand, in places where the voice has few periodic components such as unvoiced parts, the weight g
The value of _a is encoded into a small (close to zero) value. However, even in a portion where the speech has periodicity, for example, when the periodicity is non-stationary or when the adaptive codebook model cannot sufficiently represent the periodicity, the distortion calculated by the distortion calculator 12 is Codebook search control unit 13 to minimize
When the value of weight code = weight g _a is determined by the following equation, the value of g _a is coded small, and the quality of the reproduced voice is significantly reduced. An object of the present invention is to improve the coding part by the adaptive codebook even in a case where the periodicity of the voice is non-stationary or in the case where the periodicity cannot be sufficiently expressed by the model of the adaptive codebook. It is an object of the present invention to provide a method for encoding audio with high acoustic quality.

【００１５】[0015]

【課題を解決するための手段】この発明では、音声の周
期性が非定常な部分や、適応符号帳のモデルで十分に周
期性を表現できない場合に、入力音声を適応符号帳のモ
デルに合った形に変形した後、適応符号帳の周期符号と
重み符号＝重みｇ_aとを符号化することによって、重み
ｇ_aの値を大きく符号化して高能率な符号化を実現する
とともに、変形による聴覚的な品質の劣化を最小限に抑
えるために、変形そのものを聴覚的に影響のほとんどな
い範囲で行うか、または変形によって生じた劣化を補償
する処理を復号化処理の中で行うことによって、低ビッ
トで、聴覚的に高品質な音声の符号化方法を実現する。According to the present invention, in a case where the periodicity of speech is non-stationary or when the adaptive codebook model cannot sufficiently express the periodicity, the input speech is matched with the adaptive codebook model. Then, by encoding the periodic code of the adaptive codebook and the weight code = weight g _a , the value of the weight g _a is greatly encoded to realize highly efficient encoding. In order to minimize the deterioration of auditory quality, by performing the deformation itself within a range that has almost no auditory effect, or by performing processing to compensate for the deterioration caused by the deformation in the decoding process, A low bit, perceptually high quality speech encoding method is realized.

【００１６】[0016]

【実施例】図１にこの発明による符号化方法の構成例を
示し、図８と対応する部分に同一符号を付けてある。端
子１より入力された原音声は、この発明では前処理部３
１において、適応符号帳５のモデルでより表現しやすい
信号に変形され、歪み計算部１２に送られる。この前処
理は入力音声に含まれる、連続的またはゆらぎ的に変化
している、ピッチ周期およびピッチ周期に対応する長さ
の波形を単位として定義される音声の成分を、フレーム
内で定常な信号になるように変化することであり、その
具体例は後述する。歪み計算部１２では、前処理音声と
各再生音声候補との歪みを計算し、符号帳検索制御部１
３において歪みが最小となる各符号が選択される。FIG. 1 shows an example of the configuration of an encoding method according to the present invention, and portions corresponding to those in FIG. According to the present invention, the original audio input from the terminal 1
In 1, the signal is transformed into a signal that can be more easily represented by the model of the adaptive codebook 5 and sent to the distortion calculator 12. This preprocessing converts the continuous or fluctuating voice components contained in the input voice, which are defined in units of a pitch period and a waveform having a length corresponding to the pitch period, into a steady signal within a frame. The specific example will be described later. The distortion calculator 12 calculates the distortion between the pre-processed voice and each of the reproduced voice candidates, and calculates the codebook search controller 1.
In step 3, each code that minimizes distortion is selected.

【００１７】前処理の方法によっては、前フレームの駆
動音源ベクトルを利用して前処理を行う場合もある。ま
た、変形した前処理音声が聴覚的にほとんど劣化を生じ
ないない場合には、修復のための符号を伝送する必要は
なく、復号化でも、図１０に示した従来の復号化方法を
そのまま用いることができる。一方、符号化の能率を大
幅に向上させるためには、聴覚的に劣化を生じるような
前処理が必要になることがある。この場合には、前処理
部３１での処理に対応した修復符号を符号送出部１４か
ら送出し、図２に図１０と対応する部分に同一符号を付
けて示す復号化方法によって、合成フィルタ２９で再生
された音声に対して、修復処理部３２において変形によ
る品質の劣化を修復符号により補償するような修復処理
を施す。ただし、修復符号を用いなくても、あらかじめ
決められた処理によって前処理による劣化を補償できる
場合には、修復符号を送る必要はない。Depending on the preprocessing method, the preprocessing may be performed using the driving sound source vector of the previous frame. If the deformed pre-processed speech has almost no auditory deterioration, there is no need to transmit a code for restoration, and the conventional decoding method shown in FIG. be able to. On the other hand, in order to greatly improve the efficiency of encoding, preprocessing that may cause auditory deterioration may be required. In this case, the restoration code corresponding to the processing in the pre-processing unit 31 is transmitted from the code transmission unit 14, and the synthesis filter 29 is decoded by the decoding method shown in FIG. The restoration processing unit 32 performs restoration processing to compensate for the deterioration in quality due to deformation using the restoration code. However, if the deterioration due to the pre-processing can be compensated for by the predetermined processing without using the repair code, there is no need to send the repair code.

【００１８】図２の復号化方法では、修復処理部３２を
合成フィルタ２９の後に入れているが、加算部２８と合
成フィルタ２９の間に入れてもよい。また、図１の符号
化方法では、原音声に前処理を施して歪み計算に用いて
いるが、歪み計算部１２に入力するリファレンス音声は
原音声のままで、加算部１０の出力である駆動音源ベク
トル候補に変形を加えることによっても、等価な処理を
行うことができる。In the decoding method shown in FIG. 2, the restoration processing section 32 is inserted after the synthesis filter 29, but may be inserted between the addition section 28 and the synthesis filter 29. In the encoding method of FIG. 1, the original speech is preprocessed and used for distortion calculation. However, the reference speech input to the distortion calculator 12 is the original speech and the driving Equivalent processing can be performed by modifying the sound source vector candidates.

【００１９】図１、図２において破線で示してある部分
は、必要に応じて用いることを示す。符号駆動線形予測
符号化（ＣＥＬＰ）において、リファレンス（入力）音
声ベクトルをｘ、適応符号ベクトルをｅ、適応符号ベク
トルに乗ずる重みをｇ_a，雑音符号ベクトルをｃ、雑音
符号ベクトルに乗ずる重みをｇ_r，合成フィルタ１１の
インパルス応答を要素とする行列をThe portions shown by broken lines in FIGS. 1 and 2 indicate that they are used as needed. In code driven linear predictive coding (CELP), the reference (input) speech vector is x, the adaptive code vector is e, the weight of the adaptive code vector is g _a , the noise code vector is c, and the weight of the noise code vector is g. _r , a matrix having the impulse response of the synthesis filter 11 as an element

【００２０】[0020]

【数１】とすると、歪みＥは一般に次のように定義される。Ｅ＝‖ｘ−ｇ_aＨｅ−ｇ_rＨｃ‖² ただし、聴覚重み付け等の項は省略した。聴覚重み付け
を考慮する場合でも、ｘおよびＨに若干の変更を加える
のみで、上式と同じ形で表現できる。上式において、Ｅ
が最小になるように、ｇ_{a ,}ｅ，ｇ_r，ｃを探索する。
しかし一般的には、現実的な演算量で探索するため、先
にｇ_rＨｃ＝０としてＥが最小になる、ｇ_a，ｅを決定したのち、
ｇ_a，ｅの値（またはベクトル）を固定してｇ_r，ｃを
決定することが多い。上記手法によるとｇ_aの最適値
は、ｇ_a＝（ｘ^tＨｅ）／‖Ｈｅ‖² 歪みは、Ｅ＝‖ｘ‖²−（ｘ^tＨｅ）²／‖Ｈｅ‖² で与えられる。ただし、ｅは未知ベクトルで、適応符号
帳のモデルによって最適なベクトルを探索する。このと
き、有声部であるにもかかわらず、上記ｇ_aの値が小さ
いと、一般に再生音声の品質は悪い。そこで、ベクトル
ｘに変形を加える前処理によってｇ_aを大きくとれれ
ば、符号化の能率が向上し、再生音声の品質の向上につ
ながる。(Equation 1) Then, the distortion E is generally defined as follows. _{E = ‖x-g a He-} g r Hc‖ 2 However, terms such as perceptual weighting is omitted. Even when the auditory weighting is considered, the expression can be expressed in the same form as the above equation by only slightly changing x and H. In the above equation, E
Search for g _a, e, g _r , and c so that is minimized.
However, in general, in order to search with a realistic amount of computation, g _a , e at which E is minimized as g _r Hc = 0 is determined first, and then g _a , e is determined.
In many cases, g _r and c are determined while fixing the values (or vectors) of g _a and e. The optimum value of g _a according to the above _{^{method, g a = (x t He}} ) / ‖He‖ 2 strain, E = ‖x‖ ² - is given by ^{^{(x t He) 2 / ‖He‖}} 2. Here, e is an unknown vector, and an optimal vector is searched for by an adaptive codebook model. At this time, if the value of g _a is small despite the fact that the portion is a voiced portion, the quality of the reproduced sound is generally poor. Therefore, if g _a can be increased by the pre-processing for deforming the vector x, the efficiency of encoding is improved and the quality of reproduced sound is improved.

【００２１】次に、図１における前処理部３１の具体的
な実施例を示す。まずピッチ周期について前処理する例
を図３に模式的に示す。（ａ）は入力音声に合成フィル
タ１１の逆フィルタをかけた残差信号である。音声は非
定常な信号であるので、ピッチ周期を１ピッチ波形ごと
に正確に測ると、同じフレーム内でもピッチ周期が
Ｌ₂，Ｌ₃と微妙に変化していることがある。これは一
般に「ゆらぎ」と呼ばれるものの他に、アクセント等に
よるピッチ変化によって、徐々に変化している場合があ
る。このようなピッチ周期の変化は、適応符号帳５のモ
デルでは表現しきれないため、その結果上記ｇ_aの値が
小さくなる。そこで、図３（ｂ）に示すようにピッチ周
期を等間隔Ｌ₄＝（Ｌ₂＋Ｌ₃）／２に変換する。ある
いは図３（ｃ）に示すように、Ｌ₅＝（Ｌ₁＋Ｌ₂＋Ｌ
₃）／３の等間隔とする。また、ＣＥＬＰでは、適応符
号帳のラグ（繰り返し単位）が実際のピッチ周期の２倍
や３倍になることがあるので、図３（ｄ）に示すよう
に、Ｌ₆＝Ｌ ₀とし、Ｌ₇＝（Ｌ₁＋Ｌ₂＋Ｌ₃−
Ｌ₀）／２として適応符号帳のラグで繰り返したときに
最適となるように間隔の変換を行ってもよい。（ｄ）は
ピッチ周期の２倍の場合を示したが、３倍、４倍の場合
にも容易に拡張できる。Next, a specific example of the pre-processing unit 31 in FIG.
Examples will be described. Example of preprocessing for pitch period first
Is schematically shown in FIG. (A) is a synthetic filter for input speech
This is a residual signal that has been subjected to the inverse filter of the data 11. Audio is non
Since the signal is a steady signal, the pitch period is set for each pitch waveform.
Accurately, the pitch period is within the same frame
L_Two, L_ThreeAnd may have changed slightly. This is one
In addition to what is commonly called "fluctuations",
May change gradually due to the pitch change
You. Such a change in the pitch period is caused by the
As Dell cannot express it,_aIs the value of
Become smaller. Therefore, as shown in FIG.
Period equal intervals L_Four= (L_Two+ L_Three) / 2. is there
Or, as shown in FIG._Five= (L₁+ L_Two+ L
_Three) / 3. In CELP, the adaptive code
Lag (repeat unit) of number book is twice the actual pitch period
Or 3 times, so as shown in FIG.
And L₆= L ₀And L₇= (L₁+ L_Two+ L_Three−
L₀) / 2 when repeated with the adaptive codebook lag
Interval conversion may be performed so as to be optimal. (D)
The case of twice the pitch period is shown, but the case of three times and four times
Can be easily extended.

【００２２】これらの処理は、図４Ａに示すように、原
音声をピッチ周期分析部３３において１ピッチ波形ごと
の正確なピッチ周期を分析しながら、ピッチ周期変換部
３４で音声波形の変形処理をする。また、図３では前の
フレームも現在のフレームも残差で処理する様子を示し
たが、過去のフレームは前フレームの駆動音源信号を用
いて処理してもよい。この場合には、ピッチ間隔分析部
３３では、前フレーム前の駆動音源信号を用いて分析を
行う。上記処理において、ピッチ周期の変換量がわずか
な場合には、処理によって聴覚的な劣化はほとんど生じ
ないため、復号化の過程で前処理の補償処理を行わなく
ても大きな問題はない。しかし、復号化の過程で、再び
ピッチ周期がなめらかな変化となるような処理をすれば
なお良い。In these processes, as shown in FIG. 4A, the original speech is analyzed by a pitch cycle analysis section 33 to analyze the correct pitch cycle for each pitch waveform, and a pitch cycle conversion section 34 performs a speech waveform deformation process. I do. Further, FIG. 3 shows a state in which both the previous frame and the current frame are processed using the residual, but the past frame may be processed using the driving sound source signal of the previous frame. In this case, the pitch interval analysis unit 33 performs analysis using the driving sound source signal before the previous frame. In the above processing, when the conversion amount of the pitch period is small, there is almost no auditory deterioration due to the processing, so that there is no major problem even if the pre-compensation processing is not performed in the decoding process. However, in the course of decoding, it is even better to perform processing so that the pitch period again changes smoothly.

【００２３】ピッチ周期の分析方法の例を図５に示す。
図５（ａ）は入力音声に合成フィルタ１１の逆フィルタ
をかけた残差信号を表す。まず現在のフレームの残差で
の最大点を見つけ、その最大点を中心として１ピッチ長
の残差波形を切り出す。この切り出した波形を図５
（ｂ）に示す。このときの１ピッチ長は、例えば変形相
関法等を用いて算出した、フレーム内の平均ピッチ周期
を用いればよい。また、切り出す長さは、必ずしも１ピ
ッチ長でなくてもよく、任意の長さでよい。一般に１ピ
ッチ長の７０〜８０％程度の長さにすることも多い。FIG. 5 shows an example of a pitch period analysis method.
FIG. 5A shows a residual signal obtained by applying an inverse filter of the synthesis filter 11 to the input voice. First, the maximum point in the residual of the current frame is found, and a one-pitch-length residual waveform is cut out centering on the maximum point. This cut-out waveform is shown in FIG.
(B). The one pitch length at this time may use an average pitch period in a frame calculated using, for example, a deformation correlation method or the like. In addition, the length to be cut out does not necessarily have to be one pitch length, and may be any length. Generally, the length is often about 70 to 80% of one pitch length.

【００２４】次に、切り出した波形をマッチドフィルタ
の係数として、残差波形上で移動させながら図５（ｃ）
に示すような相関値系列を計算する。このとき、波形を
切り出した最大点の時刻では、相関の値は１となる。正
確なピッチ間隔を決める基準位置は、この相関値のピー
クを検出することにより決定する。決定された基準位置
を図５（ｄ）に示す。このようにして決定されたピッチ
基準位置の間隔をもってピッチ周期（間隔）とし、上記
基準位置は以後の変形処理においても利用される。上記
処理において、図５（ａ）に示される残差は、過去のフ
レームについては、既に量子化された駆動音源信号で置
き換えてもよい。また、現在のフレームの残差について
は、逆フィルタの内部状態を、既に量子化された駆動音
源信号で置き換えて計算してもよい。Next, the extracted waveform is used as a matched filter coefficient while moving on the residual waveform as shown in FIG.
A correlation value series as shown in is calculated. At this time, the correlation value is 1 at the time of the maximum point at which the waveform is cut out. A reference position for determining an accurate pitch interval is determined by detecting a peak of the correlation value. The determined reference position is shown in FIG. The pitch of the pitch reference position determined in this way is defined as a pitch cycle (interval), and the reference position is used in subsequent deformation processing. In the above process, the residual shown in FIG. 5A may be replaced with a driving sound source signal that has already been quantized for a past frame. The residual of the current frame may be calculated by replacing the internal state of the inverse filter with the already-quantized drive excitation signal.

【００２５】また、ピッチ周期（間隔）を変換する方法
の例を図６に示す。図６（ａ）は残差を、図６（ｂ）は
上記方法で求めたピッチ基準位置とする。ピッチ間隔の
変換は、基準位置の間隔が指定された値になるように、
図６（ｂ）に示されるピッチ基準位置を図６（ｃ）に示
すように移動させる。まず、図６（ｂ）の各基準位置に
おいて、図６（ｄ）に示すように重みが１、隣接する基
準位置において重みが０となる窓関数をつくり、この窓
関数を図６（ａ）の残差信号にかけて切り出す。次に切
り出した波形を、図６（ｅ）に示すように、図６（ｃ）
の基準位置の移動に合わせてずらし、ずらした各波形を
再度重ね合わせることによって、ピッチ間隔の変換され
た残差を生成する。このとき、フレームをまたいで変形
処理をする場合には、音声の先読みを行って、残差図６
（ａ）をバッファに蓄えておかなければならない。な
お、図６では、窓関数として三角窓を用いたが、基準位
置付近の波形の保存を重視する場合には、ハニング窓等
を用いるとよい。FIG. 6 shows an example of a method for converting the pitch period (interval). FIG. 6A shows the residual, and FIG. 6B shows the pitch reference position obtained by the above method. The conversion of the pitch interval is performed so that the interval between the reference positions becomes the specified value.
The pitch reference position shown in FIG. 6B is moved as shown in FIG. First, at each reference position in FIG. 6B, a window function having a weight of 1 as shown in FIG. 6D and a weight of 0 at an adjacent reference position is created. Cut out over the residual signal. Next, as shown in FIG. 6E, the cut out waveform is shown in FIG.
Is shifted in accordance with the movement of the reference position, and the shifted waveforms are superimposed again to generate a converted pitch interval residual. At this time, when performing the transformation process across frames, the voice is pre-read and the residual diagram
(A) must be stored in the buffer. Although a triangular window is used as the window function in FIG. 6, a Hanning window or the like may be used when emphasis is placed on saving a waveform near the reference position.

【００２６】ピッチ間隔変換の上記実施例は、波形の重
ね合わせの手法によっているが、図７に示すように、基
準位置から基準位置までの波形を伸縮することによって
ピッチの間隔を変換してもよい。また、最も簡単な方法
としては、基準位置と基準位置の中間の波形を、ピッチ
を短くする場合には削除し、長くする場合には、０を挿
入する方法によってもよい。処理音声は、変形された残
差を合成フィルタに通すことによって作成する。Although the above embodiment of the pitch interval conversion is based on the method of superimposing the waveforms, as shown in FIG. 7, the pitch interval can be converted by expanding and contracting the waveform from the reference position to the reference position. Good. As the simplest method, a method may be adopted in which a waveform between reference positions is deleted when the pitch is shortened, and 0 is inserted when the pitch is lengthened. The processed voice is created by passing the transformed residual through a synthesis filter.

【００２７】次にピッチ周期に対応する長さの波形を単
位として定義される音声の成分として振幅（ピッチゲイ
ン）について前処理する例を図３（ｅ），（ｆ）に模式
的に示す。図３（ｅ）は入力音声に合成フィルタ１１の
逆フィルタをかけた残差信号である。音声は非定常な信
号であるので、ピッチ周期に相当する時間遅れた波形の
相関（ピッチゲイン）を１ピッチ波形ごとに正確に測る
と、同じフレーム内でも相関値が微妙に変化しているこ
とがある。つまりピーク値がわずか異なっている。この
ような場合にも、適応符号帳のモデルで十分に表現でき
ない。そこで、図３（ｆ）に示すように、ピッチゲイン
をフレーム内で一定になるように変形し、同一ピーク
（振幅）すれば、適応符号帳による予測精度が上がり、
その結果適応符号ベクトルに乗ずる重みの値を大きくと
ることができる。Next, FIGS. 3E and 3F schematically show an example of preprocessing an amplitude (pitch gain) as a speech component defined in units of a waveform having a length corresponding to the pitch period. FIG. 3E shows a residual signal obtained by applying an inverse filter of the synthesis filter 11 to the input voice. Since voice is an unsteady signal, if the correlation (pitch gain) of a time-delayed waveform corresponding to the pitch period is accurately measured for each pitch waveform, the correlation value will change slightly even within the same frame. There is. That is, the peak values are slightly different. Even in such a case, it cannot be sufficiently expressed by the adaptive codebook model. Therefore, as shown in FIG. 3 (f), if the pitch gain is deformed so as to be constant within the frame and the same peak (amplitude) is obtained, the prediction accuracy by the adaptive codebook increases,
As a result, the value of the weight multiplied by the adaptive code vector can be increased.

【００２８】この処理は、図４Ｂに示すように、ピッチ
ゲイン分析部３５でフレーム内のピッチゲインを正確に
分析しながら、ピッチゲイン変換部３６で変形処理を行
う。この処理でも、ピッチゲイン分析部３５での分析の
際に前フレームの駆動音源ベクトルを用いると量子化レ
ベルに応じた最適な処理を行うことができる。ピッチゲ
インの変換方法については、残差領域での各１ピッチ波
形にある値を乗ずる方法などによる。この処理の場合に
は、前処理によって若干の品質劣化を伴うことがあるの
で、必要があれば復号化の過程で、前後のフレームの情
報からピッチゲインがなめらかに変化するように補償処
理をすることが望ましい。ただし、修復のための情報に
ビットを割り当てて、修復符号として特に伝送する必要
性が生じることは少ないと考えられる。In this process, as shown in FIG. 4B, the pitch gain analysis unit 35 accurately analyzes the pitch gain in the frame, and the pitch gain conversion unit 36 performs a deformation process. Also in this process, when the pitch excitation analysis unit 35 uses the driving sound source vector of the previous frame at the time of analysis, it is possible to perform an optimum process according to the quantization level. The pitch gain is converted by a method of multiplying each pitch waveform in the residual region by a certain value. In the case of this processing, since the quality may be slightly deteriorated due to the pre-processing, if necessary, in the decoding process, compensation processing is performed so that the pitch gain changes smoothly from the information of the preceding and following frames. It is desirable. However, it is considered that there is little need to allocate bits to the information for restoration and to particularly transmit the information as the restoration code.

【００２９】次に、前記ピッチ周期に対応する長さの波
形を単位として定義される音声の成分として、周波数ス
ペクトルの位相特性について前処理する例を図３
（ｇ），（ｈ）に示す。図３（ｇ）は、残差信号から１
ピッチ波形を切り出した波形を模式的に示したものであ
る。残差波形は一般に白色のパワースペクトルを持つ
が、位相特性に関しては非定常的に変化している。例え
ば、図３（ｇ）の波形は、パワースペクトルを維持した
ままで図３（ｈ）に示すように位相のみを変えることが
できる（様子を模式的に示した）。フレーム内で位相が
変化している場合はもちろん、既に量子化した前フレー
ムの駆動音源信号と位相が異なっていても適応符号帳の
モデルで十分表現することができないため、品質の劣化
となる。Next, FIG. 3 shows an example in which the phase characteristic of the frequency spectrum is preprocessed as a speech component defined in units of a waveform having a length corresponding to the pitch period.
(G) and (h). FIG. 3 (g) shows that 1
FIG. 9 schematically shows a waveform obtained by cutting out a pitch waveform. The residual waveform generally has a white power spectrum, but non-stationarily changes in phase characteristics. For example, in the waveform of FIG. 3 (g), only the phase can be changed as shown in FIG. 3 (h) while maintaining the power spectrum (the state is schematically shown). Not only when the phase is changed in the frame, but even if the phase is different from the already-quantized drive excitation signal of the previous frame, it cannot be sufficiently expressed by the adaptive codebook model, so that the quality deteriorates.

【００３０】そこで、図４Ｃに示すように位相分析部３
７において原音声の位相を１ピッチごとに分析しなが
ら、位相変換部３８において変形処理を行う。また、位
相分析部３７で、前フレームの駆動音源信号の位相も併
せて分析しながら、前フレームの位相に合わせる等の最
適な処理を行うと効果が大きい。なお、図３（ｈ）は図
３（ｇ）の波形をパルスが急峻になるような波形に変形
処理をしているが、この発明における変形では位相の変
化は任意である。人間の聴覚特性は、位相情報の変化に
対して鈍感であるので、この処理において極端な品質劣
化が生じることは少ない。しかし、一般に位相が単調に
なると、再生音声がブザー音的になり、いわゆる「合成
音的で自然性に欠ける音」になりやすい。そこで、伝送
または記憶のためのビットに余裕がなければ補償処理を
しなくても特に問題は生じないと考えられるが、修復の
ための情報にいくらかビットを割り当てるとより自然な
再生音が得られる。Therefore, as shown in FIG.
In 7, a transformation process is performed in the phase converter 38 while analyzing the phase of the original voice for each pitch. In addition, when the phase analysis unit 37 analyzes the phase of the drive sound source signal of the previous frame together, and performs an optimal process such as adjusting to the phase of the previous frame, the effect is great. In FIG. 3H, the waveform in FIG. 3G is transformed into a waveform in which the pulse becomes steep, but in the modification of the present invention, the phase change is arbitrary. Since human auditory characteristics are insensitive to changes in phase information, extreme quality deterioration rarely occurs in this processing. However, in general, when the phase becomes monotonous, the reproduced sound becomes a buzzer sound, and tends to be a so-called “synthetic sound that lacks naturalness”. Therefore, it is considered that there is no particular problem even if the compensation processing is not performed if there are no spare bits for transmission or storage, but a more natural reproduction sound can be obtained by allocating some bits to the information for restoration. .

【００３１】上述した３つの前処理は、全部を同時に行
ってもよいし、任意のものを組み合わせて用いてもよ
い。より適応符号帳のモデルを満たすためには、適応符
号帳が過去の駆動音源ベクトルをある周期で繰り返すよ
うに、これから符号化すべき波形も、ある位置から切り
出した波形を、ある周期で繰り返した波形に変形すると
よい。この場合にも、上記の各前処理の手法を併用でき
る。この処理の場合には、入力（リファレンス）音声が
極めて単調な波形になるため、いわゆる「合成音的で自
然性に欠ける音」でも情報量が優先されるような応用分
野で効果があると考えられる。しかし、ある程度の自然
性を必要とする場合には、処理音声と原音声との変化分
を適当なビット数で符号化して、修復符号として伝送す
る必要がある。The above three pre-processings may be performed all at the same time or may be used in combination. In order to satisfy the adaptive codebook model more, the adaptive codebook repeats the past excitation vector in a certain cycle. It is good to transform to. Also in this case, the above-described preprocessing methods can be used together. In the case of this processing, since the input (reference) sound has a very monotonous waveform, it is considered that the so-called "synthetic sound and lack of naturalness" is effective in an application field in which the amount of information is prioritized. Can be However, when a certain degree of naturalness is required, it is necessary to encode the change between the processed voice and the original voice with an appropriate number of bits and transmit the resulting code as a repair code.

【００３２】このほか、前述のｇ_aの値を大きくする直
接的な前処理として、ダイナミックプログラミングの手
法を用いて入力ｘを変形してもよい。なお、この発明
は、ＣＥＬＰだけでなく、適応符号帳のモデルに基づく
すべての符号化方式に対して適応される。また、ピッチ
フィルタと呼ばれるピッチ周期をタップ位置とするフィ
ルタによって音声の周期性を表現するタイプの符号化方
式にも適用される。In addition, as a direct pre-process for increasing the value of g _a , the input x may be transformed using a dynamic programming technique. The present invention is applicable not only to CELP but also to all coding schemes based on an adaptive codebook model. Further, the present invention is also applied to a coding method of a type that expresses the periodicity of speech by a filter called a pitch filter having a pitch period as a tap position.

【００３３】[0033]

【発明の効果】以上述べたように、この発明により、適
応符号帳のモデルによって音声の表現が容易になり、そ
の結果適応符号ベクトルに乗算される重みの値（量子化
値）を大きくとることができ、一層符号化による歪みを
小さくすることができる。一方、変形処理を修復するた
めの伝送または記憶する情報がないか、もしくはわずか
で良いような前処理をすることにより、伝送または記憶
するための情報量を増加させることなく、より高品質な
符号化を実現することができる。As described above, according to the present invention, the expression of speech is facilitated by the adaptive codebook model, and as a result, the weight value (quantized value) multiplied by the adaptive code vector is increased. And distortion due to encoding can be further reduced. On the other hand, by performing pre-processing such that there is no or little information to be transmitted or stored to repair the deformation process, a higher quality code can be obtained without increasing the amount of information to be transmitted or stored. Can be realized.

[Brief description of the drawings]

【図１】この発明の実施例の符号化方法の構成を示すブ
ロック図。FIG. 1 is a block diagram showing a configuration of an encoding method according to an embodiment of the present invention.

【図２】この発明の実施例の復号化方法の構成を示すブ
ロック図。FIG. 2 is a block diagram showing a configuration of a decoding method according to the embodiment of the present invention.

【図３】この発明における前処理の各種例を説明するた
めの波形図。FIG. 3 is a waveform chart for explaining various examples of preprocessing in the present invention.

【図４】前処理部３１の各種構成例を示すブロック図。FIG. 4 is a block diagram showing various configuration examples of a preprocessing unit 31.

【図５】ピッチ間隔を分析する一実施例において、ピッ
チ基準位置が決定される様子を模式的に示した図。FIG. 5 is a diagram schematically showing how a pitch reference position is determined in one embodiment for analyzing pitch intervals.

【図６】ピッチ間隔を変換する一実施例を模式的に示し
た図。FIG. 6 is a diagram schematically showing an embodiment for converting a pitch interval.

【図７】ピッチ間隔を変換する一実施例において、基準
位置から基準位置までの波形の変形方法の一例を模式的
に示した図。FIG. 7 is a diagram schematically showing an example of a method of deforming a waveform from a reference position to a reference position in an embodiment for converting a pitch interval.

【図８】符号駆動線形予測符号化法の符号化方法の一般
的構成例を示すブロック図。FIG. 8 is a block diagram illustrating a general configuration example of an encoding method of a code-driven linear prediction encoding method.

【図９】符号駆動線形予測符号化法において、適応符号
帳と雑音符号帳から再生音声候補が作られる様子を模式
的に表した図。FIG. 9 is a diagram schematically showing a state in which a reproduced speech candidate is created from an adaptive codebook and a noise codebook in the code-driven linear prediction encoding method.

【図１０】符号駆動線形予測符号化法の復号化方法の一
般的構成例を示すブロック図。FIG. 10 is a block diagram illustrating a general configuration example of a decoding method of a code-driven linear prediction encoding method.

フロントページの続き (56)参考文献特開昭59−61891（ＪＰ，Ａ) 特開平６−27996（ＪＰ，Ａ) 特開平３−119398（ＪＰ，Ａ) 特開平４−344699（ＪＰ，Ａ) 特開昭61−150000（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 19/08 G10L 19/12 Continuation of front page (56) References JP-A-59-61891 (JP, A) JP-A-6-27996 (JP, A) JP-A-3-119398 (JP, A) JP-A-4-344699 (JP) , A) JP-A-61-150000 (JP, A) (58) Fields investigated (Int. Cl. ⁷ , DB name) G10L 19/08 G10L 19/12

Claims

(57) [Claims]

1. A filter comprising: a time-series vector extracted from an adaptive codebook, which is obtained by repeating a past excitation vector at a cycle corresponding to a pitch; and a time-series vector extracted from a noise codebook. A speech encoding method that encodes input speech by driving and reproducing the speech, wherein the pitch corresponding to the pitch cycle and the pitch cycle, which are included in the input speech, are continuously or fluctuating. A speech component defined as a unit of a waveform having a length, which is transformed into a stationary signal in a frame, and re-inputted as speech for encoding. .