JPH1165599A

JPH1165599A - Audio compression / expansion method and apparatus, and storage medium for storing audio compression / expansion processing program

Info

Publication number: JPH1165599A
Application number: JP9223512A
Authority: JP
Inventors: Mitsuhiro Inazumi; 満広稲積
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 1997-08-20
Filing date: 1997-08-20
Publication date: 1999-03-09
Anticipated expiration: 2017-08-20
Also published as: JP3661363B2

Abstract

(57)【要約】【課題】単純な処理で効率よくしかも高品質な音声圧
縮伸張を可能とするとともに、ハードウエア化や並列処
理化に有利なものとする。【解決手段】音声片切り出し部１によって所定区間の
音声片を処理対象音声片として切り出し、類似度判定部
３により複数種類の音声片群含む音声片表４を参照して
類似度を判定し、音声片選択部５によって、最も類似度
の高い部分を有する音声片を選択する。そして、符号化
部６により選択された音声片についてのデータを基に前
記処理対象音声片を符号化する。また、伸張処理を行っ
た後、あるいは、スペクトル包絡パラメータの抽出後
に、音声片更新部１１によって、前記音声片表に格納さ
れるそれぞれの音声片の内容を更新する。 (57) [Summary] [PROBLEMS] To enable efficient and high-quality voice compression / expansion with simple processing and to be advantageous for hardware and parallel processing. SOLUTION: A speech segment clipping unit 1 clips a speech segment in a predetermined section as a speech segment to be processed, and a similarity judgment unit 3 judges a similarity by referring to a speech segment table 4 including a plurality of types of speech segment groups. The voice segment selecting unit 5 selects a voice segment having a part with the highest similarity. Then, based on the data about the speech segment selected by the encoding unit 6, the speech segment to be processed is encoded. Further, after performing the decompression process or after extracting the spectrum envelope parameters, the speech unit updating unit 11 updates the content of each speech segment stored in the speech segment table.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声信号を単純な
処理で効率的に圧縮伸張処理する音声圧縮伸張方法およ
び装置並びに音声圧縮伸張処理プログラムを記憶した記
憶媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an audio compression / expansion method and apparatus for efficiently compressing / expanding audio signals by simple processing, and a storage medium storing an audio compression / expansion processing program.

【０００２】[0002]

【従来の技術】音声信号を圧縮伸張する際の符号化方法
として、従来より様々な方法が提案されている。その１
つとして、特開昭５９−１１６９７３（以下、第１の従
来技術という）がある。2. Description of the Related Art Various methods have been proposed as encoding methods for compressing and expanding audio signals. Part 1
For example, there is JP-A-59-116973 (hereinafter referred to as a first prior art).

【０００３】この第１の従来技術は、入力音声データを
短時間毎に分割して短時間音声信号系列を求める手段、
この短時間音声信号系列からスペクトル包絡パラメータ
を抽出するスペクトル包絡パラメータ抽出手段、このス
ペクトル包絡パラメータをもとにインパルス応答系列を
計算するするインパルス応答系列計算手段、このインパ
ルス応答系列を用いて自己相関関数列を計算する手段、
前記インパルス応答系列と短時間音声信号系列を用いて
相互相関関数列を計算する手段、前記自己相関関数列と
相互相関関数列を用いて駆動音源信号系列計算して符号
化する手段、スペクトル包絡符号と駆動音源信号とを組
み合わせて出力する手段とを有し、さらに、前記短時間
音声信号に対して予め定められた補正を加える目標信号
計算手段を有している。The first prior art is a means for dividing input audio data for each short time to obtain a short-time audio signal sequence.
A spectrum envelope parameter extracting means for extracting a spectrum envelope parameter from the short-time speech signal sequence, an impulse response sequence calculating means for calculating an impulse response sequence based on the spectrum envelope parameter, and an autocorrelation function using the impulse response sequence Means to calculate columns,
Means for calculating a cross-correlation function sequence using the impulse response sequence and the short-time audio signal sequence, means for calculating and encoding a drive excitation signal sequence using the auto-correlation function sequence and the cross-correlation function sequence, a spectral envelope code And a means for combining and outputting a driving sound source signal, and a target signal calculating means for applying a predetermined correction to the short-time audio signal.

【０００４】この第１の従来技術によれば、音声の符号
化を行うに際して、効率的に駆動音源パルスの位置とゲ
インを決定することができ、また、計算量、使用メモリ
量の削減にもある程度の効果は得られる。According to the first conventional technique, the position and the gain of the drive excitation pulse can be efficiently determined when encoding the speech, and the amount of calculation and the amount of memory used can be reduced. Some effect can be obtained.

【０００５】しかし、この第１の従来技術は、女性の声
のような音声信号を符号化したのち、音声合成を行う場
合、高品質な音声合成を得るには、駆動音源パルスをた
くさん抽出する必要があるため、圧縮率が悪くなるとい
う問題点があった。However, according to the first prior art, when a speech signal such as a female voice is encoded and then speech synthesis is performed, a large number of driving sound source pulses are extracted in order to obtain high-quality speech synthesis. Because of the necessity, there is a problem that the compression ratio is deteriorated.

【０００６】すなわち、女性の声は、男性の声に比べる
と複雑で、高精度な合成音を得るには、駆動音源パルス
をたくさん抽出する必要があり、結局は、圧縮率が悪い
ものとなってしまう。That is, a female voice is more complicated than a male voice, and it is necessary to extract a large number of driving sound source pulses in order to obtain a highly accurate synthesized sound. Would.

【０００７】一方、高い圧縮率を得るための技術とし
て、特開昭６３−３７３９９（以下、第２の従来技術と
いう）、特開平３−４３００（以下、第３の従来技術と
いう）がある。On the other hand, as techniques for obtaining a high compression ratio, there are JP-A-63-37399 (hereinafter referred to as a second prior art) and JP-A-3-4300 (hereinafter referred to as a third prior art).

【０００８】第２の従来技術は、音声信号からピッチ推
定を行い、過去のパルス列からの推定値と実際の信号と
の残差を求め、この残差により駆動音源パルスを計算し
ようとするものである。In the second prior art, pitch estimation is performed from a speech signal, a residual between an estimated value from a past pulse train and an actual signal is obtained, and a driving excitation pulse is calculated based on the residual. is there.

【０００９】また、第３の従来技術は、ピッチ推定を行
い、その１ピッチ区間分の駆動音源（マルチパルス）を
推定する。そして、そのマルチパルスのゲインと位相を
補正することによって、他のピッチ区間を補正すること
により他のピッチ区間を近似する。さらに、推定された
値と実際の値との残差より、第２のマルチパルスを推定
する。なお、マルチパルス信号の他に雑音コードブック
を用いる場合もある。In the third prior art, pitch estimation is performed, and a driving sound source (multi-pulse) for one pitch section is estimated. Then, by correcting the gain and phase of the multi-pulse, the other pitch section is approximated by correcting the other pitch section. Further, the second multi-pulse is estimated from the residual between the estimated value and the actual value. Note that a noise codebook may be used in addition to the multi-pulse signal.

【００１０】[0010]

【発明が解決しようとする課題】前記した第２、第３の
従来技術は、同じ波形を繰り返す周期を求め、１つ前の
周期から次の周期を推定し、その推定した部分と現実の
音声波形との差分を計算して、その差分により駆動音源
を計算するため、高い圧縮率が実現できる。According to the second and third prior arts described above, a cycle in which the same waveform is repeated is obtained, a next cycle is estimated from a previous cycle, and the estimated portion is compared with an actual voice. Since the difference from the waveform is calculated and the driving sound source is calculated based on the difference, a high compression ratio can be realized.

【００１１】しかし、ピッチを求めたり差分を求めたり
する必要があるため計算量が多く、また、それらのデー
タを蓄えるために大きな容量のメモリが必要になるとい
う問題点がある。However, there is a problem that the amount of calculation is large because it is necessary to obtain a pitch and a difference, and a large-capacity memory is required to store such data.

【００１２】また、残差を求め、この残差により駆動音
源パルスを計算するため、データの一部が失われた場
合、失われたデータ部分がそれ以降の計算に大きな影響
を与えることになり、高精度な音声合成が行えなくなる
という大きな問題点がある。In addition, since a residual is obtained and a driving excitation pulse is calculated based on the residual, when a part of data is lost, the lost data has a great effect on subsequent calculations. However, there is a big problem that high-precision speech synthesis cannot be performed.

【００１３】このように、従来の技術は、それぞれにお
いて種々の問題点がある。たとえば、第１の従来技術
は、駆動音源パルスを求めるための基本的な技術ではあ
るが、合成音の品質を上げようとすると、多くの駆動音
源パルスを立てる必要があり、女性の声のような音声デ
ータに対しては特に圧縮率が悪くなるという問題があ
る。また、第２の従来技術と、第３の従来技術は高圧縮
率が得られるが、計算量が多く、使用メモリ量も多いと
いう問題があり、さらに、差分情報を用いるためデータ
欠落に弱いという問題がある。As described above, each of the conventional techniques has various problems. For example, the first prior art is a basic technique for obtaining a driving sound source pulse. However, in order to increase the quality of a synthesized sound, it is necessary to make many driving sound source pulses, and it is necessary to generate many driving sound source pulses. In particular, there is a problem that the compression ratio is deteriorated with respect to a simple audio data. Further, the second and third prior arts provide high compression ratios, but have a problem of a large amount of calculation and a large amount of memory used, and furthermore are vulnerable to data loss due to the use of difference information. There's a problem.

【００１４】最近では音声データを扱う携帯用の情報機
器が広い分野で用いられるようになってきている。この
種の携帯用情報機器は、ＣＰＵの計算速度やメモリ容量
には大きな制約があるため、計算量や使用メモリ量が多
いということは重大な問題である。また、差分情報を用
いる方法は、データの欠落を考慮する必要のある情報機
器においては製品の性能向上の面で問題が多く、携帯機
器に限らず、コンピュータネットワーク上のリアルタイ
ム伝送などにおいても、データの欠落が、伝送されるデ
ータに大きな影響を与えることにもなる。In recent years, portable information devices for handling audio data have been used in a wide range of fields. In this type of portable information device, since the calculation speed and the memory capacity of the CPU are greatly restricted, it is a serious problem that the calculation amount and the used memory amount are large. In addition, the method using difference information has many problems in terms of improving the performance of products in information devices that need to consider the lack of data, and is not limited to mobile devices, and is not limited to real-time transmission over computer networks. Is also greatly affected by the transmitted data.

【００１５】以上述べたように、従来のそれぞれの音声
符号化方法は、処理が複雑であることが共通しており、
ハードウエア化、並列処理による高速化が相対的に困難
であるという問題点がある。特に、ピッチ周期を求める
処理を含むものは、計算量が多く、また、誤りが発生し
た場合の影響が大きい。さらに、従来のスペクトル包絡
パラメータによるインパルス応答と、駆動パルスを用い
る方法は、パルスの前後に不連続を生じ、これが雑音と
なって現れるという問題点がある。As described above, each of the conventional speech coding methods has in common that the processing is complicated.
There is a problem that it is relatively difficult to increase the speed by hardware and parallel processing. In particular, the processing including the processing for obtaining the pitch period requires a large amount of calculation and has a large influence when an error occurs. Further, the conventional method using the impulse response based on the spectral envelope parameter and the driving pulse has a problem that discontinuity occurs before and after the pulse, which appears as noise.

【００１６】そこで、本発明は、処理内容が単純で、ハ
ードウエア化、並列処理化を容易に可能とし、かつ、効
率のよい符号化が可能で、比較的高い圧縮率での音声デ
ータ圧縮を可能とする音声圧縮伸張方法および装置並び
に音声圧縮伸張処理プログラムを記憶した記憶媒体を提
供することを目的とする。Accordingly, the present invention provides a simple processing content, enables easy hardware and parallel processing, enables efficient encoding, and compresses audio data at a relatively high compression rate. It is an object of the present invention to provide an audio compression / expansion method and apparatus, and a storage medium storing an audio compression / expansion processing program.

【００１７】[0017]

【課題を解決するための手段】本発明の請求項１に記載
された音声圧縮伸張方法は、入力音声から所定区間の音
声片を処理対象音声片として切り出し、複数種類の音声
片群含む音声片表を参照し、その音声片表内のそれぞれ
の音声片と前記処理対象音声片との類似性を比較して、
最も類似度の高い音声片を選択し、その選択された音声
片についてのデータを基に、前記処理対象音声片を符号
化して符号化データを作成する処理を含むことを特徴と
している。According to a first aspect of the present invention, there is provided a voice compression / expansion method for extracting a voice segment of a predetermined section from an input voice as a voice segment to be processed, and including a plurality of types of voice segment groups. Referring to the table, by comparing the similarity between each speech segment in the speech segment table and the speech segment to be processed,
It is characterized by including a process of selecting a speech segment having the highest similarity and encoding the speech segment to be processed based on data on the selected speech segment to create encoded data.

【００１８】請求項２の発明は、請求項１の発明におい
て、符号化データを作成したのち、その符号化データを
伸張し、この伸張されたデータを前記処理対象音声片か
ら差し引いて残差を求め、その残差波形に対して、前記
複数種類の音声片群含む音声片表を参照し、その音声片
表内のそれぞれの音声片と前記残差波形との類似性を比
較する処理を1回以上行って符号化データを得るように
している。According to a second aspect of the present invention, in the first aspect of the present invention, after the encoded data is created, the encoded data is decompressed, and the decompressed data is subtracted from the processing target speech piece to obtain a residual. For the obtained residual waveform, a process of comparing the similarity between each of the voice segments in the voice segment table and the residual waveform by referring to the voice segment table including the plurality of types of voice segment groups is described. More than once, encoded data is obtained.

【００１９】そして、請求項３の発明は、請求項１また
は２の発明において、前記音声片表に格納される音声片
は、前記処理対象音声片よりも時間的に後方のすでに圧
縮伸張処理された音声波形を用いて作成された音声片、
スペクトル包絡パラメータにより推定される時間的前方
予測音声波形と時間的後方予測音声波形を用いて作成さ
れた音声片、雑音成分により作成された音声片を少なく
とも有し、それぞれの音声片は、符号化されたデータの
伸張処理後あるいはスペクトル包絡パラメータの抽出後
にその内容が更新されるようにしている。According to a third aspect of the present invention, in the first or second aspect of the present invention, the speech segment stored in the speech segment table is already subjected to compression / expansion processing which is temporally backward from the speech segment to be processed. Speech piece created using the speech waveform
At least a speech fragment created using the temporally forward predicted speech waveform and the temporally backward predicted speech waveform estimated by the spectral envelope parameter, and a speech fragment created by a noise component, each of which is encoded The content is updated after the expanded data or after extracting the spectral envelope parameters.

【００２０】また、請求項４の発明は、請求項１から３
のいずれかの発明において、前記各音声片は、前記処理
対象音声片よりも時間的に長い区間を有し、処理対象音
声片との類似度判定の際は、各音声片の長さの範囲にお
いて処理対象音声片との類似性が判定され、最も類似度
の高い部分を有する音声片が選択されるようにしてい
る。[0020] Further, the invention of claim 4 provides the invention according to claims 1 to 3.
In any one of the inventions, each of the voice segments has a section that is temporally longer than the voice segment to be processed, and when determining the similarity with the voice segment to be processed, the range of the length of each voice segment is determined. In, the similarity to the speech segment to be processed is determined, and the speech segment having the part with the highest similarity is selected.

【００２１】また、請求項５の発明は、請求項４の発明
において、前記符号化データは、前記最も類似度の高い
部分を有する音声片番号、その音声片内のどの部分であ
るかを表す位置データ、振幅調整用のパラメータで表さ
れるデータであり、さらに、場合に応じて、スペクトル
包絡パラメータをも加えたデータである。According to a fifth aspect of the present invention, in the fourth aspect of the present invention, the coded data represents a voice segment number having a portion having the highest similarity and a portion within the voice segment. It is data represented by position data and parameters for amplitude adjustment, and further data to which a spectrum envelope parameter is added as necessary.

【００２２】また、請求項６に記載の本発明の音声圧縮
伸張装置は、入力音声からあらかじめ設定された所定区
間の音声片を処理対象音声片として切り出す音声片切り
出し部と、入力音声からスペクトル包絡パラメータを抽
出するスペクトル包絡パラメータ抽出部と、複数種類の
音声片を格納する音声片表と、前記音声片表を参照し、
その音声片表内のそれぞれの音声片と前記処理対象音声
片との類似性を比較して類似度を求める類似度判定部
と、この類似度判定部による類似度に基づいて、最も類
似度の高い音声片を選択する音声片選択部と、この音声
片選択部により選択された音声片についてのデータを基
に前記処理対象音声片を符号化する符号化部と、この符
号化部により符号化されたデータを符号化データとして
出力するとともに、場合によっては、前記符号化部によ
り符号化されたデータに前記スペクトル包絡パラメータ
抽出部により抽出されたスペクトル包絡パラメータを加
えた符号化データを作成して出力する符号化データ出力
部とを構成要件として含むものである。According to a sixth aspect of the present invention, there is provided a voice compression / expansion apparatus according to the present invention, wherein a voice segment extraction unit for extracting a voice segment of a predetermined section set in advance from an input voice as a voice segment to be processed, and a spectral envelope from the input voice. A spectral envelope parameter extraction unit for extracting parameters, a speech unit table storing a plurality of types of speech units, and referring to the speech unit table,
A similarity determining unit that determines the similarity by comparing the similarity between each of the voice segments in the voice segment table and the target voice segment, and based on the similarity determined by the similarity determining unit, A speech segment selection unit for selecting a high speech segment, an encoding unit for encoding the processing object speech segment based on data on the speech segment selected by the speech segment selection unit, and encoding by the encoding unit And output the encoded data as encoded data, and in some cases, create encoded data obtained by adding the spectrum envelope parameter extracted by the spectrum envelope parameter extraction unit to the data encoded by the encoding unit. And a coded data output unit to be output.

【００２３】請求項７の発明は、さらに、これに加え
て、符号化部により符号化されたデータを伸張する伸張
部と、この伸張部により伸張されたデータを前記処理対
象音声片から差し引いて残差を求める残差生成部と、前
記伸張部により伸張されたデータあるいは前記スペクト
ル包絡パラメータ抽出部により抽出されたスペクトル包
絡パラメータを用いて前記音声片表に格納された音声片
の内容の更新を行う音声片更新部とを有する構成として
いる。According to a seventh aspect of the present invention, in addition to the above, a decompression unit for decompressing the data encoded by the encoding unit, and subtracting the data decompressed by the decompression unit from the speech piece to be processed. A residual generation unit for obtaining a residual, and updating of the content of the speech unit stored in the speech unit table using the data expanded by the expansion unit or the spectrum envelope parameter extracted by the spectrum envelope parameter extraction unit. And a voice segment updating unit for performing the process.

【００２４】そして、請求項８の発明は、請求項７の発
明において、前記類似度判定部、音声片選択部、符号化
部、伸張部、残差生成部は、処理手順にループを形成
し、類似度判定、音声片選択、符号化、伸張、残差生成
処理を行って得られる残差波形に対して、前記音声片表
テーブルを参照し、その音声片表テーブル内のそれぞれ
の音声片と前記残差波形との類似性を比較する処理を、
1回以上行ったのち、符号化データを作成して出力する
ようにしている。According to an eighth aspect of the present invention, in the seventh aspect of the present invention, the similarity determination section, speech piece selection section, encoding section, decompression section, and residual generation section form a loop in a processing procedure. For the residual waveform obtained by performing similarity determination, speech segment selection, encoding, decompression, and residual generation processing, the speech segment table is referred to, and each speech segment in the speech segment table is referred to. And comparing the similarity between the residual waveform and
After performing one or more times, encoded data is created and output.

【００２５】また、請求項９の発明は、請求項６から８
のいずれかの発明において、前記音声片表に格納される
音声片は、前記処理対象音声片よりも時間的に後方のす
でに圧縮伸張処理された音声波形を用いて作成された音
声片、スペクトル包絡パラメータにより推定される時間
的前方予測音声波形と時間的後方予測音声波形を用いて
作成された音声片、雑音成分により作成された音声片を
少なくとも有し、それぞれの音声片は、前記音声更新処
理部によって、伸張処理後あるいはスペクトル包絡パラ
メータの抽出後にその内容が更新されるようにしてい
る。Further, the invention of claim 9 is the invention of claims 6 to 8
In any one of the inventions described above, the speech segment stored in the speech segment table is a speech segment created using an already-compressed and decompressed speech waveform temporally later than the speech segment to be processed, and a spectral envelope. At least a speech segment created using the temporally predicted speech waveform and the temporally predicted speech waveform estimated by the parameter, and a speech segment created by a noise component, and each speech segment is subjected to the speech update processing. The content is updated by the section after the expansion process or after the extraction of the spectral envelope parameter.

【００２６】請求項１０の発明は、請求項６から９のい
ずれかの発明において、前記各音声片は、前記処理対象
音声片よりも時間的に長い区間を有し、処理対象音声片
との類似度判定の際は、各音声片の長さの範囲において
処理対象音声片との類似性が判定され、最も類似度の高
い部分を有する音声片が選択されるようにしている。According to a tenth aspect of the present invention, in any one of the sixth to ninth aspects, each of the voice segments has a section longer in time than the processing target voice segment. At the time of similarity determination, the similarity to the processing target voice segment is determined within the range of the length of each voice segment, and the voice segment having the portion with the highest similarity is selected.

【００２７】また、請求項１１の発明は、請求項１０の
発明において、前記符号化データは、前記最も類似度の
高い部分を有する音声片番号、その音声片内のどの部分
であるかを表す位置データ、振幅調整用のパラメータで
表されるデータであり、さらに、場合に応じて、スペク
トル包絡パラメータをも加えたデータである。According to an eleventh aspect of the present invention, in the tenth aspect of the present invention, the coded data indicates a voice segment number having the portion having the highest similarity and which portion in the voice segment. It is data represented by position data and parameters for amplitude adjustment, and further data to which a spectrum envelope parameter is added as necessary.

【００２８】さらに、請求項１２に記載の音声圧縮伸張
処理プログラムを記憶した記憶媒体の発明は、その音声
圧縮伸張処理プログラムは、入力音声から所定区間の音
声片を処理対象音声片として切り出し、複数種類の音声
片群含む音声片表を参照し、その音声片表内のそれぞれ
の音声片と前記処理対象音声片との類似性を比較し、最
も類似度の高い音声片を選択して、選択された音声片に
ついてのデータを基に前記処理対象音声片を符号化し、
場合に応じて、スペクトル包絡パラメータをも加えた符
号化データを作成する処理を行うとともに、符号化され
たデータの伸張処理後あるいは前記スペクトラム包絡パ
ラメータの抽出後に、前記音声片表に格納されるそれぞ
れの音声片の内容を更新する処理を行うものである。Further, according to a twelfth aspect of the present invention, there is provided a storage medium storing a voice compression / expansion processing program, wherein the voice compression / expansion processing program cuts out a voice segment of a predetermined section from an input voice as a voice fragment to be processed. Referring to the voice segment table including the type of voice segment group, comparing the similarity between each voice segment in the voice segment table and the processing target voice segment, selecting the voice segment having the highest similarity, and selecting Encoding the processing target speech piece based on the data about the speech piece,
Depending on the case, while performing processing of creating encoded data also including the spectral envelope parameter, and after the expansion processing of the encoded data or after extracting the spectrum envelope parameter, each stored in the speech piece table To update the content of the voice segment.

【００２９】このように、本発明では、音声片表内のそ
れぞれの音声片と入力音声から切り出した処理対象音声
片（たとえば、４msec程度の長さの音声片）との類似性
を比較し、最も類似度の高い音声片を選択し、その選択
された音声片についてのデータを基に前記処理対象音声
片を符号化するという処理を基本処理として行うように
している。これにより、符号化がきわめて単純な処理で
可能となるため、ハードウエア化、並列処理化を行う際
に有利なものとすることができる。As described above, according to the present invention, the similarity between each speech segment in the speech segment table and a speech segment to be processed (for example, a speech segment having a length of about 4 msec) extracted from the input speech is compared. As a basic process, a speech segment having the highest similarity is selected, and the speech segment to be processed is encoded based on data on the selected speech segment. As a result, the encoding can be performed by a very simple process, which can be advantageous when performing hardware processing and parallel processing.

【００３０】また、符号化データを作成したのち、その
符号化データの伸張処理、伸張されたデータを前記処理
対象音声片から差し引く残差生成処理、その残差波形に
対して、再び、音声片表を参照し、類似性を求めるとい
う処理を1回以上行って符号化データを得ることによ
り、より一層、高精度な符号化データを得ることができ
る。After the encoded data has been created, the encoded data is expanded, the expanded data is subtracted from the processing target audio fragment, and the residual generation process is performed. By referring to the table and performing a process of obtaining similarity at least once to obtain encoded data, encoded data with higher precision can be obtained.

【００３１】また、音声片表に格納される音声片は、処
理対象音声片よりも時間的に後方のすでに圧縮伸張処理
された音声波形を用いて作成された音声片、スペクトル
包絡パラメータにより推定される時間的前方予測音声波
形と時間的後方予測音声波形を用いて作成された音声
片、雑音成分により作成された音声片を少なくとも有す
ることで、入力音声を符号化する際、効率よく、しかも
高精度な符号化が可能となる。特に、スペクトル包絡パ
ラメータにより推定される予測音声波形を用いる場合、
従来では、時間的前方予測音声波形（インパルス応答）
のみを用いることが一般的であるが、本発明は、スペク
トル包絡パラメータにより推定される時間的前方予測音
声波形と時間的後方予測音声波形を用いて音声片を作成
するようにしている。The speech segment stored in the speech segment table is estimated by a speech segment created using an already-compressed and expanded speech waveform temporally behind the speech segment to be processed, and a spectral envelope parameter. By having at least a speech segment created using the temporally forward predicted speech waveform and the temporally backward predicted speech waveform, and a speech segment created by a noise component, efficient and high-quality encoding of the input speech is achieved. Accurate encoding becomes possible. In particular, when using a predicted speech waveform estimated by the spectrum envelope parameter,
Conventionally, temporally predicted speech waveform (impulse response)
Although it is common to use only a speech segment, the present invention creates a speech segment using a temporally forward predicted speech waveform and a temporally backward predicted speech waveform estimated by a spectral envelope parameter.

【００３２】このように、前方予測音声波形に加えて、
時間的に後方の後方予測音声波形を用いると、雑音の低
減を図れる効果がある。すなわち、インパルス応答（前
方予測音声波形）のみを用いた音声片とした場合、音声
レベルが殆ど０の状態から急激に波形が立ち上がった音
声片となってしまうため、その音声片を用いて圧縮伸張
処理したとき、不連続点が生じることによってその部分
が雑音となって現れるという問題点がある。これに対し
て、時間的に後方の後方予測音声波形を用いると不連続
点を限りなく小さくすることができ、圧縮伸張音声の品
質を大幅に改善できる。Thus, in addition to the forward predicted speech waveform,
The use of a backward predicted speech waveform that is temporally backward has the effect of reducing noise. That is, if a speech piece using only the impulse response (forward predicted speech waveform) is used, the speech piece becomes a speech piece whose waveform rapidly rises from a state where the speech level is almost 0, and is compressed and decompressed using the speech piece. When processing is performed, there is a problem that a discontinuous point is generated and the part appears as noise. On the other hand, if a backward predicted speech waveform temporally backward is used, discontinuities can be reduced as much as possible, and the quality of compressed and expanded speech can be greatly improved.

【００３３】また、それぞれの音声片は、符号化された
データの伸張処理後あるいはスペクトル包絡パラメータ
の抽出後にその内容が更新されるようにしているので、
従来のように、固定的な内容のコードブックとは異な
り、処理対象音声片に対して、常に、最適な音声片が格
納されることになり、高品質な符号化が可能となる。Further, the content of each speech piece is updated after the expansion processing of the encoded data or after the extraction of the spectral envelope parameter, so that
Unlike a conventional codebook having a fixed content as in the related art, an optimum speech segment is always stored for a speech segment to be processed, and high-quality encoding can be performed.

【００３４】また、前記符号化されたデータは、類似部
分音声片を有する音声片番号、その音声片内のどの部分
であるかを表す位置データ、振幅調整用のパラメータで
表されるデータに、場合によっては、スペクトル包絡パ
ラメータをも加えたデータで表すことができる。したが
って、符号化後のデータは数バイト程度のデータとな
り、大幅なデータ圧縮が可能となる。なお、一般には、
音声は急激に変化することは少ないので、処理対象音声
片それぞれが４msec程度として考えた場合、スペクトル
包絡パラメータの変化は緩やかであり、処理対象の音声
片の１０個に１回程度の頻度でスペクトル包絡パラメー
タを抽出することで十分な精度が得られる、したがっ
て、スペクトル包絡パラメータを加えたとしても大幅に
圧縮されたデータとすることができる。The encoded data includes a voice segment number having a similar partial voice segment, position data indicating which portion in the voice segment, and data expressed by an amplitude adjustment parameter. In some cases, it can be represented by data to which a spectral envelope parameter is also added. Therefore, the encoded data becomes data of about several bytes, and significant data compression becomes possible. In general,
Since the voice does not change abruptly, the spectral envelope parameter changes slowly when each of the voice segments to be processed is assumed to be about 4 msec, and the frequency of the spectral envelope parameter is about once every 10 voice segments to be processed. Sufficient accuracy can be obtained by extracting the envelope parameters. Therefore, even if the spectral envelope parameters are added, the data can be significantly compressed.

【００３５】[0035]

【発明の実施の形態】以下、本発明の実施の形態につい
て説明する。具体的な実施の形態を説明する前に、ま
ず、本発明の実施の形態の基本的な処理内容について説
明する。Embodiments of the present invention will be described below. Before describing a specific embodiment, first, basic processing contents of the embodiment of the present invention will be described.

【００３６】図１は入力音声波形を示すもので、このよ
うな入力音声波形から、たとえば、４msec程度の音声片
の切り出しを行う。この切り出された音声片（以下、処
理対象音声片という）ｈ１を音声片表に格納されている
音声片と比較し、最も類似度の高い音声片を音声片表の
中から選択し、選択された音声片を用いて符号化データ
を作成する。なお、処理対象音声片を４msecとしたの
は、この実施の形態において使用したシステムでは、４
msec程度の長さで切り出すのが最もよい結果が得られる
からである。つまり、処理理対象音声片の長さが４msec
よりも短くなると、音質的には向上するが、圧縮率の低
下につながり、また、４msecよりも長くなると、圧縮率
的には有利となるが、音質的な劣化につながるおそれが
あるからである。FIG. 1 shows an input speech waveform. From such an input speech waveform, a speech piece of, for example, about 4 msec is cut out. The cut-out speech segment (hereinafter, referred to as a speech segment to be processed) h1 is compared with the speech segment stored in the speech segment table, and the speech segment having the highest similarity is selected from the speech segment table. Then, encoded data is created using the speech piece. Note that the processing target speech piece was set to 4 msec in the system used in this embodiment.
This is because the best result can be obtained by cutting out with a length of about msec. In other words, the length of the voice segment to be processed is 4 msec.
When the length is shorter than the above, the sound quality is improved, but the compression ratio is reduced. When the length is longer than 4 msec, the compression ratio is advantageous, but the sound quality may be deteriorated. .

【００３７】ところで、ここで言う音声片表というの
は、図２に示すような複数の要素から作成された音声片
（この例では、Ａ１〜Ａ４の４つの音声片）を有するも
ので、これらの音声片の作成方法については後に説明す
る。なお、音声片表には常に最新の音声片が格納される
ものであり、図２に示す音声片表は、或る時刻における
音声片表の内容を示すものである。The speech segment table referred to here has speech segments (four speech segments A1 to A4 in this example) created from a plurality of elements as shown in FIG. The method of creating the voice segment will be described later. Note that the latest voice segment is always stored in the voice segment table, and the voice segment table shown in FIG. 2 shows the contents of the voice segment table at a certain time.

【００３８】今、この図２に示す音声片表が最新の内容
であるとすれば、図１において、切り出された４msec程
度の処理対象音声片ｈ１が、音声片表の中のどの音声片
のどの部分に最も類似しているかを判断する。この場
合、処理対象音声片ｈ１は、音声片表の音声片Ａ２の位
置ｐ１からの部分が最も類似していると判定される。な
お、この最も類似している部分を、類似部分と呼ぶこと
にする。Now, assuming that the speech segment table shown in FIG. 2 has the latest contents, in FIG. 1, the cut-out speech segment h1 of about 4 msec is replaced with any speech segment in the speech segment table. Determine which part is most similar. In this case, the speech segment h1 to be processed is determined to have the most similar portion from the position p1 of the speech segment A2 in the speech segment table. The most similar part will be referred to as a similar part.

【００３９】これにより、処理対象音声片ｈ１の符号化
データは、音声片表の音声片番号Ａ２、位置ｐ１、音声
レベルを合わせるための倍率によって表すことができ
る。Thus, the encoded data of the speech segment h1 to be processed can be represented by the speech segment number A2, the position p1, and the magnification for matching the speech level in the speech segment table.

【００４０】すなわち、音声片表の音声片番号は、この
場合、Ａ１〜Ａ４の４つが存在するため、２ビットであ
らわすことができ、位置ｐ１は、それぞれの音声片の長
さを１６msecとすれば１２８サンプリング点（サンプリ
ング周波数が８ｋＨｚであるとする）であるため、７ビ
ットで表すことができる。また、音声レベルの高さを合
わせるために、たとえば、１２８段階で調整するとすれ
ば、やはり７ビットで表すことができる。したがって、
これらを合計すると、１６ビット、つまり、２バイトの
データとして表現できる。That is, in this case, since there are four voice segment numbers A1 to A4 in the voice segment table in this case, they can be represented by 2 bits, and the position p1 corresponds to a case where the length of each voice segment is 16 msec. For example, since it is 128 sampling points (assuming that the sampling frequency is 8 kHz), it can be represented by 7 bits. Also, if the audio level is adjusted in 128 steps, for example, it can also be represented by 7 bits. Therefore,
When these are summed up, they can be expressed as 16 bits, that is, 2 bytes of data.

【００４１】これに対して、処理対象音声片ｈ１は、各
サンプリング点それぞれに２バイト程度のデータ量があ
るとすれば、サンプリング点の数が３２個であると、６
４バイトのデータ量が存在することになる。したがっ
て、符号化後のデータ量は、元のデータに対して、１／
３２となる。On the other hand, if the number of sampling points is 32 and the number of sampling points is 32, the speech segment h1 to be processed has a data amount of about 2 bytes at each sampling point.
There will be a data amount of 4 bytes. Therefore, the data amount after encoding is 1 /
It becomes 32.

【００４２】また、スペクトル包絡パラメータを使用す
る場合は、そのデータとして、4.5バイト程度必要であ
る。ただし、一般には、音声は急激に変化することは少
ないので、処理対象音声片それぞれが４msec程度として
考えた場合、スペクトル包絡パラメータの変化は緩やか
であり、処理対象音声片の１０個に１回程度の頻度でス
ペクトル包絡パラメータを抽出することで十分な精度が
得られる、したがって、スペクトル包絡パラメータを加
えたとしても、その符号化データは元のデータに対して
大幅に圧縮されたデータとすることができる。When the spectral envelope parameter is used, about 4.5 bytes are required as the data. However, in general, the voice does not change abruptly. Therefore, when it is considered that each of the voice segments to be processed is about 4 msec, the change of the spectrum envelope parameter is gradual, and about once every ten voice segments to be processed. Sufficient accuracy can be obtained by extracting the spectrum envelope parameter at the frequency of. Therefore, even if the spectrum envelope parameter is added, the encoded data can be significantly compressed from the original data. it can.

【００４３】このように、本発明では、処理そのものは
単純であり、しかも効率のよい音声データの圧縮が可能
となる。As described above, according to the present invention, the processing itself is simple, and efficient compression of audio data is possible.

【００４４】次に本発明の具体的な実施の形態について
説明する。Next, a specific embodiment of the present invention will be described.

【００４５】図３は本発明の実施の形態の処理手順を説
明するフロ−チャ−トである。図３において、まず、入
力音声から４msec程度の処理対象音声片ｈ１を切り出す
（ステップｓ１）。この処理は、前述の図１により説明
した処理である。そして、スペクトル包絡パラメータを
抽出するか否かを判断し（ステップｓ２）、スペクトル
包絡パラメータを必要とする場合は、スペクトル包絡パ
ラメータの抽出を行う（ステップｓ３）。なお、前述し
たように、音声は急激に変化することは少ないので、切
り出される処理対象音声片それぞれが４msec程度として
考えた場合、スペクトル包絡パラメータの変化は緩やか
である。したがって、処理対象音声片の１０個に１回程
度の頻度でスペクトル包絡パラメータを抽出することで
十分な精度が得られる。FIG. 3 is a flowchart for explaining the processing procedure of the embodiment of the present invention. In FIG. 3, first, a processing target speech piece h1 of about 4 msec is cut out from the input speech (step s1). This process is the process described with reference to FIG. Then, it is determined whether or not to extract the spectrum envelope parameter (step s2). If the spectrum envelope parameter is required, the spectrum envelope parameter is extracted (step s3). Note that, as described above, since the voice rarely changes abruptly, the spectral envelope parameter changes slowly when each of the target voice segments to be cut is assumed to be about 4 msec. Therefore, sufficient accuracy can be obtained by extracting the spectral envelope parameter at a frequency of about once in 10 speech segments to be processed.

【００４６】そして、次のステップｓ４において、その
時点における音声片表を参照して、最も類似度の高い類
似部分を有する音声片を選択する。たとえば、或る時点
における処理対象音声片ｈ１に対して、その時点の音声
片表の内容が図２に示す内容であったとすると、処理対
象音声片ｈ１は、音声片表の音声片Ａ２の位置ｐ１から
の部分が最も類似していると判定され、その音声片Ａ２
が類似部分を有する音声片として選択される。Then, in the next step s4, the speech segment having the highest similarity is selected with reference to the speech segment table at that time. For example, assuming that the content of the voice segment table at that time is the content shown in FIG. 2 with respect to the voice segment h1 to be processed at a certain point in time, the processing target voice segment h1 is the position of the voice segment A2 in the voice segment table. It is determined that the part from p1 is the most similar, and the voice segment A2
Is selected as a speech piece having a similar part.

【００４７】次に、選択された音声片Ａ２についてのデ
ータ（音声片番号、位置、音声レベルを合わせるための
倍率）などに基づいて符号化処理を行う（ステップｓ
５）。Next, an encoding process is performed based on the data (speech segment number, position, magnification for matching the speech level) of the selected speech segment A2 (step s).
5).

【００４８】そして、圧縮処理が終了であるか否かを判
断して（ステップｓ６）、圧縮処理が終了であれば、ス
テップｓ５にて符号化処理した符号化データを出力し
（ステップｓ７）、入力音声についてすべての圧縮処理
が終了か否かを判断して（ステップｓ８）、終了であれ
ば処理を終了とし、まだ、終了していなければ、ステッ
プｓ１に戻る。Then, it is determined whether or not the compression processing has been completed (step s6). If the compression processing has been completed, the coded data that has been coded in step s5 is output (step s7). It is determined whether or not all compression processing has been completed for the input voice (step s8). If the processing has been completed, the processing is terminated. If not completed, the processing returns to step s1.

【００４９】一方、ステップｓ６において、圧縮処理終
了でなければ、伸張処理（ステップｓ９）、残差生成処
理（ステップｓ１０）を行ったのち、ステップｓ４に処
理が戻り、ステップｓ４からステップｓ１０で形成され
るループ処理を行う。以下、このループ処理について説
明する。On the other hand, if the compression processing is not completed in step s6, the decompression processing (step s9) and the residual generation processing (step s10) are performed, and then the processing returns to step s4, and the processing is performed in steps s4 to s10. Loop processing is performed. Hereinafter, this loop processing will be described.

【００５０】前述したように、たとえば、処理対象音声
片ｈ１に対して音声片表の音声片Ａ２の位置ｐ１からの
部分が最も類似していると判定され、その類似部分を有
する音声片Ａ２が選択されたとする。そして、選択され
た音声片Ａ２についてのデータ（音声片番号、位置、音
声レベルを合わせるための倍率）などに基づいて符号化
処理を行う。この段階で圧縮処理を終了としないで、同
じ処理を何回か繰り返す。つまり、ステップｓ５におい
て符号化されたあと、符号化されたデータを、一旦、伸
張処理し（ステップｓ７）、その後、残差生成処理を行
う（ステップ８）。As described above, for example, it is determined that the portion from the position p1 of the speech segment A2 in the speech segment table to the speech segment h1 to be processed is the most similar, and the speech segment A2 having the similar portion is determined. Assume that it is selected. Then, the encoding process is performed based on the data (speech segment number, position, magnification for matching the speech level) of the selected speech segment A2, and the like. At this stage, the same processing is repeated several times without terminating the compression processing. That is, after the data is encoded in step s5, the encoded data is temporarily expanded (step s7), and thereafter, a residual generation process is performed (step 8).

【００５１】この残差生成処理というのは、符号化され
て伸張された音声データを、元の入力音声（この場合、
処理対象音声片ｈ１）から差し引いて、その差分を取る
処理である。つまり、図４に示すように、処理対象音声
片ｈ１から伸張処理された音声データＨ１を引いて、そ
の残差ｄ１を求める。そして、求められた残差ｄ１につ
いて、その時点における音声片表を参照して、最も類似
度の高い部分（類似部分）を有する音声片を選択すると
いう処理を行う。このような処理を1回以上行うことに
より、より一層、高精度な圧縮データが得られるが、２
回程度でも十分な精度が得られる。This residual generation processing is performed by converting the encoded and expanded audio data into the original input audio (in this case,
In this process, the difference is subtracted from the processing target speech piece h1) to obtain the difference. That is, as shown in FIG. 4, the decompressed audio data H1 is subtracted from the processing target audio piece h1, and the residual d1 is obtained. Then, with respect to the obtained residual d1, a process of selecting a voice segment having a portion (similar portion) having the highest similarity with reference to the voice segment table at that time is performed. By performing such processing at least once, compressed data with higher precision can be obtained.
Sufficient accuracy can be obtained even about times.

【００５２】ところで、ステップｓ９にて行われる伸張
処理は、図５のフロ−チャ−トに示されるような処理手
順にて行われる。The decompression process performed in step s9 is performed according to the processing procedure shown in the flowchart of FIG.

【００５３】すなわち、符号化されたデータを入力し
（ステップ１１）、スペクトル包絡パラメータの更新か
否かを判断する（ステップｓ１２）。つまり、スペクト
ル包絡パラメータが抽出されている場合は、これまでの
スペクトル包絡パラメータの値を新たなスペクトル包絡
パラメータの値に更新する（ステップｓ１３）。That is, the encoded data is input (step 11), and it is determined whether or not the spectrum envelope parameter is updated (step s12). That is, when the spectrum envelope parameter has been extracted, the value of the previous spectrum envelope parameter is updated to a new value of the spectrum envelope parameter (step s13).

【００５４】次に、その時点における音声片表を参照し
て、符号化データに基づいて最も類似度の高い部分（類
似部分）を有する音声片を選択する（ステップｓ１
４）。そして、選択された音声片データに基づいて伸張
データを作成する（ステップｓ１５）。そして、処理が
終了したか否かを判断する（ステップｓ１６）。処理終
了でなければ、ステップｓ１５にて伸張処理されたデー
タを用いて、それまでの音声片表の内容を、この新たな
音声片によって更新する（ステップｓ１７）。Next, by referring to the speech piece table at that time, a speech piece having a part (similar part) having the highest similarity is selected based on the encoded data (step s1).
4). Then, decompressed data is created based on the selected speech piece data (step s15). Then, it is determined whether or not the processing has been completed (step s16). If the processing is not completed, the contents of the speech segment table up to that time are updated with the new speech segment using the data expanded in step s15 (step s17).

【００５５】そして、さらに符号化データ存在すれば、
その符号化データに対して、同様の処理が行われる。Then, if encoded data exists,
A similar process is performed on the encoded data.

【００５６】なお、この伸張処理は、図３の処理手順の
一つとしてだけ用いられるのではなく、伸張処理単独で
も用いられる。たとえば、符号化されたデータが所定の
メモリに蓄えられている場合、その符号化されたデータ
を伸張処理する場合にも用いられる。Note that this decompression processing is used not only as one of the processing procedures in FIG. 3 but also in the decompression processing alone. For example, when coded data is stored in a predetermined memory, it is also used when decompressing the coded data.

【００５７】このようにして伸張処理が終了すると、図
３のフローチャートにおいては、残差生成を行う（ステ
ップｓ１０）。つまり、前述したように、図４に示すよ
うに、音声片ｈ１から伸張処理された音声データＨ１を
引いて、その残差ｄ１を求める。そして、求められた残
差ｄ１について、その時点における音声片表（伸張処理
後に新たに更新された音声片表）を参照して、最も類似
度の高い部分（類似部分）を有する音声片を選択すると
いう処理を行う。このような処理を1回以上行うことに
より、より一層、高精度な圧縮データが得られるが、前
述の如く、２回程度でも十分な精度が得られる。When the decompression process is completed as described above, a residual is generated in the flowchart of FIG. 3 (step s10). That is, as described above, as shown in FIG. 4, the decompressed voice data H1 is subtracted from the voice piece h1, and the residual d1 is obtained. Then, with respect to the obtained residual d1, the speech segment having the highest similarity (similar portion) is selected with reference to the speech segment table at that time (the speech segment table newly updated after the decompression process). Is performed. By performing such processing at least once, compressed data with higher precision can be obtained. However, as described above, sufficient accuracy can be obtained even with about two times.

【００５８】ところで、以上の処理で用いられる音声片
表は、少なくとも以下に示す要素により作成された音声
片を含むものである。By the way, the speech piece table used in the above processing includes at least speech pieces created by the following elements.

【００５９】（１）現在、切り出された処理対象音声片
に対し、すでに圧縮伸張処理された音声データ（処理対
象音声片に対し、時間的に後方の圧縮伸張処理された音
声データ）を用いる。なお、ここでは、すでに過ぎ去っ
た時間を時間的に後方といい、これから先の時間を時間
的に前方という表現を用いる。(1) At present, speech data which has been already compressed and decompressed for the cut-out speech segment to be processed (speech data which has been compressed and decompressed later in time with respect to the speech segment to be processed) is used. Here, the expression that the time that has already passed is temporally backward and the time that is future is temporally forward are used.

【００６０】たとえば、入力音声が図６（ａ）であると
し、ある時刻ｔ１までの入力音声がすでに圧縮伸張処理
され、その圧縮伸張処理された音声波形が図６（ｂ）の
ようであったとする。そして、現在、処理対象音声片が
ｈ１であったとすると、その処理対象音声片ｈ１に対し
ては、図６（ｂ）に示す圧縮伸張された音声波形の所定
部分（処理対象音声片ｈ１に対する直前の圧縮伸張され
た音声波形）を音声片として用いる。これは、図２に示
す音声片表においては、たとえば、Ａ２の音声片に相当
する。なお、その音声片の時間的な長さは、１６msec程
度とする。For example, assuming that the input voice is as shown in FIG. 6A, the input voice up to a certain time t1 has already been subjected to compression / expansion processing, and the compressed / expanded voice waveform is as shown in FIG. 6B. I do. Assuming that the current speech segment to be processed is h1, a predetermined portion of the compressed and expanded speech waveform shown in FIG. Is used as a speech piece. This corresponds to, for example, the speech piece A2 in the speech piece table shown in FIG. The time length of the voice piece is about 16 msec.

【００６１】（２）処理対象音声片の近傍のスペクトル
包絡パラメータより推定される時間的前方予測音声波形
およびそれと連続する時間的後方予測音声波形を用い
る。(2) A temporally forward predicted speech waveform estimated from a spectral envelope parameter near a speech segment to be processed and a temporally backward predicted speech waveform that follows the waveform are used.

【００６２】前にも述べたように、スペクトル包絡パラ
メータは、切り出された音声片ごとに送る必要はない。
これは、音声は急激には変化することは殆どないと考え
られるためであり、たとえば、数個から十数個の処理対
象音声片に対して１回というような割合でスペクトル包
絡パラメータを送ればよい。そういう意味で、ここで
は、処理対象音声片の“近傍”のスペクトル包絡パラメ
ータという表現を用いている。As described above, the spectral envelope parameter does not need to be sent for each extracted speech piece.
This is because the voice is considered to hardly change drastically. For example, if the spectral envelope parameter is sent at a rate of once for several to several tens of speech pieces to be processed, Good. In this sense, the expression “spectral envelope parameter near” the speech segment to be processed is used here.

【００６３】なお、この現在処理対象音声片の近傍のス
ペクトル包絡パラメータより推定される時間的前方予測
音声波形およびそれと連続する時間的後方予測音声波形
というのは、図７に示すように、インパルス応答（前方
予測音声波形）ｘ１に加えて、時間的に後方の後方予測
音声波形ｘ２を指している。As shown in FIG. 7, the temporally forward predicted speech waveform estimated from the spectral envelope parameter in the vicinity of the currently processed speech segment and the temporally backward predicted speech waveform continuous with it are shown in FIG. (Forward predicted speech waveform) In addition to x1, it indicates a temporally backward predicted speech waveform x2.

【００６４】このように、インパルス応答（前方予測音
声波形）に加えて、時間的に後方の後方予測音声波形を
用いると、雑音の低減を図れる効果がある。すなわち、
インパルス応答（前方予測音声波形）のみを用いた音声
片とした場合、音声レベルが殆ど０の状態から急激に波
形が立ち上がった音声片となってしまうため、その音声
片を用いて圧縮伸張処理したとき、不連続点が生じるこ
とによってその部分が雑音となって現れるという問題点
がある。これに対して、時間的に後方の後方予測音声波
形を用いると不連続点を限りなく小さくすることがで
き、圧縮伸張音声の品質を大幅に改善できる。As described above, using the backward predicted speech waveform temporally backward in addition to the impulse response (forward predicted speech waveform) has an effect of reducing noise. That is,
If a speech piece using only the impulse response (forward predicted speech waveform) is used, the speech level suddenly rises from a state where the speech level is almost 0, and the compression / expansion process is performed using the speech piece. There is a problem that a discontinuous point sometimes appears as noise when the discontinuous point occurs. On the other hand, if a backward predicted speech waveform temporally backward is used, discontinuities can be reduced as much as possible, and the quality of compressed and expanded speech can be greatly improved.

【００６５】（３）雑音波形を用いる。(3) Use a noise waveform.

【００６６】この雑音波形は乱数で与えられたものでも
よく、また、実際の入力音声中からサンプル化されたも
のを用いてもよい。The noise waveform may be given by random numbers, or may be sampled from actual input speech.

【００６７】以上のように、本発明で使用する音声片表
の内容としては、（１）〜（３）で説明した音声片を少
なくとも含むものとする。そして、これら各音声片は１
６msec程度の長さの音声片として、たとえば、図２に示
すような状態で保持され、常に、最新のデータが蓄えら
れる。As described above, the contents of the speech piece table used in the present invention include at least the speech pieces described in (1) to (3). And each of these speech pieces is 1
For example, a voice piece having a length of about 6 msec is held in a state as shown in FIG. 2, and the latest data is always stored.

【００６８】図８は本発明の音声圧縮伸張装置の構成を
示すブロック図である。図８において、音声入力部１か
ら入力された音声は、音声切り出し部２によって、前述
したように、たとえば、４msec程度の処理対象音声片と
して切り出される。この切り出された処理対象音声片
は、類似度判定部３によって、音声片表４内の幾つかの
音声片ａ１，ａ２，・・・，Ａｎと比較され類似度を得
る。そして、音声片選択部５によって最も類似度の高い
部分（類似部分）を有する音声片が選択される。FIG. 8 is a block diagram showing the configuration of the audio compression / decompression device of the present invention. In FIG. 8, the voice input from the voice input unit 1 is cut out by the voice cutout unit 2 as a processing target voice piece of, for example, about 4 msec as described above. The extracted speech segment to be processed is compared with some speech segments a1, a2,..., An in the speech segment table 4 by the similarity determination unit 3 to obtain a similarity. Then, the voice segment selection unit 5 selects a voice segment having a portion with the highest similarity (similar portion).

【００６９】符号化部６は、選択された音声片について
のデータ（音声片番号、位置、音声レベルを合わせるた
めの倍率）などに基づいて符号化処理を行う。なお、こ
の段階で符号化処理を終了とすれば、その符号化データ
を符号化データ出力部７から出力する。また、このと
き、スペクトル包絡パラメータを用いる場合は、スペク
トル包絡パラメータ抽出部８によって抽出されたスペク
トル包絡パラメータを加えた符号化処理を行う。The encoding section 6 performs an encoding process based on the data (speech segment number, position, magnification for matching the speech level) of the selected speech segment. If the encoding process is terminated at this stage, the encoded data is output from the encoded data output unit 7. At this time, when using the spectrum envelope parameter, an encoding process is performed by adding the spectrum envelope parameter extracted by the spectrum envelope parameter extraction unit 8.

【００７０】一方、符号化処理終了でなければ、符号化
部７で符号化された符号化データを伸張部９によって伸
張処理し、残差生成部１０にて残差生成処理を行う。こ
の伸張処理と残差生成処理は図４におけるフローチャー
トのステップｓ９とステップｓ１０の処理である。On the other hand, if the encoding process has not been completed, the encoded data encoded by the encoding unit 7 is expanded by the expansion unit 9, and the residual generation unit 10 performs the residual generation process. The decompression processing and the residual generation processing are the processing of steps s9 and s10 of the flowchart in FIG.

【００７１】この残差生成処理というのは、前述したよ
うに、符号化されて伸張された音声データを元の入力音
声（この場合、処理対象音声片ｈ１）から差し引いて、
その差分を取る処理である。つまり、図４に示すよう
に、音声片ｈ１から伸張処理された音声データＨ１を引
いて、その残差ｄ１を求めるものである。そして、求め
られた残差について、その時点における音声片表を参照
して、最も類似度の高い類似部分音声片を選択するとい
う処理を行う。As described above, this residual generation processing is performed by subtracting the coded and expanded voice data from the original input voice (in this case, the voice segment h1 to be processed).
This is the process of taking the difference. That is, as shown in FIG. 4, the decompressed voice data H1 is subtracted from the voice piece h1, and the residual d1 is obtained. Then, with respect to the obtained residual, a process of selecting a similar partial speech segment having the highest similarity with reference to the speech segment table at that time is performed.

【００７２】なお、前記伸張部９にて行われる伸張処理
は、図５のフロ−チャ−トに示されるような処理手順に
て行われる。そして、伸張処理された音声データを用い
て、音声片更新部１１が音声片表４の内容の更新を行
う。また、この音声片更新部１１は、スペクトル包絡パ
ラメータ抽出部８からスペクトル包絡パラメータが抽出
された場合は、そのスペクトル包絡パラメータにより推
定される時間的前方予測音声波形およびそれと連続する
時間的な後方予測音声波形をも更新する。このようにし
て、音声片表４の内容は常に最新の音声片が格納される
ことになる。The decompression process performed by the decompression unit 9 is performed according to the processing procedure shown in the flowchart of FIG. Then, the speech piece updating unit 11 updates the contents of the speech piece table 4 using the decompressed speech data. Further, when the spectrum envelope parameter is extracted from the spectrum envelope parameter extraction unit 8, the speech piece updating unit 11 performs temporally forward predicted speech waveform estimated by the spectrum envelope parameter and temporally backward predicted continuous with it. Also update the audio waveform. In this way, the contents of the voice segment table 4 always store the latest voice segments.

【００７３】このような構成の音声圧縮伸張装置の全体
的な動作については、図４のフローチャートで説明した
ので、ここではその動作についての説明は省略する。Since the overall operation of the audio compression / decompression device having such a configuration has been described with reference to the flowchart of FIG. 4, the description of the operation will be omitted here.

【００７４】なお、本発明は前述の実施の形態に限定さ
れるものではなく、本発明の要旨を逸脱しない範囲で種
々変形実施可能となるものである。たとえば、切り出さ
れる処理対象音声片は、前述の実施の形態では、４msec
としたが、これは、前述の実施の形態において使用した
システムでは、４msecとすることで最もよい結果が得ら
れたからである。しかし、使用するシステムなどによっ
ては、この数値は異なる場合もあるので、これに限定さ
れるものではなく、本発明が適用されるシステムに応じ
て最適な時間を設定することができる。また、図２で示
した音声片表の内容は一例であって、これに限られるも
のではない。The present invention is not limited to the above-described embodiment, but can be variously modified without departing from the gist of the present invention. For example, the processing target speech piece to be cut out is 4 msec in the above-described embodiment.
The reason is that the best result was obtained with 4 msec in the system used in the above embodiment. However, the numerical value may be different depending on the system to be used and the like, and the present invention is not limited to this, and an optimal time can be set according to the system to which the present invention is applied. Further, the contents of the speech piece table shown in FIG. 2 are an example, and the present invention is not limited to this.

【００７５】また、以上説明した本発明の音声圧縮伸張
処理を行う処理プログラムは、フロッピィディスク、光
ディスク、ハードディスクなどの記憶媒体に記憶させて
置くことが出来、本発明は、これらの記憶媒体をも含む
ものであり、また、ネットワークからデータを得る方式
でもよい。Further, the processing program for performing the above-described audio compression / decompression processing of the present invention can be stored in a storage medium such as a floppy disk, an optical disk, or a hard disk. And a method of obtaining data from a network.

【００７６】[0076]

【発明の効果】以上説明したように、本発明によれば、
音声片表内のそれぞれの音声片と入力音声から切り出し
た処理対象音声片との類似性を比較し、最も類似度の高
い音声片を選択し、その選択された音声片についてのデ
ータを基に前記処理対象音声片を符号化する処理を基本
処理として行うようにしている。これにより、符号化が
きわめて単純な処理で可能となる。As described above, according to the present invention,
Compare the similarity of each speech segment in the speech segment table with the speech segment to be processed cut out from the input speech, select the speech segment with the highest similarity, and based on the data for the selected speech segment. The processing of encoding the processing target speech piece is performed as basic processing. As a result, encoding can be performed with a very simple process.

【００７７】また、符号化データを作成したのち、その
符号化データの伸張処理を行い、伸張されたデータを前
記処理対象音声片から差し引いて得られた残差波形に対
して、再び、音声片表を参照し、類似性を求めるという
処理を複数回行って符号化データを得ることにより、よ
り一層、高品質な符号化データを得ることができる。After the coded data is created, the coded data is expanded, and the residual waveform obtained by subtracting the expanded data from the processing target audio piece is again converted to the audio signal. By referring to the table and performing the process of obtaining similarity a plurality of times to obtain encoded data, encoded data with higher quality can be obtained.

【００７８】また、音声片表に格納される音声片は、処
理対象音声片よりも時間的に後方のすでに圧縮伸張処理
された音声波形を用いて作成された音声片、スペクトル
包絡パラメータにより推定される時間的前方予測音声波
形と時間的後方予測音声波形を用いて作成された音声
片、雑音成分により作成された音声片を少なくとも有す
ることで、入力音声を符号化する際、効率よく、しかも
高品質な符号化が可能となる。特に、スペクトル包絡パ
ラメータにより推定される予測音声波形により音声片を
作成する場合、本発明では、スペクトル包絡パラメータ
により推定される時間的前方予測音声波形に加えて、時
間的に後方の後方予測音声波形を用いているので、雑音
の低減が図れ、音声の品質を大幅に改善できる。The speech segments stored in the speech segment table are estimated based on the speech segments created using the already-compressed and expanded speech waveforms temporally behind the speech segment to be processed, and the spectral envelope parameters. By having at least a speech segment created using the temporally forward predicted speech waveform and the temporally backward predicted speech waveform, and a speech segment created by a noise component, efficient and high-quality encoding of the input speech is achieved. High quality encoding is possible. In particular, when a speech segment is created from a predicted speech waveform estimated by the spectrum envelope parameter, in the present invention, in addition to the temporally forward predicted speech waveform estimated by the spectrum envelope parameter, the backward predicted speech waveform temporally backward Is used, noise can be reduced, and the quality of voice can be greatly improved.

【００７９】また、それぞれの音声片は、符号化された
データの伸張処理後あるいはスペクトル包絡パラメータ
の抽出後にその内容が更新されるようにしているので、
従来のように、固定的な内容のコードブックとは異な
り、処理対象音声片に対して、常に、最適な音声片が格
納されることになり、高品質な符号化が可能となる。Further, the content of each speech segment is updated after the expansion processing of the encoded data or after the extraction of the spectral envelope parameter.
Unlike a conventional codebook having a fixed content as in the related art, an optimum speech segment is always stored for a speech segment to be processed, and high-quality encoding can be performed.

【００８０】また、前記符号化されたデータは、類似部
分音声片を有する音声片番号、その音声片内のどの部分
であるかを表す位置データ、振幅調整用のパラメータで
表されるデータに、場合によっては、スペクトル包絡パ
ラメータをも加えたデータで表すことができ、大幅なデ
ータ圧縮が可能となる。The coded data includes a voice segment number having a similar partial voice segment, position data indicating a portion within the voice segment, and data expressed by an amplitude adjustment parameter. In some cases, it can be represented by data to which a spectral envelope parameter is also added, and significant data compression is possible.

【００８１】このように、本発明は、処理内容が単純で
しかも効率よく高品質な音声圧縮伸張が可能となり、ハ
ードウエア化や並列処理化を行う際にきわめて有利なも
のとすることができる。As described above, the present invention makes it possible to efficiently and efficiently perform high-quality voice compression / decompression with simple processing contents, and can be made extremely advantageous when implementing hardware or parallel processing.

[Brief description of the drawings]

【図１】本発明の実施の形態を説明するために入力音声
を所定の区間切り出した例を示す図。FIG. 1 is a diagram showing an example in which input speech is cut out in a predetermined section in order to explain an embodiment of the present invention.

【図２】本発明の実施の形態における音声片表の一例を
示す図。FIG. 2 is a diagram showing an example of a speech piece table according to the embodiment of the present invention.

【図３】本発明の実施の形態の処理手順を説明するフロ
ーチャート。FIG. 3 is a flowchart illustrating a processing procedure according to the embodiment of the invention.

【図４】本発明の実施の形態における残差成分を求める
処理を説明する図。FIG. 4 is a view for explaining processing for obtaining a residual component in the embodiment of the present invention.

【図５】本発明の実施の形態における伸張処理手順を説
明するフローチャート。FIG. 5 is a flowchart illustrating a decompression processing procedure according to the embodiment of the present invention.

【図６】本発明の実施の形態における音声片表内の音声
片を伸張処理後の音声波形より作成する例を説明する
図。FIG. 6 is an exemplary view for explaining an example of creating a speech segment in a speech segment table from a speech waveform after decompression processing according to the embodiment of the present invention;

【図７】本発明の実施の形態における音声片表内の音声
片をスペクトル包絡パラメータより推定される時間的前
方予測音声波形と時間的後方予測音声波形より作成する
例を説明する図。FIG. 7 is a diagram illustrating an example in which a speech segment in a speech segment table according to the embodiment of the present invention is created from a temporally forward predicted speech waveform estimated from a spectral envelope parameter and a temporally backward predicted speech waveform.

【図８】本発明の実施の形態における音声圧縮伸張装置
の構成を示すブロック図。FIG. 8 is a block diagram showing a configuration of a voice compression / decompression device according to the embodiment of the present invention.

[Explanation of symbols]

１音声入力部２音声片切り出し部３類似度判定部４音声片表５音声片選択部６符号化部７符号化データ出力部８スペクトル包絡パラメータ抽出部９伸張部１０残差生成部１１音声片更新部ｈ１処理対象音声片Ａ１，Ａ２，Ａ３，Ａ４音声片表内に格納された音声
片ｐ１音声片における類似部分音声の位置Reference Signs List 1 voice input unit 2 voice segment extraction unit 3 similarity determination unit 4 voice segment table 5 voice segment selection unit 6 encoding unit 7 encoded data output unit 8 spectrum envelope parameter extraction unit 9 expansion unit 10 residual generation unit 11 speech fragment Update unit h1 Speech segment to be processed A1, A2, A3, A4 Speech segment stored in speech segment table p1 Position of similar partial speech in speech segment

Claims

[Claims]

1. A speech segment in a predetermined section is cut out from an input speech as a speech segment to be processed, a speech segment table including a plurality of types of speech segment groups is referred to, and each speech segment in the speech segment table and the speech to be processed are referred to. A process of comparing the similarity with the segment, selecting a speech segment with the highest similarity, and encoding the processing target speech segment based on data on the selected speech segment to generate encoded data. A voice compression / decompression method comprising:

2. After the encoded data is created, the encoded data is expanded, the expanded data is subtracted from the processing target speech piece to obtain a residual, and the residual waveform is Referring to a speech unit table including a plurality of types of speech unit groups, a process of comparing the similarity between each speech unit in the speech unit table and the residual waveform is performed one or more times to obtain encoded data. 2. The audio compression / decompression method according to claim 1, wherein:

3. A speech segment stored in the speech segment table is a speech segment created using an already-compressed and expanded speech waveform temporally behind the speech segment to be processed, and a spectral envelope parameter. It has at least a speech fragment created using the estimated temporally predicted speech waveform and the temporally backward predicted speech waveform, and a speech fragment created with a noise component. 3. The audio compression / expansion method according to claim 1, wherein the content is updated after the expansion process or after the extraction of the spectrum envelope parameter.

4. Each of the voice segments has a section longer in time than the voice segment to be processed, and when determining the similarity with the voice segment to be processed, processing is performed within a range of the length of each voice segment. 4. The speech compression / expansion method according to claim 1, wherein a similarity to the target speech piece is determined, and a speech piece having a portion having the highest similarity is selected.

5. The coded data is a voice segment number having a portion having the highest similarity, position data indicating which portion in the voice segment, and data represented by an amplitude adjustment parameter. 5. The audio compression / decompression method according to claim 4, wherein the data is data to which a spectrum envelope parameter is added as necessary.

6. A speech segment extraction unit for extracting a speech segment of a predetermined section set in advance from an input speech as a speech segment to be processed, a spectrum envelope parameter extraction unit for extracting a spectrum envelope parameter from the input speech, A voice segment table storing the segments, and a similarity determination unit that refers to the voice segment table, compares the similarities of the respective voice segments in the voice segment table and the similarity of the processing target voice segment, and determines the similarity. A voice segment selecting unit for selecting a voice segment having the highest similarity based on the similarity determined by the similarity determining unit; and An encoding unit that encodes the fragment, and outputs the data encoded by the encoding unit as encoded data, and, in some cases, the data encoded by the encoding unit. And a coded data output unit that generates and outputs coded data obtained by adding the spectrum envelope parameter extracted by the spectrum envelope parameter extraction unit to the data, as a constituent element.

7. A decompression unit for decompressing the data encoded by the encoding unit, and a residual generation unit for subtracting the data decompressed by the decompression unit from the speech unit to be processed to obtain a residual. A speech unit updating unit that updates the contents of the speech unit stored in the speech unit table using the data expanded by the expansion unit or the spectrum envelope parameter extracted by the spectrum envelope parameter extraction unit. 7. The audio compression / decompression device according to claim 6, comprising:

8. The similarity determination unit, speech unit selection unit, encoding unit, decompression unit, and residual generation unit form a loop in a processing procedure, and perform similarity judgment, speech unit selection, encoding, decompression, For the residual waveform obtained by performing the residual generation processing, referring to the speech piece table, a process of comparing the similarity between each speech piece in the speech piece table and the residual waveform, 1 8. The audio compression / decompression device according to claim 7, wherein the encoded data is created and output after performing the operations more than once.

9. A speech segment stored in the speech segment table is a speech segment created using an already-compressed and decompressed speech waveform temporally behind the speech segment to be processed, and a spectral envelope parameter. Estimated temporal forward predicted audio waveform and temporal backward predicted audio waveform, a speech segment created using the speech component, at least a speech segment created by a noise component, each speech segment by the speech update processing unit 7. The content is updated after decompression processing or after extraction of a spectrum envelope parameter.
9. The audio compression / decompression device according to any one of items 1 to 8.

10. Each of the speech segments has a longer section than the speech segment to be processed, and when determining the degree of similarity with the speech segment to be processed, the speech segment is processed within the length of each speech segment. The speech compression / expansion apparatus according to any one of claims 6 to 9, wherein a similarity to the speech piece is determined, and a speech piece having a portion having the highest similarity is selected.

11. The coded data is a voice segment number having the highest similarity portion, position data indicating which portion is in the voice segment, and data represented by an amplitude adjustment parameter. 11. The audio compression / decompression device according to claim 10, wherein the data is data to which a spectral envelope parameter is added as necessary.

12. A storage medium storing a voice compression / expansion processing program, the voice compression / expansion processing program cuts out a voice segment in a predetermined section from an input voice as a processing target voice segment, and includes a voice segment including a plurality of types of voice segment groups. Refer to the table,
The similarity between each speech segment in the speech segment table and the speech segment to be processed is compared, the speech segment having the highest similarity is selected, and the speech segment to be processed is selected based on the data about the selected speech segment. The voice segment is encoded and, if necessary, a process of creating encoded data to which a spectrum envelope parameter is added, and after the decompression process of the encoded data or after the extraction of the spectrum envelope parameter, A storage medium storing a voice compression / decompression processing program, wherein the content of each voice piece stored in a table is updated.