CN1632861A

CN1632861A - A Low Bit Rate Speech Coder

Info

Publication number: CN1632861A
Application number: CNA2004101032190A
Authority: CN
Inventors: 董恩清
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2004-12-31
Filing date: 2004-12-31
Publication date: 2005-06-29
Anticipated expiration: 2024-12-31
Also published as: CN1327408C

Abstract

This invention discloses a sound coder, which uses one flexible adjust shape function and uses the function to shape the clock function used by Donoho to get one new clock function to improve the high spectrum energy focus property. It uses sub dimensional method to the local cosine conversion parameter and uses LGB design method for each dimensional vector and uses tree structure search method from code searching.

Description

A Low Bit Rate Speech Coder

技术领域Technical field

本发明涉及一种语音编码器，特别涉及一种基于局部余弦变换(LocalCosine Transform，LCT)的低比特率语音编码器，适合于要求低比特率语音编码的通信系统中使用。The present invention relates to a speech coder, in particular to a low bit rate speech coder based on local cosine transform (LocalCosine Transform, LCT), suitable for use in communication systems requiring low bit rate speech coding.

背景技术 Background technique

低比特率语音编码在过去20年已成为一个主要研究主题，结果导致已经将比特率从16kb/s到2.4kb/s范围的很多语音编码算法标准化。目前语音编码器研究的重点在4kb/s及更低的高质量语音编码。虽然CELP波形编码器在低于6.3kb/s比特率时仍能产生高质量的语音，但当比特率减少至4kb/s及更低时，由于没有足够的比特对波形细节进行编码，波形编码系统将会产生大量量化噪声。另一方面，参数编码(也称声码器)并不企图产生与原始信号相似的波形，而代之，试图找到能够较好地表示语音知觉重要属性的一组参数，但它们对各种特殊环境噪声的鲁棒性较差。Low bit rate speech coding has been a major research topic over the past 20 years, resulting in the standardization of many speech coding algorithms with bit rates ranging from 16 kb/s to 2.4 kb/s. At present, the research focus of speech coder is on 4kb/s and lower high-quality speech coding. Although the CELP waveform encoder can still produce high-quality speech at bit rates below 6.3kb/s, when the bit rate is reduced to 4kb/s and lower, there are not enough bits to encode the details of the waveform. The system will generate a lot of quantization noise. On the other hand, parametric coding (also known as vocoder) does not try to produce a waveform similar to the original signal, but instead tries to find a set of parameters that can better represent important properties of speech perception, but they are not very useful for various special The robustness to ambient noise is poor.

然而，对于在4kb/s比特率及更低比特率的语音编码，最近的研究显示在频域进行语音编码比已存在的基于CELP的编码器具有更好语音质量的潜力。谱编码器试图重构语音振幅谱而不是精确地恢复语音波形。尽管以上的编码器广泛应用于低比特率语音编码，它们大多受假设的模型精度的限制，还有它们主要依赖于正确的参数估计，往往这些要求很难得到保证。所以，在特殊环境下，这些编码方法的鲁棒性是很差的，编码后的语音质量具有一定的局限性。However, for speech coding at bit rates of 4 kb/s and lower, recent studies have shown that speech coding in the frequency domain has the potential for better speech quality than existing CELP-based coders. Spectral encoders attempt to reconstruct the speech amplitude spectrum rather than recovering the speech waveform exactly. Although the above encoders are widely used in low-bit-rate speech coding, they are mostly limited by the assumed model accuracy, and they mainly rely on correct parameter estimation, which is often difficult to guarantee. Therefore, in special environments, the robustness of these coding methods is very poor, and the speech quality after coding has certain limitations.

由Coifman和Meyer(1991)及Auscher等(1992)先后构造的局部余弦基是由平滑、紧支撑钟函数与余弦函数乘积构成的。这些局部化的余弦函数仍保留着正交性，并且具有较小的Heisenberg乘积。近些年来，局部余弦变换理论方法得到广泛深入的研究，该方法在图像压缩编码中应用较多，而应用在语音信号处理方面的研究相对较少，特别是应用在语音编码中则更少。但在MalvarH.S.“Lapped transforms for efficient transform/subband coding”.IEEETrans.on Acoust.，Speech Signal Processing，1990.，vol.38(6)，Page(s)：969-978发表的文献中证明了在语音编码中LCT方法的编码增益优于DCT编码，并且十分接近于KL变换编码。特别是与DCT编码方法相比，明显减少了帧之间的“喀嚓”声，不需要象DCT变换编码在编码中为了减少帧之间出现的异常“喀嚓”声音而经常采用半帧长滑动的方法。所以，该LCT方法比DCT方法减少接近一半的计算量。在Wickerhauser M.V.于1994年发表的文献“Comparison of picture compression methods：wavelet，waveletpacket and local cosine”.Wavelets：Theory，Algorithms，and Applications，Editor(Charles K.Chui and Laura Montefusco and Luigia Puccio)，Academic Press，San Diego，California，p.585～621，中进行的几个二维图像编码方法对比研究也表明了在编码增益上LCT方法优于DCT方法，并同样也非常接近于KL变换方法。研究表明，提高变换编码的编码增益的关键在于正交基的选取，同样，在局部余弦变换编码中的关键也是局部余弦正交基的选取，而影响局部余弦正交基选取的主要因素却是钟函数的选取。以上少量的将LCT方法应用于语音编码中的研究只是停留在简单的编码增益对比，并没有真正设计一个可行的语音编码器。The local cosine basis constructed successively by Coifman and Meyer (1991) and Auscher et al. (1992) is composed of the product of smooth, compactly supported clock function and cosine function. These localized cosine functions still retain orthogonality and have small Heisenberg products. In recent years, the theoretical method of local cosine transform has been widely and deeply studied. This method is widely used in image compression coding, but there are relatively few researches on speech signal processing, especially in speech coding. However, it is proved in the literature published by MalvarH.S. "Lapped transforms for efficient transform/subband coding". IEEETrans.on Acoust., Speech Signal Processing, 1990., vol.38(6), Page(s): 969-978 In speech coding, the coding gain of LCT method is better than DCT coding, and it is very close to KL transform coding. Especially compared with the DCT coding method, the "click" sound between frames is significantly reduced, and it is not necessary to use half-frame length sliding in order to reduce the abnormal "click" sound between frames like DCT transform coding. method. Therefore, the LCT method reduces the calculation amount by nearly half compared with the DCT method. In the literature "Comparison of picture compression methods: wavelet, waveletpacket and local cosine" published by Wickerhauser M.V. in 1994. Wavelets: Theory, Algorithms, and Applications, Editor (Charles K. Chui and Laura Montefusco and Luigia Puccio), Academic Press, San Diego, California, p.585～621, the comparative study of several two-dimensional image coding methods also shows that the LCT method is superior to the DCT method in terms of coding gain, and is also very close to the KL transform method. Studies have shown that the key to improving the coding gain of transform coding lies in the selection of orthogonal bases. Similarly, the key in local cosine transform coding is also the selection of local cosine quadrature bases, and the main factors affecting the selection of local cosine quadrature bases are The choice of clock function. The above few researches on applying the LCT method to speech coding are only in simple coding gain comparison, and have not really designed a feasible speech coder.

发明内容Contents of invention

本发明的目的是利用局部余弦变换具有较高的编码增益的特点，提供一种在局部余弦变换域内实用的优良的低比特率语音编码器。The purpose of the present invention is to utilize the characteristic that local cosine transform has higher coding gain, and provide a kind of excellent low-bit-rate speech coder that is practical in local cosine transform domain.

实现本发明目的的技术方案是：一种低比特率语音编码器，它基于局部余弦变换，由高通滤波预处理器对输入编码器的原始语音信号进行处理，然后进行局部余弦变换(LCT)处理，其特征在于：所述的LCT变换中的钟函数bnew(n)符合如下条件：The technical scheme that realizes the object of the present invention is: a kind of low bit rate speech coder, it is based on local cosine transform, is processed by high-pass filtering preprocessor to the original speech signal of input coder, then carries out local cosine transform (LCT) processing , it is characterized in that: the clock function bnew (n) in the described LCT transformation meets the following conditions:

ξ_[n](n)为采用的整形函数，符合条件 $ξ_{[n + 1]} \overset{def}{=} ξ_{[n]} [\sin (πt / 2)]$ 和 $ξ_{[0]} (t) \overset{def}{=} ξ (t),$ 其中：ξ _[n] (n) is the shaping function adopted, which meets the condition $ξ_{[no + 1]} \overset{def}{=} ξ_{[no]} [\sin (πt / 2)]$ and $ξ_{[0]} (t) \overset{def}{=} ξ (t),$ in:

下标n为该整形函数的迭代次数；钟函数在1～4m的宽度上取值。The subscript n is the number of iterations of the shaping function; the clock function takes values in the width of 1-4m.

所述的钟函数b_new(n)确保与余弦函数相乘形成一个局部余弦正交基。Said clock function b _new (n) is guaranteed to be multiplied with the cosine function to form a local cosine quadrature basis.

所述的整形函数的迭代次数n为8～10。The number n of iterations of the shaping function is 8-10.

每一帧经过LCT变换后的LCT系数，先按从低频到高频分别40、40、40、20进行分维矢量维数的划分，再利用四个不同的分维矢量量化码书进行分维矢量量化，从第一维矢量到第四维矢量分配的比特位分别依次为12、12、8、8比特，每一帧的增益量化采用8比特标量量化，按照从第一分维矢量比特位到第四分维矢量比特位、增益量化比特位的顺序输出比特位为48比特，用6个字节表示每一帧输出的比特流。The LCT coefficients of each frame after LCT transformation are first divided into fractal vector dimensions by 40, 40, 40, and 20 from low frequency to high frequency, and then four different fractal vector quantization codebooks are used to perform fractal dimension Vector quantization, the bits allocated from the first dimensional vector to the fourth dimensional vector are 12, 12, 8, and 8 bits respectively, and the gain quantization of each frame adopts 8-bit scalar quantization, according to the bits from the first dimensional vector The sequential output bits to the fourth fractal-dimensional vector bits and gain quantization bits are 48 bits, and 6 bytes are used to represent the output bit stream of each frame.

所述的语音编码器还有一个与其匹配的语音解码器。The speech coder also has a speech decoder matching it.

本发明由于应用了一个可以灵活调整的整形函数，利用这个整形函数对Donoho采用的钟函数进行整形，得到一个能够提高谱能量聚集性的新的钟函数；对局部余弦变换系数的编码，采用分维量化方法，对于每一维矢量，均采用LGB方法设计码书；编码中码书的搜索采用树形结构搜索方法，实现了一种在局部余弦变换域内具有优良的低比特语音编码器。经客观参数评价和非正式听力测试表明，该编码器比LPC-10e编码器具有更好的自然度和可理解性，它适合于各种环境下的语音编码。Due to the application of a shaping function that can be flexibly adjusted, the present invention uses this shaping function to shape the clock function adopted by Donoho to obtain a new clock function that can improve the aggregation of spectral energy; Dimensional quantization method, for each dimension vector, adopts LGB method to design codebook; codebook search in coding adopts tree structure search method, realizes a kind of excellent low-bit speech coder in local cosine transform domain. The objective parameter evaluation and informal listening test show that the coder has better naturalness and intelligibility than LPC-10e coder, and it is suitable for speech coding in various environments.

附图说明Description of drawings

图1是本发明实施例语音编码器中的整形函数随着递归次数变化的曲线图；Fig. 1 is the graph that the reshaping function in the speech coder of the embodiment of the present invention changes along with the number of recursions;

图2是本发明实施例语音编码器中所采用的整形后的钟函数随着递归次数的增加低半频能量增加百分比图(英语+汉语)；Fig. 2 is the bell function after the reshaping adopted in the speech coder of the embodiment of the present invention along with the increase of the number of recursions low half-frequency energy increase percentage figure (English+Chinese);

图3是本发明实施例语音编码器的结构示意图；Fig. 3 is the structural representation of speech coder of the embodiment of the present invention;

图4是本发明实施例语音解码器的结构示意图；Fig. 4 is the structural representation of speech decoder of the embodiment of the present invention;

具体实施方式 Detailed ways

下面结合附图和实施例，对本发明所述的技术方案作进一步的阐述。The technical solutions of the present invention will be further described below in conjunction with the drawings and embodiments.

参见附图3、附图4，附图中分别提供了本实施例所述的低比特率编码器和解码器的结构示意图。Referring to accompanying drawings 3 and 4, the drawings respectively provide structural schematic diagrams of the low bit rate encoder and decoder described in this embodiment.

本发明实施例的关键技术为：The key technology of the embodiment of the present invention is:

一、最佳整形后的钟函数的获得1. Obtaining the best shaped clock function

图3中，对输入编码器的原始语音信号进行高通滤波预处理，然后进行LCT变换处理，在LCT变换中，本发明采用整形后的钟函数为：In Fig. 3, carry out high-pass filtering preprocessing to the original voice signal of input coder, then carry out LCT transform processing, in LCT transform, the clock function after the present invention adopts shaping is:

上述整形后的钟函数由如下步骤得到：The clock function after the above shaping is obtained by the following steps:

1、采用Donoho的钟函数：1. Using Donoho's clock function:

在Wickerhauser M.V.于1994年出版的专著中阐述局部余弦变换算法时，给出的钟函数对于给定的I_j和r，则钟形函数是固定不变的。When Wickerhauser MV described the local cosine transform algorithm in the monograph published in 1994, the given bell function is fixed for the given I _j and r.

下面给出Donoho采用的钟函数简单构造过程。设I_j＝2m，r＝m，则钟形窗宽度为4m，令The simple construction process of the clock function adopted by Donoho is given below. Suppose I _j =2m, r=m, then the width of the bell-shaped window is 4m, let

t(n)＝n-0.5，1≤n≤m. (1)t(n)=n-0.5, 1≤n≤m. (1)

x(n)＝(1+t(n)/m)/2 (2)x(n)=(1+t(n)/m)/2

那么，Donoho采用的钟形窗函数为：Then, the bell window function used by Donoho is:

2、整形函数的构造为：2. The structure of the shaping function is:

令输入实值序列t(n)为t(n)＝[2(n-1)-m+0.5]/2m，1≤n≤m (4)Let the input real-valued sequence t(n) be t(n)=[2(n-1)-m+0.5]/2m, 1≤n≤m (4)

定义一个实值连续函数define a real-valued continuous function

对于上式重复用sin(πt/2)代替t，对于任意大固定整数d，可以获得d次连续可微函数(ξ∈C^d)。定义如下递归函数Replacing t with sin(πt/2) for the above formula, for any large fixed integer d, the d-time continuous differentiable function (ξ∈C ^d ) can be obtained. Define the following recursive function

${ξ ξ}_{[[00]]} ((t t)) \overset{def def}{= =} ξ ξ ((t t))$

${ξ ξ}_{[[n no + + 11]]} \overset{def def}{= =} {ξ ξ}_{[[n no]]} [[sin sin ((πt πt / / 22))]]$

其中ξ的下标表示递归次数。通过递归将会看到ξ_[n](t)在t＝+1和t＝-1点上2ⁿ-1阶导数为0，也即意味着ξ_[n]∈C^2n-1。如图1为这个整形函数的几个递归结果曲线，这里m＝80。where the subscript of ξ represents the number of recursions. Through recursion, it will be seen that the derivative of ξ _[n] (t) at t=+1 and ^t =-1 is 0, which means that ξ _[n] ∈C ^2n-1 . Figure 1 shows several recursive result curves of this shaping function, where m=80.

3、整形后的钟函数的求取：3. Calculation of clock function after shaping:

通过改变递归次数产生各种整形函数，利用递归n次后的整形函数ξ_[n](t)对(6.3)式中的钟函数进行整形得到如下一个新的钟函数Various shaping functions are generated by changing the number of recursions, and the bell function in (6.3) is shaped by using the shaping function ξ _[n] (t) after n times of recursion to obtain the following new clock function

上式中的钟函数确保与余弦函数相乘形成一个局部余弦正交基。The clock function in the above formula is guaranteed to be multiplied with the cosine function to form a local cosine quadrature basis.

在实际问题中，需要在一个固定的窗宽度上求取最佳的正交基。也就是要求设计一个能够灵活调整的钟函数来满足实际问题的需要。在本实施例中，采用的技术方案是对语音信号进行解相关去冗余，目的是使固定帧长语音信号谱能量较好地集中在若干频带内，便于分频带编码。为此，本发明实施例所提供的整形方法，是能够对Donoho采用的钟函数进行灵活整形的整形函数，从中选取适合于频率域语音编码的整形函数，进而得到最佳的钟函数。In practical problems, it is necessary to find the best orthogonal basis on a fixed window width. That is to say, it is required to design a clock function that can be flexibly adjusted to meet the needs of practical problems. In this embodiment, the technical solution adopted is to perform de-correlation and de-redundancy on the speech signal, in order to better concentrate the spectral energy of the speech signal with a fixed frame length in several frequency bands, so as to facilitate sub-band coding. For this reason, the shaping method provided by the embodiment of the present invention is a shaping function capable of flexibly shaping the clock function adopted by Donoho, from which a shaping function suitable for frequency-domain speech coding is selected to obtain the best clock function.

4、最佳钟形函数的确定：4. Determination of the best bell-shaped function:

本发明实施例中将要涉及到变换域语音编码的实际问题，需要解决的是确定进行多少次递归后形成的整形函数对Donoho采用的钟形函数进行整形而得到的钟函数最适合。在本实施例中，把帧长为20ms，采样率为8kHz的语音信号的频带划分为高低两个频带，整形钟函数的目的是要求谱能量尽可能集中在信息量较大的低半频频带内，便于后面编码对高、低半频带的谱系数进行比特位数的优化分配。The embodiment of the present invention will involve the practical problem of transform domain speech coding, and what needs to be solved is to determine how many times of recursion the shaping function formed after shaping the bell function adopted by Donoho is the most suitable bell function. In this embodiment, the frame length is 20ms, the frequency band of the voice signal with a sampling rate of 8kHz is divided into high and low frequency bands, and the purpose of the shaping clock function is to require the spectral energy to be concentrated in the low half frequency band with a large amount of information as much as possible It is convenient for subsequent coding to optimize the allocation of bit numbers to the spectral coefficients of the high and low half-bands.

参见附图2，本发明实施例采用英语和汉语语音进行测试而得到的随着递归次数的变化，利用整形后的钟函数比采用Donoho的钟函数进行局部余弦变换后低半频带的谱能量占总的谱能量百分比的增加量。从图2可以看到，当递归9次时谱能量增加最大，因此，本发明实施例选择9次递归的整形函数进行整形。虽然谱能量增加的比例较小，但说明了调整合适的钟函数能够改变谱能量聚集程度，便于编码时对比特位的分配优化。Referring to accompanying drawing 2, embodiment of the present invention adopts English and Chinese speech to carry out test and obtains along with the change of recursion number, utilizes the clock function after shaping than adopts Donoho's clock function to carry out the spectrum energy of lower half frequency band after local cosine transform. The percentage increase in total spectral power. It can be seen from FIG. 2 that the spectral energy increases the most when recursing 9 times. Therefore, in the embodiment of the present invention, a shaping function with 9 recursive times is selected for shaping. Although the proportion of spectral energy increase is small, it shows that adjusting the appropriate clock function can change the degree of spectral energy aggregation, which facilitates the optimization of bit allocation during encoding.

二、分维矢量量化方法2. Fractal dimension vector quantization method

粗略地讲，成年人语音信号前四个共振峰分别位于500Hz、1500Hz、2500Hz和3500Hz。这实际上将语音信号划分成四个重要区域，要求我们在编码时对这四个区域的谱区别对待。对于变换域的参数进行编码，大多采用分维矢量量化(Splitted Vector Quantization)方法，所以，本发明实施例中，所设计的编码器将局部余弦变换的系数采取分维量化的方法。对于每一维矢量分别进行码书训练。当利用LGB算法生成码书后，为了提高编解码时码书的搜索速度采用树形码书搜索方法。Roughly speaking, the first four formants of adult speech signals are located at 500Hz, 1500Hz, 2500Hz and 3500Hz respectively. This actually divides the speech signal into four important regions, requiring us to treat the spectra of these four regions differently when encoding. For the encoding of the parameters in the transform domain, the split vector quantization (Splitted Vector Quantization) method is mostly used. Therefore, in the embodiment of the present invention, the designed encoder adopts the split dimension quantization method for the coefficients of the local cosine transform. Codebook training is performed separately for each dimension vector. After using the LGB algorithm to generate the codebook, a tree-shaped codebook search method is used in order to improve the codebook search speed during encoding and decoding.

在分维量化时，各个维矢量的变换系数数目划分从低频到高频分别为40、40、40、20。我们把这四个矢量分别称为第一维矢量、第二维矢量、第三维矢量和第四维矢量。由于对采样率为8kHz的语音信号，只保留3500Hz以下的谱成分就足以较好地恢复出满意质量的语音信号。为了降低计算复杂度，第四维矢量只用20个系数。在解码器中的反变换合成语音信号时，将剩余最高频成分的20个系数填充为0。During fractal quantization, the number of transformation coefficients of each dimension vector is divided into 40, 40, 40, and 20 from low frequency to high frequency. We refer to these four vectors as the first dimension vector, the second dimension vector, the third dimension vector and the fourth dimension vector. As for the speech signal with a sampling rate of 8kHz, only keeping the spectral components below 3500Hz is enough to recover the speech signal with satisfactory quality. In order to reduce the computational complexity, the fourth dimension vector only uses 20 coefficients. When inversely transforming and synthesizing the speech signal in the decoder, the remaining 20 coefficients of the highest frequency components are filled with 0.

在本发明实施例中，比特位分配是从低频到高频的各维矢量分配的比特位数分别为12、12、8、8。语音编码器的增益是通过采用将输入信号谱能量与编码时搜索的四个码矢量的谱能量和之比计算得来的。增益的量化采用8比特标量量化方法。本发明实施例中设计的编码器每帧总的比特分配如表1所示。In the embodiment of the present invention, the bit allocation is 12, 12, 8, and 8 bits allocated to the vectors of each dimension from low frequency to high frequency. The speech encoder gain is calculated by taking the ratio of the spectral energy of the input signal to the sum of the spectral energies of the four code vectors searched for during encoding. The quantization of the gain adopts an 8-bit scalar quantization method. Table 1 shows the total bit allocation per frame of the encoder designed in the embodiment of the present invention.

语音编码器输入语音信号是采样率为8kHz 16比特PCM格式的语音信号。本实施例采用的是wav格式的语音数据，所以电平幅值是归一化了的。系统对语音的种类没有特殊的要求，适合于各种语种的语音编码。Speech coder input speech signal is the speech signal of sampling rate 8kHz 16 bit PCM format. This embodiment uses voice data in wav format, so the level amplitude is normalized. The system has no special requirements on the type of speech, and is suitable for speech coding in various languages.

对本发明实施例所述的编码器的评价：To the evaluation of encoder described in the embodiment of the present invention:

1、客观评价1. Objective evaluation

与发明实施例所述的编码器进行测试对比时所采用的其它标准化了的编码器有G.729 Annex B(G.729B)、GSM Half-Rate、FS1016、FS1015(LPC-10e)。客观评价采用的参数有信噪比(Signal to Noise Ratio，SNR)和峰值信噪比(Peak Signal to Noise Ratio，PSNR)：Other standardized encoders used for testing and comparing with the encoder described in the embodiment of the invention include G.729 Annex B (G.729B), GSM Half-Rate, FS1016, FS1015 (LPC-10e). The parameters used in objective evaluation are Signal to Noise Ratio (SNR) and Peak Signal to Noise Ratio (PSNR):

$SNR SNR = = 1010 lo lo {g g}_{1010} \frac{(({σ σ}_{x x}^{22}))}{(({σ σ}_{e e}^{22}))}$

这里σ_x ²是语音信号的均方，σ_e ²为原始语音信号与重构的语音信号差的均方。Here σ _x ² is the mean square of the speech signal, and σ _e ² is the mean square of the difference between the original speech signal and the reconstructed speech signal.

$PSNR PSNR = = 1010 {log log}_{1010} \frac{{NX NX}^{22}}{{| | | | x x - - \overset{~ ~}{x x} | | | |}^{22}}$

这里N为重构信号的长度，X为在长度为N的信号x内的绝对值最大值，Here N is the length of the reconstructed signal, X is the absolute maximum value in the signal x of length N,

为原始信号与重构信号之间差的平方和。 is the sum of squares of the difference between the original signal and the reconstructed signal.

众所周知，对编码后的语音信号进行客观评价有时得到令人费解的结果。即使一个编码器编码后的语音具有高信噪比，有时可能它的语音质量不一定比另一个编码器编码后产生低信噪比的语音质量高。相反，同样也成立。所以说客观参数评价不能作为语音编码器性能评价的主要指标，它只能作为一个辅助评价。It is well known that objective evaluation of encoded speech signals sometimes yields puzzling results. Even if the speech encoded by one coder has a high SNR, sometimes it may not be of higher speech quality than the speech produced by another coder with a low SNR. On the contrary, the same holds true. Therefore, the objective parameter evaluation cannot be used as the main index of speech coder performance evaluation, it can only be used as an auxiliary evaluation.

表2为本实施例语音编码器(FBR-LCT)与G.729B、GSM Half-Rate、FS1016和FS1015编码标准进行对比的结果。该结果也说明了客观评价方法在语音编码器性能评价的可靠性。G.729B、GSM Half-Rate和FS1016都属于中低比特率的编码标准，它们编码的语音质量远远超过FS1015和LCT编码方法，但从这两个指标看，相比较而言LCT方法具有相当高的优势。与相同比特率的FS1015编码器进行比较，表明LCT编码方法的SNR和PSNR明显比FS1015标准的SNR和PSNR最多高出近5dB。Table 2 is the result of comparing the speech coder (FBR-LCT) of the present embodiment with G.729B, GSM Half-Rate, FS1016 and FS1015 coding standards. The results also illustrate the reliability of the objective evaluation method in speech encoder performance evaluation. G.729B, GSM Half-Rate and FS1016 all belong to low-medium bit-rate coding standards, and the voice quality of their coding far exceeds that of FS1015 and LCT coding methods, but from these two indicators, LCT method has a considerable high advantage. Compared with the FS1015 coder with the same bit rate, it shows that the SNR and PSNR of the LCT coding method are obviously higher than the SNR and PSNR of the FS1015 standard by nearly 5dB at most.

本发明实施例编码器所采用的编码方法是在变换域内进行的，其实质是波形编码的范畴。所以利用SNR和PSNR两个评价指标来进行客观评价，对其是有利的。所以，客观地讲单从若干个客观指标对编码器进行评价是不能说明问题的，只能作为一个参考。The encoding method adopted by the encoder in the embodiment of the present invention is performed in the transform domain, which is essentially the category of waveform encoding. Therefore, it is beneficial to use the two evaluation indicators of SNR and PSNR for objective evaluation. Therefore, objectively speaking, evaluating the encoder from several objective indicators cannot explain the problem, and can only be used as a reference.

2、主观评价：2. Subjective evaluation:

语音编码器产生的语音最终的接受对象是人的耳朵，所以编码后的语音质量好坏主要是接受人的听觉感知评价。一般采用非正式语音听力测试进行语音质量的评价。The speech generated by the speech encoder is finally accepted by the human ear, so the quality of the speech after encoding is mainly evaluated by the auditory perception of the recipient. Informal speech listening tests are generally used to evaluate speech quality.

对于无噪声清晰语音，本发明实施例所采用的LCT编码方法重构(FBR-LCT)的语音具有轻微的模糊，所以听不到象LPC-10e重构的语音洪亮。没有G.729B、GSM Half-Rate和FS1016编码标准产生的语音清晰度高，但是它的理解性和自然度是好的，且明显比同比特率的LPC-10e方法要好。LCT编码方法具有较强的鲁棒性，它的编码失真随着信号的改变不敏感，甚至于对G.729B、GSM Half-Rate、FS1016和LPC-10e方法无效的信号仍然很稳定。当使用背景音乐或其它非语音信号时，FBR-LCT编码方法比LPC-10e方法明显好。这些完全是由于LCT编码方法属于在变换域内的波形编码，所以它不依赖于如基音等语音特征参数。相反，G.729B、GSM Half-Rate、FS1016及LPC-10e是基于语音源-滤波生成模型及线性预测参数的估计，对参数估计的精度特别敏感。本发明所述的的基于局部余弦变换低比特率编码器还可通过软件仿真实现。For noise-free clear speech, the speech reconstructed by the LCT coding method (FBR-LCT) adopted in the embodiment of the present invention has slight fuzziness, so the loud speech reconstructed like LPC-10e cannot be heard. The voice clarity produced by the G.729B, GSM Half-Rate and FS1016 coding standards is not as high, but its intelligibility and naturalness are good, and it is obviously better than the LPC-10e method with the same bit rate. The LCT coding method has strong robustness, and its coding distortion is not sensitive to the change of the signal, even the signals that are invalid to the G.729B, GSM Half-Rate, FS1016 and LPC-10e methods are still very stable. When using background music or other non-speech signals, the FBR-LCT coding method is significantly better than the LPC-10e method. These are entirely due to the fact that the LCT coding method belongs to waveform coding in the transform domain, so it does not depend on speech characteristic parameters such as pitch. On the contrary, G.729B, GSM Half-Rate, FS1016 and LPC-10e are based on speech source-filter generation model and estimation of linear prediction parameters, and are particularly sensitive to the accuracy of parameter estimation. The local cosine transform-based low bit rate encoder described in the present invention can also be realized by software simulation.

表1Table 1

分维矢量增益帧 Fractal vector gain frame

第一维矢量第二维矢量第三维矢量第四维矢量 (比特) (比特)1st dimension vector 2nd dimension vector 3rd dimension vector 4th dimension vector (bits) (bits)

12 12 8 8 8 4812 12 8 8 8 8 48

表2 Table 2

英语汉语汉语+背景音乐比特率 English Chinese Chinese + background music bit rate

编码器类Encoder class

SNR(dB) PSNR(dB) SNR(dB) PSNR(dB) SNR(dB) PSNR(dB) (kb/s)SNR(dB) PSNR(dB) SNR(dB) PSNR(dB) SNR(dB) PSNR(dB) (kb/s)

G.729 Annex -0.95 15.08 -1.46 18.32 -1.18 15.58 8G.729 Annex -0.95 15.08 -1.46 18.32 -1.18 15.58 8

GSM Half-Rate -1.24 14.81 -0.82 19.46 -0.74 16.09 5.6GSM Half-Rate -1.24 14.81 -0.82 19.46 -0.74 16.09 5.6

FS1016 0.71 16.74 1.37 21.63 1.27 18.09 4.8FS1016 0.71 16.74 1.37 21.63 1.27 18.09 4.8

FS1015(LPC10e) -3.59 12.47 -2.65 17.64 -1.80 15.02 2.4FS1015(LPC10e) -3.59 12.47 -2.65 17.64 -1.80 15.02 2.4

FBR-LCT -0.44 15.08 0.26 20.54 -1.07 15.75 2.4FBR-LCT -0.44 15.08 0.26 20.54 -1.07 15.75 2.4

Claims

1. a low bit rate speech coder, it is based on local cosine transform, the original speech signal of input coder is processed by high-pass filter preprocessor, then carries out local cosine transform processing, it is characterized in that: described local cosine The clock function b _new (n) in the transformation meets the following conditions:

ξ _[n] (n) is the shaping function adopted, which meets the condition

ξ_{[no + 1]} \overset{def}{=} ξ_{[no]} [\sin (πt / 2)]

and

ξ_{[0]} (t) \overset{def}{=} ξ (t),

in:

The subscript n is the number of iterations of the shaping function; the clock function takes values in the width of 1-4m.

2. A low-bit-rate speech coder according to claim 1, characterized in that: said clock function b _new (n) is guaranteed to be multiplied with a cosine function to form a local cosine-orthogonal basis.

3. A low bit rate speech coder according to claim 1, characterized in that: the number of iterations n of said shaping function is 8-10.

4. a kind of low bit rate speech coder according to claim 1, is characterized in that: the local cosine transform coefficient of each frame after local cosine transform, first press respectively 40,40,40, 20 Divide the dimension of the fractal vector, and then use four different fractal vector quantization codebooks to perform fractal vector quantization. The bits allocated from the first dimension vector to the fourth dimension vector are respectively 12, 12, and 8 , 8 bits, the gain quantization of each frame adopts 8-bit scalar quantization, and the output bits are 48 bits according to the order from the first fractal-dimensional vector bit to the fourth fractal-dimensional vector bit and gain quantization bit, using 6 The bytes represent the output bitstream for each frame.

5. A low bit rate speech coder according to claim 1, characterized in that: said speech coder also has a speech decoder matching it.