CN1275223C

CN1275223C - A low bit-rate speech coder

Info

Publication number: CN1275223C
Application number: CNB2004101032203A
Authority: CN
Inventors: 董恩清
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2004-12-31
Filing date: 2004-12-31
Publication date: 2006-09-13
Anticipated expiration: 2024-12-31
Also published as: CN1632862A

Abstract

The invention discloses a speech coder suitable for a communication system requiring low-bit-variable-rate speech coding. It applies the SVM method to the VAD voice activation detection, which improves the correct recognition rate of the voice coder for voice detection; adopts the voice mode classification method of GSM, and merges the original four voice modes into three voice modes, so that only Two bits are used to represent the entire speech pattern. It also takes full advantage of the high coding gain of the local cosine transform, uses LCT and SVM-VAD for low-bit variable-rate speech coding, and provides a practical and high-performance low-bit variable-rate speech coder.

Description

A Low Bit Variable Rate Speech Coder

技术领域Technical field

本发明涉及一种语音编码器，特别涉及一种适合于要求低比特变速率语音编码的通信系统中的语音编码器。The present invention relates to a speech coder, in particular to a speech coder suitable for a communication system requiring low bit variable rate speech coding.

背景技术 Background technique

可变速率(Variable Bit Rate，VBR)编码技术是近年来发展起来的，其核心思想是对语音的跃变、平稳、无声段采用不同的速率进行编码，以便VBR编码平均速率将比同等语音质量的FBR编码低得多。Variable bit rate (Variable Bit Rate, VBR) coding technology has been developed in recent years. Its core idea is to use different rates to encode speech jumps, steady, and silent segments, so that the average rate of VBR coding will be higher than that of the same speech quality. The FBR encoding is much lower.

真正能更好地发挥VBR技术优势的应用领域是对语音编码速率无严格速率限制、而又要求有速率“弹性”的场合，如CDMA、VoIP、ATM等。目前，无线通信系统和IP技术正在迅猛发展，即将在全球通信系统中占据越来越重要的地位。为此，国际电联ITU-T SG16正在制定新的可变速率编码标准，以适应将来的分组语音通信网(如VoIP)、IMT-2000语音编码及高质量低比特率语音压缩应用。在这些应用中，用户可以在语音质量和编码速率(信道容量)之间折衷考虑，实现具有“软”控制的能力。The application field that can really give full play to the advantages of VBR technology is the occasion where there is no strict rate limit on the speech coding rate, but the rate "elasticity" is required, such as CDMA, VoIP, ATM, etc. At present, the wireless communication system and IP technology are developing rapidly, and will soon occupy an increasingly important position in the global communication system. To this end, ITU-T SG16 is developing a new variable rate coding standard to adapt to future packet voice communication networks (such as VoIP), IMT-2000 voice coding and high-quality low-bit-rate voice compression applications. In these applications, users can trade off between voice quality and coding rate (channel capacity), realizing the ability to have "soft" control.

变比特率的一个众所周知的例子是QCELP，其是由CTIA制定的称为IS-95的一个变比特率语音编码器。到目前为止，基于CELP的变比特率语音编码方法的研究相对较多。A well-known example of variable bit rate is QCELP, which is a variable bit rate speech coder specified by CTIA called IS-95. So far, there are relatively many researches on CELP-based variable bit rate speech coding methods.

在语音激活检测中，采用众所周知的VAD方法的例子包括在IS-95标准中的QCELP语音编码器、IS-127标准中的EVRC、GSM标准中采用DTX模式和由ITU-T提出的G.729附件B(G.729B)的VAD方法。In voice activity detection, examples of well-known VAD methods include QCELP speech coder in IS-95 standard, EVRC in IS-127 standard, DTX mode in GSM standard and G.729 proposed by ITU-T VAD method of Annex B (G.729B).

在过去的几年里，已经呈现出对支持向量机(Support Vector Machines，SVM)浓厚的兴趣。经验表明，SVM在如手写体识别、面孔识别、文本分类等大量应用中普遍具有较好的性能。但是，该方法在语音激活检测中的应用很少报导。In the past few years, there has been a lot of interest in Support Vector Machines (SVM). Experience shows that SVM generally has good performance in a large number of applications such as handwriting recognition, face recognition, and text classification. However, the application of this method in speech activation detection is rarely reported.

低比特率语音编码在过去20年已成为一个主要研究主题，结果导致已经将比特率从16kb/s到2.4kb/s范围的很多语音编码算法标准化。目前语音编码器研究重点在4kb/s及更低的高质量语音编码，并且最近的研究显示在频域进行语音编码比已存在的基于CELP编码器具有更好质量的潜力。谱编码器的特点是试图重构语音振幅谱而不是精确地恢复语音波形。尽管以上基于CELP及参数编码的编码器广泛应用于低比特率语音编码，它们大多受假设的模型精度的限制，还有它们主要依赖于正确的参数估计，往往这些要求很难得到保证。所以，这些编码方法的鲁棒性在特殊环境下是很差的，导致编码后的语音质量有一定的局限性。Low bit rate speech coding has been a major research topic over the past 20 years, resulting in the standardization of many speech coding algorithms with bit rates ranging from 16 kb/s to 2.4 kb/s. The current speech codec research focuses on high-quality speech coding at 4kb/s and lower, and recent research shows that speech coding in the frequency domain has the potential to have better quality than existing CELP-based coders. Spectral encoders are characterized by attempting to reconstruct the speech amplitude spectrum rather than recovering the speech waveform exactly. Although the above encoders based on CELP and parametric coding are widely used in low-bit-rate speech coding, most of them are limited by the assumed model accuracy, and they mainly rely on correct parameter estimation, which are often difficult to guarantee. Therefore, the robustness of these encoding methods is very poor in special environments, resulting in certain limitations in the quality of encoded speech.

由Coifman和Meyer(1991)及Auscher等(1992)先后构造的局部余弦基是由平滑、紧支撑钟函数与余弦函数乘积构成的。这些局部化的余弦函数仍保留着正交性，并且具有较小的Heisenberg乘积。近些年来，局部余弦变换理论得到广泛深入的研究，该方法在语音信号处理方面的研究相对较少，特别是应用在语音编码中更少。但在Malvar H.S.于1990年发表的文章中证明了在语音编码中LCT方法的编码增益优于DCT编码，并且十分接近KL变换编码。特别是与DCT编码方法相比，明显减少了帧之间的“喀嚓”声。The local cosine basis constructed successively by Coifman and Meyer (1991) and Auscher et al. (1992) is composed of the product of smooth, compactly supported clock function and cosine function. These localized cosine functions still retain orthogonality and have small Heisenberg products. In recent years, the theory of local cosine transform has been widely and deeply studied, but there are relatively few studies on this method in speech signal processing, especially in speech coding. However, in the article published by Malvar H.S. in 1990, it was proved that the coding gain of the LCT method in speech coding is better than that of DCT coding, and it is very close to KL transform coding. Especially compared to the DCT coding method, the "click" sound between frames is significantly reduced.

鉴于低比特变速率语音编码方法在目前实际应用中的强烈需求，以及以前其它一些建立在模型基础上的编码方法由于受到假设的模型精度和估计的参数精度的限制往往影响编码效果及编码器的应用范围。In view of the strong demand for low-bit variable-rate speech coding methods in current practical applications, as well as other previous coding methods based on models, the coding effect and the performance of the coder are often affected by the limitations of the assumed model accuracy and estimated parameter accuracy. application range.

发明内容Contents of Invention

本发明的目的是利用局部余弦变换具有较高的编码增益的特点，提供一个实用的、性能优良的低比特变速率语音编码器。The purpose of the present invention is to provide a practical, high-performance low-bit variable-rate speech coder by utilizing the characteristic of high coding gain of local cosine transform.

为达到上述目的，本发明采用的技术方案是：一种低比特变速率语音编码器，它基于局部余弦变换，所述的语音编码器将输入的原始语音信号经过高通滤波器预处理后，输入到语音激活检测器检测判别激活语音帧与非激活语音帧，再分别经LCT变换器处理，完成语音编码，其中：In order to achieve the above object, the technical solution adopted in the present invention is: a low-bit variable-rate speech coder, which is based on local cosine transform, and the speech coder preprocesses the input original speech signal through a high-pass filter, and then inputs To the speech activation detector to detect and distinguish the active speech frame and the non-activated speech frame, and then respectively process through the LCT converter to complete the speech coding, wherein:

所述的语音激活检测器采用SVM-VAD语音激活检测模块，其工作流程如下：Described voice activation detector adopts SVM-VAD voice activation detection module, and its workflow is as follows:

①对输入的语音数据进行参数提取，得到当前帧的线谱频率(Line SpectralFrequencies)、全带能量、低带能量、过零率四个分类特征参数；① Extract the parameters of the input speech data to obtain four classification characteristic parameters of the current frame: Line Spectral Frequencies, full-band energy, low-band energy, and zero-crossing rate;

②初始化处理：根据背景噪声的改变随时计算更新在只有背景噪声时上述四个特征参数；②Initialization processing: Calculate and update the above four characteristic parameters at any time according to the change of background noise when there is only background noise;

③差分处理：将上述当前帧的四个特征参数分别减去初始化时表示当前状态只有背景噪声的情况下相应的这四个特征参数，生成语音激活检测分类需要的相应的四个差分特征参数；③ Differential processing: Subtract the four characteristic parameters of the above-mentioned current frame from the corresponding four characteristic parameters when the initialization indicates that the current state only has background noise, and generate the corresponding four differential characteristic parameters required for voice activation detection and classification;

④采用SVM算法进行语音激活性检测，训练支持向量机采用的是序列最小最优化(Sequential Minimal Optimization，SMO)方法，最终将语音划分成激活和非激活两种语音类型；④The SVM algorithm is used for voice activation detection, and the training support vector machine adopts the Sequential Minimal Optimization (SMO) method, and finally the voice is divided into two voice types: active and inactive;

⑤采用四步平滑和校正算法进行VAD判别平滑处理；⑤ Using four-step smoothing and correction algorithm for VAD discrimination smoothing;

⑥在每一帧进行VAD处理后，输出非激活或激活语音帧信号，如果估计该帧的背景噪声能量大于背景噪声能量门限的，则需要重新在进行修正平均背景噪声参数处理；⑥After VAD processing is performed on each frame, the inactive or active speech frame signal is output. If the background noise energy of the frame is estimated to be greater than the background noise energy threshold, the average background noise parameter processing needs to be corrected again;

所述的LCT变换器处理，其方法是：Described LCT converter handles, and its method is:

①对经SVM-VAD语音激活检测模块检测为非激活语音帧，按无声/背景噪声语音模式的分维矢量维数进行分维处理，然后将该分维矢量分别按照无声/背景噪声语音模式的相应分维矢量的码书进行分维矢量量化，得到与该语音模式相对应的两个比特位长度都是7位的分维矢量量化结果，同时对该模式语音帧的增益进行标量量化，将按照表示语音模式的2个比特位、表示增益的8个比特位、表示第一分维矢量和第二分维矢量的都为7比特位的顺序，组成3个字节输出，表示该帧语音编码结束；1. The non-activated speech frame detected by the SVM-VAD speech activation detection module is processed according to the fractal vector dimension of the silent/background noise speech mode, and then the fractal vector is processed according to the fractal dimension of the silent/background noise speech mode. The codebook of the corresponding fractal vector is subjected to fractal vector quantization to obtain the fractal vector quantization result corresponding to the two bit lengths of the speech mode, and the gain of the speech frame of the mode is scalar quantized. According to the order of 2 bits representing the speech mode, 8 bits representing the gain, and 7 bits representing the first fractal dimension vector and the second fractal dimension vector, 3 bytes are output to represent the voice of the frame End of coding;

②对经SVM-VAD模块检测为激活语音帧，按清音(模式0)、清浊音(模式1)、中强浊音(模式2)的方法分成三种语音模式，按照相应的三种语音模式的分维矢量维数进行分维处理，然后将相应的四个分维矢量分别按照对应的语音模式的相应分维矢量的码书进行分维矢量量化，得到与该语音模式相对应的四个不同长度比特位分别表示对应的分维矢量的量化结果；同时对该语音帧的增益进行标量量化，将按照表示语音模式的两个比特位、表示增益的8个比特位及按照从表示该语音模式的第一分维矢量的比特位至第四分维矢量的比特位的顺序将这些比特位组成整数个字节输出，表示该帧语音编码结束。②For the active speech frame detected by the SVM-VAD module, it is divided into three speech modes according to the methods of unvoiced (mode 0), unvoiced (mode 1), and moderately strong voiced (mode 2), and according to the corresponding three kinds of speech modes The dimension of the fractal vector is subjected to fractal processing, and then the corresponding four fractal vectors are quantized according to the codebook of the corresponding fractal vector of the corresponding speech mode, and four different fractal vectors corresponding to the speech mode are obtained. The length bits respectively represent the quantization results of the corresponding fractal vectors; at the same time, scalar quantization is performed on the gain of the speech frame, and the speech mode will be represented according to the two bits representing the speech mode, the 8 bits representing the gain, and the slave. These bits are formed into an integer number of bytes in the order of the bits of the first fractal dimension vector to the bits of the fourth fractal dimension vector, indicating that the speech coding of the frame is completed.

所述的无声/背景噪声语音模式的第一分维矢量维数、第二分维矢量维数均为40；所述的清音、清浊音和中强浊音语音模式的第一分维矢量维数、第二分维矢量维数和第三分维矢量维数均为40，而第四分维矢量维数均为20。The first fractal dimension vector dimension and the second fractal dimension vector dimension of the silent/background noise speech pattern are 40; the first fractal dimension vector dimension of the unvoiced, unvoiced and moderately strong voiced speech patterns , the dimension of the second fractal dimension vector and the dimension of the third fractal dimension vector are both 40, and the dimension of the fourth fractal dimension vector is both 20.

所述的无声/背景噪声语音模式第一、第二分维矢量比特分配均为7比特，第三、第四分维矢量比特分配均为0比特，增益为8比特、模式为2比特；所述的清音语音模式第一、第二分维矢量比特分配均为7比特，第三、第四分维矢量比特分配均为8比特，增益为8比特、模式为2比特；所述的清浊音语音模式第一、第二分维矢量比特分配均为11比特，第三、第四分维矢量比特分配均为8比特，增益为8比特、模式为2比特；所述的中强浊音语音模式第一、第二分维矢量比特分配均为8比特，第三、第四分维矢量比特分配均为8和6比特，增益为8比特、模式为2比特。The first and second fractal-dimensional vector bit allocations of the silent/background noise voice mode are 7 bits, the third and fourth fractal-dimensional vector bit allocations are 0 bits, the gain is 8 bits, and the mode is 2 bits; The first and second fractal dimension vector bit allocations of the unvoiced voice mode are 7 bits, the third and fourth fractal dimension vector bit allocations are 8 bits, the gain is 8 bits, and the mode is 2 bits; The bit allocation of the first and second fractal-dimensional vectors in the voice mode is 11 bits, the bit allocation of the third and fourth fractal-dimensional vectors is 8 bits, the gain is 8 bits, and the mode is 2 bits; the medium-strong voiced voice mode The first and second fractal-dimensional vector bit allocations are both 8 bits, the third and fourth fractal-dimensional vector bit allocations are both 8 bits and 6 bits, the gain is 8 bits, and the mode is 2 bits.

本发明由于充分利用了SVM方法的特点，将SVM应用于VAD检测中，提高了语音编码器对语音检测的正确识别率；采用GSM的语音模式分类方法，并将原来的四种语音模式合并为三种语音模式，使最终只采用两个比特表示整个语音模式。The present invention has made full use of the characteristics of the SVM method, applies the SVM to the VAD detection, and improves the correct recognition rate of the speech coder for speech detection; adopts the speech pattern classification method of GSM, and combines the original four speech patterns into Three speech modes, so that only two bits are used to represent the entire speech mode.

附图说明Description of drawings

图1是本发明实施例提供的SVM-VAD语音激活模块工作运行流程图Fig. 1 is the working flow chart of the SVM-VAD voice activation module that the embodiment of the present invention provides

图2是本发明实施例提供的VBR-LCT语音编码器的框架结构示意图Fig. 2 is a schematic diagram of the frame structure of the VBR-LCT speech coder provided by the embodiment of the present invention

具体实施方式 Detailed ways

下面结合附图及实施例对本发明作进一步描述：The present invention will be further described below in conjunction with accompanying drawing and embodiment:

实施例：Example:

1、激活语音模式划分1. Activate voice mode division

在GSM系统中语音模式选择的准则如下：The criteria for voice mode selection in the GSM system are as follows:

Mode＝0，P_v＜1.7(清音)。Mode=0, P _v <1.7 (voiceless).

Mode＝1，P_v≥1.7，P_m＜3.5对于所有的m，(轻浊音)。Mode=1, P _v ≥ 1.7, P _m < 3.5 for all m, (lightly voiced).

Mode＝2，3.5≤P_m＜7.0，对于所有的m，(中浊音)。Mode=2, 3.5≦P _m <7.0, for all m, (medium voiced).

Mode＝3，P_m＞7.0，对于所有的m，(强浊音)。Mode=3, P _m >7.0, for all m, (strongly voiced).

其中m＝1，2，3，4表示某一帧中的子帧，其中P_m表示第m子帧开环LTP预测增益(dB)，P_v表示整个帧开环LPT预测增益(dB)。Where m=1, 2, 3, 4 represent subframes in a certain frame, wherein P _m represents the open-loop LTP prediction gain (dB) of the mth subframe, and P _v represents the open-loop LPT prediction gain (dB) of the whole frame.

强浊音和中浊音具有较强的周期性和较高的语音能量，根据语音生成模型，这两种语音模式的共振峰很强，很好地表示它们有利于产生较清晰的浊音。对于频率域编码，强浊音和中浊音之间的谱成分差别不大，所以，在本发明的实施例中，采用把强浊音模式和中浊音模式合并为一个称为中强浊音模式的方法。合并为一个中强浊音模式的另一原因是由于VAD检测到的无声帧类型加上上面三种语音模式，可以只利用2个比特表示编码模式之间的转换。因此，本实施例对于激活语音只有三种模式，即模式0、模式1、模式2，分别代表清音模式、轻浊音模式和中强浊音模式。Strongly voiced and moderately voiced have stronger periodicity and higher speech energy. According to speech generation models, these two speech modes have strong formants, which is a good indication that they are conducive to producing clearer voiced sounds. For frequency domain coding, there is little difference in spectral components between strongly voiced and moderately voiced, so in the embodiment of the present invention, a method of combining the strongly voiced mode and the moderately voiced mode into one mode called moderately voiced is adopted. Another reason for merging into one moderately strong voiced mode is that due to the silent frame type detected by the VAD plus the above three speech modes, only 2 bits can be used to represent the conversion between coding modes. Therefore, in this embodiment, there are only three modes for activating speech, namely mode 0, mode 1 and mode 2, respectively representing unvoiced mode, slightly voiced mode and moderately strongly voiced mode.

2、分维矢量量化方法2. Fractal dimension vector quantization method

粗略地讲，成年人语音信号前四个共振峰分别位于500Hz、1500Hz、2500Hz和3500Hz。这实际上将语音信号划分成四个重要区域，在编码时要求对这四个区域的谱区别对待。所以，本发明实施例在设计编码器时将局部余弦变换的系数采取分维量化的方法。对于每一维矢量分别采用1980年由Linde、Buzo和Gray三人提出的矢量量化方法(LGB算法)进行码书训练。当利用LGB算法生成码书后，为了提高编解码时码书的搜索速度采用树形码书搜索方法。Roughly speaking, the first four formants of adult speech signals are located at 500Hz, 1500Hz, 2500Hz and 3500Hz respectively. This actually divides the speech signal into four important regions, and the spectra of these four regions are required to be treated differently during encoding. Therefore, in the embodiment of the present invention, the coefficients of the local cosine transform are quantized by fractal dimension when designing an encoder. For each dimension vector, the vector quantization method (LGB algorithm) proposed by Linde, Buzo and Gray in 1980 is used for codebook training. After using the LGB algorithm to generate the codebook, a tree-shaped codebook search method is used in order to improve the codebook search speed during encoding and decoding.

在本发明实施例采用的分维量化中，对于激活语音帧的各个模式的局部余弦变换系数数目划分从低频到高频分别为40、40、40、20。而对于无声或背景噪声帧只取前两个低频段的系数，分别为40。把这四个矢量分别称为第一维矢量、第二维矢量、第三维矢量和第四维矢量。由于对采样率为8kHz的语音信号，只保留3500Hz以下的谱成分就足以较好地恢复出满意质量的语音信号。为了降低计算复杂度，激活语音模式帧的第四维矢量只用20个系数，而无声或背景噪声帧却不利用高半频段的系数，表1是各种模式语音帧的分维矢量维数。在解码器中的反变换合成语音信号时，将激活语音帧的剩余最高频成分的20个系数和无声(背景噪声)的高半频中的80个系数填充为0。In the fractal quantization adopted in the embodiment of the present invention, the number of local cosine transform coefficients for each mode of the active speech frame is divided into 40, 40, 40, and 20 from low frequency to high frequency. For silent or background noise frames, only the coefficients of the first two low-frequency bands are taken, which are 40 respectively. These four vectors are called the first dimension vector, the second dimension vector, the third dimension vector and the fourth dimension vector respectively. As for the speech signal with a sampling rate of 8kHz, only keeping the spectral components below 3500Hz is enough to recover the speech signal with satisfactory quality. In order to reduce the computational complexity, only 20 coefficients are used for the fourth-dimensional vector of the active speech mode frame, while the silent or background noise frame does not use the coefficients of the upper half frequency band. Table 1 shows the fractal vector dimensions of various modes of speech frames . When inversely transforming and synthesizing the speech signal in the decoder, the 20 coefficients of the remaining highest frequency components of the active speech frame and the 80 coefficients in the high half frequency of the silence (background noise) are filled with 0.

3、比特位分配3. Bit distribution

根据各类激活语音帧和无声(背景噪声)帧的特点采取不同比特率分配策略，表2是本发明实施例提供的VBR-LCT编码器的比特分配表。Different bit rate allocation strategies are adopted according to the characteristics of various active speech frames and silent (background noise) frames. Table 2 is the bit allocation table of the VBR-LCT encoder provided by the embodiment of the present invention.

中强浊音模式语音具有较强的周期性，且语音能量多集中在中低频带内，所以需要给中低频带分配较多的比特。将此类语音模式分配中等比特位数就能较好地得到表示。Speech in the medium-strongly voiced mode has strong periodicity, and the speech energy is mostly concentrated in the middle and low frequency bands, so more bits need to be allocated to the middle and low frequency bands. Such speech patterns are well represented by assigning a moderate number of bits.

对于轻浊音的模式语音，因为它在某种程度上是浊音和清音以一定比例的混合，它的周期性没有中强浊音模式语音的强，但其中却包含着语音中的跃变部分。其中的突变帧虽然在语音中所占的比例较少，但它却包含了大量的信息，所以能否有效地表示它将直接影响语音质量。为此，本实施例对这种模式的语音帧采用分配较高的比特位数的策略。For the pattern speech of light voiced sound, because it is a mixture of voiced sound and unvoiced sound to a certain extent, its periodicity is not as strong as that of the pattern speech of medium and strong voiced sound, but it contains the transition part of the speech. Although the proportion of the abrupt frame in the speech is small, it contains a lot of information, so whether it can be effectively expressed will directly affect the speech quality. For this reason, this embodiment adopts a strategy of allocating a higher number of bits to the speech frames of this mode.

清音模式语音可以说是完全由清音组成的，所以应该认为清音的局部余弦变换谱是平坦的。在比特分配中各个频带上基本分配相同的比特，但为了增强高频部分的清音特性只给高半频上的两个频段各增加一个比特。Unvoiced mode speech can be said to be completely composed of unvoiced sounds, so the local cosine transform spectrum of unvoiced sounds should be considered flat. In the bit allocation, the same bits are basically allocated to each frequency band, but only one bit is added to each of the two frequency bands on the high half frequency in order to enhance the unvoiced characteristics of the high frequency part.

为了得到自然度较好的语音，在本实施例中，没有采用将无声或背景噪声帧的语音全部充0处理。如果进行这样的处理将会导致有声帧和无声帧之间的能量产生突变，形成不舒适现象。为此，对无声或噪声帧也分配一定的比特位来对其进行表示。对于强背景噪声或在特殊的环境下，如果出现将有声误判为无声，那么利用这个有限的比特位也能在某些程度上表示有声语音的信息，这是基于局部余弦变换编码方法所特有的优势。In order to obtain speech with better naturalness, in this embodiment, the speech of the silent or background noise frame is not filled with 0. Such processing will result in a sudden change in energy between the voiced frame and the silent frame, resulting in an uncomfortable phenomenon. For this reason, certain bits are also allocated to silent or noise frames to represent them. For strong background noise or in a special environment, if the sound is misjudged as silent, then this limited bit can also represent the information of the sound to some extent, which is unique to the local cosine transform coding method The advantages.

各个模式的语音帧编码器的增益是通过采用将输入信号谱能量与编码时搜索的码矢量的谱能量和之比计算得来的。增益的量化采用8比特标量量化方法。对各种模式的语音帧分配的总的比特数都是整数个字节，所以对于各个模式语音帧的编码，传输中出现帧内部的比特位误差不会引起后续语音帧的解码，具有一定的抗误码和纠错能力。The gain of the speech frame encoder for each mode is calculated by using the ratio of the spectral energy of the input signal to the sum of the spectral energy of the code vector searched during encoding. The quantization of the gain adopts an 8-bit scalar quantization method. The total number of bits allocated to the voice frames of various modes is an integer number of bytes, so for the encoding of each mode of voice frames, the bit error inside the frame during transmission will not cause the decoding of subsequent voice frames, which has a certain Error resistance and error correction capability.

4、SVM-VAD方法4. SVM-VAD method

VAD的作用是在有声和无声之间进行区分，这是一个众所周知的分类问题。对于任何分类问题，不得不选择用于分类的参数，以及不得不设计一个判别函数。我们选取的是VAD应用中通常习惯采用的一组描述信号能量和谱成分的参数。参数的选择是受每个参数对分类结果的贡献、其鲁棒性和其计算复杂度支配的。这里选择的参数为当前帧参数与背景噪声滑动平均参数差得到的四个差分测量参数，即谱失真、全带能量差、低带能量差、过零率差。The role of VAD is to distinguish between voiced and unvoiced, which is a well-known classification problem. As with any classification problem, the parameters for classification have to be chosen, and a discriminant function has to be designed. What we choose is a group of parameters that describe signal energy and spectral components that are usually used in VAD applications. The choice of parameters is governed by each parameter's contribution to the classification result, its robustness, and its computational complexity. The parameters selected here are four differential measurement parameters obtained by the difference between the current frame parameters and the background noise sliding average parameters, namely spectral distortion, full-band energy difference, low-band energy difference, and zero-crossing rate difference.

VAD算法和非激活语音编码器都是以数字化语音帧形式操作的。为了兼容性，对所有方法使用相等的帧长。如图1为对于每一帧的VAD算法的一般运行流程图。使用SVM方法进行VAD判别的结果是局部的，也就是其没有考虑语音和噪声的短时平稳特性。需要使用前面相邻帧，采用四步平滑和校正算法。如果噪声电平突然改变，在一个长时期内使用最小能量估计，设计一个特殊重置算法防止算法锁定在有声模式。Both the VAD algorithm and the inactive vocoder operate on digitized speech frames. For compatibility, use equal frame sizes for all methods. Figure 1 is a general flow chart of the VAD algorithm for each frame. The result of using the SVM method for VAD discrimination is partial, that is, it does not consider the short-term stationary characteristics of speech and noise. Need to use the previous adjacent frame, using a four-step smoothing and correction algorithm. If the noise level changes suddenly, using the minimum energy estimate over a long period of time, a special reset algorithm is designed to prevent the algorithm from locking into audible mode.

图2为本发明实施例提供的VBR-LCT语音编码器的框架结构示意图。图2中预处理模块是为了减少低频噪声和直流分量的高通滤波处理。语音编码器输入语音信号是采样率为8kHz 16比特PCM格式的语音信号。本实施例中采用的是wav格式的语音数据，所以电平幅值是归一化了的。FIG. 2 is a schematic diagram of the frame structure of a VBR-LCT speech encoder provided by an embodiment of the present invention. The pre-processing module in Fig. 2 is high-pass filter processing for reducing low-frequency noise and DC components. Speech coder input speech signal is the speech signal of sampling rate 8kHz 16 bit PCM format. In this embodiment, voice data in wav format is used, so the level amplitude is normalized.

对信号进行变换分析通常采用短时处理的方法。短时信号段的长度选取对分析结果影响较大。语音信号变换编码方法同样涉及到分析窗长度选择的问题。我们知道语音信号总体来讲是弱非平稳的信号，但它在短时间内可以近似地认为是平稳的，如20ms的间隔。所以，为了提高压缩比，在编码中尽可能选择长窗来降低比特率，但与此同时也会增加编解码器的延迟。所以，在帧长选取上，根据语音信号的特点，要求对编码器的延迟和比特率进行折衷处理。本发明实施例所提供的低比特变速率编码器，要求帧长度不能小于20ms，再者，20ms帧长是绝大多数编码器所采用的，属于中低延迟的编码策略。在帧长为20ms内的语音段，语音信号可以近似地被认为是平稳的，有利于语音信号的正交表示，所以在本实施例中帧长选用20ms，即160个采样点。Transformation analysis of signals usually adopts a short-time processing method. The selection of the length of the short-term signal segment has a great influence on the analysis results. The speech signal transform coding method also involves the problem of choosing the length of the analysis window. We know that speech signals are generally weak and non-stationary signals, but they can be approximately considered to be stable in a short period of time, such as an interval of 20ms. Therefore, in order to improve the compression ratio, choose a long window as much as possible in the encoding to reduce the bit rate, but at the same time it will increase the delay of the codec. Therefore, in the selection of frame length, according to the characteristics of the speech signal, it is required to compromise the delay and bit rate of the encoder. The low-bit variable-rate encoder provided by the embodiment of the present invention requires that the frame length should not be less than 20 ms. Furthermore, the frame length of 20 ms is adopted by most encoders, and belongs to the low-medium delay coding strategy. In the speech segment with a frame length of 20 ms, the speech signal can be approximately considered to be stable, which is conducive to the orthogonal representation of the speech signal, so in this embodiment, the frame length is selected as 20 ms, that is, 160 sampling points.

编码器的评价：Encoder evaluation:

1、客观评价1. Objective evaluation

表3所列数据为本实施例提供的VBR-LCT语音编码器与G.729B、GSMHalf-Rate、FS1016和FS1015编码标准进行对比的结果。该结果也说明了客观评价方法在语音编码器性能评价的可靠性。G.729B、GSM Half-Rate和FS1016都属于中低比特率的编码标准，它们编码的语音质量远远超过FS1015、VBR-LCT方法，但从这两个指标看，相比较而言VBR-LCT方法具有相当大的优势。与相近比特率的FS1015编码器进行比较，从几个类型语音数据的SNR和PSNR表明本实施例提供的VBR-LCT编码方法明显比FS1015标准的SNR和PSNR最多高出近5dB。The data listed in Table 3 is the result of comparing the VBR-LCT speech coder provided in this embodiment with the coding standards of G.729B, GSM Half-Rate, FS1016 and FS1015. The results also illustrate the reliability of the objective evaluation method in speech encoder performance evaluation. G.729B, GSM Half-Rate, and FS1016 are all low-to-medium bit rate coding standards, and the voice quality of their coding far exceeds that of FS1015 and VBR-LCT methods. However, from these two indicators, VBR-LCT method has considerable advantages. Compared with the FS1015 coder with a similar bit rate, the SNR and PSNR of several types of speech data show that the VBR-LCT coding method provided by this embodiment is obviously higher than the SNR and PSNR of the FS1015 standard by nearly 5dB at most.

从语音编码器的实质分析，本发明采用的VBR-LCT编码方法是在变换域内进行的，其实质是波形编码的范畴。所以利用SNR和PSNR两个评价指标来进行客观评价，对其是有利的，客观指标对编码器进行评价可以作为一个参考。From the essential analysis of the speech coder, the VBR-LCT encoding method adopted in the present invention is carried out in the transformation domain, and its essence is the category of waveform encoding. Therefore, it is beneficial to use the two evaluation indicators of SNR and PSNR for objective evaluation, and the evaluation of the encoder by objective indicators can be used as a reference.

2、主观评价2. Subjective evaluation

语音编码器产生的语音最终的接受对象是人的耳朵，所以编码后的语音质量好坏主要是接受人的听觉感知评价。这里我们采用非正式语音听力测试进行语音质量的评价。The speech generated by the speech encoder is finally accepted by the human ear, so the quality of the speech after encoding is mainly evaluated by the auditory perception of the recipient. Here we use informal speech listening test to evaluate the speech quality.

在对双向对话的语音进行编码，本发明本实施例提供的VBR-LCT的编码器平均比特率接近1.6kb/s。对于无噪声清晰语音，VBR-LCT编码器得到的重构语音也同样具有轻微的模糊，所以听不到象LPC-10e重构的语音洪亮。没有G.729B、GSM Half-Rate和FS1016编码标准产生的语音清晰度高，但是它的理解性和自然度是好的，且明显比相近比特率的LPC-10e方法要好。VBR-LCT编码方法对环境噪声具有较强的鲁棒性，它的编码失真随着信号的改变不敏感，甚至于对G.729B、GSM Half-Rate、FS1016和LPC-10e方法无效的信号仍然很稳定。当使用背景音乐或其它非语音信号时，VBR-LCT编码方法比LPC-10e方法明显好。这些完全是由于VBR-LCT编码方法属于在变换域内的波形编码，所以它不依赖于如基音等语音特征参数。When encoding the voice of the two-way dialogue, the average bit rate of the encoder of the VBR-LCT provided by this embodiment of the present invention is close to 1.6 kb/s. For noise-free clear speech, the reconstructed speech obtained by the VBR-LCT coder also has slight fuzziness, so the reconstructed speech like LPC-10e cannot be heard loud and loud. The speech intelligibility produced by the G.729B, GSM Half-Rate and FS1016 coding standards is not as high, but its intelligibility and naturalness are good, and significantly better than the similar bit rate LPC-10e method. The VBR-LCT coding method has strong robustness to environmental noise, and its coding distortion is not sensitive to the change of the signal, even the signals that are invalid to the G.729B, GSM Half-Rate, FS1016 and LPC-10e methods are still very stable. When using background music or other non-speech signals, the VBR-LCT coding method is significantly better than the LPC-10e method. These are entirely due to the fact that the VBR-LCT coding method belongs to waveform coding in the transform domain, so it does not depend on speech characteristic parameters such as pitch.

表1 语音模式分维矢量第一维矢量第二维矢量第三维矢量第四维矢量无声/背景噪声Mode 0(清音)Mode 1(轻浊音)Mode 2(中强浊音) 40404040 40404040 0404040 0202020 Table 1 voice mode Fractal vector first dimension vector second dimension vector 3rd dimension vector 4th dimension vector No sound/background noise Mode 0 (unvoiced) Mode 1 (lightly voiced) Mode 2 (moderately strong voiced) 40404040 40404040 0404040 0202020

表2 语音模式分维矢量增益模式比特/帧第一维矢量第二维矢量第三维矢量第四维矢量无声/背景噪声Mode 0(清音)Mode 1(轻浊音)Mode 2(中强浊音 77118 77118 0888 0886 8888 2222 24404840 Table 2 voice mode Fractal vector gain model bit/frame first dimension vector second dimension vector 3rd dimension vector 4th dimension vector No sound/background noise Mode 0 (voiceless) Mode 1 (light voiced) Mode 2 (moderately strong voiced 77118 77118 0888 0886 8888 2222 24404840

表3 编码器类型 SNR(dB) PSNR(dB) 比特率(kb/s) G.729Anne×BGSM Half-RateFS1016FS1015(LPC-10e)VBR-LCT -0.951.240.71-3.59-0.96 15.0814.8116.7412.4715.08 85.64.82.41.6 table 3 encoder type SNR(dB) PSNR(dB) bit rate(kb/s) G.729Anne×BGSM Half-RateFS1016FS1015(LPC-10e)VBR-LCT -0.951.240.71-3.59-0.96 15.0814.8116.7412.4715.08 85.64.82.41.6

Claims

1. A low-bit variable rate speech encoder, after the input original speech signal is preprocessed by a high-pass filter, it is input to a speech activation detector to detect and distinguish active speech frames and inactive speech frames, and then pass through local cosine transformers respectively Process, complete speech coding, it is characterized in that:

Described voice activation detector adopts support vector machine-voice activation detection module, and its workflow is as follows:

① Extract the parameters of the input voice data to obtain four classification feature parameters of the current frame, the line spectrum frequency, full-band energy, low-band energy, and zero-crossing rate;

②Initialization processing: Calculate and update the above four characteristic parameters at any time according to the change of background noise when there is only background noise;

③ Differential processing: Subtract the four characteristic parameters of the above-mentioned current frame from the corresponding four characteristic parameters when the initialization indicates that the current state only has background noise, and generate the corresponding four differential characteristic parameters required for voice activation detection and classification;

④Using support vector machine algorithm for voice activation detection, the training support vector machine adopts the sequential minimum optimization method, and finally divides the voice into two voice types: active and inactive;

⑤Adopt four-step smoothing and correction algorithm for speech activation detection, discrimination and smoothing;

⑥ After the voice activation detection processing is performed on each frame, the inactive or active voice frame signal is output. If the estimated background noise energy of the frame is greater than the background noise energy threshold, it is necessary to re-correct the average background noise parameter processing;

Described local cosine transformer process, its method is:

①For inactive speech frames detected by the support vector machine-speech activation detection module, perform fractal processing according to the fractal vector dimension of the silent/background noise speech mode, and then the fractal vectors are respectively divided according to the silent/background noise speech mode The codebook of the corresponding fractal vector of the corresponding fractal vector is carried out fractal vector quantization, obtains the fractal vector quantization result that the length of two bits corresponding to this voice mode is 7 bits, and scalar quantizes the gain of this mode voice frame at the same time, According to the order of 2 bits representing the voice mode, 8 bits representing the gain, and 7 bits representing the first fractal dimension vector and the second fractal dimension vector, 3 bytes are output to represent the frame Speech encoding ends;

②For activated speech frames detected by the Support Vector Machine-Voice Activation Detection module, divide them into three speech modes according to unvoiced, unvoiced, and moderately strong voiced sounds, and perform fractal dimensioning according to the fractal vector dimensions of the corresponding three speech modes Then, the corresponding four fractal-dimensional vectors are quantized according to the codebook of the corresponding fractal-dimensional vector of the corresponding speech mode, and four different-length bits corresponding to the speech mode are obtained to represent the corresponding fractal-dimensional vectors respectively. Dimensional vector quantization result; Carry out scalar quantization to the gain of this speech frame at the same time, will represent two bits of speech pattern, 8 bits of expression gain and according to representing the bit of the first fractal dimension vector of this speech pattern The order of the bits from the bit to the fourth fractal dimension vector forms these bits into an integer number of bytes to output, indicating that the speech coding of the frame is completed.

2. low bit variable rate speech coder according to claim 1, is characterized in that: the first fractal dimension vector dimension of described silent/background noise speech pattern, the second fractal dimension vector dimension are 40, The third and fourth fractal dimension vector dimensions are all 0; the first fractal dimension vector dimension, the second fractal dimension vector dimension and the third fractal dimension vector dimension of the unvoiced, unvoiced and moderately strong voiced speech patterns The number is 40, and the dimension of the fourth fractal dimension vector is 20.

3. The low bit variable rate speech encoder according to claim 1, characterized in that: the first and second fractal dimension vector bit allocations of the described silent/background noise speech pattern are 7 bits, and the third and fourth The fractal-dimensional vector bit allocation is 0 bit, the gain is 8 bits, and the mode is 2 bits; the first and second fractal-dimensional vector bit allocations of the unvoiced voice mode are 7 bits, and the third and fourth fractal-dimensional vector bits are The allocation is 8 bits, the gain is 8 bits, and the mode is 2 bits; the first and second fractal-dimensional vector bit allocations of the unvoiced and voiced speech modes are both 11 bits, and the third and fourth fractal-dimensional vector bit allocations are both 8 bits, the gain is 8 bits, and the mode is 2 bits; the first and second fractal-dimensional vector bit allocations of the medium-strong voiced speech mode are both 8 bits, and the third and fourth fractal-dimensional vector bit allocations are both 8 and 6 bits, 8 bits for gain, and 2 bits for mode.