CN101359978B

CN101359978B - Method for control of rate variant multi-mode wideband encoding rate

Info

Publication number: CN101359978B
Application number: CN200710153938.7A
Authority: CN
Inventors: 向为
Original assignee: Individual
Current assignee: Individual
Priority date: 2007-07-30
Filing date: 2007-09-14
Publication date: 2014-01-29
Anticipated expiration: 2027-09-14
Also published as: CN101359978A; CN101359474A; CN101399043A

Abstract

The invention provides a wideband encoder adopting a new variable speed as well as an encoding method, in the method, the variable speed wideband encoder synthesizes digit speech signals as per encoding rate before determining the output encoding rate, then confirms the output encoding rate according to the voice activation and voiced-unvoiced detection to the synthesized digit speech signals, thereby the sound signals synthesized by a decoder can accurately reflect the auditory effect of original voice. The invention can be directly applied to the voice coding technique of the third-generation mobile communication system.

Description

A kind of method of controlling rate variant multi-mode wideband encoding rate

Technical field

The present invention relates to definite method of the code rate in variable Rate multi-mode wideband VMR-WB coding, be specifically related to that voice activation detects, the pure and impure range of sound is divided and whether be object and the method problem of stablizing the judgement of voiced sound.

Background technology

Code excited linear prediction coder has been widely used since 1985 are suggested.In the vocoder of CDMA (CDMA) and universal mobile telecommunications system (UMTS), all used the technology of code excited linear prediction coder.

Code Excited Linear Prediction has comprised linear prediction and quantification, self-adapting code book search and fixed codebook search.Because itself has quiet period voice, can be by reducing the transfer rate of the effective compressed voice data of data rate between these quiet period, the patent of the rate changeable vocoder that the application number of Qualcomm is 92104618.9 is exactly a scheme about said method.In addition also have other the characteristic according to speech to decide the method for code rate.

3GPP2 (3G (Third Generation) Moblie partner plans 2) has selected variable Rate multi-mode wideband (VMR-WB) vocoder as standard in order to meet the needs of wideband speech coding, VMR-WB has also adopted the coding method of Code Excited Linear Prediction, VMR-WB is according to the operational mode of the feature of each frame input signal and selection, by built-in speed, select mechanism to select a corresponding type of coding, the code rate of selectable type of coding is: the full rate (FR) of 13.3kb/s (kilobits/second), the half rate (HR) of 6.2kb/s (kilobits/second), / 4th speed (QR) of 2.7kb/s (kilobits/second), / 8th speed (ER) of 1.0kb/s (kilobits/second).

Linear prediction and quantification have comprised: the voice signal frame that sampling is obtained or the pretreated voice signal frame of process form a sequence, take advantage of the sample sound in this sequence, so that the voice data frame of a windowing to be provided with a window function; Voice data frame by described windowing calculates one group of coefficient of autocorrelation; With Lai Wenxun-Du Bin (Levinson-Durbin) algorithm, by described coefficient of autocorrelation batch total, calculate one group of linear predictor coefficient: described linear predictor coefficient group is transformed into another spectrum domain; The one group of acoustic reactance that is transformed into the rank, coefficient sets-16 on another spectrum domain described in quantizing according to speed in coded order is received the value of frequency spectrum to (ISP).

In Qualcomm Code Excited Linear Prediction (QCELP) process, after the optimum code book signal that self-adapting code book search and fixed codebook search obtain is multiplied by optimum gain separately, be added, itself and be pumping signal.Pumping signal is must use in cataloged procedure, and Qualcomm Code Excited Linear Prediction (QCELP) is the synthetic speech based on pumping signal of error minimum between search and raw tone.

VMR-WB comprises the process of self-adapting code book search, and 5.16 joints of the C.S0052-A of 3GPP2 are described this.Self-adapting code book search has comprised the calculating that the search that postpones based on fundamental tone (pitch) and the former pumping signal of the interpolation of being undertaken by the length of delay in pitch delay or delay counter after this finally obtain self-adapting code book, detailed content can be referring to 5.16 joints of the C.S0052-A of 3GPP2, the pumping signal before so-called be exactly former frame from pumping signal on 231+17 nearest sample point of present frame.

In VMR-WB tone decoding process, each frame is all carried out to LP (linear prediction) filter parameter decoding, thereby be formed for the LP filter coefficient of each subframe of the voice signal of each subframe of reconstruct, the building method of the pumping signal of each subframe is: for using FR and common (Generic) HR type of coding of self-adapting code book search and using under the voiced sound HR type of coding of modification of signal and time delay profile, the signal obtaining after self-adapting code book signal is amplified by self-adapting code book yield value, the signal obtaining after amplifying by fixed codebook gain value with fixed code book signal superposes, and the self-adapting code book yield value here and fixed code book signal are the quantized values that the self-adapting code book gain index that obtains according to decoding and fixed code book index find from quantization table, described self-adapting code book signal is the composite signal of the pumping signal based on a upper subframe, , self-adaption of decoding codebook index obtain integer and mark pitch delay, by described integer and mark pitch delay, the pumping signal of a upper subframe is carried out to interpolation and obtain self-adapting code book pumping signal, according to signal path parameter in coded frame, come linear interpolation self-adapting code book pumping signal to obtain self-adapting code book signal again, this signal path is that coding staff obtains and write VMR-WB coded frame while carrying out frequency dependent Pitch Prediction (determining the signal path of one of 2 kinds according to Pitch Prediction error), here under the voiced sound FR it is to be noted at RS-I (rate set-1) and common FR pattern, there are 2 self-adapting code books gain (referring to the 5.20.2 joint of the C.S0052-A of 3GPP2) and corresponding self-adapting code book signal.Do not using under voiceless sound type of coding, QR or the ER encoding condition of self-adapting code book search, the signal that pumping signal obtains after being amplified by fixed codebook gain value by fixed code book signal is determined.

VMR-WB fixed codebook gain quantizes to comprise: the prediction gain that the quantification energy predicting error (quantifiedprediction error) based on former subframe obtains, and the quantification of the modifying factor between fixed codebook gain and described prediction gain.The quantification energy predicting error of subframe (quantified prediction error) is exactly the logarithm value of described modifying factor.

In 5.20 joints of the C.S0052-A that VMR-WB fixed codebook gain quantizes at 3GPP2, have explanation, formula below (1) and (2) illustrate the relation of the prediction gain that quantizes energy predicting error and FR, general (Generic) HR and voiced sound HR,

\tilde{E} (n) = Σ_{i = 1}^{4} b_{i} \hat{R} (n - i) - - - (1)

g_{c}^{'} = 10^{0.05 (\tilde{E} (n) + \overset{&OverBar;}{E} - E_{i})} . - - - (2)

{G^{'}}_{c} = \tilde{E} (n) + \overset{&OverBar;}{E} - E_{i} - - - (3)

Formula (1) is n subframe predict energy (predicted energy)

definition, [the b1 b2 b3 b4] of value [0.5 0.4 0.3 0.2] is moving average (MA) predictive coefficient, it is exactly the quantification energy predicting error of k subframe; Formula (2) is prediction gain (predicted gain) g ' of linear domain _cdefinition, formula (3) is prediction gain (predictedgain) G ' _cdefinition,

that value is the mean value of the renewal energy (innovation energy) of 30 decibels (dB), E _ion average to upgrade energy (mean innovation energy).The ratio that modifying factor between fixed codebook gain and the prediction gain of linear domain is the former with the latter; Energy predicting error R (n) 20 is multiplied by the logarithm of stating modifying factor, quantizes energy predicting error and is 20 and take advantage of the logarithm that quantizes modifying factor.

What how prediction of gain error affected voiceless sound HR or voiceless sound QR equally has explanation in 5.20 joints of the C.S0052-A of 3GPP2, in the 5.20.1 joint of the C.S0052-Av1.0 version of 3GPP2,5.20.1-4 formula has provided the definition of quantized prediction error, 5.20.1-5 formula has provided the definition of the linear domain gain quantizing, and therefrom can show that the difference of the prediction gain that the log gain of quantification and (3) are given is exactly quantized prediction error.

The linear prediction analysis (LPC) that the resonance peak of the digital voice frame that the digital Speech frame of sampling forms after pretreatment formed synthetic digital Speech frame after linear prediction and quantification, self-adapting code book search and fixed codebook search is mainly used by linear prediction determines, more definitely, concerning VMR-WB, be exactly that ISP is converted to after prediction (LP) coefficient, 16 rank linear prediction synthesis filters also can be definite by formula (4), wherein

(i=1 ..., m, m=16) be linear prediction (LP) coefficient having quantized.

H (z) = \frac{1}{\hat{A} (z)} = \frac{1}{1 + Σ_{i = 1}^{m} {\hat{a}}_{i} z^{- i}}, - - - (4)

By pumping signal, by the filtered output of linear prediction synthesis filter, be exactly synthetic digital Speech frame, so, the limit correspondence of linear prediction synthesis filter frequency and the bandwidth of resonance peak of synthetic digital Speech frame, these resonance peaks are reflected in the intensity of the waveform in time domain, very large on sense of hearing impact.

According to be published in Proc.IEEE (progress. institute of electrical and electronic engineers) .1975,63 (4): the document of 561-580 " linear prediction: the review (Linear Prediction:A Tutorial Review) of introduction property " can be known, the position that the peakedness ratio of the spectrum envelope that the method for employing linear prediction obtains usually departs from real resonance peak compared with approaching harmonic wave peak value, that is to say, the spectrum envelope of the spectrum envelope of the synthetic digital Speech frame obtaining according to linear prediction synthesis filter and original digital voice signal frame is not consistent.

The author who publishes for 2004 in Electronic Industry Press is that your the auspicious << discrete time voice signal of quart of the U.S. is processed: principle and application, 5.3.4 joint--the Levinson of (Discrete-Time Speech Signal Processing:Principle and Practice) >>, (Lai Wenxun) in recurrence and correlation properties thereof, point out: the all-pole modeling that linear prediction is used and autocorrelation method can make, (4) to drop in unit circle be minimum phase system to all limits of formula, the phase function of the Fourier transform of the solution of the correlation method of sequence is distortion, the auto-correlation of linear prediction causes that glottis maximum phase limit is to the transformation of minimum phase limit, while setting up synthetic speech waveform, the phase function distortion that auto-correlation conversion causes may be influential to speech perception,, synthesizes departing from of the waveform of digital voice signal and the waveform of original digital voice signal that is.In 5.6 joints at this book---the speech synthesis based on all-pole modeling, point out: the composite signal based on linear prediction correlation method looks like voice, but simultaneously because its minimum phase characteristic has lost absolute phase structure; Shown in example in Fig. 5 .18 in book, the spike of reconstructed speech signal is more more outstanding than original signal, and the desirable glottis ripple that is assumed to minimum phase is time upset, and has than the steeper rising edge of actual glottis ripple.

The method that VMR-WB vocoder has adopted speed to select is at present determined code rate, the assorting process by being divided into a plurality of stages is divided into a kind of in inactive speech, voiceless sound, stable voiced sound and unsettled voiced sound input audio signal frame, and assorting process has been used voice activation detection (VAD) method, pure and impure range of sound separating method and stable dullness area separating method.

It is first to calculate the level of pretreated input signal and the difference between ground unrest estimated value that current voice activation detects (VAD) method, calculate again VAD decision threshold, the initial judgement of VAD realizes by more described difference and decision threshold, when being greater than the latter, the former initially adjudicates as there being Speech frame, when the former is less than or equal to the latter, initial judgement is for without Speech frame, and the conclusive judgement of VAD is the judgement after comprehensively by the result of other detections such as initially judgement and pretreated digital voice signal tone.

Summary of the invention

The technical matters solving

The VAD that prior art adopts, the pure and impure range of sound divide and stablize voiced sound district office for to as if sampling speech input after digital voice signal frame forms after pretreatment again after the digital voice signal frame that forms or sampling pretreated digital voice signal frame, but the synthetic digital Speech frame that the coded frame producing according to the VMR-WB coding that adopts Code Excited Linear Prediction technology produces and the phonetic feature of former digital voice signal frame inconsistent, about this point, in background technology, point out, that is: with Linear prediction analysis method, estimate that the peak that resulting spectrum envelope usually occurs resonance peak departs from real resonance peak, the all-pole modeling that linear prediction is used and autocorrelation method can make all limits of model drop in unit circle, thereby cause the phase function distortion of the Fourier transform of synthetic digital voice signal, this can make departing from of the shape of synthetic digital voice signal and the shape of original digital voice signal.

Continuous waveform peak position of the encoded frame to produce the synthesized coded digital voice signals and linear predictive analysis of the encoded excitation codebook often deviate from the original to produce a digital signal (digital voice signal or a preprocessed) of peak position can be performed on the waveform of AMR-WB mode of interoperability in the 3GPP2 VMR-WB encoder is compatible sound AMR-WB12.65kb / s rate 13.3kb / s to find the authentication coding examples, 3GPP's TS26074-500. zip (zip is the file extension) waveform DTX_400.zip in DTX4.INP file TS_AMR500_DTX.zip file (INP-file extension) file corresponding to the speech signal in 7.83 seconds and 7.84 seconds between frame corresponding to the peak position belongs digital voice signal waveform synthesis position with the maximum peak corresponding to DTX4.INP file as an input speech signal to AMR12.65kb / s for encoding and decoding the encoding rate is not formed in correspondence with each other on the the following is a description on this point:

As shown in Figure 1, in 392 frames in the drawings of peak-peak between 7.83 seconds and 7.84 seconds in the waveform of the corresponding voice signal of DTX4.INP file (in figure before 7.84 seconds), find, and still can find corresponding peak-peak in pretreated digital voice frame, but for the synthetic audio digital signals after decoding, as shown in Figure 2, the peak value of corresponding waveform appears in synthetic digital voice signal 393 frames (after 7.84 seconds) that the coded frame with 13.3kb/s rate coding produces after decoding, synthetic digital voice signal frame 393 be than 392 frames of correspondence late a frame, if astable voiced sound detects the waveform peak of 392 frames that the corresponding digital voice signal of DTX4.INP file digital voice signal after pretreatment can be detected, although pretreated like this digital voice signal is encoded in 13.3kb/s speed mode, but in 392 frames of the synthetic digital voice signal of the decoded rear generation of coded frame with 13.3kb/s rate coding of this digital voice signal, there is no the obvious corresponding waveform peak that affects the sense of hearing in 392 frames of original signal.

So, the synthetic digital voice signal frame of pretreated digital voice frame and its correspondence not necessarily has consistent phonetic feature, the result that the VAD of pretreated digital voice frame (or the digital Speech frame of sampling), pure and impure sound detect and unstable voiced sound detects also and do not mean that composite number word voice signal frame has identical with it result, particularly in the situation that all encoded operations of the resonance peak being detected on a digital voice incoming frame be mapped to its adjacent after on a corresponding synthetic digital Speech frame of digital voice incoming frame.

As described in background, existing VAD, in pure and impure sound detection and unstable voiced sound detection technique, do not detect the resonance peak in pretreated digital voice frame (or the digital Speech frame of sampling), a plurality of frequency subbands that are divided in current technology are distinguished detection signal level, pitch detection, pitch Detection, sophisticated signal detects these technology and is not directly involved in the detection of resonance peak, and VMR-WB coding with the corresponding resonance peak of limit of the prediction synthesis filter based on the resulting LP coefficient of LPC to form the harmonic peak that the sense of hearing is had a significant impact, the frequency location of the resonance peak operation map that is just encoded has been gone to these harmonic peaks like this.

When voice signal is very faint, the amplitude of the resonance peak of voice signal and energy are very little almost to be flooded by ground unrest, , in crude sampling digital voice signal or pretreated digital voice signal, the level of ground unrest or energy are without speech with level or the approaching VAD result that makes of energy of faint resonance peak, many subbands level detection, pitch detection, pitch Detection also cannot detect, because the operations such as VAD in prior art are arranged at pitch delay parameter and upgrade code book (innovative codebook) calculating and carry out before, LPC in existing VMR-WB technology is not used to detect frequency and the bandwidth of those limits of corresponding resonance peak, do not remove to detect amplitude and the energy at the waveform at the waveform peak place corresponding to prediction synthesis filter limit place, although the amplitude of the waveform at these waveform peak places and the size of energy are very large on speech perception impact.

Technical scheme

For VAD, pure and impure sound detect and unstable voiced sound detects result is reflected more accurately according to the resulting synthetic digital Speech frame of VMR-WB coded frame, whether there is speech, whether be voiced sound and whether be unsettled voiced sound, the present invention is directly positioned at the object of these detections on the synthetic digital voice signal of corresponding VMR-WB coded frame.In order to make that the harmonic peak of the synthetic digital voice signal of the large resonance peak corresponding to original digital voice signal of sense of hearing impact is not missed in VAD and other testing process, the present invention is also at VAD, amplitude or energy in the output signal during pure and impure sound detection and unstable voiced sound detect, direct-detection being produced to linear prediction synthesis filter input signal-synthetic digital voice signal, like this, although can not direct-detection to amplitude or the energy of the waveform at the waveform peak place corresponding to linear prediction synthesis filter limit, but as long as the harmonic peak in synthetic digital signal frequency spectrum is reflected to detection threshold that amplitude in time domain waveform or short-time energy (or average amplitude) the surpasses regulation harmonic peak in just can undetected synthetic digital voice signal frequency spectrum.

For solving above-mentioned detection not for the problem of the characteristics of speech sounds of the Speech frame after the decoding of corresponding VMR-WB coded frame, the present invention provides the method for first encoding and detecting afterwards, before being coded in VAD, the encoding operation in method below, input voice signal frame being carried out does not relate to the calculating of VAD coding parameter, below the coding that input voice signal frame carried out in method refer to obtain linear prediction analysis and quantize after coding parameter, self-adapting code book search for the parameters such as the coding parameters such as resulting fundamental tone (pitch) delay and the resulting fixed code book of fixed codebook search.

While being maximum, need to carry out four kinds of code rates coding below and at least need to carry out really the delimit the organizational structure method of bit rate of three kinds of code rates:

With voiceless sound QR, voiced sound HR and FR type of coding, to inputting the pumping signal that voice signal frame is encoded and produced according to encoding, export respectively synthetic digital voice signal frame respectively, synthetic digital voice signal frame to voiceless sound QR carries out VAD, in VAD testing result, is to take CNG-ER type of coding coded speech input signal frame generate VMR-WB coded frame as output encoder frame during without speech; Synthetic digital voice signal frame to voiced sound HR carries out pure and impure sound detection, if the testing result that pure and impure sound detects is voiceless sound, using described voiceless sound QR type of coding as speech input signal frame generation VMR-WB coded frame is as output encoder frame; The synthetic digital voice signal frame of FR being stablized to voiced sound and detect, is to stablize voiced sound if stablize the testing result of voiced sound detection, using described voiced sound HR type of coding as speech input signal frame generation VMR-WB coded frame is as output encoder frame; If stablize the testing result of voiced sound detection, not to stablize voiced sound (voice segments that for example this frame comprises non-stationary or fast the voiced sound signal of translate phase), using described FR type of coding as speech input signal frame generation VMR-WB coded frame is as output encoder frame.

With the code rate of FR, speech incoming frame encoded and carry out the operation that the pumping signal being produced by coding generates synthetic digital Speech frame, then according to the synthetic digital Speech frame of this FR code rate, stablizing voiced sound and detect; With the code rate of HR, speech incoming frame being encoded and carry out the pumping signal being produced by coding generates the operation of synthetic digital Speech frame, then carries out pure and impure sound detection according to the synthetic digital Speech frame of this HR code rate; With the code rate of QR, speech incoming frame encoded and carry out the operation that the pumping signal being produced by coding generates synthetic digital Speech frame, according to the synthetic digital Speech frame of this QR code rate, carrying out voice activation detection (VAD); If stablizing voiced sound testing result and be unstable voiced sound, to take the VMR-WB coded frame that FR code rate generates be output encoder frame; If stablize voiced sound testing result, be to stablize voiced sound, the VMR-WB coded frame that the HR code rate of take when pure and impure sound testing result is voiced sound generates is output encoder frame; If pure and impure sound testing result is voiceless sound and voice activation testing result, be that to have speech to take the VMR-WB coded frame that QR code rate generates be output encoder frame, otherwise encode with CNG-ER code rate.

Except these two methods, also have the 5.10 joint signals of the C.S0052-A that the is similar 3GPP2 below corresponding method of classifying:

The synthetic digital Speech frame of pumping signal generation of speech input signal frame being encoded and produced by coding with the type of coding of voiceless sound HR or voiceless sound QR, the method that produces synthetic digital Speech frame by pumping signal is by LP composite filter by pumping signal, the corresponding synthetic digital Speech frame of type of coding coding with described voiceless sound HR or voiceless sound QR is carried out to voice activation detection (VAD), if testing result is without speech, with CNG-ER code rate coded speech input signal frame and take the VMR-WB coded frame that generates as output encoder frame, if the testing result of VAD is to have speech, the synthetic digital Speech frame of pumping signal generation of speech input signal frame being encoded and produced by coding with the type of coding of voiced sound HR, the corresponding synthetic digital Speech frame of type of coding coding with described voiced sound HR is carried out to pure and impure sound detection, if the testing result that pure and impure sound detects is voiceless sound, the type of coding of described voiceless sound HR or voiceless sound QR of usining generates VMR-WB coded frame as output encoder frame as speech input signal frame, if the testing result that pure and impure sound detects is voiced sound, the synthetic digital Speech frame of pumping signal generation of speech input signal frame being encoded and produced by coding with the type of coding of FR, the corresponding synthetic digital Speech frame of type of coding coding with described FR is stablized to voiced sound to be detected, if stablizing the testing result of voiced sound detection is to stablize voiced sound, using described voiced sound HR type of coding as speech input signal frame generation VMR-WB coded frame is as output encoder frame, if stablize the testing result of voiced sound detection, not to stablize voiced sound (voice segments that for example this frame comprises non-stationary or fast the voiced sound signal of translate phase), using described FR type of coding as speech input signal frame generation VMR-WB coded frame is as output encoder frame.

The above-mentioned this voice activation detection method of first carrying out is carried out the method for voiced sound and FR coding again the operand of the voice signal of conversational is greatly reduced, because only need to carry out CNG-ER coding one time once detect without speech again.

Due to after the initiating terminal of voice occurs in immediately voice quiet period, so change the order that above-mentioned this execution detects, by the order shown in Fig. 5, determine type of coding and output encoder frame, just there is method below:

With the type of coding of voiceless sound HR or voiceless sound QR, speech input signal frame is encoded and will pass through to generate synthetic digital Speech frame by the determined linear prediction LP composite filter of coding by the determined pumping signal of coding, according to the synthetic digital Speech frame of the type of coding coding of this voiceless sound HR or voiceless sound QR, carry out voice activation detection (VAD), if testing result is without speech, with CNG-ER code rate coded speech input signal frame and take the VMR-WB coded frame that generates as output encoder frame; If the testing result of VAD is to have speech, with the type of coding of FR, speech input signal frame is encoded and will pass through to generate synthetic digital Speech frame by the determined linear prediction LP composite filter of coding by the determined pumping signal of coding, according to the synthetic digital Speech frame of the type of coding of this FR, stablizing voiced sound and detect, is not to stablize voiced sound if stablize the testing result of voiced sound detection, using described FR type of coding as speech input signal frame generation VMR-WB coded frame is as output encoder frame; If stablize testing result that voiced sound detects and be, stablize that voiced sound is just encoded to speech input signal frame with the type of coding of voiced sound HR and will be by the determined pumping signal of coding by synthesizing digital Speech frame by the determined linear prediction LP composite filter generation of coding, according to the synthetic digital Speech frame of the type of coding of this voiced sound HR, carry out pure and impure sound detection, if the testing result that pure and impure sound detects is voiceless sound, the type of coding of described voiceless sound HR or voiceless sound QR of usining generates VMR-WB coded frame as output encoder frame as speech input signal frame; If the testing result that pure and impure sound detects is voiced sound, using described voiced sound HR type of coding as speech input signal frame generation VMR-WB coded frame is as output encoder frame.

Also having a kind of method is exactly first to carry out pure and impure sound detection, that is:

With the type of coding of voiced sound HR, speech input signal frame is encoded and will pass through to generate synthetic digital Speech frame by the determined linear prediction LP composite filter of coding by the determined pumping signal of coding, according to the synthetic digital Speech frame of the type of coding of this voiced sound HR, carry out pure and impure sound detection, if the testing result that pure and impure sound detects is voiceless sound, just with the type of coding of voiceless sound HR or voiceless sound QR, speech input signal frame is encoded and will pass through to generate synthetic digital Speech frame by the determined linear prediction LP composite filter of coding by the determined pumping signal of coding, according to the synthetic digital Speech frame of the type of coding coding of this voiceless sound HR or voiceless sound QR, carry out voice activation detection (VAD), if the testing result that pure and impure sound detects is voiced sound, just with the type of coding of FR, speech input signal frame is encoded and will, by the determined pumping signal of coding by generating synthetic digital Speech frame by the determined linear prediction LP composite filter of coding, according to the synthetic digital Speech frame of the type of coding of this FR, stablize voiced sound and detect,

If VAD result is without speech, with CNG-ER code rate coded speech input signal frame and take the VMR-WB coded frame that generates as output encoder frame; If the testing result of VAD is to have speech, the type of coding of voiceless sound HR or voiceless sound QR of usining generates VMR-WB coded frame as output encoder frame as speech input signal frame; If stablizing the testing result of voiced sound detection is not to stablize voiced sound, using described FR type of coding as speech input signal frame generation VMR-WB coded frame is as output encoder frame, if stablize the testing result of voiced sound detection, stablize voiced sound just with described, using described voiced sound HR type of coding as speech input signal frame generation VMR-WB coded frame is as output encoder frame.

With regard to all these determine the method for code rate above, all relate to a plurality of code rates and encoding, with the pumping signal of resulting new a plurality of type of codings in a plurality of type of coding coded speech signal frames and synthetic digital voice signal process, quantize the parameter that process was produced that the parameters such as energy predicting error only can be used that type of coding coded frame of coding output when next frame encode, the state before not renewal will all be abandoned and revert to all state variables of the stylish generation of VMR-WB frame of the type of coding that coding is not output.

Except above-mentioned state variable disposal route, the preservation of state variable and the problem of use that for multi-rate coding, produce, the present invention also has the method for analysis and solve below.

For VMR-WB vocoder, at the prediction gain G ' shown in formula (3) above _ccalculating formula and the prediction gain g ' of the linear domain shown in formula (2) _ccalculating formula in correlated variables, only have the predict energy (predicted energy) of subframe

depend on the state value relevant with the coding of subframe above and quantize energy predicting error, upgrade the mean value of energy value constant, on average upgrade energy E _ionly relevant with fixed code book.

The coded frame decoding that VMR-WB code translator produces voice coding module, according to formula (3), because for one and same coding frame, so the mean value of code translator and these voice coding module both sides' renewal energy with average renewal energy E _iin full accord, if use the quantized prediction error of four subframes of same previous frame, both sides' prediction gain G ' _cin full accord, the prediction gain g ' of linear domain _calso in full accord.

When scrambler is during for the first time with the mode coded speech signal frame of odd encoder speed, each voice coding module energy reference initial pumping signal and quantized prediction error consistent with code translator in scrambler, and, in scrambler, always have the coded frame that a voice coding module produces to want decoded device to receive:

In the situation that the coded frame that this coding module produces is FR, common (Generic) HR or voiced sound HR coded frame, code translator directly obtains the pitch delay consistent with this voice coding module, consistent self-adapting code book and quantizes gain, consistent fixed code book and consistent quantification modifying factor from the FR that receives or voiced sound HR coded frame, by the prediction gain g ' of linear domain _ctake advantage of the modifying factor of consistent quantification to obtain quantizing fixed codebook gain; Because code translator with this voice coding module with reference to consistent quantification energy predicting error and for same coded frame, the prediction gain g ' of their linear domain _cin full accord, so their quantification fixed codebook gain is also in full accord; Code translator is according to the synthetic consistent self-adapting code book of the pumping signal of consistent previous frame subframe and pitch delay, and after self-adapting code book and fixed code book are multiplied by quantification gain separately, be added and as the pumping signal of new subframe, the pumping signal of the pumping signal of new subframe and this voice coding module is in full accord;

In the situation that the coded frame that this coding module produces is voiceless sound HR or voiceless sound QR coded frame, code translator directly obtains the fixed code book consistent with this voice coding module and consistent quantification fixed codebook gain from the voiceless sound HR that receives or voiceless sound QR coded frame, code translator is multiplied by consistent quantification fixed codebook gain as the pumping signal of new subframe using fixed code book, and the pumping signal of the pumping signal of new subframe and this voice coding module is in full accord; Equally, because for one and same coding frame, so the mean value of code translator and these voice coding module both sides' renewal energy

with average renewal energy E _iin full accord, and use the quantized prediction error of four subframes of consistent previous frame, both sides' prediction gain G ' _cin full accord, according to consistent fixed code book, quantize gain and can determine that their quantized prediction error is also consistent, this can draw from formula (5) below, and formula (5) is equal to (5.20.1-5) formula of the C.S0052-Avl.0 of 3GPP2, Γ is quantized prediction error, g _cto quantize fixed codebook gain;

g _c＝10 ^{0.05(Γ+G’c)} (5)

In the situation that the coded frame that this coding module produces is CNG-ER or CNG-QR coded frame, this coding module resets pumping signal, code translator also resets pumping signal after receiving CNG-ER or CNG-QR coded frame, therefore both sides' pumping signal has obtained consistent, so do not generate new quantized prediction error both sides' quantized prediction error while carrying out CNG type coding, is also consistent.

Scrambler provides the voice coding module that the pumping signal of just having determined after the code rate indication of specifying output VMR-WB coded frame with the new subframe of code translator has concord in its VAD, pure and impure sound and stable voiced sound testing result, when scrambler obtain it for the first time the code rate of the voice signal frame of odd encoder rate coding specify after indication, determined with the voice coding module that the pumping signal of the new subframe of code translator is agreed, every other voice coding module is all with reference to pumping signal and the quantized prediction error of the subframe of this voice coding module.This process constantly repeats, so, scrambler can synthesize the pumping signal of new consistent next frame subframe under the condition of pumping signal of using the previous frame subframe consistent with code translator, the consistance of pumping signal also can be transmitted frame by frame, and the consistance of pumping signal also can obtain the maintenance of long period.

As long as the pumping signal of coding and decoding both sides based on consistent, their synthetic digital voice signal also can reach unanimity, although this is because LP parameter has inconsistent time, but do not there is transitivity for constructing LP parameter inconsistent of LP composite filter,, as long as coding staff is used LP parameter consistent with decoding side while having the continuous several frame of coding, the LP parameter of both structure LP composite filters just can be consistent.

Like this in the method for above-mentioned definite VMR-WB code rate, the input voice signal frame that the new pumping signal producing by the output VMR-WB coded frame of coding previous frame input voice signal frame is encoded current; If the output VMR-WB coded frame of coding previous frame input voice signal frame has produced the input voice signal frame that new quantized prediction error is encoded current by this quantized prediction error, otherwise the input voice signal frame of encoding current by the quantized prediction error before coding previous frame input voice signal frame.

In order to reduce the calculated amount of synthetic digital voice signal frame, the operation of synthetic digital voice signal frame only can be reduced to and generates composite number word voice signal frame by a kind of FR, according to the synthetic digital voice signal frame of this FR, carry out VAD, the detection of pure and impure sound and unstable voiced sound and detect, because the phonetic feature of the synthetic digital voice signal frame of the synthetic digital voice signal frame of FR and HR speed is more approaching.So just there is following scheme:

A kind of method of definite VMR-WB code rate, by the code rate of FR, to sound figure sample frame or to its pretreated digital signal frame, carry out linear prediction, self-adapting code book search and the search of renewal code book obtain pumping signal, and will to this pumping signal, carry out filtering by the determined linear prediction synthesis filter of linear prediction and obtain synthetic video digital signal frame, according to this synthetic video digital signal frame, carry out voice activation detection (VAD), when VAD result is to carry out pure and impure sound detection according to this synthetic video digital signal frame while having speech, when pure and impure sound testing result is voiced sound, according to this synthetic video digital signal frame, stablizing voiced sound detects, when VAD result is encode and generate VMR-WB coded frame by CNG-ER during without speech, when pure and impure sound testing result is voiceless sound, by voiceless sound HR or voiceless sound QR, encodes and generate VMR-WB coded frame, when stablizing voiced sound testing result, be by HR rate coding VMR-WB coded frame while stablizing voiced sound, when stablizing voiced sound testing result and be unstable voiced sound, by FR, encode and generate VMR-WB coded frame.

When VMR-WB is operated in AMR-WB Interoperability Mode, just there are these the two kinds coding methods of supporting HR and not supporting HR below:

A kind of method of VMR-WB code rate of definite AMR-WB Interoperability Mode, by the code rate of FR, to sound figure sample frame or to its pretreated digital signal frame, carry out linear prediction, self-adapting code book search and the search of renewal (innovative) code book obtain pumping signal, and will to this pumping signal, carry out filtering by the determined linear prediction synthesis filter of linear prediction and obtain synthetic video digital signal frame, according to this synthetic video digital signal frame, carry out voice activation detection (VAD), when VAD result is to carry out pure and impure sound detection according to this synthetic video digital signal frame while having speech, when VAD result is encode and generate VMR-WB coded frame by CNG-ER or CNG-QR during without speech, when pure and impure sound testing result is voiceless sound by voiceless sound HR rate coding VMR-WB coded frame, when pure and impure sound testing result is voiced sound, by FR, encode and generate VMR-WB coded frame.

A kind of method of VMR-WB code rate of definite AMR-WB Interoperability Mode, by FR, to sound figure sample frame or to its pretreated digital signal frame, carry out linear prediction, self-adapting code book search and the search of renewal code book obtain pumping signal, and will to this pumping signal, carry out filtering by the determined linear prediction synthesis filter of linear prediction and obtain synthetic video digital signal frame, according to this synthetic video digital signal frame, carry out voice activation detection (VAD), when VAD result is encode and generate VMR-WB coded frame by CNG-ER or CNG-QR during without speech, when VAD result is by FR rate coding and generate VMR-WB coded frame while having speech.

Voice activation detection method of the prior art, pure and impure sound detect and stablize the method that voiced sound detects synthetic digital voice signal is stood good.

In addition, because synthetic digital voice signal has higher energy in resonance peaks corresponding to prediction synthesis filter limit, synthetic digital voice signal frame is being carried out can detecting when voice activation detects the amplitude of its crest, if the amplitude of the rising edge of its crest and negative edge all surpasses or one of them is just adjudicated this frame for there being speech over threshold value, like this, once the corresponding harmonic peak of described limit is reflected in the amplitude of the crest of the vibration on waveform, surpass threshold value, synthetic digital voice signal frame just can not be missed when VAD detects.The synthetic digital voice signal spike of pointing out in there is background technology during than the more outstanding phenomenon of original signal those outstanding spikes can use relatively easily with the method for threshold value comparison and be detected.It is not unique being used for the establishing method of the rising edge of crest or the threshold value of negative edge comparison, determining of this threshold value can be used fixed value, also can be relevant with the synthetic digital voice signal frame at crest place, such as, can be with reference to the average amplitude of synthetic digital voice signal frame---the absolute value of the signal value in frame on sample point and.

A kind of VAD method that the present invention proposes is whether to surpass threshold value for the amplitude in the waveform of synthetic digital voice signal, if surpassed, will synthesize digital voice signal judgement for there being speech.Once its amplitude of those waveforms corresponding to former digital voice signal resonance peak can be detected with regard to not being missed over threshold value like this, just the synthetic digital voice signal frame at its place can not replaced and sent to decoding side by ground unrest coded frame.Another kind of detection method is whether the synthetic short-time average energy of digital voice signal of detection or the peak value of short-time average magnitude surpass threshold value, if surpassed, will synthesize digital voice signal judgement for there being speech, once like this those corresponding to the short-time average energy of the waveform of former digital voice signal resonance peak or the peak value of short-time average magnitude, to surpass threshold value be not just can omission but can be detected.

Pure and impure sound detects also can be directly for the waveform that synthesizes the time domain of digital voice, the short-time average energy of voiced sound or short-time average magnitude are higher than the respective value of voiceless sound, because the definition that need to first provide rectangular window w (n) of the definition of short-time average energy or short-time average magnitude, N is the size of window:

W (n)=1,0≤n≤N-1; 0, n < 0 or n > N-1 (6)

Short-time average energy E of the present invention _nbe defined as follows:

E_{n} = Σ_{m = n - N + 1}^{n} {[x (m) w (n - m)]}^{2} - - - (7)

Short-time average magnitude M of the present invention _nbe defined as follows:

M_{n} = Σ_{m = n - N + 1}^{n} | x (m) | w (n - m) = | x (n) | * w (n) - - - (8)

The present invention proposes the method for synthetic digital voice signal frame being carried out to pure and impure sound detection by short-time average energy or short-time average magnitude:

Set size and the short-time average energy threshold value of window N, in synthetic digital voice signal frame, the short-time average energy of this window surpasses this short-time average energy threshold value, and the pure and impure sound testing result of this frame is decided to be to unvoiced frame.

Set the size of window N and the count threshold of short-time average energy threshold value and exceeded threshold, the number of times that the short-time average energy of this window surpasses this short-time average energy threshold value in synthetic digital voice signal frame, over count threshold, is decided to be unvoiced frame by the pure and impure sound testing result of this frame.

Set size and the short-time average magnitude threshold value of window N, in synthetic digital voice signal frame, the short-time average magnitude of this window surpasses this short-time average magnitude threshold value, and the pure and impure sound testing result of this frame is decided to be to unvoiced frame.

Set the size of window N and the count threshold of short-time average magnitude threshold value and exceeded threshold, the number of times that the short-time average magnitude of this window surpasses this short-time average magnitude threshold value in synthetic digital voice signal frame, over count threshold, is decided to be unvoiced frame by the pure and impure sound testing result of this frame.

Short-time average energy or short-time average magnitude can also be for stablizing the detection of voiced sound, translate phase when the initiating terminal in voice or pure and impure sound, short-time average energy or short-time average magnitude will have than the value before this obvious increase, so just there is detection method below:

Set the size of window N, in the synthetic digital voice signal frame of FR, the short-time average energy of this window surpasses the short-time average energy of this maximum window in some frames before this frame, the stable voiced sound testing result of this frame is decided to be to unstable unvoiced frame, if exist same FR to synthesize digital voice signal frame before the synthetic digital voice signal frame of this FR, before described this frame, some frames just can adopt FR to synthesize digital voice signal frame, otherwise some frames will comprise the synthetic digital voice signal frame of HR or QR.

Set the size of window N, in synthetic digital voice signal frame, the short-time average magnitude of this window surpasses the short-time average magnitude of this maximum window in some frames before this frame, and the stable voiced sound testing result of this frame is decided to be to unstable unvoiced frame.

Beneficial effect

Owing to having adopted, first carry out the search of linear prediction and code book and carry out again phonetic feature and detect (VAD, pure and impure sound detects and stablizes voiced sound and detect) method, like this, the appearance of the pumping signal generating by FR or other non-ER code rate is just prior to VAD, pure and impure sound detects and stablizes the operations such as voiced sound detection, for pumping signal, by the output of linear prediction synthesis filter, carry out VAD, pure and impure sound detects and stablizes voiced sound and detect, VAD like this, pure and impure sound detects and stablize the result that voiced sound detects can be by original figure voiced frame through linear prediction, phonetic feature after self-adapting code book search and fixed codebook search are processed is included, the phonetic feature of audio digital signals frame that the VMR-WB coded frame of the non-CNG type of coding that receive decoding side produces after decoding and this type of coding of coding staff for detection of the phonetic feature of synthetic audio digital signals frame similar, the synthetic audio digital signals that coding staff in the situation that cannot detect has active speech, voiceless sound, voiced sound or a unstable voiced sound just likely produces the VMR-WB coded frame of CNG type of coding.The phonetic feature of the audio digital signals frame that the VMR-WB coded frame of the non-CNG type of coding of receiving due to decoding side produces after decoding and coding staff FR type of coding for detection of the phonetic feature of synthetic audio digital signals also have reasonable similarity, so also only arrange the synthetic audio digital signals of FR type of coding for the detection of phonetic feature.

Another benefit that the inventive method detects according to synthetic audio digital signals is, in the situation that original pretreated digital signal has been lost the phonetic features such as the active speech that can be detected, voiced sound or unstable voiced sound after linear prediction and code book search processing, the compressibility to voice of VMR-WB coding can further improve.

The amplitude that the amplitude of the crest of synthetic digital voice signal can be reflected in to the crest on waveform at harmonic peak corresponding to prediction synthesis filter limit with the VAD method of threshold value comparison of the present invention detects the synthetic digital voice signal frame at this crest place during higher than threshold value.When the spike of the synthetic digital voice signal of mentioning is embodied in the rising edge of the spike in the waveform of synthetic digital voice signal of corresponding original signal resonance peak or negative edge than original signal larger than more outstanding this phenomenon of original signal, above-mentioned the amplitude of the crest of synthetic digital voice signal can be detected to the frame that cannot detect by detecting the spike of original signal waveform with the method for threshold value comparison in background technology.Equally, when the rising edge of the above-mentioned synthetic digital voice signal rising edge that more steep this phenomenon is embodied in the spike in the waveform of synthetic digital voice signal of corresponding original signal resonance peak than original signal is during than original signal larger, of the present invention the rising edge of the crest of synthetic digital voice signal can be detected to the frame that originally cannot detect with the method for threshold value comparison.Equally, when the rising edge of above-mentioned synthetic digital voice signal, than original signal, more steep this phenomenon is embodied in the slope ratio original signal of rising edge of the spike in the waveform of synthetic digital voice signal of corresponding original signal resonance peak when larger, and the slope of the rising edge of the crest of synthetic digital voice signal can be detected to the frame that originally cannot detect with the method for threshold value comparison.

accompanying drawing explanation

Fig. 1 is the waveform between 7.83 seconds and 7.84 seconds in the corresponding voice signal of DTX4.INP file.

Waveform between 7.83 seconds and 7.84 seconds of Fig. 2 audio digital signals that to be the corresponding voice signal of DTX4.INP file form after the decoding of rate coded signal at full speed through VMR-WB scrambler.

Fig. 3 determines the theory diagram of code rate VMR-WB scrambler according to the synthetic digital voice signal of full rate.

Fig. 4 determines the theory diagram of code rate VMR-WB scrambler according to the testing result on the synthetic digital voice signal of multiple speed.

Fig. 5 is that VMR-WB scrambler is determined the functional block diagram of code rate according to synthetic digital voice signal.

embodiment

Embodiment 1, the VMR-NB scrambler of a RS-II pattern, as shown in Figure 3, voice sample rate is that pulsed modulation (PCM) signal frame 1 of 8kHz (KHz) is exported to full-speed voice synthesis module and voice coding module through the signal frame 2 of over-sampling (up-sampling) the operation formation 12.8kHz of sampling module simultaneously, or, voice sample rate is that the signal frame 2 of owing sampling (down-sampling) operation formation 12.8kHz of 16kHz (KHz) pulsed modulation (PCM) signal frame 1 process sampling module is exported to full-speed voice synthesis module and voice coding module simultaneously.Full-speed voice synthesis module carries out to signal frame 2 that linear prediction and ISP conversion generates ISP coefficient and for constructing the LP coefficient of linear prediction synthesis filter, then, subframe execution open-loop pitch analysis and pitch tracking (Open-loop Pitch Analysis and Pitch Tracking) to signal frame 2 are determined field pitch delay, and 5.8 joints of the C.S0052-A of 3GPP2 are explained in detail this operation, then, carry out the search of full rate self-adapting code book and produce the gain of self-adapting code book vector sum self-adapting code book, 5.15 joints of the C.S0052-A obtaining at 3GPP2 of the echo signal of searching for about self-adapting code book have detailed explanation, about self-adapting code book search, at 5.16 joints of the C.S0052-A of 3GPP2, there is detailed explanation, after self-adapting code book search, the FR fixed codebook search of the 5.17 joint defineds of the C.S0052-A of execution 3GPP2, after completing the search of FR self-adapting code book and FR fixed codebook search, can calculate the pumping signal of subframe, the signal obtaining after self-adapting code book signal is amplified by self-adapting code book yield value, the signal obtaining after amplifying by fixed codebook gain value with fixed code book signal superposes.The pumping signal that obtains all subframes of signal frame is obtained to the synthetic digital signal frame 3 of FR speed by linear prediction synthesis filter, synthetic digital signal frame 3 outputs to stablizes voiced sound detection module, pure and impure sound detection module, voice activation detection module.Stablize voiced sound detection module stable output voiced sound testing result 6 and select module to coded frame output, pure and impure sound detection module is exported pure and impure sound testing result 7 and is selected module to coded frame output, and voice activation detection module output voice activation testing result 8 is selected module to coded frame output.It is that to send content during without speech be that the coded command signal 9 of CNG-ER (comfort noise 1/8th speed) coding is to voice coding module in voice activation testing result 8 that module is selected in coded frame output, it is that while having speech and pure and impure sound testing result 7 to be voiceless sound, to send code rate be that the coded command signal 9 of QR is to voice coding module in voice activation testing result 8 that module is selected in coded frame output, it is to have speech in voice activation testing result 8 that module is selected in coded frame output, pure and impure sound testing result 7 is that voiced sound and stable voiced sound testing result are that while stablizing voiced sound, to send code rate be that the coded command signal 9 of HR is to voice coding module, it is to have speech in voice activation testing result 8 that module is selected in coded frame output, pure and impure sound testing result 7 is that to send code rate while being unstable voiced sound be that the coded command signal 9 of FR is to voice coding module for voiced sound and stable voiced sound testing result.Voice coding module is encoded and is generated the VMR-WB coded frame of this speed signal frame 2 according to the given code rate of coded command signal 9, C.S0052-A about the detailed encoding operation step 3GPP2 of the various code rates of RS-II pattern is described later in detail, and 3GPP2 has also provided the source code of VMR-WB scrambler.

Embodiment 2, the VMR-NB scrambler of a RS-II pattern, as shown in Figure 2, difference from Example 1 is that it does not have voice synthetic module, by three kinds of synthetic synthetic audio digital signals of the coding module of three kinds of speed corresponding respectively the coded frame of three kinds of speed.Voice sample rate be pulsed modulation (PCM) signal frame 1 of 8kHz (KHz) through over-sampling (up-sampling) operation of sampling module, form 12.8kHz signal frame 2 simultaneously to FR (full rate) coding module, voiced sound half rate (voiced-HR) coding module, the output of voiceless sound 1/4th speed (unvoiced-QR) coding module and CNG-ER (comfort noise 1/8th speed) voice coding module, or, voice sample rate is that the signal frame 2 of owing sampling (down-sampling) operation formation 12.8kHz of pulsed modulation (PCM) the signal frame 1 process sampling module of 16kHz (KHz) is exported to the voice coding module of four kinds of speed simultaneously.The voice coding module of four kinds of speed is carried out encoding operation to signal frame 2 respectively, voice coding about these four kinds of speed is described later in detail at the C.S0052-A of 3GPP2, and 3GPP2 has also provided the source code of VMR-WB scrambler, the voice coding module that place especially of the present invention is each speed is provided with deposits the buffer memory of former frame pumping signal and the buffer memory of former frame predicated error, each voice coding module receives the former frame pumping signal 35 of Stimulus Buffer transmission and drops it off in the buffer memory of depositing former frame pumping signal before the coding that starts a frame, same each voice coding module receives the former frame predicated error 37 of predicated error buffer transmission and drops it off in the buffer memory of depositing former frame predicated error before the coding that starts a frame, each voice coding module has completed after the coding of a frame and the generation of synthetic audio digital signals, the pumping signal of the present frame that it is produced and predicated error send respectively to Stimulus Buffer and predicated error buffer, pumping signal that three voice coding modules except CNG-ER voice coding module produce according to their each own codings generates corresponding synthetic audio digital signals, is exactly pumping signal signal (5.21 joints at the C.S0052-Av1.0 of 3GPP2 can find this description) after the filtering of linear prediction synthesis filter.Here it may be noted that, although do not use the pumping signal of previous frame but pumping signal can be reset to fixed value when CNG-ER encode, so the pumping signal 17 after resetting will be sent to Stimulus Buffer, although do not use the predicated error of previous frame but still previous frame predicated error that it is received sends to quantized prediction error buffer as the predicated error 18 of present frame when CNG-ER encodes.For voiceless sound QR (1/4th speed) speech coder, it completes after the coding of a frame, and the pumping signal 15 producing, to Stimulus Buffer output, is sent to quantized prediction error buffer by the predicated error of generation 17.For voiced sound HR (half rate) speech coder, it completes after the coding of a frame, and the pumping signal 13 producing, to Stimulus Buffer output, is sent to quantized prediction error buffer by the quantification energy predicting error 14 of generation.For FR (full rate) speech coder, it completes after the coding of a frame, and the pumping signal 11 producing, to Stimulus Buffer output, is sent to quantized prediction error buffer by the quantification energy predicting error 12 of generation.

Voice activation detection module receives the synthetic digital voice signal frame 5 of QR that coding that voiceless sound QR (1/4th speed) speech coder completes a frame generates afterwards and it is carried out to voice activation detection, and the result 8 that voice activation is detected is selected module output to coded frame output; The synthetic digital voice signal frame 4 of HR that the coding that pure and impure sound detection module reception voiced sound HR (half rate) speech coder completes a frame generates afterwards also carries out pure and impure sound detection to it, and the result 7 that pure and impure sound is detected selects module to export to coded frame output; Stablize voiced sound detection module and receive the synthetic digital voice signal frame 3 of FR that coding that FR (full rate) speech coder completes a frame generates afterwards and it stablize to voiced sound and detect, by stablize the result 6 that voiced sound detects, to coded frame output selection module, export; It is during without speech by take the VMR-WB coded frame 20 that CNG-ER type of coding coded speech input signal frame generates, be output encoder frame 10 in VAD testing result that module is selected in coded frame output, and by the pumping signal 17 of present frame, upgrades the pumping signal of its former frame and indicate the quantized prediction error buffer quantized prediction error of predicated error 18 its former frame of renewal of present frame with excitation and predicated error update signal 9 indication Stimulus Buffers; If the testing result that pure and impure sound detects is voiceless sound, the described voiceless sound QR type of coding of take is output encoder frame 10 as speech input signal frame generates VMR-WB coded frame 21, and with pumping signal and indication quantized prediction error buffer that the pumping signal 15 of present frame is upgraded its former frame, upgrades the quantized prediction error of its former frame with excitation and predicated error update signal 9 indication Stimulus Buffers by the predicated error 16 of present frame; If stablizing the testing result of voiced sound detection is to stablize voiced sound, the described voiced sound HR type of coding of take is output encoder frame 10 as speech input signal frame generates VMR-WB coded frame 22, and with pumping signal and the indication quantized prediction error buffer of pumping signal 13 its former frame of renewal of present frame, upgrades the quantized prediction error of its former frame with excitation and predicated error update signal 9 indication Stimulus Buffers by the predicated error 14 of present frame; If stablize the testing result of voiced sound detection, be not to stablize voiced sound (voice segments that for example this frame comprises non-stationary or fast the voiced sound signal of translate phase), the described FR type of coding of take is output encoder frame 10 as speech input signal frame generates VMR-WB coded frame 23, and indicates Stimulus Buffers to use the pumping signal 11 of present frame to upgrade the quantized prediction error of pumping signal and predicated error 12 its former frame of renewal that indication quantized prediction error buffer is used present frame of its former frame by excitation and predicated error update signal 9.If not being the result 8 that the voice activation detection sign VAD-flag in CNG-ER or CNG-QR type of coding coding output frame 10 selects module to detect according to voice activation by coded frame output, output encoder frame 10 do not set.

Claims

1. a method for definite output variable speed rate multi-mode wideband VMR-WB coded frame type of coding, is characterized in that:

With voiceless sound 1/4th QR, voiced sound half rate HR and full rate FR type of coding are encoded to input voice signal frame respectively, and the response to the determined pumping signal of each own coding according to the determined linear prediction synthesis filter of each own coding---synthetic digital voice signal frame carries out the detection of phonetic feature, , the synthetic digital voice signal frame of voiceless sound QR is carried out to voice activation and detect VAD, synthetic digital voice signal frame to voiced sound HR carries out pure and impure sound detection, the synthetic digital voice signal frame of FR is stablized to voiced sound to be detected, if VAD result is just to using comfort noise 1/8th CNG-ER type of coding coding input voice signal frames generate VMR-WB coded frame as output encoder frame without speech, if the testing result that pure and impure sound detects is voiceless sound, with described, using voiceless sound 1/4th QR type of codings as input voice signal frame generation VMR-WB coded frame is as output encoder frame, if stablizing the testing result of voiced sound detection is to stablize voiced sound, using described voiced sound HR type of coding as input voice signal frame generation VMR-WB coded frame is as output encoder frame, if stablizing the testing result of voiced sound detection is not to stablize voiced sound, using described FR type of coding as input voice signal frame generation VMR-WB coded frame is as output encoder frame.

2. a method of determining output VMR-WB coded frame type of coding, is characterized in that:

Type of coding with FR is encoded to input voice signal frame and will pass through to generate synthetic digital Speech frame by the determined linear prediction LP composite filter of coding by the determined pumping signal of coding, then according to the synthetic digital Speech frame of this FR type of coding, stablizes voiced sound and detect; Type of coding with common HR or voiced sound HR is encoded to input voice signal frame and will pass through to generate synthetic digital Speech frame by the determined LP composite filter of coding by the determined pumping signal of coding, then carries out pure and impure sound detection according to the synthetic digital Speech frame of the type of coding of this common HR or voiced sound HR; With the type of coding of voiceless sound QR, speech incoming frame is encoded and will, by the determined pumping signal of coding by generating synthetic digital Speech frame by the determined LP composite filter of coding, according to the synthetic digital Speech frame of this QR type of coding, carry out voice activation and detect VAD; If stablize voiced sound testing result, be that unstable voiced sound is usingd FR type of coding as input voice signal frame generation VMR-WB coded frame is as output encoder frame; If stablize voiced sound testing result, be to stablize voiced sound, the type of coding of common HR or voiced sound HR of usining when pure and impure sound testing result is voiced sound generates VMR-WB coded frame as output encoder frame as input voice signal frame; If pure and impure sound testing result is voiceless sound and voice activation testing result is to have speech, using voiceless sound QR type of coding as input voice signal frame generation VMR-WB coded frame is as output encoder frame, otherwise encode and generate VMR-WB coded frame with CNG-ER type of coding.

3. a method of determining output VMR-WB coded frame type of coding, is characterized in that:

Type of coding with voiceless sound HR or voiceless sound QR is encoded to input voice signal frame and will pass through to generate synthetic digital Speech frame by the determined linear prediction LP composite filter of coding by the determined pumping signal of coding, according to the synthetic digital Speech frame of the type of coding coding of this voiceless sound HR or voiceless sound QR, carry out voice activation and detect VAD, if VAD result is without speech, using CNG-ER type of coding coding input voice signal frame generate VMR-WB coded frame as output encoder frame, if the testing result of VAD is to have speech, with the type of coding of voiced sound HR, input voice signal frame is encoded and will pass through to generate synthetic digital Speech frame by the determined linear prediction LP composite filter of coding by the determined pumping signal of coding, according to the synthetic digital Speech frame of the type of coding of this voiced sound HR, carry out pure and impure sound detection, if the testing result that pure and impure sound detects is voiceless sound, the type of coding of described voiceless sound HR or voiceless sound QR of usining generates VMR-WB coded frame as output encoder frame as input voice signal frame, if the testing result that pure and impure sound detects is voiced sound, with the type of coding of FR, input voice signal frame is encoded and will pass through to generate synthetic digital Speech frame by the determined linear prediction LP composite filter of coding by the determined pumping signal of coding, according to the synthetic digital Speech frame of the type of coding of this FR, stablizing voiced sound detects, if stablizing the testing result of voiced sound detection is to stablize voiced sound, using described voiced sound HR type of coding as input voice signal frame generation VMR-WB coded frame is as output encoder frame, if stablizing the testing result that voiced sound detects is not stablize voiced sound, using the type of coding of described FR as input voice signal frame generation VMR-WB coded frame is as output encoder frame.

4. a method of determining output VMR-WB coded frame type of coding, is characterized in that:

Type of coding with voiceless sound HR or voiceless sound QR is encoded to input voice signal frame and will pass through to generate synthetic digital Speech frame by the determined linear prediction LP composite filter of coding by the determined pumping signal of coding, according to the synthetic digital Speech frame of the type of coding coding of this voiceless sound HR or voiceless sound QR, carry out voice activation and detect VAD, if the result of VAD is without speech, take CNG-ER type of coding coding input voice signal frame generate VMR-WB coded frame as output encoder frame; If the result of VAD is to have speech, with the type of coding of FR, input voice signal frame is encoded and will pass through to generate synthetic digital Speech frame by the determined linear prediction LP composite filter of coding by the determined pumping signal of coding, according to the synthetic digital Speech frame of the type of coding of this FR, stablizing voiced sound and detect, is not to stablize voiced sound if stablize the testing result of voiced sound detection, using described FR type of coding as input voice signal frame generation VMR-WB coded frame is as output encoder frame; If stablize the testing result of voiced sound detection, be to stablize voiced sound just with the type of coding of voiced sound HR, input voice signal frame encoded and will pass through by the synthetic digital Speech frame of the determined linear prediction LP composite filter generation of coding by the determined pumping signal of coding, according to the synthetic digital Speech frame of the type of coding of this voiced sound HR, carry out pure and impure sound detection, if the testing result that pure and impure sound detects is voiceless sound, the type of coding of described voiceless sound HR or voiceless sound QR of usining generates VMR-WB coded frame as output encoder frame as input voice signal frame; If the testing result that pure and impure sound detects is voiced sound, using described voiced sound HR type of coding as input voice signal frame generation VMR-WB coded frame is as output encoder frame.

5. a method of determining output VMR-WB coded frame type of coding, is characterized in that:

Type of coding with voiced sound HR is encoded to input voice signal frame and will pass through to generate synthetic digital Speech frame by the determined linear prediction LP composite filter of coding by the determined pumping signal of coding, according to the synthetic digital Speech frame of the type of coding of this voiced sound HR, carry out pure and impure sound detection, if the testing result that pure and impure sound detects is voiceless sound, just with the type of coding of voiceless sound HR or voiceless sound QR, input voice signal frame is encoded and will pass through to generate synthetic digital Speech frame by the determined linear prediction LP composite filter of coding by the determined pumping signal of coding, according to the synthetic digital Speech frame of the type of coding coding of this voiceless sound HR or voiceless sound QR, carry out voice activation and detect VAD, if the testing result that pure and impure sound detects is voiced sound, just with the type of coding of FR, input voice signal frame is encoded and will, by the determined pumping signal of coding by generating synthetic digital Speech frame by the determined linear prediction LP composite filter of coding, according to the synthetic digital Speech frame of the type of coding of this FR, stablize voiced sound and detect, if VAD result is without speech, with CNG-ER type of coding coding input voice signal frame and take the VMR-WB coded frame that generates as output encoder frame, if the testing result of VAD is to have speech, the type of coding of voiceless sound HR or voiceless sound QR of usining generates VMR-WB coded frame as output encoder frame as input voice signal frame, if stablizing the testing result of voiced sound detection is not to stablize voiced sound, using described FR type of coding as input voice signal frame generation VMR-WB coded frame is as output encoder frame, if stablize the testing result of voiced sound detection, stablize voiced sound just with described, using described voiced sound HR type of coding as input voice signal frame generation VMR-WB coded frame is as output encoder frame.

6. according to the method for any one in claim 1 to 5, it is characterized in that,

Pumping signal using the pumping signal of the type of coding of output encoder frame as described input voice signal frame, quantized prediction error using the quantized prediction error of four subframes of the type of coding of output encoder frame as four subframes of described input voice signal frame, that is:

If the type of coding of output encoder frame is FR, common half rate Generic HR or voiced sound HR, the pumping signal of the type of coding of described output encoder frame be that self-adapting code book signal and fixed code book signal times are added after with quantification gain separately and, the quantized prediction error of the subframe of the type of coding of output encoder frame is 20 to take advantage of the logarithm of the quantification modifying factor that output encoder frame provides; If the type of coding of output encoder frame is voiceless sound HR or voiceless sound QR, the pumping signal of the type of coding of described output encoder frame is fixed code book signal times with the fixed codebook gain quantizing, and the quantized prediction error of the subframe of the type of coding of output encoder frame is the poor of the logarithm value of the quantification fixed codebook gain that provides of output encoder frame and prediction gain; If the type of coding of output encoder frame is CNG-QR or CNG-ER, the pumping signal of the type of coding of described output encoder frame is the initial value resetting to, the quantized prediction error using the quantized prediction error of the subframe of the adjacent last output encoder frame of output encoder frame as the subframe of the type of coding of output encoder frame.

7. a method of determining output VMR-WB coded frame type of coding, is characterized in that:

By FR, to sound figure sample frame or to its pretreated digital signal frame, carry out linear prediction, self-adapting code book search and the search of renewal code book obtain pumping signal, and will to this pumping signal, carry out filtering by the determined linear prediction synthesis filter of linear prediction and obtain synthetic digital voice signal frame, according to this synthetic digital voice signal frame, carry out voice activation and detect VAD, when VAD result is to carry out pure and impure sound detection according to this synthetic digital voice signal frame while having speech, when pure and impure sound testing result is voiced sound, according to this synthetic digital voice signal frame, stablizing voiced sound detects, when VAD result is encode and generate VMR-WB coded frame by CNG-ER type of coding during without speech, when pure and impure sound testing result is voiceless sound by voiceless sound HR or voiceless sound QR rate coding and generate VMR-WB coded frame, when stablizing voiced sound testing result, be while stablizing voiced sound, by voiced sound HR, encode and generate VMR-WB coded frame, when stablizing voiced sound testing result and be unstable voiced sound, by FR, encode and generate VMR-WB coded frame.

8. a method for the output VMR-WB coded frame type of coding of definite AMR-WB Interoperability Mode, is characterized in that:

By FR, to sound figure sample frame or to its pretreated digital signal frame, carry out linear prediction, self-adapting code book search and the search of renewal innovative code book obtain pumping signal, and will to this pumping signal, carry out filtering by the determined linear prediction synthesis filter of linear prediction and obtain synthetic digital voice signal frame, according to this synthetic digital voice signal frame, carry out voice activation and detect VAD, when VAD result is to carry out pure and impure sound detection according to this synthetic digital voice signal frame while having speech, when VAD result is encode and generate VMR-WB coded frame by CNG-ER or CNG-QR during without speech, when pure and impure sound testing result is voiceless sound by voiceless sound HR rate coding VMR-WB coded frame, when pure and impure sound testing result is voiced sound, by FR, encode and generate VMR-WB coded frame.

9. a method for the output VMR-WB coded frame type of coding of definite AMR-WB Interoperability Mode, is characterized in that:

By FR, to sound figure sample frame or to its pretreated digital signal frame, carry out linear prediction, self-adapting code book search and upgrade code book search obtaining pumping signal, and will to this pumping signal, carry out filtering by the determined linear prediction synthesis filter of linear prediction and obtain synthetic digital voice signal frame, according to this synthetic digital voice signal frame, carry out voice activation and detect VAD, when VAD result is encode by CNG-ER or CNG-QR during without speech and generate VMR-WB coded frame, when VAD result is by FR rate coding and generate VMR-WB coded frame while having speech.

10. according to the method for any one in claim 1 to 9, it is characterized in that:

If described voice activation detects the amplitude of the rising edge of the crest in the waveform in the described synthetic digital voice signal frame of institute's basis, surpass threshold value, just the result of described voice activation detection has been set as to speech.

11. according to the method for any one in claim 1 to 9, it is characterized in that:

If described voice activation detects the rising edge of crest and the amplitude of negative edge of the waveform in the described synthetic digital voice signal frame of institute's basis, surpass respectively the threshold value of setting respectively for them, just the result of described voice activation detection has been set as to speech.

12. according to the method for claim 10, it is characterized in that:

According to described voice activation, detect the described synthetic digital voice signal frame of institute's basis and determine described threshold value.

13. according to the method for claim 11, it is characterized in that:

According to described voice activation, detect the described synthetic digital voice signal frame of institute's basis and determine the described threshold value of setting respectively.

14. according to the method for any one in claim 1 to 7, it is characterized in that:

Set the size of window and the count threshold of short-time average energy threshold value and exceeded threshold, the number of times that the short-time average energy that detects this window in the described synthetic digital voice signal frame of institute's basis when described pure and impure sound surpasses this short-time average energy threshold value surpasses count threshold, and the pure and impure sound testing result of this frame is decided to be to unvoiced frame.

15. according to the method for any one in claim 1 to 7, it is characterized in that:

Set the size of window and the count threshold of short-time average magnitude threshold value and exceeded threshold, the number of times that the short-time average magnitude that detects this window in the described synthetic digital voice signal frame of institute's basis when described pure and impure sound surpasses this short-time average magnitude threshold value surpasses count threshold, and the pure and impure sound testing result of this frame is decided to be to unvoiced frame.

16. according to the method for any one in claim 1 to 7, it is characterized in that:

Set the size of window, the maximum short-time average magnitude that detects this window in the described synthetic digital voice signal frame of institute's basis when described stable voiced sound surpasses the short-time average magnitude of this maximum window in several synthetic digital voice signal frames before this frame, and the stable voiced sound testing result of this frame is decided to be to unstable unvoiced frame.