CN101399043A - Self-adapting multi-speed narrowband coding method and coder - Google Patents
Self-adapting multi-speed narrowband coding method and coder Download PDFInfo
- Publication number
- CN101399043A CN101399043A CNA2008100966172A CN200810096617A CN101399043A CN 101399043 A CN101399043 A CN 101399043A CN A2008100966172 A CNA2008100966172 A CN A2008100966172A CN 200810096617 A CN200810096617 A CN 200810096617A CN 101399043 A CN101399043 A CN 101399043A
- Authority
- CN
- China
- Prior art keywords
- signal frame
- frame
- audio signal
- speech
- digital audio
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 110
- 230000005236 sound signal Effects 0.000 claims abstract description 123
- 238000001514 detection method Methods 0.000 claims abstract description 52
- 230000003044 adaptive effect Effects 0.000 claims abstract description 12
- 206010038743 Restlessness Diseases 0.000 claims description 102
- 238000005086 pumping Methods 0.000 claims description 99
- 230000005540 biological transmission Effects 0.000 claims description 98
- 238000011002 quantification Methods 0.000 claims description 94
- 230000000630 rising effect Effects 0.000 claims description 84
- 230000004913 activation Effects 0.000 claims description 59
- 230000015572 biosynthetic process Effects 0.000 claims description 35
- 238000003786 synthesis reaction Methods 0.000 claims description 33
- 238000012360 testing method Methods 0.000 claims description 16
- 238000001914 filtration Methods 0.000 claims description 10
- 238000012937 correction Methods 0.000 claims description 4
- 238000005516 engineering process Methods 0.000 abstract description 14
- 230000000694 effects Effects 0.000 abstract description 8
- 238000010295 mobile communication Methods 0.000 abstract 1
- 102100029641 E3 ubiquitin-protein ligase DTX4 Human genes 0.000 description 14
- 101000865806 Homo sapiens E3 ubiquitin-protein ligase DTX4 Proteins 0.000 description 14
- 238000001228 spectrum Methods 0.000 description 13
- 230000008569 process Effects 0.000 description 11
- 230000033228 biological regulation Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 8
- 230000008901 benefit Effects 0.000 description 7
- 230000005284 excitation Effects 0.000 description 7
- 238000013139 quantization Methods 0.000 description 4
- 238000005070 sampling Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 239000002131 composite material Substances 0.000 description 3
- 210000004704 glottis Anatomy 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 230000002194 synthesizing effect Effects 0.000 description 3
- 230000009931 harmful effect Effects 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000012552 review Methods 0.000 description 2
- 238000010998 test method Methods 0.000 description 2
- 241000282341 Mustela putorius furo Species 0.000 description 1
- 101150059859 VAD1 gene Proteins 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000013144 data compression Methods 0.000 description 1
- 230000009849 deactivation Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000000151 deposition Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Landscapes
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The invention provides an adaptive multi-rate encoder and a coding method. Compared with the prior art, the voice activity detection is changed greatly. Objects of the voice activity detection are correspondingly synthesized into digital speeches according to coding frames. And according to the changes, the adaptive multi-rate encoder and the coding method are updated in general frame. Thus, the synthesized sound signals of the decoder can accurately reflect the auditory effect of original sounds. The invention can be directly applied to the speech coding technology of a third generation mobile communication system, namely the universal mobile telecommunication system.
Description
Technical field
The present invention relates to self-adapting multi-rate narrowband scrambler and coding method thereof, the voice activation that is specifically related to the self-adapting multi-rate narrowband scrambler detects and continuous voice signal frame is carried out the technology of AMR-NB coding.
Background technology
Code excited linear prediction coder has obtained using widely since 1985 are suggested.In the vocoder of CDMA (CDMA) and universal mobile telecommunications system (UMTS), all used the technology of code excited linear prediction coder.
Code Excited Linear Prediction has comprised linear prediction and quantification, self-adapting code book search and fixed codebook search.Because itself has quiet period voice, can be by reducing the transfer rate of the effective compressed voice data of data rate between these quiet period, the application number of Qualcomm is that the patent of 92104618.9 rate changeable vocoder is exactly a scheme about said method.
In UMTS, used adaptive multi-rate (AMR) voice coding, adaptive multi-rate (AMR) voice coding be 3GPP (3G (Third Generation) Moblie partner plan) formulate be applied to voice compression coding in the 3G (Third Generation) Moblie, adaptive multi-rate (AMR) voice coding is divided into self-adapting multi-rate narrowband (AMR-NB) voice coding, AMR-WB (AMR-WB) voice coding and AMR-WB modified (AMR-WB+) voice coding again, and these coding methods are all based on code book excitation linear linear forecasting technology.The code book excited linear prediction (CELP) coder that adopts in adaptive multi-rate (AMR) code encoding/decoding mode is divided into several subframes with a voice signal frame, carries out linear prediction and quantification, self-adapting code book search and quantification and fixed codebook search and quantification.AMR-NB (self-adapting multi-rate narrowband) voice coding is supported the code rate of the speech pattern of eight kinds of speed: 12.2,10.2,7.95,7.40,6.70,5.90,5.15, (4.75kb/s kilobits/second), and the code rate of the ground unrest pattern of low rate (1.80kb/s), the form 1 of the chapters and sections 5 of the TS26.071-500 of 3GPP (Table 1) has provided the encoder modes of corresponding above-mentioned these self-adapting multi-rate narrowband code rates: AMR_12.20, AMR_10.20, AMR_7.95, AMR_7.40, AMR_6.70, AMR_5.90, AMR_5.15, AMR_4.75 and AMR_SID.
Linear prediction and quantification have comprised: the voice signal frame that sampling is obtained or form a sequence through pretreated voice signal frame, take advantage of sample sound in this sequence with a window function, so that the voice data frame of a windowing to be provided; Voice data frame by described windowing calculates one group of coefficient of autocorrelation; Calculate one group of linear predictor coefficient with Lai Wenxun-Du Bin (Levinson-Durbin) algorithm by described coefficient of autocorrelation batch total: described linear predictor coefficient group is transformed into another spectrum domain; Quantize the described coefficient sets that is transformed on another spectrum domain according to the speed in the coded order, for example, one group of line frequency spectrum on 10 rank is to the value of (LSP), or one group of acoustic reactance on 16 rank is received the value of frequency spectrum to (ISP), about the line frequency spectrum to (LSP), in the article in being published in international language voice and signal Processing meeting (ICASSP) ' 84 " the line frequency spectrum is to (LSP) and speech data compression " explanation is arranged, the application number of Qualcomm is in the patent of 92104618.9 rate changeable vocoder explanation to be arranged also, and explanation is also all arranged in the C.S0014-A of the TS of 3GPP (technical manual) 26090 and 3GPP2.
In the Qualcomm Code Excited Linear Prediction (QCELP) process, the best code book vector signal that self-adapting code book search and fixed codebook search obtain multiply by addition after separately the optimum gain, itself and be pumping signal.Pumping signal is must use in the cataloged procedure, and Qualcomm Code Excited Linear Prediction (QCELP) is the synthetic speech based on pumping signal of error minimum between search and the raw tone.
The TS26090 of 3GPP is described the self-adapting code book search of self-adapting multi-rate narrowband, for example, and 5.6 joints of TS26090-310 version.Self-adapting code book search has comprised the calculating that pumping signal before closed loop pitch (pitch) search based on former pumping signal and the interpolation of being undertaken by selected integer and mark pitch delay after this obtains self-adapting code book.The self-adapting code book parameter that the self-adapting code book search obtains is the self-adapting code book gain of pumping signal, integer and mark pitch delay, self-adapting code book gain and quantification.
Closed loop pitch searcher is to finish by the minimizing of all square weighted errors between raw tone and the reconstruct voice, described minimizing need be found out minimum all square weighted error the pairing all square weighted error of each delay value in the hunting zone, and the pairing all square weighted error of each delay value is determined the response of former pumping signal by self-adapting code book ferret out signal (target signal) and weighted synthesis filter (weighted synthesis filter).Concerning self-adapting multi-rate narrowband, 5.6 in the TS26.090-310 version of 3GPP joint illustrates this, is exactly that the integer delay value k that finds the solution earlier by the characteristic item R (k) of following formula (1) expression when maximum obtains best integer delay,
X (n) is the echo signal of self-adapting code book search, y
kIt is the value of crossing deactivation signal through filtering at integer time-delay k place, near best integer delay mark delay value obtains by the normalized characteristic item R of interpolation (k), the maximum mark delay value of search can obtain best score to postpone, that deposit the pumping signal value is excitation impact damper excitation buffer (u (n), n=-(143+11), 39,), the value of search phase (u (n) wherein, n=0,1 ..., 39,) be linear residual error (LPresidual), the pumping signal of each subframe is the signal that obtains after the self-adapting code book signal of current subframe amplifies by the self-adapting code book yield value that quantizes, and obtains the signal resulting signal that superposes after amplifying by the fixed codebook gain value that quantizes with the fixed code book signal, about this point, can be referring to 5.9 joints of the TS26.090-310 version of 3GPP, its Chinese style (64) is the mathematical notation of pumping signal value.
Fixed codebook search about AMR-NB has a detailed description in 5.7 joints of the TS26090-500 of 3GPP, the fixed code book of AMR-NB has adopted algebraic-codebook Algebraic codebook, and fixed codebook search is fixed the fixed codebook gain that the code book parameter has fixed code book vector, fixed codebook gain and quantification.
In self-adapting multi-rate narrowband (AMR-NB) the tone decoding process, each frame is all carried out LP (linear prediction) filter parameter decoding, thereby be formed for the LP filter coefficient of each subframe of the voice signal of each subframe of reconstruct; The building method of the pumping signal of each subframe is: the signal that obtains after the self-adapting code book signal is amplified by the self-adapting code book yield value, the signal that obtains after amplifying by the fixed codebook gain value with the fixed code book signal superposes, and self-adapting code book yield value here and fixed code book signal are the quantized values that the self-adapting code book gain index that obtains according to decoding and fixed code book index find from quantization table; The self-adapting code book signal of AMR-NB is based on the composite signal of the pumping signal of a subframe, promptly, the self-adaption of decoding codebook index obtain integer and mark pitch delay, by described integer and mark pitch delay the pumping signal of a last subframe is carried out interpolation and obtains the self-adapting code book signal.
The fixed codebook gain of self-adapting multi-rate narrowband (AMR-NB) quantizes to comprise: the fixed code book prediction gain that obtains based on the quantification energy predicting error (quantified prediction error) of former subframe, and the quantification of the modifying factor between fixed codebook gain and the described fixed code book prediction gain.
The quantification energy predicting error of subframe (quantified prediction error) is arranged an identical value by the encoding and decoding both sides, it for example can be the logarithmic mean value of the frame energy of a last signal frame that is encoded, also can be the value after the logarithm of the modifying factor of an above-mentioned last coded frame amplifies by fixed proportion, or the encoding and decoding both sides use the quantification energy predicting error of previous frame separately.
The TS26.090 of 3GPP quantizes to be described to the fixed codebook gain of self-adapting multi-rate narrowband, for example, formula (54) and (56) in 5.8 joints of TS26.090-310 version, just Xia Mian formula (3) and (4) illustrate how quantification energy predicting error influences the fixed code book prediction gain
Formula (3) is a n subframe prediction energy (predicted energy)
Definition, value is moving average (MA) predictive coefficient for [the b1 b2 b3 b4] of [0.680.580.340.19],
It is exactly the quantification energy predicting error of k subframe; Formula (4) is fixed code book prediction gain (predicted fixed-codebook gain) g '
cDefinition, E upgrades the mean value of energy (innovation energy) and gets different constant values according to the difference of pattern, is 36 decibels (dB) during 12.2kb/s for example, EI on average upgrades energy (mean innovation energy).Modifying factor between fixed codebook gain and the fixed code book prediction gain is the ratio of the former with the latter; And the formula (58) in 5.8 joints of TS26.090-310 version illustrate that energy predicting error R (n) 20 is multiplied by the logarithm of stating modifying factor, quantizes the energy predicting error and then is 20 and take advantage of the logarithm of quantification modifying factor.
The 5.2 joint frame energy of the TS26.092-500 of 3GPP calculate provided in (Frame energy caculation) according to before the following explanation of calculating frame energy logarithmic mean value of the frame energy of frame:
S (n) is that input audio signal frame i is through the pretreated signal frame of high-pass filtering, en
Log(i) be the frame energy logarithm value of current i.
Be the frame energy logarithmic mean value of current i frame, the energy index that frame energy logarithmic mean value is quantized into 6 bits is placed on the SID frame.
The digital voice frame of sampled digital Speech frame through forming after the pre-service through linear prediction and quantification, self-adapting code book search and fixed codebook search after the resonance peak of formed synthetic digital Speech frame mainly determined by the employed linear prediction analysis of linear prediction (LPC), more definite, concerning AMR-NB, be exactly after the line frequency spectrum is converted to prediction (LP) coefficient to (LSP), one 10 rank linear prediction synthesis filter (linear prediction synthesis filter) just can be definite by formula (7), wherein
It is prediction (LP) coefficient that has quantized.
For AMR-NB and AMR-WB, is exactly synthetic digital Speech frame with pumping signal by the filtered output of linear prediction synthesis filter, so, the limit correspondence of linear prediction synthesis filter the frequency and the bandwidth of resonance peak of synthetic digital Speech frame, these resonance peaks are reflected on the intensity of the waveform on the time domain, and are very big to sense of hearing influence.
According to be published in Proc.IEEE (progress. institute of electrical and electronic engineers) .1975,63 (4): the document of 561-580 " linear prediction: the review (Linear Prediction:A Tutorial Review) of the property of crossing the threshold " can be known, the position that the peakedness ratio of the spectrum envelope that the method for employing linear prediction obtains usually departs from real resonance peak near the harmonic wave peak value, that is to say that the spectrum envelope of the synthetic digital Speech frame that obtains according to linear prediction synthesis filter is not consistent with the spectrum envelope of original digital voice signal frame.
The author who publishes Electronic Industry Press 2004 be the quart of the U.S. you are auspicious<<the discrete time voice signal handles: principle and application (Discrete-Time Speech Signal Processing:Principle and Practice) 5.3.4 save---point out in Levinson (Lai Wenxun) recurrence and the correlation properties thereof: it is minimum phase system that employed all-pole modeling of linear prediction and autocorrelation method can make all limits of (7) formula drop in the unit circle; The phase function of the Fourier transform of separating of the correlation method of sequence is distortion; The auto-correlation of linear prediction causes the transformation of glottis maximum phase limit to the minimum phase limit; When setting up the synthetic speech waveform, the phase function distortion that the auto-correlation conversion causes may be influential to speech perception, that is, and and the departing from of the waveform of the waveform of synthetic digital voice signal and original digital voice signal.Point out in 5.6 joints at this book---the speech synthesis based on all-pole modeling: the composite signal based on the linear prediction correlation method looks like voice, but simultaneously owing to its minimum phase characteristic has lost the absolute phase structure; Shown in the example among Fig. 5 .18 in the book, the spike of reconstructed speech signal is more more outstanding than original signal, and the desirable glottis ripple that is assumed to minimum phase is the time upset, and has than the steeper rising edge of actual glottis ripple.
The voice activation of adaptive multi-rate vocoder detection (VAD) method is to calculate the level of pretreated input signal and the difference between the ground unrest estimated value earlier at present, calculate the VAD decision threshold again, the initial judgement of VAD realizes by more described difference and decision threshold, when the former initially adjudicates to Speech frame is arranged during greater than the latter, when the former during smaller or equal to the latter initial judgement be no Speech frame, the conclusive judgement of VAD is with the result of initially other detections such as judgement and the pretreated digital voice signal tone judgement after comprehensively.
The VAD of AMR-NB and AMR-WB also will combine with discontinuous transmitting DTX, DTX is that the VAD result by a plurality of input signal frames detects the transmission that just begins to carry out discontinuous silence description frames SID after one section voice finishes, and the TS26.093 of 3GPP has introduced carrying into execution a plan of a kind of DTX.
The DTX requirement, when one section voice finishes, to need a plurality of (for example 8) successive frame to remove to produce a SID frame, promptly will be continuously a plurality of (for example 7) VAD result frame (for example the 8th frame) afterwards is encoded to SID_FIRST to indicate the end of one section voice for the input signal frame of no speech after with speech pattern code rate coding, in case the SID_FIRST frame is sent out, as long as continuous no voice (for example per 8 frames) transmission SID_UPDATE frame periodically just, first SID_UPDATE frame need send out at the particular moment behind the SID_FIRST frame (for example the 3rd frame); A kind of exception is that the VAD result of an input signal frame behind the input signal frame of voice is no speech and finishes to be less than certain hour (for example 24 frames) apart from the preceding paragraph voice this frame is encoded to the SID_FIRST frame.
Present adaptive multi-rate vocoder all is variable bit rate monotype coding basically, vocoder is the speed of employing coded command or adopts the speed of coding ground unrest to encode with its detection decision to the voice signal frame according to the speed instruction of coded command, have quiet period hardly for this class music signal of song, scrambler does not just need quiet period to detect this function yet.Because the raising that the frequent use of quiet description (SID) frame no doubt can bring utilization ratio of wireless resources, but the decline that also can bring voice quality.The function that present variable bit rate adaptive multi-rate coding device all has silence description frames to generate automatically.
Summary of the invention
The technical matters that solves
Synthetic digital Speech frame that coded frame generated that is produced according to the AMR coding that adopts the Code Excited Linear Prediction technology and the phonetic feature of former digital voice signal frame and inconsistent, in background technology, point out to some extent about this point, that is: estimate that with the linear prediction analysis method peak that resulting spectrum envelope usually takes place resonance peak departs from real resonance peak; Employed all-pole modeling of linear prediction and autocorrelation method can make all limits of model drop in the unit circle, thereby cause the phase function distortion of the Fourier transform of synthetic digital voice signal, this can make the departing from of waveform shape of the waveform shape of synthetic digital voice signal and original digital voice signal.
The VAD institute that existing AMR technology adopts to as if the digital voice signal frame that forms of sampling speech input back or sampling after the pretreated digital voice signal frame that after pre-service, forms again of digital voice signal frame; usually can depart from peak on the waveform of the original digital voice signal that is used for VAD (or pretreated digital voice signal) with encode peak on the waveform of the synthetic digital voice signal that the continuous coded frame that produces produces of the mode of linear prediction analysis and code book excitation after deciphering; this paper provides example by the AMR-NB vocoder of 3GPP to concrete acoustic coding; peak-peak position between 7.83 seconds and 7.84 seconds among the DTX_400.zip of the TS_AMR_500_DTX.zip file in the TS26.074-500.zip of 3GPP (zip is the suffix name of the file) file in the waveform of DTX4.INP (the suffix name of INP-file) the pairing voice signal of file is exactly the explanation about this point with serving as that input is that frame under the corresponding peak on the waveform of the code rate synthetic digital voice signal that carries out forming behind the coding and decoding is not corresponding mutually with 12.2kb/s with the DTX4.INP file below:
As shown in Figure 6, the peak-peak correspondence that occurs in the waveform of 392 frames of the pretreated digital voice signal of voice signal that DTX4.INP is specified (among the figure before 7.84 seconds) peak-peak between 7.83 seconds and 7.84 seconds in the waveform of the pairing voice signal of DTX4.INP file, for the synthetic audio digital signals after the decoding, as shown in Figure 7, the peak value of corresponding waveform appears in synthetic digital voice signal 393 frames (after 7.84 seconds) that the coded frame with the 12.2kb/s rate coding produces after deciphering, synthetic digital voice signal frame 393 be 392 frames than correspondence late a frame, if the method that VAD adopts short-time energy to detect can detect the waveform peak of 392 frames of pretreated digital voice signal, although pretreated like this digital voice signal is encoded in 12.2kb/s speed mode, there is not the obvious corresponding waveform peak that influences the sense of hearing in 392 frames of original signal in 392 frames of the synthetic digital voice signal that the decoded back of the coded frame with the 12.2kb/s rate coding of this digital voice signal produces.The reason that above-mentioned situation takes place be because according to the VAD of the AMR-NB scrambler that TS26.073-530 constructed of 3GPP and speech pattern coding at digital voice signal and incomplete same, VAD at digital voice signal than speech pattern coding to more lean in time after a bit, promptly VAD has carried out the voice activation detection to uncoded digital voice signal also.
So the synthetic digital voice signal frame of pretreated digital voice signal frame and its correspondence not necessarily has on all four sound characteristic.Be used as VAD pretreated digital voice signal frame (or sampled digital voice signal frame) VAD result also and do not mean that the synthetic digital voice signal frame of its correspondence has identical with it VAD result, particularly the encoded operation of the resonance peak that is detected on the digital voice incoming frame that is used as VAD when be mapped to its adjacent after under the situation on pairing synthetic digital Speech frame of digital voice incoming frame that is used as VAD.
Just as stated in the Background Art, do not detect resonance peak in the pretreated digital voice frame (or sampled digital Speech frame) in the existing VAD technology, detection signal level, pitch detection, pitch Detection, sophisticated signal detect the detection that these technology directly do not relate to resonance peak to a plurality of frequency subbands that are divided in the present technology respectively, and AMR coding with the corresponding resonance peak of the limit of the prediction synthesis filter of the resulting LP coefficient of LPC to form the harmonic peak that the sense of hearing is had a significant impact, the frequency location of the resonance peak operation map that just is encoded has been gone to these harmonic peaks like this.
When voice signal is very faint, the amplitude of the resonance peak of voice signal and energy are very little almost to be flooded by ground unrest, promptly, in crude sampling digital voice signal or the pretreated digital voice signal level of ground unrest or energy with the level of faint resonance peak or energy near making VAD result be no speech, many subbands level detection, pitch detection, pitch Detection also can't detect, because VAD carries out before being arranged at pitch delay parameter and upgrading code book (innovative codebook) calculating in the prior art, LPC in the existing AMR technology is not used to detect the frequency and the bandwidth of those limits of corresponding resonance peak, more do not go to detect at amplitude and energy, although the size of the amplitude of the waveform at these waveform peak places and energy is very big to the speech perception influence corresponding to the waveform at the waveform peak place at prediction synthesis filter limit place.
The present invention will solve the input signal frame of coding front and back and the inconsistent harmful effect that VAD is brought of characteristics of speech sounds of the decoded synthetic digital signal frame of coded frame; And the inconsistent harmful effect that brings of waveform character between the two, for example, the VAD result of 392 frames of the pretreated digital voice signal of voice signal that above-mentioned DTX4.INP is specified has speech but 393 frames are no speeches can cause 392 frames by 393 frames are by ground unrest code rate coding situation by voice pattern-coding rate coding, and the waveform peak of the maximum of such 392 frames just can not be reflected on the synthetic digital signal frame of variable rate coding.
If detecting, voice activation to carry out at synthetic digital Speech frame, producing the coding how whether parameters such as the linear prediction of this synthetic digital voice signal frame and the resulting pumping signal of code book search operation, wave filter memory, wave filter error can and be used for next frame so, also is the problem to be solved in the present invention.
Technical scheme
Whether AMR-NB coded frame resulting digital voice frame after deciphering has speech, this judgement can also detect and makes by this digital voice frame being carried out voice activation, so the present invention adopts the synthetic digital voice signal frame to the AMR coded frame to carry out the method that direct voice activation detects.
In order to make the big harmonic peak corresponding to the synthetic digital voice signal of the resonance peak of original digital voice signal of sense of hearing influence is not omitted in the VAD process, the present invention also with above-mentioned direct voice activation detection and location on the amplitude or energy in the output signal that is produced to the linear prediction synthesis filter input signal, like this, though can not directly detect amplitude or energy, as long as the harmonic peak in the synthetic digital signal frequency spectrum is reflected to detection threshold that amplitude on the time domain waveform or short-time energy or average amplitude the surpass regulation harmonic peak in just can omission synthetic digital signal frequency spectrum corresponding to the waveform at the original waveform input signal peak value place of linear prediction synthesis filter limit.
A kind of VAD method that the present invention proposes is exactly whether the amplitude that detects in the waveform of synthetic digital voice signal surpasses threshold value, if surpass then will synthesize the digital voice signal judgement for speech is arranged.Like this, for waveform corresponding to those former input digit voice signal resonance peaks, just can be detected in case its amplitude surpasses threshold value, just the synthetic digital voice signal frame at its place can not replaced with the ground unrest coded frame and sent to decoding side by omission.Another kind of detection method is whether the short-time average energy of the synthetic digital voice signal of detection or the peak value of short-time average magnitude surpass threshold value, if surpass then will synthesize the digital voice signal judgement for speech is arranged, like this those corresponding to the peak value of the short-time average energy of the waveform of former input digit voice signal resonance peak or short-time average magnitude in case to surpass threshold value just be not can omission but can be detected.
Also mention in the problem that solves---synthetic digital voice signal frame or the problem of whether and how when next frame is encoded, using with parameters such as resulting pumping signals in the process of non-ground unrest code rate coding AMR frame, a method that addresses this problem is to continue to use the method that the 3GPP standard provides---only keep those parameters that process produced that coding sends to take over party AMR-NB frame, that is: when the result of VAD be have speech then with the process of non-ground unrest code rate coding AMR frame in resulting pumping signal, the wave filter memory signal, parameters such as filtering error signal and quantification energy predicting error will be used when next frame is encoded; When the result of VAD is that no speech and the transmission types that finally causes present frame are decided to be that quiet description begins SID_FIRST, SID_UPDATE or no datat NO_DATA are upgraded in quiet description, then will be dropped with resulting all parameters in the process of non-ground unrest code rate coding AMR frame, and the parameters such as pumping signal that coding ground unrest code rate frame is produced after resetting will be used when next frame is encoded, and this also is the way of the given ground unrest code rate from the ground unrest pattern of 3GPP when switching to the non-ground unrest code rate of speech pattern.
Concerning the synthetic digital Speech frame of generation speech pattern code rate of the present invention and with it as the method for the object of VAD, on the one hand, generate synthetic digital Speech frame and relate to operations such as the linear prediction carried out incessantly in the AMR-NB speech pattern encoding operation, code book search; On the other hand, when VAD result be no speech and the coding that just also will relate to the AMR-NB frame of ground unrest code rate when causing scrambler output ground unrest coded frame at last.Uninterruptedly the sound effect of the speech pattern of (for example constant speed) coding is better than the sound effect of the variable rate coding of speech pattern and ground unrest mode mixture mode, so the parameter of using speech pattern coding (or generating synthetic digital Speech frame) to be produced when carrying out the coding of speech pattern again behind the coding of ground unrest pattern helps improving voice quality.
So, the present invention proposes another kind of method, same speech incoming frame has been carried out double mode coding and had only ground unrest code rate coded frame to be selected as the AMR-NB transmit frame under the situation that code translator sends relating to speech pattern (non-ground unrest code rate) and non-voice pattern (ground unrest code rate), the parameter of using speech pattern coding to be produced selectively is used for the coding of next frame, the present invention provide above-mentioned selection scheme.
The scheme of selection of the present invention makes, finishes behind the coding of AMR-NB frame of current input signal frame at scrambler and after code translator finishes the decoding of this AMR-NB frame, the pumping signal that both sides are consistent.Reaching the benefit that such effect brings is, under both sides are consistent the prerequisite of pumping signal, as long as the linear spectral frequency LSF parameter that relating in the AMR-NB frame of speech pattern constructed linear prediction synthesis filter transmits errorless, for the coding and decoding both sides, just can agree by the synthetic digital Speech frame that linear prediction synthesis filter responsing excitation signal is exported.
The coding and decoding both sides are consistent in the technical scheme of the present invention of pumping signal, scrambler need be determined pumping signal according to the AMR-NB frame of its output, when output frame was the AMR-NB frame of ground unrest pattern, scrambler reset to pumping signal the fixed value of a scrambler and code translator both sides agreement; When output frame is the AMR-NB coded frame of speech pattern, pumping signal before scrambler reaches a last subframe by integer and mark pitch delay is carried out interpolation and is obtained the self-adapting code book signal at last, this self-adapting code book signal is again by the signal that obtains after the self-adapting code book yield value amplification that quantizes, the signal that obtains after amplifying by the fixed codebook gain value that quantizes with the fixed code book signal superposes, with resulting signal as pumping signal.
Speech pattern AMR-NB coded frame comprises the quantification gain and the fixed code book signal of integer and mark pitch delay, self-adapting code book, but directly do not comprise the fixed codebook gain parameter, but comprise the quantization encoding parameter of the modifying factor between fixed codebook gain and the fixed code book prediction gain g ' c, because the AMR-NB scrambler has been arranged consistent fixed code book prediction gain g ' c with the code translator both sides, so both sides just can agree on pumping signal.
The AMR-NB scrambler is by arranging consistent fixed code book prediction gain g ' with the consistent quantification energy predicting error of its AMR-NB code translator agreement
c, by the fixed code book prediction gain g ' shown in the front formula (4)
cCalculating formula in as can be known: the prediction energy (predicted energy) that has only subframe
Determine that by quantizing the energy predicting error value of mean value E of upgrading energy is only relevant with the code rate of the AMR-NB coded frame of scrambler transmission, on average upgrades ENERGY E
IOnly relevant with the fixed code book signal, about this point, formula (55) in 5.8 joints of TS26.090-310 version has provided explanation, so the self-adapting multi-rate narrowband code translator is by obtaining the code rate and the fixed code book parameter of AMR-NB coded frame, can with the AMR-NB scrambler in mean value E that upgrades energy and the average ENERGY E of upgrading
IOn obtain in full accord, if use the quantification energy predicting error of four same subframes to calculate the prediction energy of subframe
, scrambler and code translator both sides' fixed code book prediction gain g '
cAlso in full accord.
Existing 3GPP standard has provided a kind of method of the quantification energy predicting error that agreement is consistent between AMR-NB scrambler and the code translator, promptly, when the transmit frame of AMR-NB scrambler is the AMR-NB coded frame of speech pattern, press the TS26.090-310 version 5.8 the joint in formula (58) explanation, energy predicting error R (n) just is set at 20 logarithms of taking advantage of modifying factor in this AMR-NB frame, quantizes the energy predicting error and then is 20 and take advantage of the logarithm of this quantification modifying factor; When coded frame is ground unrest code rate frame, the quantification energy predicting error of coder both sides' subframe is that the logarithmic mean value (averaged logarithmic energy) of the frame energy of the quantification that provides in the AMR-NB coded frame according to this ground unrest code rate is set, the 5.2 joint frame energy of the TS26.092-500 of 3GPP calculate provided in (Frame energy caculation) according to before the explanation of calculating frame energy logarithmic mean value of the frame energy of frame:
Above-mentioned this between AMR-NB scrambler and code translator the agreement the consistent scheme that quantizes the energy predicting error be not unique, for example, in the AMR-WB of 3GPP scheme, be exactly that the coder both sides are provided with by the modifying factor in the voice pattern-coding transmit frame and quantize the energy predicting error and then agree; In fact for this method that all generates synthetic digital Speech frame for each input signal frame of the present invention, can all generate modifying factor for each input signal frame, and when transmit frame is silence description frames with the modifying factor one of four subframes of this frame in the same way code translator send, like this, the coder both sides have just kept the consistance that quantizes the energy predicting error parameter, just do not need to adopt the such transmission SID_UPDATE frame of 3GPP with the unified mode that arrives the logarithmic mean value of the frame energy that quantizes of both sides' quantification energy predicting error, though increased a spot of bit number that sends than the way that originally only sends silence description frames yet.
Encode for self-adapting multi-rate narrowband, data on 160 sample points of all of the pumping signal of previous frame are not will use all, because the hunting zone of 3GPP regulation and stipulation fundamental tone time-delay is within 143 sample points, so the pumping signal impact damper of stipulating in the standard also has only the size of 154 sample points, only use 154 sample points in 160 sample points just passable if only require the requirement of compatible existing 3GPP standard.
Be exactly to carry out the technical scheme that voice activation detects below according to synthetic digital audio signal:
According to ground unrest code rate and a non-ground unrest code rate input signal frame in the input signal frame sequence is carried out self-adapting multi-rate narrowband AMR-NB coding and a back input signal frame adjacent with this input signal frame carried out the AMR-NB Methods for Coding, it is characterized in that
According to encode resulting self-adapting code book parameter and fixed code book parameter of a described input signal frame being generated pumping signal with described non-ground unrest code rate, according to a described input signal frame resulting linear forecasting parameter of encoding being determined linear prediction synthesis filter, this pumping signal filtering is generated synthetic digital audio signal frame with this linear prediction synthesis filter with described non-ground unrest code rate;
Carry out voice activation according to described synthetic digital audio signal frame and detect, determine the transmission types signal of discontinuous transmission according to the result of described voice activation detection;
If described transmission types signal is normal speech SPEECH_GOOD, according to employed self-adapting code book parameter and fixed code book parameter in the AMR-NB coded frame of the described non-ground unrest code rate of a described input signal frame, generate the pumping signal of a described input signal frame; If described transmission types signal is not SPEECH_GOOD, with a described input signal frame pumping signal reset;
According to a described input signal frame pumping signal an adjacent back input signal frame is carried out the encoding operation of non-ground unrest code rate.
The control DTX of discontinuous transmission in said method and operational module still are that each frame in the input signal frame sequence produces a transmission types signal TX_TYPE, but determining of this transmission types signal will be according to the result of the voice activation detection that synthetic digital audio signal frame is done, and this is different from the way of the synthetic digital audio signal frame of not considering coded frame of prior art.
For said method, to keep the prerequisite of consistent quantification energy predicting error based on the AMR-NB codec, it has accomplished to make both sides that consistent pumping signal is arranged.Have as for the method for keeping consistent quantification energy predicting error and to list one by one below multiple:
First kind, scrambler only when sending the AMR-NB frame of speech pattern, will quantize the energy predicting error update according to the modifying factor in the coded frame, all the other the time remain unchanged; Code translator is when receiving the AMR-NB frame of speech pattern, to quantize the energy predicting error update according to the modifying factor in the coded frame, all the other the time keep that to quantize the energy predicting error constant, that is, the quantification energy predicting error of the subframe of last input signal frame that will be adjacent with a described input signal frame is as the quantification energy predicting error of the subframe of a described input signal frame;
Second kind, scrambler is only when sending the AMR-NB frame, to quantize the energy predicting error update according to the modifying factor in the coded frame, all the other the time remain unchanged, the coding that sends simultaneously the modifying factor that the AMR-NB frame of encoded voice pattern produces when sending the SID frame simultaneously is to code translator; Code translator is when receiving the AMR-NB frame of speech pattern, to quantize the energy predicting error update according to the modifying factor in the coded frame, when receiving the SID frame, receive modifying factor and will quantize the energy predicting error update according to modifying factor, all the other the time to keep quantification energy predicting error constant.
The third is exactly that existing AMR-NB code translator is still deciphered according to the method that the arrowband of 3GPP regulation is deciphered, coding one side, when described transmission types signal is SPEECH_GOOD, according to employed modifying factor correction factor in the AMR-NB coded frame of the described non-ground unrest code rate of a described input signal frame, generate the quantification energy predicting error of the subframe of a described input signal frame; At described transmission types signal is that quiet description is when beginning SID_FIRST or quiet description and upgrading SID_UPDATE, according to the logarithmic mean value of the frame energy of the quantification of a described input signal frame, generate the quantification energy predicting error of the subframe of a described input signal frame; When described transmission types signal was no datat NO_DATA, the quantification energy predicting error of the subframe of last input signal frame that will be adjacent with a described input signal frame was as the quantification energy predicting error of the subframe of a described input signal frame;
Because the AMR-NB frame of coding ground unrest pattern does not need the pumping signal of previous frame and quantizes the energy predicting error, in above-mentioned double mode method of the present invention, described input signal frame pumping signal and quantize the encoding operation that the energy predicting error only is used to an adjacent back input signal frame is carried out non-ground unrest code rate.
Generate the self-adapting code book parameter of pumping signal and the AMR-NB coded frame that the fixed code book parameter comes from the non-ground unrest code rate of input signal frame coding in technique scheme, the linear forecasting parameter of structure linear prediction synthesis filter comes from the AMR-NB coded frame of non-ground unrest code rate equally.But above-mentioned these parameters can obtain before generating the AMR-NB coded frame, that is to say and obtain linear forecasting parameter after linear prediction, obtain the self-adapting code book parameter after the self-adapting code book search, and obtain the fixed code book parameter after fixed codebook search.So following AMR-NB scrambler is just arranged, that is,
A kind of self-adapting multi-rate narrowband AMR-NB scrambler that has discontinuous transmitting DTX control and operating means, described discontinuous transmission control and operating means are determined the code rate of transmission types TX_TYPE and definite AMR-NB coded frame according to the voice activation testing result, in described AMR-NB scrambler, the input audio signal frame is carried out linear prediction, according to described code rate is that described input audio signal frame coding and output type are the AMR-NB transmit frame of TX_TYPE, and generate the pumping signal of the described input audio signal frame of the next audio input signal frame that is used to encode, it is characterized in that
Determine linear prediction synthesis filter by described input audio signal frame being carried out the linear forecasting parameter that linear prediction obtains;
According to the code rate of speech pattern to described sound input audio signal frame adaptive code book is searched for, the resulting self-adapting code book parameter of fixed codebook search and fixed code book parameter generate speech pattern pumping signal;
With the pumping signal filtering generation synthetic digital audio signal frame of described linear prediction synthesis filter to described speech pattern;
Obtain described voice activation testing result according to the voice activation detection that described synthetic digital audio signal frame is carried out;
If described TX_TYPE is normal speech SPEECH_GOOD, be input audio signal frame coding AMR-NB transmit frame according to described self-adapting code book search, the resulting self-adapting code book parameter of fixed codebook search and fixed code book parameter to the input audio signal frame, and, generate the pumping signal of described input signal frame according to employed self-adapting code book parameter and fixed code book parameter in this coded frame;
Beginning SID_FIRST or quiet description renewal SID_UPDATE if described TX_TYPE is quiet description, is input signal frame coding AMR-NB transmit frame by the ground unrest code rate, and the pumping signal of described input audio signal frame is resetted;
If described TX_TYPE is no datat NO_DATA, the pumping signal of described input audio signal frame is resetted.
Above-mentioned AMR-NB scrambler detects VAD because advanced jargon sound activates, determine TX_TYPE again, so according to TX_TYPE decision code rate can be reached for each input signal frame only encode an AMR-NB frame (comprise without TX_TYPE be NO_DATA do not need the frame that sends) effect, because the AMR-NB coder makes quantification energy predicting error each other reach consistent according to bipartite AMR-NB coded frame, so the above-mentioned scrambler scheme of quantification energy predicting error really is just comparatively simple, as long as it be provided with quantification energy predicting error according to modifying factor when TX_TYPE is SPEECH_GOOD, and the frame energy by input signal frame is provided with (method of AMR-NB) or remain unchanged (method of AMR-WB) when TX_TYPE is SID.
The technical scheme that employing is provided with (method of AMR-NB) by the frame energy of input signal frame can make scrambler energy of the present invention and according to the code translator compatibility of the AMR-NB standard of 3GPP; promptly; this scrambler comprises the device of quantification energy predicting error of four subframes of the needed described input audio signal frame of speech pattern AMR-NB frame of a back input signal frame of determining that coding is adjacent with described input audio signal frame; it is characterized in that; determine the quantification energy predicting error of four subframes of described input audio signal frame according to the transmission types TX_TYPE of described input audio signal frame; described transmission types when being normal speech SPEECH_GOOD this device according to the AMR-NB coded frame of the non-ground unrest code rate of described input audio signal frame in given modifying factor generate the quantification energy predicting error of four subframes of described input audio signal frame; at described TX_TYPE is the frame energy logarithmic mean value that the quantification energy predicting error of quiet description four subframes of the described input audio signal frame of this device when beginning SID_FIRST or quiet description and upgrading SID_UPDATE is set to the quantification of described input audio signal frame; if described transmission types is no datat NO_DATA, the quantification energy predicting error of the subframe of last input audio signal frame that will be adjacent with described input audio signal frame is as the quantification energy predicting error of the subframe of described input audio signal frame.
Employed coding method is exactly that object extension with VAD has arrived synthetic digital voice signal than the most obvious part of the coding method of prior art in the scrambler of the present invention, thereby can utilize the feature of resonance peak on synthetic digital voice signal waveform to detect speech, so the VAD to synthetic digital voice signal of scrambler of the present invention has comprised the detection to the waveform of synthetic digital audio signal frame.
Because synthetic digital voice signal has higher energy in the resonance peaks of prediction synthesis filter limit correspondence, at the amplitude that synthetic digital voice signal frame is carried out can detecting when voice activation detects its crest, if the amplitude of the rising edge of its crest and negative edge all surpasses or one of them is just adjudicated this frame for speech is arranged above threshold value, like this, surpass threshold value in case the pairing harmonic peak of described limit is reflected in the amplitude of the crest of the vibration on the waveform, synthetic digital voice signal frame just can not missed when VAD detects.The spike of the crest of the synthetic digital voice signal of in background technology occurring, pointing out during than the more outstanding phenomenon of original signal those outstanding spikes can more easily use with threshold ratio method and be detected.The establishing method that is used for the threshold value of the rising edge of crest or negative edge comparison is not unique, the definite of this threshold value can use fixed value, also can be relevant with the synthetic digital voice signal frame at crest place, such as, can be with reference to the average amplitude of synthetic digital voice signal frame---the absolute value of the signal value in the frame on the sample point and, also can be with reference to the level of the specific subband that synthesizes the digital voice signal frame, the 3.3.1 of 3GPP26094-500 joint bank of filters and subband level calculate (Filter bank and computation ofsub-band levels) and have provided a kind of method of asking the level of subband.Getting parms from the speech pattern coded frame for above-mentioned scrambler of the present invention and again generates the coding method of pumping signal, and the method for the wave test of following VAD is just arranged,
Determine threshold value according to detected synthetic digital audio signal frame, if the amplitude of the rising edge of the crest in the waveform in the described synthetic digital audio signal frame surpasses this threshold value, just the result that described voice activation is detected has been defined as speech.
Set rising edge threshold value and negative edge threshold value according to detected synthetic digital audio signal frame, respectively with the rising edge threshold value of setting and negative edge threshold ratio with the amplitude of the rising edge of the crest in the waveform of described synthetic digital audio signal frame and negative edge amplitude; If the amplitude and the negative edge amplitude of the rising edge of the crest in the waveform have surpassed described rising edge threshold value and negative edge threshold value respectively, just the result that described voice activation is detected has been set at speech.
Voice activation detection method of the prior art stands good to synthetic digital voice signal, for waveform medium wave peak number is a lot of but the situation that rising edge and negative edge amplitude are more or less the same, of the prior art signal energy can be come by detecting signal with the method for ground unrest energy comparison.But for the less situation of waveform medium wave peak number, the ability of the method detecting signal that the present invention provides below is stronger:
Determine amplitude threshold and scope according to detected synthetic digital audio signal frame, if the rising edge amplitude in the waveform in the described synthetic digital audio signal frame surpasses the number of crest of this amplitude threshold within described scope, just the result with described voice activation detection has been defined as speech.
Set rising edge threshold value, negative edge threshold value and scope according to detected synthetic digital audio signal frame, respectively with the rising edge threshold value of setting and negative edge threshold ratio with the amplitude of the rising edge of the crest in the waveform of described synthetic digital audio signal frame and negative edge amplitude; If the number that rising edge amplitude in the waveform and negative edge amplitude have surpassed described rising edge threshold value and negative edge threshold value crest respectively is within described scope, just the result that described voice activation is detected has been set at speech.
The coding method of the non-ground unrest speed AMR-NB frame of the use in the AMR-NB scrambler of the present invention drops within protection scope of the present invention equally; this an input signal frame in the input signal frame sequence is carried out self-adapting code book search, fixed codebook search and self-adapting multi-rate narrowband AMR-NB coding and a back input signal frame adjacent with this input signal frame carried out non-ground unrest code rate AMR-NB Methods for Coding; it is characterized in that
A described input signal frame is carried out linear prediction, and determine linear prediction synthesis filter according to resulting linear forecasting parameter, press voice pattern-coding speed to a described input signal frame self-adapting code book search for, fixed codebook search, and, this pumping signal filtering is generated synthetic digital audio signal frame with this linear prediction synthesis filter according to resulting self-adapting code book parameter and fixed code book parameter generation pumping signal;
Carry out voice activation according to described synthetic digital audio signal frame and detect, determine the transmission types of discontinuous transmission according to this voice activation testing result;
If described transmission types is normal speech SPEECH_GOOD, according to the described speech pattern code rate coding AMR-NB coded frame that is a described input signal frame, and, generate the pumping signal of a described input signal frame according to employed self-adapting code book parameter and fixed code book parameter in this coded frame; If being quiet description, transmission types upgrades the quiet description of the self-adapting multi-rate narrowband AMR-NB_SID_UPDATE frame that SID_UPDATE then generates described input signal frame by ground unrest code rate coding; If transmission types is that quiet description begins the AMR-NB_SID_FIRST frame that SID_FIRST then generates described input signal frame; If described transmission types is not SPEECH_GOOD, with a described input signal frame pumping signal reset;
According to a described input signal frame pumping signal an adjacent back input signal frame is carried out the encoding operation of the voice mould pattern-coding speed of non-ground unrest.
Coding staff has multiple and decoding side to keep quantizing the consistent method of energy predicting error equally for above-mentioned coding method, a kind of be only coding staff during to decoding side transmission speech pattern AMR-NB frame both sides according to the quantification energy predicting error separately of the modifying factor adjustment in the coded frame; Also have a kind of method to be exactly:
If detecting the transmission types that obtains according to the voice activation that synthesizes digital voice signal is the AMR-NB frame that normal speech SPEECH_GOOD then generates the non-ground unrest code rate of described input signal frame, and according to the modifying factor correction factor generating quantification energy predicting error in this AMR-NB frame;
As if described transmission types is that quiet description begins SID_FIRST or the frame energy logarithmic mean value that SID_UPDATE then is arranged to the quantification energy predicting error of described input digit voiced frame the quantification of this input signal frame is upgraded in quiet description.If described transmission types is no datat NO_DATA, the quantification energy predicting error of the subframe of last input signal frame that will be adjacent with a described input signal frame is as the quantification energy predicting error of the subframe of a described input signal frame.
VAD method in above-mentioned coding method also can adopt the method for wave test, that is,
Adopt fixing threshold value or determine threshold value according to detected synthetic digital audio signal frame, if the amplitude of the rising edge of the crest in the waveform in the described synthetic digital audio signal frame surpasses this threshold value, just the result that described voice activation is detected has been defined as speech.
Adopt fixing rising edge threshold value and negative edge threshold value, or set rising edge threshold value and negative edge threshold value according to detected synthetic digital audio signal frame, with the amplitude of the rising edge of the crest in the waveform of described synthetic digital audio signal frame and negative edge amplitude respectively with the rising edge threshold value of setting and negative edge threshold ratio; If the amplitude and the negative edge amplitude of the rising edge of the crest in the waveform have surpassed described rising edge threshold value and negative edge threshold value respectively, just the result that described voice activation is detected has been set at speech.
Voice activation detection method of the prior art stands good to synthetic digital voice signal, for waveform medium wave peak number is a lot of but the situation that rising edge and negative edge amplitude are more or less the same, of the prior art signal energy can be come by detecting signal with the method for ground unrest energy comparison.But for the less situation of waveform medium wave peak number, the ability of signal that the method that the present invention provides below detects speech is stronger:
Determine amplitude threshold and scope according to detected synthetic digital audio signal frame, if the rising edge amplitude in the waveform in the described synthetic digital audio signal frame surpasses the number of crest of this amplitude threshold within described scope, just the result with described voice activation detection has been defined as speech.
Set rising edge threshold value, negative edge threshold value and scope according to detected synthetic digital audio signal frame, respectively with the rising edge threshold value of setting and negative edge threshold ratio with the amplitude of the rising edge of the crest in the waveform of described synthetic digital audio signal frame and negative edge amplitude; If the number that rising edge amplitude in the waveform and negative edge amplitude have surpassed described rising edge threshold value and negative edge threshold value crest respectively is within described scope, just the result that described voice activation is detected has been set at speech.
Technical scheme of the present invention is not repelled yet speech sampled digital signal (its pretreated digital signal) is carried out the calculating of signal level and background-noise level and comparison and determines transmission types TX_TYPE according to result relatively, though in the embodiments of the invention what be input to the VAD device is synthetic digital audio signal but not through pretreated voice signal (or voice signal).
Beneficial effect
Carry out the method for VAD again owing to adopted the search of first execution linear prediction and code book, like this, the appearance of the pumping signal that is generated according to code book search and linear prediction is just operated prior to VAD, carry out VAD at pumping signal by the output of linear prediction synthesis filter, like this, if the original figure voiced frame is through linear prediction, the feature of the synthetic digital signal frame of the formation after self-adapting code book search and fixed codebook search are handled has speech, the result of its VAD is exactly a speech, the feature similarity of the synthetic audio digital signals that is used to detect of the feature of the audio digital signals frame that the AMR coded frame of the non-ground unrest code rate that receive decoding side produces after deciphering and this code rate of coding staff; Coding staff just might produce the AMR coded frame of SID type of coding under the situation that can't detect the synthetic digital signal with active speech.
The present invention directly is positioned at the object of VAD on the pairing synthetic digital voice signal frame of AMR coded frame of non-ground unrest code rate, because of can causing the VAD result of the synthetic digital voice signal frame of this code rate, the code rate reduction trends towards not having active speech, promptly, voice signal for frame with some, use method of the present invention, the code rate reduction can make the number increase of the result of the VAD judgement of doing according to difference between incoming signal level and ground unrest estimated value for the frame of no speech.Therefore, the present invention can also improve the sound compressibility of AMR coding techniques, makes same Radio Resource can hold more voice signal.
Carry out the method for VAD again owing to adopted the search of first execution linear prediction and code book, like this, pressing the appearance of the pumping signal of non-ground unrest code rate generation just operates prior to VAD, operate prior to VAD on the order that operates in execution by the search of the code book of non-ground unrest code rate, the parameter that generates the pumping signal that produces when synthesizing digital voice signal by non-ground unrest code rate when the transmission types indication that DTX control and operational module is produced as the result of VAD is not normal voice (SPEECH_GOOD) just can not be used further to the coding of the non-ground unrest code rate of next frame, of the present invention abandoning selectively in the case carried out linear prediction under the speech pattern, self-adapting code book search and the resulting parameter of fixed codebook search, promptly, except pumping signal and quantification energy predicting error parameter that use coding ground unrest code rate coded frame is produced, just can utilize under speech pattern when synthesizing digital audio signal and carry out linear prediction for the generation of next frame input audio signal, self-adapting code book search and fixed codebook search are operated resulting parameter, abandon behind the SID frame of encoding the prior art carrying out other parameter that linear prediction and code book search produce and needn't resemble again by non-ground unrest code rate, because this scheme has been arranged, the feature that contains more input audio signal for the synthetic digital audio signal that is used for the voice activation detection of next input audio signal frame generation, because in the prior art, in case run into the ground unrest speed coding frame one time, the state variable that comprises pumping signal and quantification energy predicting error in the AMR-NB scrambler all can be resetted, scrambler has been lost the feature of input audio signal in the past this moment.
After receiving the AMR-NB coded frame of speech pattern, speech pattern coding module in take over party's code translator and the scrambler is respectively with reference to the pumping signal on the consistent past sample point that comprises previous frame subframe sample point and the quantification energy predicting error of four subframes, parameter in the coded frame of receiving on the one side use channel, the opposing party uses and oneself is encoded to the parameter of going in this coded frame, generate the pumping signal and the synthetic speech of subframe separately respectively, so take over party's code translator synthetic the synthetic pumping signal of pumping signal and described voice coding module in full accord, code translator uses the pumping signal consistent with scrambler to make the acoustical quality of the synthetic speech of deciphering generation guaranteed.
The amplitude that the amplitude of the crest that will synthesize digital voice signal of the present invention can be reflected in the crest on the waveform at the harmonic peak of prediction synthesis filter limit correspondence with threshold ratio VAD method detects the synthetic digital voice signal frame at this crest place when being higher than threshold value.When the spike of the synthetic digital voice signal of mentioning in background technology when more outstanding this phenomenon is embodied in the rising edge of the spike in the waveform of synthetic digital voice signal of corresponding original signal resonance peak or negative edge than original signal bigger than original signal, the amplitude of the above-mentioned crest that will synthesize digital voice signal can detect the frame that can't detect by the spike that detects original signal waveform with threshold ratio method.Equally, when the rising edge of the above-mentioned synthetic digital voice signal rising edge that more steep this phenomenon is embodied in the spike in the waveform of synthetic digital voice signal of corresponding original signal resonance peak than original signal during than original signal bigger, the rising edge that will synthesize the crest of digital voice signal of the present invention can detect the frame that can't detect originally with threshold ratio method.Equally, more steep this phenomenon is embodied in the slope ratio original signal of rising edge of the spike in the waveform of synthetic digital voice signal of corresponding original signal resonance peak when bigger than original signal when the rising edge of above-mentioned synthetic digital voice signal, and the slope of the rising edge of the crest of synthetic digital voice signal can be detected the frame that can't detect originally with threshold ratio method.
Description of drawings
Fig. 1 is the theory diagram of self-adapting multi-rate narrowband (AMR-NB) scrambler of supporting the variable bit rate of constant speech pattern coding.
Fig. 2 is the simplified block diagram of voice coding module among Fig. 1.
Fig. 3 is to be that input signal is the synthetic digital voice signal frame of the 393rd frame of code rate with 12.2kb/s with the DTX4.INP among the TS26074-500 of 3GPP, and 7.84 on the figure is meant 7.84 seconds the moment.
Fig. 4 is the AMR-NB scrambler that generates an AMR-NB coded frame for each input signal frame.
Fig. 5 is the simplified block diagram of the voice coding module among Fig. 4.
Fig. 6 is that 7.84 on the figure is meant 7.84 seconds the moment as the 392nd frame of the DTX4.INP among the TS26074-500 of the 3GPP of input signal through pretreated digital voice signal.
Fig. 7 is to be that input signal is the 393rd frame of the synthetic digital voice signal behind the coding and decoding of code rate with 12.2kb/s with the DTX4.INP among the TS26074-500 of 3GPP, and 7.84 on the figure is meant 7.84 seconds the moment.
Embodiment
Embodiment 1, self-adapting multi-rate narrowband (AMR-NB) scrambler that can between constant speech pattern and discontinuous transmitting DTX pattern, switch, as shown in Figure 1, the voice sample rate is that 13 bit uniform pulse modulation (PCM) input audio signal frame 1 of 8kHz is exported to non-ground unrest code rate voice coding module and ground unrest coding module simultaneously, the voice coding module is selected module output with self-adapting multi-rate narrowband (AMR-NB) coded frame 11 of the non-ground unrest code rate of signal frame 1 to coded frame output, the ground unrest coding module is selected module output with the quiet description coded frame 12 of self-adapting multi-rate narrowband (AMR-NB) of the ground unrest code rate of signal frame 1 to coded frame output, the synthetic digital voice signal frame 17 that the voice coding module produces during also with coded signal frame 1 is exported to the voice activation detection module, the method of the local synthetic speech of generation (local synthesized speech) that provides in 5.9 joints of the generation of synthetic digital voice signal frame 17 according to the 26090-500 of 3GPP produces, the voice activation detection module carries out voice activation to synthetic digital voice signal frame 17 and detects, and the result that will detect---VAD sign 18 is to discontinuous transmission (DTX) control and operational module output, DTX control and operational module output transmission types signal 19 are selected module to coded frame output, and coded frame output selects module that the transmission types signal of receiving 19 is exported to 3G (3G (Third Generation) Moblie) wireless access network (AN).Transmission types signal 19 is that normal speech (SPEECH_GOOD), quiet description begin (SID_FIRST), quiet description and upgrade one of four kinds of (SID_UPDATE), no datat (NO_DATA), when transmission types signal 19 was normal speech (SPEECH_GOOD), it was self-adapting multi-rate narrowband (AMR-NB) coded frame 11 of encoding by non-ground unrest code rate (speech pattern) that the information bit 2 of module output is selected in coded frame output; When transmission types signal 19 was quiet description renewal (SID_UPDATE), it was the quiet description of self-adapting multi-rate narrowband (AMR-NB_SID) frame 12 of encoding by the ground unrest code rate that the information bit 2 of module output is selected in coded frame output; When transmission types signal 19 is quiet descriptions when beginning (SID_FIRST), it also is the AMR-NB_SID frame 12 of ground unrest coding module output rather than the SID_FIRST frame that forms according to 3GPP technical manual TS26093 (35 comfort noise bits are 0 frame entirely) that the information bit 2 of module output is selected in coded frame output; When transmission types signal 19 is no datat (NO_DATA); information bit 2 is invalid for the AN of 3G; so when transmission types signal 19 was not normal speech (SPEECH_GOOD), coded frame output selected module that the AMR-NB_SID frame 12 of ground unrest coding module output is put into 2 li of information bits.
Discontinuous transmission control and operational module be received code mode signal 5 also; coded system signal 5 constant speech pattern of indication or discontinuous transmitting DTX patterns; the transmission types signal 19 that discontinuous transmission control and operational module send when coded system signal 5 is the discontinuous transmitting DTX pattern can be normal speech (SPEECH_GOOD); quiet description begins (SID_FIRST); (SID_UPDATE) upgraded in quiet description; among four kinds of the no datat (NO_DATA) any one; the content of transmission types signal 19 only indicates 18 operation result decision by DTX control and operational module according to VAD at this moment; transmission types signal 19 contents are normal speech (SPEECH_GOOD) when coded system signal 5 is constant speech pattern; promptly; VAD sign 18 has outputed to discontinuous transmission control and operational module; but discontinuous transmission control and operational module receive that this signal (no matter its content has speech or no speech) back is the transmission types signal 19 of normal speech (SPEECH_GOOD) with regard to output content; discontinuous transmission control and operational module reset to original state with its state variable, have only the AMR-NB frame of voice coding module coding just can be sent to the AN of 3G like this.
If discontinuous transmission (DTX) control and operational module are set at transmission types signal 19 according to the VAD sign 18 of input the transmission types indication of normal speech (SPEECH_GOOD), discontinuous transmission (DTX) control and operational module also send the transmission types indication of AMR-NB coded frame of the even PCM signal frame 1 of 13 bits of current 8kHz to the voice coding module---normal speech (SPEECH_GOOD), receive the quantification energy predicting error in the quantification energy predicting error buffer of pumping signal in the pumping signal impact damper that still uses module itself when this transmission types signal 19 back voice coding modules are the adjacent back frame coding AMR-NB frame of the even PCM signal frame of 13 bits of current 8kHz and module itself, that is, still use the pumping signal in its excitation impact damper and use quantification energy predicting error according to the described method of the TS26090 of 3GPP; If being set at quiet description according to the VAD sign of importing 18 with transmission types signal 19, discontinuous transmission (DTX) control and operational module begin (SID_FIRST), any among three of (SID_UPDATE) and the no datat (NO_DATA) upgraded in quiet description, discontinuous transmission (DTX) control and operational module also send this signal 19 to the voice coding module, after receiving the transmission types signal 19 of one of these types, the voice coding module will be finished with the ground unrest coding module and use when the pumping signal 35 that current PCM signal frame 1 coding back is produced goes to replace pumping signal in himself pumping signal impact damper for the AMR-NB frame of an adjacent back frame of the even PCM signal frame of 13 bits of coding and current 8kHz, equally, the voice coding module quantification energy predicting error that will go four subframes in the quantification energy predicting error buffer of replacement itself with the quantification energy predicting error 37 that the ground unrest coding module is finished four subframes that current PCM signal frame 1 coding back is produced is for encoding and using during the AMR-NB frame of the adjacent back frame of the even PCM signal frame of 13 bits of current 8kHz.
After code translator is received the AMR_SID frame of scrambler transmission, therefrom obtain the index of frame energy logarithm, index according to frame energy logarithm obtains frame energy logarithmic mean value, the quantification energy predicting of its four subframes all is set to this frame energy logarithmic mean value, because SID_UPDATE and SID_FIRST frame have all comprised the index of frame energy logarithm in the present embodiment, code translator can both adopt the same quantification energy predicting error with scrambler when receiving the AMR_SID frame, because basis modifying factor wherein was provided with and quantizes the energy predicting error when code translator was received speech pattern AMR-NB frame, code translator keeps quantification energy predicting error constant suspend the DTX that sends at scrambler during, so the quantification energy predicting error that scrambler and code translator can be consistent.
The block diagram of the coded portion on the right of transmit leg among Fig. 1 of the TS26.071 of Fig. 1 and 3GPP (TRANSMIT SIDE) is similar, difference is that voice activation detects the signal difference that (Voice Activity Detector) module receives from the voice coding module there, 3GPP Fig. 1 of TS26.071 in be speech sample through the pretreated signal of voice coding (Speech Encoder) module, among this paper Fig. 1 to be the voice coding module carry out linear prediction and quantification to the voice digital signal frame of input, the synthetic audio digital signals frame that is generated after self-adapting code book search and the fixed codebook search.Among Fig. 1 of this paper, be to select one as information bit (info bits) normal speech (SPEECH_GOOD) or the quiet description quiet description of self-adapting multi-rate narrowband (AMR-NB_SID) coded frame that coded frame output selects AMR-NB coded frame that module will generate from the voice coding module and ground unrest coding module to generate when upgrading (SID_UPDATE) at transmission types signal 19; Different with the present invention, 3GPP Fig. 1 of TS26.071 in vocoder frames (speech frame) 4 and silence description frames (SIDframe) 5 can not occur simultaneously, do not have this operation that elects among both.
Shown in Figure 2 is the simplified block diagram of voice coding module among Fig. 1, it has provided the Signal Processing flow process, Fig. 3 among the TS26.090-500 of this figure and 3GPP (simplified block diagram of AMR scrambler) is basic identical, A among Fig. 2 (z) is the reverse wave filter (The inverse filter with quantized coefficients) of not quantization parameter, x (n) is the echo signal of self-adapting code book search, x
2(n) echo signal of fixed codebook search, the description of each chapters and sections has covered the content of its Fig. 3 in TS26.090-500, so also covered the related content identical with its Fig. 3 of Fig. 2 of this paper.
The different place of Fig. 3 with among the TS26.090-500 among Fig. 2 of this paper is following a few place:
Voice coding module shown in Fig. 2 is utilized the reverse wave filter (The inverse filter withquantized coefficients) of quantization parameter
Obtain linear prediction synthesis filter, pumping signal filtering is produced synthetic digital audio signal frame 17 with this composite filter; Also show the aftertreatment flow process among Fig. 2, this aftertreatment is exactly that first content to transmission types 19 detects, if not SPEECH_GOOD, the pumping signal that just replaces present frame with pumping signal 35 is with quantizing the quantification energy predicting error that energy predicting error 37 replaces four subframes in the present frame; Parameter in the AMR-NB coded frame 11 among Fig. 1 just comes from LSP index, self-adapting code book index, self-adapting code book gain index, fixed code book index and the fixed codebook gain index among Fig. 2.
Be the AMR-NB scrambler below with the represented signal of the DTX4.INP among the TS26.074-500 of 3GPP be detailed description on a period of time of the input signal process of carrying out the 12.2kb/s coding, the total length of DTX4.INP is 1188 frames, 20 milliseconds of every frames, be length overall 23.76 seconds, the value of each sample point of DTX4.INP is represented as 16 bits, the AMR-NB scrambler with its 3 least significant bit (LSB)s (bit 2-bit 0) thus put 0 digital voice signal (precision of this digital voice signal is 8) that forms 13 bits.The voice coding module that is operated in the 12.2kb/s code rate carries out input signal frame carrying out non-ground unrest code rate speech pattern coding with 12.2kb/s after the pre-service of 5.1 chapters and sections defineds of TS26.090 of 3GPP, the sequence of operations that this has comprised the AMR-NB coding of linear prediction and quantification, self-adapting code book search, fixed codebook search and has generated synthetic digital voice signal.Coded system signal 5 was constant speech pattern before 7.7 seconds, always with the coding output of the constant speed of 12.2kb/s, the 1st frame (0 second to 0.02 second) each frame in the 385th frame that is DTX4.INP all adopts the speed coding frame of the 12.2kb/s that the voice coding module produces to scrambler when selecting information bit 19 to the input signal frame before 7.7 seconds; Coded system signal 5 is the DTX pattern till 7.7 seconds to 8.10 seconds, promptly, the code rate of each frame in the 386th frame to the 405 frames is by one in definite 12.2kb/s of VAD module and DTX control and sending module and the ground unrest code rate (1.80kb/s), now the cataloged procedure of DTX mode duration of work is investigated.
Synthetic digital Speech frame for present embodiment, the file DTX4_122.COD that can provide with reference to the TS26.074-500 of 3GPP (COD is the suffix name of file) contrasts the waveform of the synthetic digital voice signal frame of the 393rd frame shown in Figure 3 with the 393rd frame of the determined synthetic digital voice signal of this document.It is no speech that the voice activation of the VAD1 option of the AMR of 3GPP detects VAD result for the 386th to the 405th frame of DTX4.INP always, it can't detect the phonetic feature that speech is arranged of the 392nd frame among Fig. 7, and 3 kinds of VAD methods given below have all detected the phonetic feature that speech is arranged of synthetic digital Speech frame at the 393rd frame (7.84 seconds to 7.86 seconds), VAD in the present embodiment has used the 3rd kind of following method, has listed the testing result of these frames in form 1.DTX control and operating means are made as SPEECH_GOOD with TX_TYPE after receiving the VAD sign 43 of speech, TX_TYPE is set to SID_FIRST after receiving the VAD sign 43 that 8 contents are no speeches continuously, if the VAD sign 43 of receiving 3 no speeches again just TX_TYPE is set to SID_UPDATE (2 TX_TYPE between SID_UPDATE and the SID_FIRST are NO_DATA), the VAD that after this whenever receives continuous 8 no speeches indicate 43 just TX_TYPE be set to SID_UPDATE (this SID_UPDATE TX_TYPE before is NO_DATA).
Fig. 3 illustrates the waveform of the synthetic digital voice signal frame of the 393rd frame, and the horizontal ordinate of figure has marked the time, and ordinate has marked number percent, and the scope of half frame (7.84 seconds to 7.85 seconds) is between-1.6% to 2.2% before the 393rd frame that provides among the figure as can be seen.Because the scope of 16 signed integers is-2
15(equaling-32768) is to 2
15-1 (equaling 32767),-1.6% is equivalent to-524 to 720 to 2.2%, first kind of VAD method regulation crest more than or equal to 502 threshold value then the VAD judgement be when speech is arranged, can detect the voice signal of the 393rd frame, the value of the crest of the interior maximum of half frame (7.84 seconds to 7.85 seconds) is 430 before the 393rd frame in the drawings, the trough on the left side that is adjacent is-176, the value of the trough on the right that is adjacent is-81, promptly, rising edge length is 606, negative edge is 511 all to have surpassed 502, and such the 393rd frame will be judged as speech; Rising edge in second kind of VAD method regulation crest or negative edge more than or equal to 592 threshold value then the VAD judgement be that speech is arranged, this method also can detect the voice signal of the 393rd frame; Rising edge in the third VAD method regulation crest more than or equal to the negative edge of 592 threshold value and crest more than or equal to 502 threshold value then the VAD judgement be that speech is arranged, this method also can detect the voice signal of the 393rd frame.
Frame number | The VAD sign 18 of synthetic digital voice signal frame 17 | Transmission types signal 19 | The code rate that transmission types 19 is determined | Use the pumping signal of previous frame during speech pattern coding module coding present frame and quantize the energy predicting error |
386 | No speech | SPEECH_GOOD | 12.2kb/s | From speech pattern coding module itself |
393 | Speech is arranged | SPEECH_GOOD | 12.2kb/s | From speech pattern coding module itself |
394 | No speech | SPEECH_GOOD | 12.2kb/s | From speech pattern coding module itself |
395 | No speech | SPEECH_GOOD | 12.2kb/s | From speech pattern coding module itself |
396 | No speech | SPEECH_GOOD | 12.2kb/s | From speech pattern coding module itself |
397 | No speech | SPEECH_GOOD | 12.2kb/s | From speech pattern coding module itself |
398 | No speech | SPEECH_GOOD | 12.2kb/s | From speech pattern coding module itself |
399 | No speech | SPEECH_GOOD | 12.2kb/s | From speech pattern coding module itself |
400 | No speech | SPEECH_GOOD | 12.2kb/s | From speech pattern coding module itself |
401 | No speech | SID_FIRST | 1.80kb/s | From speech pattern coding module itself |
402 | No speech | NO_DATA | 1.80kb/s | From 35 and 37 of ground unrest coding module |
403 | No speech | NO_DATA | 1.80kb/s | From 35 and 37 of ground unrest coding module |
404 | No speech | SID_UPDATE | 1.80kb/s | From 35 and 37 of ground unrest coding module |
405 | No speech | NO_DATA | 1.80kb/s | From 35 and 37 of ground unrest coding module |
Signal value on the sample point of the frame of synthetic digital voice signal frame 393 is sequentially below in the brace: { 43,42,13,15,7 ,-41 ,-1,33,0 ,-1,1 ,-6 ,-5 ,-176 ,-32,215,430,186 ,-81 ,-74,195,105,19 ,-29,-72 ,-29 ,-46 ,-235,123 ,-98 ,-67 ,-72,16,39,126,71 ,-63,53,31 ,-153,92,136,100,2,17 ,-45,31,45,-47 ,-102 ,-98 ,-44,8,88,1 ,-41,118 ,-52,1,59,32,10 ,-27 ,-41,108 ,-45 ,-44,55,72 ,-26,119 ,-110,-70 ,-131,43,54,10 ,-41 ,-50,16,-15,56,20,13 ,-13 ,-1 ,-3,6,11,9 ,-44 ,-119 ,-134,151,288,104,-229 ,-39 ,-6,25,188,61 ,-73 ,-27,-233 ,-137,136 ,-2 ,-218,56,43,139,-14,5 ,-16,246,22 ,-131,89,76,-97,7,134,9,42,3 ,-31 ,-102,-126 ,-49 ,-11 ,-36 ,-64 ,-5,144,201,17,42,56 ,-146 ,-134,1 ,-76 ,-153,-81,22,2 ,-39,39,80,42,80,31 ,-30 ,-41 ,-52 ,-75 ,-16,7 ,-17}
The average amplitude that can calculate the frame of this frame is that (the absolute value sum of the signal value in the frame on each sample point) is 10813.In VAD, can use above-mentioned 3 kinds of wave test methods,
First kind of wave test is VAD result to be set at speech in rising edge and negative edge amplitude during all greater than threshold value, the method of asking threshold value be 500 and the product of the average amplitude of weighting coefficient 0.04643 and frame between get wherein higher value, the latter 0.04643 takes advantage of 10813 to equal 502, so the threshold value of this method is 502;
Second kind of wave test is VAD result to be made as speech in the rising edge amplitude during greater than threshold value, the method of asking threshold value be 572 and the product of the average amplitude of weighting coefficient 0.05475 and frame between get wherein higher value, the latter 0.05475 takes advantage of 10813 to equal 592, so the threshold value of this method is 502;
The third wave test is VAD result to be made as speech in the rising edge of crest and negative edge amplitude during respectively greater than separately threshold value, the threshold value of rising edge amplitude be by 572 and the product of the average amplitude of weighting coefficient 0.05475 and frame between get higher value wherein method obtain, the threshold value of negative edge amplitude be by 500 and the product of the average amplitude of weighting coefficient 0.04643 and frame between get higher value wherein method obtain, so they are respectively 592 and 502, the numerical value of the VAD of the secondary series of form 1 sign obtains according to this VAD method.
But for same DTX4.INP input, the pre-service digital voice signal of VAD of the 393rd frame that the method for above-mentioned detection waveform is used for the AMR-NB scrambler of 3GPP technical manual 26.073 regulations then can't detect this frame to the frame of speech is arranged, promptly, all then the VAD judgement can not be with the 393rd frame judgement of former pre-service digital voice signal for there being speech for the method for speech is arranged greater than corresponding threshold separately to adopt above-mentioned rising edge and negative edge amplitude.
As long as all be that the value of being compared surpasses threshold value and just the VAD judgement is decided to be speech is arranged in 3 kinds of above-mentioned wave test methods, be actually the crest that surpasses threshold value and count the method that setting range also is a kind of detection waveform, for example, can stipulate: when rising edge and negative edge amplitude just have been made as speech with VAD result greater than the number of the crest of separately threshold value respectively in 1 to 3 scope, the threshold value of rising edge amplitude be by 572 and the product of the average amplitude of weighting coefficient 0.05475 and frame between get higher value wherein method obtain, the threshold value of negative edge amplitude be by 500 and the product of the average amplitude of weighting coefficient 0.04643 and frame between get higher value wherein method obtain, so they are respectively 592 and 502, under this regulation, still can detect the VAD result that speech is arranged of the 393rd synthetic digital Speech frame.
Carry out among the embodiment 1 coding according to 2 code rates be one be the ground unrest code rate another be non-ground unrest code rate, like this, the quantification energy predicting sum of errors pumping signal that transmission types 19 should use when having specified a frame behind the coding, promptly, if pumping signal that transmission types signal 19 contents are SPEECH_GOOD then to be produced with 12.2kb/s coding and quantized prediction error will be used when the coding next frame by it, otherwise the quantized prediction error that uses the pumping signal after the ground unrest coding module resets and generate according to the frame energy logarithmic mean value that quantizes during voice coding module coding next frame.
DTX control and operational module maintain initial state always before the 386th frame, start working from the 386th frame, preceding 7 frames can not produce the output (first SID_FIRST needs the frame of 8 VAD for no speech) of SID_FIRST all the time, because the 8th frame i.e. the 393rd frame has been detected speech, so the output of 393 to 400 frames in the form 1 also all is the AMR-NB frame of 12.2kb/s.
Embodiment 2, one is had only the AMR-NB scrambler of a coding module for its generation AMR-NB coded frame to an input voice signal frame as shown in Figure 4, input audio signal frame 42 is the even PCM frames of 13 bits, the 43rd, the VAD sign, the 44th, AMR-NB encoded speech frames (non-ground unrest encode speed self-adaption arrowband coded frame), the 45th, the quiet description of AMR-NB (SID) frame, the 46th, the indication of transmission types, the 47th, pass to the information bit of 3G Access Network, the voice coding module is carried out the synthetic digital voice signal frame 48 that the search of linear prediction and code book obtains to the even PCM frame of 13 bits, the 49th, the even PCM frame of 13 bits carried out the pretreated voice signal frame that obtains after the pre-service, the 50th, quantification energy predicting error---the frame energy logarithmic mean value of quantification of the subframe that generates during the coded frame of ground unrest coding module coding ground unrest code rate-quiet description (SID) frame, the logarithmic mean value (averaged logarithmic energy) that is the frame energy is through the value after the quantification treatment, the quantification energy predicting error of four subframes is all used this numerical value, calculates in (Frame energy caculation) at the 5.2 joint frame energy of the TS26.092-500 of 3GPP and has provided the logarithmic mean value of frame energy and the frame energy logarithmic mean value defined of quantification.
Block diagram in the right of the transmission part of describing among Fig. 4 and 3GPP26.071-400 Fig. 1 (Transmit side) is similar, different places are: the voice activation detection module among Fig. 4 of the present invention detects synthetic digital voice signal, and the method for 3GPP is that pretreated digital voice signal is detected.
The technical manual of the TS26.092 of the 3GPP of the ground unrest coding module reference among Fig. 4 realizes the function of the coding of ground unrest code rate AMR-NB frame.The ground unrest coding module produce when its coding SID frame is provided to the voice coding module quantification energy predicting error---the frame energy logarithmic mean value of quantification has had clear and definite expression in Fig. 4, the ground unrest coding module receives VAD sign 43, just its frame energy logarithmic mean value 50 is carried out update calculation when the ground unrest coding module receives when the continuous content that (comprises 8) more than 8 indicates 43 for the VAD of no speech.
The simplified block diagram of the voice coding module of Fig. 4 as shown in Figure 5, in the aftertreatment flow process among Fig. 5, when the transmission types 46 of present frame is not SPEECH_GOOD, use the reset values of known pumping signal that the pumping signal that it encourages the present frame of depositing in impact damper is set, pumping signal in this excitation impact damper is the signal value that comprises on 154 sample points of last subframe at least, and the quantification energy predicting error of four current subframes is set according to the frame energy logarithmic mean value 50 from the quantification of ground unrest coding module.
In the present embodiment, the voice coding module receives the even PCM frame 42 of 13 bits, send it to the voice activation detection module and pretreated speech digital signal is carried out the synthetic digital voice signal frame that obtains after search of linear prediction, self-adapting code book and the fixed codebook search, that is: amplify the back with self-adapting code book by self-adapting code book gain and amplify the back addition with fixed code book by fixed codebook gain and obtain pumping signal, again with linear prediction (LP) parameter of pumping signal by obtaining by linear prediction-
Determined linear prediction synthesis filter obtains synthetic digital voice signal frame 48 (linear prediction synthesis filter that is used for synthetic digital Speech frame also can be determined by linear forecasting parameter A (z)); The voice activation detection module is according to the resultant VAD result of detection to synthetic digital voice signal frame 48---and VAD sign 43 is to DTX control and operational module output, DTX control is the same with what stipulate among the 3GPP with the function of operational module, specifically saves referring to 5.1 of TS26.093-520.Voice coding module shown in Fig. 4 is receiving that transmission types indication 46 produces AMR-NB speech pattern coded frame (non-ground unrest code rate coded frame) when being normal voice (SPEECH_GOOD), and LSP index, self-adapting code book index, self-adapting code book gain index, fixed code book index and the fixed codebook gain index among Fig. 5 just can be incorporated in this AMR-NB speech pattern coded frame at this moment; When the ground unrest coding module when the transmission types received indication 46 is not normal voice (SPEECH_GOOD), the ground unrest coding module is the pretreated speech digital signal frame 49 coding quiet description of AMR-NB (SID) frames 45, DTX control and operational module are put AMR-NB encoded speech frames 44 47 li of information bits and are sent to 3G Access Network (AN) when transmission types indication 46 is normal voice (SPEECH_GOOD), DTX control and operational module are put the quiet description of adaptive multi-rate (AMR_SID) frame 45 47 li of information bits and are sent to 3G Access Network (AN) when transmission types indication 46 is quiet description renewal (SID_UPDATE), DTX control and operational module are that the SID_FIRST frame that quiet description is put according to 3GPP technical manual TS26093 formation 47 li of information bits when beginning (SID_FIRST) sends to 3G Access Network (AN) in transmission types indication 46, indication 3G Access Network did not carry out the transmission of Speech frame when TX control and operational module were no datat (NO_DATA) in transmission types indication 46, can so what no matter is put in information bit.
Discontinuous transmitting DTX control in embodiment 2 and operating means receive VAD result and come from synthesizing the detection of digital voice, and its operation is according to the regulation of the TS26.093 of 3GPP.
Described in technical scheme, under the situation that transmits ground unrest pattern-coding frame, both sides reset to consistent pumping signal with pumping signal, code translator is provided with according to the frame energy logarithmic mean value index in the SID_UPDATE frame and quantizes the energy predicting error and make it consistent with scrambler, and code translator is provided with according to the logarithmic mean value of the frame energy of a plurality of speech pattern frames of receiving before the SID_FIRST frame and quantizes the energy predicting error and make it consistent with scrambler; Under the situation that transmits the speech pattern coded frame, code translator uses the parameter in the coded frame of receiving, scrambler uses and oneself is encoded to the parameter of going in this coded frame, based on the pumping signal and the synthetic speech of the pumping signal of the unanimity subframe consistent with quantizing the generation of energy predicting error.So pumping signal that can be consistent with scrambler corresponding to the code translator of above-mentioned scrambler and quantize the energy predicting error.
Claims (22)
1. one kind is carried out self-adapting multi-rate narrowband AMR-NB coding and a back input signal frame adjacent with this input signal frame is carried out the AMR-NB Methods for Coding the input signal frame in the input signal frame sequence according to ground unrest code rate and non-ground unrest code rate, it is characterized in that
According to encode resulting self-adapting code book parameter and fixed code book parameter of a described input signal frame being generated pumping signal with described non-ground unrest code rate, according to a described input signal frame resulting linear forecasting parameter of encoding being determined linear prediction synthesis filter, this pumping signal filtering is generated synthetic digital audio signal frame with this linear prediction synthesis filter with described non-ground unrest code rate;
Carry out voice activation according to described synthetic digital audio signal frame and detect, determine the transmission types of discontinuous transmission according to the result of described voice activation detection;
If described transmission types is normal speech SPEECH_GOOD, according to employed self-adapting code book parameter and fixed code book parameter in the AMR-NB coded frame of the described non-ground unrest code rate of a described input signal frame, generate the pumping signal of a described input signal frame; If described transmission types is not SPEECH_GOOD, with a described input signal frame pumping signal reset;
According to a described input signal frame pumping signal an adjacent back input signal frame is carried out the encoding operation of non-ground unrest code rate.
2. according to the method for claim 1, it is characterized in that, if described transmission types is SPEECH_GOOD, according to employed modifying factor correction factor in the AMR-NB coded frame of the described non-ground unrest code rate of a described input signal frame, generate the quantification energy predicting error of the subframe of a described input signal frame; If described transmission types is quiet description to begin SID_FIRST or quiet description and upgrade SID_UPDATE,, generate the quantification energy predicting error of the subframe of a described input signal frame according to the logarithmic mean value of the frame energy of the quantification of a described input signal frame; If described transmission types is no datat NO_DATA, the quantification energy predicting error of the subframe of last input signal frame that will be adjacent with a described input signal frame is as the quantification energy predicting error of the subframe of a described input signal frame;
An adjacent back input signal frame is carried out the encoding operation of non-ground unrest code rate according to the quantification energy predicting error of the subframe of a described input signal frame.
3. according to the method for claim 1 or 2, it is characterized in that:
The voice activation that wherein carries out detects the wave test that comprises synthetic digital audio signal frame.
4. according to the method for claim 3, it is characterized in that described wave test to described synthetic digital audio signal frame comprises:
Determine threshold value according to described synthetic digital audio signal frame, if the amplitude of the rising edge of the crest in the waveform in the described synthetic digital audio signal frame surpasses this threshold value, just the result that described voice activation is detected has been defined as speech.
5. according to the method for claim 3, it is characterized in that the detection of described waveform to described synthetic digital audio signal frame comprises:
Set rising edge threshold value and negative edge threshold value according to described synthetic digital audio signal frame, respectively with the rising edge threshold value of setting and negative edge threshold ratio with the amplitude of the rising edge of the crest in the waveform of described synthetic digital audio signal frame and negative edge amplitude; If the amplitude and the negative edge amplitude of the rising edge of the crest in the waveform have surpassed described rising edge threshold value and negative edge threshold value respectively, just the result that described voice activation is detected has been set at speech.
6. according to the method for claim 3, it is characterized in that, described wave test to described synthetic digital audio signal frame comprises: determine amplitude threshold and scope according to described synthetic digital audio signal frame, if the rising edge amplitude in the waveform in the described synthetic digital audio signal frame surpasses the number of crest of this amplitude threshold within described scope, just the result with described voice activation detection has been defined as speech.
7. according to the method for claim 3, it is characterized in that described wave test to described synthetic digital audio signal frame comprises:
Set rising edge threshold value, negative edge threshold value and scope according to described synthetic digital audio signal frame, respectively with the rising edge threshold value of setting and negative edge threshold ratio with the amplitude of the rising edge of the crest in the waveform of described synthetic digital audio signal frame and negative edge amplitude; If the number that rising edge amplitude in the waveform and negative edge amplitude have surpassed described rising edge threshold value and negative edge threshold value crest respectively is within described scope, just the result that described voice activation is detected has been set at speech.
8. self-adapting multi-rate narrowband AMR-NB scrambler that has discontinuous transmission control and operating means, described discontinuous transmission control and operating means are determined the code rate of transmission types TX_TYPE and definite AMR-NB coded frame according to the voice activation testing result, in described AMR-NB scrambler, the input audio signal frame is carried out linear prediction, according to described code rate is that described input audio signal frame coding and output type are the AMR-NB transmit frame of TX_TYPE, and generate the pumping signal of the described input audio signal frame of the next audio input signal frame that is used to encode, it is characterized in that
Determine linear prediction synthesis filter by described input audio signal frame being carried out the linear forecasting parameter that linear prediction obtains;
According to the code rate of speech pattern to described sound input audio signal frame adaptive code book is searched for, the resulting self-adapting code book parameter of fixed codebook search and fixed code book parameter generate speech pattern pumping signal;
With the pumping signal filtering generation synthetic digital audio signal frame of described linear prediction synthesis filter to described speech pattern;
Obtain described voice activation testing result according to the voice activation detection that described synthetic digital audio signal frame is carried out;
If described TX_TYPE is normal speech SPEECH_GOOD, be input audio signal frame coding AMR-NB transmit frame according to described self-adapting code book search, the resulting self-adapting code book parameter of fixed codebook search and fixed code book parameter to the input audio signal frame, and, generate the pumping signal of described input audio signal frame according to employed self-adapting code book parameter and fixed code book parameter in this coded frame;
Beginning SID_FIRST or quiet description renewal SID_UPDATE if described TX_TYPE is quiet description, is input signal frame coding AMR-NB transmit frame by the ground unrest code rate, and the pumping signal of described input audio signal frame is resetted;
If described TX_TYPE is no datat NO_DATA, the pumping signal of described input audio signal frame is resetted.
9. scrambler according to Claim 8, the device of quantification energy predicting error of four subframes that also comprises the needed described input audio signal frame of speech pattern AMR-NB frame of a back input signal frame of determining that coding is adjacent with described input audio signal frame, it is characterized in that, determine the quantification energy predicting error of four subframes of described input audio signal frame according to the transmission types TX_TYPE of described input audio signal frame, promptly
Described transmission types when being normal speech SPEECH_GOOD this device according to the AMR-NB coded frame of the non-ground unrest code rate of described input audio signal frame in given modifying factor generate the quantification energy predicting error of four subframes of described input audio signal frame; at described TX_TYPE is the frame energy logarithmic mean value that the quantification energy predicting error of quiet description four subframes of the described input audio signal frame of this device when beginning SID_FIRST or quiet description and upgrading SID_UPDATE is set to the quantification of described input audio signal frame; if described transmission types is no datat NO_DATA, the quantification energy predicting error of the subframe of last input audio signal frame that will be adjacent with described input audio signal frame is as the quantification energy predicting error of the subframe of described input audio signal frame.
10. according to Claim 8 or 9 scrambler,
Wherein the voice activation of carrying out detects the detection that comprises the waveform of synthetic digital audio signal frame.
11. the scrambler according to claim 10 is characterized in that,
The detection of described waveform to synthetic digital audio signal frame comprises: determine a threshold value according to described synthetic digital audio signal frame, with the amplitude of the rising edge of the crest of the waveform in the described synthetic digital audio signal frame with described threshold ratio, the amplitude when the rising edge of the crest of described waveform just has been defined as speech with the result that described voice activation detects greater than this threshold value.
12. the scrambler according to claim 10 is characterized in that,
The detection of described waveform to synthetic digital audio signal frame comprises: determine amplitude threshold and scope according to described synthetic digital audio signal frame, if the rising edge amplitude in the waveform in the described synthetic digital audio signal frame surpasses the number of crest of this amplitude threshold within described scope, just the result with described voice activation detection has been defined as speech.
13. the scrambler according to claim 10 is characterized in that,
The detection of described waveform to synthetic digital audio signal frame comprises: set rising edge threshold value, negative edge threshold value and scope according to described synthetic digital audio signal frame, with the amplitude of the rising edge of the crest in the waveform of described synthetic digital audio signal frame and negative edge amplitude respectively with the rising edge threshold value of setting and negative edge threshold ratio; If the number that rising edge amplitude in the waveform and negative edge amplitude have surpassed described rising edge threshold value and negative edge threshold value crest respectively is within described scope, just the result that described voice activation is detected has been set at speech.
14. the scrambler according to claim 10 is characterized in that,
The detection of described waveform to synthetic digital audio signal frame comprises: set rising edge threshold value and negative edge threshold value according to described synthetic digital audio signal frame, with the amplitude of the rising edge of the crest in the waveform of described synthetic digital audio signal frame and negative edge amplitude respectively with the rising edge threshold value of setting and negative edge threshold ratio; If the amplitude and the negative edge amplitude of the rising edge of the crest in the waveform have surpassed described rising edge threshold value and negative edge threshold value respectively, just the result that described voice activation is detected has been set at speech.
15. one kind is carried out self-adapting code book search, fixed codebook search and self-adapting multi-rate narrowband AMR-NB coding and a back input signal frame adjacent with this input signal frame is carried out non-ground unrest code rate AMR-NB Methods for Coding the input signal frame in the input signal frame sequence, it is characterized in that
A described input signal frame is carried out linear prediction, and determine linear prediction synthesis filter according to resulting linear forecasting parameter, press voice pattern-coding speed to a described input signal frame self-adapting code book search for, fixed codebook search, and, this pumping signal filtering is generated synthetic digital audio signal frame with this linear prediction synthesis filter according to resulting self-adapting code book parameter and fixed code book parameter generation pumping signal;
Carry out voice activation according to described synthetic digital audio signal frame and detect, determine the transmission types of discontinuous transmission according to the result of this voice activation detection;
If described transmission types is normal speech SPEECH_GOOD, according to the described speech pattern code rate coding AMR-NB coded frame that is a described input signal frame, and, generate the pumping signal of a described input signal frame according to employed self-adapting code book parameter and fixed code book parameter in this coded frame; If being quiet description renewal SID_UPDATE or quiet description, transmission types begins the quiet description of the self-adapting multi-rate narrowband AMR-NB_SID frame that SID_FIRST then generates described input signal frame by ground unrest code rate coding;
If described transmission types is not SPEECH_GOOD, with a described input signal frame pumping signal reset;
According to a described input signal frame pumping signal an adjacent back input signal frame is carried out the encoding operation of the voice mould pattern-coding speed of non-ground unrest.
16. the method according to claim 15 is characterized in that,
If described transmission types is the AMR-NB frame that normal speech SPEECH_GOOD then generates the non-ground unrest code rate of described input signal frame, and according to the modifying factor correction factor generating quantification energy predicting error in this AMR-NB frame;
As if described transmission types is that quiet description begins SID_FIRST or the frame energy logarithmic mean value that SID_UPDATE then is arranged to the quantification energy predicting error of described input digit voiced frame the quantification of this input signal frame is upgraded in quiet description;
If described transmission types is no datat NO_DATA, the quantification energy predicting error of the subframe of last input signal frame that will be adjacent with a described input signal frame is as the quantification energy predicting error of the subframe of a described input signal frame.
17. the method according to claim 15 or 16 is characterized in that:
The voice activation that wherein carries out detects the detection that comprises the waveform of synthetic digital audio signal frame.
18. the method according to claim 17 is characterized in that,
The detection of described waveform to synthetic digital audio signal frame comprises that the amplitude of the rising edge of the crest of the waveform in described synthetic digital voice signal frame just has been set at speech with described voice activation testing result when surpassing threshold value.
19. the method according to claim 17 is characterized in that:
The detection of described waveform to synthetic digital audio signal frame is set rising edge threshold value and negative edge threshold value according to described synthetic digital audio signal frame, with the amplitude of the rising edge of the crest in the waveform of described synthetic digital audio signal frame and negative edge amplitude respectively with the rising edge threshold value of setting and negative edge threshold ratio; If the amplitude and the negative edge amplitude of the rising edge of the crest in the waveform have surpassed described rising edge threshold value and negative edge threshold value respectively, just the result that described voice activation is detected has been set at speech.
20. the method according to claim 17 is characterized in that,
The detection of described waveform to synthetic digital audio signal frame comprises: determine amplitude threshold and scope according to described synthetic digital audio signal frame, if the rising edge amplitude in the waveform in the described synthetic digital audio signal frame surpasses the number of crest of this amplitude threshold within described scope, just the result with described voice activation detection has been defined as speech.
21. the method according to claim 17 is characterized in that,
The detection of described waveform to synthetic digital audio signal frame comprises: set rising edge threshold value, negative edge threshold value and scope according to described synthetic digital audio signal frame, with the amplitude of the rising edge of the crest in the waveform of described synthetic digital audio signal frame and negative edge amplitude respectively with the rising edge threshold value of setting and negative edge threshold ratio; If the number that rising edge amplitude in the waveform and negative edge amplitude have surpassed described rising edge threshold value and negative edge threshold value crest respectively is within described scope, just the result that described voice activation is detected has been set at speech.
22. the method according to claim 18 is characterized in that, determines described threshold value according to described synthetic digital audio signal frame.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2008100966172A CN101399043A (en) | 2007-07-30 | 2008-04-29 | Self-adapting multi-speed narrowband coding method and coder |
Applications Claiming Priority (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200710044336.8 | 2007-07-30 | ||
CN200710044336 | 2007-07-30 | ||
CN200710045982.6 | 2007-09-14 | ||
CN200710172563.9 | 2007-12-19 | ||
CNA2008100966172A CN101399043A (en) | 2007-07-30 | 2008-04-29 | Self-adapting multi-speed narrowband coding method and coder |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101399043A true CN101399043A (en) | 2009-04-01 |
Family
ID=40331904
Family Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN200710153938.7A Expired - Fee Related CN101359978B (en) | 2007-07-30 | 2007-09-14 | Method for control of rate variant multi-mode wideband encoding rate |
CNA2008100966172A Pending CN101399043A (en) | 2007-07-30 | 2008-04-29 | Self-adapting multi-speed narrowband coding method and coder |
CNA2008100882656A Pending CN101359474A (en) | 2007-07-30 | 2008-04-29 | AMR-WB coding method and encoder |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN200710153938.7A Expired - Fee Related CN101359978B (en) | 2007-07-30 | 2007-09-14 | Method for control of rate variant multi-mode wideband encoding rate |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNA2008100882656A Pending CN101359474A (en) | 2007-07-30 | 2008-04-29 | AMR-WB coding method and encoder |
Country Status (1)
Country | Link |
---|---|
CN (3) | CN101359978B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111429927A (en) * | 2020-03-11 | 2020-07-17 | 云知声智能科技股份有限公司 | Method for improving personalized synthesized voice quality |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102254562B (en) * | 2011-06-29 | 2013-04-03 | 北京理工大学 | Method for coding variable speed audio frequency switching between adjacent high/low speed coding modes |
CN102385863B (en) * | 2011-10-10 | 2013-02-20 | 杭州米加科技有限公司 | Sound coding method based on speech music classification |
CN107978325B (en) * | 2012-03-23 | 2022-01-11 | 杜比实验室特许公司 | Voice communication method and apparatus, method and apparatus for operating jitter buffer |
CN102723968B (en) * | 2012-05-30 | 2017-01-18 | 中兴通讯股份有限公司 | Method and device for increasing capacity of empty hole |
WO2014192604A1 (en) * | 2013-05-31 | 2014-12-04 | ソニー株式会社 | Encoding device and method, decoding device and method, and program |
CN103337243B (en) * | 2013-06-28 | 2017-02-08 | 大连理工大学 | Method for converting AMR code stream into AMR-WB code stream |
KR101621780B1 (en) * | 2014-03-28 | 2016-05-17 | 숭실대학교산학협력단 | Method fomethod for judgment of drinking using differential frequency energy, recording medium and device for performing the method |
CN105609118B (en) * | 2015-12-30 | 2020-02-07 | 生迪智慧科技有限公司 | Voice detection method and device |
CN110444192A (en) * | 2019-08-15 | 2019-11-12 | 广州科粤信息科技有限公司 | A kind of intelligent sound robot based on voice technology |
CN110619881B (en) * | 2019-09-20 | 2022-04-15 | 北京百瑞互联技术有限公司 | Voice coding method, device and equipment |
CN113611325B (en) * | 2021-04-26 | 2023-07-04 | 珠海市杰理科技股份有限公司 | Voice signal speed change method and device based on clear and voiced sound and audio equipment |
CN113345446B (en) * | 2021-06-01 | 2024-02-27 | 广州虎牙科技有限公司 | Audio processing method, device, electronic equipment and computer readable storage medium |
CN115711591B (en) * | 2022-09-29 | 2024-03-15 | 成都飞机工业(集团)有限责任公司 | Gamma factor acquisition method, device, equipment and medium |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1186765C (en) * | 2002-12-19 | 2005-01-26 | 北京工业大学 | Method for encoding 2.3kb/s harmonic wave excidted linear prediction speech |
CN1275223C (en) * | 2004-12-31 | 2006-09-13 | 苏州大学 | A low bit-rate speech coder |
-
2007
- 2007-09-14 CN CN200710153938.7A patent/CN101359978B/en not_active Expired - Fee Related
-
2008
- 2008-04-29 CN CNA2008100966172A patent/CN101399043A/en active Pending
- 2008-04-29 CN CNA2008100882656A patent/CN101359474A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111429927A (en) * | 2020-03-11 | 2020-07-17 | 云知声智能科技股份有限公司 | Method for improving personalized synthesized voice quality |
Also Published As
Publication number | Publication date |
---|---|
CN101359978B (en) | 2014-01-29 |
CN101359978A (en) | 2009-02-04 |
CN101359474A (en) | 2009-02-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101399043A (en) | Self-adapting multi-speed narrowband coding method and coder | |
KR101785885B1 (en) | Adaptive bandwidth extension and apparatus for the same | |
CN102934163B (en) | Systems, methods, apparatus, and computer program products for wideband speech coding | |
CN1307614C (en) | Method and arrangement for synthesizing speech | |
US8090573B2 (en) | Selection of encoding modes and/or encoding rates for speech compression with open loop re-decision | |
CN103258541A (en) | Adaptive time/frequency-based audio encoding and decoding apparatuses and methods | |
CA2188493A1 (en) | Speech encoding/decoding method and apparatus using lpc residuals | |
CN1334952A (en) | Coded enhancement feature for improved performance in coding communication signals | |
JP6262337B2 (en) | Gain shape estimation for improved tracking of high-band temporal characteristics | |
CN105359211A (en) | Unvoiced/voiced decision for speech processing | |
CN104126201A (en) | System and method for mixed codebook excitation for speech coding | |
WO1999065017A1 (en) | Speech coding apparatus and speech decoding apparatus | |
CN101388214B (en) | Speed changing vocoder and coding method thereof | |
EP2132733A1 (en) | Non-causal postfilter | |
CN101572090B (en) | Self-adapting multi-rate narrowband coding method and coder | |
CN101609682B (en) | Encoder and method for self adapting to discontinuous transmission of multi-rate wideband | |
CN101609683B (en) | Encoder and method for self adapting to discontinuous transmission of multi-rate narrowband | |
Sun et al. | Speech compression | |
Yoon et al. | An efficient transcoding algorithm for G. 723.1 and G. 729A speech coders: interoperability between mobile and IP network | |
CN101572091A (en) | Self-adapting multi-rate broadband coding method and coder | |
Li et al. | Basic audio compression techniques | |
Xydeas | An overview of speech coding techniques | |
CN1964244A (en) | A method to receive and transmit digital signal using vocoder | |
CN101373595A (en) | Self-adapting multi-velocity encoder with fixed velocity and coding method thereof | |
EP1212750A1 (en) | Multimode vselp speech coder |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20090401 |