[go: up one dir, main page]

CN102332264A - Robust Active Speech Detection Method - Google Patents

Robust Active Speech Detection Method Download PDF

Info

Publication number
CN102332264A
CN102332264A CN 201110281881 CN201110281881A CN102332264A CN 102332264 A CN102332264 A CN 102332264A CN 201110281881 CN201110281881 CN 201110281881 CN 201110281881 A CN201110281881 A CN 201110281881A CN 102332264 A CN102332264 A CN 102332264A
Authority
CN
China
Prior art keywords
voice
time domain
detection method
domain energy
energy sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 201110281881
Other languages
Chinese (zh)
Inventor
韩纪庆
游大涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology Shenzhen
Original Assignee
Harbin Institute of Technology Shenzhen
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology Shenzhen filed Critical Harbin Institute of Technology Shenzhen
Priority to CN 201110281881 priority Critical patent/CN102332264A/en
Publication of CN102332264A publication Critical patent/CN102332264A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

鲁棒性活动语音检测方法,属于音频信号处理领域。本发明为了解决现有的活动语音检测方法是基于傅立叶变换提取的频域音频特征,但该类型音频特征对噪音缺乏鲁棒性的问题。本发明方法包括:一:采样大量的历史语音数据,训练出语音字典集;二:根据所述语音字典集对输入的语音信号进行稀疏分解,提取语音的稀疏系数C;三:根据所述稀疏系数C重构被稀疏分解的语音信号

Figure DDA0000093121680000011
四:获取所述重构的语音信号
Figure DDA0000093121680000012
的时域能量序列E;五:设计一个短时窗W1,计算得分yn;六:设计一个长时窗W2,计算判决阈值βn;七:判断是否有yn>βn公式成立,是,则确定输入的语音信号S为语音,否,则确定输入的语音信号S为非语音,进而完成对活动语音的检测。

Figure 201110281881

The invention discloses a robust active voice detection method, which belongs to the field of audio signal processing. The present invention aims to solve the problem that the existing active voice detection method is based on frequency-domain audio features extracted by Fourier transform, but this type of audio features lacks robustness to noise. The method of the present invention includes: one: sampling a large amount of historical voice data, and training a voice dictionary set; two: performing sparse decomposition on the input voice signal according to the voice dictionary set, and extracting the sparse coefficient C of the voice; three: according to the sparse The coefficient C reconstructs the sparsely decomposed speech signal

Figure DDA0000093121680000011
Four: Obtain the reconstructed speech signal
Figure DDA0000093121680000012
5: design a short time window W 1 , and calculate the score y n ; 6: design a long time window W 2 , and calculate the decision threshold β n ; 7: judge whether the formula y n > β n holds true , if yes, it is determined that the input voice signal S is voice; if not, it is determined that the input voice signal S is non-voice, and then the detection of the active voice is completed.

Figure 201110281881

Description

Robustness movable voice detection method
Technical field
The present invention relates to a kind of robustness movable voice detection method, be specifically related to improve the movable voice detection method of code efficiency and channel utilization, belong to field.
Background technology
The movable voice detection method is to utilize voice signal and the difference of noise signal aspect common information, automatically the technology of recognizing voice section and non-speech segment.It is an important technology of field that movable voice detects; Particularly at limited bandwidth and the very big instant messaging field of voice flux; The movable voice detection technique can be under the situation that does not influence communication quality; Remove the quiet part in the voice flow, and then improve high coding efficiency and channel utilization in the communication effectively.Though the movable voice detection technique has obtained effective progress, still have some major issues well not solve as yet so far, particularly under low signal-to-noise ratio non-stationary noise conditions, the performance of movable voice detection technique remains further to be improved.At present; The frequency domain audio frequency characteristics that overwhelming majority detection method is extracted based on Fourier conversion (Fourier transform); But the type audio frequency characteristics lacks robustness to noise (particularly non-stationary noise), and this defective is to influence the basic factor that movable voice detection technique performance improves.In order further to improve the performance of movable voice detection technique, be necessary to study and adopt converter technique, and design new detection method on this basis noise robust.
Summary of the invention
The present invention seeks to be based on the frequency domain audio frequency characteristics that Fourier transform extracts in order to solve existing movable voice detection method; But the type audio frequency characteristics lacks robustness to noise (particularly non-stationary noise); And then influence the problem of movable voice detection technique performance, a kind of robustness movable voice detection method is provided.
Robustness movable voice detection method according to the invention, this method may further comprise the steps:
Step 1: a large amount of historical speech data of sampling, and train a voice dictionary collection Ψ ∈ R according to said historical speech data L * D, wherein R representes it is real number space, L and D are the natural numbers greater than 0, represent a certain Spatial Dimension respectively;
Step 2: the voice dictionary collection Ψ that obtains according to step 1, to the voice signal S={s of input 1, s 2..., s N∈ R L * NCarry out Sparse Decomposition, extract the sparse coefficient C={c of voice 1, c 2..., c N∈ R D * NWherein N is a natural number, representes a certain Spatial Dimension;
Step 3: the sparse coefficient C reconstruct of obtaining according to step 2 is by the voice signal of Sparse Decomposition S ~ = { s ~ 1 , s ~ 2 , . . . , s ~ N } ∈ R L × N ;
Step 4: the voice signal of obtaining step three said reconstruct
Figure BDA0000093121660000012
Time domain energy sequence E={e 1, e 2..., e N∈ R;
Step 5: design a short window W 1, with said short window W 1With the time domain energy sequence E convolution algorithm that slides, with each result calculated STME nAs a certain particular frame s nScore y nN=1 wherein ..., N, W 1The length span be [2+1,2 * 10+1];
Step 6: design a W of window when long 2, with said window W when long 2With the time domain energy sequence E convolution algorithm that slides, with each result calculated LTME nAs a certain particular frame S nDecision threshold β nW wherein 2The length span be [1000,1000 * 10], when n<6000, get n as length value;
Step 7: judged whether y n>β nFormula is set up, and judged result confirms then that for being the voice signal S of input is voice, and judged result is confirmed that then the voice signal S of input is a non-voice, and then accomplished the detection to movable voice for not.
Advantage of the present invention: speech detection mode of the present invention can be under low signal-to-noise ratio non-stationary noise jamming condition, efficiently voice and non-voice fragment in the discriminate tone frequency sequence.
Description of drawings
Fig. 1 is the process flow diagram of the inventive method.
Embodiment
Embodiment one: below in conjunction with Fig. 1 this embodiment is described, the said robustness movable voice of this embodiment detection method, this method may further comprise the steps:
Step 1: a large amount of historical speech data of sampling, and train a voice dictionary collection Ψ ∈ R according to said historical speech data L * D, wherein R representes it is real number space, L and D are the natural numbers greater than 0, represent a certain Spatial Dimension respectively;
Step 2: the voice dictionary collection Ψ that obtains according to step 1, to the voice signal S={s of input 1, s 2..., s N∈ R L * NCarry out Sparse Decomposition, extract the sparse coefficient C={c of voice 1, c 2..., c N∈ R D * NWherein N is a natural number, representes a certain Spatial Dimension;
Step 3: the sparse coefficient C reconstruct of obtaining according to step 2 is by the voice signal of Sparse Decomposition S ~ = { s ~ 1 , s ~ 2 , . . . , s ~ N } ∈ R L × N ;
Step 4: the voice signal of obtaining step three said reconstruct
Figure BDA0000093121660000022
Time domain energy sequence E={e 1, e 2..., e N∈ R;
Step 5: design a short window W 1, with said short window W 1With the time domain energy sequence E convolution algorithm that slides, with each result calculated STME nAs a certain particular frame s nScore y nN=1 wherein ..., N, W 1The length span be [2+1,2 * 10+1];
Step 6: design a W of window when long 2, with said window W when long 2With the time domain energy sequence E convolution algorithm that slides, with each result calculated LTME nAs a certain particular frame s nDecision threshold β nW wherein 2The length span be [1000,1000 * 10], when n<6000, get n as length value;
Step 7: judged whether y n>β nFormula is set up, and judged result confirms then that for being the voice signal S of input is voice, and judged result is confirmed that then the voice signal S of input is a non-voice, and then accomplished the detection to movable voice for not.
Embodiment two: this embodiment is described further embodiment one, the training process of the voice dictionary collection of step 1:
Step 11: with cosine function initialization voice dictionary collection Ψ 0∈ R L * D, wherein L equals the length of speech frame, and D is an integer greater than L;
Step 12: the training utterance wordbook, a large amount of historical speech data of training process collection comes from existing wordbook, and training step is following three steps of cycle and regeneration of mature:
A large amount of historical speech data of step a, the existing wordbook of basis, the sparse coefficient C of employing svd algorithm computing voice:
C = arg min C | | C | | 1 + λ | | X - ΨC | | 2 2 ,
Step b, the sparse coefficient C that obtains according to step a upgrade the voice dictionary collection:
Ψ ~ = Arg Min Ψ | | C | | 1 + λ | | X - Ψ C | | 2 2 , Ψ=Ψ when training for the first time 0,
Step c, judged whether that formula
Figure BDA0000093121660000033
sets up; Judged result is for being that then returns the update calculation that step b carries out next round; Otherwise; Upgrade and finish; Obtain the voice dictionary collection; Wherein δ is sparse threshold value, and satisfies relation:
Figure BDA0000093121660000035
Embodiment three: this embodiment is described further embodiment one, and the voice dictionary that the sparse coefficient C of voice obtains from step 1 by following formula in the step 2 is concentrated and extracted:
C = arg min C | | C | | 1 + λ | | X - ΨC | | 2 2 .
4, robustness movable voice detection method according to claim 1; It is characterized in that, press following formula reconstruct in the step 3 by the voice signal of Sparse Decomposition
Figure BDA0000093121660000041
S ~ = ΨC .
Embodiment five: this embodiment is described further embodiment one, the described short window W of step 5 1Acquisition process be:
Designing a short window does W 1 = { w 1 1 , w 2 1 , . . . , w I 1 1 } ,
Wherein w 1 1 = w 2 1 = . . . = w I 1 1 = 1 I 1 ;
In addition, W 1With time domain energy sequence E{e N-I1+1, e N I1+2..., e nConvolution algorithm result { e N-I1+1, e N-I1+2..., e n, as last e nPairing speech frame s nScore;
Afterwards, W 1In time domain energy sequence E to one of front slide, with time domain energy sequence E{e N-I1+2, e N-I1+3..., e N+1Carry out the convolution algorithm of next round, and with result of calculation y N+1As e N+1Pairing speech frame s N+1Score; Repeat above-mentioned computing, until end.
This embodiment is the detection method under the high real-time conditions.
Embodiment six: this embodiment is described further embodiment one, and the acquisition process of the described short window W1 of step 5 is:
Designing a short window does W 1 = { w 1 1 , w 2 1 , . . . , w I 1 1 } ,
Wherein
Figure BDA0000093121660000046
and I1 are the odd number greater than 0;
In addition, short window W 1With the time domain energy sequence
Figure BDA0000093121660000047
Convolution algorithm y as a result n, as last e nPairing speech frame s nScore;
Afterwards, short window W 1In the time domain energy sequence to one of front slide, with the time domain energy sequence
Figure BDA0000093121660000048
Carry out the convolution algorithm of next round, and with result of calculation y N+1As e N+1Pairing speech frame s N+1Score;
Repeat above-mentioned computing, until end.
This embodiment is the detection method under the low real-time conditions.
Embodiment seven: this embodiment is described further short window W to embodiment five or six 1Length be 7.This value is recommendation.
Embodiment eight: this embodiment is described further embodiment one, described window W when long of step 6 2Acquisition process be:
Designing window when long does W 2 = { w 1 2 , w 2 2 , . . . , w I 2 2 } ,
Wherein w 1 2 = w 2 2 = . . . = w I 2 2 = 1 I 2 ;
In addition, W 2With time domain energy sequence E{e N-I2+1, e N-I2+2..., e nConvolution algorithm β as a result n, as last e nPairing speech frame s nDecision threshold;
Afterwards, W 2In the time domain energy sequence to one of front slide, with time domain energy sequence E{e N-I2+2, e N-I2+3..., e N+1Carry out the convolution algorithm of next round, and with result of calculation β N+1As e N+1Pairing speech frame s N+1Decision threshold;
Repeat above-mentioned computing, until end;
Wherein I2 is a natural number much larger than I1.
Embodiment nine: this embodiment is described further embodiment eight, window W when long 2Length be 6000.This value is recommendation.

Claims (9)

1. robustness movable voice detection method is characterized in that, this method may further comprise the steps:
Step 1: a large amount of historical speech data of sampling, and train a voice dictionary collection Ψ ∈ R according to said historical speech data L * D, wherein R representes it is real number space, L and D are the natural numbers greater than 0, represent a certain Spatial Dimension respectively;
Step 2: the voice dictionary collection Ψ that obtains according to step 1, to the voice signal S={s of input 1, s 2..., s N∈ R L * NCarry out Sparse Decomposition, extract the sparse coefficient C={c of voice 1, c 2..., c N∈ R D * NWherein N is a natural number, representes a certain Spatial Dimension;
Step 3: the sparse coefficient C reconstruct of obtaining according to step 2 is by the voice signal of Sparse Decomposition S ~ = { s ~ 1 , s ~ 2 , . . . , s ~ N } ∈ R L × N ;
Step 4: the voice signal of obtaining step three said reconstruct
Figure FDA0000093121650000012
Time domain energy sequence E={e 1, e 2..., e N∈ R;
Step 5: design a short window W 1, with said short window W 1With the time domain energy sequence E convolution algorithm that slides, with each result calculated STME nAs a certain particular frame s nScore y nN=1 wherein ..., N, W 1The length span be [2+1,2 * 10+1];
Step 6: design a W of window when long 2, with said window W when long 2With the time domain energy sequence E convolution algorithm that slides, with each result calculated LTME nAs a certain particular frame s nDecision threshold β nW wherein 2The length span be [1000,1000 * 10], when n<6000, get n as length value;
Step 7: judged whether y n>β nFormula is set up, and judged result confirms then that for being the voice signal S of input is voice, and judged result is confirmed that then the voice signal S of input is a non-voice, and then accomplished the detection to movable voice for not.
2. robustness movable voice detection method according to claim 1 is characterized in that, the training process of the voice dictionary collection of step 1:
Step 11: with cosine function initialization voice dictionary collection Ψ 0∈ R L * D, wherein L equals the length of speech frame, and D is an integer greater than L;
Step 12: the training utterance wordbook, a large amount of historical speech data of training process collection comes from existing wordbook, and training step is following three steps of cycle and regeneration of mature:
A large amount of historical speech data of step a, the existing wordbook of basis, the sparse coefficient C of employing svd algorithm computing voice:
C = arg min C | | C | | 1 + λ | | X - ΨC | | 2 2 ,
Step b, the sparse coefficient C that obtains according to step a upgrade the voice dictionary collection:
Ψ ~ = Arg Min Ψ | | C | | 1 + λ | | X - Ψ C | | 2 2 , Ψ=Ψ when training for the first time 0,
Step c, judged whether that formula
Figure FDA0000093121650000023
sets up; Judged result is for being that then
Figure FDA0000093121650000024
returns the update calculation that step b carries out next round; Otherwise, upgrade and finish, obtain the voice dictionary collection,
Wherein δ is sparse threshold value, and satisfies relation:
Figure FDA0000093121650000025
3. robustness movable voice detection method according to claim 1 is characterized in that, the voice dictionary that the sparse coefficient C of voice obtains from step 1 by following formula in the step 2 is concentrated and extracted:
C = arg min C | | C | | 1 + λ | | X - ΨC | | 2 2 .
4. robustness movable voice detection method according to claim 1; It is characterized in that, press following formula reconstruct in the step 3 by the voice signal of Sparse Decomposition
Figure FDA0000093121650000027
S ~ = ΨC .
5. robustness movable voice detection method according to claim 1 is characterized in that, the described short window W of step 5 1Acquisition process be:
Designing a short window does W 1 = { w 1 1 , w 2 1 , . . . , w I 1 1 } ,
Wherein w 1 1 = w 2 1 = . . . = w I 1 1 = 1 I 1 ;
In addition, W 1With time domain energy sequence E{e N-I1+1, e N-I1+2..., e nConvolution algorithm result { e N-I1+1, e N-I1+2..., e n, as last e nPairing speech frame s nScore;
Afterwards, W 1In time domain energy sequence E to one of front slide, with time domain energy sequence E{e N-I1+2, e N-I1+3..., e N+1Carry out the convolution algorithm of next round, and with result of calculation y N+1As e N+1Pairing speech frame s N+1Score; Repeat above-mentioned computing, until end.
6. robustness movable voice detection method according to claim 1 is characterized in that, the described short window W of step 5 1Acquisition process be:
Designing a short window does W 1 = { w 1 1 , w 2 1 , . . . , w I 1 1 } ,
Wherein
Figure FDA0000093121650000032
and I1 are the odd number greater than 0;
In addition, short window W 1With the time domain energy sequence
Figure FDA0000093121650000033
Convolution algorithm y as a result n, as last e nPairing speech frame s nScore;
Afterwards, short window W 1In the time domain energy sequence to one of front slide, with the time domain energy sequence
Figure FDA0000093121650000034
Carry out the convolution algorithm of next round, and with result of calculation y N+1As e N+1Pairing speech frame s N+1Score;
Repeat above-mentioned computing, until end.
7. according to claim 5 or 6 described robustness movable voice detection methods, it is characterized in that short window W 1Length be 7.
8. robustness movable voice detection method according to claim 1 is characterized in that, described window W when long of step 6 2Acquisition process be:
Designing window when long does W 2 = { w 1 2 , w 2 2 , . . . , w I 2 2 } ,
Wherein w 1 2 = w 2 2 = . . . = w I 2 2 = 1 I 2 ;
In addition, W 2With time domain energy sequence E{e N-I2+1, e N-I2+2..., e nConvolution algorithm β as a result n, as last e nPairing speech frame s nDecision threshold;
Afterwards, W 2In the time domain energy sequence to one of front slide, with time domain energy sequence E{e N-I2+2, e N-I2+3..., e N+1Carry out the convolution algorithm of next round, and with result of calculation β N+1As e N+1Pairing speech frame s N+1Decision threshold;
Repeat above-mentioned computing, until end;
Wherein I2 is a natural number much larger than I1.
9. robustness movable voice detection method according to claim 8 is characterized in that, window W when long 2Length be 6000.
CN 201110281881 2011-09-21 2011-09-21 Robust Active Speech Detection Method Pending CN102332264A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110281881 CN102332264A (en) 2011-09-21 2011-09-21 Robust Active Speech Detection Method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110281881 CN102332264A (en) 2011-09-21 2011-09-21 Robust Active Speech Detection Method

Publications (1)

Publication Number Publication Date
CN102332264A true CN102332264A (en) 2012-01-25

Family

ID=45484020

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110281881 Pending CN102332264A (en) 2011-09-21 2011-09-21 Robust Active Speech Detection Method

Country Status (1)

Country Link
CN (1) CN102332264A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107403618A (en) * 2017-07-21 2017-11-28 山东师范大学 Based on the audio event sorting technique and computer equipment for stacking base rarefaction representation
CN104867495B (en) * 2013-08-28 2020-10-16 德州仪器公司 Sound recognition apparatus and method of operating the same

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6993481B2 (en) * 2000-12-04 2006-01-31 Global Ip Sound Ab Detection of speech activity using feature model adaptation
CN101009099A (en) * 2007-01-26 2007-08-01 北京中星微电子有限公司 Digital auto gain control method and device
CN101606196A (en) * 2007-02-14 2009-12-16 曼德斯必德技术公司 Embedded silence and ground unrest compression
US7769585B2 (en) * 2007-04-05 2010-08-03 Avidyne Corporation System and method of voice activity detection in noisy environments

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6993481B2 (en) * 2000-12-04 2006-01-31 Global Ip Sound Ab Detection of speech activity using feature model adaptation
CN101009099A (en) * 2007-01-26 2007-08-01 北京中星微电子有限公司 Digital auto gain control method and device
CN101606196A (en) * 2007-02-14 2009-12-16 曼德斯必德技术公司 Embedded silence and ground unrest compression
US7769585B2 (en) * 2007-04-05 2010-08-03 Avidyne Corporation System and method of voice activity detection in noisy environments

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《智能计算机与应用》 20110830 游大涛 等 基于长短时能量均值的活动语音检测算法 35-39 第1卷, 第2期 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104867495B (en) * 2013-08-28 2020-10-16 德州仪器公司 Sound recognition apparatus and method of operating the same
CN107403618A (en) * 2017-07-21 2017-11-28 山东师范大学 Based on the audio event sorting technique and computer equipment for stacking base rarefaction representation
CN107403618B (en) * 2017-07-21 2020-05-05 山东师范大学 Audio event classification method based on stacking base sparse representation and computer equipment

Similar Documents

Publication Publication Date Title
Li et al. Spmamba: State-space model is all you need in speech separation
CN101727906B (en) Encoding and decoding method and device for high frequency band signal
CN111081268A (en) A Phase-Correlated Shared Deep Convolutional Neural Network Speech Enhancement Method
CN103218524B (en) The deficient of density based determines blind source separation method
CN107274908A (en) Small echo speech de-noising method based on new threshold function table
CN101625858B (en) Method for extracting short-time energy frequency value in voice endpoint detection
CN103559888A (en) Speech enhancement method based on non-negative low-rank and sparse matrix decomposition principle
CN106653056A (en) Fundamental frequency extraction model based on LSTM recurrent neural network and training method thereof
CN103778919A (en) Speech coding method based on compressed sensing and sparse representation
CN107221334B (en) Audio bandwidth extension method and extension device
CN103137133B (en) Inactive sound modulated parameter estimating method and comfort noise production method and system
CN102915735B (en) Noise-containing speech signal reconstruction method and noise-containing speech signal device based on compressed sensing
CN103325381A (en) Speech separation method based on fuzzy membership function
CN101266797A (en) Post processing and filtering method for voice signals
CN101308660B (en) Decoding terminal error recovery method of audio compression stream
CN103325388B (en) Based on the mute detection method of least energy wavelet frame
CN103093757B (en) A conversion method for converting narrowband code stream into wideband code stream
CN105575405A (en) Double-microphone voice active detection method and voice acquisition device
CN107785028A (en) Voice de-noising method and device based on signal autocorrelation
CN102881293A (en) Over-complete dictionary constructing method applicable to voice compression sensing
CN104240717B (en) Voice enhancement method based on combination of sparse code and ideal binary system mask
CN102201230B (en) Voice detection method for emergency
CN101308651B (en) Detection method of audio transient signal
CN102332264A (en) Robust Active Speech Detection Method
CN104867493B (en) Multifractal Dimension end-point detecting method based on wavelet transformation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120125