CN102332264A

CN102332264A - Robust Active Speech Detection Method

Info

Publication number: CN102332264A
Application number: CN 201110281881
Authority: CN
Inventors: 韩纪庆; 游大涛
Original assignee: Harbin Institute of Technology Shenzhen
Current assignee: Harbin Institute of Technology Shenzhen
Priority date: 2011-09-21
Filing date: 2011-09-21
Publication date: 2012-01-25

Abstract

The invention discloses a robust active voice detection method, which belongs to the field of audio signal processing. The present invention aims to solve the problem that the existing active voice detection method is based on frequency-domain audio features extracted by Fourier transform, but this type of audio features lacks robustness to noise. The method of the present invention includes: one: sampling a large amount of historical voice data, and training a voice dictionary set; two: performing sparse decomposition on the input voice signal according to the voice dictionary set, and extracting the sparse coefficient C of the voice; three: according to the sparse The coefficient C reconstructs the sparsely decomposed speech signal

Four: Obtain the reconstructed speech signal

5: design a short time window W ₁ , and calculate the score y _n ; 6: design a long time window W ₂ , and calculate the decision threshold β _n ; 7: judge whether the formula y _n > β _n holds true , if yes, it is determined that the input voice signal S is voice; if not, it is determined that the input voice signal S is non-voice, and then the detection of the active voice is completed.

Description

Robustness movable voice detection method

Technical field

The present invention relates to a kind of robustness movable voice detection method, be specifically related to improve the movable voice detection method of code efficiency and channel utilization, belong to field.

Background technology

The movable voice detection method is to utilize voice signal and the difference of noise signal aspect common information, automatically the technology of recognizing voice section and non-speech segment.It is an important technology of field that movable voice detects; Particularly at limited bandwidth and the very big instant messaging field of voice flux; The movable voice detection technique can be under the situation that does not influence communication quality; Remove the quiet part in the voice flow, and then improve high coding efficiency and channel utilization in the communication effectively.Though the movable voice detection technique has obtained effective progress, still have some major issues well not solve as yet so far, particularly under low signal-to-noise ratio non-stationary noise conditions, the performance of movable voice detection technique remains further to be improved.At present; The frequency domain audio frequency characteristics that overwhelming majority detection method is extracted based on Fourier conversion (Fourier transform); But the type audio frequency characteristics lacks robustness to noise (particularly non-stationary noise), and this defective is to influence the basic factor that movable voice detection technique performance improves.In order further to improve the performance of movable voice detection technique, be necessary to study and adopt converter technique, and design new detection method on this basis noise robust.

Summary of the invention

The present invention seeks to be based on the frequency domain audio frequency characteristics that Fourier transform extracts in order to solve existing movable voice detection method; But the type audio frequency characteristics lacks robustness to noise (particularly non-stationary noise); And then influence the problem of movable voice detection technique performance, a kind of robustness movable voice detection method is provided.

Robustness movable voice detection method according to the invention, this method may further comprise the steps:

Step 1: a large amount of historical speech data of sampling, and train a voice dictionary collection Ψ ∈ R according to said historical speech data ^{L * D}, wherein R representes it is real number space, L and D are the natural numbers greater than 0, represent a certain Spatial Dimension respectively;

Step 2: the voice dictionary collection Ψ that obtains according to step 1, to the voice signal S={s of input ₁, s ₂..., s _N∈ R ^{L * N}Carry out Sparse Decomposition, extract the sparse coefficient C={c of voice ₁, c ₂..., c _N∈ R ^{D * N}Wherein N is a natural number, representes a certain Spatial Dimension;

Step 3: the sparse coefficient C reconstruct of obtaining according to step 2 is by the voice signal of Sparse Decomposition

\tilde{S} = {{\tilde{s}}_{1}, {\tilde{s}}_{2}, . . ., {\tilde{s}}_{N}} &Element; R^{L \times N};

Step 4: the voice signal of obtaining step three said reconstruct

Time domain energy sequence E={e ₁, e ₂..., e _N∈ R;

Step 5: design a short window W ₁, with said short window W ₁With the time domain energy sequence E convolution algorithm that slides, with each result calculated STME _nAs a certain particular frame s _nScore y _nN=1 wherein ..., N, W ₁The length span be [2+1,2 * 10+1];

Step 6: design a W of window when long ₂, with said window W when long ₂With the time domain energy sequence E convolution algorithm that slides, with each result calculated LTME _nAs a certain particular frame S _nDecision threshold β _nW wherein ₂The length span be [1000,1000 * 10], when n＜6000, get n as length value;

Step 7: judged whether y _n＞β _nFormula is set up, and judged result confirms then that for being the voice signal S of input is voice, and judged result is confirmed that then the voice signal S of input is a non-voice, and then accomplished the detection to movable voice for not.

Advantage of the present invention: speech detection mode of the present invention can be under low signal-to-noise ratio non-stationary noise jamming condition, efficiently voice and non-voice fragment in the discriminate tone frequency sequence.

Description of drawings

Fig. 1 is the process flow diagram of the inventive method.

Embodiment

Embodiment one: below in conjunction with Fig. 1 this embodiment is described, the said robustness movable voice of this embodiment detection method, this method may further comprise the steps:

\tilde{S} = {{\tilde{s}}_{1}, {\tilde{s}}_{2}, . . ., {\tilde{s}}_{N}} &Element; R^{L \times N};

Step 4: the voice signal of obtaining step three said reconstruct

Time domain energy sequence E={e ₁, e ₂..., e _N∈ R;

Embodiment two: this embodiment is described further embodiment one, the training process of the voice dictionary collection of step 1:

Step 11: with cosine function initialization voice dictionary collection Ψ ₀∈ R ^{L * D}, wherein L equals the length of speech frame, and D is an integer greater than L;

Step 12: the training utterance wordbook, a large amount of historical speech data of training process collection comes from existing wordbook, and training step is following three steps of cycle and regeneration of mature:

A large amount of historical speech data of step a, the existing wordbook of basis, the sparse coefficient C of employing svd algorithm computing voice:

C = \arg \min_{C} {| | C | |}_{1} + λ {| | X - ΨC | |}_{2}^{2},

Step b, the sparse coefficient C that obtains according to step a upgrade the voice dictionary collection:

\tilde{Ψ} = Arg \underset{Ψ}{Min} {| | C | |}_{1} + λ {| | X - Ψ C | |}_{2}^{2},

Ψ=Ψ when training for the first time ₀,

Step c, judged whether that formula

sets up; Judged result is for being that then returns the update calculation that step b carries out next round; Otherwise; Upgrade and finish; Obtain the voice dictionary collection; Wherein δ is sparse threshold value, and satisfies relation:

Embodiment three: this embodiment is described further embodiment one, and the voice dictionary that the sparse coefficient C of voice obtains from step 1 by following formula in the step 2 is concentrated and extracted:

C = \arg \min_{C} {| | C | |}_{1} + λ {| | X - ΨC | |}_{2}^{2} .

4, robustness movable voice detection method according to claim 1; It is characterized in that, press following formula reconstruct in the step 3 by the voice signal of Sparse Decomposition

\tilde{S} = ΨC .

Embodiment five: this embodiment is described further embodiment one, the described short window W of step 5 ₁Acquisition process be:

Designing a short window does

W_{1} = {w_{1}^{1}, w_{2}^{1}, . . ., w_{I 1}^{1}},

Wherein

w_{1}^{1} = w_{2}^{1} = . . . = w_{I 1}^{1} = \frac{1}{I 1};

In addition, W ₁With time domain energy sequence E{e _N-I1+1, e _{N I1+2}..., e _nConvolution algorithm result { e _N-I1+1, e _N-I1+2..., e _n, as last e _nPairing speech frame s _nScore;

Afterwards, W ₁In time domain energy sequence E to one of front slide, with time domain energy sequence E{e _N-I1+2, e _N-I1+3..., e _N+1Carry out the convolution algorithm of next round, and with result of calculation y _N+1As e _N+1Pairing speech frame s _N+1Score; Repeat above-mentioned computing, until end.

This embodiment is the detection method under the high real-time conditions.

Embodiment six: this embodiment is described further embodiment one, and the acquisition process of the described short window W1 of step 5 is:

Designing a short window does

W_{1} = {w_{1}^{1}, w_{2}^{1}, . . ., w_{I 1}^{1}},

Wherein

and I1 are the odd number greater than 0;

In addition, short window W ₁With the time domain energy sequence

Convolution algorithm y as a result _n, as last e _nPairing speech frame s _nScore;

Afterwards, short window W ₁In the time domain energy sequence to one of front slide, with the time domain energy sequence

Carry out the convolution algorithm of next round, and with result of calculation y _N+1As e _N+1Pairing speech frame s _N+1Score;

Repeat above-mentioned computing, until end.

This embodiment is the detection method under the low real-time conditions.

Embodiment seven: this embodiment is described further short window W to embodiment five or six ₁Length be 7.This value is recommendation.

Embodiment eight: this embodiment is described further embodiment one, described window W when long of step 6 ₂Acquisition process be:

Designing window when long does

W_{2} = {w_{1}^{2}, w_{2}^{2}, . . ., w_{I 2}^{2}},

Wherein

w_{1}^{2} = w_{2}^{2} = . . . = w_{I 2}^{2} = \frac{1}{I 2};

In addition, W ₂With time domain energy sequence E{e _N-I2+1, e _N-I2+2..., e _nConvolution algorithm β as a result _n, as last e _nPairing speech frame s _nDecision threshold;

Afterwards, W ₂In the time domain energy sequence to one of front slide, with time domain energy sequence E{e _N-I2+2, e _N-I2+3..., e _N+1Carry out the convolution algorithm of next round, and with result of calculation β _N+1As e _N+1Pairing speech frame s _N+1Decision threshold;

Repeat above-mentioned computing, until end;

Wherein I2 is a natural number much larger than I1.

Embodiment nine: this embodiment is described further embodiment eight, window W when long ₂Length be 6000.This value is recommendation.

Claims

1. robustness movable voice detection method is characterized in that, this method may further comprise the steps:

\tilde{S} = {{\tilde{s}}_{1}, {\tilde{s}}_{2}, . . ., {\tilde{s}}_{N}} &Element; R^{L \times N};

Step 4: the voice signal of obtaining step three said reconstruct

Time domain energy sequence E={e ₁, e ₂..., e _N∈ R;

2. robustness movable voice detection method according to claim 1 is characterized in that, the training process of the voice dictionary collection of step 1:

C = \arg \min_{C} {| | C | |}_{1} + λ {| | X - ΨC | |}_{2}^{2},

\tilde{Ψ} = Arg \underset{Ψ}{Min} {| | C | |}_{1} + λ {| | X - Ψ C | |}_{2}^{2},

Ψ=Ψ when training for the first time ₀,

Step c, judged whether that formula

sets up; Judged result is for being that then

returns the update calculation that step b carries out next round; Otherwise, upgrade and finish, obtain the voice dictionary collection,

Wherein δ is sparse threshold value, and satisfies relation:

3. robustness movable voice detection method according to claim 1 is characterized in that, the voice dictionary that the sparse coefficient C of voice obtains from step 1 by following formula in the step 2 is concentrated and extracted:

C = \arg \min_{C} {| | C | |}_{1} + λ {| | X - ΨC | |}_{2}^{2} .

4. robustness movable voice detection method according to claim 1; It is characterized in that, press following formula reconstruct in the step 3 by the voice signal of Sparse Decomposition

\tilde{S} = ΨC .

5. robustness movable voice detection method according to claim 1 is characterized in that, the described short window W of step 5 ₁Acquisition process be:

Designing a short window does

W_{1} = {w_{1}^{1}, w_{2}^{1}, . . ., w_{I 1}^{1}},

Wherein

w_{1}^{1} = w_{2}^{1} = . . . = w_{I 1}^{1} = \frac{1}{I 1};

In addition, W ₁With time domain energy sequence E{e _N-I1+1, e _N-I1+2..., e _nConvolution algorithm result { e _N-I1+1, e _N-I1+2..., e _n, as last e _nPairing speech frame s _nScore;

6. robustness movable voice detection method according to claim 1 is characterized in that, the described short window W of step 5 ₁Acquisition process be:

Designing a short window does

W_{1} = {w_{1}^{1}, w_{2}^{1}, . . ., w_{I 1}^{1}},

Wherein

and I1 are the odd number greater than 0;

In addition, short window W ₁With the time domain energy sequence

Repeat above-mentioned computing, until end.

7. according to claim 5 or 6 described robustness movable voice detection methods, it is characterized in that short window W ₁Length be 7.

8. robustness movable voice detection method according to claim 1 is characterized in that, described window W when long of step 6 ₂Acquisition process be:

Designing window when long does

W_{2} = {w_{1}^{2}, w_{2}^{2}, . . ., w_{I 2}^{2}},

Wherein

w_{1}^{2} = w_{2}^{2} = . . . = w_{I 2}^{2} = \frac{1}{I 2};

Repeat above-mentioned computing, until end;

Wherein I2 is a natural number much larger than I1.

9. robustness movable voice detection method according to claim 8 is characterized in that, window W when long ₂Length be 6000.