Summary of the invention
The present invention seeks to be based on the frequency domain audio frequency characteristics that Fourier transform extracts in order to solve existing movable voice detection method; But the type audio frequency characteristics lacks robustness to noise (particularly non-stationary noise); And then influence the problem of movable voice detection technique performance, a kind of robustness movable voice detection method is provided.
Robustness movable voice detection method according to the invention, this method may further comprise the steps:
Step 1: a large amount of historical speech data of sampling, and train a voice dictionary collection Ψ ∈ R according to said historical speech data
L * D, wherein R representes it is real number space, L and D are the natural numbers greater than 0, represent a certain Spatial Dimension respectively;
Step 2: the voice dictionary collection Ψ that obtains according to step 1, to the voice signal S={s of input
1, s
2..., s
N∈ R
L * NCarry out Sparse Decomposition, extract the sparse coefficient C={c of voice
1, c
2..., c
N∈ R
D * NWherein N is a natural number, representes a certain Spatial Dimension;
Step 3: the sparse coefficient C reconstruct of obtaining according to step 2 is by the voice signal of Sparse Decomposition
Step 4: the voice signal of obtaining step three said reconstruct
Time domain energy sequence E={e
1, e
2..., e
N∈ R;
Step 5: design a short window W
1, with said short window W
1With the time domain energy sequence E convolution algorithm that slides, with each result calculated STME
nAs a certain particular frame s
nScore y
nN=1 wherein ..., N, W
1The length span be [2+1,2 * 10+1];
Step 6: design a W of window when long
2, with said window W when long
2With the time domain energy sequence E convolution algorithm that slides, with each result calculated LTME
nAs a certain particular frame S
nDecision threshold β
nW wherein
2The length span be [1000,1000 * 10], when n<6000, get n as length value;
Step 7: judged whether y
n>β
nFormula is set up, and judged result confirms then that for being the voice signal S of input is voice, and judged result is confirmed that then the voice signal S of input is a non-voice, and then accomplished the detection to movable voice for not.
Advantage of the present invention: speech detection mode of the present invention can be under low signal-to-noise ratio non-stationary noise jamming condition, efficiently voice and non-voice fragment in the discriminate tone frequency sequence.
Embodiment
Embodiment one: below in conjunction with Fig. 1 this embodiment is described, the said robustness movable voice of this embodiment detection method, this method may further comprise the steps:
Step 1: a large amount of historical speech data of sampling, and train a voice dictionary collection Ψ ∈ R according to said historical speech data
L * D, wherein R representes it is real number space, L and D are the natural numbers greater than 0, represent a certain Spatial Dimension respectively;
Step 2: the voice dictionary collection Ψ that obtains according to step 1, to the voice signal S={s of input
1, s
2..., s
N∈ R
L * NCarry out Sparse Decomposition, extract the sparse coefficient C={c of voice
1, c
2..., c
N∈ R
D * NWherein N is a natural number, representes a certain Spatial Dimension;
Step 3: the sparse coefficient C reconstruct of obtaining according to step 2 is by the voice signal of Sparse Decomposition
Step 4: the voice signal of obtaining step three said reconstruct
Time domain energy sequence E={e
1, e
2..., e
N∈ R;
Step 5: design a short window W
1, with said short window W
1With the time domain energy sequence E convolution algorithm that slides, with each result calculated STME
nAs a certain particular frame s
nScore y
nN=1 wherein ..., N, W
1The length span be [2+1,2 * 10+1];
Step 6: design a W of window when long
2, with said window W when long
2With the time domain energy sequence E convolution algorithm that slides, with each result calculated LTME
nAs a certain particular frame s
nDecision threshold β
nW wherein
2The length span be [1000,1000 * 10], when n<6000, get n as length value;
Step 7: judged whether y
n>β
nFormula is set up, and judged result confirms then that for being the voice signal S of input is voice, and judged result is confirmed that then the voice signal S of input is a non-voice, and then accomplished the detection to movable voice for not.
Embodiment two: this embodiment is described further embodiment one, the training process of the voice dictionary collection of step 1:
Step 11: with cosine function initialization voice dictionary collection Ψ
0∈ R
L * D, wherein L equals the length of speech frame, and D is an integer greater than L;
Step 12: the training utterance wordbook, a large amount of historical speech data of training process collection comes from existing wordbook, and training step is following three steps of cycle and regeneration of mature:
A large amount of historical speech data of step a, the existing wordbook of basis, the sparse coefficient C of employing svd algorithm computing voice:
Step b, the sparse coefficient C that obtains according to step a upgrade the voice dictionary collection:
Ψ=Ψ when training for the first time
0,
Step c, judged whether that formula
sets up; Judged result is for being that then
returns the update calculation that step b carries out next round; Otherwise; Upgrade and finish; Obtain the voice dictionary collection; Wherein δ is sparse threshold value, and satisfies relation:
Embodiment three: this embodiment is described further embodiment one, and the voice dictionary that the sparse coefficient C of voice obtains from step 1 by following formula in the step 2 is concentrated and extracted:
4, robustness movable voice detection method according to claim 1; It is characterized in that, press following formula reconstruct in the step 3 by the voice signal of Sparse Decomposition
Embodiment five: this embodiment is described further embodiment one, the described short window W of step 5
1Acquisition process be:
Designing a short window does
Wherein
In addition, W
1With time domain energy sequence E{e
N-I1+1, e
N I1+2..., e
nConvolution algorithm result { e
N-I1+1, e
N-I1+2..., e
n, as last e
nPairing speech frame s
nScore;
Afterwards, W
1In time domain energy sequence E to one of front slide, with time domain energy sequence E{e
N-I1+2, e
N-I1+3..., e
N+1Carry out the convolution algorithm of next round, and with result of calculation y
N+1As e
N+1Pairing speech frame s
N+1Score; Repeat above-mentioned computing, until end.
This embodiment is the detection method under the high real-time conditions.
Embodiment six: this embodiment is described further embodiment one, and the acquisition process of the described short window W1 of step 5 is:
Designing a short window does
Wherein
and I1 are the odd number greater than 0;
In addition, short window W
1With the time domain energy sequence
Convolution algorithm y as a result
n, as last e
nPairing speech frame s
nScore;
Afterwards, short window W
1In the time domain energy sequence to one of front slide, with the time domain energy sequence
Carry out the convolution algorithm of next round, and with result of calculation y
N+1As e
N+1Pairing speech frame s
N+1Score;
Repeat above-mentioned computing, until end.
This embodiment is the detection method under the low real-time conditions.
Embodiment seven: this embodiment is described further short window W to embodiment five or six
1Length be 7.This value is recommendation.
Embodiment eight: this embodiment is described further embodiment one, described window W when long of step 6
2Acquisition process be:
Designing window when long does
Wherein
In addition, W
2With time domain energy sequence E{e
N-I2+1, e
N-I2+2..., e
nConvolution algorithm β as a result
n, as last e
nPairing speech frame s
nDecision threshold;
Afterwards, W
2In the time domain energy sequence to one of front slide, with time domain energy sequence E{e
N-I2+2, e
N-I2+3..., e
N+1Carry out the convolution algorithm of next round, and with result of calculation β
N+1As e
N+1Pairing speech frame s
N+1Decision threshold;
Repeat above-mentioned computing, until end;
Wherein I2 is a natural number much larger than I1.
Embodiment nine: this embodiment is described further embodiment eight, window W when long
2Length be 6000.This value is recommendation.