Summary of the invention
The objective of the invention is to solve the existing in prior technology defective, provide a kind of and cut apart many speakers word speed method of estimation with cluster based on the speaker: cut apart with cluster by the speaker and earlier voice flow is divided into voice segments, the voice segments with identical speaker is stitched together in order again; Estimate number of words and duration in each speaker's voice then respectively, realize that many speakers' word speed is estimated.
The technical solution adopted for the present invention to solve the technical problems comprises the steps:
1) reads in voice flow: read in the voice flow that records many speakers voice;
2) speaker is cut apart: detect that the speaker changes a little in the above-mentioned voice flow, be divided into a plurality of voice segments according to these changes voice flow of naming a person for a particular job;
3) speaker's cluster: utilize the spectral clustering algorithm that the above-mentioned voice segments that splits is carried out speaker's cluster, identical speaker's voice segments is stitched together in order, obtain speaker's number and each speaker's voice;
4) word speed is estimated: extract energy envelope respectively from each speaker's voice, and determine syllable number by the local maximum point of finding out energy envelope, thereby estimate each speaker's word speed.
Described step 2) step cut apart of speaker comprises:
2.1) utilize and from the above-mentioned voice flow that reads in, find out quiet section and voice segments based on the silence detection algorithm of threshold judgement;
2.2) above-mentioned voice segments is spliced into a long voice segments in order, and from long voice segments, extract comprise the Mel frequency cepstral coefficient (Mel Frequency Cepstral Coefficients, MFCCs) and the audio frequency characteristics of first order difference (Delta-MFCCs);
2.3) audio frequency characteristics that utilizes said extracted to come out, according to bayesian information criterion, judge that the similarity between the adjacent data window detects the speaker in the long voice segments to change a little;
2.4) change a little according to above-mentioned speaker, voice flow is divided into a plurality of voice segments, and each voice segments only comprises a speaker.
Described step 2.1) step based on the silence detection algorithm of threshold judgement comprises:
2.1.1) voice flow that reads in is carried out the branch frame, and calculate the energy of every frame voice, obtain the energy feature vector of voice flow;
2.1.2) the calculating energy thresholding;
2.1.3) energy and the energy threshold of every frame voice compared, the frame that is lower than energy threshold is quiet frame, otherwise is speech frame, and adjacent quiet frame is spliced into one quiet section in order, and adjacent speech frame is spliced into a voice segments in order.
The step that described step 4) word speed is estimated comprises:
4.1) calculate the energy of speaker's voice;
4.2) energy that utilizes low-pass filter that said extracted is come out carries out filtering, obtains energy envelope;
4.3) calculating energy envelope threshold value;
4.4) determine local maximum point in the energy envelope, obtain the number of local maximum point;
4.5) with the number of the local maximum point in this speaker's speech energy envelope as the syllable number, and, obtain this speaker's word speed divided by the duration of these speaker's voice;
4.6) repetition above-mentioned steps 4.1)~4.5), till the word speed of all speaker's voice has all been estimated.
Described local maximum point satisfies following condition:
A) this element value is greater than the energy envelope threshold value;
B) this element value is greater than its forward and backward 0.07 second all elements value;
Described local maximum point position is the position at energy peak place of the simple or compound vowel of a Chinese syllable of each syllable.
The invention has the beneficial effects as follows: utilize the speaker to cut apart the voice flow that to comprise many speakers and be cut into a plurality of voice segments, and each voice segments only comprises a speaker, utilize speaker's cluster that identical speaker's voice segments is combined again, so the present invention can estimate the word speed of many speakers voice.In addition, determine the syllable number by the local maximum point that detects each speaker's speech energy envelope, thereby estimate each speaker's word speed, compare with word speed method of estimation based on speech recognition, do not need to do complicated numerical evaluation (for example calculating of the output probability of acoustic model and language model) thus saved operation time, more be applicable to the occasion that real-time word speed is estimated.
Embodiment
Be described in detail below in conjunction with specific embodiment and Figure of description.
Fig. 1 is the process flow diagram of the method for many speakers of estimation word speed according to an embodiment of the invention.As shown in Figure 1, at first in step 101, read in voice flow.Voice flow is the speech data that records many speakers voice, can be the file of various forms, for example WAV, RAM, MP3, VOX etc.
Then, in step 102, utilization is found out quiet section and voice segments in the voice flow based on the mute detection method of threshold judgement, above-mentioned voice segments is spliced into a long voice segments in order, and from long voice segments, extract audio frequency characteristics, the audio frequency characteristics that utilizes said extracted to come out, according to bayesian information criterion, the similarity in the long voice segments of judgement between the adjacent data window detects the speaker and changes a little; Change a little according to above-mentioned speaker at last, voice flow is divided into a plurality of voice segments, and each voice segments only comprises a speaker.
Mute detection method in the step 102 specifically may further comprise the steps:
1) voice flow that reads in is divided into the T frame, frame length is 32 milliseconds of (frame length corresponding sampling points number N=0.032 * f
s, f wherein
sSample frequency for voice signal), it is 16 milliseconds that frame moves, if the sampled point number of last frame voice is then cast out it less than N;
2) calculate t (the frame voice signal x of 1≤t≤T)
t(n) ENERGY E
t:
Obtain the energy vectors E=[E of voice flow
1, E
2..., E
T], wherein T is a totalframes;
3) judge quietly significant limitation is arranged with fixing energy threshold with voice because the speech energy various environment under differs greatly, but voice and quiet between the relativeness of energy size be constant, so define adaptive energy thresholding T
E:
T
E=min(E)+0.3×[mean(E)-min(E)]
Wherein, min (E) is the minimum value of each frame energy, and mean (E) is the mean value of each frame energy.
4) energy and the energy threshold with every frame voice compares, and the frame that is lower than energy threshold is quiet frame, otherwise is speech frame, and adjacent quiet frame is spliced into one quiet section in order, and adjacent speech frame is spliced into a voice segments in order.
The method of utilizing bayesian information criterion to determine that the speaker changes a little in the step 102 specifically may further comprise the steps:
1) will be spliced into a long voice segments through each voice segments that silence detection obtains in order, will grow voice segments and be cut into data window, window length is 2 seconds, and it is 0.1 second that window moves.Each data window is carried out the branch frame, frame length is 32 milliseconds, it is 16 milliseconds that frame moves, from each frame voice signal, extract MFCCs and Delta-MFCCs feature, the dimension M of MFCCs and Delta-MFCCs gets 12, the feature of each data window constitutes an eigenmatrix F, and the dimension d=2M of eigenmatrix F is 24;
2) the BIC distance between two adjacent data windows of calculating (x and y), BIC distance calculation formula is as follows:
Wherein, z merges the data window that obtains afterwards, n with data window x and y
xAnd n
yBe respectively the frame number of data window x and y, F
x, F
yAnd F
zBe respectively the eigenmatrix of data window x, y and z, cov (F
x), cov (F
y) and cov (F
z) be respectively eigenmatrix F
x, F
yAnd F
zCovariance matrix, it is that penalty coefficient and experiment value are 2.0 that determinant of a matrix value, α are asked in det () expression;
3) if BIC distance, delta B/C greater than zero, then these two data windows are regarded as belonging to two different speakers (being to exist the speaker to change a little between them), otherwise these two data windows are regarded as belonging to same speaker and they are merged;
4) whether the data window that constantly slides judges two BIC distances between the adjacent data window greater than zero, and preserves the speaker and change a little, till the distance of the BIC between all adjacent data windows of long voice segments all has been judged.
The step of said extracted MFCCs and Delta-MFCCs feature comprises:
1) voice signal is divided into the T frame, frame length is 32 milliseconds of (frame length corresponding sampling points number N=0.032 * f
s, f wherein
sSample frequency for voice signal), it is 16 milliseconds that frame moves, if the sampled point number of last frame voice is then cast out it less than N;
2) to t (the frame voice signal x of 1≤t≤T)
t(n) (Discrete FourierTransformation DFT) obtains linear spectral X to do discrete Fourier transform (DFT)
t(k):
3) with above-mentioned linear spectral X
t(k) obtain the Mel frequency spectrum by Mel frequency filter group, carry out logarithm operation again and obtain log spectrum S
t(m), wherein Mel frequency filter group is several bandpass filter H
m(k), 0≤m<M, M are the number of wave filter, each wave filter has the triangle filtering characteristic, and its centre frequency is f (m), and the interval between hour adjacent f (m) is also less when the m value, along with the interval of the adjacent f of the increase of m (m) becomes greatly gradually, the transport function of each bandpass filter is:
Wherein, f (m) is defined as follows:
Wherein, f
l, f
hBe the low-limit frequency and the highest frequency of the frequency application scope of wave filter, B
-1Inverse function for B: B
-1(b)=700 (e
B/1125-1), therefore by linear spectral X
t(k) to logarithmic spectrum S
t(m) functional expression is:
4) with above-mentioned log spectrum S
t(m) (Discrete Cosine Transformation DCT) transforms to cepstrum domain, obtains t frame MFCCs, C through discrete cosine transform
t(p):
5) first order difference (Delta-MFCCs) of calculating t frame MFCCs, C
t(p):
Wherein, Q is a constant, and value is 3 during experiment.
6) every frame voice signal is repeated above-mentioned steps 2)~5), obtain the MFCCs and the Delta-MFCCs of all T frame voice signals, with synthetic MFCC matrix of their der group frame by frame and Delta-MFCC matrix, again MFCC matrix and Delta-MFCC matrix are merged the constitutive characteristic matrix F.
Step 103, from each voice segments that splits, extract the audio frequency characteristics that comprises MFCCs and Delta-MFCCs, and utilize the spectral clustering algorithm that the voice segments that each comprises a speaker is carried out speaker's cluster, obtain speaker's number and each speaker's voice.Concrete steps are as follows:
1) each voice segments is carried out the branch frame, frame length is 32 milliseconds, and it is 16 milliseconds that frame moves, and extracts MFCCs and Delta-MFCCs feature from each frame voice signal, and the dimension M of MFCCs and Delta-MFCCs is 12, and the feature of each voice segments constitutes an eigenmatrix F
j, eigenmatrix F
jDimension d=2M be 24;
2) according to each eigenmatrix F
jObtain the eigenmatrix set F={F of all voice segments to be clustered
1..., F
J, J is the total number of voice segments, constructs affine matrix (Affinity matrix) A ∈ R according to F again
J * J, (i, j) the individual elements A of A
IjBe defined as follows:
Wherein, d (F
i, F
j) be eigenmatrix F
iWith F
jBetween Euclidean distance, σ
i(or σ
j) be a scale parameter, be defined as the individual eigenmatrix F of i (or j)
i(or F
j) and other T-1 eigenmatrix between the variance of Euclidean distance vector;
3) structure diagonal matrix D, it the (i, i) individual element equals the capable all elements sum of i of affine matrix A, constructs normalized affine matrix L=D according to matrix D and A again
-1/2AD
-1/2
4) the preceding K of compute matrix L
MaxThe eigenwert of individual maximum
And eigenwert vector
V wherein
k(1≤k≤K
Max) be column vector and
Estimate optimum classification number (being speaker's number) K according to the difference between the adjacent feature value:
According to the speaker's number K that estimates, structural matrix V=[v
1, v
2..., v
K] ∈ R
J * K
5) each row of normalization matrix V obtains matrix Y ∈ R
J * K, (j, k) the individual element Y of Y
Jk:
6) space R is made in each trade among the matrix Y
KIn a point, utilize K mean algorithm (K-meansalgorithm) that this J capable (being J point) is clustered into the K class.
7) the pairing voice segments of eigenmatrix Fj is judged to k class (i.e. k speaker), the j of and if only if matrix Y capable by cluster in the k class;
8), obtain speaker's number, each speaker's voice and duration thereof according to above-mentioned cluster result.
At last, in step 104, from each speaker's voice, extract energy envelope, and determine the syllable number, estimate each speaker's word speed by detected energy envelope local maximum point.In standard Chinese, each syllable all comprises simple or compound vowel of a Chinese syllable basically, and the simple or compound vowel of a Chinese syllable number is the syllable number, the syllable number is the word number, and the energy maximum of simple or compound vowel of a Chinese syllable in the syllable, therefore can obtain the number of word, thereby estimate word speed by the simple or compound vowel of a Chinese syllable number of detected energy maximum.Concrete steps based on the word speed method of estimation of above-mentioned consideration are as follows:
1) calculate the ENERGY E (n) of each speaker's voice signal s (n):
E(n)=s
2(n),1≤n≤Len
Wherein, Len is the total number of sampled point of voice signal;
2) utilize a low-pass filter that ENERGY E (n) is carried out filtering, obtain energy envelope E (n), the technical indicator of this low-pass filter is as follows: based on the FIR wave filter of Equiripple method, sample frequency f
sBe 16000 hertz, cut-off frequecy of passband f
PassBe 50 hertz, stopband cutoff frequency f
StopBe 100 hertz, the maximum attenuation A of passband
PassBe 1dB, the minimal attenuation A of stopband
StopBe 80dB;
3) calculating energy envelope threshold value T
E:
T
E=0.4×mean(E(n))
Wherein, mean (E (n)) is the mean value of energy envelope;
4) will satisfy following two conditions of elements in the energy envelope as local maximum point:
Condition 1: this element value is greater than energy envelope threshold value T
E,
Condition 2: this element value is greater than its forward and backward 0.07 second all elements value, promptly greater than its forward and backward 0.07 * f
sIndividual element value,
The position (sampled point) at above-mentioned local maximum point place is the position at energy peak place of the simple or compound vowel of a Chinese syllable of each syllable, and the reason of getting 0.07 second is: the minimum value of the average duration of syllable approximately is 0.14 second, thus among the E (n) greater than T
EAnd greater than the position at its forward and backward 0.07 second element value place the position at energy peak place of the simple or compound vowel of a Chinese syllable of each syllable;
5) with the number of the local maximum point in certain speaker's speech energy envelope as syllable (word) number, with the number of word duration (second), obtain this speaker's word speed (word/second) divided by these speaker's voice;
6) repeat above-mentioned steps 1)~5), till the word speed of all speaker's voice has all been estimated.
The duration that Fig. 2 (a) has provided certain speaker is the oscillogram of 5 seconds voice signal, and Fig. 2 (b) has provided Fig. 2 (a) pairing energy envelope waveform of voice signal (shown in the solid line), the threshold value (shown in the dotted line) of energy envelope and the energy envelope local maximum point (shown in the dot-and-dash line of band circle) that obtains according to above-mentioned word speed estimating step.As can be seen from Figure 2: this speaker's voice signal duration is 5 seconds, and the number of local maximum point is 22, and promptly number of words is 22, and therefore, this speaker's word speed is 4.4 word/seconds (or 264 words per minute clocks).
Though more than by the foregoing description many speakers word speed method of estimation of the present invention is described in detail, therefore can not be interpreted as limitation of the scope of the invention.Should be pointed out that for the person of ordinary skill of the art without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with claims.