CN102543063B

CN102543063B - Method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers

Info

Publication number: CN102543063B
Application number: CN2011104035773A
Authority: CN
Inventors: 李艳雄; 徐鑫; 贺前华
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2011-12-07
Filing date: 2011-12-07
Publication date: 2013-07-24
Anticipated expiration: 2031-12-07
Also published as: CN102543063A

Abstract

A multi-speaker speech rate estimation method based on speaker segmentation and clustering relates to a method for estimating multi-speaker speech rates. The method for estimating multi-speaker speech rate of the present invention: first read in the voice stream; then detect the speaker change points in the above-mentioned voice stream, and divide the voice stream into a plurality of voice segments according to these change points; then perform the above-mentioned voice segment Speaker clustering, splicing the speech segments of the same speaker together in order to obtain the number of speakers and the voice of each speaker; finally estimate the duration of each speaker's voice and the number of words it contains, and estimate the people's speech rate. Compared with the current single-speaker speech rate estimation method based on speech recognition, this method can not only estimate the speech rate of multiple speakers, but also has a faster speed.

Description

Cut apart many speakers word speed method of estimation with cluster based on the speaker

Technical field

The present invention relates to that voice signal is handled and mode identification technology, relate in particular to and a kind ofly cut apart many speakers word speed method of estimation with cluster based on the speaker.

Background technology

Development along with voice processing technology, at present the object of speech processes just progressively turns to many speakers voice (for example conference voice, conversational speech) by single speaker's voice, estimates many speakers' word speed and becomes more and more important according to the parameter of each speaker's voice speed adaption ground adjustment speech processing system (for example speech recognition system).In addition, in recording studio or breadboard Recording Process, speaker (for example announcer, host, contact staff etc.) rule of thumb subjectively measures word speed, and is often not accurate enough.Though can adopt the method for artificial mark to estimate speaker's word speed behind the End of Tape, do very time-consumingly like this, this way was just less feasible when particularly data volume was very big.Therefore, the word speed that can automatically estimate many speakers just becomes extremely important.

Existing word speed method of estimation is all at single speaker's voice, word speed that can only the estimate sheet speaker, and can not estimate many speakers' word speed.In addition, existing word speed method of estimation mainly is based on the word speed that voice identification result is estimated the speaker: at first adopt speech recognition device identification aligned phoneme sequence and each phoneme time corresponding point from the input voice; Identifier word sequence and each word time corresponding point again, thus estimate speaker's word speed.

The weak point of above-mentioned word speed method of estimation is:

(1) word speed that can only estimate sheet speaker voice.When containing many speakers' voice in the input voice, the input voice only are used as a speaker's speech processes, and can not get many speakers' word speed estimated result.

(2) speed is slow.Present method is at first carried out speech recognition to the input voice, and aligned phoneme sequence and the word sequence estimation according to identification goes out word speed again.It (generally is hidden Markov model that this method need be trained a large amount of phoneme models, Hidden Markov Model), when identification, also need a large amount of computing (extract feature, estimate the output probability of acoustic model and language model etc.), therefore the speed of this method is slow, is unfavorable for real-time processing.

Summary of the invention

The objective of the invention is to solve the existing in prior technology defective, provide a kind of and cut apart many speakers word speed method of estimation with cluster based on the speaker: cut apart with cluster by the speaker and earlier voice flow is divided into voice segments, the voice segments with identical speaker is stitched together in order again; Estimate number of words and duration in each speaker's voice then respectively, realize that many speakers' word speed is estimated.

The technical solution adopted for the present invention to solve the technical problems comprises the steps:

1) reads in voice flow: read in the voice flow that records many speakers voice;

2) speaker is cut apart: detect that the speaker changes a little in the above-mentioned voice flow, be divided into a plurality of voice segments according to these changes voice flow of naming a person for a particular job;

3) speaker's cluster: utilize the spectral clustering algorithm that the above-mentioned voice segments that splits is carried out speaker's cluster, identical speaker's voice segments is stitched together in order, obtain speaker's number and each speaker's voice;

4) word speed is estimated: extract energy envelope respectively from each speaker's voice, and determine syllable number by the local maximum point of finding out energy envelope, thereby estimate each speaker's word speed.

Described step 2) step cut apart of speaker comprises:

2.1) utilize and from the above-mentioned voice flow that reads in, find out quiet section and voice segments based on the silence detection algorithm of threshold judgement;

2.2) above-mentioned voice segments is spliced into a long voice segments in order, and from long voice segments, extract comprise the Mel frequency cepstral coefficient (Mel Frequency Cepstral Coefficients, MFCCs) and the audio frequency characteristics of first order difference (Delta-MFCCs);

2.3) audio frequency characteristics that utilizes said extracted to come out, according to bayesian information criterion, judge that the similarity between the adjacent data window detects the speaker in the long voice segments to change a little;

2.4) change a little according to above-mentioned speaker, voice flow is divided into a plurality of voice segments, and each voice segments only comprises a speaker.

Described step 2.1) step based on the silence detection algorithm of threshold judgement comprises:

2.1.1) voice flow that reads in is carried out the branch frame, and calculate the energy of every frame voice, obtain the energy feature vector of voice flow;

2.1.2) the calculating energy thresholding;

2.1.3) energy and the energy threshold of every frame voice compared, the frame that is lower than energy threshold is quiet frame, otherwise is speech frame, and adjacent quiet frame is spliced into one quiet section in order, and adjacent speech frame is spliced into a voice segments in order.

The step that described step 4) word speed is estimated comprises:

4.1) calculate the energy of speaker's voice;

4.2) energy that utilizes low-pass filter that said extracted is come out carries out filtering, obtains energy envelope;

4.3) calculating energy envelope threshold value;

4.4) determine local maximum point in the energy envelope, obtain the number of local maximum point;

4.5) with the number of the local maximum point in this speaker's speech energy envelope as the syllable number, and, obtain this speaker's word speed divided by the duration of these speaker's voice;

4.6) repetition above-mentioned steps 4.1)～4.5), till the word speed of all speaker's voice has all been estimated.

Described local maximum point satisfies following condition:

A) this element value is greater than the energy envelope threshold value;

B) this element value is greater than its forward and backward 0.07 second all elements value;

Described local maximum point position is the position at energy peak place of the simple or compound vowel of a Chinese syllable of each syllable.

The invention has the beneficial effects as follows: utilize the speaker to cut apart the voice flow that to comprise many speakers and be cut into a plurality of voice segments, and each voice segments only comprises a speaker, utilize speaker's cluster that identical speaker's voice segments is combined again, so the present invention can estimate the word speed of many speakers voice.In addition, determine the syllable number by the local maximum point that detects each speaker's speech energy envelope, thereby estimate each speaker's word speed, compare with word speed method of estimation based on speech recognition, do not need to do complicated numerical evaluation (for example calculating of the output probability of acoustic model and language model) thus saved operation time, more be applicable to the occasion that real-time word speed is estimated.

Description of drawings

Fig. 1 is a process flow diagram of the present invention.

Fig. 2 is the synoptic diagram that word speed is estimated in the embodiments of the invention, wherein Fig. 2 (a) is certain speaker's speech waveform figure, the speech energy figure of Fig. 2 (b) for extracting: solid line is an energy envelope, and the dot-and-dash line of band circle is the energy envelope local maximum point, and dotted line is the energy envelope threshold value.

Embodiment

Be described in detail below in conjunction with specific embodiment and Figure of description.

Fig. 1 is the process flow diagram of the method for many speakers of estimation word speed according to an embodiment of the invention.As shown in Figure 1, at first in step 101, read in voice flow.Voice flow is the speech data that records many speakers voice, can be the file of various forms, for example WAV, RAM, MP3, VOX etc.

Then, in step 102, utilization is found out quiet section and voice segments in the voice flow based on the mute detection method of threshold judgement, above-mentioned voice segments is spliced into a long voice segments in order, and from long voice segments, extract audio frequency characteristics, the audio frequency characteristics that utilizes said extracted to come out, according to bayesian information criterion, the similarity in the long voice segments of judgement between the adjacent data window detects the speaker and changes a little; Change a little according to above-mentioned speaker at last, voice flow is divided into a plurality of voice segments, and each voice segments only comprises a speaker.

Mute detection method in the step 102 specifically may further comprise the steps:

1) voice flow that reads in is divided into the T frame, frame length is 32 milliseconds of (frame length corresponding sampling points number N=0.032 * f _s, f wherein _sSample frequency for voice signal), it is 16 milliseconds that frame moves, if the sampled point number of last frame voice is then cast out it less than N;

2) calculate t (the frame voice signal x of 1≤t≤T) _t(n) ENERGY E _t:

E_{t} = Σ_{n = 1}^{N} X_{t}^{2} (n), 1 \leq t \leq T

Obtain the energy vectors E=[E of voice flow ₁, E ₂..., E _T], wherein T is a totalframes;

3) judge quietly significant limitation is arranged with fixing energy threshold with voice because the speech energy various environment under differs greatly, but voice and quiet between the relativeness of energy size be constant, so define adaptive energy thresholding T _E:

T _E＝min(E)+0.3×[mean(E)-min(E)]

Wherein, min (E) is the minimum value of each frame energy, and mean (E) is the mean value of each frame energy.

4) energy and the energy threshold with every frame voice compares, and the frame that is lower than energy threshold is quiet frame, otherwise is speech frame, and adjacent quiet frame is spliced into one quiet section in order, and adjacent speech frame is spliced into a voice segments in order.

The method of utilizing bayesian information criterion to determine that the speaker changes a little in the step 102 specifically may further comprise the steps:

1) will be spliced into a long voice segments through each voice segments that silence detection obtains in order, will grow voice segments and be cut into data window, window length is 2 seconds, and it is 0.1 second that window moves.Each data window is carried out the branch frame, frame length is 32 milliseconds, it is 16 milliseconds that frame moves, from each frame voice signal, extract MFCCs and Delta-MFCCs feature, the dimension M of MFCCs and Delta-MFCCs gets 12, the feature of each data window constitutes an eigenmatrix F, and the dimension d=2M of eigenmatrix F is 24;

2) the BIC distance between two adjacent data windows of calculating (x and y), BIC distance calculation formula is as follows:

ΔBIC = (n_{x} + n_{y}) \ln (| \det (cov (F_{z})) |) - n_{x} \ln (| \det (cov (F_{x})) |) -

n_{y} \ln (| \det (cov (F_{y})) |) - α (d + \frac{d (d + 1)}{2}) \ln (n_{x} + n_{y})

Wherein, z merges the data window that obtains afterwards, n with data window x and y _xAnd n _yBe respectively the frame number of data window x and y, F _x, F _yAnd F _zBe respectively the eigenmatrix of data window x, y and z, cov (F _x), cov (F _y) and cov (F _z) be respectively eigenmatrix F _x, F _yAnd F _zCovariance matrix, it is that penalty coefficient and experiment value are 2.0 that determinant of a matrix value, α are asked in det () expression;

3) if BIC distance, delta B/C greater than zero, then these two data windows are regarded as belonging to two different speakers (being to exist the speaker to change a little between them), otherwise these two data windows are regarded as belonging to same speaker and they are merged;

4) whether the data window that constantly slides judges two BIC distances between the adjacent data window greater than zero, and preserves the speaker and change a little, till the distance of the BIC between all adjacent data windows of long voice segments all has been judged.

The step of said extracted MFCCs and Delta-MFCCs feature comprises:

1) voice signal is divided into the T frame, frame length is 32 milliseconds of (frame length corresponding sampling points number N=0.032 * f _s, f wherein _sSample frequency for voice signal), it is 16 milliseconds that frame moves, if the sampled point number of last frame voice is then cast out it less than N;

2) to t (the frame voice signal x of 1≤t≤T) _t(n) (Discrete FourierTransformation DFT) obtains linear spectral X to do discrete Fourier transform (DFT) _t(k):

X_{t} (k) = Σ_{n = 0}^{N - 1} X_{t} (n) e^{- j 2 πnk / N}, (0 \leq n, k \leq N - 1)

3) with above-mentioned linear spectral X _t(k) obtain the Mel frequency spectrum by Mel frequency filter group, carry out logarithm operation again and obtain log spectrum S _t(m), wherein Mel frequency filter group is several bandpass filter H _m(k), 0≤m＜M, M are the number of wave filter, each wave filter has the triangle filtering characteristic, and its centre frequency is f (m), and the interval between hour adjacent f (m) is also less when the m value, along with the interval of the adjacent f of the increase of m (m) becomes greatly gradually, the transport function of each bandpass filter is:

H_{m} (k) = \{\begin{matrix} 0 & (k < f (m - 1)) \\ \frac{k - f (m - 1)}{f (m) - f (m - 1)} & (f (m - 1) \leq k \leq f (m)) \\ \frac{f (m + 1) - k}{f (m + 1) - f (m)} & (f (m) < k \leq f (m + 1)) \\ 0 & (k > f (m + 1)) \end{matrix} (0 \leq m < M)

Wherein, f (m) is defined as follows:

f (m) = (\frac{N}{f_{s}}) B^{- 1} (B (f_{l}) + m \frac{B (f_{h}) - B (f_{l})}{M + 1})

Wherein, f _l, f _hBe the low-limit frequency and the highest frequency of the frequency application scope of wave filter, B ^-1Inverse function for B: B ^-1(b)=700 (e ^B/1125-1), therefore by linear spectral X _t(k) to logarithmic spectrum S _t(m) functional expression is:

S_{t} (m) = \ln (Σ_{k = 0}^{N - 1} {| X_{t} (k) |}^{2} H_{m} (k)), (0 \leq m < M)

4) with above-mentioned log spectrum S _t(m) (Discrete Cosine Transformation DCT) transforms to cepstrum domain, obtains t frame MFCCs, C through discrete cosine transform _t(p):

C_{t} (p) = Σ_{m = 0}^{M - 1} S_{t} (m) \cos (\frac{(m + 0.5) nπ}{M}), (0 \leq p < M)

5) first order difference (Delta-MFCCs) of calculating t frame MFCCs, C _t(p):

C_{t}^{'} (p) = \frac{1}{\sqrt{\frac{Q}{q = - Q} q^{2}}} Σ_{q = - Q}^{Q} q \times C_{t} (p + q), (0 \leq p < M)

Wherein, Q is a constant, and value is 3 during experiment.

6) every frame voice signal is repeated above-mentioned steps 2)～5), obtain the MFCCs and the Delta-MFCCs of all T frame voice signals, with synthetic MFCC matrix of their der group frame by frame and Delta-MFCC matrix, again MFCC matrix and Delta-MFCC matrix are merged the constitutive characteristic matrix F.

Step 103, from each voice segments that splits, extract the audio frequency characteristics that comprises MFCCs and Delta-MFCCs, and utilize the spectral clustering algorithm that the voice segments that each comprises a speaker is carried out speaker's cluster, obtain speaker's number and each speaker's voice.Concrete steps are as follows:

1) each voice segments is carried out the branch frame, frame length is 32 milliseconds, and it is 16 milliseconds that frame moves, and extracts MFCCs and Delta-MFCCs feature from each frame voice signal, and the dimension M of MFCCs and Delta-MFCCs is 12, and the feature of each voice segments constitutes an eigenmatrix F _j, eigenmatrix F _jDimension d=2M be 24;

2) according to each eigenmatrix F _jObtain the eigenmatrix set F={F of all voice segments to be clustered ₁..., F _J, J is the total number of voice segments, constructs affine matrix (Affinity matrix) A ∈ R according to F again ^{J * J}, (i, j) the individual elements A of A _IjBe defined as follows:

A_{ij} = [\begin{matrix} \exp (\frac{- d^{2} (F_{i}, F_{j})}{2 σ_{i} σ_{j}}) & i &NotEqual; j \\ 0 & i = j \end{matrix}]

Wherein, d (F _i, F _j) be eigenmatrix F _iWith F _jBetween Euclidean distance, σ _i(or σ _j) be a scale parameter, be defined as the individual eigenmatrix F of i (or j) _i(or F _j) and other T-1 eigenmatrix between the variance of Euclidean distance vector;

3) structure diagonal matrix D, it the (i, i) individual element equals the capable all elements sum of i of affine matrix A, constructs normalized affine matrix L=D according to matrix D and A again ^-1/2AD ^-1/2

4) the preceding K of compute matrix L _MaxThe eigenwert of individual maximum

And eigenwert vector

V wherein _k(1≤k≤K _Max) be column vector and Estimate optimum classification number (being speaker's number) K according to the difference between the adjacent feature value:

K = \underset{i &Element; [1, K_{\max} - 1]}{\arg \max} (λ_{i} - λ_{i + 1})

According to the speaker's number K that estimates, structural matrix V=[v ₁, v ₂..., v _K] ∈ R ^{J * K}

5) each row of normalization matrix V obtains matrix Y ∈ R ^{J * K}, (j, k) the individual element Y of Y _Jk:

Y_{jk} = \frac{V_{jk}}{\sqrt{(Σ_{k = 1}^{K} V_{jk}^{2})}}, 1 \leq i \leq J;

6) space R is made in each trade among the matrix Y ^KIn a point, utilize K mean algorithm (K-meansalgorithm) that this J capable (being J point) is clustered into the K class.

7) the pairing voice segments of eigenmatrix Fj is judged to k class (i.e. k speaker), the j of and if only if matrix Y capable by cluster in the k class;

8), obtain speaker's number, each speaker's voice and duration thereof according to above-mentioned cluster result.

At last, in step 104, from each speaker's voice, extract energy envelope, and determine the syllable number, estimate each speaker's word speed by detected energy envelope local maximum point.In standard Chinese, each syllable all comprises simple or compound vowel of a Chinese syllable basically, and the simple or compound vowel of a Chinese syllable number is the syllable number, the syllable number is the word number, and the energy maximum of simple or compound vowel of a Chinese syllable in the syllable, therefore can obtain the number of word, thereby estimate word speed by the simple or compound vowel of a Chinese syllable number of detected energy maximum.Concrete steps based on the word speed method of estimation of above-mentioned consideration are as follows:

1) calculate the ENERGY E (n) of each speaker's voice signal s (n):

E(n)＝s ²(n)，1≤n≤Len

Wherein, Len is the total number of sampled point of voice signal;

2) utilize a low-pass filter that ENERGY E (n) is carried out filtering, obtain energy envelope E (n), the technical indicator of this low-pass filter is as follows: based on the FIR wave filter of Equiripple method, sample frequency f _sBe 16000 hertz, cut-off frequecy of passband f _PassBe 50 hertz, stopband cutoff frequency f _StopBe 100 hertz, the maximum attenuation A of passband _PassBe 1dB, the minimal attenuation A of stopband _StopBe 80dB;

3) calculating energy envelope threshold value T _E:

T _E＝0.4×mean(E(n))

Wherein, mean (E (n)) is the mean value of energy envelope;

4) will satisfy following two conditions of elements in the energy envelope as local maximum point:

Condition 1: this element value is greater than energy envelope threshold value T _E,

Condition 2: this element value is greater than its forward and backward 0.07 second all elements value, promptly greater than its forward and backward 0.07 * f _sIndividual element value,

The position (sampled point) at above-mentioned local maximum point place is the position at energy peak place of the simple or compound vowel of a Chinese syllable of each syllable, and the reason of getting 0.07 second is: the minimum value of the average duration of syllable approximately is 0.14 second, thus among the E (n) greater than T _EAnd greater than the position at its forward and backward 0.07 second element value place the position at energy peak place of the simple or compound vowel of a Chinese syllable of each syllable;

5) with the number of the local maximum point in certain speaker's speech energy envelope as syllable (word) number, with the number of word duration (second), obtain this speaker's word speed (word/second) divided by these speaker's voice;

6) repeat above-mentioned steps 1)～5), till the word speed of all speaker's voice has all been estimated.

The duration that Fig. 2 (a) has provided certain speaker is the oscillogram of 5 seconds voice signal, and Fig. 2 (b) has provided Fig. 2 (a) pairing energy envelope waveform of voice signal (shown in the solid line), the threshold value (shown in the dotted line) of energy envelope and the energy envelope local maximum point (shown in the dot-and-dash line of band circle) that obtains according to above-mentioned word speed estimating step.As can be seen from Figure 2: this speaker's voice signal duration is 5 seconds, and the number of local maximum point is 22, and promptly number of words is 22, and therefore, this speaker's word speed is 4.4 word/seconds (or 264 words per minute clocks).

Though more than by the foregoing description many speakers word speed method of estimation of the present invention is described in detail, therefore can not be interpreted as limitation of the scope of the invention.Should be pointed out that for the person of ordinary skill of the art without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with claims.

Claims

1. a multi-speaker speech rate estimation method based on speaker segmentation and clustering, is characterized in that comprising the steps:

1) Read in the voice stream: read in the recorded voice stream with multiple speakers;

2) Speaker segmentation: detect speaker change points in the above voice stream, and divide the voice stream into multiple voice segments according to these change points;

3) Speaker clustering: group the speech segments of the same speaker together and stitch them together in order to obtain the number of speakers and the voice of each speaker;

4) Speech rate estimation: extract the energy envelope from the speech of each speaker, and determine the number of syllables by finding the local maximum point of the energy envelope, thereby estimating the speech rate of each speaker; this step specifically includes:

4.1) Calculate the energy of a speaker's voice;

4.2) Use a low-pass filter to filter the energy extracted above to obtain an energy envelope;

4.3) Calculate the energy envelope threshold;

4.4) Determine the local maximum point in the energy envelope, and obtain the number of local maximum points. Specifically, the elements in the energy envelope that meet the following two conditions are used as the local maximum point:

a) The element value is greater than the energy envelope threshold;

b) The value of this element is greater than the values of all elements 0.07 seconds before and after it;

The position of the local maximum point is the position of the energy peak of the final of each syllable;

4.5) The number of local maximum points in the energy envelope of the speaker's voice is taken as the number of syllables, and divided by the duration of the speaker's voice to obtain the speaker's speech rate;

4.6) Repeat steps 4.1) to 4.5) above until the speech rates of all speakers are estimated.

2. The multi-speaker speech rate estimation method according to claim 1, wherein the step 2) speaker segmentation comprises:

2.1) Use the silence detection algorithm based on the threshold judgment to find out the silence segment and the speech segment from the above-mentioned read-in voice stream;

2.2) Splicing the above speech segments into a long speech segment in sequence, and extracting audio features from the long speech segment;

2.3) Using the audio features extracted above, according to the Bayesian information criterion, judge the similarity between adjacent data windows in the long speech segment to detect the speaker change point;

2.4) According to the speaker change point above, the speech stream is divided into multiple speech segments, and each speech segment contains only one speaker.

3. The multi-speaker speech rate estimation method according to claim 2, characterized in that, said step 2.1) the step of the silence detection algorithm based on threshold judgment comprises:

2.1.1) Divide the read voice stream into frames, and calculate the energy of each frame of voice to obtain the energy feature vector of the voice stream;

2.1.2) Calculate the energy threshold;

2.1.3) Compare the energy of each frame of speech with the energy threshold. The frame below the energy threshold is a silent frame, otherwise it is a speech frame. The adjacent silent frames are spliced into a silent segment in order, and the adjacent speech frames are Spliced in sequence into a speech segment.

4. The multi-speaker speech rate estimation method according to claim 2, wherein the audio features in step 2.2) include Mel-frequency cepstral coefficients and their first-order differences.

The multi-speaker speech rate estimation method according to claim 1, characterized in that the speaker clustering in step 3) adopts a spectral clustering algorithm.