[go: up one dir, main page]

CN102543063B - Method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers - Google Patents

Method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers Download PDF

Info

Publication number
CN102543063B
CN102543063B CN2011104035773A CN201110403577A CN102543063B CN 102543063 B CN102543063 B CN 102543063B CN 2011104035773 A CN2011104035773 A CN 2011104035773A CN 201110403577 A CN201110403577 A CN 201110403577A CN 102543063 B CN102543063 B CN 102543063B
Authority
CN
China
Prior art keywords
speaker
speech
voice
energy
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2011104035773A
Other languages
Chinese (zh)
Other versions
CN102543063A (en
Inventor
李艳雄
徐鑫
贺前华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN2011104035773A priority Critical patent/CN102543063B/en
Publication of CN102543063A publication Critical patent/CN102543063A/en
Application granted granted Critical
Publication of CN102543063B publication Critical patent/CN102543063B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Complex Calculations (AREA)

Abstract

基于说话人分割与聚类的多说话人语速估计方法,它涉及一种估计多说话人语速的方法。本发明的估计多说话人语速的方法:首先读入语音流;接着检测上述语音流中的说话人改变点,根据这些改变点将语音流分割成多个语音段;再对上述语音段进行说话人聚类,把相同说话人的语音段按顺序拼接在一起,得到说话人个数以及各个说话人的语音;最后分别估算各个说话人语音的时长及其所包含的字数,估计出各个说话人的语速。与目前基于语音识别的单说话人语速估计方法相比,本方法不但可以估计多说话人的语速,而且速度更快。

Figure 201110403577

A multi-speaker speech rate estimation method based on speaker segmentation and clustering relates to a method for estimating multi-speaker speech rates. The method for estimating multi-speaker speech rate of the present invention: first read in the voice stream; then detect the speaker change points in the above-mentioned voice stream, and divide the voice stream into a plurality of voice segments according to these change points; then perform the above-mentioned voice segment Speaker clustering, splicing the speech segments of the same speaker together in order to obtain the number of speakers and the voice of each speaker; finally estimate the duration of each speaker's voice and the number of words it contains, and estimate the people's speech rate. Compared with the current single-speaker speech rate estimation method based on speech recognition, this method can not only estimate the speech rate of multiple speakers, but also has a faster speed.

Figure 201110403577

Description

Cut apart many speakers word speed method of estimation with cluster based on the speaker
Technical field
The present invention relates to that voice signal is handled and mode identification technology, relate in particular to and a kind ofly cut apart many speakers word speed method of estimation with cluster based on the speaker.
Background technology
Development along with voice processing technology, at present the object of speech processes just progressively turns to many speakers voice (for example conference voice, conversational speech) by single speaker's voice, estimates many speakers' word speed and becomes more and more important according to the parameter of each speaker's voice speed adaption ground adjustment speech processing system (for example speech recognition system).In addition, in recording studio or breadboard Recording Process, speaker (for example announcer, host, contact staff etc.) rule of thumb subjectively measures word speed, and is often not accurate enough.Though can adopt the method for artificial mark to estimate speaker's word speed behind the End of Tape, do very time-consumingly like this, this way was just less feasible when particularly data volume was very big.Therefore, the word speed that can automatically estimate many speakers just becomes extremely important.
Existing word speed method of estimation is all at single speaker's voice, word speed that can only the estimate sheet speaker, and can not estimate many speakers' word speed.In addition, existing word speed method of estimation mainly is based on the word speed that voice identification result is estimated the speaker: at first adopt speech recognition device identification aligned phoneme sequence and each phoneme time corresponding point from the input voice; Identifier word sequence and each word time corresponding point again, thus estimate speaker's word speed.
The weak point of above-mentioned word speed method of estimation is:
(1) word speed that can only estimate sheet speaker voice.When containing many speakers' voice in the input voice, the input voice only are used as a speaker's speech processes, and can not get many speakers' word speed estimated result.
(2) speed is slow.Present method is at first carried out speech recognition to the input voice, and aligned phoneme sequence and the word sequence estimation according to identification goes out word speed again.It (generally is hidden Markov model that this method need be trained a large amount of phoneme models, Hidden Markov Model), when identification, also need a large amount of computing (extract feature, estimate the output probability of acoustic model and language model etc.), therefore the speed of this method is slow, is unfavorable for real-time processing.
Summary of the invention
The objective of the invention is to solve the existing in prior technology defective, provide a kind of and cut apart many speakers word speed method of estimation with cluster based on the speaker: cut apart with cluster by the speaker and earlier voice flow is divided into voice segments, the voice segments with identical speaker is stitched together in order again; Estimate number of words and duration in each speaker's voice then respectively, realize that many speakers' word speed is estimated.
The technical solution adopted for the present invention to solve the technical problems comprises the steps:
1) reads in voice flow: read in the voice flow that records many speakers voice;
2) speaker is cut apart: detect that the speaker changes a little in the above-mentioned voice flow, be divided into a plurality of voice segments according to these changes voice flow of naming a person for a particular job;
3) speaker's cluster: utilize the spectral clustering algorithm that the above-mentioned voice segments that splits is carried out speaker's cluster, identical speaker's voice segments is stitched together in order, obtain speaker's number and each speaker's voice;
4) word speed is estimated: extract energy envelope respectively from each speaker's voice, and determine syllable number by the local maximum point of finding out energy envelope, thereby estimate each speaker's word speed.
Described step 2) step cut apart of speaker comprises:
2.1) utilize and from the above-mentioned voice flow that reads in, find out quiet section and voice segments based on the silence detection algorithm of threshold judgement;
2.2) above-mentioned voice segments is spliced into a long voice segments in order, and from long voice segments, extract comprise the Mel frequency cepstral coefficient (Mel Frequency Cepstral Coefficients, MFCCs) and the audio frequency characteristics of first order difference (Delta-MFCCs);
2.3) audio frequency characteristics that utilizes said extracted to come out, according to bayesian information criterion, judge that the similarity between the adjacent data window detects the speaker in the long voice segments to change a little;
2.4) change a little according to above-mentioned speaker, voice flow is divided into a plurality of voice segments, and each voice segments only comprises a speaker.
Described step 2.1) step based on the silence detection algorithm of threshold judgement comprises:
2.1.1) voice flow that reads in is carried out the branch frame, and calculate the energy of every frame voice, obtain the energy feature vector of voice flow;
2.1.2) the calculating energy thresholding;
2.1.3) energy and the energy threshold of every frame voice compared, the frame that is lower than energy threshold is quiet frame, otherwise is speech frame, and adjacent quiet frame is spliced into one quiet section in order, and adjacent speech frame is spliced into a voice segments in order.
The step that described step 4) word speed is estimated comprises:
4.1) calculate the energy of speaker's voice;
4.2) energy that utilizes low-pass filter that said extracted is come out carries out filtering, obtains energy envelope;
4.3) calculating energy envelope threshold value;
4.4) determine local maximum point in the energy envelope, obtain the number of local maximum point;
4.5) with the number of the local maximum point in this speaker's speech energy envelope as the syllable number, and, obtain this speaker's word speed divided by the duration of these speaker's voice;
4.6) repetition above-mentioned steps 4.1)~4.5), till the word speed of all speaker's voice has all been estimated.
Described local maximum point satisfies following condition:
A) this element value is greater than the energy envelope threshold value;
B) this element value is greater than its forward and backward 0.07 second all elements value;
Described local maximum point position is the position at energy peak place of the simple or compound vowel of a Chinese syllable of each syllable.
The invention has the beneficial effects as follows: utilize the speaker to cut apart the voice flow that to comprise many speakers and be cut into a plurality of voice segments, and each voice segments only comprises a speaker, utilize speaker's cluster that identical speaker's voice segments is combined again, so the present invention can estimate the word speed of many speakers voice.In addition, determine the syllable number by the local maximum point that detects each speaker's speech energy envelope, thereby estimate each speaker's word speed, compare with word speed method of estimation based on speech recognition, do not need to do complicated numerical evaluation (for example calculating of the output probability of acoustic model and language model) thus saved operation time, more be applicable to the occasion that real-time word speed is estimated.
Description of drawings
Fig. 1 is a process flow diagram of the present invention.
Fig. 2 is the synoptic diagram that word speed is estimated in the embodiments of the invention, wherein Fig. 2 (a) is certain speaker's speech waveform figure, the speech energy figure of Fig. 2 (b) for extracting: solid line is an energy envelope, and the dot-and-dash line of band circle is the energy envelope local maximum point, and dotted line is the energy envelope threshold value.
Embodiment
Be described in detail below in conjunction with specific embodiment and Figure of description.
Fig. 1 is the process flow diagram of the method for many speakers of estimation word speed according to an embodiment of the invention.As shown in Figure 1, at first in step 101, read in voice flow.Voice flow is the speech data that records many speakers voice, can be the file of various forms, for example WAV, RAM, MP3, VOX etc.
Then, in step 102, utilization is found out quiet section and voice segments in the voice flow based on the mute detection method of threshold judgement, above-mentioned voice segments is spliced into a long voice segments in order, and from long voice segments, extract audio frequency characteristics, the audio frequency characteristics that utilizes said extracted to come out, according to bayesian information criterion, the similarity in the long voice segments of judgement between the adjacent data window detects the speaker and changes a little; Change a little according to above-mentioned speaker at last, voice flow is divided into a plurality of voice segments, and each voice segments only comprises a speaker.
Mute detection method in the step 102 specifically may further comprise the steps:
1) voice flow that reads in is divided into the T frame, frame length is 32 milliseconds of (frame length corresponding sampling points number N=0.032 * f s, f wherein sSample frequency for voice signal), it is 16 milliseconds that frame moves, if the sampled point number of last frame voice is then cast out it less than N;
2) calculate t (the frame voice signal x of 1≤t≤T) t(n) ENERGY E t:
E t = Σ n = 1 N X t 2 ( n ) , 1 ≤ t ≤ T
Obtain the energy vectors E=[E of voice flow 1, E 2..., E T], wherein T is a totalframes;
3) judge quietly significant limitation is arranged with fixing energy threshold with voice because the speech energy various environment under differs greatly, but voice and quiet between the relativeness of energy size be constant, so define adaptive energy thresholding T E:
T E=min(E)+0.3×[mean(E)-min(E)]
Wherein, min (E) is the minimum value of each frame energy, and mean (E) is the mean value of each frame energy.
4) energy and the energy threshold with every frame voice compares, and the frame that is lower than energy threshold is quiet frame, otherwise is speech frame, and adjacent quiet frame is spliced into one quiet section in order, and adjacent speech frame is spliced into a voice segments in order.
The method of utilizing bayesian information criterion to determine that the speaker changes a little in the step 102 specifically may further comprise the steps:
1) will be spliced into a long voice segments through each voice segments that silence detection obtains in order, will grow voice segments and be cut into data window, window length is 2 seconds, and it is 0.1 second that window moves.Each data window is carried out the branch frame, frame length is 32 milliseconds, it is 16 milliseconds that frame moves, from each frame voice signal, extract MFCCs and Delta-MFCCs feature, the dimension M of MFCCs and Delta-MFCCs gets 12, the feature of each data window constitutes an eigenmatrix F, and the dimension d=2M of eigenmatrix F is 24;
2) the BIC distance between two adjacent data windows of calculating (x and y), BIC distance calculation formula is as follows:
ΔBIC = ( n x + n y ) ln ( | det ( cov ( F z ) ) | ) - n x ln ( | det ( cov ( F x ) ) | ) -
n y ln ( | det ( cov ( F y ) ) | ) - α ( d + d ( d + 1 ) 2 ) ln ( n x + n y )
Wherein, z merges the data window that obtains afterwards, n with data window x and y xAnd n yBe respectively the frame number of data window x and y, F x, F yAnd F zBe respectively the eigenmatrix of data window x, y and z, cov (F x), cov (F y) and cov (F z) be respectively eigenmatrix F x, F yAnd F zCovariance matrix, it is that penalty coefficient and experiment value are 2.0 that determinant of a matrix value, α are asked in det () expression;
3) if BIC distance, delta B/C greater than zero, then these two data windows are regarded as belonging to two different speakers (being to exist the speaker to change a little between them), otherwise these two data windows are regarded as belonging to same speaker and they are merged;
4) whether the data window that constantly slides judges two BIC distances between the adjacent data window greater than zero, and preserves the speaker and change a little, till the distance of the BIC between all adjacent data windows of long voice segments all has been judged.
The step of said extracted MFCCs and Delta-MFCCs feature comprises:
1) voice signal is divided into the T frame, frame length is 32 milliseconds of (frame length corresponding sampling points number N=0.032 * f s, f wherein sSample frequency for voice signal), it is 16 milliseconds that frame moves, if the sampled point number of last frame voice is then cast out it less than N;
2) to t (the frame voice signal x of 1≤t≤T) t(n) (Discrete FourierTransformation DFT) obtains linear spectral X to do discrete Fourier transform (DFT) t(k):
X t ( k ) = Σ n = 0 N - 1 X t ( n ) e - j 2 πnk / N , ( 0 ≤ n , k ≤ N - 1 )
3) with above-mentioned linear spectral X t(k) obtain the Mel frequency spectrum by Mel frequency filter group, carry out logarithm operation again and obtain log spectrum S t(m), wherein Mel frequency filter group is several bandpass filter H m(k), 0≤m<M, M are the number of wave filter, each wave filter has the triangle filtering characteristic, and its centre frequency is f (m), and the interval between hour adjacent f (m) is also less when the m value, along with the interval of the adjacent f of the increase of m (m) becomes greatly gradually, the transport function of each bandpass filter is:
H m ( k ) = 0 ( k < f ( m - 1 ) ) k - f ( m - 1 ) f ( m ) - f ( m - 1 ) ( f ( m - 1 ) &le; k &le; f ( m ) ) f ( m + 1 ) - k f ( m + 1 ) - f ( m ) ( f ( m ) < k &le; f ( m + 1 ) ) 0 ( k > f ( m + 1 ) ) ( 0 &le; m < M )
Wherein, f (m) is defined as follows:
f ( m ) = ( N f s ) B - 1 ( B ( f l ) + m B ( f h ) - B ( f l ) M + 1 )
Wherein, f l, f hBe the low-limit frequency and the highest frequency of the frequency application scope of wave filter, B -1Inverse function for B: B -1(b)=700 (e B/1125-1), therefore by linear spectral X t(k) to logarithmic spectrum S t(m) functional expression is:
S t ( m ) = ln ( &Sigma; k = 0 N - 1 | X t ( k ) | 2 H m ( k ) ) , ( 0 &le; m < M )
4) with above-mentioned log spectrum S t(m) (Discrete Cosine Transformation DCT) transforms to cepstrum domain, obtains t frame MFCCs, C through discrete cosine transform t(p):
C t ( p ) = &Sigma; m = 0 M - 1 S t ( m ) cos ( ( m + 0.5 ) n&pi; M ) , ( 0 &le; p < M )
5) first order difference (Delta-MFCCs) of calculating t frame MFCCs, C t(p):
C t &prime; ( p ) = 1 Q q = - Q q 2 &Sigma; q = - Q Q q &times; C t ( p + q ) , ( 0 &le; p < M )
Wherein, Q is a constant, and value is 3 during experiment.
6) every frame voice signal is repeated above-mentioned steps 2)~5), obtain the MFCCs and the Delta-MFCCs of all T frame voice signals, with synthetic MFCC matrix of their der group frame by frame and Delta-MFCC matrix, again MFCC matrix and Delta-MFCC matrix are merged the constitutive characteristic matrix F.
Step 103, from each voice segments that splits, extract the audio frequency characteristics that comprises MFCCs and Delta-MFCCs, and utilize the spectral clustering algorithm that the voice segments that each comprises a speaker is carried out speaker's cluster, obtain speaker's number and each speaker's voice.Concrete steps are as follows:
1) each voice segments is carried out the branch frame, frame length is 32 milliseconds, and it is 16 milliseconds that frame moves, and extracts MFCCs and Delta-MFCCs feature from each frame voice signal, and the dimension M of MFCCs and Delta-MFCCs is 12, and the feature of each voice segments constitutes an eigenmatrix F j, eigenmatrix F jDimension d=2M be 24;
2) according to each eigenmatrix F jObtain the eigenmatrix set F={F of all voice segments to be clustered 1..., F J, J is the total number of voice segments, constructs affine matrix (Affinity matrix) A ∈ R according to F again J * J, (i, j) the individual elements A of A IjBe defined as follows:
A ij = exp ( - d 2 ( F i , F j ) 2 &sigma; i &sigma; j ) i &NotEqual; j 0 i = j
Wherein, d (F i, F j) be eigenmatrix F iWith F jBetween Euclidean distance, σ i(or σ j) be a scale parameter, be defined as the individual eigenmatrix F of i (or j) i(or F j) and other T-1 eigenmatrix between the variance of Euclidean distance vector;
3) structure diagonal matrix D, it the (i, i) individual element equals the capable all elements sum of i of affine matrix A, constructs normalized affine matrix L=D according to matrix D and A again -1/2AD -1/2
4) the preceding K of compute matrix L MaxThe eigenwert of individual maximum
Figure BDA0000116876130000091
And eigenwert vector
Figure BDA0000116876130000092
V wherein k(1≤k≤K Max) be column vector and Estimate optimum classification number (being speaker's number) K according to the difference between the adjacent feature value:
K = arg max i &Element; [ 1 , K max - 1 ] ( &lambda; i - &lambda; i + 1 )
According to the speaker's number K that estimates, structural matrix V=[v 1, v 2..., v K] ∈ R J * K
5) each row of normalization matrix V obtains matrix Y ∈ R J * K, (j, k) the individual element Y of Y Jk:
Y jk = V jk ( &Sigma; k = 1 K V jk 2 ) , 1 &le; i &le; J ;
6) space R is made in each trade among the matrix Y KIn a point, utilize K mean algorithm (K-meansalgorithm) that this J capable (being J point) is clustered into the K class.
7) the pairing voice segments of eigenmatrix Fj is judged to k class (i.e. k speaker), the j of and if only if matrix Y capable by cluster in the k class;
8), obtain speaker's number, each speaker's voice and duration thereof according to above-mentioned cluster result.
At last, in step 104, from each speaker's voice, extract energy envelope, and determine the syllable number, estimate each speaker's word speed by detected energy envelope local maximum point.In standard Chinese, each syllable all comprises simple or compound vowel of a Chinese syllable basically, and the simple or compound vowel of a Chinese syllable number is the syllable number, the syllable number is the word number, and the energy maximum of simple or compound vowel of a Chinese syllable in the syllable, therefore can obtain the number of word, thereby estimate word speed by the simple or compound vowel of a Chinese syllable number of detected energy maximum.Concrete steps based on the word speed method of estimation of above-mentioned consideration are as follows:
1) calculate the ENERGY E (n) of each speaker's voice signal s (n):
E(n)=s 2(n),1≤n≤Len
Wherein, Len is the total number of sampled point of voice signal;
2) utilize a low-pass filter that ENERGY E (n) is carried out filtering, obtain energy envelope E (n), the technical indicator of this low-pass filter is as follows: based on the FIR wave filter of Equiripple method, sample frequency f sBe 16000 hertz, cut-off frequecy of passband f PassBe 50 hertz, stopband cutoff frequency f StopBe 100 hertz, the maximum attenuation A of passband PassBe 1dB, the minimal attenuation A of stopband StopBe 80dB;
3) calculating energy envelope threshold value T E:
T E=0.4×mean(E(n))
Wherein, mean (E (n)) is the mean value of energy envelope;
4) will satisfy following two conditions of elements in the energy envelope as local maximum point:
Condition 1: this element value is greater than energy envelope threshold value T E,
Condition 2: this element value is greater than its forward and backward 0.07 second all elements value, promptly greater than its forward and backward 0.07 * f sIndividual element value,
The position (sampled point) at above-mentioned local maximum point place is the position at energy peak place of the simple or compound vowel of a Chinese syllable of each syllable, and the reason of getting 0.07 second is: the minimum value of the average duration of syllable approximately is 0.14 second, thus among the E (n) greater than T EAnd greater than the position at its forward and backward 0.07 second element value place the position at energy peak place of the simple or compound vowel of a Chinese syllable of each syllable;
5) with the number of the local maximum point in certain speaker's speech energy envelope as syllable (word) number, with the number of word duration (second), obtain this speaker's word speed (word/second) divided by these speaker's voice;
6) repeat above-mentioned steps 1)~5), till the word speed of all speaker's voice has all been estimated.
The duration that Fig. 2 (a) has provided certain speaker is the oscillogram of 5 seconds voice signal, and Fig. 2 (b) has provided Fig. 2 (a) pairing energy envelope waveform of voice signal (shown in the solid line), the threshold value (shown in the dotted line) of energy envelope and the energy envelope local maximum point (shown in the dot-and-dash line of band circle) that obtains according to above-mentioned word speed estimating step.As can be seen from Figure 2: this speaker's voice signal duration is 5 seconds, and the number of local maximum point is 22, and promptly number of words is 22, and therefore, this speaker's word speed is 4.4 word/seconds (or 264 words per minute clocks).
Though more than by the foregoing description many speakers word speed method of estimation of the present invention is described in detail, therefore can not be interpreted as limitation of the scope of the invention.Should be pointed out that for the person of ordinary skill of the art without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with claims.

Claims (5)

1.一种基于说话人分割与聚类的多说话人语速估计方法,其特征在于包括如下步骤: 1. a multi-speaker speech rate estimation method based on speaker segmentation and clustering, is characterized in that comprising the steps: 1)读入语音流:读入记录有多说话人语音的语音流; 1) Read in the voice stream: read in the recorded voice stream with multiple speakers; 2)说话人分割:检测上述语音流中说话人改变点,根据这些改变点将语音流分成多个语音段; 2) Speaker segmentation: detect speaker change points in the above voice stream, and divide the voice stream into multiple voice segments according to these change points; 3)说话人聚类:把相同说话人的语音段聚在一类并按顺序拼接在一起,得到说话人个数以及各个说话人的语音; 3) Speaker clustering: group the speech segments of the same speaker together and stitch them together in order to obtain the number of speakers and the voice of each speaker; 4)语速估计:分别从各个说话人语音中提取能量包络,并通过找出能量包络的局部最大值点确定音节数,从而估计出各个说话人的语速;该步骤具体包括: 4) Speech rate estimation: extract the energy envelope from the speech of each speaker, and determine the number of syllables by finding the local maximum point of the energy envelope, thereby estimating the speech rate of each speaker; this step specifically includes: 4.1)计算一个说话人语音的能量; 4.1) Calculate the energy of a speaker's voice; 4.2)利用低通滤波器对上述提取出来的能量进行滤波,得到能量包络; 4.2) Use a low-pass filter to filter the energy extracted above to obtain an energy envelope; 4.3)计算能量包络阈值; 4.3) Calculate the energy envelope threshold; 4.4)确定能量包络中局部最大值点,得到局部最大值点的个数,具体是将能量包络中满足下述两个条件的元素作为局部最大值点: 4.4) Determine the local maximum point in the energy envelope, and obtain the number of local maximum points. Specifically, the elements in the energy envelope that meet the following two conditions are used as the local maximum point: a)该元素值大于能量包络阈值; a) The element value is greater than the energy envelope threshold; b)该元素值大于其前、后0.07秒的所有元素值; b) The value of this element is greater than the values of all elements 0.07 seconds before and after it; 所述局部最大值点所在位置为每个音节的韵母的能量峰值所在的位置;  The position of the local maximum point is the position of the energy peak of the final of each syllable; 4.5)将该说话人语音能量包络中的局部最大值点的个数作为音节个数,并除以该说话人语音的时长,得到该说话人的语速;  4.5) The number of local maximum points in the energy envelope of the speaker's voice is taken as the number of syllables, and divided by the duration of the speaker's voice to obtain the speaker's speech rate; 4.6)重复上述步骤4.1)~4.5),直到所有说话人语音的语速都估计完为止。 4.6) Repeat steps 4.1) to 4.5) above until the speech rates of all speakers are estimated. 2.根据权利要求1所述的多说话人语速估计方法,其特征在于,所述步骤2)说话人分割的步骤包括: 2. The multi-speaker speech rate estimation method according to claim 1, wherein the step 2) speaker segmentation comprises: 2.1)利用基于门限判决的静音检测算法从上述读入的语音流中找出静音段和语音段; 2.1) Use the silence detection algorithm based on the threshold judgment to find out the silence segment and the speech segment from the above-mentioned read-in voice stream; 2.2)将上述语音段按顺序拼接成一个长语音段,并从长语音段中提取音频特征; 2.2) Splicing the above speech segments into a long speech segment in sequence, and extracting audio features from the long speech segment; 2.3)利用上述提取出来的音频特征,根据贝叶斯信息准则,判断长语音段中相邻数据窗之间的相似度来检测说话人改变点; 2.3) Using the audio features extracted above, according to the Bayesian information criterion, judge the similarity between adjacent data windows in the long speech segment to detect the speaker change point; 2.4)根据上述说话人改变点,把语音流分割成多个语音段,且每个语音段只包含一个说话人。 2.4) According to the speaker change point above, the speech stream is divided into multiple speech segments, and each speech segment contains only one speaker. 3.根据权利要求2所述的多说话人语速估计方法,其特征在于,所述步骤2.1)基于门限判决的静音检测算法的步骤包括: 3. The multi-speaker speech rate estimation method according to claim 2, characterized in that, said step 2.1) the step of the silence detection algorithm based on threshold judgment comprises: 2.1.1)对读入的语音流进行分帧,并计算每帧语音的能量,得到语音流的能量特征矢量; 2.1.1) Divide the read voice stream into frames, and calculate the energy of each frame of voice to obtain the energy feature vector of the voice stream; 2.1.2)计算能量门限; 2.1.2) Calculate the energy threshold; 2.1.3)将每帧语音的能量与能量门限比较,低于能量门限的帧为静音帧,否则为语音帧,将相邻的静音帧按顺序拼接成一个静音段,将相邻的语音帧按顺序拼接成一个语音段。 2.1.3) Compare the energy of each frame of speech with the energy threshold. The frame below the energy threshold is a silent frame, otherwise it is a speech frame. The adjacent silent frames are spliced into a silent segment in order, and the adjacent speech frames are Spliced in sequence into a speech segment. 4.根据权利要求2所述的多说话人语速估计方法,其特征在于,所述步骤2.2)的音频特征包括梅尔频率倒谱系数及其一阶差分。 4. The multi-speaker speech rate estimation method according to claim 2, wherein the audio features in step 2.2) include Mel-frequency cepstral coefficients and their first-order differences. 5.根据权利要求1所述的多说话人语速估计方法,其特征在于,所述步骤3)的说话人聚类采用谱聚类算法。 The multi-speaker speech rate estimation method according to claim 1, characterized in that the speaker clustering in step 3) adopts a spectral clustering algorithm.
CN2011104035773A 2011-12-07 2011-12-07 Method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers Expired - Fee Related CN102543063B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011104035773A CN102543063B (en) 2011-12-07 2011-12-07 Method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011104035773A CN102543063B (en) 2011-12-07 2011-12-07 Method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers

Publications (2)

Publication Number Publication Date
CN102543063A CN102543063A (en) 2012-07-04
CN102543063B true CN102543063B (en) 2013-07-24

Family

ID=46349803

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011104035773A Expired - Fee Related CN102543063B (en) 2011-12-07 2011-12-07 Method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers

Country Status (1)

Country Link
CN (1) CN102543063B (en)

Families Citing this family (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760434A (en) * 2012-07-09 2012-10-31 华为终端有限公司 Method for updating voiceprint feature model and terminal
CN103137137B (en) * 2013-02-27 2015-07-01 华南理工大学 Eloquent speaker finding method in conference audio
JP6171544B2 (en) * 2013-05-08 2017-08-02 カシオ計算機株式会社 Audio processing apparatus, audio processing method, and program
CN104282303B (en) * 2013-07-09 2019-03-29 威盛电子股份有限公司 Method for voice recognition by voiceprint recognition and electronic device thereof
CN103400580A (en) * 2013-07-23 2013-11-20 华南理工大学 Method for estimating importance degree of speaker in multiuser session voice
CN104347068B (en) * 2013-08-08 2020-05-22 索尼公司 Audio signal processing device and method and monitoring system
CN103530432A (en) * 2013-09-24 2014-01-22 华南理工大学 Conference recorder with speech extracting function and speech extracting method
CN104851423B (en) * 2014-02-19 2021-04-13 联想(北京)有限公司 Sound information processing method and device
CN104021785A (en) * 2014-05-28 2014-09-03 华南理工大学 Method of extracting speech of most important guest in meeting
CN104183239B (en) * 2014-07-25 2017-04-19 南京邮电大学 Text-independent speaker recognition method based on weighted Bayes hybrid model
CN105161093B (en) * 2015-10-14 2019-07-09 科大讯飞股份有限公司 A kind of method and system judging speaker's number
CN106971734B (en) * 2016-01-14 2020-10-23 芋头科技(杭州)有限公司 Method and system for training and identifying model according to extraction frequency of model
CN106205610B (en) * 2016-06-29 2019-11-26 联想(北京)有限公司 A kind of voice information identification method and equipment
CN107886955B (en) * 2016-09-29 2021-10-26 百度在线网络技术(北京)有限公司 Identity recognition method, device and equipment of voice conversation sample
CN106649513B (en) * 2016-10-14 2020-03-31 盐城工学院 Audio data clustering method based on spectral clustering
CN106531195B (en) * 2016-11-08 2019-09-27 北京理工大学 A dialog conflict detection method and device
CN106782496B (en) * 2016-11-15 2019-08-20 北京科技大学 A Crowd Quantity Monitoring Method Based on Speech and Crowd Sensing
CN106782507B (en) * 2016-12-19 2018-03-06 平安科技(深圳)有限公司 The method and device of voice segmentation
CN106782508A (en) * 2016-12-20 2017-05-31 美的集团股份有限公司 The cutting method of speech audio and the cutting device of speech audio
CN107342077A (en) * 2017-05-27 2017-11-10 国家计算机网络与信息安全管理中心 A kind of speaker segmentation clustering method and system based on factorial analysis
CN107967912B (en) * 2017-11-28 2022-02-25 广州势必可赢网络科技有限公司 Human voice segmentation method and device
CN109949813A (en) * 2017-12-20 2019-06-28 北京君林科技股份有限公司 A kind of method, apparatus and system converting speech into text
CN108962283B (en) * 2018-01-29 2020-11-06 北京猎户星空科技有限公司 Method and device for determining question end mute time and electronic equipment
CN108683790B (en) * 2018-04-23 2020-09-22 Oppo广东移动通信有限公司 Voice processing method and related product
CN108597521A (en) * 2018-05-04 2018-09-28 徐涌 Audio role divides interactive system, method, terminal and the medium with identification word
CN109461447B (en) * 2018-09-30 2023-08-18 厦门快商通信息技术有限公司 End-to-end speaker segmentation method and system based on deep learning
CN109859742B (en) * 2019-01-08 2021-04-09 国家计算机网络与信息安全管理中心 Speaker segmentation clustering method and device
CN109960743A (en) * 2019-01-16 2019-07-02 平安科技(深圳)有限公司 Method, device, computer equipment and storage medium for distinguishing conference content
CN110060665A (en) * 2019-03-15 2019-07-26 上海拍拍贷金融信息服务有限公司 Word speed detection method and device, readable storage medium storing program for executing
CN110364183A (en) * 2019-07-09 2019-10-22 深圳壹账通智能科技有限公司 Method, apparatus, computer equipment and the storage medium of voice quality inspection
CN111312256B (en) * 2019-10-31 2024-05-10 平安科技(深圳)有限公司 Voice identification method and device and computer equipment
CN112017685B (en) * 2020-08-27 2023-12-22 抖音视界有限公司 Speech generation method, device, equipment and computer readable medium
CN112423094A (en) * 2020-10-30 2021-02-26 广州佰锐网络科技有限公司 Double-recording service broadcasting method and device and storage medium
CN112669855A (en) * 2020-12-17 2021-04-16 北京沃东天骏信息技术有限公司 Voice processing method and device
CN112565880B (en) * 2020-12-28 2023-03-24 北京五街科技有限公司 Method and system for playing explanation videos
CN112565881B (en) * 2020-12-28 2023-03-24 北京五街科技有限公司 Self-adaptive video playing method and system
CN112802498B (en) * 2020-12-29 2023-11-24 深圳追一科技有限公司 Voice detection method, device, computer equipment and storage medium
CN112289323B (en) * 2020-12-29 2021-05-28 深圳追一科技有限公司 Voice data processing method and device, computer equipment and storage medium
CN115171724B (en) * 2021-04-01 2025-07-01 暗物智能科技(广州)有限公司 A speech rate analysis method and system
CN114067787B (en) * 2021-12-17 2022-07-05 广东讯飞启明科技发展有限公司 Voice speech speed self-adaptive recognition system
CN114464194A (en) * 2022-03-12 2022-05-10 云知声智能科技股份有限公司 Voiceprint clustering method and device, storage medium and electronic device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2990693B2 (en) * 1988-02-29 1999-12-13 株式会社明電舎 Speech synthesizer
US6873953B1 (en) * 2000-05-22 2005-03-29 Nuance Communications Prosody based endpoint detection
CN100505040C (en) * 2005-07-26 2009-06-24 浙江大学 Audio Segmentation Method Based on Decision Tree and Speaker Change Detection
CN100485780C (en) * 2005-10-31 2009-05-06 浙江大学 Quick audio-frequency separating method based on tonic frequency

Also Published As

Publication number Publication date
CN102543063A (en) 2012-07-04

Similar Documents

Publication Publication Date Title
CN102543063B (en) Method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers
CN110364143B (en) Voice awakening method and device and intelligent electronic equipment
CN103646649B (en) A kind of speech detection method efficiently
Zhu et al. Combining speaker identification and BIC for speaker diarization
CN103400580A (en) Method for estimating importance degree of speaker in multiuser session voice
Lokhande et al. Voice activity detection algorithm for speech recognition applications
CN103137137B (en) Eloquent speaker finding method in conference audio
CN100505040C (en) Audio Segmentation Method Based on Decision Tree and Speaker Change Detection
Mitra et al. Medium-duration modulation cepstral feature for robust speech recognition
CN103559882B (en) A speech extraction method for conference moderators based on speaker segmentation
CN108922541A (en) Multidimensional characteristic parameter method for recognizing sound-groove based on DTW and GMM model
Vyas A Gaussian mixture model based speech recognition system using Matlab
CN110232933A (en) Audio detection method and device, storage medium and electronic equipment
CN104021785A (en) Method of extracting speech of most important guest in meeting
CN100485780C (en) Quick audio-frequency separating method based on tonic frequency
CN110689887B (en) Audio verification method and device, storage medium and electronic equipment
Jaafar et al. Automatic syllables segmentation for frog identification system
CN106910495A (en) Audio classification system and method applied to abnormal sound detection
CN109473102A (en) A kind of robot secretary intelligent meeting recording method and system
WO2018095167A1 (en) Voiceprint identification method and voiceprint identification system
Chou et al. On the studies of syllable segmentation and improving MFCCs for automatic birdsong recognition
CN112489692A (en) Voice endpoint detection method and device
CN102201230B (en) Voice detection method for emergency
Kitaoka et al. Development of VAD evaluation framework CENSREC-1-C and investigation of relationship between VAD and speech recognition performance
Raj et al. Classifier-based non-linear projection for adaptive endpointing of continuous speech

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130724

Termination date: 20181207