CN105261357A

CN105261357A - Voice endpoint detection method and device based on statistics model

Info

Publication number: CN105261357A
Application number: CN201510587721.1A
Authority: CN
Inventors: 贺利强; 潘复平
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2015-09-15
Filing date: 2015-09-15
Publication date: 2016-01-20
Anticipated expiration: 2035-09-15
Also published as: CN105261357B

Abstract

The invention provides a voice endpoint detection method and device based on a statistics model, wherein the method comprises the steps of: receiving an input voice signal to be detected; extracting first voice characteristic information of the voice signal to be detected in a framing manner, and carrying out anti-noise processing on the first voice characteristic information so as to generate second voice characteristic information of the voice signal to be detected; generating an identification result of the voice signal to be detected according to the second voice characteristic information and an acoustics model; initially detecting out voice endpoints of the voice signal to be detected according to the identification result and a preset mute detection algorithm; and calculating confidence information of the voice signal to be detected, and adjusting the voice endpoints according to the confidence information. By adopting the voice endpoint detection method based on the statistics model, the voice endpoints of the voice signal to be detected are accurately positioned, the accuracy of voice endpoint detection is improved, the accuracy of voice identification is further improved, and the voice identification performance is improved.

Description

The sound end detecting method of Corpus--based Method model and device

Technical field

The present invention relates to technical field of voice recognition, particularly a kind of sound end detecting method of Corpus--based Method model and device.

Background technology

Along with the development of man-machine information interaction technology, speech recognition technology demonstrates its importance.In speech recognition system, speech terminals detection is one of gordian technique in speech recognition.Speech terminals detection refers to the starting point and ending point finding out phonological component in continuous voice signal.Whether accurate end-point detection is, directly can have influence on the performance of speech recognition system.If mistake appears in end points cutting, then can cause leaking the generation identifying or miss the situations such as identification, and then voice identification result can be caused inaccurate.

At present, traditional sound end detecting method mainly obtains time domain or frequency domain energy, and compares with given threshold value, thus judges the starting point and ending point of voice.The general process of end-point detection is: 1, phonetic feature is extracted in framing, calculates time domain or frequency domain energy; 2, energy value is compared with threshold value, judge voice starting point; If 3 find voice starting point, then continue to get energy value backward and compare with threshold value, judge whether voice terminate; If 4 find voice end point, then stop searching, return results.

But, find that above-mentioned voice activity detection algorithm at least exists following problem realizing inventor in process of the present invention: (1) above-mentioned sound end detecting method is applicable to stationary noise, and the environment of high s/n ratio, but under nonstationary noise, comparatively low signal-to-noise ratio environment, the Detection results of above-mentioned sound end detecting method is bad, and the accuracy rate of the sound end detected is lower; (2) for the voice signal under different signal to noise ratio (S/N ratio), be difficult to choose suitable threshold value, the accuracy of detection under quiet environment and the accuracy of detection under noise circumstance cannot be ensured.

Summary of the invention

The present invention is intended to solve one of technical matters in correlation technique at least to a certain extent.For this reason, first object of the present invention is the sound end detecting method proposing a kind of Corpus--based Method model, the end-point detection mode that the method is adjusted the sound end that Preliminary detection goes out by confidence information, accurately located the sound end of voice signal to be detected, improve the accuracy rate of speech terminals detection, and then the accuracy of speech recognition can be improved, improve the performance of speech recognition system.

Second object of the present invention is the speech terminals detection device proposing a kind of Corpus--based Method model.

For achieving the above object, the sound end detecting method of the Corpus--based Method model of first aspect present invention embodiment, comprising: the voice signal to be detected receiving input; The first voice characteristics information of described voice signal to be detected is extracted in framing, and carries out anti-noise process to described first voice characteristics information, to generate the second voice characteristics information of described voice signal to be detected; The recognition result of voice signal to be detected according to described second voice characteristics information and acoustics model generation; The sound end of described voice signal to be detected is gone out according to described recognition result and default quiet detection algorithm Preliminary detection; And calculate the confidence information of described voice signal to be detected, and according to described confidence information, described sound end is adjusted.

The sound end detecting method of the Corpus--based Method model of the embodiment of the present invention, receive the voice signal to be detected of input, the first voice characteristics information of voice signal to be detected is extracted in framing, and anti-noise process is carried out to the first voice characteristics information, to generate the second voice characteristics information of voice signal to be detected, according to the recognition result of the second voice characteristics information and acoustics model generation voice signal to be detected, according to recognition result with preset the sound end that quiet detection algorithm Preliminary detection goes out voice signal to be detected; And calculate the confidence information of voice signal to be detected, and according to confidence information, sound end is adjusted.Thus, provide a kind of end-point detection mode sound end that Preliminary detection goes out adjusted by confidence information, accurately located the sound end of voice signal to be detected, improve the accuracy rate of speech terminals detection, and then the accuracy of speech recognition can be improved, improve the performance of speech recognition system.

For achieving the above object, the speech terminals detection device of the Corpus--based Method model of second aspect present invention embodiment, comprising: receiver module, for receiving the voice signal to be detected of input; Anti-noise module, extracts the first voice characteristics information of described voice signal to be detected, and carries out anti-noise process to described first voice characteristics information, to generate the second voice characteristics information of described voice signal to be detected for framing; Generation module, for the recognition result of voice signal to be detected according to described second voice characteristics information and acoustics model generation; Identification module, for going out the sound end of described voice signal to be detected according to described recognition result and default voice activity detection algorithm Preliminary detection; Computing module, for calculating the confidence information of described voice signal to be detected; Adjusting module, for adjusting described sound end according to described confidence information.

The speech terminals detection device of the Corpus--based Method model of the embodiment of the present invention, the voice signal to be detected of input is received by receiver module, the first voice characteristics information of voice signal to be detected is extracted in the framing of anti-noise module, and anti-noise process is carried out to the first voice characteristics information, to generate the second voice characteristics information of voice signal to be detected, generation module is according to the acoustics recognition result of the second voice characteristics information and acoustics model generation voice signal to be detected, identification module is according to acoustics recognition result and preset the sound end that quiet detection algorithm Preliminary detection goes out voice signal to be detected, computing module calculates the confidence information of voice signal to be detected, adjusting module adjusts sound end according to confidence information.Thus, provide a kind of end-point detection mode sound end that Preliminary detection goes out adjusted by confidence information, accurately located the sound end of voice signal to be detected, improve the accuracy rate of speech terminals detection, and then the accuracy of speech recognition can be improved, improve the performance of speech recognition system.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the sound end detecting method of Corpus--based Method model according to an embodiment of the invention.

Fig. 2 is the exemplary plot of the optimum word sequence of voice signal to be detected.

Fig. 3 is the structural representation of the speech terminals detection device of Corpus--based Method model according to an embodiment of the invention.

Fig. 4 is the structural representation of the speech terminals detection device of Corpus--based Method model in accordance with another embodiment of the present invention.

Embodiment

Be described below in detail embodiments of the invention, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has element that is identical or similar functions from start to finish.Be exemplary below by the embodiment be described with reference to the drawings, be intended to for explaining the present invention, and can not limitation of the present invention be interpreted as.

Below with reference to the accompanying drawings sound end detecting method and the device of the Corpus--based Method model of the embodiment of the present invention are described.

As shown in Figure 1, the sound end detecting method of this Corpus--based Method model comprises:

S101, receives the voice signal to be detected of input.

S102, the first voice characteristics information of voice signal to be detected is extracted in framing, and carries out anti-noise process to the first voice characteristics information, to generate the second voice characteristics information of voice signal to be detected.

Particularly, after receiving voice signal to be detected, by prior art, sub-frame processing is carried out to voice signal to be detected, and extract the first voice characteristics information of every frame voice signal to be detected.

Wherein, above-mentioned first voice characteristics information comprises mel-frequency cepstrum coefficient (MelFrequencyCepstrumCoefficient is called for short MFCC), and the information such as the first order difference of MFCC and second order difference.

After the first voice characteristics information extracting every frame voice signal to be detected, in order to reduce the impact of noise on follow-up identification, can at characteristic layer in the face of the first voice characteristics information carries out antinoise process, specifically, by histogram transformation algorithm, antinoise process is carried out to the first voice characteristics information.

Wherein, the ultimate principle of histogram transformation algorithm: suppose that former eigenvector is x, its probability density function is P _xx (), cumulative distribution function is C _x(x); Eigenvector after conversion is y, and it is P with reference to probability density function _refy (), cumulative distribution function is C _ref(y), and have y=T (x), the transforming function transformation function of characteristic parameter should make C _x(x)=C _ref(y)=C _ref(T (x)), can obtain thus

y = T (x) = C_{r e f}^{- 1} (C_{x} (x)) .

Carrying out in the process of anti-noise process to the first voice characteristics information, the cumulative distribution function used is very important.But, due to traditional histogram equalization method be used for characteristic layer process time, there is following technical matters: a) accurately calculating cumulative distribution function needs enough feature samples data, and the tested speech data of reality, the length of voice segments can not ensure the demand; B) actual neighbourhood noise is very complicated, can not ensure the consistance of training utterance data and the distribution of tested speech data characteristics.

In order to make up the deficiency of traditional algorithm of histogram equalization, before by algorithm of histogram equalization anti-noise process being carried out to the first voice characteristics information, the voice feature data under different noise circumstance and different signal to noise ratio (S/N ratio) environment can be preserved in advance.

Particularly, first can calculate the signal to noise ratio (S/N ratio) of voice signal to be detected, then from the corresponding relation of the different signal to noise ratio (S/N ratio) voice feature datas preserved in advance, the voice feature data under this signal to noise ratio (S/N ratio) is obtained according to signal to noise ratio (S/N ratio), and according to obtained voice feature data determination cumulative distribution function, and by this cumulative distribution function, anti-noise process is carried out to the first voice characteristics information, to generate the second voice characteristics information of voice signal to be detected.

Wherein, it is to be appreciated that the signal to noise ratio (S/N ratio) of voice signal to be detected is different, the voice feature data obtained is different, and the corresponding cumulative distribution function determined is different.Further, carry out in conversion process at characteristic layer in the face of the first voice characteristics information of voice signal to be detected, cumulative distribution function changes along with the change of voice signal to be detected.

In one embodiment of the invention, after the second voice characteristics information generating voice signal to be detected, in order to reduce the otherness that preservation voice feature data and test data feature in advance distribute, the second voice characteristics information can be used to upgrade cumulative distribution function.That is, input in the process of voice user in speech terminals detection system, in this embodiment is not fixing for carrying out the cumulative distribution function of anti-noise process to voice, but constantly update according to the second voice characteristics information of voice.

Such as, after the speech data 1 receiving user's input, assuming that determine that the cumulative distribution function corresponding with speech data 1 is A, namely the cumulative distribution function of the first voice characteristics information of processed voice data 1 is A, anti-noise process is carried out by first voice characteristics information of cumulative distribution function A to speech data 1, to generate the second voice characteristics information of speech data 1, use the second voice characteristics information of speech data 1 to upgrade cumulative distribution function simultaneously, assuming that the cumulative distribution function after upgrading is B, if receive again speech data 2 after receiving speech data 1, anti-noise process is carried out by first voice characteristics information of cumulative distribution function B to speech data 2, to generate the second voice characteristics information of speech data 2, according to the second voice characteristics information of speech data 2, cumulative distribution function is upgraded simultaneously, assuming that the cumulative distribution function after upgrading is C.Thus, to the anti-noise process that the voice characteristics information of speech data carries out, effective alleviation training data and test data feature distribute inconsistent problem, increase the distinctiveness of voice and non-speech portion in speech data, and then improve the accuracy that subsequent endpoints detects.

S103, according to the recognition result of the second voice characteristics information and acoustics model generation voice signal to be detected.

Particularly, after generation second voice characteristics information, the likelihood value of every frame voice signal to be detected in each modeling unit is calculated based on acoustic model, then dynamic programming algorithm is passed through, the word sequence of optimum state metastasis sequence and correspondence thereof can be obtained, and using the word sequence of obtained optimum state metastasis sequence and correspondence thereof as recognition result.

Wherein, modeling unit is the three-tone state after phoneme decision tree-based clustering.Based on acoustic model, can obtain the State-output probability of voice signal to be detected in each modeling unit, State-output probability and state transition probability are used for the accumulation likelihood value of every paths when calculating path is expanded.Wherein, state transition probability is that in acoustic model, training in advance is good, and state transition probability is the probable value of carrying out redirect when carrying out Path extension between state and state.

In order to improve the accuracy and efficiency of acoustics identification, the acoustic model based on neural network (DNN, DeepNeuralNetworks) can be used to identify the second voice characteristics information.

Wherein, DNN acoustic model set up by training a large amount of speech data.

S104, goes out the sound end of voice signal to be detected according to recognition result and quiet detection algorithm Preliminary detection.

Wherein, above-mentioned quiet detection algorithm can include but not limited to the quiet detection algorithm based on the optimum word sequence of recognition result.

In one embodiment of the invention, the voice starting point of voice signal to be detected and the detailed process of voice end point is gone out according to recognition result with based on the quiet detection algorithm Preliminary detection of the optimum word sequence of recognition result:

S11, according to the optimum word sequence of recognition result determination current time voice signal to be detected, and the afterbody detecting the optimum word sequence of current time exports whether word is quiet.

Particularly, after the recognition result obtaining voice signal to be detected, according to the likelihood value size exporting word sequence accumulation, obtain the optimum word sequence of current time voice signal to be detected, wherein, as shown in Figure 2, Fig. 2 is only an example of optimum word sequence to the form of the optimum word sequence of current time voice signal to be detected, as seen in Figure 2, export word sequence to be made up of quiet and voice.

Such as, current voice to be detected are " we ", all have quiet before and after voice to be detected, the form of the output word sequence that current voice to be detected are corresponding is: quiet-> voice-> voice-> is quiet.

Wherein, it should be noted that, along with the increase of input voice, optimum word sequence constantly changes along with the change of accumulation likelihood value.

S12, if afterbody export word be quiet, be then recorded in quiet before, and apart from quiet output word recently end time point T1.

Whether S13, detect end time point T1 further and change after the phonetic entry to be detected of follow-up M frame, if point of described end time remains unchanged, then enter the intermediateness of the quiet detection of voice afterbody.

Wherein, M is default positive integer.

Particularly, detecting that end time point T1 keeps the input of M frame speech data not change, then the intermediateness of the quiet detection of voice afterbody is entered;

S14, detect current state whether to mediate state, if mediate state, length L quiet after then calculating end time point T1, and judge whether quiet length L is greater than the first predetermined threshold value further, if be greater than, then quietly detect successfully, and determine the voice starting point T0 of voice signal to be detected according to optimum word sequence, and is put the voice end point of T1 as voice signal to be detected the end time.

Wherein, first predetermined threshold value pre-sets, such as, first predetermined threshold value can be set to 600ms, namely, after judging that the quiet length L of afterbody is greater than 600ms, can determine quietly to detect successfully, now, the voice starting point in voice signal to be detected can be determined according to optimum word sequence, and is put the voice end point of T1 as voice signal to be detected the end time.

S15, if export word end time point T1, quiet detect successfully before change, then repeated execution of steps S11 to S14.

Wherein, it should be noted that, have the information that corresponding each word continues duration in optimum word sequence, each word is corresponding with multiframe voice signal, and the duration that each word is corresponding equals the duration sum of multiframe voice signal to be detected.

S105, calculates the confidence information of voice signal to be detected, and adjusts sound end according to confidence information.

After the voice starting point going out voice signal to be detected in Preliminary detection and voice end point, because the strong jamming of Background environmental noise can know into voice (false-alarm) component environment noise by mistake, thus cause speech terminals detection hydraulic performance decline.In order to improve the accuracy of speech terminals detection, by two times confidence level estimation technology, the sound end that Preliminary detection goes out is adjusted.

Particularly, can according to the confidence information of the snr computation voice signal to be detected of the sound end of the likelihood value of the modeling unit calculated, voice signal to be detected and voice signal to be detected.

Specifically, the confidence information of word sequence between voice starting point and voice end point can be calculated, and according to confidence information, the sound end that Preliminary detection goes out be adjusted.

Furthermore, the acoustics posterior probability of word sequence between sound end first can be calculated according to the likelihood value of the modeling unit calculated and the sound end of voice signal to be detected.

Wherein, the formula of the acoustics posterior probability of the kth word between sound end is calculated:

P_{k} (X) = Σ_{t = 0}^{T} \frac{p_{t} (m_{k} | x)}{Σ_{i = 0}^{I} p_{t} (m_{i} | x)} / T_{k} (X)

Wherein P _k(X) be the acoustics posterior probability of a kth word in voice signal to be detected, p _t(m _k| the likelihood value of this word corresponding modeling unit when being x) t frame, be all modeling unit during t frame likelihood value and, T _k(X) be the lasting duration of this word.

After calculating acoustics posterior probability, can according to the acoustics posterior probability of word sequence between sound end and confidence information corresponding to snr computation voice signal.

Particularly, for each word between sound end, can based on the short-time energy value E of current word _k(X) and the noise energy assessed value N (X) of input voice, the signal to noise ratio snr of current word is calculated _k(X)=E _k(X)/N (X).

After calculating the signal to noise ratio (S/N ratio) of current word, based on the acoustics posterior probability of current word and signal to noise ratio (S/N ratio), the degree of confidence CM of current word can be calculated _k(X)=w*P _k(X)+(1-w) * SNR _k(X), wherein 0≤w≤1, w is weight coefficient, and the value of w is determined by acoustics posterior probability and signal to noise ratio (S/N ratio).

After calculating the degree of confidence of word sequence, can adjust sound end according to confidence information, particularly, can first determine the word that confidence score is the highest, then using word the highest for confidence score as center, progressively merge with adjacent word, and calculate the average confidence of the word sequence after each merging, until the average confidence calculated reaches the second predetermined threshold value; And when the average confidence calculated reaches the second predetermined threshold value, determine the beginning word and the end word that calculate current average confidence, and according to the sart point in time starting word, voice starting point is adjusted, the end time point according to terminating word adjusts voice end point.

Wherein, the formula calculating average confidence is:

C M (X) = Σ_{n = 0}^{N} t_{n} (x) {CM}_{n} (x) / Σ_{n = 0}^{N} t_{n} (x)

Wherein, t _nx () represents that the duration of the n-th word is long, CM _nx () represents the degree of confidence of the n-th word, N represents total number of word sequence in this calculating.

Specifically, can judge that whether the sart point in time starting word is identical with voice starting point, if different, then will start the voice starting point of sart point in time as voice signal to be detected of word.

In the process that voice starting point is adjusted, can also judge that whether the end time point terminating word is identical with voice end point, if different, then will terminate the voice end point of end time point as voice signal to be detected of word.

Such as, for a voice signal to be detected, assuming that be A according to recognition result with based on the voice starting point that the quiet detection algorithm Preliminary detection of the optimum word sequence of recognition result goes out voice signal to be detected, voice end time point (voice end point) is B, if when average confidence reaches the second predetermined threshold value, again noise can be judged to be by not having merged word, assuming that when average confidence reaches the second predetermined threshold value, determine that the beginning word calculating this average confidence is X, end word is Y, now, can be obtained this and start start time point A1 corresponding to word, terminate the end time point B1 that word is corresponding, and judge that whether start time point A1 is identical with the voice starting point A of voice signal to be detected, if both are different, then using the voice starting point of start time point A1 as voice signal to be detected.Similarly, can judge that whether end time point B1 is identical with the end time point B of voice signal to be detected, if both are different, then puts the end time point of B1 as voice signal to be detected using the end time.Thus, revised initially knowing the sound end detected by confidence information, thus improve the accuracy rate of speech terminals detection, and then the effect of speech recognition can be improved.

In summary it can be seen, this embodiment proposes a kind of end-point detection mode revised sound end by confidence information, the end-point detecting method of this embodiment is first by presetting quiet detection algorithm, to find voice starting point and the voice terminating point of voice signal to be detected as much as possible, then calculate the confidence information of voice signal to be detected, and according to calculated confidence information, the sound end that Preliminary detection goes out is adjusted.Thus, the accuracy rate of speech terminals detection can be improved, and then the accuracy of speech recognition can be improved, improve the performance of speech recognition system.

In order to realize above-described embodiment, the present invention also proposes a kind of speech terminals detection device of Corpus--based Method model.

As shown in Figure 3, this speech terminals detection device comprises receiver module 100, anti-noise module 200, generation module 300, identification module 400, computing module 500 and adjusting module 600, wherein:

Receiver module 100 is for receiving the voice signal to be detected of input; Anti-noise module 200 extracts the first voice characteristics information of voice signal to be detected for framing, and carries out anti-noise process to the first voice characteristics information, to generate the second voice characteristics information of voice signal to be detected; Generation module 300 is for the recognition result according to the second voice characteristics information and acoustics model generation voice signal to be detected; Identification module 400 is for going out the sound end of voice signal to be detected according to recognition result and default quiet detection algorithm Preliminary detection; Computing module 500 is for calculating the confidence information of voice signal to be detected; Adjusting module 600 is for adjusting sound end according to confidence information.

Wherein, above-mentioned first voice characteristics information can include but not limited to mel-frequency cepstrum coefficient (MelFrequencyCepstrumCoefficient is called for short MFCC), and the information such as the first order difference of MFCC and second order difference.

Anti-noise module 200 is specifically for the signal to noise ratio (S/N ratio) that calculates voice signal to be detected; From the corresponding relation of the different signal to noise ratio (S/N ratio) of preserving in advance and voice feature data, the voice feature data under signal to noise ratio (S/N ratio) is obtained according to signal to noise ratio (S/N ratio), and according to voice feature data determination cumulative distribution function; And according to cumulative distribution function, the first voice characteristics information is converted, to generate the second voice characteristics information of voice signal to be detected.

After generation module 300 generates the second voice characteristics information, identification module 400 calculates the likelihood value of every frame voice signal to be detected in each modeling unit based on acoustic model, then dynamic programming algorithm is passed through, the word sequence of optimum state metastasis sequence and correspondence thereof can be obtained, and using the word sequence of obtained optimum state metastasis sequence and correspondence thereof as recognition result.

In addition, as shown in Figure 4, said apparatus can also comprise update module 700, and this update module 700 is for upgrading cumulative distribution function according to the second voice characteristics information.

Specifically, after the first voice characteristics information extracting every frame voice signal to be detected, in order to reduce the impact of noise on follow-up identification, anti-noise module 200 can at characteristic layer in the face of the first voice characteristics information carries out antinoise process, specifically, by histogram transformation algorithm, antinoise process is carried out to the first voice characteristics information.

In order to make up the deficiency of traditional algorithm of histogram equalization, before anti-noise module 200 carries out anti-noise process by algorithm of histogram equalization to the first voice characteristics information, in end point detecting device, also can preserve the corresponding relation of the voice feature data under different noise circumstance and different signal to noise ratio (S/N ratio) environment in advance.

Wherein, it is to be appreciated that the signal to noise ratio (S/N ratio) of voice signal to be detected is different, the voice feature data obtained is different, and the corresponding cumulative distribution function calculated is different.Further, carry out in conversion process at characteristic layer in the face of the first voice characteristics information of voice signal to be detected, cumulative distribution function changes along with the change of voice signal to be detected.

Such as, after the speech data 1 receiving user's input, assuming that anti-noise module 200 determines that the cumulative distribution function corresponding with speech data 1 is A, namely the cumulative distribution function of the voice characteristics information of processed voice data 1 is A, anti-noise process is carried out by first voice characteristics information of cumulative distribution function A to speech data 1, to generate the second voice characteristics information of speech data 1, then update module 600 uses the second voice characteristics information of speech data 1 to upgrade cumulative distribution function, assuming that the cumulative distribution function after upgrading is B, if receive again speech data 2 after receiving speech data 1, anti-noise process is carried out by first voice characteristics information of cumulative distribution function B to speech data 2, to generate the second voice characteristics information of speech data 2, simultaneously, update module 700 upgrades cumulative distribution function according to the second voice characteristics information of speech data 2, assuming that the cumulative distribution function after upgrading is C.Thus, to the anti-noise process that the voice characteristics information of speech data carries out, effective alleviation training data and test data feature distribute inconsistent problem, increase the distinctiveness of voice and non-speech portion in speech data, and then improve the accuracy that subsequent endpoints detects.

Wherein, above-mentioned default quiet detection algorithm includes but not limited to the quiet detection algorithm of the optimum word sequence of recognition result.

Particularly, identification module 400 goes out the sound end of voice signal to be detected especially by step S11 to S15 Preliminary detection, wherein: S11, according to the optimum word sequence of recognition result determination current time voice signal to be detected, and the afterbody detecting optimum word sequence exports whether word is quiet; S12, if it is quiet that afterbody exports word, be then recorded in quiet before, and apart from the end time point of quiet output word recently; S13, whether detect end time point further and change after the voice signal input to be detected of follow-up M frame, if end time point remains unchanged, then enter the intermediateness of the quiet detection of voice afterbody, wherein, M is default positive integer; S14, detect current state whether to mediate state, if mediate state, length L quiet after then calculating end time point, and judge whether quiet length L is greater than the first predetermined threshold value further, if be greater than, then quietly detect successfully, and determine the voice starting point of voice signal to be detected according to optimum word sequence, and using the voice end point of end time point as voice signal to be detected; S15, if the end time point quiet detect successfully before change, then repeated execution of steps S11 to S14.

Particularly, computing module 500 specifically for: according to the confidence information of the sound end of acoustics recognition result, voice signal to be detected and the snr computation voice signal to be detected of voice signal to be detected.

Specifically, computing module 500 first can calculate the acoustics posterior probability of each word between sound end, and according to the acoustics posterior probability of each word of every frame between sound end and confidence information corresponding to each word of snr computation.

P_{k} (X) = Σ_{t = 0}^{T} \frac{p_{t} (m_{s} | x)}{Σ_{i = 0}^{I} p_{t} (m_{i} | x)} / T_{k} (X)

As shown in Figure 4, above-mentioned adjusting module 600 can comprise first determine submodule 610, process submodule 620, second determine submodule 630 and adjustment submodule 640, wherein:

First determines that submodule 610 is for determining the word that confidence score is the highest, process submodule 620 for using word the highest for confidence score as center, and progressively merge with the degree of confidence of adjacent word, and the average confidence calculated after each merging, until the average confidence calculated reaches the second predetermined threshold value; Second when determining submodule 630 for reaching the second predetermined threshold value at the average confidence calculated, and determines to calculate the beginning word of current average confidence and terminate word; Adjustment submodule 640 is for adjusting voice starting point according to the sart point in time starting word, and the end time point according to terminating word adjusts voice end point.

Specifically, adjustment submodule 640 can judge that whether the sart point in time starting word is identical with voice starting point, if different, then the voice starting point of sart point in time as voice signal to be detected of word will be started, and judge that whether the end time point terminating word is identical with voice end point, if different, then the voice end point of end time point as voice signal to be detected of word will be terminated.

It should be noted that, the explanation of the aforementioned sound end detecting method embodiment to Corpus--based Method model illustrates and the speech terminals detection device being also applicable to the Corpus--based Method model of this embodiment repeats no more herein.

In the description of this instructions, specific features, structure, material or feature that the description of reference term " embodiment ", " some embodiments ", " example ", " concrete example " or " some examples " etc. means to describe in conjunction with this embodiment or example are contained at least one embodiment of the present invention or example.In this manual, to the schematic representation of above-mentioned term not must for be identical embodiment or example.And the specific features of description, structure, material or feature can combine in one or more embodiment in office or example in an appropriate manner.In addition, when not conflicting, the feature of the different embodiment described in this instructions or example and different embodiment or example can carry out combining and combining by those skilled in the art.

In addition, term " first ", " second " only for describing object, and can not be interpreted as instruction or hint relative importance or imply the quantity indicating indicated technical characteristic.Thus, be limited with " first ", the feature of " second " can express or impliedly comprise at least one this feature.In describing the invention, the implication of " multiple " is at least two, such as two, three etc., unless otherwise expressly limited specifically.

Describe and can be understood in process flow diagram or in this any process otherwise described or method, represent and comprise one or more for realizing the module of the code of the executable instruction of the step of specific logical function or process, fragment or part, and the scope of the preferred embodiment of the present invention comprises other realization, wherein can not according to order that is shown or that discuss, comprise according to involved function by the mode while of basic or by contrary order, carry out n-back test, this should understand by embodiments of the invention person of ordinary skill in the field.

In flow charts represent or in this logic otherwise described and/or step, such as, the sequencing list of the executable instruction for realizing logic function can be considered to, may be embodied in any computer-readable medium, for instruction execution system, device or equipment (as computer based system, comprise the system of processor or other can from instruction execution system, device or equipment instruction fetch and perform the system of instruction) use, or to use in conjunction with these instruction execution systems, device or equipment.With regard to this instructions, " computer-readable medium " can be anyly can to comprise, store, communicate, propagate or transmission procedure for instruction execution system, device or equipment or the device that uses in conjunction with these instruction execution systems, device or equipment.The example more specifically (non-exhaustive list) of computer-readable medium comprises following: the electrical connection section (electronic installation) with one or more wiring, portable computer diskette box (magnetic device), random access memory (RAM), ROM (read-only memory) (ROM), erasablely edit ROM (read-only memory) (EPROM or flash memory), fiber device, and portable optic disk ROM (read-only memory) (CDROM).In addition, computer-readable medium can be even paper or other suitable media that can print described program thereon, because can such as by carrying out optical scanning to paper or other media, then carry out editing, decipher or carry out process with other suitable methods if desired and electronically obtain described program, be then stored in computer memory.

Should be appreciated that each several part of the present invention can realize with hardware, software, firmware or their combination.In the above-described embodiment, multiple step or method can with to store in memory and the software performed by suitable instruction execution system or firmware realize.Such as, if realized with hardware, the same in another embodiment, can realize by any one in following technology well known in the art or their combination: the discrete logic with the logic gates for realizing logic function to data-signal, there is the special IC of suitable combinational logic gate circuit, programmable gate array (PGA), field programmable gate array (FPGA) etc.

Those skilled in the art are appreciated that realizing all or part of step that above-described embodiment method carries is that the hardware that can carry out instruction relevant by program completes, described program can be stored in a kind of computer-readable recording medium, this program perform time, step comprising embodiment of the method one or a combination set of.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing module, also can be that the independent physics of unit exists, also can be integrated in a module by two or more unit.Above-mentioned integrated module both can adopt the form of hardware to realize, and the form of software function module also can be adopted to realize.If described integrated module using the form of software function module realize and as independently production marketing or use time, also can be stored in a computer read/write memory medium.

The above-mentioned storage medium mentioned can be ROM (read-only memory), disk or CD etc.Although illustrate and describe embodiments of the invention above, be understandable that, above-described embodiment is exemplary, can not be interpreted as limitation of the present invention, and those of ordinary skill in the art can change above-described embodiment within the scope of the invention, revises, replace and modification.

Claims

1. a sound end detecting method for Corpus--based Method model, is characterized in that, comprises the following steps:

Receive the voice signal to be detected of input;

The first voice characteristics information of described voice signal to be detected is extracted in framing, and carries out anti-noise process to described first voice characteristics information, to generate the second voice characteristics information of described voice signal to be detected;

The recognition result of voice signal to be detected according to described second voice characteristics information and acoustics model generation;

The sound end of described voice signal to be detected is gone out according to described recognition result and default quiet detection algorithm Preliminary detection; And

Calculate the confidence information of described voice signal to be detected, and according to described confidence information, described sound end is adjusted.

2. the sound end detecting method of Corpus--based Method model as claimed in claim 1, is characterized in that, describedly carries out anti-noise process to described first voice characteristics information, to generate the second voice characteristics information of described voice signal to be detected, specifically comprises:

Calculate the signal to noise ratio (S/N ratio) of described voice signal to be detected;

From the corresponding relation of the different signal to noise ratio (S/N ratio) of preserving in advance and voice feature data, the voice feature data under described signal to noise ratio (S/N ratio) is obtained according to described signal to noise ratio (S/N ratio), and according to described voice feature data determination cumulative distribution function; And

According to described cumulative distribution function, anti-noise process is carried out to described first voice characteristics information, to generate the second voice characteristics information of described voice signal to be detected.

3. the sound end detecting method of Corpus--based Method model as claimed in claim 2, is characterized in that, also comprise:

According to described second voice characteristics information, described cumulative distribution function is upgraded.

4. the sound end detecting method of Corpus--based Method model as claimed in claim 1, is characterized in that, described default quiet detection algorithm comprises the quiet detection algorithm based on the optimum word sequence of recognition result.

5. the sound end detecting method of Corpus--based Method model as claimed in claim 4, is characterized in that, the described sound end going out described voice signal to be detected according to described recognition result and default quiet detection algorithm Preliminary detection, specifically comprises:

S11, the optimum word sequence of voice signal to be detected according to described recognition result determination current time, and the afterbody detecting described optimum word sequence exports whether word is quiet;

S12, if it is quiet that described afterbody exports word, be then recorded in described quiet before, and apart from the end time point of described quiet output word recently;

S13, whether detect point of described end time further and change after the voice signal input to be detected of follow-up M frame, if point of described end time remains unchanged, then enter the intermediateness of the quiet detection of voice afterbody, wherein, M is default positive integer;

S14, detect current state and whether be in described intermediateness, if be in described intermediateness, length L quiet after then calculating point of described end time, and judge whether quiet length L is greater than the first predetermined threshold value further, if be greater than, then quietly detect successfully, and determine the voice starting point of described voice signal to be detected according to described optimum word sequence, and using the voice end point of described end time point as described voice signal to be detected;

S15, if the described end time point quiet detect successfully before change, then repeat described step S11 to S14.

6. the sound end detecting method of Corpus--based Method model as claimed in claim 2, it is characterized in that, the confidence information of the described voice signal to be detected of described calculating, specifically comprises:

The confidence information of voice signal to be detected according to the sound end of described recognition result, described voice signal to be detected and the snr computation of described voice signal to be detected.

7. the sound end detecting method of Corpus--based Method model as claimed in claim 6, it is characterized in that, the confidence information of described voice signal to be detected according to the sound end of described recognition result, described voice signal to be detected and the snr computation of described voice signal to be detected, specifically comprises:

Based on described recognition result, calculate the acoustics posterior probability of each word between described sound end;

According to the acoustics posterior probability of each word between described sound end and confidence information corresponding to each word of snr computation.

8. the sound end detecting method of Corpus--based Method model as claimed in claim 7, is characterized in that, describedly adjusts described sound end according to described confidence information, specifically comprises:

Determine the word that confidence score is the highest;

Using word the highest for confidence score as center, and progressively merge with the degree of confidence of adjacent word, and calculate the average confidence after each merging, until the average confidence calculated reaches the second predetermined threshold value;

When the average confidence calculated reaches described second predetermined threshold value, determine the beginning word and the end word that calculate current average confidence, and according to the sart point in time of described beginning word, described voice starting point is adjusted, the end time point according to described end word adjusts described voice end point.

9. the sound end detecting method of Corpus--based Method model as claimed in claim 8, it is characterized in that, the described sart point in time according to described beginning word adjusts described voice starting point, and the end time point according to described end word adjusts described voice end point, specifically comprises:

Judge that whether the sart point in time of described beginning word is identical with described voice starting point, if different, then using the described voice starting point of the sart point in time of described beginning word as described voice signal to be detected;

Judge that whether the end time point of described end word is identical with described voice end point, if different, then using the described voice end point of the end time of described end word point as described voice signal to be detected.

10. a speech terminals detection device for Corpus--based Method model, is characterized in that, comprising:

Receiver module, for receiving the voice signal to be detected of input;

Anti-noise module, extracts the first voice characteristics information of described voice signal to be detected, and carries out anti-noise process to described first voice characteristics information, to generate the second voice characteristics information of described voice signal to be detected for framing;

Generation module, for the recognition result of voice signal to be detected according to described second voice characteristics information and acoustics model generation;

Identification module, for going out the sound end of described voice signal to be detected according to described recognition result and default voice activity detection algorithm Preliminary detection;

Computing module, for calculating the confidence information of described voice signal to be detected;

Adjusting module, for adjusting described sound end according to described confidence information.

The speech terminals detection device of 11. Corpus--based Method models as claimed in claim 10, is characterized in that, described anti-noise module, specifically for:

The speech terminals detection device of 12. Corpus--based Method models as claimed in claim 11, is characterized in that, also comprise:

Update module, for upgrading described cumulative distribution function according to described second voice characteristics information.

The speech terminals detection device of 13. Corpus--based Method models as claimed in claim 10, is characterized in that, described default quiet detection algorithm comprises the quiet detection algorithm based on the optimum word sequence of recognition result.

The speech terminals detection device of 14. Corpus--based Method models as claimed in claim 13, it is characterized in that, described identification module goes out the sound end of described voice signal to be detected especially by step S11 to S15 Preliminary detection, wherein:

The speech terminals detection device of 15. Corpus--based Method models as claimed in claim 11, is characterized in that, described computing module, specifically for:

The speech terminals detection device of 16. Corpus--based Method models as claimed in claim 15, is characterized in that, described computing module, specifically for:

According to the acoustics posterior probability of each word of every frame between described sound end and confidence information corresponding to each word of snr computation.

The speech terminals detection device of 17. Corpus--based Method models as claimed in claim 16, it is characterized in that, described adjusting module, specifically comprises:

First determines submodule, for determining the word that confidence score is the highest;

Process submodule, for using word the highest for confidence score as center, and progressively to merge with the degree of confidence of adjacent word, and calculate the average confidence after each merging, until the average confidence calculated reaches the second predetermined threshold value;

Second determines submodule, during for reaching described second predetermined threshold value at the average confidence calculated, determines the beginning word and the end word that calculate current average confidence;

Adjustment submodule, adjusts described voice starting point for the sart point in time according to described beginning word, and the end time point according to described end word adjusts described voice end point.

The speech terminals detection device of 18. Corpus--based Method models as claimed in claim 17, is characterized in that, described adjustment submodule, specifically for: