Embodiment
For making the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in further detail.
In order in decode procedure, to add the positional information of speech frame in the acoustic feature space, the present invention is at statistics phoneme and universal background model (Universal Background Model, UBM) on the basis of the response relation of each gaussian component, by setting up the corresponding relation between each gaussian component and phoneme among the UBM, obtain speech frame residing position in the acoustic feature space, be expressed as the different local probability that speech frame belongs to the acoustic feature space, obtain guiding probability.When decoding, utilize the positional information (namely investigating the response relation of each gaussian component in the affiliated phoneme of speech frame to be decoded, main gaussian component, phoneme and the universal background model of speech frame on universal background model) of speech frame to be decoded, revise the computing formula of legacy paths PTS in the decode procedure, so that decode system is on the basis that utilizes traditional acoustic model and language model, further the positional information with speech frame to be decoded incorporates decode procedure.
Fig. 1 is the response relation according to each gaussian component in phoneme and the universal background model, calculate the guiding probability, and the guiding probability that will obtain joins the process flow diagram of path PTS.In statistics phoneme and universal background model UBM, before the corresponding relation of each gaussian component, at first need to train a UBM, be used for describing whole acoustic feature space.After obtaining UBM, we calculate their main gaussian component on UBM to all speech frames in the phonetic feature training corpus.Simultaneously, by phonetic feature training corpus is forced cutting, obtain the affiliated phoneme of each speech frame.Afterwards, with all take same gaussian component as main gaussian component and the speech frame that belongs to same phoneme be classified as a class, add up the quantity of speech frame in such, this quantity is the response frequency between arbitrary phoneme and the arbitrary gaussian component.By the response frequency is carried out normalization, obtain guiding probability.And will guide probability and traditional path PTS computation process to merge mutually, obtain merging the speech recognition decoder algorithm of guiding probability.As shown in Figure 1, the method specifically comprises following step:
Step 1: training universal background model UBM is used for describing whole acoustic feature space.Wherein, universal background model has adopted mixed Gauss model;
Step 2: the main gaussian component of computing voice frame on universal background model UBM;
Step 3: utilize the acoustic model of recognition system that phonetic feature training corpus is forced cutting, obtain the affiliated phoneme of speech frame;
Step 4: the response frequency of statistics phoneme and gaussian component;
Step 5: the described response frequency of normalization obtains guiding probability;
Step 6: will guide during probability fusion calculates to traditional voice recognition path PTS, thereby instruct demoder to finish enhancing or weakening to the path.
To introduce in detail above-mentioned each step below in conjunction with accompanying drawing.
Fig. 2 is according to the phonetic feature in the phonetic feature training corpus, and training obtains the process flow diagram of universal background model.The present invention adopts respectively two kinds of different UBM training methods, namely directly whole phonetic feature training corpus is used greatest hope (EM) Algorithm for Training UBM, perhaps utilizes voice feature data that the gaussian component that each phoneme comprises is upgraded, and obtains UBM.Two kinds of methods are respectively shown in Fig. 2 (a) and Fig. 2 (b).
Shown in Fig. 2 (a), directly use the method for EM Algorithm for Training UBM specifically to comprise following content to whole phonetic feature training corpus.At first setting UBM is multidimensional standard Gaussian distribution, dimension is identical with the phonetic feature dimension, such as the present invention voice signal is extracted Mel cepstrum coefficient (Mel Frequency Cepstral Coefficients, MFCC) as feature, the Mel cepstrum coefficient and the 1 dimension speech energy that wherein comprise 12 dimensions, and their first order difference and second order difference, totally 39 dimensions are as proper vector.On whole phonetic feature training corpus, utilize the EM algorithm to adjust average and the variance of Gaussian distribution.Greatest hope (EM) algorithm is the algorithm of seeking parameter maximal possibility estimation or maximum a posteriori estimation in probability model, is used for estimating posterior probability density function.In speech recognition, represent probability density function with gauss hybrid models, therefore mainly be the parameters such as the average of estimating each gaussian component in the gauss hybrid models, variance.When estimating, at first calculation expectation (E step) utilizes the existing estimated value to hidden variable, calculates its maximum likelihood estimator; Next is maximization (M step), and the maximum likelihood value that maximization was tried to achieve in the E step is come the value of calculating parameter.The estimates of parameters that M found on the step is used in the next E step calculating, and this process constantly hockets, and finally finishes the parameter estimation of gauss hybrid models.Subsequently, for the ease of describing more accurately local acoustics feature space, the HHEd instrument of algorithm employing HTK divides the gaussian component in the HMM state.For any component in the gauss hybrid models, it is split into average equates with variance, two gaussian component of weighted, thereby reach the purpose that increases the quantity of Gaussian distribution among the UBM.Carry out loop iteration with upgrading the average of Gaussian distribution and the process of variance and increase gaussian distribution number, until the number of gaussian component reaches expectation value among the UBM, obtain final UBM.
Shown in Fig. 2 (b), the method that the gaussian component of utilizing phonetic feature training corpus that each phoneme is comprised is upgraded specifically comprises following content.At first, on phonetic feature training corpus, add up all phonemes that obtain comprising in the phonetic feature training corpus, and Hidden Markov Model (HMM) HMM set up in each phoneme.Secondly, use the Baum-Welch algorithm to upgrade HMM parameter, the HMM model that obtains training.The Baum-Welch algorithm is to utilize one group of observation sequence to go the HMM model parameter of training a continuous mixed Gaussian to distribute, and comprises the state transition probability matrix of model, the parameters such as mean vector, covariance matrix and hybrid weight of HMM model parameter.In estimating the HMM parametric procedure, Baum-Welch adopts maximal possibility estimation, and its process of adjusting the gaussian component parameter on each state is consistent with the EM algorithm.After this, with each gaussian component in the HMM model, weighting obtains the initial generic background model, wherein the weight of each gaussian component equates, and all the weight sum of component is 1, again whole phonetic feature training corpus use the EM algorithm to initial UBM in the parameter of each gaussian component adjust, the UBM that obtains upgrading is final UBM.
Fig. 3 is the process flow diagram of the main gaussian component of computing voice frame on universal background model UBM.As shown in Figure 3, for each speech frame in the whole phonetic feature training corpus, calculate the score on the probability density function of its each gaussian component in the UBM that step 1 obtains, and the gaussian component that score the maximum is corresponding is as its main gaussian component.Specifically comprise:
UBM comprises M gaussian component, and the probability density function of wherein remembering m gaussian component is λ
m, or probability density function is expressed as N (O with its parametric form; μ
m, ∑
m), μ wherein
m, ∑
mThe average and the variance that represent m gaussian component.Then speech frame O is at probability density function λ
mOn probable value be calculated as follows:
For each speech frame O, among the UBM to speech frame O probability P (O| λ
m) maximum gaussian component is defined as the main gaussian component of speech frame O.That is, for speech frame O, the gaussian component m that satisfies formula (1) is the main gaussian component of speech frame O.That is to say, for any the component m ' in the gauss hybrid models, calculate it to probability P (the O| λ of speech frame O
M '), and a gaussian component in the searching gauss hybrid models, make speech frame O all large at the probability of likelihood ratio speech frame O on other any one gaussian component on the probability density function of this gaussian component.The gaussian component that finds is main gaussian component, is designated as m.
Fig. 4 utilizes the acoustic model of recognition system that phonetic feature training corpus is forced cutting, obtains the process flow diagram of the affiliated phoneme of speech frame.As shown in Figure 4, the context-sensitive three-tone acoustic model of model, and with this as baseline system.Baseline system is comparison system, by relatively drawing the improvement in performance situation of the speech recognition system that the present invention realizes with described baseline system.Context-sensitive initial consonant/vowel with tones is as basic modeling unit in the baseline system, and only considers the impact of each phoneme before and after the current phoneme, forms triphone model.Each model is the form of " initial consonant-vowel with tones+initial consonant (or quiet, sil) " or " vowel with tones (or sil)-initial consonant+vowel with tones "." initial consonant-vowel with tones+initial consonant (or quiet, sil) " current pronunciation of expression is vowel with tones, and this vowel with tones is in after the pronunciation of an initial consonant, before the pronunciation of another initial consonant (perhaps quiet).Such as the model of " zh-ong1+g " expression vowel with tones ong1, furtherly, be the model of the ong1 after the zh pronunciation, before the g pronunciation.Similarly, the current pronunciation of " vowel with tones (or sil)-initial consonant+vowel with tones " expression is initial consonant, and this initial consonant is in after the pronunciation of a vowel with tones (perhaps quiet), before the pronunciation of another vowel with tones.Such as the model of " ong1-g+uo2 " expression initial consonant g, furtherly, be the model of the g after ong1 pronunciation, before the uo2 pronunciation.Except the triphone model of above two kinds of structures, system has also set up quiet model sil, the situation of voice signal when being used for describing.
Except quiet model sil, all three-tone acoustic models, namely all " initial consonant-vowels with tones+initial consonant (or quiet, sil) " reach the model of " vowel with tones (or sil)-initial consonant+vowel with tones " form, all adopt the HMM structure from left to right of continuous density.HMM comprises 5 states, and wherein 3 are the emission attitude.Quiet model sil has increased the redirect between the emission attitude, is used for the phenomenon that the quiet duration length of portrayal voice differs.The training of HMM model is adopted and is finished based on the Baum-Welch algorithm of expectation maximization.As the system that compares with the present invention, baseline system need to just be finished beginning most.Therefore, comprising the training of described sil and triphone model, all is to finish in the starting stage.
Because after considering the context impact, acoustic model quantity sharply increases, so baseline system adopts the state clustering algorithm based on decision tree to reduce the number of parameters that needs training.
Step 3 is utilized the acoustic model of above-mentioned baseline system, and the Viterbi algorithm, and the phonetic feature of training corpus is forced cutting.Originally, training corpus includes only corresponding aligned phoneme sequence, but the zero-time of each phoneme and termination time are unknown in the aligned phoneme sequence.That is to say that although know in the speaking of whole word and perhaps claim aligned phoneme sequence, the Voice onset time of each phoneme in the aligned phoneme sequence and termination time are unknown.In order to obtain the time boundary information of phoneme, algorithm utilizes acoustic model and the Viterbi algorithm of baseline system, divide a time boundary for each phoneme in the corresponding aligned phoneme sequence of training corpus, obtain the time boundary information of corpus annotation text (aligned phoneme sequence).For each phoneme from the phone set that training corpus obtains, obtain its zero-time position and termination time position in phonetic feature, be in the speech frame between zero-time position and the termination time position, all be marked for this reason phoneme, thereby with each speech frame, be divided in a certain phoneme.Thus, obtain the affiliated phoneme of each speech frame.
Fig. 5 is the process flow diagram of the response frequency of statistics phoneme and gaussian component.For each speech frame, the main gaussian component that calculates in step 2 is m, the phoneme p that in step 3, obtains, and the response frequency of note m and p is C
PmAdd up the response frequency of a certain phoneme and a certain gaussian component, be with all take same gaussian component as main gaussian component and the speech frame that belongs to same phoneme be classified as a class, then the frame number of speech frame in such is the response frequency between above-mentioned a certain phoneme and a certain gaussian component.As shown in Figure 5, for the first speech frame O in the voice training corpus, its main gaussian component on UBM is m, and the phoneme under it is p, so, claim gaussian component m and phoneme p a secondary response to occur, that is to say that appearance one frame is take m as main gaussian component and belong to the voice O of phoneme p in the voice training corpus.With this all speech frames in voice training corpus are added up, obtain arbitrary gaussian component among the UBM and the phone set that obtained by speech corpus in the frame of response appears between arbitrary phoneme.
Fig. 6 carries out the process flow diagram that normalization obtains guiding probability for the response frequency that statistics in the step 4 is obtained.As shown in Figure 6, step 5 uses two kinds of different method for normalizing to obtain guiding probability, namely uses (2) formula to be listed as normalized method, and uses the normalized method of (3) formula procession.
The below introduces the row method for normalizing.For a certain phoneme p, having added up arbitrary gaussian component m among the UBM is C to the frequency of its response
Pm(1≤m≤M), M is the number of all gaussian component among the UBM.All gaussian component are made normalized to the response of this phoneme, shown in (2).Obtain guiding probability, be designated as r
Pm
R wherein
PmExpression guiding probability, certain the gaussian component m that represents among the UBM has responded how most speech data among a certain phoneme p, C
PiRepresent that arbitrary gaussian component i is to the response frequency of phoneme p in M the gaussian component.Such as, r
Pm=0.3, mean that phoneme p has 30% speech data take m as main Gauss, perhaps p has 30% data to fall within corresponding local space.
The below introduces the ranks method for normalizing.Ranks normalization considers to belong to the ratio that p and the data take m as main Gauss account for whole speech frame training corpus.Normalization is suc as formula shown in (3).
Wherein M is the number of gaussian component among the UBM, and P is the number of phoneme from the phone set that speech frame training corpus obtains, C
JiRepresent that arbitrary gaussian component j is to the response frequency of arbitrary phoneme j in P the phoneme in M the gaussian component.
In addition, consider that to have the guiding probability minimum even be zero situation, the present invention is provided with lowest threshold T zero probability is carried out smoothly, to avoid because the guiding probability is too small, direct situation with the path beta pruning.For row normalization, T=1/M; For ranks normalization, T=1/ (MP) is namely in the row method for normalizing, if r
Pm<1/M then sets r
Pm=1/M, otherwise keep r
PmConstant; In the ranks method for normalizing, if r
Pm<1/ (MP) then sets r
Pm=1/ (MP), otherwise keep r
PmConstant.Guiding probability after finally obtaining smoothly.
Fig. 7 is for incorporating the guiding probability that obtains in the step 5 process flow diagram of speech recognition decoder process.As shown in Figure 7, in decode procedure, to current speech frame O to be decoded
t, except calculating the acoustic model probability P
AmWith probabilistic language model P
LmAlso need calculate current speech frame O outward,
tThe guiding probability.At first, to current speech frame O
t, calculate current speech frame O by (1) formula
tProbable value on each gaussian component of UBM, and the corresponding gaussian component of person of finding out the maximum probability are as O
tMain gaussian component, be denoted as m; Then, the position according to Path extension in the decode procedure arrives obtains O
tAffiliated phoneme p.For example in decode procedure Path extension to quiet model sil or the shape triphone model such as " a-b+c ", then obtain the affiliated phoneme of present frame and be respectively sil and b, wherein a, b, c satisfy the form of " initial consonant-vowel with tones+initial consonant (or quiet, sil) " or " vowel with tones (or sil)-initial consonant+vowel with tones ".At last, according to the guiding probability that step 5 has been added up, search gaussian component m to the guiding probability of phoneme p.By in legacy paths PTS computing formula (4), adding the guiding probability, obtained merging the path PTS computing formula (5) of guiding probability, to strengthen traditional path PTS.Above-mentioned this mode by incorporating the positional information of speech frame in the acoustic feature space, can limit more targetedly or strengthens current path, and potential path is kept as much as possible.
Traditional path PTS calculates suc as formula shown in (4):
P(t)=P(t-1)+α
1P
am+α
2P
lm(4)
Wherein, P (t-1), P
Am, P
LmAll be the probability of logarithmic form, t was current time, and t-1 is previous moment, and P (t-1) expression was carved into t-1 general probability constantly, P from first o'clock
AmBe the acoustics probability, expression current speech frame O
tProbability on the state that expands to is by O
tGauss hybrid models at corresponding state calculates.P
LmBe probabilistic language model, represented word one deck relation of interdependence.α
1And α
2Be respectively the weight of acoustics probability and probabilistic language model.Adjust α according to (4) formula
1And α
2, so that system obtains minimum Chinese character error rate.In certain span, one group of α of given first
1And α
2Value, and be worth with this and calculate general probability by (4) formula and obtain one group of recognition result, then change α
1And α
2Value, and under new value, calculate general probability and obtain one group of new recognition result by (4) formula, constantly change α with this
1And α
2Value, and select so that one group of minimum α of Chinese character error rate
1And α
2Value, be used for merging the guiding probability at next step.
The voice recognition path PTS that has merged the guiding probability calculates suc as formula shown in (5):
P(t)=P(t-1)+α
1P
am+α
2P
lm+α
3r
pm (5)
R wherein
PmBe speech frame O
tTake m as main Gauss the time to the guiding probability of phoneme p.α
3Be the guiding probability right.
Afterwards, be respectively α
3Compose different values, and to current α
3Assignment calculate general probability by (5) formula, finally obtain different recognition results.Constantly for guiding the weight α of probability
3Assignment obtains merging among the present invention the speech recognition decoder algorithm that guides probability.
Tested the above-mentioned Algorithm Performance that the present invention proposes at Chinese large vocabulary Continuous Speech Recognition System.The hardware platform of experiment is the PC of Intel 3.0GHz dominant frequency and 4GB internal memory, and internal memory uses and is about 180MB-250MB in the operational process.Baseline system adopts context-sensitive three-tone acoustic model.The basic phone set that adopts in the experiment comprises 24 initial consonants and 37 simple or compound vowel of a Chinese syllable, and each simple or compound vowel of a Chinese syllable contains 5 tones.In the Mandarin recognition experiment, removing does not have the sound of appearance female in the voice training language material Kuku, and phone set comprises 191 basic phonemes altogether.After the impact of considering linguistic context, the quantity of acoustic model is 204388.Behind the state clustering algorithm of use based on decision tree, altogether comprise 4575 shared states in the model.The entry number is 48188 in the Chinese dictionary, has used the bigram statistics language model in identifying.
Guiding probability among acoustic model, language model and the present invention has in various degree impact to recognition performance.At first in baseline system, adjust the weight α of acoustics probability and probabilistic language model
1And α
2, obtaining the Chinese character misclassification rate minimum is 12.78%.On this basis, add the guiding probability, can obviously reduce the Chinese character error rate, reach 11.61%.Compare with traditional baseline system, error rate descends 9.15% relatively.
Above-described specific embodiment; purpose of the present invention, technical scheme and beneficial effect are further described; be understood that; the above only is specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any modification of making, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.