CN102982799A

CN102982799A - Speech recognition optimization decoding method integrating guide probability

Info

Publication number: CN102982799A
Application number: CN2012105607454A
Authority: CN
Inventors: 刘文举; 杨占磊
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2012-12-20
Filing date: 2012-12-20
Publication date: 2013-03-20

Abstract

The invention discloses a speech recognition and decoding method which combines guidance probability. Aiming at the lack of location information of the speech frame in the acoustic feature space in the traditional speech recognition system, the present invention proposes a guided probability model, which is used to describe the probability that the speech frame belongs to different parts of the acoustic feature space, and is used to guide the decoding process. The solution of the present invention includes the following steps: training a general background model to describe the entire acoustic feature space; calculating the main Gaussian components of speech frames on the general background model; using the acoustic model of the recognition system to forcibly segment the training corpus to obtain speech The phoneme to which the frame belongs; the response frequency of the phoneme and the main Gaussian is counted; the guide probability is obtained by normalizing the response frequency; the guide probability is integrated into the total score calculation of the speech recognition path, so as to guide the decoder to complete the enhancement or weakening of the path.

Description

A kind of speech recognition optimization solution code method that merges the guiding probability

Technical field

The present invention relates to field of speech recognition, particularly speech recognition Acoustic Modeling and decoding field.

Background technology

At present, speech recognition system generally adopts Hidden Markov Model (HMM) as the basic model of Acoustic Modeling and decoding.For considering the context pronunciation to the impact of voice unit, people adopt three-tone (triphone) model to improve system recognition rate more.But after considering context, model quantity and parameter scale sharply increase.Take Chinese large vocabulary Continuous Speech Recognition System as example, basic phoneme is concentrated and is only comprised 191 initial consonants and vowel with tones, and corresponding triphone model sum surpasses 200,000.Even through the parameter sharing of model layer, state layer and gauss component layer, the parameter scale is still huge.This not only can bring the inadequate problem of parameter training, at cognitive phase, also can introduce too high decoding complex degree.Fully excavate the useful information in the existing training data, no matter to the compression of acoustic model parameter scale, thereby or the raising of model accuracy improved the speech recognition system performance, all tool has very important significance.

The symposial of holding in U.S. Johns Hopkins University in 2009, take newspeak and frontier as application background, a kind of subspace-based gauss hybrid models (Subspace Gaussian Mixture Models is proposed, Subspace GMM) audio recognition method (list of references 1:D.Povey, " A tutorial-style introduction to subspace gaussian mixture models for speech recognition; " Tech.Rep., Tech.Rep.MSR-TR-2009-111, Microsoft Research, 2009.).With traditional Hidden Markov Model (HMM) (Hidden Markov Model, HMM) gauss hybrid models of each state direct correlation (Gaussian Mixture Model, GMM) difference, vector of subspace gauss hybrid models direct correlation, and go out associated GMM by this vector calculation.Because vectorial dimension is far below the parameter scale among the GMM, so that acoustic model represents is compacter, on limited training data, can obtain than the better recognition effect of conventional model.

Except compressing and improving the Acoustic Modeling, also can improve Path extension and beta pruning mechanism at decode phase, most promising path is remained.Traditional decode procedure only uses acoustic model probability and probabilistic language model when the calculating path score, and with the foundation of the general probability after the two fusion as expansion or beta pruning.

But, in existing decoding technique, because different models may be identical to the marking of same voice segments, only rely on acoustic model and language model, be difficult to farthest portray the difference of different phonemes.Show as the rapid expansion of searching route in the decode procedure, and the beta pruning mistake etc.For example, in existing Beam search technique, demoder can calculate on indistinction ground, whole search volume the probability of each paths, and reservation and maximum probability difference be no more than the path of Beam, and the too small path of probability is deleted.This traditional coding/decoding method is not paid close attention to the investigation in the section space of playing a game, and each paths all is to expand coequally and beta pruning.

In fact, to any frame phonetic feature, it all is positioned at a part in acoustic feature space.The present invention wishes to utilize the positional information of speech frame to be identified in the acoustic feature space, strengthens the search to this local space, strengthens the path on this local space, and is kept as far as possible and expand; To not belonging to the path of this local space, will not strengthen.After the search of reinforcement local space, increase through this local path shared ratio in All Paths, contain correct path as much as possible thereby make in the path that keeps and expand.Compare with traditional decoding algorithm, the algorithm of carrying joins set of paths to path likely as much as possible among the present invention, the path that the potentiality that weaken simultaneously are little.

Summary of the invention

The technical matters that (one) will solve

The object of the invention is to solve lack in the existing voice identification decoding technique and utilize the positional information of speech frame to be identified in the acoustic feature space, lack the deficiency to part local space enhanced search.

(2) technical scheme

For addressing the above problem, the present invention proposes a kind of speech recognition decoder method that merges the guiding probability, it is characterized in that, comprise the following steps:

Step a: the training universal background model is used for describing whole acoustic feature space;

Step b: the main gaussian component of computing voice frame on described universal background model;

Step c: utilize acoustic model that training corpus is forced cutting, obtain the affiliated phoneme of speech frame;

Steps d: the response frequency of gaussian component in statistics phoneme and the described universal background model;

Step e: calculate the guiding probability according to the described response frequency;

Step f: will guide probability fusion in the PTS calculating of voice recognition path, thereby finish enhancing or weakening to the voice recognition path score.

(3) beneficial effect

The present invention is directed to traditional speech recognition system and lack the deficiency of utilizing the positional information of speech frame in the acoustic feature space, a kind of guiding probability model has been proposed, describing this positional information, and will guide probability to incorporate in the path PTS of speech recognition decoder process.The new method that has merged the speech frame positional information is more emphasized most promising part in the acoustic feature space is searched for, reservation and expansion are by the path of this local space, weaken simultaneously the path without this local space, so that the score in different paths has more differentiation.By merging the guiding probability, can aided decoder be screened by potential path, finally reduce the Chinese character error rate of recognition system.

Description of drawings

Fig. 1 is the process flow diagram according to the speech recognition decoder algorithm based on guiding probability of the present invention;

Fig. 2 (a) is the synoptic diagram according to direct use EM Algorithm for Training universal background model of the present invention;

Fig. 2 (b) is by converging the synoptic diagram of each phoneme gaussian component training universal background model according to of the present invention;

Fig. 3 is the process flow diagram of the main gaussian component of calculating according to the present invention speech frame to be decoded on universal background model;

Fig. 4 is that the acoustic model that utilizes according to the present invention is forced cutting to training corpus, obtains the process flow diagram of the affiliated phoneme of speech frame;

Fig. 5 is the process flow diagram according to statistics phoneme of the present invention and main Gauss's the response frequency;

Fig. 6 is the process flow diagram that the normalized response frequency according to the present invention obtains guiding probability;

Fig. 7 will guide during probability fusion calculates to the path PTS of speech recognition according to according to the present invention, thereby instruct demoder to finish process flow diagram to enhancing or the weakening in path.

Embodiment

For making the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in further detail.

In order in decode procedure, to add the positional information of speech frame in the acoustic feature space, the present invention is at statistics phoneme and universal background model (Universal Background Model, UBM) on the basis of the response relation of each gaussian component, by setting up the corresponding relation between each gaussian component and phoneme among the UBM, obtain speech frame residing position in the acoustic feature space, be expressed as the different local probability that speech frame belongs to the acoustic feature space, obtain guiding probability.When decoding, utilize the positional information (namely investigating the response relation of each gaussian component in the affiliated phoneme of speech frame to be decoded, main gaussian component, phoneme and the universal background model of speech frame on universal background model) of speech frame to be decoded, revise the computing formula of legacy paths PTS in the decode procedure, so that decode system is on the basis that utilizes traditional acoustic model and language model, further the positional information with speech frame to be decoded incorporates decode procedure.

Fig. 1 is the response relation according to each gaussian component in phoneme and the universal background model, calculate the guiding probability, and the guiding probability that will obtain joins the process flow diagram of path PTS.In statistics phoneme and universal background model UBM, before the corresponding relation of each gaussian component, at first need to train a UBM, be used for describing whole acoustic feature space.After obtaining UBM, we calculate their main gaussian component on UBM to all speech frames in the phonetic feature training corpus.Simultaneously, by phonetic feature training corpus is forced cutting, obtain the affiliated phoneme of each speech frame.Afterwards, with all take same gaussian component as main gaussian component and the speech frame that belongs to same phoneme be classified as a class, add up the quantity of speech frame in such, this quantity is the response frequency between arbitrary phoneme and the arbitrary gaussian component.By the response frequency is carried out normalization, obtain guiding probability.And will guide probability and traditional path PTS computation process to merge mutually, obtain merging the speech recognition decoder algorithm of guiding probability.As shown in Figure 1, the method specifically comprises following step:

Step 1: training universal background model UBM is used for describing whole acoustic feature space.Wherein, universal background model has adopted mixed Gauss model;

Step 2: the main gaussian component of computing voice frame on universal background model UBM;

Step 3: utilize the acoustic model of recognition system that phonetic feature training corpus is forced cutting, obtain the affiliated phoneme of speech frame;

Step 4: the response frequency of statistics phoneme and gaussian component;

Step 5: the described response frequency of normalization obtains guiding probability;

Step 6: will guide during probability fusion calculates to traditional voice recognition path PTS, thereby instruct demoder to finish enhancing or weakening to the path.

To introduce in detail above-mentioned each step below in conjunction with accompanying drawing.

Fig. 2 is according to the phonetic feature in the phonetic feature training corpus, and training obtains the process flow diagram of universal background model.The present invention adopts respectively two kinds of different UBM training methods, namely directly whole phonetic feature training corpus is used greatest hope (EM) Algorithm for Training UBM, perhaps utilizes voice feature data that the gaussian component that each phoneme comprises is upgraded, and obtains UBM.Two kinds of methods are respectively shown in Fig. 2 (a) and Fig. 2 (b).

Shown in Fig. 2 (a), directly use the method for EM Algorithm for Training UBM specifically to comprise following content to whole phonetic feature training corpus.At first setting UBM is multidimensional standard Gaussian distribution, dimension is identical with the phonetic feature dimension, such as the present invention voice signal is extracted Mel cepstrum coefficient (Mel Frequency Cepstral Coefficients, MFCC) as feature, the Mel cepstrum coefficient and the 1 dimension speech energy that wherein comprise 12 dimensions, and their first order difference and second order difference, totally 39 dimensions are as proper vector.On whole phonetic feature training corpus, utilize the EM algorithm to adjust average and the variance of Gaussian distribution.Greatest hope (EM) algorithm is the algorithm of seeking parameter maximal possibility estimation or maximum a posteriori estimation in probability model, is used for estimating posterior probability density function.In speech recognition, represent probability density function with gauss hybrid models, therefore mainly be the parameters such as the average of estimating each gaussian component in the gauss hybrid models, variance.When estimating, at first calculation expectation (E step) utilizes the existing estimated value to hidden variable, calculates its maximum likelihood estimator; Next is maximization (M step), and the maximum likelihood value that maximization was tried to achieve in the E step is come the value of calculating parameter.The estimates of parameters that M found on the step is used in the next E step calculating, and this process constantly hockets, and finally finishes the parameter estimation of gauss hybrid models.Subsequently, for the ease of describing more accurately local acoustics feature space, the HHEd instrument of algorithm employing HTK divides the gaussian component in the HMM state.For any component in the gauss hybrid models, it is split into average equates with variance, two gaussian component of weighted, thereby reach the purpose that increases the quantity of Gaussian distribution among the UBM.Carry out loop iteration with upgrading the average of Gaussian distribution and the process of variance and increase gaussian distribution number, until the number of gaussian component reaches expectation value among the UBM, obtain final UBM.

Shown in Fig. 2 (b), the method that the gaussian component of utilizing phonetic feature training corpus that each phoneme is comprised is upgraded specifically comprises following content.At first, on phonetic feature training corpus, add up all phonemes that obtain comprising in the phonetic feature training corpus, and Hidden Markov Model (HMM) HMM set up in each phoneme.Secondly, use the Baum-Welch algorithm to upgrade HMM parameter, the HMM model that obtains training.The Baum-Welch algorithm is to utilize one group of observation sequence to go the HMM model parameter of training a continuous mixed Gaussian to distribute, and comprises the state transition probability matrix of model, the parameters such as mean vector, covariance matrix and hybrid weight of HMM model parameter.In estimating the HMM parametric procedure, Baum-Welch adopts maximal possibility estimation, and its process of adjusting the gaussian component parameter on each state is consistent with the EM algorithm.After this, with each gaussian component in the HMM model, weighting obtains the initial generic background model, wherein the weight of each gaussian component equates, and all the weight sum of component is 1, again whole phonetic feature training corpus use the EM algorithm to initial UBM in the parameter of each gaussian component adjust, the UBM that obtains upgrading is final UBM.

Fig. 3 is the process flow diagram of the main gaussian component of computing voice frame on universal background model UBM.As shown in Figure 3, for each speech frame in the whole phonetic feature training corpus, calculate the score on the probability density function of its each gaussian component in the UBM that step 1 obtains, and the gaussian component that score the maximum is corresponding is as its main gaussian component.Specifically comprise:

UBM comprises M gaussian component, and the probability density function of wherein remembering m gaussian component is λ _m, or probability density function is expressed as N (O with its parametric form; μ _m, ∑ _m), μ wherein _m, ∑ _mThe average and the variance that represent m gaussian component.Then speech frame O is at probability density function λ _mOn probable value be calculated as follows:

P (O | λ_{m}) = \frac{1}{\sqrt{2 π Σ_{m}}} e^{- {(o - μ_{m})}^{2} / 2 Σ_{m}}

For each speech frame O, among the UBM to speech frame O probability P (O| λ _m) maximum gaussian component is defined as the main gaussian component of speech frame O.That is, for speech frame O, the gaussian component m that satisfies formula (1) is the main gaussian component of speech frame O.That is to say, for any the component m ' in the gauss hybrid models, calculate it to probability P (the O| λ of speech frame O _{M '}), and a gaussian component in the searching gauss hybrid models, make speech frame O all large at the probability of likelihood ratio speech frame O on other any one gaussian component on the probability density function of this gaussian component.The gaussian component that finds is main gaussian component, is designated as m.

m = \underset{m^{'}}{\arg \max} P (O | λ_{m^{'}}) - - - (1)

Fig. 4 utilizes the acoustic model of recognition system that phonetic feature training corpus is forced cutting, obtains the process flow diagram of the affiliated phoneme of speech frame.As shown in Figure 4, the context-sensitive three-tone acoustic model of model, and with this as baseline system.Baseline system is comparison system, by relatively drawing the improvement in performance situation of the speech recognition system that the present invention realizes with described baseline system.Context-sensitive initial consonant/vowel with tones is as basic modeling unit in the baseline system, and only considers the impact of each phoneme before and after the current phoneme, forms triphone model.Each model is the form of " initial consonant-vowel with tones+initial consonant (or quiet, sil) " or " vowel with tones (or sil)-initial consonant+vowel with tones "." initial consonant-vowel with tones+initial consonant (or quiet, sil) " current pronunciation of expression is vowel with tones, and this vowel with tones is in after the pronunciation of an initial consonant, before the pronunciation of another initial consonant (perhaps quiet).Such as the model of " zh-ong1+g " expression vowel with tones ong1, furtherly, be the model of the ong1 after the zh pronunciation, before the g pronunciation.Similarly, the current pronunciation of " vowel with tones (or sil)-initial consonant+vowel with tones " expression is initial consonant, and this initial consonant is in after the pronunciation of a vowel with tones (perhaps quiet), before the pronunciation of another vowel with tones.Such as the model of " ong1-g+uo2 " expression initial consonant g, furtherly, be the model of the g after ong1 pronunciation, before the uo2 pronunciation.Except the triphone model of above two kinds of structures, system has also set up quiet model sil, the situation of voice signal when being used for describing.

Except quiet model sil, all three-tone acoustic models, namely all " initial consonant-vowels with tones+initial consonant (or quiet, sil) " reach the model of " vowel with tones (or sil)-initial consonant+vowel with tones " form, all adopt the HMM structure from left to right of continuous density.HMM comprises 5 states, and wherein 3 are the emission attitude.Quiet model sil has increased the redirect between the emission attitude, is used for the phenomenon that the quiet duration length of portrayal voice differs.The training of HMM model is adopted and is finished based on the Baum-Welch algorithm of expectation maximization.As the system that compares with the present invention, baseline system need to just be finished beginning most.Therefore, comprising the training of described sil and triphone model, all is to finish in the starting stage.

Because after considering the context impact, acoustic model quantity sharply increases, so baseline system adopts the state clustering algorithm based on decision tree to reduce the number of parameters that needs training.

Step 3 is utilized the acoustic model of above-mentioned baseline system, and the Viterbi algorithm, and the phonetic feature of training corpus is forced cutting.Originally, training corpus includes only corresponding aligned phoneme sequence, but the zero-time of each phoneme and termination time are unknown in the aligned phoneme sequence.That is to say that although know in the speaking of whole word and perhaps claim aligned phoneme sequence, the Voice onset time of each phoneme in the aligned phoneme sequence and termination time are unknown.In order to obtain the time boundary information of phoneme, algorithm utilizes acoustic model and the Viterbi algorithm of baseline system, divide a time boundary for each phoneme in the corresponding aligned phoneme sequence of training corpus, obtain the time boundary information of corpus annotation text (aligned phoneme sequence).For each phoneme from the phone set that training corpus obtains, obtain its zero-time position and termination time position in phonetic feature, be in the speech frame between zero-time position and the termination time position, all be marked for this reason phoneme, thereby with each speech frame, be divided in a certain phoneme.Thus, obtain the affiliated phoneme of each speech frame.

Fig. 5 is the process flow diagram of the response frequency of statistics phoneme and gaussian component.For each speech frame, the main gaussian component that calculates in step 2 is m, the phoneme p that in step 3, obtains, and the response frequency of note m and p is C _PmAdd up the response frequency of a certain phoneme and a certain gaussian component, be with all take same gaussian component as main gaussian component and the speech frame that belongs to same phoneme be classified as a class, then the frame number of speech frame in such is the response frequency between above-mentioned a certain phoneme and a certain gaussian component.As shown in Figure 5, for the first speech frame O in the voice training corpus, its main gaussian component on UBM is m, and the phoneme under it is p, so, claim gaussian component m and phoneme p a secondary response to occur, that is to say that appearance one frame is take m as main gaussian component and belong to the voice O of phoneme p in the voice training corpus.With this all speech frames in voice training corpus are added up, obtain arbitrary gaussian component among the UBM and the phone set that obtained by speech corpus in the frame of response appears between arbitrary phoneme.

Fig. 6 carries out the process flow diagram that normalization obtains guiding probability for the response frequency that statistics in the step 4 is obtained.As shown in Figure 6, step 5 uses two kinds of different method for normalizing to obtain guiding probability, namely uses (2) formula to be listed as normalized method, and uses the normalized method of (3) formula procession.

The below introduces the row method for normalizing.For a certain phoneme p, having added up arbitrary gaussian component m among the UBM is C to the frequency of its response _Pm(1≤m≤M), M is the number of all gaussian component among the UBM.All gaussian component are made normalized to the response of this phoneme, shown in (2).Obtain guiding probability, be designated as r _Pm

r_{pm} = \frac{C_{pm}}{Σ_{i = 1}^{M} C_{pi}} - - - (2)

R wherein _PmExpression guiding probability, certain the gaussian component m that represents among the UBM has responded how most speech data among a certain phoneme p, C _PiRepresent that arbitrary gaussian component i is to the response frequency of phoneme p in M the gaussian component.Such as, r _Pm=0.3, mean that phoneme p has 30% speech data take m as main Gauss, perhaps p has 30% data to fall within corresponding local space.

The below introduces the ranks method for normalizing.Ranks normalization considers to belong to the ratio that p and the data take m as main Gauss account for whole speech frame training corpus.Normalization is suc as formula shown in (3).

r_{pm} = \frac{C_{pm}}{Σ_{j = 1}^{P} Σ_{i = 1}^{M} C_{ji}} - - - (3)

Wherein M is the number of gaussian component among the UBM, and P is the number of phoneme from the phone set that speech frame training corpus obtains, C _JiRepresent that arbitrary gaussian component j is to the response frequency of arbitrary phoneme j in P the phoneme in M the gaussian component.

In addition, consider that to have the guiding probability minimum even be zero situation, the present invention is provided with lowest threshold T zero probability is carried out smoothly, to avoid because the guiding probability is too small, direct situation with the path beta pruning.For row normalization, T=1/M; For ranks normalization, T=1/ (MP) is namely in the row method for normalizing, if r _Pm＜1/M then sets r _Pm=1/M, otherwise keep r _PmConstant; In the ranks method for normalizing, if r _Pm＜1/ (MP) then sets r _Pm=1/ (MP), otherwise keep r _PmConstant.Guiding probability after finally obtaining smoothly.

Fig. 7 is for incorporating the guiding probability that obtains in the step 5 process flow diagram of speech recognition decoder process.As shown in Figure 7, in decode procedure, to current speech frame O to be decoded _t, except calculating the acoustic model probability P _AmWith probabilistic language model P _LmAlso need calculate current speech frame O outward, _tThe guiding probability.At first, to current speech frame O _t, calculate current speech frame O by (1) formula _tProbable value on each gaussian component of UBM, and the corresponding gaussian component of person of finding out the maximum probability are as O _tMain gaussian component, be denoted as m; Then, the position according to Path extension in the decode procedure arrives obtains O _tAffiliated phoneme p.For example in decode procedure Path extension to quiet model sil or the shape triphone model such as " a-b+c ", then obtain the affiliated phoneme of present frame and be respectively sil and b, wherein a, b, c satisfy the form of " initial consonant-vowel with tones+initial consonant (or quiet, sil) " or " vowel with tones (or sil)-initial consonant+vowel with tones ".At last, according to the guiding probability that step 5 has been added up, search gaussian component m to the guiding probability of phoneme p.By in legacy paths PTS computing formula (4), adding the guiding probability, obtained merging the path PTS computing formula (5) of guiding probability, to strengthen traditional path PTS.Above-mentioned this mode by incorporating the positional information of speech frame in the acoustic feature space, can limit more targetedly or strengthens current path, and potential path is kept as much as possible.

Traditional path PTS calculates suc as formula shown in (4):

P(t)＝P(t-1)+α ₁P _am+α ₂P _lm(4)

Wherein, P (t-1), P _Am, P _LmAll be the probability of logarithmic form, t was current time, and t-1 is previous moment, and P (t-1) expression was carved into t-1 general probability constantly, P from first o'clock _AmBe the acoustics probability, expression current speech frame O _tProbability on the state that expands to is by O _tGauss hybrid models at corresponding state calculates.P _LmBe probabilistic language model, represented word one deck relation of interdependence.α ₁And α ₂Be respectively the weight of acoustics probability and probabilistic language model.Adjust α according to (4) formula ₁And α ₂, so that system obtains minimum Chinese character error rate.In certain span, one group of α of given first ₁And α ₂Value, and be worth with this and calculate general probability by (4) formula and obtain one group of recognition result, then change α ₁And α ₂Value, and under new value, calculate general probability and obtain one group of new recognition result by (4) formula, constantly change α with this ₁And α ₂Value, and select so that one group of minimum α of Chinese character error rate ₁And α ₂Value, be used for merging the guiding probability at next step.

The voice recognition path PTS that has merged the guiding probability calculates suc as formula shown in (5):

P(t)＝P(t-1)+α ₁P _am+α ₂P _lm+α ₃r _pm (5)

R wherein _PmBe speech frame O _tTake m as main Gauss the time to the guiding probability of phoneme p.α ₃Be the guiding probability right.

Afterwards, be respectively α ₃Compose different values, and to current α ₃Assignment calculate general probability by (5) formula, finally obtain different recognition results.Constantly for guiding the weight α of probability ₃Assignment obtains merging among the present invention the speech recognition decoder algorithm that guides probability.

Tested the above-mentioned Algorithm Performance that the present invention proposes at Chinese large vocabulary Continuous Speech Recognition System.The hardware platform of experiment is the PC of Intel 3.0GHz dominant frequency and 4GB internal memory, and internal memory uses and is about 180MB-250MB in the operational process.Baseline system adopts context-sensitive three-tone acoustic model.The basic phone set that adopts in the experiment comprises 24 initial consonants and 37 simple or compound vowel of a Chinese syllable, and each simple or compound vowel of a Chinese syllable contains 5 tones.In the Mandarin recognition experiment, removing does not have the sound of appearance female in the voice training language material Kuku, and phone set comprises 191 basic phonemes altogether.After the impact of considering linguistic context, the quantity of acoustic model is 204388.Behind the state clustering algorithm of use based on decision tree, altogether comprise 4575 shared states in the model.The entry number is 48188 in the Chinese dictionary, has used the bigram statistics language model in identifying.

Guiding probability among acoustic model, language model and the present invention has in various degree impact to recognition performance.At first in baseline system, adjust the weight α of acoustics probability and probabilistic language model ₁And α ₂, obtaining the Chinese character misclassification rate minimum is 12.78%.On this basis, add the guiding probability, can obviously reduce the Chinese character error rate, reach 11.61%.Compare with traditional baseline system, error rate descends 9.15% relatively.

Above-described specific embodiment; purpose of the present invention, technical scheme and beneficial effect are further described; be understood that; the above only is specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any modification of making, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. a speech recognition decoder method that merges the guiding probability is characterized in that, comprises the following steps:

2. the method for claim 1 is characterized in that, the described universal background model of one of dual mode training below using among the described step a:

One uses expectation maximization Algorithm for Training mixed Gauss model also to increase gradually the number of gaussian component in the described mixed Gauss model, finally obtains universal background model;

Its two, set up Hidden Markov Model (HMM) for each phoneme in the training corpus respectively; Then use the Baum-Welch algorithm to upgrade parameter in the described Hidden Markov Model (HMM), the Hidden Markov Model (HMM) that obtains training; Afterwards, each the gaussian component weighting in the described Hidden Markov Model (HMM) is obtained the initial generic background model, and use the EM algorithm that the parameter of each gaussian component in the resulting universal background model is adjusted, obtain final universal background model.

3. the method for claim 1 is characterized in that, among the step b, for speech frame O, its main gaussian component is the gaussian component of probable value maximum in described universal background model.

4. method as claimed in claim 3 is characterized in that, the following calculating of the probable value of described speech frame O in described universal background model:

P (O | λ_{m}) = \frac{1}{\sqrt{2 π Σ_{m}}} e^{- {(o - μ_{m})}^{2} / 2 Σ_{m}}

Wherein, λ _mBe the probability density function of m gaussian component in the described universal background model, μ _m, ∑ _mThe average and the variance that represent respectively m gaussian component.

5. the method for claim 1 is characterized in that, described step c specifically comprises:

Set up the three-tone acoustic model, and utilize described three-tone acoustic model and Viterbi algorithm to each the phoneme time division border in the corresponding aligned phoneme sequence of training corpus, obtain zero-time position and the termination time position of each phoneme in the described aligned phoneme sequence, and will be in speech frame between described zero-time position and the termination time position, be labeled as and belong to this phoneme, obtain phoneme under each speech frame with this.

6. the method for claim 1, it is characterized in that, in the described steps d, the response frequency of gaussian component is in described phoneme and the described universal background model: for each gaussian component and each phoneme, take described gaussian component as main gaussian component and belong to the frame number of the speech frame of described phoneme.

7. the method for claim 1 is characterized in that, uses row normalization to calculate described guiding probability among the described step e:

r_{pm} = \frac{C_{pm}}{Σ_{i = 1}^{M} C_{pi}}

Wherein, r _PmBe described guiding probability, C _PmBe the response frequency of gaussian component m in the described universal background model and described phoneme p, C _PiRepresent in the described universal background model response frequency of i component and phoneme p, described M is the number of gaussian component in the described universal background model.

8. the method for claim 1 is characterized in that, uses ranks normalization to calculate described guiding probability among the described step e:

r_{pm} = \frac{C_{pm}}{Σ_{j = 1}^{P} Σ_{i = 1}^{M} C_{ji}}

Wherein, r _PmBe described guiding probability, C _PmBe the response frequency of gaussian component m in the described universal background model and described phoneme p, C _JiRepresent that the i gaussian component is to the response frequency of p phoneme in the described universal background model, described M is the number of gaussian component in the described universal background model, and described P is the number of phoneme in the described voice training corpus.

9. such as claim 7 or 8 described methods, it is characterized in that, lowest threshold T be set carry out smoothly zero probability is following:

During row normalization, if r _Pm＜1/M then sets r _Pm=1/M, otherwise keep r _PmConstant; During ranks normalization, if r _Pm＜1/ (MP) then sets r _Pm=1/ (MP), otherwise keep r _PmConstant.

10. the method for claim 1 is characterized in that, described step f comprises the PTS that uses path in the described guiding probability calculation speech recognition process:

P(t)＝P(t-1)+α ₁P _am+α ₂P _lm+α ₃r _pm

Wherein P (t-1) is historical path score, P _AmBe the acoustic model probability of present frame, P _LmBe probabilistic language model, α ₁And α ₂Be respectively the weight of acoustic model probability and probabilistic language model, r _PmBe speech frame O guiding probability to phoneme p take m as main gaussian component time the, α ₃Weight for the guiding probability.