[go: up one dir, main page]

CN102982799A - Speech recognition optimization decoding method integrating guide probability - Google Patents

Speech recognition optimization decoding method integrating guide probability Download PDF

Info

Publication number
CN102982799A
CN102982799A CN2012105607454A CN201210560745A CN102982799A CN 102982799 A CN102982799 A CN 102982799A CN 2012105607454 A CN2012105607454 A CN 2012105607454A CN 201210560745 A CN201210560745 A CN 201210560745A CN 102982799 A CN102982799 A CN 102982799A
Authority
CN
China
Prior art keywords
phoneme
model
probability
gaussian component
background model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012105607454A
Other languages
Chinese (zh)
Inventor
刘文举
杨占磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN2012105607454A priority Critical patent/CN102982799A/en
Publication of CN102982799A publication Critical patent/CN102982799A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

本发明公开了一种融合了引导概率的语音识别解码方法。针对传统的语音识别系统缺乏利用语音帧在声学特征空间中的位置信息不足,本发明提出一种引导概率模型,用于描述语音帧属于声学特征空间不同局部的概率,并用于指导解码过程。本发明的方案包括下列步骤:训练通用背景模型,用于描述整个声学特征空间;计算语音帧在通用背景模型上的主高斯分量;利用识别系统的声学模型对训练语料库进行强制切分,得到语音帧所属的音素;统计音素与主高斯的响应频次;归一化响应频次得到引导概率;将引导概率融合到语音识别的路径总得分计算中,从而指导解码器完成对路径的增强或者削弱。

Figure 201210560745

The invention discloses a speech recognition and decoding method which combines guidance probability. Aiming at the lack of location information of the speech frame in the acoustic feature space in the traditional speech recognition system, the present invention proposes a guided probability model, which is used to describe the probability that the speech frame belongs to different parts of the acoustic feature space, and is used to guide the decoding process. The solution of the present invention includes the following steps: training a general background model to describe the entire acoustic feature space; calculating the main Gaussian components of speech frames on the general background model; using the acoustic model of the recognition system to forcibly segment the training corpus to obtain speech The phoneme to which the frame belongs; the response frequency of the phoneme and the main Gaussian is counted; the guide probability is obtained by normalizing the response frequency; the guide probability is integrated into the total score calculation of the speech recognition path, so as to guide the decoder to complete the enhancement or weakening of the path.

Figure 201210560745

Description

A kind of speech recognition optimization solution code method that merges the guiding probability
Technical field
The present invention relates to field of speech recognition, particularly speech recognition Acoustic Modeling and decoding field.
Background technology
At present, speech recognition system generally adopts Hidden Markov Model (HMM) as the basic model of Acoustic Modeling and decoding.For considering the context pronunciation to the impact of voice unit, people adopt three-tone (triphone) model to improve system recognition rate more.But after considering context, model quantity and parameter scale sharply increase.Take Chinese large vocabulary Continuous Speech Recognition System as example, basic phoneme is concentrated and is only comprised 191 initial consonants and vowel with tones, and corresponding triphone model sum surpasses 200,000.Even through the parameter sharing of model layer, state layer and gauss component layer, the parameter scale is still huge.This not only can bring the inadequate problem of parameter training, at cognitive phase, also can introduce too high decoding complex degree.Fully excavate the useful information in the existing training data, no matter to the compression of acoustic model parameter scale, thereby or the raising of model accuracy improved the speech recognition system performance, all tool has very important significance.
The symposial of holding in U.S. Johns Hopkins University in 2009, take newspeak and frontier as application background, a kind of subspace-based gauss hybrid models (Subspace Gaussian Mixture Models is proposed, Subspace GMM) audio recognition method (list of references 1:D.Povey, " A tutorial-style introduction to subspace gaussian mixture models for speech recognition; " Tech.Rep., Tech.Rep.MSR-TR-2009-111, Microsoft Research, 2009.).With traditional Hidden Markov Model (HMM) (Hidden Markov Model, HMM) gauss hybrid models of each state direct correlation (Gaussian Mixture Model, GMM) difference, vector of subspace gauss hybrid models direct correlation, and go out associated GMM by this vector calculation.Because vectorial dimension is far below the parameter scale among the GMM, so that acoustic model represents is compacter, on limited training data, can obtain than the better recognition effect of conventional model.
Except compressing and improving the Acoustic Modeling, also can improve Path extension and beta pruning mechanism at decode phase, most promising path is remained.Traditional decode procedure only uses acoustic model probability and probabilistic language model when the calculating path score, and with the foundation of the general probability after the two fusion as expansion or beta pruning.
But, in existing decoding technique, because different models may be identical to the marking of same voice segments, only rely on acoustic model and language model, be difficult to farthest portray the difference of different phonemes.Show as the rapid expansion of searching route in the decode procedure, and the beta pruning mistake etc.For example, in existing Beam search technique, demoder can calculate on indistinction ground, whole search volume the probability of each paths, and reservation and maximum probability difference be no more than the path of Beam, and the too small path of probability is deleted.This traditional coding/decoding method is not paid close attention to the investigation in the section space of playing a game, and each paths all is to expand coequally and beta pruning.
In fact, to any frame phonetic feature, it all is positioned at a part in acoustic feature space.The present invention wishes to utilize the positional information of speech frame to be identified in the acoustic feature space, strengthens the search to this local space, strengthens the path on this local space, and is kept as far as possible and expand; To not belonging to the path of this local space, will not strengthen.After the search of reinforcement local space, increase through this local path shared ratio in All Paths, contain correct path as much as possible thereby make in the path that keeps and expand.Compare with traditional decoding algorithm, the algorithm of carrying joins set of paths to path likely as much as possible among the present invention, the path that the potentiality that weaken simultaneously are little.
Summary of the invention
The technical matters that (one) will solve
The object of the invention is to solve lack in the existing voice identification decoding technique and utilize the positional information of speech frame to be identified in the acoustic feature space, lack the deficiency to part local space enhanced search.
(2) technical scheme
For addressing the above problem, the present invention proposes a kind of speech recognition decoder method that merges the guiding probability, it is characterized in that, comprise the following steps:
Step a: the training universal background model is used for describing whole acoustic feature space;
Step b: the main gaussian component of computing voice frame on described universal background model;
Step c: utilize acoustic model that training corpus is forced cutting, obtain the affiliated phoneme of speech frame;
Steps d: the response frequency of gaussian component in statistics phoneme and the described universal background model;
Step e: calculate the guiding probability according to the described response frequency;
Step f: will guide probability fusion in the PTS calculating of voice recognition path, thereby finish enhancing or weakening to the voice recognition path score.
(3) beneficial effect
The present invention is directed to traditional speech recognition system and lack the deficiency of utilizing the positional information of speech frame in the acoustic feature space, a kind of guiding probability model has been proposed, describing this positional information, and will guide probability to incorporate in the path PTS of speech recognition decoder process.The new method that has merged the speech frame positional information is more emphasized most promising part in the acoustic feature space is searched for, reservation and expansion are by the path of this local space, weaken simultaneously the path without this local space, so that the score in different paths has more differentiation.By merging the guiding probability, can aided decoder be screened by potential path, finally reduce the Chinese character error rate of recognition system.
Description of drawings
Fig. 1 is the process flow diagram according to the speech recognition decoder algorithm based on guiding probability of the present invention;
Fig. 2 (a) is the synoptic diagram according to direct use EM Algorithm for Training universal background model of the present invention;
Fig. 2 (b) is by converging the synoptic diagram of each phoneme gaussian component training universal background model according to of the present invention;
Fig. 3 is the process flow diagram of the main gaussian component of calculating according to the present invention speech frame to be decoded on universal background model;
Fig. 4 is that the acoustic model that utilizes according to the present invention is forced cutting to training corpus, obtains the process flow diagram of the affiliated phoneme of speech frame;
Fig. 5 is the process flow diagram according to statistics phoneme of the present invention and main Gauss's the response frequency;
Fig. 6 is the process flow diagram that the normalized response frequency according to the present invention obtains guiding probability;
Fig. 7 will guide during probability fusion calculates to the path PTS of speech recognition according to according to the present invention, thereby instruct demoder to finish process flow diagram to enhancing or the weakening in path.
Embodiment
For making the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in further detail.
In order in decode procedure, to add the positional information of speech frame in the acoustic feature space, the present invention is at statistics phoneme and universal background model (Universal Background Model, UBM) on the basis of the response relation of each gaussian component, by setting up the corresponding relation between each gaussian component and phoneme among the UBM, obtain speech frame residing position in the acoustic feature space, be expressed as the different local probability that speech frame belongs to the acoustic feature space, obtain guiding probability.When decoding, utilize the positional information (namely investigating the response relation of each gaussian component in the affiliated phoneme of speech frame to be decoded, main gaussian component, phoneme and the universal background model of speech frame on universal background model) of speech frame to be decoded, revise the computing formula of legacy paths PTS in the decode procedure, so that decode system is on the basis that utilizes traditional acoustic model and language model, further the positional information with speech frame to be decoded incorporates decode procedure.
Fig. 1 is the response relation according to each gaussian component in phoneme and the universal background model, calculate the guiding probability, and the guiding probability that will obtain joins the process flow diagram of path PTS.In statistics phoneme and universal background model UBM, before the corresponding relation of each gaussian component, at first need to train a UBM, be used for describing whole acoustic feature space.After obtaining UBM, we calculate their main gaussian component on UBM to all speech frames in the phonetic feature training corpus.Simultaneously, by phonetic feature training corpus is forced cutting, obtain the affiliated phoneme of each speech frame.Afterwards, with all take same gaussian component as main gaussian component and the speech frame that belongs to same phoneme be classified as a class, add up the quantity of speech frame in such, this quantity is the response frequency between arbitrary phoneme and the arbitrary gaussian component.By the response frequency is carried out normalization, obtain guiding probability.And will guide probability and traditional path PTS computation process to merge mutually, obtain merging the speech recognition decoder algorithm of guiding probability.As shown in Figure 1, the method specifically comprises following step:
Step 1: training universal background model UBM is used for describing whole acoustic feature space.Wherein, universal background model has adopted mixed Gauss model;
Step 2: the main gaussian component of computing voice frame on universal background model UBM;
Step 3: utilize the acoustic model of recognition system that phonetic feature training corpus is forced cutting, obtain the affiliated phoneme of speech frame;
Step 4: the response frequency of statistics phoneme and gaussian component;
Step 5: the described response frequency of normalization obtains guiding probability;
Step 6: will guide during probability fusion calculates to traditional voice recognition path PTS, thereby instruct demoder to finish enhancing or weakening to the path.
To introduce in detail above-mentioned each step below in conjunction with accompanying drawing.
Fig. 2 is according to the phonetic feature in the phonetic feature training corpus, and training obtains the process flow diagram of universal background model.The present invention adopts respectively two kinds of different UBM training methods, namely directly whole phonetic feature training corpus is used greatest hope (EM) Algorithm for Training UBM, perhaps utilizes voice feature data that the gaussian component that each phoneme comprises is upgraded, and obtains UBM.Two kinds of methods are respectively shown in Fig. 2 (a) and Fig. 2 (b).
Shown in Fig. 2 (a), directly use the method for EM Algorithm for Training UBM specifically to comprise following content to whole phonetic feature training corpus.At first setting UBM is multidimensional standard Gaussian distribution, dimension is identical with the phonetic feature dimension, such as the present invention voice signal is extracted Mel cepstrum coefficient (Mel Frequency Cepstral Coefficients, MFCC) as feature, the Mel cepstrum coefficient and the 1 dimension speech energy that wherein comprise 12 dimensions, and their first order difference and second order difference, totally 39 dimensions are as proper vector.On whole phonetic feature training corpus, utilize the EM algorithm to adjust average and the variance of Gaussian distribution.Greatest hope (EM) algorithm is the algorithm of seeking parameter maximal possibility estimation or maximum a posteriori estimation in probability model, is used for estimating posterior probability density function.In speech recognition, represent probability density function with gauss hybrid models, therefore mainly be the parameters such as the average of estimating each gaussian component in the gauss hybrid models, variance.When estimating, at first calculation expectation (E step) utilizes the existing estimated value to hidden variable, calculates its maximum likelihood estimator; Next is maximization (M step), and the maximum likelihood value that maximization was tried to achieve in the E step is come the value of calculating parameter.The estimates of parameters that M found on the step is used in the next E step calculating, and this process constantly hockets, and finally finishes the parameter estimation of gauss hybrid models.Subsequently, for the ease of describing more accurately local acoustics feature space, the HHEd instrument of algorithm employing HTK divides the gaussian component in the HMM state.For any component in the gauss hybrid models, it is split into average equates with variance, two gaussian component of weighted, thereby reach the purpose that increases the quantity of Gaussian distribution among the UBM.Carry out loop iteration with upgrading the average of Gaussian distribution and the process of variance and increase gaussian distribution number, until the number of gaussian component reaches expectation value among the UBM, obtain final UBM.
Shown in Fig. 2 (b), the method that the gaussian component of utilizing phonetic feature training corpus that each phoneme is comprised is upgraded specifically comprises following content.At first, on phonetic feature training corpus, add up all phonemes that obtain comprising in the phonetic feature training corpus, and Hidden Markov Model (HMM) HMM set up in each phoneme.Secondly, use the Baum-Welch algorithm to upgrade HMM parameter, the HMM model that obtains training.The Baum-Welch algorithm is to utilize one group of observation sequence to go the HMM model parameter of training a continuous mixed Gaussian to distribute, and comprises the state transition probability matrix of model, the parameters such as mean vector, covariance matrix and hybrid weight of HMM model parameter.In estimating the HMM parametric procedure, Baum-Welch adopts maximal possibility estimation, and its process of adjusting the gaussian component parameter on each state is consistent with the EM algorithm.After this, with each gaussian component in the HMM model, weighting obtains the initial generic background model, wherein the weight of each gaussian component equates, and all the weight sum of component is 1, again whole phonetic feature training corpus use the EM algorithm to initial UBM in the parameter of each gaussian component adjust, the UBM that obtains upgrading is final UBM.
Fig. 3 is the process flow diagram of the main gaussian component of computing voice frame on universal background model UBM.As shown in Figure 3, for each speech frame in the whole phonetic feature training corpus, calculate the score on the probability density function of its each gaussian component in the UBM that step 1 obtains, and the gaussian component that score the maximum is corresponding is as its main gaussian component.Specifically comprise:
UBM comprises M gaussian component, and the probability density function of wherein remembering m gaussian component is λ m, or probability density function is expressed as N (O with its parametric form; μ m, ∑ m), μ wherein m, ∑ mThe average and the variance that represent m gaussian component.Then speech frame O is at probability density function λ mOn probable value be calculated as follows:
P ( O | λ m ) = 1 2 π Σ m e - ( o - μ m ) 2 / 2 Σ m
For each speech frame O, among the UBM to speech frame O probability P (O| λ m) maximum gaussian component is defined as the main gaussian component of speech frame O.That is, for speech frame O, the gaussian component m that satisfies formula (1) is the main gaussian component of speech frame O.That is to say, for any the component m ' in the gauss hybrid models, calculate it to probability P (the O| λ of speech frame O M '), and a gaussian component in the searching gauss hybrid models, make speech frame O all large at the probability of likelihood ratio speech frame O on other any one gaussian component on the probability density function of this gaussian component.The gaussian component that finds is main gaussian component, is designated as m.
m = arg max m ′ P ( O | λ m ′ ) - - - ( 1 )
Fig. 4 utilizes the acoustic model of recognition system that phonetic feature training corpus is forced cutting, obtains the process flow diagram of the affiliated phoneme of speech frame.As shown in Figure 4, the context-sensitive three-tone acoustic model of model, and with this as baseline system.Baseline system is comparison system, by relatively drawing the improvement in performance situation of the speech recognition system that the present invention realizes with described baseline system.Context-sensitive initial consonant/vowel with tones is as basic modeling unit in the baseline system, and only considers the impact of each phoneme before and after the current phoneme, forms triphone model.Each model is the form of " initial consonant-vowel with tones+initial consonant (or quiet, sil) " or " vowel with tones (or sil)-initial consonant+vowel with tones "." initial consonant-vowel with tones+initial consonant (or quiet, sil) " current pronunciation of expression is vowel with tones, and this vowel with tones is in after the pronunciation of an initial consonant, before the pronunciation of another initial consonant (perhaps quiet).Such as the model of " zh-ong1+g " expression vowel with tones ong1, furtherly, be the model of the ong1 after the zh pronunciation, before the g pronunciation.Similarly, the current pronunciation of " vowel with tones (or sil)-initial consonant+vowel with tones " expression is initial consonant, and this initial consonant is in after the pronunciation of a vowel with tones (perhaps quiet), before the pronunciation of another vowel with tones.Such as the model of " ong1-g+uo2 " expression initial consonant g, furtherly, be the model of the g after ong1 pronunciation, before the uo2 pronunciation.Except the triphone model of above two kinds of structures, system has also set up quiet model sil, the situation of voice signal when being used for describing.
Except quiet model sil, all three-tone acoustic models, namely all " initial consonant-vowels with tones+initial consonant (or quiet, sil) " reach the model of " vowel with tones (or sil)-initial consonant+vowel with tones " form, all adopt the HMM structure from left to right of continuous density.HMM comprises 5 states, and wherein 3 are the emission attitude.Quiet model sil has increased the redirect between the emission attitude, is used for the phenomenon that the quiet duration length of portrayal voice differs.The training of HMM model is adopted and is finished based on the Baum-Welch algorithm of expectation maximization.As the system that compares with the present invention, baseline system need to just be finished beginning most.Therefore, comprising the training of described sil and triphone model, all is to finish in the starting stage.
Because after considering the context impact, acoustic model quantity sharply increases, so baseline system adopts the state clustering algorithm based on decision tree to reduce the number of parameters that needs training.
Step 3 is utilized the acoustic model of above-mentioned baseline system, and the Viterbi algorithm, and the phonetic feature of training corpus is forced cutting.Originally, training corpus includes only corresponding aligned phoneme sequence, but the zero-time of each phoneme and termination time are unknown in the aligned phoneme sequence.That is to say that although know in the speaking of whole word and perhaps claim aligned phoneme sequence, the Voice onset time of each phoneme in the aligned phoneme sequence and termination time are unknown.In order to obtain the time boundary information of phoneme, algorithm utilizes acoustic model and the Viterbi algorithm of baseline system, divide a time boundary for each phoneme in the corresponding aligned phoneme sequence of training corpus, obtain the time boundary information of corpus annotation text (aligned phoneme sequence).For each phoneme from the phone set that training corpus obtains, obtain its zero-time position and termination time position in phonetic feature, be in the speech frame between zero-time position and the termination time position, all be marked for this reason phoneme, thereby with each speech frame, be divided in a certain phoneme.Thus, obtain the affiliated phoneme of each speech frame.
Fig. 5 is the process flow diagram of the response frequency of statistics phoneme and gaussian component.For each speech frame, the main gaussian component that calculates in step 2 is m, the phoneme p that in step 3, obtains, and the response frequency of note m and p is C PmAdd up the response frequency of a certain phoneme and a certain gaussian component, be with all take same gaussian component as main gaussian component and the speech frame that belongs to same phoneme be classified as a class, then the frame number of speech frame in such is the response frequency between above-mentioned a certain phoneme and a certain gaussian component.As shown in Figure 5, for the first speech frame O in the voice training corpus, its main gaussian component on UBM is m, and the phoneme under it is p, so, claim gaussian component m and phoneme p a secondary response to occur, that is to say that appearance one frame is take m as main gaussian component and belong to the voice O of phoneme p in the voice training corpus.With this all speech frames in voice training corpus are added up, obtain arbitrary gaussian component among the UBM and the phone set that obtained by speech corpus in the frame of response appears between arbitrary phoneme.
Fig. 6 carries out the process flow diagram that normalization obtains guiding probability for the response frequency that statistics in the step 4 is obtained.As shown in Figure 6, step 5 uses two kinds of different method for normalizing to obtain guiding probability, namely uses (2) formula to be listed as normalized method, and uses the normalized method of (3) formula procession.
The below introduces the row method for normalizing.For a certain phoneme p, having added up arbitrary gaussian component m among the UBM is C to the frequency of its response Pm(1≤m≤M), M is the number of all gaussian component among the UBM.All gaussian component are made normalized to the response of this phoneme, shown in (2).Obtain guiding probability, be designated as r Pm
r pm = C pm Σ i = 1 M C pi - - - ( 2 )
R wherein PmExpression guiding probability, certain the gaussian component m that represents among the UBM has responded how most speech data among a certain phoneme p, C PiRepresent that arbitrary gaussian component i is to the response frequency of phoneme p in M the gaussian component.Such as, r Pm=0.3, mean that phoneme p has 30% speech data take m as main Gauss, perhaps p has 30% data to fall within corresponding local space.
The below introduces the ranks method for normalizing.Ranks normalization considers to belong to the ratio that p and the data take m as main Gauss account for whole speech frame training corpus.Normalization is suc as formula shown in (3).
r pm = C pm Σ j = 1 P Σ i = 1 M C ji - - - ( 3 )
Wherein M is the number of gaussian component among the UBM, and P is the number of phoneme from the phone set that speech frame training corpus obtains, C JiRepresent that arbitrary gaussian component j is to the response frequency of arbitrary phoneme j in P the phoneme in M the gaussian component.
In addition, consider that to have the guiding probability minimum even be zero situation, the present invention is provided with lowest threshold T zero probability is carried out smoothly, to avoid because the guiding probability is too small, direct situation with the path beta pruning.For row normalization, T=1/M; For ranks normalization, T=1/ (MP) is namely in the row method for normalizing, if r Pm<1/M then sets r Pm=1/M, otherwise keep r PmConstant; In the ranks method for normalizing, if r Pm<1/ (MP) then sets r Pm=1/ (MP), otherwise keep r PmConstant.Guiding probability after finally obtaining smoothly.
Fig. 7 is for incorporating the guiding probability that obtains in the step 5 process flow diagram of speech recognition decoder process.As shown in Figure 7, in decode procedure, to current speech frame O to be decoded t, except calculating the acoustic model probability P AmWith probabilistic language model P LmAlso need calculate current speech frame O outward, tThe guiding probability.At first, to current speech frame O t, calculate current speech frame O by (1) formula tProbable value on each gaussian component of UBM, and the corresponding gaussian component of person of finding out the maximum probability are as O tMain gaussian component, be denoted as m; Then, the position according to Path extension in the decode procedure arrives obtains O tAffiliated phoneme p.For example in decode procedure Path extension to quiet model sil or the shape triphone model such as " a-b+c ", then obtain the affiliated phoneme of present frame and be respectively sil and b, wherein a, b, c satisfy the form of " initial consonant-vowel with tones+initial consonant (or quiet, sil) " or " vowel with tones (or sil)-initial consonant+vowel with tones ".At last, according to the guiding probability that step 5 has been added up, search gaussian component m to the guiding probability of phoneme p.By in legacy paths PTS computing formula (4), adding the guiding probability, obtained merging the path PTS computing formula (5) of guiding probability, to strengthen traditional path PTS.Above-mentioned this mode by incorporating the positional information of speech frame in the acoustic feature space, can limit more targetedly or strengthens current path, and potential path is kept as much as possible.
Traditional path PTS calculates suc as formula shown in (4):
P(t)=P(t-1)+α 1P am2P lm(4)
Wherein, P (t-1), P Am, P LmAll be the probability of logarithmic form, t was current time, and t-1 is previous moment, and P (t-1) expression was carved into t-1 general probability constantly, P from first o'clock AmBe the acoustics probability, expression current speech frame O tProbability on the state that expands to is by O tGauss hybrid models at corresponding state calculates.P LmBe probabilistic language model, represented word one deck relation of interdependence.α 1And α 2Be respectively the weight of acoustics probability and probabilistic language model.Adjust α according to (4) formula 1And α 2, so that system obtains minimum Chinese character error rate.In certain span, one group of α of given first 1And α 2Value, and be worth with this and calculate general probability by (4) formula and obtain one group of recognition result, then change α 1And α 2Value, and under new value, calculate general probability and obtain one group of new recognition result by (4) formula, constantly change α with this 1And α 2Value, and select so that one group of minimum α of Chinese character error rate 1And α 2Value, be used for merging the guiding probability at next step.
The voice recognition path PTS that has merged the guiding probability calculates suc as formula shown in (5):
P(t)=P(t-1)+α 1P am2P lm3r pm (5)
R wherein PmBe speech frame O tTake m as main Gauss the time to the guiding probability of phoneme p.α 3Be the guiding probability right.
Afterwards, be respectively α 3Compose different values, and to current α 3Assignment calculate general probability by (5) formula, finally obtain different recognition results.Constantly for guiding the weight α of probability 3Assignment obtains merging among the present invention the speech recognition decoder algorithm that guides probability.
Tested the above-mentioned Algorithm Performance that the present invention proposes at Chinese large vocabulary Continuous Speech Recognition System.The hardware platform of experiment is the PC of Intel 3.0GHz dominant frequency and 4GB internal memory, and internal memory uses and is about 180MB-250MB in the operational process.Baseline system adopts context-sensitive three-tone acoustic model.The basic phone set that adopts in the experiment comprises 24 initial consonants and 37 simple or compound vowel of a Chinese syllable, and each simple or compound vowel of a Chinese syllable contains 5 tones.In the Mandarin recognition experiment, removing does not have the sound of appearance female in the voice training language material Kuku, and phone set comprises 191 basic phonemes altogether.After the impact of considering linguistic context, the quantity of acoustic model is 204388.Behind the state clustering algorithm of use based on decision tree, altogether comprise 4575 shared states in the model.The entry number is 48188 in the Chinese dictionary, has used the bigram statistics language model in identifying.
Guiding probability among acoustic model, language model and the present invention has in various degree impact to recognition performance.At first in baseline system, adjust the weight α of acoustics probability and probabilistic language model 1And α 2, obtaining the Chinese character misclassification rate minimum is 12.78%.On this basis, add the guiding probability, can obviously reduce the Chinese character error rate, reach 11.61%.Compare with traditional baseline system, error rate descends 9.15% relatively.
Above-described specific embodiment; purpose of the present invention, technical scheme and beneficial effect are further described; be understood that; the above only is specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any modification of making, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (10)

1. a speech recognition decoder method that merges the guiding probability is characterized in that, comprises the following steps:
Step a: the training universal background model is used for describing whole acoustic feature space;
Step b: the main gaussian component of computing voice frame on described universal background model;
Step c: utilize acoustic model that training corpus is forced cutting, obtain the affiliated phoneme of speech frame;
Steps d: the response frequency of gaussian component in statistics phoneme and the described universal background model;
Step e: calculate the guiding probability according to the described response frequency;
Step f: will guide probability fusion in the PTS calculating of voice recognition path, thereby finish enhancing or weakening to the voice recognition path score.
2. the method for claim 1 is characterized in that, the described universal background model of one of dual mode training below using among the described step a:
One uses expectation maximization Algorithm for Training mixed Gauss model also to increase gradually the number of gaussian component in the described mixed Gauss model, finally obtains universal background model;
Its two, set up Hidden Markov Model (HMM) for each phoneme in the training corpus respectively; Then use the Baum-Welch algorithm to upgrade parameter in the described Hidden Markov Model (HMM), the Hidden Markov Model (HMM) that obtains training; Afterwards, each the gaussian component weighting in the described Hidden Markov Model (HMM) is obtained the initial generic background model, and use the EM algorithm that the parameter of each gaussian component in the resulting universal background model is adjusted, obtain final universal background model.
3. the method for claim 1 is characterized in that, among the step b, for speech frame O, its main gaussian component is the gaussian component of probable value maximum in described universal background model.
4. method as claimed in claim 3 is characterized in that, the following calculating of the probable value of described speech frame O in described universal background model:
P ( O | λ m ) = 1 2 π Σ m e - ( o - μ m ) 2 / 2 Σ m
Wherein, λ mBe the probability density function of m gaussian component in the described universal background model, μ m, ∑ mThe average and the variance that represent respectively m gaussian component.
5. the method for claim 1 is characterized in that, described step c specifically comprises:
Set up the three-tone acoustic model, and utilize described three-tone acoustic model and Viterbi algorithm to each the phoneme time division border in the corresponding aligned phoneme sequence of training corpus, obtain zero-time position and the termination time position of each phoneme in the described aligned phoneme sequence, and will be in speech frame between described zero-time position and the termination time position, be labeled as and belong to this phoneme, obtain phoneme under each speech frame with this.
6. the method for claim 1, it is characterized in that, in the described steps d, the response frequency of gaussian component is in described phoneme and the described universal background model: for each gaussian component and each phoneme, take described gaussian component as main gaussian component and belong to the frame number of the speech frame of described phoneme.
7. the method for claim 1 is characterized in that, uses row normalization to calculate described guiding probability among the described step e:
r pm = C pm Σ i = 1 M C pi
Wherein, r PmBe described guiding probability, C PmBe the response frequency of gaussian component m in the described universal background model and described phoneme p, C PiRepresent in the described universal background model response frequency of i component and phoneme p, described M is the number of gaussian component in the described universal background model.
8. the method for claim 1 is characterized in that, uses ranks normalization to calculate described guiding probability among the described step e:
r pm = C pm Σ j = 1 P Σ i = 1 M C ji
Wherein, r PmBe described guiding probability, C PmBe the response frequency of gaussian component m in the described universal background model and described phoneme p, C JiRepresent that the i gaussian component is to the response frequency of p phoneme in the described universal background model, described M is the number of gaussian component in the described universal background model, and described P is the number of phoneme in the described voice training corpus.
9. such as claim 7 or 8 described methods, it is characterized in that, lowest threshold T be set carry out smoothly zero probability is following:
During row normalization, if r Pm<1/M then sets r Pm=1/M, otherwise keep r PmConstant; During ranks normalization, if r Pm<1/ (MP) then sets r Pm=1/ (MP), otherwise keep r PmConstant.
10. the method for claim 1 is characterized in that, described step f comprises the PTS that uses path in the described guiding probability calculation speech recognition process:
P(t)=P(t-1)+α 1P am2P lm3r pm
Wherein P (t-1) is historical path score, P AmBe the acoustic model probability of present frame, P LmBe probabilistic language model, α 1And α 2Be respectively the weight of acoustic model probability and probabilistic language model, r PmBe speech frame O guiding probability to phoneme p take m as main gaussian component time the, α 3Weight for the guiding probability.
CN2012105607454A 2012-12-20 2012-12-20 Speech recognition optimization decoding method integrating guide probability Pending CN102982799A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012105607454A CN102982799A (en) 2012-12-20 2012-12-20 Speech recognition optimization decoding method integrating guide probability

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012105607454A CN102982799A (en) 2012-12-20 2012-12-20 Speech recognition optimization decoding method integrating guide probability

Publications (1)

Publication Number Publication Date
CN102982799A true CN102982799A (en) 2013-03-20

Family

ID=47856710

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012105607454A Pending CN102982799A (en) 2012-12-20 2012-12-20 Speech recognition optimization decoding method integrating guide probability

Country Status (1)

Country Link
CN (1) CN102982799A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103915092A (en) * 2014-04-01 2014-07-09 百度在线网络技术(北京)有限公司 Voice identification method and device
CN104123934A (en) * 2014-07-23 2014-10-29 泰亿格电子(上海)有限公司 Speech composition recognition method and system
CN105654944A (en) * 2015-12-30 2016-06-08 中国科学院自动化研究所 Short-time and long-time feature modeling fusion-based environmental sound recognition method and device
CN105845128A (en) * 2016-04-06 2016-08-10 中国科学技术大学 Voice identification efficiency optimization method based on dynamic pruning beam prediction
CN106157953A (en) * 2015-04-16 2016-11-23 科大讯飞股份有限公司 continuous speech recognition method and system
CN106409283A (en) * 2016-08-31 2017-02-15 上海交通大学 Audio frequency-based man-machine mixed interaction system and method
CN106971703A (en) * 2017-03-17 2017-07-21 西北师范大学 A kind of song synthetic method and device based on HMM
CN109377981A (en) * 2018-11-22 2019-02-22 四川长虹电器股份有限公司 The method and device of phoneme alignment
CN110490213A (en) * 2017-09-11 2019-11-22 腾讯科技(深圳)有限公司 Image-recognizing method, device and storage medium
CN111583906A (en) * 2019-02-18 2020-08-25 中国移动通信有限公司研究院 Character recognition method, device and terminal for voice conversation
CN113888777A (en) * 2021-09-08 2022-01-04 南京金盾公共安全技术研究院有限公司 Voiceprint unlocking method and device based on cloud machine learning
CN114566155A (en) * 2022-03-14 2022-05-31 成都启英泰伦科技有限公司 Feature reduction method for continuous speech recognition

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1655232A (en) * 2004-02-13 2005-08-17 松下电器产业株式会社 Context-sensitive Chinese Speech Recognition Modeling Method
US7203652B1 (en) * 2002-02-21 2007-04-10 Nuance Communications Method and system for improving robustness in a speech system
US20100318355A1 (en) * 2009-06-10 2010-12-16 Microsoft Corporation Model training for automatic speech recognition from imperfect transcription data
CN102237082A (en) * 2010-05-05 2011-11-09 三星电子株式会社 Self-adaption method of speech recognition system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7203652B1 (en) * 2002-02-21 2007-04-10 Nuance Communications Method and system for improving robustness in a speech system
CN1655232A (en) * 2004-02-13 2005-08-17 松下电器产业株式会社 Context-sensitive Chinese Speech Recognition Modeling Method
US20100318355A1 (en) * 2009-06-10 2010-12-16 Microsoft Corporation Model training for automatic speech recognition from imperfect transcription data
CN102237082A (en) * 2010-05-05 2011-11-09 三星电子株式会社 Self-adaption method of speech recognition system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨占磊: "融合引导概率的语音识别解码算法研究", 《声学学报》, vol. 37, no. 2, 31 March 2012 (2012-03-31), pages 209 - 217 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103915092A (en) * 2014-04-01 2014-07-09 百度在线网络技术(北京)有限公司 Voice identification method and device
CN104123934A (en) * 2014-07-23 2014-10-29 泰亿格电子(上海)有限公司 Speech composition recognition method and system
CN106157953A (en) * 2015-04-16 2016-11-23 科大讯飞股份有限公司 continuous speech recognition method and system
CN105654944B (en) * 2015-12-30 2019-11-01 中国科学院自动化研究所 It is a kind of merged in short-term with it is long when feature modeling ambient sound recognition methods and device
CN105654944A (en) * 2015-12-30 2016-06-08 中国科学院自动化研究所 Short-time and long-time feature modeling fusion-based environmental sound recognition method and device
CN105845128A (en) * 2016-04-06 2016-08-10 中国科学技术大学 Voice identification efficiency optimization method based on dynamic pruning beam prediction
CN105845128B (en) * 2016-04-06 2020-01-03 中国科学技术大学 Voice recognition efficiency optimization method based on dynamic pruning beam width prediction
CN106409283A (en) * 2016-08-31 2017-02-15 上海交通大学 Audio frequency-based man-machine mixed interaction system and method
CN106409283B (en) * 2016-08-31 2020-01-10 上海交通大学 Man-machine mixed interaction system and method based on audio
CN106971703A (en) * 2017-03-17 2017-07-21 西北师范大学 A kind of song synthetic method and device based on HMM
CN110490213B (en) * 2017-09-11 2021-10-29 腾讯科技(深圳)有限公司 Image recognition method, device and storage medium
CN110490213A (en) * 2017-09-11 2019-11-22 腾讯科技(深圳)有限公司 Image-recognizing method, device and storage medium
CN109377981A (en) * 2018-11-22 2019-02-22 四川长虹电器股份有限公司 The method and device of phoneme alignment
CN109377981B (en) * 2018-11-22 2021-07-23 四川长虹电器股份有限公司 Phoneme alignment method and device
CN111583906A (en) * 2019-02-18 2020-08-25 中国移动通信有限公司研究院 Character recognition method, device and terminal for voice conversation
CN111583906B (en) * 2019-02-18 2023-08-15 中国移动通信有限公司研究院 Method, device and terminal for role recognition of voice conversation
CN113888777A (en) * 2021-09-08 2022-01-04 南京金盾公共安全技术研究院有限公司 Voiceprint unlocking method and device based on cloud machine learning
CN113888777B (en) * 2021-09-08 2023-08-18 南京金盾公共安全技术研究院有限公司 A voiceprint unlocking method and device based on cloud machine learning
CN114566155A (en) * 2022-03-14 2022-05-31 成都启英泰伦科技有限公司 Feature reduction method for continuous speech recognition

Similar Documents

Publication Publication Date Title
Mark et al. The application of hidden Markov models in speech recognition
CN102982799A (en) Speech recognition optimization decoding method integrating guide probability
US9812122B2 (en) Speech recognition model construction method, speech recognition method, computer system, speech recognition apparatus, program, and recording medium
EP3438973B1 (en) Method and apparatus for constructing speech decoding network in digital speech recognition, and storage medium
US9679556B2 (en) Method and system for selectively biased linear discriminant analysis in automatic speech recognition systems
CN107093422B (en) Voice recognition method and voice recognition system
CN103021408B (en) Method and device for speech recognition, optimizing and decoding assisted by stable pronunciation section
Kannadaguli et al. A comparison of Bayesian and HMM based approaches in machine learning for emotion detection in native Kannada speaker
Zhou et al. Tone articulation modeling for Mandarin spontaneous speech recognition
Sainath et al. An exploration of large vocabulary tools for small vocabulary phonetic recognition
Kannadaguli et al. Phoneme modeling for speech recognition in Kannada using Hidden Markov Model
Li et al. On the impact of phoneme alignment in DNN-based speech synthesis.
Ijima et al. Emotional speech recognition based on style estimation and adaptation with multiple-regression HMM
Ogbureke et al. Improving initial boundary estimation for HMM-based automatic phonetic segmentation.
Prabhavalkar et al. An evaluation of posterior modeling techniques for phonetic recognition
Gómez et al. Improvements on automatic speech segmentation at the phonetic level
Ganesh et al. Grapheme Gaussian model and prosodic syllable based Tamil speech recognition system
Ko et al. Eigentriphones: A basis for context-dependent acoustic modeling
Abdelhamid et al. Discriminative training of contextdependent phones on WFST-based decoding graphs
Sun Phoneme-to-Audio Forced Alignment with Basic Syllable Types and Broad Phonetic Classes
Sinha et al. Exploring the role of pitch-adaptive cepstral features in context of children's mismatched ASR
Gómez et al. Automatic speech segmentation based on acoustical clustering
Hori et al. Brief Overview of Speech Recognition
Cosi et al. Connected Digits Recognition Task: ISTC–CNR Comparison of Open Source Tools
Casar et al. Double layer architectures for automatic speech recognition using HMM

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20130320