CN102237083A

CN102237083A - Portable interpretation system based on WinCE platform and language recognition method thereof

Info

Publication number: CN102237083A
Application number: CN2010101605215A
Authority: CN
Inventors: 李心广; 阳爱民; 姚敏锋; 张晶; 马文华; 陈永煊; 林江豪
Original assignee: Guangdong University of Foreign Studies
Current assignee: Guangdong University of Foreign Studies
Priority date: 2010-04-23
Filing date: 2010-04-23
Publication date: 2011-11-09

Abstract

The invention is a portable oral translation system based on the WinCE platform, including a speech collector, a speech preprocessing module, a speech feature extraction and modeling module, a model library, a recognition module, a corpus, and a translation and speech synthesis module. All modules are established On the embedded platform; the speech acquisition module is connected with the speech preprocessing module; the speech preprocessing module is connected with the speech feature extraction and modeling module; the speech feature extraction and modeling module are respectively connected with the model library or the recognition module; the speech feature The extraction and modeling module is connected to the model base by selecting the training state, and connected to the recognition module by selecting the recognition state; the recognition module is connected to the translation and speech synthesis module; the translation and speech synthesis module is connected to the corpus. The invention has the characteristics of higher efficiency in speech recognition, high recognition accuracy, strong device portability and two-way oral translation.

Description

A kind of portable oral translation system and speech recognition method thereof based on the WinCE platform

Technical field

The present invention relates to the speech recognition technology field, change the portable oral translation system based on the WinCE platform of corresponding translation result after particularly a kind of voice signal identification that is used for the people is sent into.The invention still further relates to the audio recognition method of this translation system.

Background technology

Speech recognition technology allows machine pass through identification exactly and understands, and the voice signal that the people is sent changes corresponding text into or makes the technology of setting command, and it just progressively becomes the gordian technique of man-machine interface in the infotech.In recent years, along with the fast development of embedded device, characteristics such as consumer electronics product is deep into our various fields in life, and it is portable, and cost is low have obtained using widely, therefore, have very big consumption market based on Embedded speech recognition system.And traditional speech recognition system, as the SPEECHSDK 5.1 of Microsoft, the HTK in Cambridge is the speech recognition engine based on PC operating system, can not be used in the embedded OS.

Summary of the invention

The objective of the invention is to design portable oral translation system based on the WinCE platform, can be under the situation of embedded system resource-constrained, realize the recognition function of big vocabulary, and have higher discrimination, and realize from Chinese to English or English spoken two-way translation to Chinese.。

Another object of the present invention is to provide the audio recognition method of this translation system.

In order to realize the foregoing invention purpose, the present invention includes following technical characterictic: a kind of portable oral translation system based on the WinCE platform, it is characterized in that: comprise voice collecting device, voice pretreatment module, phonetic feature extraction and MBM, model bank, identification module, corpus and translation and phonetic synthesis module, all modules all are based upon on the embedded platform; Voice acquisition module is connected with the voice pretreatment module; The voice pretreatment module is extracted with phonetic feature and is connected with MBM; Phonetic feature extracts and is connected with model bank or identification module respectively with MBM; Described phonetic feature extracts and is connected with model bank by being chosen as physical training condition with MBM, by selecting status recognition, is connected with identification module; Identification module is connected with the phonetic synthesis module with translation; Translation is connected with corpus with the phonetic synthesis module; Described identification module obtains translating into text by translation and phonetic synthesis module after the optimal result through the decision-making judgement, and exports with speech form; Through speech selection, realize from Chinese to English or English spoken two-way translation to Chinese.

Described voice pretreatment module comprises successively the pre-emphasis unit that connects, divides frame processing unit, adds window unit and end-point detection unit; Pre-emphasis unit is connected with the voice collecting device, and the end-point detection unit extracts with phonetic feature and is connected with MBM;

Described pre-emphasis unit is a high boost pre-emphasis digital filter;

Frame processing unit taked the field overlapping to divide the frame mode to carry out the processing of branch frame in described minute;

The described window unit that adds adopts Hamming window function carry out windowization;

Described end-point detection unit adopts with short-time energy E and short-time average zero-crossing rate Z as the double threshold of feature relatively, and calculates zero-crossing rate threshold values Z according to quiet section _cT and height energy threshold are carried out the detection of end points as thresholding.

Described phonetic feature extracts with MBM and passes through to extract the MFCC phonetic feature as recognition feature; Adopt hidden Markov model as training and model of cognition; This hidden Markov model is made up of Markov chain and general random process;

Described hidden Markov model utilizes the forward-backward algorithm probabilistic algorithm to solve the valuation problem; Utilize the Viterbi algorithm to solve decoding problem; Utilize the Baum-Welch iterative algorithm to solve problem concerning study.

Be specially: utilize the forward-backward algorithm probabilistic algorithm, solve for the given λ of hidden Markov model system=(π, A, B), the observation sequence O=O that produces according to system ₁, O ₂..., O _TCalculate the problem of likelihood probability P (O/ λ).

Utilize the Viterbi algorithm, solve for the given λ of hidden Markov model system=(π, A, B), and the observation sequence O=O that produces by system ₁, O ₂..., O _T, search makes this system produce the status switch S=q of the most possible experience of this observation sequence ₁, q ₂... q _tProblem.

For the hidden Markov model system of the unknown, utilize the Baum-Welch iterative algorithm to come the estimation model parameter.

The present invention also comprises a kind of speech recognition method of the portable oral translation system based on the WinCE platform, it is characterized in that comprising the steps:

(1) hidden Markov model is trained the acquisition model parameter;

(2) phonetic feature that characteristic extracting module is obtained is as the observation sequence of hidden Markov model; The voice unit that training obtains is a status switch, solves the state transitions sequence by the Viterbi algorithm;

(3) adopt the decision-making judgement, obtain the state transitions sequence of maximum probability;

(4) go out candidate phoneme or syllable according to optimum condition sequence correspondence, form speech and sentence by language model at last.

The first initialization hidden Markov model of described step (1) parameter utilizes the Baum-Welch iterative algorithm to come the estimation model parameter then.

Described step (1) is utilized training algorithm to carry out repeatedly iteration and is obtained the result, also should provide the condition of a finishing iteration simultaneously, when the relative variation of this probability less than ε, the finishing iteration process in addition, is set maximum iteration time N, when iterations during greater than N, also stop iteration, and the Baum-Welch algorithm is adopted the method that increases scale factor, the data underflow problem of correction algorithm.

The present invention is a kind of portable oral translation system and speech recognition method thereof based on the WinCE platform, and its hardware core is a flush bonding processor, and embedded system has low cost, low-power consumption, high-performance, portable fine quality such as strong.In the voice pretreatment module, comprise pre-emphasis unit, divide frame processing unit, add window unit and end-point detection unit, by the voice signal that collects is anticipated, make that embedded system efficient when the later stage speech recognition is higher, recognition accuracy is also higher.Adopt hidden Markov model, Model Identification is carried out with it again in the training pattern storehouse, makes identifying precise and high efficiency more.The present invention compared with prior art has two-way translation, low-cost, low-power consumption, and high-performance, advantage such as portable strong, and have very big consumption market in the speech recognition system field.

Description of drawings

Fig. 1 is the composition synoptic diagram of hidden Markov model

Fig. 2 forward-backward algorithm algorithm synoptic diagram

Fig. 3 hidden Markov model parameter training process flow diagram

Fig. 4 does not have the hidden Markov model structure from left to right of leap

Fig. 5 hidden Markov model identifying;

Fig. 6 is module principle figure of the present invention;

Fig. 7 is the transition probability processing procedure of identification module of the present invention;

Fig. 8 is the corpus structural drawing of translation of the present invention and phonetic synthesis module.

Embodiment

The present invention is a kind of portable oral translation system based on the WinCE platform, design has realized a speech recognition system based on wince, embedded system has low cost, low-power consumption, fine qualities such as high-performance, its core are its flush bonding processor, at present, the little processing of ARM mainly comprises ARM7 series, ARM9 series, ARM9E series, ARM10E series, ARM11 series, and its function from strength to strength.The present invention uses embedded system scientific research platform UP-CPU 6410, adopts up-to-date S3C6410X (ARM11) embedded microprocessor of Samsung company, and its frequency reaches 633M, is a based on the ARM1176JZF-S core, adopts the processor of ARM v6 framework.

Module principle figure of the present invention as shown in Figure 6, voice signal by voice collecting device 1 microphone collection input, carry out pre-emphasis by 2 pairs of voice signals of voice pretreatment module, divide frame, windowing, processing such as end-point detection, what realize above-mentioned processing capacity is pre-emphasis unit 21, divide frame processing unit 22, add window unit 23 and end-point detection unit 24.Carrying out feature by the phonetic feature extraction with 3 pairs of voice messagings of MBM then carries and the training utterance model, phonetic feature extracts and is connected with model bank 4 or identification module 5 with MBM 3, read corpus 6 by translation and phonetic synthesis module 7, translate into the output of text and synthetic speech.

Respectively each modular unit that relates to is described below:

One, pre-emphasis unit 21

The average power spectra of voice signal is subjected to the influence of glottal excitation and mouth and nose radiation, front end is pressed 6dB/oct (octave) decay greatly more than 800Hz, the high more corresponding composition of frequency is more little, will be promoted its HFS before voice signal is analyzed for this reason.Therefore before being analyzed, adopts voice signal the high boost pre-emphasis digital filter processes voice signals of a 6dB/oct usually, realization is promoted its HFS, make the frequency spectrum of signal become smooth, remain on low frequency in the whole frequency band of high frequency, can ask frequency spectrum with same noise.The filter response function is:

H(z)＝1-αz ^-1，0.9≤α≤1.0

Wherein α is a pre emphasis factor, gets 0.9375 usually, like this, and the output of preemphasis network

The available difference equation of relation with the voice signal s (n) that imports Expression.

Two, divide frame processing unit 22

Voice signal has time-varying characteristics, but in a short time range, its characteristic remains unchanged promptly relatively stable substantially, and this specific character of voice signal is called " short-time characteristic ", and this short section time is generally 10～30ms.So the analysis of voice signal and processing generally are based upon on the basis of " short-time characteristic ", promptly carry out " short-time analysis ", sound signal stream is adopted divide frame to handle.The frame number of general per second has

Frames per \sec ond = \frac{1}{t} (0.01 < t < 0.03)

Decide on actual conditions.Divide frame both can adopt continuation mode, also can adopting overlaps divides the mode of frame, owing to have correlativity between the voice signal, adopts field to overlap among the present invention and divides the mode of frame.

Like this, for the voice signal of integral body, the characteristic parameter time series of forming by each frame characteristic parameter that analyzes.

Three, add window unit 23

Voice signal has stationarity in short-term, can carry out the branch frame to signal and handle.And be to realize near the speech waveform the sampling n in the voice signal is emphasized and the remainder of waveform is weakened, and then also will be to its windowing process.Each short section to voice signal is handled, and in fact is exactly that each short section is carried out certain conversion or imposed certain computing, and its general expression is:

Q_{n} = Σ_{m = - \infty}^{\infty} T [s (n)] ω (n - m)

T[wherein] represent certain conversion, it can be linear also can be non-linear, s (n) is an input speech signal series.Q _nIt is the time series that all each sections obtain after treatment.

Select Hamming window in the invention for use

Four, the end-point detection unit 24

End-point detection during voice signal is handled mainly is starting point and the end point in order to detect voice automatically.The present invention has adopted the double threshold relative method to carry out end-point detection.The double threshold relative method as feature, in conjunction with the advantage of Z and E, makes detection more accurate with short-time energy E and short-time average zero-crossing rate Z, the processing time of effective reduction system, improve the real-time of system handles, and can get rid of the noise of unvoiced segments, thus the recognition performance that improves.

In the double threshold relative method, short-time energy E and short-time average zero-crossing rate Z feature calculation are as follows respectively:

(1) short-time energy E

The short-time energy of voice signal s (n) is defined as:

E_{n} = Σ_{m - \infty}^{\infty} {[s (n) ω (n - m)]}^{2}

Wherein ω (n) is the window function of Hamming window.

For following formula, if make h (n)=ω ²(n), then have:

E_{n} = Σ_{m = - \infty}^{\infty} s^{2} (n) h (n - m) = s^{2} (n) * h (n)

Following formula represents as can be known, and the short-time energy of window transform is equivalent to signal with " voice square " by a linear filter output, and the unit-sample response of this wave filter is h (n).It realizes that block diagram is as follows:

The realization block diagram of short-time energy

For the short-time average energy E that with n is certain frame voice signal of sign _nFor:

E_{n} = Σ_{m = n - N + 1}^{n} {[s (m) ω (n - m)]}^{2}

(2) short-time average zero-crossing rate Z

The short-time average zero-crossing rate definition

Z_{n} = Σ_{m = - \infty}^{\infty} Sgn [s (m)] - Sgn [s (m - 1)]

Sgn[wherein] be sign function, promptly

S (n) is a voice signal.

Z_{n} = Σ_{m = - \infty}^{\infty} | Sgn [s (m)] - Sgn [s (m - 1)] | ω (n - m)

= | Sgn [s (n)] - Sgn [s (n - 1)] | * ω (n)

Wherein ω (n) is a window function.

It realizes that block diagram is as follows:

The short time interval that voice signal begins is equally distributed ambient noise signal.When adopting the double threshold relative method to carry out end-point detection, need calculate zero-crossing rate threshold values Z according to " quiet " section of beginning _cT and height energy threshold ETL (low energy metered valve) and ETU (high energy metered valve) are used as thresholding, just can realize the accurate detection of end points.

Zero-crossing rate threshold values Z _cT=min (IF, Z _c+ 2* σ _Zc), wherein IF is an empirical value, the present invention gets IF=25; Z _c, σ _ZcBe respectively the average and the standard deviation of the zero-crossing rate of initial " quiet " section.

For ETL (low energy metered valve) and ETU (high energy metered valve), need calculate the short-time average energy of " quiet " section earlier, maximum energy value is designated as E _Max, minimum energy value is designated as E _MinOrder:

I1＝0.03*(E _max-E _min)+E _min

I2＝4*E _min

Then have:

ETL＝min(I1，I2)

ETU＝5*ETL

Utilize Z _cWhen T and ETL and ETU detected as thresholding, establishing start frame was N1, then the ENERGY E at N1 frame place _N1And zero-crossing rate Z _N1Satisfy ETU＞E simultaneously _N1＞ETL, E _N1+1＞ETU, Z _N1＞Z _cT; ENERGY E at end frame N2 place _N2And zero-crossing rate Z _N2Satisfy simultaneously (adjusting coefficient k=4), Z _N1＜Z _cT.

Adopt the double threshold relative method, combine the situation of other frame, can effectively avoid The noise, improve degree of detection, phonetic feature is extracted have high efficiency, be beneficial to the raising of discrimination.

Five, phonetic feature extracts and MBM 3

The extraction that the present invention adopts is based on the MFCC phonetic feature of the auditory properties feature as identification.(Mel-Frequency Cepstral Coefficients is to propose according to human auditory system's characteristic MFCC) to the Mel cepstrum coefficient, and anthropomorphic dummy's ear is to the perception of different frequency voice.People's ear is differentiated the process of sound frequency just as a kind of operation of taking the logarithm.For example: in the Mel frequency domain, the people is a linear relationship to the perception of tone, if the Mel difference on the frequency twice of two sections voice, then people's also poor twice in perception.

Wherein the MFCC algorithmic procedure of characteristic extracting module 3 is:

1. Fast Fourier Transform (FFT) (FFT):

X [k] = Σ_{n = 0}^{N - 1} x [n] e^{- j \frac{2 π}{N} nk}, k = 0,1,2, . . ., N - 1

X[n] (n=0,1,2 ..., N-1) a frame discrete voice sequence for obtaining through over-sampling, N is a frame length.X[k] the plural number series of ordering for N, again to X[k] delivery get the signal amplitude spectrum | X[k] |.

2. the actual frequency yardstick is converted to the Mel dimensions in frequency:

Mel (f) = 2597 \lg (1 + \frac{f}{700})

Mel (f) is the Mel frequency, and f is an actual frequency, and unit is Hz.

3. configuration triangle filter group and calculate each triangle filter signal amplitude is composed | X[k] | filtered output:

F (l) = Σ_{k = f_{o} (l)}^{f_{h} (l)} w_{l} (k) | X [k] |, l = 1,2, . . ., L

Wherein

w_{l} (k) = \{\begin{matrix} \frac{k - f_{o} (l)}{f_{c} (l) - f_{o} (l)}, & f_{o} (l) \leq k \leq f_{c} (l) \\ \frac{f_{h} (l) - k}{f_{h} (l) - f_{c} (l)}, & f_{c} (l) \leq k \leq f_{h} (l) \end{matrix}

f_{o} (l) = \frac{o (l)}{[\frac{f_{s}}{N}]},

f_{h} (l) = \frac{h (l)}{[\frac{f_{s}}{N}]},

f_{c} (l) = \frac{c (l)}{[\frac{f_{s}}{N}]}

w _l(k) be the filter factor of respective filter, o (l), c (l), h (l) be on the actual frequency coordinate axis respective filter lower frequency limit, centre frequency and upper limiting frequency, f _sBe sampling rate, L is a number of filter, and F (l) is filtering output.

4. the logarithm computing is done in all wave filter outputs, is further done discrete cosine transform (DTC) again, can obtain MFCC:

M (i) = \sqrt{\frac{2}{N}} Σ_{l = 1}^{L} \log F (l) \cos [(l - \frac{1}{2}) \frac{iπ}{L}], i = 1,2, . . ., Q

Q is the exponent number of MFCC parameter, generally gets 12, and M (i) is gained MFCC parameter.

Speech model of the present invention adopts hidden Markov model, hidden Markov model (HMM, HiddenMarkov Model) is a kind of statistical signal transaction module,, develops by Markov chain with probability model parametric representation, that be used to describe the statistics of random processes characteristic.Two ingredients of HMM: Markov chain: describe the transfer of state, describe with transition probability.The general random process: the relation between description state and observation sequence, to describe with the observed value probability, it is formed as Fig. 1.

The HMM model can be expressed as: λ=(N, M, π, A, B), wherein

N: Markov chain state number in the model.Remember that N state is θ ₁..., θ _N, note t Markov chain state of living in constantly is q _t, obvious q _t∈ (θ ₁..., θ _N).

M: the possible observed value number of each state correspondence.Remember that M observed value is V ₁..., V _M, note t observed observation vector constantly is O _t, O wherein _t∈ (V ₁..., V _M).

π: original state probability vector, π=(π ₁..., π _N), π wherein _i=P (q ₁=θ _i), 1≤i≤N.

A: state transition probability matrix, A=(a _Ij) _{N * N}, a _Ij=P (q _I+1=θ _j/ q _t=θ _i), 1≤i, j≤N are the transition probabilities that changes to state j from state i.

B: output probability matrix, B=(b _Ik) _{N * M},

b _Ik=P (O _t=V _k/ q _t=θ _i), when representing to get the hang of i, 1≤i≤N, 1≤k≤M produce output V _kProbability.Because a _Ij, b _Ik, π _iAll be probability, therefore need satisfy normalizing condition: a _Ij〉=0, b _Ik〉=0, π _i〉=0

And

Σ_{j = 1}^{N} a_{ij} = 1,1 \leq i \leq N,

Σ_{k = 1}^{M} b_{ik} = 1,1 \leq i \leq N,

Σ_{i = 1}^{N} π_{i} = 1

HMM relates to three problems:

1, valuation problem

A given λ of HMM system=(π, A, B), according to the observation sequence O=O of system's generation ₁, O ₂..., O _T, calculate likelihood probability P (O/ λ).To a fixing status switch S=q ₁, q ₂... q _t, the most basic theoretical calculation method is the probability addition with all possible status switch, promptly

But this method complexity is c ^TT, therefore calculated amount is very big, adopts forward direction-back can solve this estimation problem in the identification effectively to algorithm, and calculated amount is c ²T.

Definition forward variable: a _t ⁱ=P (o ₁o ₂... o _t, q _t=i| λ) under the representation model λ, at moment t, observed events is O _t, state is the probability of i.Next forward variable computing formula constantly is:

The synoptic diagram of forward-backward algorithm algorithm as shown in Figure 2.

The definition back is to variable: β _t(i)=P (o _T+1o _T+2... o _T| q _t=i, λ) T is (o to the observed events sequence of moment t+1 backward from stopping constantly in expression _T+1o _T+2... o _T), and the state of t is the probability of i constantly.The back computing formula to variable of previous moment is:

The back is similar to the synoptic diagram and the forward direction method of algorithm, and just direction is opposite.

When utilizing forward direction probability and backward probability to calculate the valuation problem, concrete computing formula is as follows

P (O / λ) = Σ_{i = 1}^{N} α_{T} (i), P (O / λ) = Σ_{i = 1}^{N} β_{I} (i)

2, decoding problem

A given λ of HMM system=(π, A, B), and the observation sequence O=O that produces by system ₁, O ₂..., O _T, search makes this system produce the status switch S=q of the most possible experience of this observation sequence ₁, q ₂... q _t, promptly find the solution and make P (S/O, λ) Zui Da status switch S.Because

And P (O/ λ) is all identical for all S, so decoding problem is equivalent to and finds the solution the status switch S that makes P (S, O/ λ) maximum.Decoding problem adopts the Viterbi algorithm to solve.

A status switch is looked in expression, and this status switch state when t is i, and the probable value maximum of the status switch of state i and front t-1 state formation, and the recursion formula of algorithm is:

3, problem concerning study

For the HMM system an of the unknown, according to the observation sequence O=O of system's generation ₁, O ₂..., O _T, how to determine that model λ=(π, A B), promptly find the solution and make system combined probability

Maximum model parameter π, A, B.Problem concerning study is corresponding to the parameter training process of HMM, have only observed data, lack description, select maximum likelihood probability usually as optimum target to state, be based upon on expectation maximization (EM) basis, adopt the Baum-Welch iterative algorithm to come the estimation model parameter.ξ _t(i, state was the probability of j when state was i and t+1 when j) representing t

ξ _t(i，j)＝P(q _t＝i，q _t+1＝j|O，λ)

ξ_{t} (i, j) = \frac{P (q_{t} = i, q_{t + 1} = j, O | λ)}{P (O | λ)} = \frac{α_{t} (i) a_{ij} b_{j} (o_{t + 1}) β_{t + 1} (j)}{P (O | λ)}

= \frac{α_{t} (i) a_{ij} b_{j} (o_{t + 1}) β_{t + 1} (j)}{Σ_{i = 1}^{N} Σ_{j = 1}^{N} α_{t} (i) a_{ij} b_{j} (o_{t + 1}) β_{t + 1} (j)}

State is the probability of i during expression t

Expression is i number of 1 process state constantly;

So the computing formula of state-transition matrix is:

The computing formula of output probability matrix is:

{\overset{&OverBar;}{b}}_{j} (k) = \frac{{\underset{t = 1}{Σ}}_{o_{t} = v_{k}}^{T} γ_{t} (j)}{Σ_{t = 1}^{T} γ_{t} (j)}

The process of HMM speech recognition of the present invention is specific as follows:

In speech recognition, the MFCC phonetic feature that is obtained by characteristic extracting module is the observation sequence of HMM model; State then is the voice unit that is obtained by training.Therefore, when building the HMM model and carrying out speech recognition, need obtain the HMM model parameter to the model training, training process of the present invention has obtained good training effect as shown in Figure 3.

In the training process, at first initialization HMM parameter utilizes the Baum-Welch iterative algorithm to come the estimation model parameter then.In actual applications, should utilize training algorithm to carry out repeatedly iteration and just can obtain the result, also should provide the condition of a finishing iteration simultaneously.When the relative variation of this probability less than ε, the finishing iteration process in addition, is set maximum iteration time N, when iterations during greater than N, also stops iteration, and the Baum-Welch algorithm is adopted the method that increases scale factor, the data underflow problem of correction algorithm.As shown in Figure 4, the HMM structure from left to right of the nothing leap of the present invention's employing.

As shown in Figure 5, after training the HMM model, utilize the MFCC feature, solve state transitions sequence P (O| λ in conjunction with the Viterbi algorithm _n) (n=1...M), final, adopt the decision-making judgement, obtain the state transitions sequence of maximum probability, as shown in Figure 5.λ according to optimum condition sequence correspondence provides candidate's syllable or sound mother then, forms speech and sentence by language model at last.

Concrete module realizes being described as follows:

Six, identification module 5:

As shown in Figure 7, identification module adopts the HMM model, calls the speech model of having trained in the model bank, mates with the input speech model.Be output as transition probability value P through the HMM template _i(i=0,1...i, i are the template number) is to transition probability P _iCompare, obtain maximum transition probability P value, export corresponding text message, just can obtain recognition result.

Owing in the large vocabulary speech recognition system, have a large amount of nearly sound speech, homonym, cause system recognition rate to reduce.For overcoming the influence of nearly sound speech, homonym, system handles the transition probability that mates the back generation, and its processing procedure as shown in Figure 1.Set the threshold value of transition probability

Work as P _i＞P _TThe time, export corresponding text, otherwise give up the result.

By the transition probability threshold processing, effectively improved the discrimination of system.

Seven, translation and phonetic synthesis module:

Translation mainly is that latent state and corpus by identification module output are carried out match query with the phonetic synthesis module, and it is translated into text, adopts the TTS technology, exports with speech form.

Fig. 8 is the structural drawing of corpus.Corpus adopts the complex characteristic vector to set up.Definition phoneme proper vector V _Phoneme, have

V _phoneme＝(No.，Phoneme)

Wherein, No. is the phoneme numbering, and Phoneme is a phoneme content.

Definition syllable characteristic vector V _Syllable, have

V _syllable＝(No.，Syllable，No. _Word，G _P)

Wherein, No. is the syllable numbering, and Syllable is the syllable content, No. _WordBe word numbering, G _PBe the aligned phoneme sequence collection.

Definition word feature vector V _Word, have

V _Word＝(No.，Word，Vector _W，Num _Phrase，No. _Phrase)

Wherein, No. is the word numbering, and Word is the word content, Vector _WBe the part of speech proper vector, and part of speech proper vector Vector _W=(n, v, num, pron, adj, adv), Num _PharseBe the phrase number based on this word, No. _PharseBe the phrase numbering.

Definition note vector V _TranHave

V _Tran＝(No.，Tran _n，Tran _v，Tran _num，Tran _pron，Tran _adj，Tran _adv)

Wherein, No. is a numbering of note, Tran _n, Tran _v, Tran _Num, Tran _Pron, Tran _Adj, Tran _AdvBeing respectively part of speech is n, v, num, pron, adj, the note of adv.

In the corpus, certain incidence relation that some feature between the vector exists can come vector is striden the level inquiry by linked character, improves search efficiency.

In translation process, at first according to phoneme proper vector V _PhonemeObtain syllable characteristic vector V _SyllableThe information that is associated, and then to word feature vector V _WordInquire about, at last with note vector V _TranBe the result.

The fundamental purpose of phonetic synthesis is that the text that has translation to obtain is exported with speech form.Three main ingredients: text analysis model, rhythm generation module and acoustic module.As follows by its building-up process:

Text analyzing → rhythm generation → acoustic module

In conjunction with above-mentioned explanation, the present invention compared with prior art has two-way translation, low-cost, low-power consumption, and high-performance, advantage such as portable strong has very big consumption market in the speech recognition system field.

Claims

1. portable oral translation system based on the WinCE platform, it is characterized in that: comprise voice collecting device, voice pretreatment module, phonetic feature extraction and MBM, model bank, identification module, corpus and translation and phonetic synthesis module, all modules all are based upon on the embedded platform; Voice acquisition module is connected with the voice pretreatment module; The voice pretreatment module is extracted with phonetic feature and is connected with MBM; Phonetic feature extracts and is connected with model bank or identification module respectively with MBM; Described phonetic feature extracts and is connected with model bank by being chosen as physical training condition with MBM, by selecting status recognition, is connected with identification module; Identification module is connected with the phonetic synthesis module with translation; Translation is connected with corpus with the phonetic synthesis module; Described identification module obtains translating into text by translation and phonetic synthesis module after the optimal result through the decision-making judgement, and exports with speech form; Through speech selection, realize from Chinese to English or English spoken two-way translation to Chinese.

2. the portable oral translation system based on the WinCE platform according to claim 1 is characterized in that: described voice pretreatment module comprises successively the pre-emphasis unit that connects, divides frame processing unit, adds window unit and end-point detection unit; Pre-emphasis unit is connected with the voice collecting device, and the end-point detection unit extracts with phonetic feature and is connected with MBM;

Described pre-emphasis unit is a high boost pre-emphasis digital filter;

Described end-point detection unit adopts short-time energy E and short-time average zero-crossing rate Z as the double threshold of feature relatively, and calculates zero-crossing rate threshold values ZcT and height energy threshold as thresholding according to quiet section, carries out the detection of end points.

3. the portable oral translation system based on the WinCE platform according to claim 2 is characterized in that: described phonetic feature extracts with MBM and passes through to extract the MFCC phonetic feature as recognition feature; Set up hidden Markov model and be training and model of cognition, this hidden Markov model is made up of Markov chain and general random process;

Described hidden Markov model utilizes the forward-backward algorithm probabilistic algorithm to solve the valuation problem, utilizes the Viterbi algorithm to solve decoding problem; Utilize the Baum-Welch iterative algorithm to solve problem concerning study.

4. the portable oral translation system based on the WinCE platform according to claim 3 is characterized in that:

Utilize the forward-backward algorithm probabilistic algorithm, solve for the given λ of hidden Markov model system=(π, A, B), the observation sequence O=O that produces according to system ₁, O ₂..., O _TCalculate the problem of likelihood probability P (O/ λ).

5. the portable oral translation system based on the WinCE platform according to claim 3 is characterized in that: utilize the Viterbi algorithm, solve for the given λ of hidden Markov model system=(π, A, B), and the observation sequence O=O that produces by system ₁, O ₂..., O _T, search makes this system produce the status switch S=q of the most possible experience of this observation sequence ₁, q ₂... q _tProblem.

6. the portable oral translation system based on the WinCE platform according to claim 3 is characterized in that: for the hidden Markov model system of the unknown, utilize the Baum-Welch iterative algorithm to come the estimation model parameter.

7. the speech recognition method of the portable oral translation system based on the WinCE platform according to claim 3 is characterized in that comprising the steps:

(1) hidden Markov model is trained the acquisition model parameter;

(4) go out candidate's syllable or sound mother according to optimum condition sequence correspondence, form speech and sentence by language model at last.

8. the speech recognition method of the portable oral translation system based on the WinCE platform according to claim 7, it is characterized in that: the first initialization hidden Markov model of described step (1) parameter, utilize the Baum-Welch iterative algorithm to come the estimation model parameter then.

9. the speech recognition method of the portable oral translation system based on the WinCE platform according to claim 8, it is characterized in that: described step (1) is utilized training algorithm to carry out repeatedly iteration and is obtained the result, also should provide the condition of a finishing iteration simultaneously, when the relative variation of this probability less than ε, the finishing iteration process, in addition, set maximum iteration time N, when iterations during greater than N, also stop iteration, and the Baum-Welch algorithm is adopted the method that increases scale factor, the data underflow problem of correction algorithm.