Summary of the invention
The objective of the invention is to design portable oral translation system based on the WinCE platform, can be under the situation of embedded system resource-constrained, realize the recognition function of big vocabulary, and have higher discrimination, and realize from Chinese to English or English spoken two-way translation to Chinese.。
Another object of the present invention is to provide the audio recognition method of this translation system.
In order to realize the foregoing invention purpose, the present invention includes following technical characterictic: a kind of portable oral translation system based on the WinCE platform, it is characterized in that: comprise voice collecting device, voice pretreatment module, phonetic feature extraction and MBM, model bank, identification module, corpus and translation and phonetic synthesis module, all modules all are based upon on the embedded platform; Voice acquisition module is connected with the voice pretreatment module; The voice pretreatment module is extracted with phonetic feature and is connected with MBM; Phonetic feature extracts and is connected with model bank or identification module respectively with MBM; Described phonetic feature extracts and is connected with model bank by being chosen as physical training condition with MBM, by selecting status recognition, is connected with identification module; Identification module is connected with the phonetic synthesis module with translation; Translation is connected with corpus with the phonetic synthesis module; Described identification module obtains translating into text by translation and phonetic synthesis module after the optimal result through the decision-making judgement, and exports with speech form; Through speech selection, realize from Chinese to English or English spoken two-way translation to Chinese.
Described voice pretreatment module comprises successively the pre-emphasis unit that connects, divides frame processing unit, adds window unit and end-point detection unit; Pre-emphasis unit is connected with the voice collecting device, and the end-point detection unit extracts with phonetic feature and is connected with MBM;
Described pre-emphasis unit is a high boost pre-emphasis digital filter;
Frame processing unit taked the field overlapping to divide the frame mode to carry out the processing of branch frame in described minute;
The described window unit that adds adopts Hamming window function carry out windowization;
Described end-point detection unit adopts with short-time energy E and short-time average zero-crossing rate Z as the double threshold of feature relatively, and calculates zero-crossing rate threshold values Z according to quiet section
cT and height energy threshold are carried out the detection of end points as thresholding.
Described phonetic feature extracts with MBM and passes through to extract the MFCC phonetic feature as recognition feature; Adopt hidden Markov model as training and model of cognition; This hidden Markov model is made up of Markov chain and general random process;
Described hidden Markov model utilizes the forward-backward algorithm probabilistic algorithm to solve the valuation problem; Utilize the Viterbi algorithm to solve decoding problem; Utilize the Baum-Welch iterative algorithm to solve problem concerning study.
Be specially: utilize the forward-backward algorithm probabilistic algorithm, solve for the given λ of hidden Markov model system=(π, A, B), the observation sequence O=O that produces according to system
1, O
2..., O
TCalculate the problem of likelihood probability P (O/ λ).
Utilize the Viterbi algorithm, solve for the given λ of hidden Markov model system=(π, A, B), and the observation sequence O=O that produces by system
1, O
2..., O
T, search makes this system produce the status switch S=q of the most possible experience of this observation sequence
1, q
2... q
tProblem.
For the hidden Markov model system of the unknown, utilize the Baum-Welch iterative algorithm to come the estimation model parameter.
The present invention also comprises a kind of speech recognition method of the portable oral translation system based on the WinCE platform, it is characterized in that comprising the steps:
(1) hidden Markov model is trained the acquisition model parameter;
(2) phonetic feature that characteristic extracting module is obtained is as the observation sequence of hidden Markov model; The voice unit that training obtains is a status switch, solves the state transitions sequence by the Viterbi algorithm;
(3) adopt the decision-making judgement, obtain the state transitions sequence of maximum probability;
(4) go out candidate phoneme or syllable according to optimum condition sequence correspondence, form speech and sentence by language model at last.
The first initialization hidden Markov model of described step (1) parameter utilizes the Baum-Welch iterative algorithm to come the estimation model parameter then.
Described step (1) is utilized training algorithm to carry out repeatedly iteration and is obtained the result, also should provide the condition of a finishing iteration simultaneously, when the relative variation of this probability less than ε, the finishing iteration process in addition, is set maximum iteration time N, when iterations during greater than N, also stop iteration, and the Baum-Welch algorithm is adopted the method that increases scale factor, the data underflow problem of correction algorithm.
The present invention is a kind of portable oral translation system and speech recognition method thereof based on the WinCE platform, and its hardware core is a flush bonding processor, and embedded system has low cost, low-power consumption, high-performance, portable fine quality such as strong.In the voice pretreatment module, comprise pre-emphasis unit, divide frame processing unit, add window unit and end-point detection unit, by the voice signal that collects is anticipated, make that embedded system efficient when the later stage speech recognition is higher, recognition accuracy is also higher.Adopt hidden Markov model, Model Identification is carried out with it again in the training pattern storehouse, makes identifying precise and high efficiency more.The present invention compared with prior art has two-way translation, low-cost, low-power consumption, and high-performance, advantage such as portable strong, and have very big consumption market in the speech recognition system field.
Embodiment
The present invention is a kind of portable oral translation system based on the WinCE platform, design has realized a speech recognition system based on wince, embedded system has low cost, low-power consumption, fine qualities such as high-performance, its core are its flush bonding processor, at present, the little processing of ARM mainly comprises ARM7 series, ARM9 series, ARM9E series, ARM10E series, ARM11 series, and its function from strength to strength.The present invention uses embedded system scientific research platform UP-CPU 6410, adopts up-to-date S3C6410X (ARM11) embedded microprocessor of Samsung company, and its frequency reaches 633M, is a based on the ARM1176JZF-S core, adopts the processor of ARM v6 framework.
Module principle figure of the present invention as shown in Figure 6, voice signal by voice collecting device 1 microphone collection input, carry out pre-emphasis by 2 pairs of voice signals of voice pretreatment module, divide frame, windowing, processing such as end-point detection, what realize above-mentioned processing capacity is pre-emphasis unit 21, divide frame processing unit 22, add window unit 23 and end-point detection unit 24.Carrying out feature by the phonetic feature extraction with 3 pairs of voice messagings of MBM then carries and the training utterance model, phonetic feature extracts and is connected with model bank 4 or identification module 5 with MBM 3, read corpus 6 by translation and phonetic synthesis module 7, translate into the output of text and synthetic speech.
Respectively each modular unit that relates to is described below:
One, pre-emphasis unit 21
The average power spectra of voice signal is subjected to the influence of glottal excitation and mouth and nose radiation, front end is pressed 6dB/oct (octave) decay greatly more than 800Hz, the high more corresponding composition of frequency is more little, will be promoted its HFS before voice signal is analyzed for this reason.Therefore before being analyzed, adopts voice signal the high boost pre-emphasis digital filter processes voice signals of a 6dB/oct usually, realization is promoted its HFS, make the frequency spectrum of signal become smooth, remain on low frequency in the whole frequency band of high frequency, can ask frequency spectrum with same noise.The filter response function is:
H(z)=1-αz
-1,0.9≤α≤1.0
Wherein α is a pre emphasis factor, gets 0.9375 usually, like this, and the output of preemphasis network
The available difference equation of relation with the voice signal s (n) that imports
Expression.
Two, divide frame processing unit 22
Voice signal has time-varying characteristics, but in a short time range, its characteristic remains unchanged promptly relatively stable substantially, and this specific character of voice signal is called " short-time characteristic ", and this short section time is generally 10~30ms.So the analysis of voice signal and processing generally are based upon on the basis of " short-time characteristic ", promptly carry out " short-time analysis ", sound signal stream is adopted divide frame to handle.The frame number of general per second has
Decide on actual conditions.Divide frame both can adopt continuation mode, also can adopting overlaps divides the mode of frame, owing to have correlativity between the voice signal, adopts field to overlap among the present invention and divides the mode of frame.
Like this, for the voice signal of integral body, the characteristic parameter time series of forming by each frame characteristic parameter that analyzes.
Three, add window unit 23
Voice signal has stationarity in short-term, can carry out the branch frame to signal and handle.And be to realize near the speech waveform the sampling n in the voice signal is emphasized and the remainder of waveform is weakened, and then also will be to its windowing process.Each short section to voice signal is handled, and in fact is exactly that each short section is carried out certain conversion or imposed certain computing, and its general expression is:
T[wherein] represent certain conversion, it can be linear also can be non-linear, s (n) is an input speech signal series.Q
nIt is the time series that all each sections obtain after treatment.
Select Hamming window in the invention for use
Four, the end-point detection unit 24
End-point detection during voice signal is handled mainly is starting point and the end point in order to detect voice automatically.The present invention has adopted the double threshold relative method to carry out end-point detection.The double threshold relative method as feature, in conjunction with the advantage of Z and E, makes detection more accurate with short-time energy E and short-time average zero-crossing rate Z, the processing time of effective reduction system, improve the real-time of system handles, and can get rid of the noise of unvoiced segments, thus the recognition performance that improves.
In the double threshold relative method, short-time energy E and short-time average zero-crossing rate Z feature calculation are as follows respectively:
(1) short-time energy E
The short-time energy of voice signal s (n) is defined as:
Wherein ω (n) is the window function of Hamming window.
For following formula, if make h (n)=ω
2(n), then have:
Following formula represents as can be known, and the short-time energy of window transform is equivalent to signal with " voice square " by a linear filter output, and the unit-sample response of this wave filter is h (n).It realizes that block diagram is as follows:
The realization block diagram of short-time energy
For the short-time average energy E that with n is certain frame voice signal of sign
nFor:
(2) short-time average zero-crossing rate Z
The short-time average zero-crossing rate definition
Sgn[wherein] be sign function, promptly
S (n) is a voice signal.
Wherein ω (n) is a window function.
It realizes that block diagram is as follows:
The short time interval that voice signal begins is equally distributed ambient noise signal.When adopting the double threshold relative method to carry out end-point detection, need calculate zero-crossing rate threshold values Z according to " quiet " section of beginning
cT and height energy threshold ETL (low energy metered valve) and ETU (high energy metered valve) are used as thresholding, just can realize the accurate detection of end points.
Zero-crossing rate threshold values Z
cT=min (IF, Z
c+ 2* σ
Zc), wherein IF is an empirical value, the present invention gets IF=25; Z
c, σ
ZcBe respectively the average and the standard deviation of the zero-crossing rate of initial " quiet " section.
For ETL (low energy metered valve) and ETU (high energy metered valve), need calculate the short-time average energy of " quiet " section earlier, maximum energy value is designated as E
Max, minimum energy value is designated as E
MinOrder:
I1=0.03*(E
max-E
min)+E
min
I2=4*E
min
Then have:
ETL=min(I1,I2)
ETU=5*ETL
Utilize Z
cWhen T and ETL and ETU detected as thresholding, establishing start frame was N1, then the ENERGY E at N1 frame place
N1And zero-crossing rate Z
N1Satisfy ETU>E simultaneously
N1>ETL, E
N1+1>ETU, Z
N1>Z
cT; ENERGY E at end frame N2 place
N2And zero-crossing rate Z
N2Satisfy simultaneously
(adjusting coefficient k=4), Z
N1<Z
cT.
Adopt the double threshold relative method, combine the situation of other frame, can effectively avoid The noise, improve degree of detection, phonetic feature is extracted have high efficiency, be beneficial to the raising of discrimination.
Five, phonetic feature extracts and MBM 3
The extraction that the present invention adopts is based on the MFCC phonetic feature of the auditory properties feature as identification.(Mel-Frequency Cepstral Coefficients is to propose according to human auditory system's characteristic MFCC) to the Mel cepstrum coefficient, and anthropomorphic dummy's ear is to the perception of different frequency voice.People's ear is differentiated the process of sound frequency just as a kind of operation of taking the logarithm.For example: in the Mel frequency domain, the people is a linear relationship to the perception of tone, if the Mel difference on the frequency twice of two sections voice, then people's also poor twice in perception.
Wherein the MFCC algorithmic procedure of characteristic extracting module 3 is:
1. Fast Fourier Transform (FFT) (FFT):
X[n] (n=0,1,2 ..., N-1) a frame discrete voice sequence for obtaining through over-sampling, N is a frame length.X[k] the plural number series of ordering for N, again to X[k] delivery get the signal amplitude spectrum | X[k] |.
2. the actual frequency yardstick is converted to the Mel dimensions in frequency:
Mel (f) is the Mel frequency, and f is an actual frequency, and unit is Hz.
3. configuration triangle filter group and calculate each triangle filter signal amplitude is composed | X[k] | filtered output:
Wherein
w
l(k) be the filter factor of respective filter, o (l), c (l), h (l) be on the actual frequency coordinate axis respective filter lower frequency limit, centre frequency and upper limiting frequency, f
sBe sampling rate, L is a number of filter, and F (l) is filtering output.
4. the logarithm computing is done in all wave filter outputs, is further done discrete cosine transform (DTC) again, can obtain MFCC:
Q is the exponent number of MFCC parameter, generally gets 12, and M (i) is gained MFCC parameter.
Speech model of the present invention adopts hidden Markov model, hidden Markov model (HMM, HiddenMarkov Model) is a kind of statistical signal transaction module,, develops by Markov chain with probability model parametric representation, that be used to describe the statistics of random processes characteristic.Two ingredients of HMM: Markov chain: describe the transfer of state, describe with transition probability.The general random process: the relation between description state and observation sequence, to describe with the observed value probability, it is formed as Fig. 1.
The HMM model can be expressed as: λ=(N, M, π, A, B), wherein
N: Markov chain state number in the model.Remember that N state is θ
1..., θ
N, note t Markov chain state of living in constantly is q
t, obvious q
t∈ (θ
1..., θ
N).
M: the possible observed value number of each state correspondence.Remember that M observed value is V
1..., V
M, note t observed observation vector constantly is O
t, O wherein
t∈ (V
1..., V
M).
π: original state probability vector, π=(π
1..., π
N), π wherein
i=P (q
1=θ
i), 1≤i≤N.
A: state transition probability matrix, A=(a
Ij)
N * N, a
Ij=P (q
I+1=θ
j/ q
t=θ
i), 1≤i, j≤N are the transition probabilities that changes to state j from state i.
B: output probability matrix, B=(b
Ik)
N * M,
b
Ik=P (O
t=V
k/ q
t=θ
i), when representing to get the hang of i, 1≤i≤N, 1≤k≤M produce output V
kProbability.Because a
Ij, b
Ik, π
iAll be probability, therefore need satisfy normalizing condition: a
Ij〉=0, b
Ik〉=0, π
i〉=0
And
HMM relates to three problems:
1, valuation problem
A given λ of HMM system=(π, A, B), according to the observation sequence O=O of system's generation
1, O
2..., O
T, calculate likelihood probability P (O/ λ).To a fixing status switch S=q
1, q
2... q
t, the most basic theoretical calculation method is the probability addition with all possible status switch, promptly
But this method complexity is c
TT, therefore calculated amount is very big, adopts forward direction-back can solve this estimation problem in the identification effectively to algorithm, and calculated amount is c
2T.
Definition forward variable: a
t i=P (o
1o
2... o
t, q
t=i| λ) under the representation model λ, at moment t, observed events is O
t, state is the probability of i.Next forward variable computing formula constantly is:
The synoptic diagram of forward-backward algorithm algorithm as shown in Figure 2.
The definition back is to variable: β
t(i)=P (o
T+1o
T+2... o
T| q
t=i, λ) T is (o to the observed events sequence of moment t+1 backward from stopping constantly in expression
T+1o
T+2... o
T), and the state of t is the probability of i constantly.The back computing formula to variable of previous moment is:
The back is similar to the synoptic diagram and the forward direction method of algorithm, and just direction is opposite.
When utilizing forward direction probability and backward probability to calculate the valuation problem, concrete computing formula is as follows
2, decoding problem
A given λ of HMM system=(π, A, B), and the observation sequence O=O that produces by system
1, O
2..., O
T, search makes this system produce the status switch S=q of the most possible experience of this observation sequence
1, q
2... q
t, promptly find the solution and make P (S/O, λ) Zui Da status switch S.Because
And P (O/ λ) is all identical for all S, so decoding problem is equivalent to and finds the solution the status switch S that makes P (S, O/ λ) maximum.Decoding problem adopts the Viterbi algorithm to solve.
A status switch is looked in expression, and this status switch state when t is i, and the probable value maximum of the status switch of state i and front t-1 state formation, and the recursion formula of algorithm is:
3, problem concerning study
For the HMM system an of the unknown, according to the observation sequence O=O of system's generation
1, O
2..., O
T, how to determine that model λ=(π, A B), promptly find the solution and make system combined probability
Maximum model parameter π, A, B.Problem concerning study is corresponding to the parameter training process of HMM, have only observed data, lack description, select maximum likelihood probability usually as optimum target to state, be based upon on expectation maximization (EM) basis, adopt the Baum-Welch iterative algorithm to come the estimation model parameter.ξ
t(i, state was the probability of j when state was i and t+1 when j) representing t
ξ
t(i,j)=P(q
t=i,q
t+1=j|O,λ)
State is the probability of i during expression t
Expression is i number of 1 process state constantly;
So the computing formula of state-transition matrix is:
The computing formula of output probability matrix is:
The process of HMM speech recognition of the present invention is specific as follows:
In speech recognition, the MFCC phonetic feature that is obtained by characteristic extracting module is the observation sequence of HMM model; State then is the voice unit that is obtained by training.Therefore, when building the HMM model and carrying out speech recognition, need obtain the HMM model parameter to the model training, training process of the present invention has obtained good training effect as shown in Figure 3.
In the training process, at first initialization HMM parameter utilizes the Baum-Welch iterative algorithm to come the estimation model parameter then.In actual applications, should utilize training algorithm to carry out repeatedly iteration and just can obtain the result, also should provide the condition of a finishing iteration simultaneously.When the relative variation of this probability less than ε, the finishing iteration process in addition, is set maximum iteration time N, when iterations during greater than N, also stops iteration, and the Baum-Welch algorithm is adopted the method that increases scale factor, the data underflow problem of correction algorithm.As shown in Figure 4, the HMM structure from left to right of the nothing leap of the present invention's employing.
As shown in Figure 5, after training the HMM model, utilize the MFCC feature, solve state transitions sequence P (O| λ in conjunction with the Viterbi algorithm
n) (n=1...M), final, adopt the decision-making judgement, obtain the state transitions sequence of maximum probability, as shown in Figure 5.λ according to optimum condition sequence correspondence provides candidate's syllable or sound mother then, forms speech and sentence by language model at last.
Concrete module realizes being described as follows:
Six, identification module 5:
As shown in Figure 7, identification module adopts the HMM model, calls the speech model of having trained in the model bank, mates with the input speech model.Be output as transition probability value P through the HMM template
i(i=0,1...i, i are the template number) is to transition probability P
iCompare, obtain maximum transition probability P value, export corresponding text message, just can obtain recognition result.
Owing in the large vocabulary speech recognition system, have a large amount of nearly sound speech, homonym, cause system recognition rate to reduce.For overcoming the influence of nearly sound speech, homonym, system handles the transition probability that mates the back generation, and its processing procedure as shown in Figure 1.Set the threshold value of transition probability
Work as P
i>P
TThe time, export corresponding text, otherwise give up the result.
By the transition probability threshold processing, effectively improved the discrimination of system.
Seven, translation and phonetic synthesis module:
Translation mainly is that latent state and corpus by identification module output are carried out match query with the phonetic synthesis module, and it is translated into text, adopts the TTS technology, exports with speech form.
Fig. 8 is the structural drawing of corpus.Corpus adopts the complex characteristic vector to set up.Definition phoneme proper vector V
Phoneme, have
V
phoneme=(No.,Phoneme)
Wherein, No. is the phoneme numbering, and Phoneme is a phoneme content.
Definition syllable characteristic vector V
Syllable, have
V
syllable=(No.,Syllable,No.
Word,G
P)
Wherein, No. is the syllable numbering, and Syllable is the syllable content, No.
WordBe word numbering, G
PBe the aligned phoneme sequence collection.
Definition word feature vector V
Word, have
V
Word=(No.,Word,Vector
W,Num
Phrase,No.
Phrase)
Wherein, No. is the word numbering, and Word is the word content, Vector
WBe the part of speech proper vector, and part of speech proper vector Vector
W=(n, v, num, pron, adj, adv), Num
PharseBe the phrase number based on this word, No.
PharseBe the phrase numbering.
Definition note vector V
TranHave
V
Tran=(No.,Tran
n,Tran
v,Tran
num,Tran
pron,Tran
adj,Tran
adv)
Wherein, No. is a numbering of note, Tran
n, Tran
v, Tran
Num, Tran
Pron, Tran
Adj, Tran
AdvBeing respectively part of speech is n, v, num, pron, adj, the note of adv.
In the corpus, certain incidence relation that some feature between the vector exists can come vector is striden the level inquiry by linked character, improves search efficiency.
In translation process, at first according to phoneme proper vector V
PhonemeObtain syllable characteristic vector V
SyllableThe information that is associated, and then to word feature vector V
WordInquire about, at last with note vector V
TranBe the result.
The fundamental purpose of phonetic synthesis is that the text that has translation to obtain is exported with speech form.Three main ingredients: text analysis model, rhythm generation module and acoustic module.As follows by its building-up process:
Text analyzing → rhythm generation → acoustic module
In conjunction with above-mentioned explanation, the present invention compared with prior art has two-way translation, low-cost, low-power consumption, and high-performance, advantage such as portable strong has very big consumption market in the speech recognition system field.