CN109887498A

CN109887498A - Scoring method of polite expressions at highway entrances

Info

Publication number: CN109887498A
Application number: CN201910181668.3A
Authority: CN
Inventors: 卢朝阳; 周云蝶; 李静
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2019-03-11
Filing date: 2019-03-11
Publication date: 2019-06-14

Abstract

The invention discloses a method for grading polite words at an expressway entrance, which mainly solves the problems that the existing manual supervision of toll collectors is boring and easy to neglect. The implementation scheme is: preprocessing the corpus file, completing the extraction of 24-dimensional MFCC feature parameters, and training the feature parameters to obtain a parallel network model of keywords and Filler; Voice feature parameters, and match the test voice feature parameters with the obtained network model to obtain the initial search results; match the initial search results with the isolated word model to obtain the final search results, if the search results contain all keywords, then judge 100 points, otherwise, if y keywords are missing, it will be judged as 100‑y*100/m points. The invention has good robustness, and the keyword retrieval accuracy rate is high, and the misrecognition rate is low, and is suitable for expressway crossing management.

Description

Highway mouth term of courtesy methods of marking

Technical field

The invention belongs to voice keyword retrieval technical field, in particular to a kind of term of courtesy methods of marking can be used for Highway mouth charge station.

Technical background

Duplicate work simple in the mankind is studied by machine replacement always people the initial power of robotic development, people Exchange with machine be current artificial intelligence one of demand for development.The mankind and machine directly " dialogue " are realized as a kind of Technology, speech recognition technology can advantageously convert voice signals into corresponding machine language very much, and then realize accessible friendship Stream.

Present human work life in, some needs of work by detection staff term of courtesy standard with It is no come evaluation work situation, such as highway mouth charge station staff just need to finish as defined in certain term of courtesy sides It is just up to standard.And the repetition class work of these testing and evaluations just alleviates manager once being replaced by machine to a certain extent Work load and improve management effect.Thus realize that the keyword speech recognition under some scenes seems especially useful, and It can score the evaluated personnel of unspecified person.

China has 5,000 years of civilization history, is known as the title of " state of ceremonies ", the Chinese nation is also with urbane style and features And it is world-famous for.An important component of the good social civility as Chinese traditional culture, content very abundant, the model being related to Enclose it is very extensive, almost permeate in society various aspects.Such as the staff of highway mouth charge station with passing department Machine just needs some terms of courtesy when exchanging, the staff of charge station whether using specific term of courtesy and using frequency be Administrator assesses the important evidence of their work.

Existing supervision evaluation work has been undertaken by video monitoring in behavior exchange, such as big magnificent highway video prison Charge station's subsystem in control system monitors the working condition of fee-collector in tollbooth, but this method with regard to continuous 24 hours whole days The courtesy movement that can only monitor fee-collector, the supervision assessment in speech exchange is still to need administrator by being accomplished manually Whole process supervision, process repetition is uninteresting, but also needs specially to be arranged for each tollbooth the position of administrator, lavishes labor on.

Summary of the invention

It is an object of the invention in view of the above shortcomings of the prior art, propose a kind of highway mouth term of courtesy scoring System, to realize the intelligence to fee-collector's voice monitoring, the supervision and assessment that permit ease of administration person works to fee-collector.

To achieve the above object, the present invention includes:

(1) select m term of courtesy of highway mouth fee-collector as keyword, selection n people as enunciator, everyone It is complete to each keyword and clearly say x times, it always there are m × n × x wav file as corpus library file；

(2) keyword models and the parallel network model of Filler model are constructed:

Preemphasis successively 2a) is carried out to the corpus library file of each keyword, framing adds the pretreatment of Hamming window, obtain one The voice data of one frame of frame extracts 24 Jan Vermeer frequency cepstral coefficient MFCC as characteristic parameter from the voice data；Using Baum-Welch algorithm is trained this feature parameter, obtains the Hidden Markov Model HMM parameter model of the keyword；

2b) using the predictable non-courtesy speech syllable of highway as non-key word, use and 2a) identical method establishes Non-key word HMM model；With method identical with 2a) to the single state HMM model of mute foundation, with non-key word model and mute Model forms Filler model；

Keyword models and Filler model are arranged parallel 2c), form the network model without linguistic constraints；

(3) k people is chosen as test speaker person, everyone says one to the m voice segments comprising 1 to m keyword respectively Time, k × m is always obtained！Wav file, as tone testing file；

(4) to tone testing file pass through and 2a) it is identical pretreatment and MFCC feature extraction, obtain tested speech feature Parameter；After the weight for adjusting network edge in the parallel network model of (2) resulting keyword models and Filler model, use Viterbi algorithm calculates the matching score of the tested speech characteristic parameter pair and each model in network model, retains matching The higher s model of score, as keyword initial retrieval result；

(5) Viterbi algorithm is used, the higher model of s score and 2a in (4) resulting network model are calculated) institute Keyword models matching score, temporally length is to s matching score normalization, and using result as corresponding to s S confidence level of a model；One threshold value is set, the confidence level of more each model and the size of the threshold value are recycled, it is s times total, It is higher than the threshold value lower than the model, confidence level is discarded if the threshold value if confidence level and just retains the model, which is just used as finally Keyword retrieval result；

(6) by the random tone testing file of someone in file obtained by (3) behind (4) and (5), if including institute M keyword of retrieval in need, then be judged to 100 points, if lacking y keyword, is judged to 100-y*100/m and divides, final To being commented the work of personnel to score.

The present invention has the advantages that

1) application scenarios of the invention are the charge stations of highway mouth, build a set of term of courtesy points-scoring system, are realized Intelligence to fee-collector's voice monitoring, supervision evaluation work of the permit ease of administration person to fee-collector；

2) present invention uses the voice keyword retrieval method based on HMM, has good robustness；

3) the present invention is based on after keyword initial retrieval as a result, realizing keyword recognition plus using confidence level, has Higher retrieval accuracy；

4) it is that personnel is commented to score that the present invention, which strictly observes code of points, does not miss a term of courtesy, and leakage knowledge rate is lower.

Detailed description of the invention

Fig. 1 is implementation flow chart of the invention；

Fig. 2 is the parallel network illustraton of model of keyword models and Filler model in the present invention.

Specific embodiment

With reference to the accompanying drawings and detailed description, the present invention is furture elucidated.

Referring to Fig.1, specific step is as follows for the present embodiment:

Step 1. acquires corpus library file.

Selected m term of courtesy of highway mouth fee-collector chooses n (n >=20) personal accomplishment enunciator as keyword, Everyone is complete to each keyword and clearly says x times, always there are m × n × x wav file as corpus library file.

Step 2. constructs keyword and the parallel network model of Filler.

Preemphasis successively 2a) is carried out to the corpus library file of each keyword, framing adds the pretreatment of Hamming window:

Preemphasis 2a1) is carried out to original signal x (n) with single order high-pass digital filter, obtains preemphasis treated letter Number are as follows:

Y (n)=x (n) -0.98x (n-1)；

Framing 2a2) is carried out to preemphasis treated signal y (n) and adds Hamming window, framing is obtained and adds Hamming window treated Signal are as follows:

2b) through 2a) pretreatment after obtain voice data one by one, extracted from the voice data 24 Jan Vermeers frequency Rate cepstrum coefficient MFCC is as characteristic parameter；

2c) using Baum-Welch algorithm to 2b) resulting characteristic parameter is trained:

2c1) assume observation sequence O={ O_t, t=1,2 ..., T }, initial model λ=(π, A, B), if the initial model State set is { S_i, i=1,2 ..., N }, it is q in t moment status_t, observe symbol are as follows:

V={ v_k, k=1,2 ..., N },

In initial model λ: π={ π_i, i=1,2 ..., N },

A={ a_ij, i=1,2 ..., N, j=1,2 ..., N },

B={ b_jk, j=1,2 ..., N, k=1,2 ..., M },

π_iIndicate initial state probabilities, a_ijIndicate that in the state of moment t be S_iAnd it shifts at the t+1 moment as state S_jIt is general Rate, b_jkIt indicates in state S_jObserve symbol v_kProbability；

2c2) in 2c1) hypothesis under, introduce two groups of probability variable ε_t(i, j) indicates that t moment is in state S_iAnd the t+1 moment In state S_jProbability, γ_t(i) indicate that t moment is in state S_iProbability, it may be assumed that

ε_t(i, j)=P (q_t=S_i,q_t+1=S_j| O, λ),

γ_t(i)=P (O, q_t=S_i|λ)；

2c3) by 2c2) introduced two groups of probability variables calculate one group of new parameter:

π′_i=γ₁(i),

2c4) by 2c3) resulting one group of new parameter π '_i, a '_ij, b '_j(k), revaluation obtains a new model:

λ '=(π ', A ', B '),

The probability P (O | λ ') of model λ ' generation observation sequence generates the probability P (O | λ) of observation sequence than initial model λ It is big；

2c5) repeat 2c3) and 2c4), model parameter is continuously improved, until P (O | λ ') is no longer significantly increased, model at this time λ '=(π ', A ', B ') is the Hidden Markov Model HMM parameterized template of the training keyword；

2d) using the predictable non-courtesy speech syllable of highway as non-key word, use and 2a) identical method establishes Non-key word HMM model；With method identical with 2a) to the single state HMM model of mute foundation, with non-key word model and mute Model forms Filler model；

Keyword models and Filler model are arranged parallel 2e), the network model without linguistic constraints are formed, such as Fig. 2 institute Show.

Step 3. acquires tone testing file.

K personal accomplishment test speaker person is chosen, everyone says one to the m voice segments comprising 1 to m keyword respectively Time, k × m is always obtained！Wav file, as tone testing file, wherein k > 5.

Step 4. keyword initial retrieval.

4a) tone testing file is successively passed through and 2a) identical pretreatment and and 2b) identical MFCC feature extraction, Obtain tested speech characteristic parameter；

After the weight for 4b) adjusting network edge in the resulting parallel network model of step 2, using Viterbi algorithm, meter Calculate 4a) resulting tested speech characteristic parameter is to the matching score of each model in network model:

4b1) under hypothesis identical with 2c1), if moment t is along a paths sequence Q={ q₁,q₂,…,q_tAnd q_t=S_i The maximum probability for generating observation sequence is δ_t(S_i), introduce one group of intermediate variable

4b2) initialize 4b1) set by probability variable δ_t(S_i) and intermediate variableAre as follows:

Moment t 4b3) is set along a paths sequence Q={ q₁,q₂,…,q_tAnd q_t=S_jGenerate the maximum probability of observation sequence For δ_t(S_j), introduce one group of intermediate variableIn 4a2) gained probability variable δ_t(S_i) and intermediate variableBasis On, obtain the maximum probability δ of observation sequence_t(S_j) and intermediate variableAre as follows:

4b4) according to 4b3) resulting one group of intermediate variableRecursive calculation

4b5) by 4b3) resulting δ_T(S_i), calculate the observation sequence at T moment and probability P ' (Q, the O | λ) of Model Matching and State q '_T:

P ' (Q, O | λ)=max_1≤i≤N[δ_T(S_i)],

q′_T=argmax_1≤i≤N[δ_T(S_i)],

P ' (Q, O | λ) is the matching score of observation sequence and model at this time；

4b6) merge 4b4) resulting one group of q '₁,q′₂,…,q′_T-1And 4b4) resulting q '_T, obtain optimum state path Sequence:

Q '={ q '₁,q′₂,…,q′_T}；

4c) retain 4b) in the higher s model of matching score, as keyword initial retrieval result.

Step 5. realizes keyword recognition with confidence level, obtains the final search result of keyword.

5a) use and 4b) identical Viterbi algorithm, calculate 4c) resulting higher keyword models pair of s score The matching score of isolated word model, temporally length is to s matching score normalization, and normalized result respectively as right It should be in s confidence level of s model；

One threshold value 5b) is set, and circulation compares 5a) size of resulting each model confidence and the threshold value, it is s times total, It is higher than the threshold value lower than the model, confidence level is discarded if the threshold value if confidence level and just retains the model, the model of reservation is with regard to conduct The search result of final keyword.

Step 6. completes scoring.

By the random tone testing file of someone in step 3 gained file after step 4 and step 5, if packet Containing retrieval in need m keyword, then be judged to 100 points, if lacking y keyword, be judged to 100-y*100/m point, most Obtain being commented the work of personnel to score eventually, wherein 0≤y≤m.

The above is only example of the present invention, does not constitute any limitation of the invention, it is clear that for It, all may be without departing substantially from the principle of the invention, structure after having understood the content of present invention and principle for one of skill in the art In the case where, carry out various modifications and change in form and details, but these modifications and variations based on inventive concept Still within the scope of the claims of the present invention.

Claims

1. A method for scoring polite expressions at a highway entrance, characterized in that, comprising:

(1) Select m polite expressions of expressway entrance toll collectors as keywords, select n people as speakers, each person speaks each keyword completely and clearly x times, and a total of m×n×x WAV files are obtained as a corpus file;

(2) Build a parallel network model of the keyword model and the Filler model:

2a) Pre-emphasis, frame-by-frame and Hamming window preprocessing are carried out successively to the corpus file of each keyword to obtain the speech data of one frame and one frame, and the 24-dimensional Mel frequency cepstral coefficient MFCC is extracted from the speech data as Feature parameters; use Baum-Welch algorithm to train the feature parameters to obtain the hidden Markov model HMM parameter model of the keyword;

2b) Use the predictable impolite speech syllables of the highway as non-keywords, and use the same method as 2a) to build a non-keyword HMM model; use the same method as 2a) to build a single-state HMM model for silence, use the non-keyword The model and the mute model form the Filler model;

2c) Set the keyword model and Filler model in parallel to form a network model without grammatical constraints;

(3) Select k people as test speakers, each person speaks m speech segments containing 1 to m keywords, and a total of k × m is obtained! A WAV file as a voice test file;

(4) The voice test file is subjected to the same preprocessing and MFCC feature extraction as in 2a) to obtain the test voice feature parameters; Using the Viterbi algorithm, calculate the matching score between the test speech feature parameter and each model in the network model, and retain the s models with higher matching scores as the initial search results of keywords;

(5) Using the same Viterbi algorithm as (4), calculate the matching scores of the s models with higher scores in the network models obtained in (4) and the keyword models obtained in 2a), and normalize the s matching scores according to the length of time. Normalize, and use the normalized results as the s confidence levels corresponding to the s models; set a threshold, and cyclically compare the confidence level of each model and the size of the threshold, a total of s times, if the confidence level is low The model is discarded at the threshold, the model is retained if the confidence is higher than the threshold, and the retained model is used as the final keyword retrieval result;

(6) After passing through (4) and (5), a random voice test file of a certain person in the file obtained in (3), if it contains all m keywords that need to be retrieved, it will be judged as 100 points, if there are y keywords missing word, it is judged as 100-y*100/m points, and finally the work score of the evaluated person is obtained.

2. method according to claim 1 is characterized in that, in 2a), carry out the pre-emphasis of pre-emphasis successively to the corpus file of each keyword, divide into frames and add the preprocessing of Hamming window, is to assume that the original signal is x (n ), proceed as follows:

2a1) Pre-emphasize the original signal x(n) with a first-order high-pass digital filter, and obtain the pre-emphasized signal as:

y(n)=x(n)-0.98x(n-1);

2a2) Perform frame segmentation and Hamming window processing on the pre-emphasized signal y(n), and obtain the signal after frame segmentation and Hamming window processing as:

3. method according to claim 1, is characterized in that, adopts Baum-Welch algorithm to carry out training to characteristic parameter in 2a), carries out according to the following steps:

2a3) Suppose the observation sequence O={O _t , t=1, 2, . . . , T}, the initial model λ=(π, A, B), set the state set of the initial model to be {S _i , i= ₁ , ₂ _, . i=1,2,...,N},A={a _ij ,i=1,2,...,N,j=1,2,...,N},B={b _jk , _j ₌ 1, 2, _. is the probability of state S _j , b _jk represents the probability of observing symbol v _k in state S _j ;

2a4) Under the assumption of 2a3), two sets of probability variables ε _t (i, j) are introduced to represent the probability of being in state S _i at time t and in state S _j at time t+1, and γ _t (i) means being in state at time t. The probability of Si, that is, ε _t (f, _j )=P(q _t =S _i , q _t+1 =S _j |O,λ), γ _t (i)=P(O, q _t =S _i |λ);

2a5) Calculates a new set of parameters from the two sets of probability variables introduced in 2a4):

π′ _i =γ ₁ (i),

2a6) A new set of parameters π′ _i , a′ _ij , b′ _j (k) obtained from 2a5) is re-estimated to obtain a new model: λ′=(π′, A′, B′), at this time the model The probability P(O|λ′) that λ′ produces the observation sequence is larger than the probability P(O|λ) that the initial model λ produces the observation sequence;

2a7) Repeat 2a5) and 2a6), and continuously improve the model parameters until P(O|λ') no longer increases significantly. At this time, the model λ'=(π', A', B') is the final model obtained by training .

4. method according to claim 1 is characterized in that, adopts Viterbi algorithm to calculate matching score in (4), carries out as follows:

4a1) Under the same assumptions as 2a3), set time t along a path sequence Q={q ₁ , q ₂ ,..., q _t } and q _t =S _i The maximum probability of producing an observation sequence is δ _t ( S _i ), introducing a set of intermediate variables

4a2) Initialize the probability variable δ _t (S _i ) and the intermediate variable set in 4a1) for:

4a3) Set time t along a path sequence Q={q ₁ , q ₂ ,..., q _t } and q _t =S _j to generate the maximum probability of an observation sequence as δ _t (S _j ), and introduce a set of intermediate variables The probability variable δ _t (S _i ) obtained in 4a2) and the intermediate variable On the basis of , the maximum probability δ _t (S _j ) of the observation sequence and the intermediate variables are obtained for:

4a4) A set of intermediate variables obtained according to 4a3) Recursively compute a set of q' ₁ , q' ₂ , ..., q' _T-1 :

4a5) Calculate the probability P'(Q, O|λ) and state q' _T that the observation sequence at time T matches the model from the δ _T (S _i ) obtained in 4a3):

P'(Q, O|λ)=max _1≤i≤N [δ _T (S _i )],

q′ _T =arg max _1≤i≤N [δ _T (S _i )],

At this time, P'(Q, O|λ) is the matching score between the observation sequence and the model;

_4a6 ) Combine a set of q _′ ₁ , _q ′ ₂ , . q' ₂ , ..., q' _T }.