CN106448685B

CN106448685B - A kind of voiceprint authentication system and method based on phoneme information

Info

Publication number: CN106448685B
Application number: CN201610880776.6A
Authority: CN
Inventors: 郑榕; 张策; 王黎明
Original assignee: Beijing Yuanjian Technologies Co ltd
Current assignee: Beijing Yuanjian Information Technology Co Ltd
Priority date: 2016-10-09
Filing date: 2016-10-09
Publication date: 2019-11-22
Anticipated expiration: 2036-10-09
Also published as: CN106448685A

Abstract

The invention discloses a kind of voiceprint authentication system and method based on phoneme information, system include that the phoneme based on Chinese putonghua speech identifier forces alignment module, the relevant model creation module of phoneme and the neural network classifier module based on dropout strategy；Method includes defining 16 phoneme class of standard Chinese numeric string vocal print, the explicit each pronunciation classification information for utilizing numeric string；Based on Chinese putonghua speech identifier, the phoneme boundary of each corresponding numeric string content of text is obtained using Viterbi pressure alignment algorithm；Phoneme correlation model is established using the unrelated algorithm of text；Phoneme correlation model is calculated, scores vector is obtained.Beneficial effects of the present invention: the present invention is while realizing that phoneme information cutting, phoneme modeling and phoneme correlation model separating capacity are analyzed, propose the neural network training method using dropout strategy, it solves the problems, such as numeric string phoneme missing, and improves the performance of numeric string voiceprint authentication system.

Description

A kind of voiceprint authentication system and method based on phoneme information

Technical field

The present invention relates to voiceprint authentication system technical fields, it particularly relates to which a kind of vocal print based on phoneme information is recognized Demonstrate,prove system and method.

Background technique

Living things feature recognition is a kind of physiological characteristic intrinsic according to human body itself and behavioural characteristic to identify identity Technology, have many advantages, such as to be not easy to forget, anti-counterfeiting performance is good, be not easy to forge or it is stolen, have and can use whenever and wherever possible with oneself.With Internet is fast-developing, and traditional identity authentication techniques means are increasingly unable to satisfy the need of user experience and security capabilities It asks.Sound groove recognition technology in e easy to use has been caused due to its wealthy application prospect, huge Social benefit and economic benefit The extensive concern of all trades and professions and great attention.

Application on Voiceprint Recognition, also known as Speaker Identification are one kind of biological identification technology.The technology in speech waveform by reflecting The speech parameter for human physiology and the behavioural characteristic of speaking, and then tell speaker's identity.With high security, data acquisition is convenient The features such as.

In recent years, the Speaker Identification of text related (Text-dependent) becomes the hot spot in user authentication field.By Major progress in unrelated (Text-independent) the Speaker Identification field of text, many researchers attempt by text without It closes speaker's recognizer and is applied to text related fields, such as numeric string Application on Voiceprint Recognition.

Under numeric string authentication condition, have researcher using simultaneous factor analysis (Joint Factor Analysis, JFA), gauss hybrid models-interference properties map (Gaussian Mixture Model-Nuisance Attribute Projection, GMM-NAP) and Hidden Markov Model-interference properties mapping (Hidden Markov Model-Nuisance Attribute Projection, HMM-NAP) it is compared.For JFA, the algorithm based on NAP is performed better than, reason It is that JFA is trained to need a large amount of tagged data, and exists between the training data of JFA matrix and numeric string test data and lose Match.

In the unrelated Speaker Identification of text, JFA and be based on probability linear discriminant analysis (Probabilistic Linear Discriminant Analysis, PLDA) population variance modeling factors (iVector) algorithm all rely on a large amount of exploitation Collect data.More and more work are dedicated to development set data in the limited field of processing and ask to outside field using the migration of data Topic, such as the adaptive and backoff algorithm of lexical gap.

By the mobile phone of Android system (Android) and apple system (iOS), records and construct the number comprising 536 people Word string voice set.It is divided into two kinds of scenes: global condition and rand-n condition.Global condition indicates that registration and verifying use Identical numeric string content；Rand-n condition indicates that each digit string is the random number word string that length is n, this is at certain It is more safer than global condition in the application system of a little anti-recording attacks.Involved in the present invention it is as shown in Table 1 three kinds registration/ Authentication condition: fixed whole numerical ciphers, 6 bit digital passwords of 8 bit digital passwords of dynamic and dynamic.Every kind of scene partitioning development set Collect with evaluation and test.Development set is for training global context model (Universal Background Model, UBM), population variance Modeling matrix (iVector T matrix) and linearly differentiation analysis matrix (Linear Discriminant Analysis, LDA) etc..In three kinds of conditions for evaluating and testing collection, everyone includes three registration voices and a tested speech, every tested speech and All speaker models are compared.

Table 1: several form examples of password figure

Table 2 be GMM-NAP and using iVector voiceprint authentication system etc. error rates (Equal Error Rate, EER it) compares.The result shows that the performance of voiceprint authentication system has obtained significantly consistently mentioning with the increase of digital string length It rises.But GMM-NAP and iVector system does not account for the utilization of phoneme (Phone/Phoneme) information, be based on text without Close direct application of the Application on Voiceprint Recognition under text associated scenario.In the application of numeric string vocal print, ignore phoneme information or no sound The effective use of prime information, it will the effect of the unrelated recognizer of limitation text in practical applications.

Table 2:GMM-NAP and iVector system under different test conditions etc. error rates comparison

	Fixed whole numerical ciphers	Dynamically 8 bit digital password	Dynamically 6 bit digital password
				GMM-NAP	2.09%	2.64%	3.76%
iVector	1.87%	2.40%	3.32%

Summary of the invention

It is an object of the invention to propose a kind of voiceprint authentication system and method based on phoneme information, sound can be being realized While prime information cutting, phoneme model (Phone-dependent) model separating capacity related to phoneme analysis, solves number The problem of word string phoneme lacks, and improve the performance of numeric string voiceprint authentication system.

To realize the above-mentioned technical purpose, the technical scheme of the present invention is realized as follows:

A kind of voiceprint authentication system based on phoneme information is forced including the phoneme based on Chinese putonghua speech identifier The relevant model creation module of alignment module, phoneme and the neural network classifier module based on dropout strategy；

The phoneme based on Chinese putonghua speech identifier forces alignment module to be used for 16 sounds to numeric string Plain classification carries out cutting；

The relevant model creation module of the phoneme analyzes each phoneme correlation model for establishing phoneme correlation model To the separating capacity of voiceprint, the differentiation feature of speaker is featured, rather than difference between vocabulary；

The neural network classifier module based on dropout strategy is used to merge the complementary letter of phoneme correlation model Breath.

A kind of voiceprint authentication method based on phoneme information, includes the following steps:

S01: 16 phoneme class of standard Chinese numeric string vocal print, the explicit each pronunciation for utilizing numeric string are defined Classification information；

S02: being based on Chinese putonghua speech identifier, forces alignment algorithm to obtain each corresponding numeric string using Viterbi The phoneme boundary of content of text is completed the phone segmentation to voice content, the i.e. mapping relations of speech feature vector to phoneme, is obtained To the feature vector subclass for belonging to phoneme, each character subset conjunction is considered as independent data flow and carries out subsequent processing；

S03: phoneme correlation model is established using the unrelated algorithm of text, the relevant model foundation process of phoneme reduces each The parameter amount of phoneme correlation model, avoids model excessively trained；

S04: phoneme correlation model is calculated, scores vector is obtained.

Further, using the dropout Strategies Training rear end integrated classification device in neural network algorithm in step S04.

Beneficial effects of the present invention:

(1) present invention forces alignment algorithm to obtain using typical Chinese putonghua speech identifier is based on using Viterbi The phoneme boundary of each corresponding numeric string content of text is taken, completes to be based on the phone segmentation of voice content compared to common The cutting effect of dynamic time warping (Dynamic Time Warping, DTW) scheduling algorithm is advantageously；

(2) present invention defines 16 pronunciation classifications to the numeric string pronunciation of standard Chinese, avoids affiliated phoneme class Model caused by feature vector is very few crosses training problem, establishes phoneme correlation model, and analyze each phoneme correlation model pair The separating capacity of voiceprint, phoneme correlation model feature the differentiation feature of speaker, rather than the difference between vocabulary；

(3) it in order to further increase the use of information effect of phoneme correlation model, and considers to authenticate language in practical application Sound only includes the partial content of set of phonemes, it is understood that there may be the problem of vector dimension lacks, using dropout Strategies Training nerve Network backend classifier, realizes the amalgamation judging of phoneme associated score vector, and has been obviously improved the system performance of voiceprint.

Detailed description of the invention

Fig. 1 is the rear end classifier process flow diagram in the present invention based on the relevant scores vector of phoneme；

Fig. 2 be in the present invention for different phoneme correlation models etc. error rates experimental result picture.

Specific embodiment

With reference to the attached drawing in the embodiment of the present invention, technical solution in the embodiment of the present invention carries out clear, complete Ground description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Based on this Embodiment in invention, those of ordinary skill in the art's every other embodiment obtained belong to the model that the present invention protects It encloses.

Phoneme information is explicitly utilized the numeric string voiceprint authentication method combined with neural network classification by present invention proposition, For every digit string, alignment algorithm is forced to be completed to voice content using the Viterbi of Chinese putonghua speech identifier Phone segmentation；The training parameter amount for reducing phoneme correlation model, avoids the training phonetic feature of each phoneme model is less can Caused training problem can be crossed, analyze each phoneme model to the separating capacity of Application on Voiceprint Recognition；To the score of phoneme correlation model The problem of vector is lacked there may be dimension, using the dropout Strategies Training rear end integrated classification device in neural network algorithm, The utilizing status of phoneme relevant information is improved, the system performance of numeric string voiceprint is further improved.

Table 3 gives the phonemic representation of ten standard Chinese numeric utterances.It is noted that digital " 1 " has " y i " and " y Two kinds of ao " pronunciations, therefore corresponding ten standard Chinese numeric utterances share 16 phonemes.

3: ten, table digital standard Chinese pronunciation phonemes

In " fixed whole numerical ciphers " condition, phoneme content immobilizes." 8 bit digital passwords of dynamic " and " dynamic 6 The phoneme content of numerical ciphers " is also known, because digital text is typically based on random algorithm push or the base of background system It is generated in OTP dynamic password (One-time Password) according to special algorithm.

Based on Chinese putonghua speech identifying system, alignment algorithm is forced to obtain each corresponding content of text using Viterbi Phoneme boundary, complete to the phone segmentation of voice content, the i.e. mapping of speech feature vector to phoneme.

Therefore, acoustic feature sequence vector χ=x of one section of digit string is given₁,...,x_T, discrete son can be cut into Set χ₁,...,χ₁₆.Wherein x ∈ χ_iIndicate the feature vector subclass for belonging to i-th of phoneme.Each subclass is considered as solely Vertical data flow carries out subsequent processing.Voiceprint registration stage, the relevant model of 16 phonemes(i-th of sound of speaker s Sub-prime set) it is obtained by the unrelated algorithm training of text.It should be noted that registration voice needs to cover ten numbers.This hair In bright, registration phase registers voice using three numeric strings, guarantees that each number at least occurs one in everyone registration voice Time.

During voiceprint, for " fixed whole numerical ciphers " condition, ten sextuple scores vector ξ are obtained,Can by scores vector ξ is averaged or the methods of logistic regression training rear end classifier into Row judgement.However it may for rand-n condition, the scores vector ξ such as " 8 bit digital passwords of dynamic " and " 6 bit digital passwords of dynamic " There are missings, because tested speech only includes the partial content of set of phonemes.In order to solve this problem, using neural network algorithm In dropout strategy, this is a kind of implementation method for effectively promoting generalization ability.

The dropout training algorithm of neural network is stochastic gradient descent (the Standard Stochastic of standard Gradient Descent), certain input units and hidden layer are only ignored at random with certain probability γ during forward calculation Unit.Only activation unit participates in backpropagation (Back-propagation) and gradient calculates.Because dropout is not used to Identification, in the training process, the output to every layer is readjusted:

Wherein δ (), W_lAnd b_lIt is activation primitive, l layers of weight and l layers of biasing respectively.Bm is binary cover (Binary mask) indicates which dimension is removed, and * indicates vector multiplication.

The above process can regard a kind of effective model averaging method as, i.e., by the missing of a large amount of shared weight of training to The average expression of the heterogeneous networks measured.As shown in Figure 1, training includes the neural network classifier of a hidden layer.It is wherein defeated Entering is scores vector, and output includes two units, respectively represents target authentication classification and emits imitative certification classification.For " dynamic 8 Vector dimension under the conditions of the rand-n such as numerical ciphers " and " 6 bit digital passwords of dynamic " lacks problem, to input layer with probability γ Network training is carried out using dropout strategy.In Qualify Phase, calculates following log-likelihood ratio and is exported as system:

Wherein p (ξ | target verification class) and p (ξ | emit imitative verifying class) are the likelihood scores of scores vector ξ.Pass through Bayes's public affairs Formula, likelihood score can be exchanged into posteriority expression,

P (ξ | target verification class)=p (target verification class | ξ) p (ξ)/p (target verification class)

P (ξ | emit imitative verifying class)=p (emit imitative verifying class | ξ) p (ξ)/p (emitting imitative verifying class)

Wherein p (target verification class | ξ) and p (emit imitative verifying class | ξ) are that scores vector ξ is obtained by network forward calculation Posteriority.P (target verification class) and p (emitting imitative verifying class) is the priori and Mao Fang of the target verification class obtained from training set estimation Verify the priori of class.P (ξ) is unrelated with any model, can ignore during calculating LLR.

Each phoneme model is analyzed first to the separating capacity of Application on Voiceprint Recognition.In view of the training voice of each phoneme model Feature is less, in order to avoid crossing training problem, reduces the training parameter amount of each phoneme correlation model.Fig. 2 gives each Phoneme correlation model etc. error rates comparison.

From figure 2 it can be seen that firstly, iVector to be better than GMM- more by a small margin in all phoneme correlation models NAP model.Secondly, the EER numerical value for the consonant " w " that performance is worst is five times or so of the EER of the best vowel of performance " an ".This A experimental result has directive function to practical application, and on-line system can limit the bad number of push performance, such as " 5 [wu]”。

By training dropout neural network rear end classifier, fusion output is carried out to the relevant scores vector of phoneme.Table 4 give phoneme correlation model using different rear end classifiers etc. error rates compare.Compare for convenience, also gives here The authentication performance that the phoneme associated score of GMM-NAP and iVector system is averaged.Score average formula is as follows:

Table 4: phoneme correlation model using different rear end classifiers etc. error rates compare

From table 4, it can be seen that of the present invention explicitly utilize the calculation with the fusion of neural network rear end based on phoneme information Method can effectively promote the system performance of numeric string voiceprint.It is average compared to score as a result, neural network rear end point Class device etc. error rates it is lower, performance is more excellent.With GMM-NAP the and iVector Comparative result of table 2, registers/recognize in three kinds of differences Under the conditions of card, the algorithm of phoneme correlation model and neural network rear end classifier is achieved under about 20% or so opposite EER Drop.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of voiceprint authentication system based on phoneme information, which is characterized in that including being based on Chinese putonghua speech identifier Phoneme force alignment module, the relevant model creation module of phoneme and the neural network classifier mould based on dropout strategy Block；

The phoneme based on Chinese putonghua speech identifier forces alignment module to be used for 16 phoneme classes to numeric string Not carry out cutting, digital 0 corresponding pronunciation phonemes are li, ig after cutting, and digital 1 corresponding pronunciation phonemes are yi, i, ao, number 2 corresponding pronunciation phonemes are er, and digital 3 corresponding pronunciation phonemes are s, an, and digital 4 corresponding pronunciation phonemes are s, ih, number 5 Corresponding pronunciation phonemes are w, u, and digital 6 corresponding pronunciation phonemes are li, ou, and digital 7 corresponding pronunciation phonemes are qi, i, number 8 corresponding pronunciation phonemes are b, a, and digital 9 corresponding pronunciation phonemes are ji, ou；；

The relevant model creation module of the phoneme analyzes each phoneme correlation model to sound for establishing phoneme correlation model The separating capacity of line certification；

The neural network classifier module based on dropout strategy is used to merge the complementary information of phoneme correlation model, leads to It crosses and trains dropout neural network rear end classifier, fusion output is carried out to the relevant scores vector of phoneme, and provide to phoneme The authentication performance that associated score is averaged, score average formula are as follows:

WhereinFor the average value of ten sextuple scores vectors, ξ_iIt is 16 Tie up a certain vector in scores vector, χ^testFor tested speech acoustic feature sequence vector.

2. a kind of voiceprint authentication method of voiceprint authentication system based on phoneme information described in accordance with the claim 1, feature It is, includes the following steps:

S01: 16 phoneme class of standard Chinese numeric string vocal print, the explicit each pronunciation classification for utilizing numeric string are defined Information；

S02: being based on Chinese putonghua speech identifier, forces alignment algorithm to obtain each corresponding numeric string text using Viterbi The phoneme boundary of content is completed to obtain the feature vector subclass for belonging to phoneme to the phone segmentation of voice content；

S03: phoneme correlation model is established using the unrelated algorithm of text；

S04: calculating phoneme correlation model, obtain scores vector, wherein using the dropout Strategies Training in neural network algorithm Rear end integrated classification device.