CN113488073A

CN113488073A - Multi-feature fusion based counterfeit voice detection method and device

Info

Publication number: CN113488073A
Application number: CN202110762591.6A
Authority: CN
Inventors: 陈晋音; 叶林辉
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-07-06
Filing date: 2021-07-06
Publication date: 2021-10-08
Anticipated expiration: 2041-07-06
Also published as: CN113488073B

Abstract

The invention discloses a method and device for detecting fake speech based on multi-feature fusion. The features in the speech are fused, and then the forged speech detection model based on the long short-term memory network is trained by using the fusion features to realize the detection of forged speech generated by various forged speech methods.

Description

Multi-feature fusion based counterfeit voice detection method and device

Technical Field

The invention belongs to the field of deep learning safety, and particularly relates to a method and a device for detecting forged voice based on multi-feature fusion.

Background

The voice counterfeiting technology is to generate the voice of a specific speaker through a certain technology, and compared with video counterfeiting, the voice counterfeiting technology has the characteristics of difficulty in finding and wide application range. Voiceprint locks such as WeChat can be broken by forged speech, which raises security issues in the aspects of property and privacy.

The forged voice can be synthesized by the techniques of parameter generation, waveform splicing, voice simulation generation and the like. Most methods of detecting parameter-generated speech falsification rely on specific parameters in the falsified speech for detection, and the dynamic variation of the parameters of the falsified speech generated based on the parameters is often smaller than that of the natural speech. Wherein, the high-order cepstrum coefficient reflecting the spectrum envelope details tends to be smooth in the hidden Markov model parameter training and generating process. Thus, the higher order cepstral components of the parameter-generated speech are less variable than natural speech. While this difference estimation provides a way to distinguish between real speech and parametric production speech, it is based on the full knowledge of a particular hidden markov speech parameter generation system. Therefore, the same countermeasures may not be applicable to other spurious voices generated by generators using different acoustic parameters.

Due to the simplicity of the waveform splicing type voice forging operation, the method is widely applied to voice forging. For spurious speech in the form of speech waveform concatenation, the detector compares the new access sample with the stored past access attempt instances. With the rapid development of deep learning, a voice anti-counterfeiting detection system based on deep learning starts to enter the sight of people, and detection of analog counterfeit voice can be realized. The neural network system based on deep learning is a special machine learning system, becomes a widely used model in recent years, and almost achieves the best results in time when being applied to various tasks such as biological recognition and the like. However, the existing technology for detecting the forged voice based on deep learning has the problems that the type of the forged voice detection is single, and only the voice generated by a certain forging method can be detected.

Disclosure of Invention

Aiming at the problem that the conventional forged voice detection method can only detect weak generalization of a certain forged voice, the invention provides a forged voice detection method based on multi-feature fusion, which realizes the balance construction of various features to construct fusion features by fusing a plurality of voice features and a feature balance matrix; and training a forged voice detection model based on the long-term and short-term memory network by using the constructed fusion characteristics, and realizing the detection of various forged voices.

In a first aspect, an embodiment of the present invention provides a method for detecting a forged voice based on multi-feature fusion, including the following steps:

acquiring the forged voice and the corresponding normal voice, and constructing labels of the forged voice and the normal voice;

the following processes are respectively performed for the forged voice and the normal voice: performing feature extraction on the voice to obtain multi-class features of different dimensions, respectively scaling the multi-class features of different dimensions to the same dimension to obtain multi-class base features, and then fusing the multi-class base features by using a special normal balance matrix to obtain fused features;

constructing a forged voice detection model consisting of a long-term and short-term memory network and a full-connection neural network, and performing supervised learning on the forged voice detection model by using the fusion characteristics and the label of the forged voice and the fusion characteristics and the label of the normal voice so as to optimize the model parameters and the abnormal balance matrix of the forged voice detection model;

when the method is applied, after the fusion characteristics of the voice to be detected are obtained by using the abnormal balance matrix determined by the fusion weight parameters, the fusion characteristics of the voice to be detected are detected by using the parameter-optimized forged voice detection model so as to output a detection result.

Preferably, the multi-class features obtained by feature extraction of the speech include: fundamental frequency, mel cepstrum coefficients, non-periodic components, mel frequency spectrum, energy spectrum, frequency spectrum, linear prediction coefficient, and linear prediction cepstrum coefficients.

Preferably, the multi-class features of different dimensions are respectively scaled to the same dimension by using a nearest neighbor interpolation method to obtain the multi-class features.

Preferably, the abnormal equilibrium matrix is composed of fusion weight parameters, and the fusion weight parameters are initialized by random numbers conforming to normal distribution.

Preferably, when the fake voice detection model is supervised and learned, a cross entropy loss function is adopted as an optimization target of the model.

In a second aspect, an embodiment further provides a device for detecting a forged voice based on multi-feature fusion, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the above-mentioned method for detecting a forged voice based on multi-feature fusion when executing the computer program.

The technical scheme provided by the embodiment has the excellent effects of at least comprising the following steps:

the method comprises the steps of extracting various features in voice, fusing the extracted features through feature scaling and a feature balance matrix to obtain fusion features, fusing the features in the voice as much as possible, and then training a forged voice detection model based on a long-term and short-term memory network by utilizing the fusion features to realize detection of forged voice generated by various forged voice methods.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flowchart of a method for detecting forged voice based on multi-feature fusion provided by an embodiment;

fig. 2 is a flow diagram of multi-feature fusion provided by an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

The forged voice detection technology usually only models one or more voices in the voices to realize forged voice synthesis, and the conventional forged voice detection method only can detect specific forged voices and is weak in generalization. Based on the above, the embodiment provides a method for detecting forged voices based on multi-feature fusion, which includes extracting multiple features in voices, fusing the extracted features through feature scaling and a feature balance matrix to obtain fusion features, and training a forged voice detection model based on a long-term and short-term memory network by using the fusion features to detect forged voices generated by various forged voice methods.

FIG. 1 is a flowchart of a method for detecting forged voice based on multi-feature fusion provided by an embodiment; fig. 2 is a flow diagram of multi-feature fusion provided by an embodiment. As shown in fig. 1 and fig. 2, the method for detecting a forged voice based on multi-feature fusion provided by the embodiment includes the following steps:

step1, acquiring the fake voice and the corresponding normal voice, and constructing labels of the fake voice and the normal voice.

In the embodiment, the forged voice and the normal voice generated by methods such as parameter generation, waveform splicing and deep learning can be acquired. And training and constructing a forged voice detection model based on a long-term and short-term memory network by taking the forged voice and the normal voice as data sets, wherein the class of the normal voice is marked as 0, and the class of the forged voice is marked as 1.

And 2, performing feature extraction on the voice to obtain various features with different dimensionalities.

In an embodiment, features such as fundamental frequency, mel-frequency cepstral coefficients, non-periodic components, mel-frequency spectrum, energy spectrum, frequency spectrum, linear prediction coefficients, linear prediction cepstral coefficients, etc. may be extracted. The number of features is determined according to actual situations, and the detection range of the model for the forged voice is wider as the number of extracted features is larger. In the embodiment, the feature extraction process does not need to be performed by a depth model, and can be obtained by a conventional speech feature extraction method, for example, the specific extraction process of the mel cepstrum coefficient is as follows:

step1, pre-emphasizing the speech signal, framing and windowing to obtain a pre-processed speech signal x (n), where the pre-emphasizing is performed by a transfer function h (z) -1- α z^-1Wherein α is a pre-emphasis coefficient, and 0.9 < α < 1.0. The result of the pre-emphasis process is y (n) ═ x (n) — α · x (n-1).

Step 2: and performing frame division and windowing processing on the pre-emphasized voice, wherein the frame division of the voice signal is realized by adopting a movable window with limited length for weighting. Typically, the number of frames per second is about 33-100 frames. The general framing method is an overlapped segmentation method, the overlapped part of the previous frame and the next frame is called frame shift, and the ratio of the frame shift to the frame length is generally 0-0.5. The window used is the hamming window, expressed as follows:

step3, performing discrete Fourier transform on the preprocessed signals to obtain a discrete spectrum X (k), wherein the transform formula is as follows:

step 4: inputting X (k) into a Mel filter bank, and then taking logarithm to obtain a logarithmic spectrum:

wherein H_m(k) Is a band pass filter.

Step 5: and (3) transforming the S (m) into a cepstrum domain through discrete cosine transform, wherein the obtained Mel cepstrum coefficient is expressed as follows:

and 3, respectively zooming the multi-class features with different dimensions to the same dimension to obtain the multi-class base features.

As shown in fig. 2, the extracted multi-type features are fused in series. If these features are directly fused, there may be a problem of feature imbalance, and a larger feature dimension may overwhelm or cover a smaller feature dimension, which causes a problem of reduced contribution of a small-dimension feature in the authentication process, and a feature with a large feature dimension plays a key role in the authentication process. Therefore, the dimension of the extracted voice feature is firstly scaled, and each feature is scaled to the base feature with the same dimension.

In an embodiment, nearest neighbor interpolation is used to scale the features. The matrix obtained by scaling the features extracted from the input speech is called the basis matrix. The specific process is as follows:

the extracted features are scaled, and feature 1, feature 2 to feature n in the figure represent n features extracted from the speech, such as features of fundamental frequency, mel frequency cepstrum and the like. Scaling the features using nearest neighbor interpolation, assuming the original matrix feature dimension A ═ H_src,W_src]The characteristic dimension after characteristic scaling is B ═ H_dst,W_dst]Then the width scaling factor is f_xAnd a height scaling factor f_yAs shown in the following formula (5)

Need to index the zoom in the zooming process

And

and (6) carrying out rounding. Wherein i ∈ [0, H ]_dst]，j∈[0,W_dst]。

And 4, fusing the multi-class base characteristics by using the abnormal balance matrix to obtain fused characteristics.

In an embodiment, the base matrix is further processed to obtain a fusion matrix. Feature scaling enables features to be dimensionally consistent, but if these features are directly fused in series, there is still a certain problem of feature imbalance. In the discrimination process, the contribution of the base features to the model discrimination is still unbalanced, that is, the higher the original feature dimension corresponding to the base features after feature scaling is, the greater the contribution in the model discrimination process is. Thus introducing a balanced feature matrix. The matrix is a parameter to be trained, so that the trained model can distribute weights in the fusion process according to the contribution of the features in the classification to perform feature fusion. The method comprises the following specific steps:

step1: initializing a feature balance matrix, wherein the feature balance matrix is a parameter matrix with training, and initializing the parameters into random numbers which are in accordance with positive distribution.

Step 2: putting the characteristic balance matrix into a model for training, and enabling the matrix to balance the contribution of each base characteristic to model classification in the training process, wherein the fusion process of each base characteristic through the characteristic balance matrix is as follows:

F_u＝F×[w₁,w₂…w_n]^T (6)

wherein W ═ W₁,w₂...w_n]A feature balance matrix is represented. F denotes the base feature to be fused, F_uRepresenting the fusion characteristics. Based on the fusion characteristics, a fake voice identification model is constructed to identify the truth of the fusion and the characteristics.

And 5, constructing and training a forged voice detection model.

As the voice is a time sequence signal, and a certain correlation exists between the front frame and the rear frame, a forged voice detection model based on a long-short term memory network is constructed, the long-short term memory network can be used for modeling the time sequence signal, and the characteristics on the time dimension are extracted to realize the detection of the forged voice. The constructed forged voice detection model based on the long-term and short-term memory network is shown as the following figure 1, and the constructed specific steps are as follows:

step1: a forged voice detection model based on a long and short memory network is built, as shown in figure 1, the model consists of a long and short memory network and a fully-connected neural network, and the input of the model is extracted fusion characteristics. And the long-term and short-term memory network performs feature extraction on the input fusion features on the time dimension, and the extracted features are classified through a fully-connected neural network. Because the input voice is judged to be normal voice or fake voice, the last layer of the model is designed to be composed of two neurons and used for outputting a judgment result.

Step 2: and (3) training a forged voice detection model based on the long-term and short-term memory network, and training the model by using the data set constructed in the step (1). The long-short term memory network comprises three gate control units, namely a forgetting gate, a forgetting gate and an output gate, wherein the gates are actually full connection layers, and the network parameter updating mode is as shown in the following (7) to (11):

wherein s is^(t)Indicates the output status unit, h^(t)Represents a hidden state element, g_iDenotes an input gate, g_fIndicating forgetting gate, g_ORepresenting the output gate, f representing the activation function, t representing the current time node, b representing the bias, u representing the weight of the input layer to the hidden layer, and w representing the weight of the hidden layer node to the next hidden layer node.

Step 3A cross entropy loss function is used as an optimization target of the model, and the cross entropy loss function is expressed as follows:

wherein x_iDenotes the ith sample of the input, n denotes the total number of samples in the training set, p (x)_i) Represents a sample x_iTrue class of, q (x)_i) Representing model versus input sample x_iThe prediction class label of (2).

And 6, detecting the application of the forged voice detection model.

The embodiment also provides a device for detecting the forged voice based on the multi-feature fusion, which comprises a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the method for detecting the forged voice based on the multi-feature fusion when executing the computer program, and the method comprises the following steps:

And 5, constructing and training a forged voice detection model.

The detection and device for the forged voice based on the multi-feature fusion are characterized in that various features in the voice are extracted, the extracted features are fused through feature scaling and a feature balance matrix to obtain fusion features, the fusion features are fused with the features in the voice as much as possible, and then a forged voice detection model based on a long-term and short-term memory network is trained by utilizing the fusion features, so that the detection of the forged voice generated by various forged voice methods is realized.

The above-mentioned embodiments are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only the most preferred embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions, equivalents, etc. made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. a forged speech detection method based on multi-feature fusion, is characterized in that, comprises the following steps:

Obtain fake voices and corresponding normal voices, and construct labels for fake voices and normal voices;

The following processes are respectively performed for the fake speech and the normal speech: extracting the features of the speech to obtain multi-class features of different dimensions, scaling the multi-class features of different dimensions to the same dimension to obtain multi-class base features, and using a special positive balance matrix to compare the multi-class features. Class base features are fused to obtain fused features;

Construct a fake speech detection model consisting of a long short-term memory network and a fully connected neural network, and use the fusion features and labels of the fake speech and the fusion features and labels of normal speech to perform supervised learning on the fake speech detection model to optimize the fake speech detection model. The model parameters and special positive balance matrix of ;

In application, after the fusion feature of the speech to be tested is obtained by using the special positive balance matrix determined by the fusion weight parameter, the fusion feature of the speech to be tested is detected by using the parameter-optimized fake speech detection model to output the detection result.

2. the forged speech detection method based on multi-feature fusion as claimed in claim 1, is characterized in that, the multi-class feature that speech is carried out feature extraction and obtains comprises: fundamental frequency, Mel cepstral coefficient, aperiodic composition, Mel Spectrum, Energy Spectrum, Spectrum, Linear Prediction Coefficient, Linear Prediction Cepstral Coefficient.

3 . The forged speech detection method based on multi-feature fusion according to claim 1 , wherein the nearest neighbor interpolation method is used to respectively scale multi-class features of different dimensions to the same dimension to obtain multi-class base features. 4 .

4 . The method for detecting fake speech based on multi-feature fusion as claimed in claim 1 , wherein the special positive balance matrix is composed of a parameter representing fusion weight, and a random number conforming to normal distribution is used to initialize the fusion weight parameter. 5 .

5 . The method for detecting fake speech based on multi-feature fusion according to claim 1 , wherein, when performing supervised learning on the fake speech detection model, a cross-entropy loss function is used as the optimization target of the model. 6 .

6. a forged voice detection device based on multi-feature fusion, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor executes the The computer program implements the forged speech detection method based on multi-feature fusion according to any one of claims 1 to 5.