CN111564163B

CN111564163B - RNN-based multiple fake operation voice detection method

Info

Publication number: CN111564163B
Application number: CN202010382185.2A
Authority: CN
Inventors: 严迪群; 乌婷婷; 王让定
Original assignee: Ningbo University
Current assignee: Ningbo University
Priority date: 2020-05-08
Filing date: 2020-05-08
Publication date: 2023-12-15
Anticipated expiration: 2040-05-08
Also published as: CN111564163A

Abstract

The invention discloses a speech detection method for multiple forgery operations based on RNN, which includes the following steps: 1) Obtain original speech samples, perform M kinds of forgery processing on the original speech samples, and obtain the speech after M forgery operations and 1 Unprocessed original speech, perform feature extraction on the above speech, obtain the LFCC matrix of the training speech sample, and send it to the RNN classifier network for training to obtain a multi-class training model; 2) Obtain a test speech, and Perform feature extraction on the test voice, obtain the LFCC matrix of the test voice data, and send it to the RNN classifier trained in step 1) for classification. Each test voice gets an output probability, and all output probabilities are combined as the final prediction result: If If the prediction result is the original speech, the test speech is recognized as the original speech; if the prediction result is the speech that has undergone a certain forgery operation, the test speech is recognized as the forged speech that has undergone the corresponding forgery operation.

Description

A speech detection method for multiple forgery operations based on RNN

技术领域Technical field

本发明涉及语音检测方法，尤其是一种基于RNN的多种伪造操作语音检测方法。The present invention relates to a speech detection method, in particular to a speech detection method for multiple forgery operations based on RNN.

背景技术Background technique

随着语音编辑软件功能的不断增强，非专业人士也可以很容易对语音内容进行修改。如果有不法分子恶意地对语音进行伪造修改，甚至将修改后的语音用于新闻报道、司法取证以及科学研究等领域，这将会带来巨大的威胁，甚至会对社会稳定造成不可估计的影响。数字语音取证方法是对伪造操作的检测，对于鉴别音频材料的原始性与真实性有至关重要的作用，是当前多媒体取证领域的重点研究课题。With the continuous enhancement of the functions of voice editing software, non-professionals can easily modify the voice content. If some criminals maliciously forge and modify the voice, and even use the modified voice for news reporting, judicial evidence collection, scientific research and other fields, this will pose a huge threat and even have an immeasurable impact on social stability. . Digital voice forensics method is the detection of forgery operations, which plays a vital role in identifying the originality and authenticity of audio materials. It is a key research topic in the current field of multimedia forensics.

现有的数字语音取证检测技术大部分都是检测到单一的伪造操作，即取证人员假设待检测语音会经过某种特定的伪造操作。Mengyu Qiao等人提出了一种基于量化的MDCT系数及其导数的统计特征的检测算法，检测上转换和下转换的MP3音频文件，通过重新压缩和校准音频来生成参考音频信号，然后用支持向量机进行分类，实验结果表明该方法有效地检测了MP3双重压缩并可以检测数字取证的音频处理历史。如王丽华等人提出了一种基于卷积神经网络的变调语音处理历史检测，通过对三种语音库应用四种不同的变调软件进行变调，并使用CNN对语音的变调因子进行语音库内和库间以及变调方法之间的检测，其检测率达到了90％以上。Most of the existing digital voice forensic detection technologies detect a single forgery operation, that is, the forensic personnel assume that the voice to be detected will undergo a specific forgery operation. Mengyu Qiao et al. proposed a detection algorithm based on the statistical characteristics of quantized MDCT coefficients and their derivatives to detect up-converted and down-converted MP3 audio files, generate a reference audio signal by recompressing and calibrating the audio, and then use support vectors Machine classification, experimental results show that this method effectively detects MP3 double compression and can detect audio processing history for digital forensics. For example, Wang Lihua and others proposed a history detection of pitch-shifted speech processing based on convolutional neural network. They applied four different pitch-shifting software to three speech libraries to perform pitch-shifting, and used CNN to perform intra-bank and library-based integration of the pitch-shifting factors of speech. The detection rate reaches more than 90% between time and pitch modification methods.

现有的数字语音取证检测技术能够检测到单一的伪造操作，且检测率可以达到很高的水平。但在实际应用中，取证者常常无法预测伪造的具体操作，使用某一特定操作分类器进行检测可能会出现误判。Existing digital voice forensic detection technology can detect a single forgery operation, and the detection rate can reach a very high level. However, in practical applications, evidence collectors often cannot predict the specific operation of forgery, and using a certain operation classifier for detection may result in misjudgment.

目前，大多数现有适用多种伪造操作的数字取证工作都集中在数字图像领域上，针对数字语音取证的研究仍较少。在数字语音领域，骆伟祺团队设计了一个卷积神经网络模型，可以用于检测两个不同音频编辑软件中默认设置的音频处理操作，并提供了较好的结果，该方法可以显着优于现有的基于手工特征的取证方法。然而，该实验虽然开创性地对语音的多种伪造操作检测进行了研究，却存在着一些不容忽视的问题，如计算的复杂度过高、所针对伪造操作的应用场景过于理想等。At present, most existing digital forensics work that is applicable to various forgery operations focuses on the field of digital images, and there is still little research on digital voice forensics. In the field of digital speech, Luo Weiqi's team designed a convolutional neural network model that can be used to detect audio processing operations with default settings in two different audio editing software, and provides better results. This method can be significantly better than existing methods. Some forensic methods are based on manual features. However, although this experiment conducted pioneering research on the detection of various forgery operations in speech, there are still some problems that cannot be ignored, such as the complexity of the calculation is too high and the application scenarios for the forgery operations are too ideal.

发明内容Contents of the invention

本发明所要解决的技术问题是针对上述现有技术存在的不足，提供一种基于RNN的多种伪造操作语音检测方法，能够提高检测准确率。The technical problem to be solved by the present invention is to provide an RNN-based voice detection method for multiple forgery operations in view of the deficiencies in the above-mentioned existing technologies, which can improve the detection accuracy.

本发明解决上述技术问题所采用的技术方案为：一种基于RNN的多种伪造操作语音检测方法，其特征在于：包括如下步骤：The technical solution adopted by the present invention to solve the above technical problems is: a multiple forgery operation speech detection method based on RNN, which is characterized by: including the following steps:

1)训练网络：获取原始语音样本，对所述原始语音样本进行M种伪造处理，得到M个伪造操作后的语音和1个未经处理的原始语音，对上述M个伪造后的语音和1个原始语音进行特征提取，得到训练语音样本的LFCC矩阵，送入RNN分类器网络中进行训练，得到一个多分类的训练模型；1) Training network: Obtain original voice samples, perform M forgery processes on the original voice samples, and obtain M forged voices and 1 unprocessed original voice. Features are extracted from the original speech to obtain the LFCC matrix of the training speech sample, which is sent to the RNN classifier network for training to obtain a multi-classification training model;

2)语音识别：得到一段测试语音，对该测试语音进行特征提取，得到测试语音数据的LFCC矩阵，送入由步骤1)训练好的RNN分类器中进行分类，每一个测试语音得到一个输出概率，合并所有输出概率作为最后的预测结果：如果预测结果是原始语音，则测试语音被识别为原始语音；如果预测结果是经过某一伪造操作的语音，则测试语音被识别为进行相应伪造操作的伪造语音。2) Speech recognition: Obtain a test speech, perform feature extraction on the test speech, obtain the LFCC matrix of the test speech data, and send it to the RNN classifier trained in step 1) for classification. Each test speech gets an output probability. , combine all output probabilities as the final prediction result: if the prediction result is the original speech, the test speech is recognized as the original speech; if the prediction result is the speech that has undergone a certain forgery operation, the test speech is recognized as the corresponding forgery operation. Fake voice.

优选的，在步骤1)和2)中，得到LFCC矩阵的步骤为：Preferably, in steps 1) and 2), the steps to obtain the LFCC matrix are:

1)FFT：首先先对语音进行预处理，计算每一个语音帧经过FFT后的频谱能量E(i,k)：1) FFT: First, preprocess the speech and calculate the spectrum energy E(i,k) of each speech frame after FFT:

其中，i为语音帧数，k为频率分量，x_i(m)为第i帧的语音信号数据，N为傅立叶变换的数量；Among them, i is the number of speech frames, k is the frequency component, x _i (m) is the speech signal data of the i-th frame, and N is the number of Fourier transforms;

然后计算每帧的频谱能量E(i,k)经过三角滤波器组后的能量：Then calculate the energy of the spectrum energy E(i,k) of each frame after passing through the triangular filter bank:

其中，H_i(k)表示三角滤波器的频率响应，f(l)为第l个三角滤波器的滤波函数，S(i,l)为经过三角滤波器组后的谱线能量，l表示三角滤波器的编号，L为三角滤波器总数；Among them, H _i (k) represents the frequency response of the triangular filter, f (l) is the filter function of the l-th triangular filter, S (i, l) is the spectral line energy after passing through the triangular filter group, and l represents The number of the triangular filter, L is the total number of triangular filters;

2)DCT：利用DCT计算每个三角滤波器组的输出数据lfcc(i,n)：2) DCT: Use DCT to calculate the output data lfcc(i,n) of each triangular filter bank:

其中，n代表第i帧DCT后的谱线；Among them, n represents the spectral line after DCT of the i-th frame;

3)得到LFCC统计矩：将lfcc(i,n)取12阶LFCC系数，计算均值和相关系数，得到一段语音提取出的LFCC矩阵为：3) Obtain the LFCC statistical moment: take the 12th order LFCC coefficient of lfcc(i,n), calculate the mean and correlation coefficient, and obtain the LFCC matrix extracted from a segment of speech:

其中x_s,1…x_1,n为计算得到的第s帧语音数据的n个LFCC。Among them, x _s,1 ...x _1,n are the calculated n LFCCs of the s-th frame speech data.

优选的，所述RNN分类器包括LSTM网络，依次连接的Dropout层、全连接层和Softmax层，所述Dropout层与最后一个LSTM网络连接。Preferably, the RNN classifier includes an LSTM network, a Dropout layer, a fully connected layer and a Softmax layer connected in sequence, and the Dropout layer is connected to the last LSTM network.

优选的，所述LSTM网络具有两个，参数分别设置为(64,128)和(128,64)。Preferably, the LSTM network has two, and the parameters are set to (64,128) and (128,64) respectively.

优选的，所述LSTM网络使用tanh激活函数。Preferably, the LSTM network uses tanh activation function.

优选的，所述Dropout层的Dropout函数值为0.5。Preferably, the Dropout function value of the Dropout layer is 0.5.

优选的，所述原始语音为WAV格式。Preferably, the original voice is in WAV format.

与现有技术相比，本发明的优点在于：采用语音倒谱特征，通过循环神经网络分类输出结果概率，提高语音检测的准确率，更适用于数字语音载体，能识别不同的伪造痕迹；通过RNN网络中的共享参数较现有的基于深度学习的方法计算复杂度大大降低。Compared with the existing technology, the advantage of the present invention is that: it adopts the speech cepstrum feature, classifies the output result probability through the circular neural network, improves the accuracy of speech detection, is more suitable for digital speech carriers, and can identify different forgery traces; The computational complexity of shared parameters in the RNN network is greatly reduced compared with existing deep learning-based methods.

附图说明Description of drawings

图1为本发明实施例的语音检测方法的LFCC统计矩的提取过程图；Figure 1 is a process diagram of the extraction process of LFCC statistical moments of the speech detection method according to the embodiment of the present invention;

图2为本发明实施例的语音检测方法的总体框架原理图；Figure 2 is a schematic diagram of the overall framework of the speech detection method according to the embodiment of the present invention;

图3为本发明实施例的语音检测方法的网络结构图。Figure 3 is a network structure diagram of the speech detection method according to the embodiment of the present invention.

具体实施方式Detailed ways

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。The embodiments of the present invention are described in detail below. Examples of the embodiments are shown in the drawings, in which the same or similar reference numerals represent the same or similar elements or elements with the same or similar functions.

在本发明的描述中，需要理解的是，术语“中心”、“纵向”、“横向”、“长度”、“宽度”、“厚度”、“上”、“下”、“前”、“后”、“左”、“右”、“竖直”、“水平”、“顶”、“底”“内”、“外”、“顺时针”、“逆时针”、“轴向”、“径向”、“周向”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，由于本发明所公开的实施例可以按照不同的方向设置，所以这些表示方向的术语只是作为说明而不应视作为限制，比如“上”、“下”并不一定被限定为与重力方向相反或一致的方向。此外，限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。In the description of the present invention, it should be understood that the terms "center", "longitudinal", "transverse", "length", "width", "thickness", "upper", "lower", "front", " "Back", "Left", "Right", "Vertical", "Horizontal", "Top", "Bottom", "Inside", "Outside", "Clockwise", "Counterclockwise", "Axis", The orientation or positional relationship indicated by "radial direction", "circumferential direction", etc. are based on the orientation or positional relationship shown in the drawings. They are only for the convenience of describing the present invention and simplifying the description, and do not indicate or imply the device or element to which they are referred. It must have a specific orientation, be constructed and operate in a specific orientation. Since the disclosed embodiments of the present invention can be arranged in different directions, these terms indicating directions are only for illustration and should not be regarded as limiting, such as "up", "up", "Down" is not necessarily limited to a direction opposite or consistent with the direction of gravity. In addition, features defined as “first” and “second” may explicitly or implicitly include one or more of these features.

一种基于RNN(循环神经网络)的多种伪造操作语音检测方法，通过构建一个基于倒谱特征的循环神经网络框架而实现。参见图2，框架由两个部分组成：首先先提取语音样本的倒谱特征，再送入设计好的网络框架中做分类，达到多种伪造操作鉴别的任务。A speech detection method for multiple forgery operations based on RNN (Recurrent Neural Network) is implemented by constructing a recurrent neural network framework based on cepstrum features. Referring to Figure 2, the framework consists of two parts: first, the cepstral features of the speech samples are extracted, and then sent to the designed network framework for classification to achieve the task of identifying various forgery operations.

具体的，在本发明，语音的特征提取通过以下方式实现。本发明采用的倒谱特征为线性频率倒谱系数(Linear Frequency Cepstral Coefficients,LFCC)。语音倒谱特征是语音技术中最常用的特征参数之一，它表征了人类的听觉特征，且被广泛应用于说话人识别。Specifically, in the present invention, speech feature extraction is implemented in the following manner. The cepstral feature used in the present invention is Linear Frequency Cepstral Coefficients (LFCC). Speech cepstrum feature is one of the most commonly used feature parameters in speech technology. It characterizes human auditory characteristics and is widely used in speaker recognition.

LFCC是从低频到高频带通滤波器的平均分配。本发明的LFCC统计矩的提取过程可以参见图1：LFCC is an even distribution of bandpass filters from low frequencies to high frequencies. The extraction process of LFCC statistical moments of the present invention can be seen in Figure 1:

1)FFT：首先先对语音进行预处理，计算每一个语音帧经过快速傅里叶变换(FastFourier Transform，FFT)后的频谱能量E(i,k)：1) FFT: First, preprocess the speech and calculate the spectral energy E(i,k) of each speech frame after fast Fourier Transform (FFT):

其中，i为语音帧数，k为频率分量，x_i(m)为第i帧的语音信号数据，N为傅立叶变换的数量。Among them, i is the number of speech frames, k is the frequency component, x _i (m) is the speech signal data of the i-th frame, and N is the number of Fourier transforms.

计算每帧的频谱能量E(i,k)经过三角滤波器组后的能量：Calculate the energy of the spectrum energy E(i,k) of each frame after passing through the triangular filter bank:

其中，H_i(k)表示三角滤波器的频率响应，f(l)为第l个三角滤波器的滤波函数，S(i,l)为经过三角滤波器组后的谱线能量，l表示三角滤波器的编号，L为三角滤波器总数。Among them, H _i (k) represents the frequency response of the triangular filter, f (l) is the filter function of the l-th triangular filter, S (i, l) is the spectral line energy after passing through the triangular filter group, and l represents The number of the triangular filter, L is the total number of triangular filters.

2)DCT：然后，利用离散余弦变换(Discrete Cosine Transform，DCT)计算每个三角滤波器组的输出数据lfcc(i,n)：2) DCT: Then, use Discrete Cosine Transform (DCT) to calculate the output data lfcc(i,n) of each triangular filter bank:

其中，n代表第i帧DCT后的谱线。Among them, n represents the spectral line after DCT of the i-th frame.

3)得到LFCC统计矩：将lfcc(i,n)取12阶LFCC系数，计算均值和相关系数，上述步骤可通过现有的matlab函数实现，假定某段经预处理后的语音一共有s帧，则该段语音提取出的LFCC矩阵为：3) Obtain the LFCC statistical moment: take the 12th order LFCC coefficient of lfcc(i,n), calculate the mean and correlation coefficient, the above steps can be realized through the existing matlab function, assuming that a certain segment of preprocessed speech has a total of s frames , then the LFCC matrix extracted from this speech segment is:

参见图3，网络框架采用RNN分类器，RNN分类器的网络层数的选择对于优化算法来说至关重要，更深的网络可以学到更多知识，但同时训练也需要花费很长时间而且可能会过度拟合。因此，在本发明中，提出RNN分类器的网络结构如图3所示。网络结构中包含2个LSTM网络，参数分别设置为(64,128)和(128,64)，使用tanh激活函数提高模型性能。还包括依次连接的Dropout层、全连接层(dense)和Softmax层，Dropout层与最后一个LSTM网络连接。设置Dropout函数的值为0.5，这有助于减少过拟合，经过全连接层降维后，使用Softmax层(Softmax分类器)输出概率。网络框架的总体迭代训练设置为50圈。在具体训练时可进行一定的调整。See Figure 3. The network framework uses an RNN classifier. The choice of the number of network layers of the RNN classifier is crucial to the optimization algorithm. A deeper network can learn more knowledge, but at the same time training also takes a long time and may Will overfit. Therefore, in the present invention, the network structure of the RNN classifier is proposed as shown in Figure 3. The network structure contains two LSTM networks, with parameters set to (64,128) and (128,64) respectively. The tanh activation function is used to improve model performance. It also includes the Dropout layer, the fully connected layer (dense) and the Softmax layer that are connected in sequence. The Dropout layer is connected to the last LSTM network. Set the value of the Dropout function to 0.5, which helps to reduce overfitting. After dimensionality reduction through the fully connected layer, use the Softmax layer (Softmax classifier) to output the probability. The overall iterative training of the network framework is set to 50 laps. Certain adjustments can be made during specific training.

再参见图2，语音检测方法中，包括如下步骤：Referring again to Figure 2, the speech detection method includes the following steps:

1)首先需要先对网络框架进行训练。假设有M种伪造操作，对原始语音分别进行M种伪造处理，得到M+1种语音样本，包括M个伪造操作后的语音和1个未经处理的原始语音。本发明中，对于原始语音的输入有一定的约束，需要提供一定量的WAV格式音频样本库作为网络框架的训练数据。对上述M+1种语音样本进行特征提取，得到训练语音样本的LFCC矩阵，送入设计好的RNN分类器网络中进行训练，得到一个多分类的训练模型；可在数据库中存储多个原始语音样本，对每个原始语音样本都进行特征提取并送入RNN分类器中进行训练；1) First, the network framework needs to be trained. Suppose there are M types of forgery operations, M types of forgery processing are performed on the original speech, and M+1 types of speech samples are obtained, including M speech after the forgery operation and 1 unprocessed original speech. In the present invention, there are certain constraints on the input of original speech, and a certain amount of WAV format audio sample library needs to be provided as training data for the network framework. Perform feature extraction on the above M+1 speech samples to obtain the LFCC matrix of the training speech samples, and send them to the designed RNN classifier network for training to obtain a multi-classification training model; multiple original speech sounds can be stored in the database Samples, feature extraction is performed on each original speech sample and sent to the RNN classifier for training;

2)然后，再通过训练后的网络框架得到检测识别结果：当得到一段测试语音时，对其进行特征提取，得到测试语音数据的LFCC矩阵，将其送入已经训练好的RNN分类器中进行分类。每一个测试语音会得到一个输出概率，合并所有输出概率作为最后的预测结果。若预测结果是原始语音，则测试语音被识别为原始语音；若预测结果是经过某一伪造操作的语音，则测试语音被识别为该伪造语音。取证者即可根据实验结果从而判断某段语音是否经过伪造操作。2) Then, the detection and recognition results are obtained through the trained network framework: when a test speech is obtained, feature extraction is performed on it, the LFCC matrix of the test speech data is obtained, and the result is sent to the already trained RNN classifier. Classification. Each test speech will get an output probability, and all output probabilities are combined as the final prediction result. If the prediction result is the original speech, the test speech is recognized as the original speech; if the prediction result is the speech that has undergone a certain forgery operation, the test speech is recognized as the forged speech. The forensic examiner can judge whether a certain piece of speech has been forged based on the experimental results.

Claims

1. A speech detection method for multiple forgery operations based on RNN, which is characterized by: including the following steps:

1) Training network: Obtain original voice samples, perform M forgery processes on the original voice samples, and obtain M forged voices and 1 unprocessed original voice. Features are extracted from the original speech to obtain the LFCC matrix of the training speech sample, which is sent to the RNN classifier network for training to obtain a multi-classification training model;

2) Speech recognition: Obtain a test speech, perform feature extraction on the test speech, obtain the LFCC matrix of the test speech data, and send it to the RNN classifier trained in step 1) for classification. Each test speech gets an output probability. , combine all output probabilities as the final prediction result: if the prediction result is the original speech, the test speech is recognized as the original speech; if the prediction result is the speech that has undergone a certain forgery operation, the test speech is recognized as the corresponding forgery operation. fake speech;

In steps 1) and 2), the steps to obtain the LFCC matrix are:

1) FFT: First, preprocess the speech and calculate the spectrum energy E(i,k) of each speech frame after FFT:

Among them, i is the number of speech frames, k is the frequency component, x _i (m) is the speech signal data of the i-th frame, and N is the number of Fourier transforms;

Then calculate the energy of the spectrum energy E(i,k) of each frame after passing through the triangular filter bank:

Among them, H _i (k) represents the frequency response of the triangular filter, f (l) is the filter function of the l-th triangular filter, S (i, l) is the spectral line energy after passing through the triangular filter group, and l represents The number of the triangular filter, L is the total number of triangular filters;

2) DCT: Use DCT to calculate the output data lfcc(i,n) of each triangular filter bank:

Among them, n represents the spectral line after DCT of the i-th frame;

3) Obtain the LFCC statistical moment: take the 12th order LFCC coefficient of lfcc(i,n), calculate the mean and correlation coefficient, and obtain the LFCC matrix extracted from a segment of speech:

Among them, x _s,1 ...x _1,n are the calculated n LFCCs of the s-th frame speech data.

2. The multiple forgery operation speech detection method based on RNN according to claim 1, characterized in that: the RNN classifier includes an LSTM network, a Dropout layer, a fully connected layer and a Softmax layer connected in sequence, and the Dropout layer Connect with the last LSTM network.

3. The multiple forgery operation speech detection method based on RNN according to claim 2, characterized in that: the LSTM network has two, and the parameters are set to (64, 128) and (128, 64) respectively.

4. The multiple forgery operation speech detection method based on RNN according to claim 2, characterized in that: the LSTM network uses a tanh activation function.

5. The multiple forgery operation speech detection method based on RNN according to claim 2, characterized in that: the Dropout function value of the Dropout layer is 0.5.

6. The multiple forgery operation speech detection method based on RNN according to claim 1, characterized in that: the original speech is in WAV format.