CN110782901B

CN110782901B - Method, storage medium and device for identifying voice of network telephone

Info

Publication number: CN110782901B
Application number: CN201911071415.7A
Authority: CN
Inventors: 黄远坤; 李斌; 黄继武
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2019-11-05
Filing date: 2019-11-05
Publication date: 2021-12-24
Anticipated expiration: 2039-11-05
Also published as: CN110782901A

Abstract

The invention provides a method, a storage medium and a device for recognizing voice of an Internet phone. The method comprises the steps of: converting a filtered voice signal into a standardized Mel-scale spectrogram, a standardized inverse Mel-scale spectrogram and a stack-type spectrogram respectively. waveform frame signal; using the pile-shaped waveform frame signal as the network input, extract the time domain information of the speech signal through the first convolutional neural network structure; respectively use the standardized Mel scale spectrogram and the standardized inverse Mel scale language The spectrogram is used as the network input, and the frequency domain information of the speech signal is extracted through the second convolutional neural network structure; the time domain information and frequency domain information of the speech signal are input into the classification module after training, and the classification result is output. The invention can not only effectively identify the VoIP voice of the fixed source and fixed terminal, but also can quickly and efficiently identify the VoIP voice generated by the unknown source and the unknown terminal.

Description

A method, storage medium and device for recognizing voice of Internet phone

技术领域technical field

本发明涉及网络电话识别领域，尤其涉及一种识别网络电话语音的方法、存储介质及装置。The present invention relates to the field of network phone identification, and in particular, to a method, storage medium and device for identifying network phone voice.

背景技术Background technique

随着互联网发展以及音频压缩技术的日趋成熟，人们的通讯方式变得多样化。VoIP网络电话技术的出现，使得人们能够更加方便并且花费更少的成本来进行通信。因此网络电话技术吸引了一大批用户，并且网络电话正在逐渐取代传统的固定电话和手机电话，成为人们的主要通讯方式之一。和传统的固定电话以及移动电话相比，网络电话用户没有固定的电话号码，也不需要使用电话卡来进行拨号。用户只需通过特定的网络电话软件来拨打对方的电话号码就能够达到与对方通讯的目的。而正因如此，有些网络电话软件允许用户随意设置自己的电话号码，而修改后的电话号码也不需要进行验证。因此不法分子利用这个漏洞，通过一些网络电话改号软件，将自身电话号码设置为特定的电话号码(如公安局的电话号码、银行的电话号码或者是政府部门的电话号码)，来伪装自己的身份，从而实施诈骗。因此，识别一通电话是否为网络电话可以帮助被呼叫方鉴别呼叫者的身份，从而能够在一定程度上让被呼叫者避免可能会遭遇到的诈骗。With the development of the Internet and the maturity of audio compression technology, people's communication methods have become diversified. The emergence of VoIP Internet telephony technology enables people to communicate more conveniently and at a lower cost. Therefore, VoIP technology has attracted a large number of users, and VoIP is gradually replacing traditional landline and mobile phones as one of the main communication methods for people. Compared with traditional fixed phones and mobile phones, Internet phone users do not have a fixed phone number and do not need to use a calling card to dial. The user only needs to dial the phone number of the other party through a specific VoIP software to achieve the purpose of communicating with the other party. Because of this, some VoIP software allows users to set their own phone number at will, and the modified phone number does not need to be verified. Therefore, criminals use this loophole to disguise their own phone numbers by setting their own phone numbers to specific phone numbers (such as the phone number of the public security bureau, the phone number of a bank or the phone number of a government department) through some Internet phone number changing software. identity to commit fraud. Therefore, identifying whether a call is an Internet phone can help the called party to identify the identity of the caller, so that the called party can avoid possible fraud to a certain extent.

发明内容SUMMARY OF THE INVENTION

鉴于上述现有技术的不足，本发明的目的在于提供一种识别网络电话语音的方法、存储介质及装置，旨在解决现有技术无法高效识别未知来源的网络电话语音的问题。In view of the above-mentioned deficiencies of the prior art, the purpose of the present invention is to provide a method, storage medium and device for recognizing VoIP voice, aiming to solve the problem that the prior art cannot efficiently identify VoIP voice of unknown source.

本发明为解决上述技术问题所采用的技术方案如下：The technical scheme adopted by the present invention for solving the above-mentioned technical problems is as follows:

一种识别网络电话语音的方法，其中，包括步骤：A method for recognizing voice of Internet phone, comprising the steps of:

对接收到的语音信号进行解压缩、重采样以及高通滤波处理，得到过滤后语音信号；Decompressing, resampling and high-pass filtering the received voice signal to obtain a filtered voice signal;

将所述过滤后语音信号分别转换为标准化梅尔刻度语谱图、标准化反梅尔刻度语谱图以及堆式波形帧信号；Converting the filtered speech signal into a standardized Mel-scale spectrogram, a standardized inverse Mel-scale spectrogram and a stack-shaped waveform frame signal respectively;

以所述堆式波形帧信号作为网络输入，通过第一卷积神经网络结构提取语音信号的时域信息；Taking the pile-shaped waveform frame signal as the network input, extracting the time domain information of the speech signal through the first convolutional neural network structure;

分别以所述标准化梅尔刻度语谱图和标准化反梅尔刻度语谱图作为网络输入，通过第二卷积神经网络结构提取语音信号的频域信息；Using the standardized Mel-scale spectrogram and the standardized anti-Mel-scale spectrogram as network input respectively, the frequency domain information of the speech signal is extracted by the second convolutional neural network structure;

将所述语音信号的时域信息和频域信息输入到训练完成的分类模块中，输出分类结果。Input the time domain information and frequency domain information of the speech signal into the classification module after training, and output the classification result.

所述识别网络电话语音的方法，其中，所述对接收到的语音信号进行解压缩、重采样以及高通滤波处理，得到过滤后语音信号的步骤包括：The method for recognizing the voice of the Internet phone, wherein the steps of decompressing, resampling and high-pass filtering the received voice signal to obtain the filtered voice signal include:

对接收到的语音信号进行解压缩后，重采样为8kHz、16位的波形信号；After decompressing the received voice signal, resampling it into an 8kHz, 16-bit waveform signal;

采用二阶差分滤波器对所述重采样的波形信号进行高通滤波，得到过滤后语音信号：x[n]＝-0.5*s[n-1]+s[n]-0.5*s[n+1]，其中，n表示时域信号中的样本点。The resampled waveform signal is subjected to high-pass filtering by using a second-order differential filter to obtain the filtered voice signal: x[n]=-0.5*s[n-1]+s[n]-0.5*s[n+ 1], where n represents the sample points in the time domain signal.

所述识别网络电话语音的方法，其中，所述将所述过滤后语音信号分别转换为标准化梅尔刻度语谱图、标准化反梅尔刻度语谱图以及堆式波形帧信号的步骤包括：The method for recognizing VoIP voice, wherein the step of converting the filtered voice signal into a standardized mel-scale spectrogram, a standardized inverse mel-scale spectrogram and a stack waveform frame signal respectively comprises:

对所述过滤后语音信号进行分帧处理，得到第i帧的第n个样本：y_i[n]＝w[n]*x[(i-1)*S+n]，其中，

Perform frame segmentation processing on the filtered speech signal to obtain the nth sample of the ith frame: y _i [n]=w[n]*x[(i-1)*S+n], wherein,

将经过分帧处理的过滤后语音信号转换为标准化梅尔刻度语谱图：

其中，Convert the framed filtered speech signal to a normalized mel-scale spectrogram:

in,

为梅尔刻度频率，f为赫兹刻度频率，K_b为边界点，f_L＝0为最低频率，f_H＝F_s/2为最高频率，F_s为采样频率，ΔF为频率分辨率，f_k为第k个离散傅里叶变换数据点的频率，K_m(m∈{1，2，…，M})是第m个梅尔子带滤波器；

is the Mel scale frequency, f is the Hz scale frequency, K _b is the boundary point, f _L = 0 is the lowest frequency, f _H = F _s /2 is the highest frequency, F _s is the sampling frequency, ΔF is the frequency resolution, f _k is the frequency of the kth discrete Fourier transform data point, and _Km (m∈{1, 2,...,M}) is the mth Mel subband filter;

将经过分帧处理的过滤后语音信号转换为标准化反梅尔刻度语谱图：

其中，Convert the framed filtered speech signal to a normalized inverse Mel-scale spectrogram:

in,

为反梅尔刻度；

is the inverse Mel scale;

将经过分帧处理的过滤后语音信号转换为堆式波形帧信号：Convert the framed filtered speech signal to a stacked waveform frame signal:

所述识别网络电话语音的方法，其中，所述以所述堆式波形帧信号作为网络输入，通过第一卷积神经网络结构提取语音信号的时域信息的步骤包括：The method for recognizing VoIP voice, wherein the step of extracting the time domain information of the voice signal through the first convolutional neural network structure using the stack waveform frame signal as the network input includes:

预先构建第一卷积神经网络结构，所述第一卷积神经网络结构包括6组从输入端至输出端依次串联的卷积模块，所述从输入端开始的前5组卷积模块均包括一个卷积层、一个最大池化层、一个线性整流函数、以及两个批量归一化层，最后一组卷积模块包括一个卷积层、一个批量归一化层、一个线性整流函数、以及一个全局平均池化层；A first convolutional neural network structure is constructed in advance, and the first convolutional neural network structure includes 6 groups of convolution modules sequentially connected in series from the input end to the output end, and the first 5 groups of convolution modules starting from the input end all include A convolutional layer, a max pooling layer, a linear rectification function, and two batch normalization layers. The last set of convolutional modules includes a convolutional layer, a batch normalization layer, a linear rectification function, and A global average pooling layer;

通过连接在所述第一卷积神经网络结构输出端的主分类器以及连接在所述第4组卷积模块上的辅助分类器对所述第一卷积神经网络结构进行网络参数训练，所述两个分类器的总代价函数：LossA＝α*loss₀+β*loss₁，其中，loss₀和Loss₁分别表示主分类器和辅助分类器的交叉熵损失函数，α和β是权值，满足α+β＝1；The first convolutional neural network structure is trained on network parameters through the main classifier connected to the output end of the first convolutional neural network structure and the auxiliary classifier connected to the fourth group of convolutional modules. The total cost function of the two classifiers: LossA=α*loss ₀ +β*loss ₁ , where loss ₀ and Loss ₁ represent the cross-entropy loss function of the main classifier and the auxiliary classifier, respectively, α and β are the weights, Satisfy α+β=1;

当所述第一卷积神经网络结构训结束后，移除所述主分类器和辅助分类器，将所述堆式波形帧信号输入到第一卷积神经网络结构中，提取语音信号的时域信息。After the training of the first convolutional neural network structure is completed, the main classifier and the auxiliary classifier are removed, the pile-shaped waveform frame signal is input into the first convolutional neural network structure, and the time of extracting the speech signal domain information.

所述识别网络电话语音的方法，其中，所述分别以所述梅尔刻度语谱图和反梅尔刻度语谱图作为网络输入，通过第二卷积神经网络结构提取语音信号的频域信息的步骤包括：The method for recognizing VoIP voice, wherein the Mel-scale spectrogram and the inverse Mel-scale spectrogram are respectively used as network inputs, and the frequency domain information of the speech signal is extracted through the second convolutional neural network structure The steps include:

预先构建第二卷积神经网络结构，所述第二卷积神经网络结构包括6组从输入端至输出端依次串联的卷积模块，所述从输入端开始的前5组卷积模块均包括一个卷积层、一个最大池化层、一个二维卷积核函数、以及两个批量归一化层，最后一组卷积模块包括一个卷积层、一个批量归一化层、一个线性整流函数、以及一个全局平均池化层；A second convolutional neural network structure is constructed in advance, and the second convolutional neural network structure includes 6 groups of convolution modules sequentially connected in series from the input end to the output end, and the first 5 groups of convolution modules starting from the input end all include A convolutional layer, a max pooling layer, a 2D convolution kernel function, and two batch normalization layers. The last set of convolution modules includes a convolutional layer, a batch normalization layer, and a linear rectification layer. function, and a global average pooling layer;

通过连接在所述第二卷积神经网络结构输出端的主分类器以及连接在所述第3组卷积模块上的辅助分类器对所述第二卷积神经网络结构进行网络参数训练，所述两个分类器的总代价函数：LossB＝γ*loss₀+δ*loss₁，其中，loss₀和loss₁分别表示主分类器和辅助分类器的交叉熵损失函数，γ和δ是权值，满足γ+δ＝1；The second convolutional neural network structure is trained on network parameters through the main classifier connected to the output end of the second convolutional neural network structure and the auxiliary classifier connected to the third group of convolutional modules. The total cost function of the two classifiers: LossB=γ*loss ₀ +δ*loss ₁ , where loss ₀ and loss ₁ represent the cross-entropy loss function of the main classifier and the auxiliary classifier, respectively, and γ and δ are the weights, Satisfy γ+δ=1;

当所述第二卷积神经网络结构训结束后，移除所述主分类器和辅助分类器，将所述标准化梅尔刻度语谱图和标准化反梅尔刻度语谱图分别作为网络输入，通过第二卷积神经网络结构提取语音信号的频域信息。After the second convolutional neural network structure training is completed, the main classifier and the auxiliary classifier are removed, and the standardized Mel-scale spectrogram and the standardized inverse Mel-scale spectrogram are respectively used as network inputs, The frequency domain information of the speech signal is extracted through the second convolutional neural network structure.

所述识别网络电话语音的方法，其中，所述主分类器包括一个全连接层以及一个softmax函数。In the method for recognizing VoIP voice, the main classifier includes a fully connected layer and a softmax function.

所述识别网络电话语音的方法，其中，所述分类模块为全连接神经网络分类器，所述全连接神经网络分类器包括第一层全连接层、第二层全连接层以及一个softmax函数。In the method for recognizing VoIP voice, the classification module is a fully connected neural network classifier, and the fully connected neural network classifier includes a first fully connected layer, a second fully connected layer and a softmax function.

所述识别网络电话语音的方法，其中，所述将所述语音信号的时域信息和频域信息输入到训练完成的分类模块中，输出分类结果的步骤包括：The method for recognizing the voice of the Internet phone, wherein, the time domain information and frequency domain information of the voice signal are input into the classification module after training, and the step of outputting the classification result includes:

将所述语音信号的时域信息和频域信息输入到所述全连接神经网络分类器中，将所述第一层全连接层和第二层全连接层的节点数分别设置为768和2；Input the time domain information and frequency domain information of the speech signal into the fully connected neural network classifier, and set the number of nodes of the first fully connected layer and the second fully connected layer to 768 and 2 respectively. ;

若所述全连接神经网络分类器输出[0,1]，则判定所述语音信号为网络电话语音；If the fully-connected neural network classifier outputs [0, 1], it is determined that the voice signal is an VoIP voice;

若所述全连接神经网络分类器输出[1,0]，则判定所述语音信号为移动电话语音。If the fully connected neural network classifier outputs [1,0], it is determined that the voice signal is a mobile phone voice.

一种存储介质，其中，包括存储有多条指令，所述指令适于由处理器加载并具体执行本发明所述识别网络电话语音的方法的步骤。A storage medium, which includes storing a plurality of instructions, the instructions are suitable for being loaded by a processor and specifically executing the steps of the method for recognizing voice over an Internet phone according to the present invention.

一种识别网络电话语音的装置，其中，包括处理器，适于实现各指令；以及存储介质，适于存储多条指令，所述指令适于由处理器加载并执行本发明所述识别网络电话语音的方法的步骤。。A device for recognizing voice of Internet phone, which includes a processor, suitable for implementing various instructions; and a storage medium, suitable for storing a plurality of instructions, and the instructions are suitable for being loaded by the processor and executing the recognition of the Internet phone of the present invention. The steps of the method of speech. .

有益效果：本发明提供了提出了一种识别网络电话语音的方法，采用经过训练的第一卷积神经网络结构和第二卷积神经网络结构提取语音信息的多域深度特征，利用提取的特征对网络电话语音进行识别。与现有的方法相比，本发明不仅能够有效地识别固定来源固定终端的网络电话语音，还能够快速高效地识别未知源、未知终端所产生的网络电话语音。Beneficial effects: The present invention provides and proposes a method for recognizing the voice of an Internet phone, using the trained first convolutional neural network structure and the second convolutional neural network structure to extract the multi-domain depth features of the voice information, and using the extracted features Recognize voice over VoIP. Compared with the existing method, the present invention can not only effectively identify the VoIP voice of the fixed source fixed terminal, but also can quickly and efficiently identify the VoIP voice generated by the unknown source and the unknown terminal.

附图说明Description of drawings

图1是本发明实施例提供的一种识别网络电话语音的方法较佳实施例的流程图。FIG. 1 is a flowchart of a preferred embodiment of a method for recognizing voice over an Internet phone provided by an embodiment of the present invention.

图2a为移动电话通话录音在48kHz采样率条件下的语谱图。Figure 2a is a spectrogram of a mobile phone call recording at a sampling rate of 48kHz.

图2b为网络电话通话录音在16kHz采样率条件下的语谱图。Figure 2b is the spectrogram of the VoIP call recording under the condition of 16kHz sampling rate.

图2c为移动电话通话录音在8kHz重采样率条件下的语谱图。Figure 2c is the spectrogram of the mobile phone call recording under the condition of 8kHz resampling rate.

图2d为网络电话通话录音在8kHz重采样率条件下的语谱图。Figure 2d is the spectrogram of the VoIP call recording under the condition of 8kHz resampling rate.

图3a为高通滤波前后移动电话通话录音的过零率的直方统计图。Figure 3a is a histogram of the zero-crossing rate of the mobile phone call recording before and after high-pass filtering.

图3b为高通滤波前后网络电话通话录音的过零率的直方统计图。Figure 3b is a histogram of the zero-crossing rate of the VoIP call recording before and after high-pass filtering.

图4为语音信号的分帧以及格式转换表达流程图。FIG. 4 is a flow chart of framing and format conversion expression of a speech signal.

图5为经过高通滤波的移动电话和网络电话通话录音的正则化对数语谱图、正则化对数梅尔刻度语谱图、以及正则化对数反梅尔刻度语谱图。5 is a regularized log spectrogram, a regularized log mel scale spectrogram, and a regularized log inverse Mel scale spectrogram of high-pass filtered mobile phone and VoIP call recordings.

图6a为第一卷积神经网络结构的结构示意图。FIG. 6a is a schematic structural diagram of a first convolutional neural network structure.

图6b为第二卷积神经网络结构的结构示意图。FIG. 6b is a schematic structural diagram of a second convolutional neural network structure.

图7为本发明设计的全连接卷积神经网络分类器结构图。FIG. 7 is a structural diagram of a fully connected convolutional neural network classifier designed by the present invention.

图8为本发明一种识别网络电话语音的装置的结构框图。FIG. 8 is a structural block diagram of an apparatus for recognizing voice over an Internet phone according to the present invention.

图9为场景0和场景5中不同长度的语音片段的检测准确率示意图。FIG. 9 is a schematic diagram showing the detection accuracy of speech segments of different lengths in scene 0 and scene 5.

具体实施方式Detailed ways

为使本发明的目的、技术方案及优点更加清楚、明确，以下参照附图并举实施例对本发明进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer and clearer, the present invention will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

现有文献中提出过一种网络电话指纹来进行网络电话的识别，其设计的网络电话指纹包括两部分，第一部分由语音帧的包丢失率、包间相关性、噪声谱统计特性、以及通话的语音质量等特征组成；第二部分是通过将第一部分提取的特征输入决策树而形成的路径遍历签名。由于该方法依赖于丢包特性，随着网络质量的提高，丢包可能成为一个不太有效的特性。此外，这个方法假设每一个通话来源都有固定并且唯一的一个识别指纹，因此该方法只能够适用于检测从某一个固定来源拨打到另一个指定终端的网络电话(如固定电话打到固定电话)，而不能够识别各种来自未知来源的网络电话。In the existing literature, a VoIP fingerprint is proposed for VoIP identification. The designed VoIP fingerprint includes two parts. The first part consists of the packet loss rate of the voice frame, the correlation between packets, the statistical characteristics of the noise spectrum, and the The second part is the path traversal signature formed by inputting the features extracted from the first part into the decision tree. Since the method relies on the packet loss feature, as network quality improves, packet loss may become a less effective feature. In addition, this method assumes that each call source has a fixed and unique identification fingerprint, so this method can only be applied to detect VoIP calls made from a certain fixed source to another designated terminal (such as a fixed telephone call to a fixed telephone) , instead of being able to identify various VoIP calls from unknown sources.

现有文献还提出一种通过识别编解码器来检测网络电话语音的方法，该方法首先用已知的编解码器对语音进行重新编码，形成包含噪声谱和语音信号时域直方图的多维特征模型，然后将得到的特征模型与候选编解码器的参考特征模型进行比对，从而来鉴定一段语音经过何种编解码器处理，再通过编解码器的类型来判断待检测语音是否经VoIP网络。然而，由于音频数据在不同的电话网络和各种设备中传输，因此语音信号编码码流经过不同类型的通信网络时会被转码。PSTN公用电话网络中使用的最后一个编解码器可能会掩盖VoIP网络中使用的编解码器的大部分痕迹。此外，VoIP网络和PSTN网络都可能会使用同种编解码器，如G.711等。因此，单纯识别编解码器并不足以识别网络电话语音。The existing literature also proposes a method for detecting VoIP speech by identifying codecs. The method first re-encodes the speech with a known codec to form multi-dimensional features including noise spectrum and time-domain histogram of the speech signal. model, and then compare the obtained feature model with the reference feature model of the candidate codec, so as to identify which codec processing a piece of speech has undergone, and then judge whether the voice to be detected has passed through the VoIP network by the type of the codec. . However, since the audio data is transmitted in different telephone networks and various devices, the encoded code stream of the voice signal will be transcoded as it passes through different types of communication networks. The last codec used in the PSTN public telephone network may obscure most traces of the codec used in the VoIP network. In addition, both VoIP networks and PSTN networks may use the same codec, such as G.711. Therefore, identifying codecs alone is not enough to identify VoIP speech.

基于现有技术所存在的问题，本实施方式提供一种识别网络电话语音的方法，如图1所示，其包括步骤：Based on the problems existing in the prior art, the present embodiment provides a method for recognizing voice over an Internet phone, as shown in FIG. 1 , which includes the steps:

本实施例采用经过训练的第一卷积神经网络结构和第二卷积神经网络结构提取语音信息的多域深度特征，利用提取的特征对网络电话语音进行识别。与现有的方法相比，本发明不仅能够有效地识别固定来源固定终端的网络电话语音，还能够快速高效地识别未知源、未知终端所产生的网络电话语音。In this embodiment, the trained first convolutional neural network structure and the second convolutional neural network structure are used to extract multi-domain depth features of speech information, and the extracted features are used to recognize the voice of the Internet phone. Compared with the existing method, the present invention can not only effectively identify the VoIP voice of the fixed source fixed terminal, but also can quickly and efficiently identify the VoIP voice generated by the unknown source and the unknown terminal.

在一些实施方式中，所述对接收到的语音信号进行解压缩、重采样以及高通滤波处理，得到过滤后语音信号的步骤包括：对接收到的语音信号进行解压缩后，重采样为8kHz、16位的波形信号；采用二阶差分滤波器对所述重采样的波形信号进行高通滤波，得到过滤后语音信号：x[n]＝-0.5*s[n-1]+s[n]-0.5*s[n+1]，其中，n表示时域信号中的样本点。In some embodiments, the step of performing decompression, resampling and high-pass filtering on the received voice signal to obtain the filtered voice signal includes: after decompressing the received voice signal, resampling to 8 kHz, 16-bit waveform signal; use a second-order differential filter to perform high-pass filtering on the resampled waveform signal to obtain a filtered voice signal: x[n]=-0.5*s[n-1]+s[n]- 0.5*s[n+1], where n represents the sample point in the time domain signal.

具体来讲，当使用手机将接收到的语音信号进行录制时，不同手机会默认使用不同的采样率来对语音信号进行采样，例如手机会使用16kHz、44.1kHz、48kHz等采样率，这样会导致采集到的语音包含许多没有用的高频信号。由于VoIP和PSTN网络带宽有限，本实施例选择将解压缩后的语音信号重采样为8kHz、16位的波形信号，这样既能够保留大部分语音内容，也能够解决带宽不够的问题。图2a－图2d展示了48kHz的手机通话录音信号的语谱图、16kHz网络电话通话录音信号的语谱图以及使用8kHz的采样率将它们重采样之后的语谱图，从图中可以看到48kHz和16kHz的语音信号的高频部分包含很少信息，而重采样之后的信号依然保存着原语音中的大部分信息。Specifically, when using a mobile phone to record the received voice signal, different mobile phones will use different sampling rates to sample the voice signal by default. For example, the mobile phone will use sampling rates such as 16kHz, 44.1kHz, 48kHz, etc. The captured speech contains many useless high frequency signals. Due to the limited bandwidth of VoIP and PSTN networks, this embodiment chooses to resample the decompressed voice signal into an 8kHz, 16-bit waveform signal, which can not only retain most of the voice content, but also solve the problem of insufficient bandwidth. Figures 2a-2d show the spectrogram of the 48kHz mobile phone call recording signal, the spectrogram of the 16kHz Internet phone call recording signal, and the spectrogram after resampling them with a sampling rate of 8kHz, as can be seen from the figures The high frequency parts of the 48kHz and 16kHz speech signals contain very little information, while the resampled signal still retains most of the information in the original speech.

本实施例还对重采样的语音信号进行高通滤波，由于人耳对语音信号的高频部分并不敏感，而通过人耳难以辨别网络电话语音以及手机通话语音。因此本实施例使用高通滤波器来抑制语音信号的低频部分，并放大语音中的高频信息。图3a－图3b展示了12000段25毫秒的手机通话以及网络电话录音在通过高通滤波器前以及通过高通滤波器之后的过零率。从图3a－图3b可以发现，经过高通滤波期之后，手机通话录音以及网络电话录音之间的高频成分的差异会被放大。对于一个信号s[n]，本实施例使用二阶差分滤波器来对其进行高通滤波，高通滤波后的信号x[n]可以由以下公式得到：x[n]＝-0.5*s[n-1]+s[n]-0.5*s[n+1]，其中，其中n表示时域信号中的样本点。In this embodiment, high-pass filtering is also performed on the resampled speech signal. Since the human ear is not sensitive to the high-frequency part of the speech signal, it is difficult for the human ear to distinguish the voice of the Internet phone and the voice of the mobile phone call. Therefore, this embodiment uses a high-pass filter to suppress the low-frequency part of the speech signal and amplify the high-frequency information in the speech. Figures 3a-3b show the zero-crossing rates of 12,000 25-millisecond cell phone calls and VoIP recordings before and after passing through the high-pass filter. It can be found from Figures 3a-3b that after the high-pass filtering period, the difference in high-frequency components between the mobile phone call recording and the Internet phone recording will be amplified. For a signal s[n], this embodiment uses a second-order differential filter to perform high-pass filtering, and the high-pass filtered signal x[n] can be obtained by the following formula: x[n]=-0.5*s[n -1]+s[n]-0.5*s[n+1], where n represents a sample point in the time-domain signal.

在一些实施方式中，为了方便后续使用深度学习来提取特征，本实施例再将语音信号经过解压缩、重采样以及高通滤波处理之后，继续对其进行分帧处理，并将其转换为梅尔谱、反梅尔谱、以及堆式波形帧这三种形式，过程如图4所示。In some implementation manners, in order to facilitate the use of deep learning to extract features later, in this embodiment, after decompressing, resampling, and high-pass filtering the speech signal, it continues to be framed and converted into Mel. The three forms of spectrum, inverse Mel spectrum, and heap waveform frame, the process is shown in Figure 4.

具体来讲，本实施例用N表示音频帧的帧长，用S来表示帧移，则第i帧的第n个样本可通过以下公式获取：y_i[n]＝w[n]*x[(i-1)*S+n]，其中，w[n]为汉明窗函数，以下公式定义

然后计算分帧后语音的梅尔刻度语谱图、反梅尔刻度语谱图、以及堆式波形帧信号。Specifically, in this embodiment, N is used to represent the frame length of the audio frame, and S is used to represent the frame shift, then the n-th sample of the i-th frame can be obtained by the following formula: y _i [n]=w[n]*x [(i-1)*S+n], where w[n] is the Hamming window function, defined by the following formula

Then, the Mel-scale spectrogram, the inverse Mel-scale spectrogram, and the stack waveform frame signal of the framed speech are calculated.

本实施例使用

表示从赫兹刻度到梅尔刻度的转换，使用

表示反变换，则与赫兹刻度频率f相比，所述梅尔刻度频率通过以下公式计算：

而反梅尔刻度语谱可以通过以下公式计算：

其中，f_L,f_H,和F_s分别表示最低频率、最高频率、以及采样频率，在本实施例中，将采样频率设置为8000Hz、最低频率设置为0、最高频率设置为采样频率的一半，即4000Hz。N为离散傅里叶变换的点数，其值与帧长点数相同。对于第i帧语音信号，我们先求它的离散傅里叶变换：

其中，第k个离散傅里叶变换数据点的频率f_k＝kΔF，其中，ΔF为频率分辨率，

通过以下公式计算每一帧的频谱能量：E_i[k]＝|Y_i[k]|²，由于频谱是对称的，因此本实施例只选取一半频谱，即

接下来本实施例使用M个梅尔三角滤波器对能量谱进行滤波，在滤波器组中的第m个三角滤波器定义为：This example uses

To represent the conversion from Hertz scale to Mel scale, use

represents the inverse transformation, then compared with the Hertz scale frequency f, the Mel scale frequency is calculated by the following formula:

The inverse Mel scale spectrum can be calculated by the following formula:

Wherein, f _L , f _H , and F _s represent the lowest frequency, the highest frequency, and the sampling frequency, respectively. In this embodiment, the sampling frequency is set to 8000 Hz, the lowest frequency is set to 0, and the highest frequency is set to half of the sampling frequency , ie 4000Hz. N is the number of discrete Fourier transform points, and its value is the same as the number of points in the frame length. For the i-th frame speech signal, we first find its discrete Fourier transform:

Among them, the frequency of the k-th discrete Fourier transform data point f _k =kΔF, where ΔF is the frequency resolution,

The spectral energy of each frame is calculated by the following formula: E _i [k]=|Y _i [k]| ² , since the spectrum is symmetrical, only half of the spectrum is selected in this embodiment, that is,

Next, this embodiment uses M Mel triangular filters to filter the energy spectrum, and the mth triangular filter in the filter bank is defined as:

其中，K_b为边界点，

f_L＝0为最低频率，f_H＝F_s/2为最高频率，在本实施例中，K_m(m∈{1，2，…，M})是第m个梅尔子带滤波器的中心频率，本实施例通过以下公式计算梅尔刻度语谱图，

最后通过对梅尔刻度语谱图进行标准化处理就得到最终的标准化梅尔刻度语谱图：

其中，

Among them, K _b is the boundary point,

f _L =0 is the lowest frequency, f _H =F _s /2 is the highest frequency, in this embodiment, K _m (m∈{1,2,...,M}) is the mth Mel subband filter The center frequency of , the present embodiment calculates the Mel scale spectrogram by the following formula,

Finally, the final standardized Mel-scale spectrogram is obtained by normalizing the Mel-scale spectrogram:

in,

在一些实施方式中，与计算梅尔刻度语谱图不同的是，本实施在球反梅尔刻度语谱图时使用反梅尔三角滤波器组对能量谱进行滤波，其中，第m个反梅尔滤波器的定义为：In some embodiments, different from calculating the Mel-scale spectrogram, this implementation uses an inverse Mel-triangular filter bank to filter the energy spectrum when the spherical inverse Mel-scale spectrogram is used, wherein the mth inverse Mel-scale spectrogram is used to filter the energy spectrum. The definition of the mel filter is:

通过以下公式计算反梅尔刻度语谱图：

The inverse Mel-scale spectrogram is calculated by the following formula:

最后对所述反梅尔刻度语谱图进行标准化处理得到最终的标准化反梅尔刻度语谱图：

Finally, standardize the anti-Mel scale spectrogram to obtain the final standardized anti-Mel scale spectrogram:

其中，

in,

图5展示了高通滤波信号的不同刻度的标准化语谱图，从图中我们可以看到，和手机通话录音相比，网络电话录音的高频部分的能量比手机通话的高频部分的能量弱。和赫兹刻度的语谱图相比，梅尔刻度和反梅尔刻度的语谱图能够放大手机通话录音以及网络电话通话录音的差异，并缩小不同的网络电话录音之间的差异。Figure 5 shows the normalized spectrogram of different scales of the high-pass filtered signal. From the figure, we can see that compared with the mobile phone call recording, the energy of the high-frequency part of the Internet phone call recording is weaker than that of the mobile phone call. . Compared with the spectrogram on the Hertz scale, the spectrogram on the Mel-scale and the inverse Mel-scale can amplify the difference between the phone call recording and the VoIP call recording, and reduce the difference between different VoIP recordings.

在一些实施方式中，相比起梅尔刻度语谱图和反梅尔刻度语谱图，堆式波形帧保留了语音信号的相位信息，因此使用堆式波形帧作为深度网络的输入能够捕捉到语谱图中无法捕捉到的信息，本实施例将L帧波形帧逐列地排列，就能够得到

维的堆式波形帧矩阵In some embodiments, compared to mel-scale spectrograms and inverse mel-scale spectrograms, the stacked waveform frames preserve the phase information of the speech signal, so using the stacked waveform frames as input to the deep network can capture Information that cannot be captured in the spectrogram can be obtained by arranging L-frame waveform frames column by column in this embodiment.

dimensional heaped waveform frame matrix

在一些实施方式中，设计两种不同的卷积神经网络结构，通过第一卷积神经网络结构和第二卷积神经网络结构分别提取语音信号的频域信息和时域信息，最终提取到的深度特征是一个2304维的向量。In some embodiments, two different convolutional neural network structures are designed, and the frequency domain information and time domain information of the speech signal are extracted respectively through the first convolutional neural network structure and the second convolutional neural network structure, and finally the extracted The depth feature is a 2304-dimensional vector.

在一些实施方式中，所述以所述堆式波形帧信号作为网络输入，通过第一卷积神经网络结构提取语音信号的时域信息的步骤包括：如图6a所示，预先构建第一卷积神经网络结构，所述第一卷积神经网络结构包括6组从输入端至输出端依次串联的卷积模块，所述从输入端开始的前5组卷积模块均包括一个卷积层、一个最大池化层、一个线性整流函数(ReLU)、以及两个批量归一化层(BN)，最后一组卷积模块包括一个卷积层、一个批量归一化层、一个线性整流函数、以及一个全局平均池化层；本实施例将所述前3组卷积模块设计成一维卷积，目的是提取堆式波形帧的各个帧的帧内特征，将所述第4个卷积模块被设计成二维卷积，目的是用来提取帧内特征并且融合帧间提取到的信息，最后两个卷积模块也被设计成一维卷积来进行帧间特征的融合。In some embodiments, the step of extracting the time domain information of the speech signal through the first convolutional neural network structure using the stack waveform frame signal as the network input includes: as shown in FIG. 6a, constructing a first volume in advance Convolutional neural network structure, the first convolutional neural network structure includes 6 groups of convolution modules connected in series from the input end to the output end, and the first 5 groups of convolution modules starting from the input end all include a convolution layer, A max pooling layer, a linear rectification function (ReLU), and two batch normalization layers (BN), the last set of convolution modules includes a convolutional layer, a batch normalization layer, a linear rectification function, and a global average pooling layer; in this embodiment, the first three groups of convolution modules are designed as one-dimensional convolutions, the purpose is to extract the intra-frame features of each frame of the stack waveform frame, and the fourth convolution module It is designed as a two-dimensional convolution to extract intra-frame features and fuse the information extracted between frames. The last two convolution modules are also designed as one-dimensional convolutions to fuse inter-frame features.

本实施例通过在所述第一卷积神经网络输出端连接由一个全连接层以及一个softmax函数组成的主分类器，在所述第4组卷积模块上的输出处连接一个辅助分类器，通过所述主分类器以及辅助分类器对所述第一卷积神经网络结构进行网络参数训练，所述两个分类器的总代价函数：LossA＝α*loss₀+β*loss₁，其中，loss₀和loss₁分别表示主分类器和辅助分类器的交叉熵损失函数，α和β是权值，满足α+β＝1；当所述第一卷积神经网络结构训结束后，移除所述主分类器和辅助分类器，将所述堆式波形帧信号输入到第一卷积神经网络结构(SWF-net)中，提取语音信号的时域信息。In this embodiment, a main classifier consisting of a fully connected layer and a softmax function is connected to the output end of the first convolutional neural network, and an auxiliary classifier is connected to the output of the fourth group of convolution modules, Perform network parameter training on the first convolutional neural network structure through the main classifier and the auxiliary classifier, and the total cost function of the two classifiers: LossA=α*loss ₀ +β*loss ₁ , where, loss ₀ and loss ₁ represent the cross-entropy loss function of the main classifier and the auxiliary classifier, respectively, α and β are the weights, satisfying α+β=1; when the first convolutional neural network structure is trained, remove The main classifier and the auxiliary classifier input the stack waveform frame signal into the first convolutional neural network structure (SWF-net) to extract the time domain information of the speech signal.

在一些实施方式中，所述分别以所述梅尔刻度语谱图和反梅尔刻度语谱图作为网络输入，通过第二卷积神经网络结构提取语音信号的频域信息的步骤包括：如图6b所示，预先构建第二卷积神经网络结构，所述第二卷积神经网络结构包括6组从输入端至输出端依次串联的卷积模块，所述从输入端开始的前5组卷积模块均包括一个卷积层、一个最大池化层、一个二维卷积核函数、以及两个批量归一化层，最后一组卷积模块包括一个卷积层、一个批量归一化层、一个线性整流函数、以及一个全局平均池化层。本实施例中，所述第二卷积神经网络结构与所述第一卷积神经网络结构有两个不同之处，第一个不同是卷积模块中的核函数的维度不同，因为对于语谱图而言，其相邻两个数据点都有一定的关联，所以本实施例在所述第二卷积神经网络结构中的前5个卷积模块中直接使用二维的卷积核来提取特征，同时设计了特定的池化尺寸、卷积步长来使得所述第二卷积神经网络结构与所述第一卷积神经网络结构具有相同的输出维度；第二点不同在与辅助分类器的位置，本实施例中的辅助分类器连接在所述第3卷积模块的输出处。通过所述主分类器以及辅助分类器对所述第二卷积神经网络结构进行网络参数训练，所述两个分类器的总代价函数：LossB＝γ*loss₀+δ*loss₁，其中，loss₀和loss₁分别表示主分类器和辅助分类器的交叉熵损失函数，γ和δ是权值，满足γ+δ＝1；当所述第二卷积神经网络结构训结束后，移除所述主分类器和辅助分类器，得到相应的卷积神经网络模型名称为MS-net和IMS-net；将所述标准化梅尔刻度语谱图和标准化反梅尔刻度语谱图分别作为网络输入，通过第二卷积神经网络结构提取语音信号的频域信息。In some embodiments, the Mel-scale spectrogram and the inverse Mel-scale spectrogram are respectively used as network inputs, and the step of extracting the frequency domain information of the speech signal through the second convolutional neural network structure includes: As shown in Fig. 6b, a second convolutional neural network structure is pre-built. The second convolutional neural network structure includes 6 groups of convolution modules sequentially connected in series from the input end to the output end. The first 5 groups starting from the input end Each convolution module includes a convolution layer, a maximum pooling layer, a two-dimensional convolution kernel function, and two batch normalization layers. The last group of convolution modules includes a convolution layer and a batch normalization layer. layer, a linear rectification function, and a global average pooling layer. In this embodiment, there are two differences between the second convolutional neural network structure and the first convolutional neural network structure. The first difference is that the dimension of the kernel function in the convolution module is different, because for language In terms of spectrogram, two adjacent data points have a certain correlation, so this embodiment directly uses a two-dimensional convolution kernel in the first five convolution modules in the second convolutional neural network structure. Extract features, and design a specific pooling size and convolution step size to make the second convolutional neural network structure and the first convolutional neural network structure have the same output dimension; the second difference is that the auxiliary The position of the classifier, the auxiliary classifier in this embodiment is connected at the output of the third convolution module. Perform network parameter training on the second convolutional neural network structure through the main classifier and the auxiliary classifier, and the total cost function of the two classifiers: LossB=γ*loss ₀ +δ*loss ₁ , where, loss ₀ and loss ₁ represent the cross-entropy loss function of the main classifier and the auxiliary classifier, respectively, γ and δ are weights, satisfying γ+δ=1; when the second convolutional neural network structure is trained, remove For the main classifier and the auxiliary classifier, the corresponding convolutional neural network model names are MS-net and IMS-net; the standardized Mel-scale spectrogram and the standardized inverse Mel-scale spectrogram are respectively used as the network Input, extract the frequency domain information of the speech signal through the second convolutional neural network structure.

在一些实施方式中，在提取完深度特征后，本实施例使用一个分类模块来融合提取的特征并做最终的判决。在这个阶段中许多分类器都可以被用做分类模块，本实施例使用了全连接神经网络分类器作为分类模块。如图7所示，所述全连接神经网络分类器由两个全连接层和一个softmax函数组成。其中第一层全连接层和第二层全连接层的节点数为768和2。将提取的2304维特征向量输入到训练完的全连接神经网络分器后，全连接神经网络分类器会输出[0,1]或者[1,0]；在本实施例中，输出[0,1]表示分类器将待测语音信号判决为VoIP网络电话通话录音，输出[1,0]则表示分类器将待测语音信号判决为移动电话通话录音。In some implementations, after the depth feature is extracted, this embodiment uses a classification module to fuse the extracted features and make a final decision. In this stage, many classifiers can be used as the classification module, and this embodiment uses the fully connected neural network classifier as the classification module. As shown in Figure 7, the fully connected neural network classifier consists of two fully connected layers and a softmax function. The number of nodes in the first fully connected layer and the second fully connected layer is 768 and 2. After inputting the extracted 2304-dimensional feature vector into the trained fully connected neural network classifier, the fully connected neural network classifier will output [0,1] or [1,0]; in this embodiment, output [0, 1] means that the classifier judges the voice signal to be tested as a VoIP call recording, and output [1,0] means that the classifier judges the voice signal to be tested as a mobile phone call recording.

在一些实施方式中，还提供一种存储介质，其中，存储有多条指令，所述指令适于由处理器加载并具体执行本发明所述识别网络电话语音的方法的步骤。In some implementations, a storage medium is also provided, wherein a plurality of instructions are stored, and the instructions are adapted to be loaded by a processor and specifically execute the steps of the method for recognizing voice over an Internet phone of the present invention.

在一些实施方式中，还提供一种识别网络语音信号的装置，其中，如图8所示，包括处理器10，适于实现各指令；以及存储介质20，适于存储多条指令，所述指令适于由处理器10加载并执行所述识别网络电话语音的方法的步骤。In some embodiments, an apparatus for recognizing network voice signals is also provided, wherein, as shown in FIG. 8 , it includes a processor 10 adapted to implement various instructions; and a storage medium 20 adapted to store a plurality of instructions, the The instructions are adapted to be loaded by the processor 10 and to execute the steps of the method of recognizing voice over VoIP.

下面通过现有数据库VPCID来测试本发明对网络语音信号的识别性能：Test the recognition performance of the present invention to the network voice signal by existing database VPCID below:

本实施例将帧长N设置为240，帧移S设置为120，三角滤波器的个数M设置为48。帧数L设置为132。得到48×132的梅尔刻度语谱图、48×132的反梅尔刻度语谱图、以及240×132的堆式波形帧。在评价指标中，本实施例选择真正率、真负率、以及准确率，其中网络电话语音为正样本，手机通话语音为负样本。In this embodiment, the frame length N is set to 240, the frame shift S is set to 120, and the number M of triangular filters is set to 48. Frame number L is set to 132. A 48x132 mel-scale spectrogram, a 48x132 inverse mel-scale spectrogram, and a 240x132 stack waveform frame are obtained. In the evaluation index, the present embodiment selects the true rate, the true negative rate, and the accuracy rate, wherein the voice of the Internet phone is a positive sample, and the voice of a mobile phone call is a negative sample.

1、验证辅助分类器带来的效果：1. Verify the effect of the auxiliary classifier:

表1展示了在SFW-net中使用不同权值的分类器时对2秒长的语音片段的检测性能，表2展示了在MS-net和IMS-net中使用不同权值的分类器时对2秒长的语音片段的检测性能。Table 1 shows the detection performance for 2-second speech segments when using classifiers with different weights in SFW-net, and Table 2 shows the detection performance when using classifiers with different weights in MS-net and IMS-net. Detection performance for 2-second long speech clips.

表1 SFW-net中使用不同权值的分类器的检测性能Table 1 Detection performance of classifiers using different weights in SFW-net

表2 MS-net和IMS-net中使用不同权值的分类器的检测性能Table 2 Detection performance of classifiers using different weights in MS-net and IMS-net

从表1-表2中可以看到，使用了辅助分类器的检测性能均比没有使用辅助分类器的性能好，并且辅助分类器的权值存在优化的可能。As can be seen from Table 1-Table 2, the detection performance of using the auxiliary classifier is better than that without using the auxiliary classifier, and the weight of the auxiliary classifier may be optimized.

2、使用不同分类模块的整体效果：2. The overall effect of using different classification modules:

本发明的分类模块可以由不同的分类器甚至是一些决策方法组成，使用不同的分类器或是决策方法都能够取得很好的性能。表3展示了使用通用背景模型-高斯混合模型(UBM-GMM)、集成分类器(FLD Ensemble)分类器作为分类模块，使用多数表决机制(Majority Voting)作为分类模块，以及使用全连接神经网络分类器(NN-basedclassifier)作为分类模块时对2秒长的语音片段的检测性能。The classification module of the present invention can be composed of different classifiers or even some decision-making methods, and good performance can be achieved by using different classifiers or decision-making methods. Table 3 shows the use of the Universal Background Model-Gaussian Mixture Model (UBM-GMM), the ensemble classifier (FLD Ensemble) classifier as the classification module, the use of the Majority Voting mechanism (Majority Voting) as the classification module, and the use of fully connected neural network classification. The detection performance of the NN-based classifier as a classification module for 2-second long speech segments.

表3使用不同分类器的检测性能Table 3 Detection performance using different classifiers

由表可以看出使用不同的分类方法都能取得很不错的检测性能。而我们设计的全连接神经网络分类器的性能比现有的分类器以及决策办法的性能好。It can be seen from the table that different classification methods can achieve very good detection performance. The performance of our fully connected neural network classifier is better than that of existing classifiers and decision-making methods.

3、检测不同来源语音信号的效果：3. Detect the effect of voice signals from different sources:

由于本发明从实际通话录音着手，算法的设计过程并不依赖于某一来源的数据。因此本发明能够比较有效地检测不同来源的通话语音。因此本发明设置了不同场景，表4展示了每个场景中，训练数据与测试数据中的来源不匹配程度。其中打勾的因素表示训练数据与测试数据在因素中来源不同。表5展示了本发明在不同场景中对2秒长的语音片段的检测性能。从表5中可以看到，本发明提出的方法对许多不同来源的语音都有很好的检测性能。Since the present invention starts from the actual call recording, the design process of the algorithm does not depend on data from a certain source. Therefore, the present invention can more effectively detect call voices from different sources. Therefore, the present invention sets different scenarios, and Table 4 shows the degree of mismatch between the sources in the training data and the test data in each scenario. The tick-marked factor indicates that the training data and test data are from different sources in the factor. Table 5 shows the detection performance of the present invention for 2-second speech segments in different scenarios. It can be seen from Table 5 that the method proposed by the present invention has good detection performance for speeches from many different sources.

表4在各个检测场景中的不匹配因素Table 4 Mismatch factors in each detection scenario

表5本发明在各个不匹配因素场景下的检测性能Table 5 Detection performance of the present invention under various mismatch factor scenarios

4、检测不同长度的语音片段的效果：4. Detect the effect of speech clips of different lengths:

本发明设计的算法在提取特征的过程中，提取短时音频帧的帧内特征并且进行帧间特征的融合。因此本发明能够用于检测不同长度的语音片段。图9展示了本发明在场景0和场景5中，对1秒长、2秒长、4秒长、6秒长、8秒长的语音片段的检测准确率。从图中可以看出本发明对6秒以上的语音片段的检测性能基本上达到饱和。而对长度短于6秒的语音片段也能有不错的检测准确率。In the process of extracting features, the algorithm designed in the present invention extracts the intra-frame features of the short-term audio frame and performs inter-frame feature fusion. Therefore the present invention can be used to detect speech segments of different lengths. FIG. 9 shows the detection accuracy rates of the present invention for 1-second, 2-second, 4-second, 6-second, and 8-second-long speech segments in Scenario 0 and Scenario 5. It can be seen from the figure that the detection performance of the present invention for speech segments longer than 6 seconds basically reaches saturation. And it can also have a good detection accuracy for speech segments shorter than 6 seconds.

综上所述，本发明基于语音信号的统计特征，发现网络电话语音信号分布在高频部分的能量比移动电话语音信号的分布在高频部分的能量弱，因此在检测网络电话语音信号这个问题上，本发明先对语音信号进行高通滤波处理；本发明分析了不同刻度的语谱图，发现相比于线性刻度语谱图，使用非线性刻度语谱图能够放大不同类别的通话录音信号之间的差异，并且缩小同种类别语音信号之间的差异，因此在算法设计上本发明使用梅尔刻度语谱图和反梅尔刻度语谱图作为网络输入，和现有的技术相比，本发明抛弃了手工设计特征的方法，而使用深度学习来对网络输入自行提取特征，本发明设计的第一卷积神经网络结构，采用先提取帧内特征，再融合帧间特征的方式提取深度特征；设计的第二卷积神经网络结构直接提取语谱图相邻数据点的方式提取频域的特征，三个训练完的子网所提取的信息能够被有效地融合到一起，通过将上述技术组合在一起，本发明能够有效地检测同一来源的网络电话通话录音，并且能够较鲁棒地检测不同来源的网络电话通话录音。To sum up, based on the statistical characteristics of the voice signal, the present invention finds that the energy of the voice signal of the Internet phone distributed in the high frequency part is weaker than the energy of the voice signal of the mobile phone distributed in the high frequency part, so the problem of detecting the voice signal of the Internet phone is weak. In the above, the present invention first performs high-pass filtering on the speech signal; the present invention analyzes the spectrograms of different scales, and finds that compared with the linear scale spectrogram, the use of the nonlinear scale spectrogram can amplify the difference between different types of call recording signals. Therefore, in the algorithm design, the present invention uses the Mel-scale spectrogram and the inverse Mel-scale spectrogram as the network input, and compared with the existing technology, The present invention abandons the method of manually designing features, and uses deep learning to extract features from the network input by itself. The first convolutional neural network structure designed by the present invention adopts the method of first extracting intra-frame features and then fusing inter-frame features to extract depth Features; the designed second convolutional neural network structure directly extracts the adjacent data points of the spectrogram to extract the features of the frequency domain, and the information extracted by the three trained subnetworks can be effectively fused together. Combining the technologies together, the present invention can effectively detect VoIP call recordings from the same source, and can more robustly detect VoIP call recordings from different sources.

应当理解的是，本发明的应用不限于上述的举例，对本领域普通技术人员来说，可以根据上述说明加以改进或变换，所有这些改进和变换都应属于本发明所附权利要求的保护范围。It should be understood that the application of the present invention is not limited to the above examples. For those of ordinary skill in the art, improvements or transformations can be made according to the above descriptions, and all these improvements and transformations should belong to the protection scope of the appended claims of the present invention.

Claims

1. A method for recognizing voice of a network telephone, comprising the steps of:

decompressing, resampling and high-pass filtering the received voice signal to obtain a filtered voice signal;

respectively converting the filtered voice signals into a standardized Mel scale spectrogram, a standardized inverse Mel scale spectrogram and a stacked waveform frame signal;

taking the pile-up waveform frame signal as network input, and extracting time domain information of a voice signal through a first convolutional neural network structure;

respectively taking the standardized Mel scale spectrogram and the standardized inverse Mel scale spectrogram as network inputs, and extracting frequency domain information of the voice signal through a second convolutional neural network structure;

inputting the time domain information and the frequency domain information of the voice signal into a trained classification module, and outputting a classification result;

the step of converting the filtered speech signal into a normalized mel scale spectrogram, a normalized anti-mel scale spectrogram and a stacked waveform frame signal respectively comprises:

performing framing processing on the filtered voice signal to obtain an nth sample of an ith frame: y is_i[n]＝w[n]*x[(i-1)*S+n]Wherein

converting the filtered speech signal into a standardized Mel-time scaleDegree language spectrogram:

wherein,

is the Mel scale frequency, f is the Hertz scale frequency, K_bAs a boundary point, f_L0 is the lowest frequency, f_H＝F_s2 is the highest frequency, F_sFor the sampling frequency,. DELTA.F for the frequency resolution, F_kFor the frequency of the kth discrete Fourier transform data point, K_m(M ∈ {1, 2, …, M }) is the mth Mel-band filter;

converting the filtered voice signal subjected to framing into a standardized anti-Mel scale spectrogram:

wherein,

is reverse Mel scale;

converting the filtered voice signal subjected to framing into a pile-up waveform frame signal:

the step of extracting the time domain information of the speech signal by using the heap waveform frame signal as a network input through a first convolutional neural network structure comprises:

the method comprises the steps that a first convolution neural network structure is constructed in advance, the first convolution neural network structure comprises 6 groups of convolution modules which are sequentially connected in series from an input end to an output end, the first 5 groups of convolution modules from the input end comprise a convolution layer, a maximum pooling layer, a linear rectification function and two batch normalization layers, and the last group of convolution modules comprise a convolution layer, a batch normalization layer, a linear rectification function and a global average pooling layer;

performing network parameter training on the first convolutional neural network structure through a main classifier connected to the output end of the first convolutional neural network structure and an auxiliary classifier connected to a fourth group of convolutional modules, wherein the total cost functions of the two classifiers are as follows: LossA ═ α loss₀+βloss₁Wherein, loss₀And loss₁Representing primary and secondary classifiers respectivelyThe cross entropy loss function of the classifier, wherein alpha and beta are weights, and alpha + beta is 1;

after the training of the first convolutional neural network structure is finished, removing the main classifier and the auxiliary classifier, inputting the stacked waveform frame signal into the first convolutional neural network structure, and extracting time domain information of a voice signal;

the step of extracting the frequency domain information of the voice signal through a second convolutional neural network structure by respectively taking the normalized Mel scale spectrogram and the normalized inverse Mel scale spectrogram as network inputs comprises:

a second convolutional neural network structure is constructed in advance, the second convolutional neural network structure comprises 6 groups of convolutional modules which are sequentially connected in series from an input end to an output end, the first 5 groups of convolutional modules from the input end comprise a convolutional layer, a maximum pooling layer, a two-dimensional convolutional kernel function and two batch normalization layers, and the last group of convolutional modules comprise a convolutional layer, a batch normalization layer, a linear rectification function and a global average pooling layer;

performing network parameter training on the second convolutional neural network structure through a main classifier connected to the output end of the second convolutional neural network structure and an auxiliary classifier connected to a third group of convolutional modules, wherein the total cost functions of the two classifiers are as follows: LossB ═ γ loss₀+δ*loss₁Wherein, loss₀And loss₁Respectively representing cross entropy loss functions of a main classifier and an auxiliary classifier, wherein gamma and delta are weights and satisfy that gamma + delta is 1;

and after the second convolutional neural network structure is trained, removing the main classifier and the auxiliary classifier, respectively taking the standardized Mel scale spectrogram and the standardized reverse Mel scale spectrogram as network inputs, and extracting frequency domain information of the voice signal through the second convolutional neural network structure.

2. The method of claim 1, wherein the step of decompressing, resampling and high-pass filtering the received voice signal to obtain a filtered voice signal comprises:

decompressing the received voice signal, and resampling to 8kHz and 16 bit waveform signals;

and carrying out high-pass filtering on the resampled waveform signal by adopting a second-order differential filter to obtain a filtered voice signal: x [ n ] — 0.5 × s [ n-1] + s [ n ] -0.5 × s [ n +1], where n represents a sample point in the time domain signal.

3. The method of claim 1, wherein the primary classifier comprises a full connectivity layer and a softmax function.

4. The method for recognizing voice over internet phone of claim 1, wherein the classification module is a fully-connected neural network classifier comprising a first fully-connected layer, a second fully-connected layer and a softmax function.

5. The method of claim 4, wherein the time domain information and the frequency domain information of the voice signal are input into a trained classification module, and the step of outputting the classification result comprises:

inputting the time domain information and the frequency domain information of the voice signal into the fully-connected neural network classifier, and respectively setting the node numbers of the first layer of fully-connected layer and the second layer of fully-connected layer to 768 and 2;

if the full-connection neural network classifier outputs [0,1], judging that the voice signal is the voice of the network telephone;

and if the full-connection neural network classifier outputs [1,0], judging that the voice signal is the voice of the mobile phone.

6. A storage medium comprising a plurality of instructions stored thereon, the instructions adapted to be loaded by a processor and to carry out the steps of the method of recognizing voice over internet phone of any of claims 1-5.

7. An apparatus for recognizing voice of a network telephone, comprising a processor adapted to implement instructions; and a storage medium adapted to store a plurality of instructions adapted to be loaded by the processor and to perform the steps of the method of recognizing voice over internet phone of any of claims 1-5.