CN114550751A

CN114550751A - Voice speed-doubling attack detection method based on rhythm characteristics and random forest classifier

Info

Publication number: CN114550751A
Application number: CN202210127689.9A
Authority: CN
Inventors: 徐文渊; 冀晓宇; 闫琛; 何睿文; 石卓扬; 李超豪
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-02-11
Filing date: 2022-02-11
Publication date: 2022-05-27

Abstract

The invention discloses a voice double-speed attack detection method based on prosodic features and random forest classifiers, and belongs to the technical fields of voice recognition and security. Obtain an audio data set, including normal audio and double-speed adversarial audio; extract jitter features, vibrato features and harmonic-to-noise ratio features of all audio in the audio data set to form feature vectors; use the feature vectors of normal audio and double-speed adversarial audio to train random forest classification The trained random forest classifier is used for speech double-speed attack detection. The invention can efficiently detect the voice double-speed spoofing attack through the existing microphone and voice hardware of the voice recognition system, has the characteristics of low cost and high attack detection accuracy, and can be used for the security protection of the voice recognition system on smart devices such as mobile phones. , has a wide range of needs and application prospects.

Description

Speech double-speed attack detection method based on prosodic features and random forest classifier

技术领域technical field

本发明属于语音识别技术及安全技术领域，具体涉及一种基于韵律特征和随机森林分类器的语音倍速攻击检测方法。The invention belongs to the field of speech recognition technology and security technology, and in particular relates to a speech double-speed attack detection method based on prosodic features and random forest classifiers.

背景技术Background technique

自动语音识别(Automatic Speech Recognition，ASR)系统可以识别语音并输出语音识别文本。现有流行的ASR系统包括开源系统(如Kaldi和DeepSpeech)和商业系统(如谷歌Cloud Speech-to-Text、百度ASR和科大讯飞)。对于输入音频，ASR系统首先进行信号处理，以减少噪声和去除无关的频率成分；然后将处理后的音频信号分成短段，提取梅尔频率倒谱系数(MFCC)等特征；最后利用提取的特征，通过预先训练的语音识别模型推断出最可能的单词序列。Automatic Speech Recognition (ASR) systems can recognize speech and output speech recognition text. Existing popular ASR systems include open source systems (such as Kaldi and DeepSpeech) and commercial systems (such as Google Cloud Speech-to-Text, Baidu ASR, and iFlytek). For the input audio, the ASR system first performs signal processing to reduce noise and remove irrelevant frequency components; then divide the processed audio signal into short segments to extract features such as Mel Frequency Cepstral Coefficients (MFCC); finally, use the extracted features , the most likely word sequence is inferred by a pre-trained speech recognition model.

倍速操作(Time-scale Modification，TSM)是指将一个音频片段播放速度变快或者变慢的操作。常见的音频播放器或者音频编辑软件会利用倍速操作在不改变音频音调的情况下改变音频的播放速度。倍速操作主要包括三个步骤：1)信号分解，2)帧重定位与变化，3)信号重构，它首先将输入音频分解为长度更短且相互覆盖的分析帧，分析帧的长度通常是毫秒级别，且其保留原始音频中该帧的音调内容。根据帧变换的处理算法，倍速操作有多种实现方案，比如FFmpeg、SoundTouch、Waveform Similarity Overlap-Add(WSOLA)以及Phase Vocoder(PV-TSM)。在已有的实现方案中，FFmpeg是商用音频播放器以及音频剪辑软件中最常用的倍速操作。Time-scale Modification (TSM) refers to an operation of making an audio clip play faster or slower. Common audio players or audio editing software will use the double-speed operation to change the playback speed of the audio without changing the audio pitch. The double-speed operation mainly includes three steps: 1) signal decomposition, 2) frame relocation and change, and 3) signal reconstruction. It first decomposes the input audio into analysis frames with shorter length and overlapping each other. The length of the analysis frame is usually millisecond level, and it preserves the tonal content of the frame in the original audio. According to the processing algorithm of frame transformation, there are various implementation schemes for double-speed operation, such as FFmpeg, SoundTouch, Waveform Similarity Overlap-Add (WSOLA) and Phase Vocoder (PV-TSM). In the existing implementations, FFmpeg is the most commonly used double-speed operation in commercial audio players and audio editing software.

然而，研究表明现有的语音识别系统容易遭受语音倍速攻击。语音倍速攻击指通过单纯地加速或减慢原始音频使ASR识别错误，只需修改一个片段的速度(例如20ms)就能实现无目标攻击，而目标攻击也只需要经过几轮的优化，便可实现可听性很好的恶意目标(例如开门)。随着语音技术和电子设备的发展，语音倍速攻击的门槛越来越低、效果越来越好、危害越来越大。因此，在这种情况下，亟需提出一种高效、低成本的语音倍速攻击检测方法。However, studies have shown that existing speech recognition systems are vulnerable to speech double-speed attacks. Voice double-speed attack refers to simply speeding up or slowing down the original audio to make ASR identify errors. It only needs to modify the speed of one segment (for example, 20ms) to achieve an untargeted attack, and the targeted attack only needs to go through several rounds of optimization. Achieving malicious goals with good audibility (such as opening a door). With the development of voice technology and electronic equipment, the threshold of voice double-speed attack is getting lower and lower, the effect is getting better and better, and the harm is getting bigger and bigger. Therefore, in this case, it is urgent to propose an efficient and low-cost voice double-speed attack detection method.

目前已有许多相关研究通过检测语音对抗样本产生过程中引入的噪声和失真，对语音攻击进行防护，然而这一类检测方法难以检测不添加噪声的语音倍速攻击。At present, there are many related researches to protect speech attacks by detecting the noise and distortion introduced in the process of generating speech adversarial samples. However, this type of detection method is difficult to detect speech double-speed attacks without adding noise.

发明内容SUMMARY OF THE INVENTION

为解决上述背景技术中存在的技术问题，本发明提供了一种基于韵律特征和随机森林分类器的语音倍速攻击检测方法，通过基于粒子群算法的语音倍速攻击获得倍速对抗音频，提取能有效真实反映正常音频和倍速对抗音频韵律差异的三类特征，基于随机森林分类器进行攻击检测，能准确有效地检测出针对语音识别系统的以倍速攻击为代表的语音欺骗攻击。In order to solve the technical problems existing in the above-mentioned background art, the present invention provides a voice double-speed attack detection method based on prosodic features and random forest classifiers. The double-speed confrontation audio is obtained through the voice double-speed attack based on the particle swarm algorithm, and the extraction can be effectively real. The three types of features that reflect the difference between normal audio and double-speed adversarial audio prosody, based on random forest classifier for attack detection, can accurately and effectively detect speech spoofing attacks represented by double-speed attacks against speech recognition systems.

为了实现上述目的，本发明所采用的技术方案是：In order to achieve the above object, the technical scheme adopted in the present invention is:

一种基于韵律特征和随机森林分类器的语音倍速攻击检测方法，包括：A speech double-speed attack detection method based on prosodic features and random forest classifier, including:

获取音频数据集，包括正常音频和倍速对抗音频；Obtain audio datasets, including normal audio and double-speed adversarial audio;

提取音频数据集中所有音频的抖动特征、颤音特征和谐波噪声比特征，构成特征向量；Extract the jitter features, vibrato features and harmonic-to-noise ratio features of all audios in the audio data set to form feature vectors;

利用正常音频和倍速对抗音频的特征向量训练随机森林分类器，利用训练好的随机森林分类器进行语音倍速攻击检测。The random forest classifier is trained using the feature vectors of normal audio and double-speed adversarial audio, and the trained random forest classifier is used for speech double-speed attack detection.

进一步的，所述的倍速对抗音频是通过将正常音频进行倍速操作后得到的，且未添加额外噪声。Further, the double-speed confrontation audio is obtained by performing a double-speed operation on the normal audio, and no additional noise is added.

进一步的，所述的抖动特征包括jitt特征、jitta特征、rap特征、ppq5特征，计算公式为：Further, described shaking feature includes jitt feature, jitta feature, rap feature, ppq5 feature, and the calculation formula is:

其中，T_i表示音频中第i个抖动的持续时间，N表示音频中抖动的总数量，jitta、jitt、jitt_rap、jitt_ppq5分别是jitt特征、jitta特征、rap特征、ppq5特征。Among them, T _i represents the duration of the ith jitter in the audio, N represents the total number of jitters in the audio, and jitta, jitt, jitt _rap , and jitt _ppq5 are the jitt feature, jitta feature, rap feature, and ppq5 feature, respectively.

进一步的，所述的颤音特征包括Shim特征、ShdB特征、apq5特征、apq11特征，计算公式为：Further, the vibrato feature includes Shim feature, ShdB feature, apq5 feature, apq11 feature, and the calculation formula is:

其中，A_i表示音频中第i个颤音的持续时间，M表示音频中颤音的总数量，shim、ShdB、apq5、apq11分别表示Shim特征、ShdB特征、apq5特征、apq11特征。Among them, A _i represents the duration of the ith vibrato in the audio, M represents the total number of vibrato in the audio, and shim, ShdB, apq5, and apq11 represent the Shim feature, ShdB feature, apq5 feature, and apq11 feature, respectively.

进一步的，所述的谐波噪声比特征计算公式为：Further, the calculation formula of the harmonic noise ratio characteristic is:

其中，sig_per是音频周期信号的比例，sig_noise是音频信号中噪声的比例，hnr表示谐波噪声比特征，通过设计不同的分析窗口长度，得到两种谐波噪声比特征hnr05和hnr15，所述的hnr15的分析窗口长度是hnr05的3倍。Among them, sig _per is the ratio of the audio periodic signal, sig _noise is the ratio of the noise in the audio signal, and hnr represents the harmonic-to-noise ratio feature. By designing different analysis window lengths, two harmonic-to-noise ratio features hnr05 and hnr15 are obtained, so The analysis window length of hnr15 described is 3 times that of hnr05.

进一步的，利用训练好的随机森林分类器进行语音倍速攻击检测时，提取待检测音频的抖动特征、颤音特征和谐波噪声比特征，构成特征向量，将特征向量作为训练好的随机森林分类器的输入，得出检测结果。Further, when using the trained random forest classifier to detect speech double-speed attack, extract the jitter feature, vibrato feature and harmonic noise ratio feature of the audio to be detected, form a feature vector, and use the feature vector as the trained random forest classifier. input to get the detection result.

本发明的有益效果是：The beneficial effects of the present invention are:

对于传统方法难以识别到的不添加噪声的倍速攻击音频，本发明针对倍速攻击音频与正常音频在发音韵律特征方面的差异，提出了利用抖动、颤音、谐波噪声比等10维特征，为攻击检测提供有效的特征数据，并接合随机森林算法用于重放攻击检测。在语音欺骗攻击时，即使攻击者产生了与真实用户语音非常相似的声音，该声音通过倍速时会在抖动、颤音、谐波噪声比上产生差异，因此该方法能够用于检测语音倍速欺骗攻击。For the double-speed attack audio without adding noise, which is difficult to be identified by traditional methods, the present invention proposes to use 10-dimensional features such as jitter, vibrato, harmonic noise ratio, etc., to attack the difference between the double-speed attack audio and normal audio in terms of prosody features. The detection provides effective feature data and is combined with the random forest algorithm for replay attack detection. In the voice spoofing attack, even if the attacker produces a sound that is very similar to the real user's voice, the sound will produce differences in jitter, vibrato, and harmonic noise ratio when it passes through double-speed. Therefore, this method can be used to detect voice double-speed spoofing attacks. .

本发明通过语音识别系统已有的麦克风和语音硬件就可以高效地检测语音倍速欺骗攻击，具有成本低、攻击检测精确度高的特点，可以用于手机等智能设备上的语音识别系统的安全防护，具有广泛的需求和应用前景。The invention can efficiently detect the voice double-speed spoofing attack through the existing microphone and voice hardware of the voice recognition system, has the characteristics of low cost and high attack detection accuracy, and can be used for the security protection of the voice recognition system on smart devices such as mobile phones. , has a wide range of needs and application prospects.

附图说明Description of drawings

图1是本发明实施例示出的一种基于韵律特征和随机森林分类器的语音倍速攻击检测方法的流程示意图。FIG. 1 is a schematic flowchart of a method for detecting a speech double-speed attack based on a prosodic feature and a random forest classifier according to an embodiment of the present invention.

具体实施方式Detailed ways

以下结合附图，详细说明本发明中各实施例提供的技术方案。附图中所示的流程图仅是示例性说明，不是必须包括所有的步骤。例如，有的步骤还可以分解，而有的步骤可以合并或部分合并，因此实际执行的顺序有可能根据实际情况改变。The technical solutions provided by each embodiment of the present invention will be described in detail below with reference to the accompanying drawings. The flow charts shown in the figures are merely illustrative and do not necessarily include all steps. For example, some steps can be decomposed, and some steps can be combined or partially combined, so the actual execution order may be changed according to the actual situation.

语音倍速攻击是一种单纯地加速或减慢原始音频，而不是添加干扰的语音对抗攻击手段。为了实现正常音频与不添加噪声的倍速对抗音频两者的识别，本发明利用倍速操作带来的“非自然”畸变，提出了基于韵律特征和随机森林分类器的语音倍速攻击检测方法，对电子设备的麦克风接收到语音信号提取特定特征并标记，利用标记好的特征对随机森林分类器进行训练，采用训练后的分类器对待测的语音信号进行语音倍速攻击检测，输出是否为倍速语音对抗结果。主要包括以下步骤：A speech double-speed attack is a speech adversarial attack that simply speeds up or slows down the original audio, rather than adding distractions. In order to realize the recognition of both normal audio and double-speed confrontation audio without adding noise, the present invention uses the "unnatural" distortion brought by the double-speed operation, and proposes a voice double-speed attack detection method based on prosodic features and random forest classifiers. The microphone of the device receives the voice signal to extract specific features and mark them, use the marked features to train the random forest classifier, use the trained classifier to detect the voice double-speed attack on the voice signal to be tested, and output whether it is the double-speed voice confrontation result. . It mainly includes the following steps:

1)特征提取：1) Feature extraction:

整体流程如图1所示，输入音频首先会经过特征提取阶段。语速变化会造成倍速对抗音频与正常音频之间的发音差异，利用韵律特征进行攻击检测的关键是从正常音频和倍速对抗音频中提取出差异性大的特征。韵律特征是人类语言和情绪表达的重要形式之一，包括音高、语调、声音抖动等，本发明采用4种抖动特征，即jitta、jitt、rap、ppq5；4种颤音特征，即Shim、ShdB、apq5、apq11；以及2种谐波噪声比特征，即hnr05、hnr15作为区别正常音频与倍速对抗音频的韵律特征组合，具体为：The overall process is shown in Figure 1. The input audio first goes through the feature extraction stage. The change of speech rate will cause the pronunciation difference between double-speed adversarial audio and normal audio. The key to using prosodic features for attack detection is to extract features with large differences from normal audio and double-speed adversarial audio. Prosody feature is one of the important forms of human language and emotional expression, including pitch, intonation, voice jitter, etc. The present invention adopts 4 kinds of jitter features, namely jitta, jitt, rap, ppq5; 4 kinds of vibrato features, namely Shim, ShdB , apq5, apq11; and two harmonic-to-noise ratio features, namely hnr05 and hnr15, as a combination of prosodic features to distinguish normal audio from double-speed confrontation audio, specifically:

1.1)抖动1.1) Jitter

抖动表征的是信号周期间的频率变化，主要是由于缺乏对声带振动的控制而造成，抖动数值的高低代表特定对象声音的嘶哑程度，且一般患病者的抖动较强。本发明筛选了四种抖动特征，包括jitt、jitta、rap、ppq5。用T_i表示音频中第i个抖动的持续时间，N表示音频中抖动的总数量。Jitter characterizes the frequency change between signal cycles, which is mainly caused by the lack of control over the vibration of the vocal cords. The jitter value represents the hoarseness of the voice of a specific object, and generally patients with a disease have strong jitter. The present invention screens out four kinds of jitter features, including jitt, jitta, rap, and ppq5. Let T _i denote the duration of the ith jitter in the audio, and N denote the total number of jitters in the audio.

jitta特征表示为连续抖动周期之间的平均绝对差，计算公式为：The jitta feature is expressed as the mean absolute difference between successive jitter periods, and is calculated as:

jitt特征则表示为连续周期之间的平均绝对差除以平均抖动周期，计算公式为：The jitt characteristic is expressed as the average absolute difference between consecutive cycles divided by the average jitter cycle, and the calculation formula is:

rap特征为相对平均抖动，表示为一个抖动与其两个相邻抖动的平均值之间的平均绝对差除以平均抖动周期，记为jitt_rap，计算公式为：The rap characteristic is the relative average jitter, expressed as the average absolute difference between a jitter and the average value of its two adjacent jitters divided by the average jitter period, denoted as jitt _rap , and the calculation formula is:

ppq5特征表示为一个抖动与其四个相邻抖动的平均值之间的平均绝对差除以平均抖动周期，记为jitt_ppq5，计算公式为：The ppq5 characteristic is expressed as the average absolute difference between a jitter and the average of its four adjacent jitters divided by the average jitter period, denoted as jitt _ppq5 , and the calculation formula is:

1.2)颤音1.2) Vibrato

颤音表征的是声波信号的幅值波动，与人类的呼吸以及发出的噪音相关，它一般被视为病态的韵律特征。本发明筛选了四种颤音特征，包括Shim、ShdB、apq5、apq11，用A_i表示音频中第i个颤音的持续时间，M表示音频中颤音的总数量。Vibrato characterizes the amplitude fluctuations of the sound wave signal, which is related to human breathing and noise, and is generally regarded as a pathological rhythmic characteristic. The present invention screens four vibrato features, including Shim, ShdB, apq5, and apq11. A _i represents the duration of the i-th vibrato in the audio, and M represents the total number of vibrato in the audio.

Shim特征表示颤音的振幅之间的平均绝对差除以颤音的平均振幅，计算公式为：The Shim feature represents the mean absolute difference between the amplitudes of the vibrato divided by the mean amplitude of the vibrato, calculated as:

ShdB特征表示为颤音振幅的平均绝对对数，计算公式为：The ShdB feature is expressed as the mean absolute logarithm of the vibrato amplitude and is calculated as:

apq5与ppq5特征类似，表示一个颤音的振幅与其四个相邻颤音振幅的平均值之间的平均绝对差除以颤音的平均振幅，计算公式为：Similar to ppq5, apq5 represents the average absolute difference between the amplitude of a vibrato and the average of its four adjacent vibrato amplitudes divided by the average amplitude of the vibrato. The calculation formula is:

apq11即表示一个颤音振幅与其十个相邻颤音振幅的平均值之间的平均绝对差除以颤音的平均振幅，计算公式为：apq11 is the mean absolute difference between one vibrato amplitude and the average of its ten adjacent vibrato amplitudes divided by the mean vibrato amplitude, calculated as:

1.3)谐波噪声比1.3) Harmonic noise ratio

谐波噪声比表征的是音频信号中周期性分量与非周期性分量的比例，反应出人类说话的质地，例如轻柔或者强硬的语气。本发明筛选了两种谐波噪声比特征，包括hnr05、hnr15。The harmonic-to-noise ratio characterizes the ratio of periodic components to non-periodic components in an audio signal, reflecting the quality of human speech, such as soft or strong tone. The present invention screens two kinds of harmonic noise ratio characteristics, including hnr05 and hnr15.

hnr，即谐波噪声比，表示以dB为单位的声波周期性程度，hnr05与hnr15的差别在于分析窗口长度不同，计算公式为：hnr, the harmonic noise ratio, indicates the degree of sound wave periodicity in dB. The difference between hnr05 and hnr15 is the length of the analysis window. The calculation formula is:

其中，sig_per是周期信号的比例，sig_noise是信号中噪声的比例。分析窗口长度表示用于计算谐波噪声比的数据长度，对于hnr15的分析窗口长度是hnr05的3倍。where sig _per is the ratio of periodic signals, and sig _noise is the ratio of noise in the signal. The analysis window length represents the data length used to calculate the harmonic-to-noise ratio, and the analysis window length for hnr15 is 3 times that of hnr05.

2)攻击检测：2) Attack detection:

对于语音倍速攻击的检测问题，韵律特征可以更好地表征由语速变化带来的发音畸变，从而检测出倍速对抗音频。For the detection of double-speed attacks on speech, prosodic features can better characterize the pronunciation distortion caused by changes in speech rate, so as to detect double-speed adversarial audio.

本发明利用Google TTS的接口把文本转换成语音，生成了一百句正常音频，并利用粒子群优化算法生成了一百句基于正常音频的倍速对抗音频，如上部分所述，本发明提取总共10种韵律特征，包括4种抖动特征，即jitt、jitta、rap、ppq5；4种颤音特征，即Shim、ShdB、apq5、apq11；2种谐波噪声比特征，即hnr05、hnr15。接着利用随机森林分类器对正常音频和倍速对抗音频的10维特征进行学习，每一个决策树输出检测结果为正常音频或倍速对抗音频的概率，随机森林依靠决策树的投票选择来决定最后的分类结果。The present invention uses the interface of Google TTS to convert text into speech, generates one hundred sentences of normal audio, and uses the particle swarm optimization algorithm to generate one hundred sentences of double-speed confrontation audio based on normal audio. As described in the above section, the present invention extracts a total of 10 There are four kinds of prosody features, including four kinds of jitter features, namely jitt, jitta, rap, and ppq5; four kinds of vibrato features, namely, Shim, ShdB, apq5, and apq11; and two kinds of harmonic-to-noise ratio features, namely, hnr05 and hnr15. Then use the random forest classifier to learn the 10-dimensional features of normal audio and double-speed adversarial audio. Each decision tree outputs the probability that the detection result is normal audio or double-speed adversarial audio. The random forest relies on the voting selection of the decision tree to decide the final classification. result.

本实施例中，将随机森林的检测结果与决策树、SVM分类器进行了对比，如表1所示：In this embodiment, the detection result of random forest is compared with decision tree and SVM classifier, as shown in Table 1:

表1不同分类器对倍速攻击的检测结果Table 1. Detection results of different classifiers for double-speed attack

正确率Correct rate 召回率recall 准确率Accuracy F1分数F1 Score 等错误率Equal error rate AUCAUC 随机森林random forest 0.9210.921 0.8970.897 0.9030.903 0.9160.916 0.1920.192 0.8380.838 决策树decision tree 0.8650.865 0.8660.866 0.8640.864 0.8720.872 0.140.14 0.8660.866 SVMSVM 0.7830.783 0.7410.741 0.8640.864 0.8420.842 0.2410.241 0.8490.849

可见，随机森林、决策树、SVM均实现了对正常音频和倍速对抗音频识别，其中随机森林的正确率达到92.1％。It can be seen that random forest, decision tree, and SVM all realize the recognition of normal audio and double-speed confrontation audio, and the correct rate of random forest reaches 92.1%.

本发明通过基于粒子群算法的语音倍速攻击获得倍速对抗音频，提取能有效真实反映正常音频和倍速对抗音频韵律差异的三类特征。根据正常音频和倍速对抗音频在发音韵律方面存在的规律性差异，将特征输入随机森林分类器，进而检测出正常音频和倍速对抗音频。在语音欺骗攻击时，即使攻击者产生了与真实用户语音非常相似的声音，该声音通过倍速时必然会造成一定程度的发音差异，尽管该差异极小，传统的特征提取方式难以对两者进行区分，但本发明提出的基于韵律特征和随机森林分类器的语音倍速攻击检测方法能够对这种不添加噪声的倍速对抗音频进行有效检测，对语音识别系统的防护提供了指导。The invention obtains the double-speed confrontation audio through the voice double-speed attack based on the particle swarm algorithm, and extracts three types of features that can effectively and truly reflect the difference in rhythm between the normal audio and the double-speed confrontation audio. According to the regular differences in pronunciation prosody between normal audio and double-speed adversarial audio, the features are input into the random forest classifier, and then normal audio and double-speed adversarial audio are detected. In the case of voice spoofing attacks, even if the attacker produces a voice that is very similar to the real user's voice, the voice will inevitably cause a certain degree of pronunciation difference when it passes through the double speed. Although the difference is very small, it is difficult for the traditional feature extraction method to perform However, the speech double-speed attack detection method based on prosodic features and random forest classifiers proposed by the present invention can effectively detect this double-speed confrontation audio without adding noise, and provides guidance for the protection of the speech recognition system.

本领域的技术人员应理解，上述描述及附图中所示的本发明的实施例只作为举例而并不限制本发明。本发明的目的已经完整有效地实现。本发明的功能及结构原理已在实施例中展示和说明，在没有背离所述原理下，本发明的实施方式可以有任何变形或修改。It should be understood by those skilled in the art that the embodiments of the present invention shown in the above description and the accompanying drawings are only examples and do not limit the present invention. The objects of the present invention have been fully and effectively achieved. The functional and structural principles of the present invention have been shown and described in the embodiments, and the embodiments of the present invention may be modified or modified in any way without departing from the principles.

Claims

1. a speech speed attack detection method based on rhythm feature and random forest classifier, is characterized in that, comprises:

Obtain audio datasets, including normal audio and double-speed adversarial audio;

Extract the jitter features, vibrato features and harmonic-to-noise ratio features of all audios in the audio data set to form feature vectors;

The random forest classifier is trained using the feature vectors of normal audio and double-speed adversarial audio, and the trained random forest classifier is used for speech double-speed attack detection.

2. the voice double-speed attack detection method based on prosodic feature and random forest classifier according to claim 1, is characterized in that, described double-speed confrontation audio is obtained by carrying out double-speed operation with normal audio, and does not add extra noise.

3. the voice double-speed attack detection method based on prosodic feature and random forest classifier according to claim 1, is characterized in that, described shaking feature comprises jitt feature, jitta feature, rap feature, ppq5 feature, and calculation formula is:

Among them, T _i represents the duration of the ith jitter in the audio, N represents the total number of jitters in the audio, and jitta, jitt, jitt _rap , and jitt _ppq5 are the jitt feature, jitta feature, rap feature, and ppq5 feature, respectively.

4. the voice double-speed attack detection method based on prosodic feature and random forest classifier according to claim 1, is characterized in that, described vibrato feature comprises Shim feature, ShdB feature, apq5 feature, apq11 feature, and calculation formula is:

Among them, A _i represents the duration of the ith vibrato in the audio, M represents the total number of vibrato in the audio, and shim, ShdB, apq5, and apq11 represent the Shim feature, ShdB feature, apq5 feature, and apq11 feature, respectively.

5. the speech double-speed attack detection method based on prosodic feature and random forest classifier according to claim 1, is characterized in that, described harmonic noise ratio characteristic calculation formula is:

Among them, sig _per is the ratio of the audio periodic signal, sig _noise is the ratio of the noise in the audio signal, and hnr represents the harmonic-to-noise ratio feature. By designing different analysis window lengths, two harmonic-to-noise ratio features hnr05 and hnr15 are obtained, so The analysis window length of hnr15 described is 3 times that of hnr05.

6. the method for detecting double-speed attacks based on prosodic feature and random forest classifier according to claim 1, it is characterized in that, when utilizing trained random forest classifier to carry out double-speed attack detection of voice, extract the jitter feature of audio to be detected , vibrato feature and harmonic noise ratio feature to form a feature vector, and the feature vector is used as the input of the trained random forest classifier to obtain the detection result.