CN108074585A - A kind of voice method for detecting abnormality based on sound source characteristics - Google Patents
A kind of voice method for detecting abnormality based on sound source characteristics Download PDFInfo
- Publication number
- CN108074585A CN108074585A CN201810126670.6A CN201810126670A CN108074585A CN 108074585 A CN108074585 A CN 108074585A CN 201810126670 A CN201810126670 A CN 201810126670A CN 108074585 A CN108074585 A CN 108074585A
- Authority
- CN
- China
- Prior art keywords
- mrow
- speech
- voice
- glottis
- sound source
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 230000005856 abnormality Effects 0.000 title claims abstract description 9
- 210000004704 glottis Anatomy 0.000 claims abstract description 28
- 238000001514 detection method Methods 0.000 claims abstract description 14
- 239000000284 extract Substances 0.000 claims abstract description 5
- 238000012545 processing Methods 0.000 claims description 8
- 230000005284 excitation Effects 0.000 claims description 6
- 238000001228 spectrum Methods 0.000 claims description 5
- 230000004044 response Effects 0.000 claims description 4
- 239000013256 coordination polymer Substances 0.000 claims description 3
- 238000009432 framing Methods 0.000 claims description 2
- 238000001914 filtration Methods 0.000 abstract description 13
- 230000002159 abnormal effect Effects 0.000 abstract description 10
- 210000001260 vocal cord Anatomy 0.000 abstract description 8
- 230000003340 mental effect Effects 0.000 abstract description 4
- 238000005516 engineering process Methods 0.000 abstract description 2
- 238000007781 pre-processing Methods 0.000 abstract 1
- 238000012706 support-vector machine Methods 0.000 description 12
- 230000003044 adaptive effect Effects 0.000 description 7
- 238000002474 experimental method Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 5
- 230000001755 vocal effect Effects 0.000 description 5
- 238000000605 extraction Methods 0.000 description 4
- 238000013145 classification model Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 210000004556 brain Anatomy 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000002996 emotional effect Effects 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 230000035945 sensitivity Effects 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 208000019901 Anxiety disease Diseases 0.000 description 1
- 150000001298 alcohols Chemical class 0.000 description 1
- 230000036506 anxiety Effects 0.000 description 1
- 230000002238 attenuated effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008602 contraction Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000008909 emotion recognition Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000003205 muscle Anatomy 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 102000004196 processed proteins & peptides Human genes 0.000 description 1
- 108090000765 processed proteins & peptides Proteins 0.000 description 1
- 230000005855 radiation Effects 0.000 description 1
- 230000003938 response to stress Effects 0.000 description 1
- 230000004884 risky behavior Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/45—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Emergency Alarm Devices (AREA)
Abstract
本发明公开了一种基于声源特征的语音异常检测方法,包括如下步骤:通过传感器实时收集语音数据;对得到的语音段数据进行预处理;对于语音段的语音数据,使用迭代自适应逆滤波得到声门波信号;从声门波信号中提取特征参数:归一化振幅商与声门闭合时间比数据;将提取到的特征数据输入理想SVM模型进行分类;得到分类标签,用来判断说话人状况,输出说话人状况标签,交由执行模块进行反馈。本发明的特色是,对于精神压力下的变异语音,摆脱了基于传统的线性语音生成模型,提取缺少物理意义的声学特征参数的识别方法,建立了声源估计模型,利用语音生成的逆滤波技术,分析和提取基于人体声带振动的特征参数来进行异常语音的检测。
The invention discloses a speech abnormality detection method based on sound source characteristics, comprising the following steps: collecting speech data in real time through a sensor; preprocessing the obtained speech segment data; using iterative self-adaptive inverse filtering for the speech segment speech data Obtain the glottal wave signal; extract characteristic parameters from the glottal wave signal: normalize the amplitude quotient and glottis closure time ratio data; input the extracted characteristic data into the ideal SVM model for classification; obtain classification labels to judge speech Speaker status, output the speaker status label, and send it to the execution module for feedback. The feature of the present invention is that, for the variant speech under mental stress, it gets rid of the recognition method based on the traditional linear speech generation model and extracts the acoustic characteristic parameters lacking physical meaning, establishes a sound source estimation model, and uses the inverse filtering technology of speech generation , analyze and extract the characteristic parameters based on the vibration of human vocal cords to detect abnormal speech.
Description
技术领域technical field
本发明涉及一种基于声源特征的语音异常检测方法,属于智能语音技术领域。The invention relates to a voice abnormality detection method based on sound source characteristics, and belongs to the technical field of intelligent voice.
背景技术Background technique
压力是身体对于物理,心理或者情感刺激的自然反应,当我们受到这些刺激时,大脑会向身体释放出醇和肽类物质,从而引起紧张反应。这种持续的对于工作的焦虑感,将会反映在发声器官上,从而引起发声频率,发音速度等一系列参数的改变。这些改变在语音信号处理的众多领域都有着非常重要的意义,例如变异语音识别,情感识别等。Stress is the body's natural response to physical, mental, or emotional stimuli. When we are exposed to these stimuli, the brain releases alcohols and peptides into the body, causing a stress response. This continuous anxiety about work will be reflected on the vocal organs, which will cause changes in a series of parameters such as vocal frequency and pronunciation speed. These changes are of great significance in many fields of speech signal processing, such as variant speech recognition, emotion recognition, etc.
压力一个重要体现方式是说话人说话时的语音,成为影响语音产生非常重要的一个影响因素。当周围环境或话者自身条件发生异常变化时,或者由于使用者大都专注于某项工作,语音识别只是辅助于其它工作的次要工作,在这个过程中,这时由于工作压力的存在,说话人受到精神压力,对话者发音将会有较大的影响,从而产生了异常状态,产生了语音变异,而异常状态往往会体现在说话人的语音当中,形成了压力异常状态下的语音信号。An important manifestation of stress is the voice of the speaker when speaking, which has become a very important factor affecting the production of voice. When the surrounding environment or the speaker's own conditions change abnormally, or because the user is mostly focused on a certain job, speech recognition is only a secondary job that assists other jobs. In this process, due to the existence of work pressure, speaking When a person is under mental stress, the interlocutor's pronunciation will be greatly affected, resulting in an abnormal state and voice variation, and the abnormal state is often reflected in the speaker's voice, forming a speech signal under abnormal stress.
但是,精神压力下的变异语音,特别是多任务脑负荷压力下的变异语音,从听觉上的区分度相对较低,一般的声学特征不能将其正确分类,缺乏稳定性和鲁棒性。此外,由于变异语音在生成过程中,其声源特征与一般正常语音有比较显著的区别。因此,在检测过程中,我们通过声源特征来提高变异语音分类的可靠性。通过提高变异语音的标识效率,为语音识别系统的强鲁棒性奠定基础。However, the variant speech under mental stress, especially the variant speech under multi-task brain load pressure, has a relatively low degree of auditory discrimination, and the general acoustic features cannot classify it correctly, lacking stability and robustness. In addition, during the generation process of variant speech, its sound source features are significantly different from normal speech. Therefore, in the detection process, we improve the reliability of variant speech classification through sound source features. By improving the identification efficiency of variant speech, it lays the foundation for the strong robustness of the speech recognition system.
发明内容Contents of the invention
本发明所要解决的问题是从语音生成的声源的角度对压力状态进行检测,提出一种基于语音生成建模的压力检测方法。本发明的特色是摆脱了基于传统的线性语音生成模型和缺少物理意义的声学特征参数的识别方法,建立了声源估计模型,利用语音生成的逆滤波技术,分析和提取基于人体声带振动的特征参数来进行异常语音的检测。The problem to be solved by the present invention is to detect the pressure state from the perspective of the sound source of speech generation, and propose a stress detection method based on speech generation modeling. The feature of the present invention is to get rid of the recognition method based on the traditional linear speech generation model and the acoustic characteristic parameters lacking physical meaning, establish a sound source estimation model, and use the inverse filtering technology of speech generation to analyze and extract the characteristics based on the vibration of human vocal cords parameters to detect abnormal speech.
本发明的技术方案如下:Technical scheme of the present invention is as follows:
一种基于声源特征的语音异常检测方法,包括如下步骤:A method for abnormal speech detection based on sound source features, comprising the steps of:
(1)、通过传感器实时收集语音数据;(1) Real-time collection of voice data through sensors;
(2)、通过端点检测判断语音数据的语音段和噪声段,以决定是否进行下一步语音信号处理工作;(2), judge the voice segment and the noise segment of voice data by endpoint detection, to decide whether to carry out the next step voice signal processing work;
(3)、对得到的语音段的语音数据分帧加窗,并对每一帧进行高频预加重处理;(3), windowing the speech data of the speech segment obtained, and carrying out high-frequency pre-emphasis processing to each frame;
(4)、对于语音段的语音数据,使用迭代自适应逆滤波得到声门波信号;(4), for the voice data of the voice segment, use iterative adaptive inverse filtering to obtain the glottal wave signal;
(5)、提取声门波的特征参数归一化振幅商与声门闭合时间比数据;(5), extracting the characteristic parameter normalized amplitude quotient of glottal wave and glottis closure time ratio data;
(6)、将提取到的数据输入已经训练好的SVM模型进行分类;(6), inputting the extracted data into the trained SVM model for classification;
(7)、得到分类标签,用来判断说话人状况,输出说话人状况标签,交由执行模块进行反馈。(7) Obtain classification labels for judging the status of the speaker, output the status label of the speaker, and send it to the execution module for feedback.
上述步骤(3)中加窗采用汉明窗对一帧语音加窗。The windowing in the above step (3) uses a Hamming window to window a frame of speech.
上述步骤(3)中高频预加重处理通过一个一阶有限激励响应高通滤波器对其高频部分加以提升。In the above step (3), the high-frequency pre-emphasis processing uses a first-order finite excitation response high-pass filter to enhance its high-frequency part.
上述步骤(4)中得到声门波信号的步骤如下:The steps of obtaining the glottal wave signal in the above-mentioned steps (4) are as follows:
(a)、使用迭代自适应逆滤波建立声道模型;(a), using iterative adaptive inverse filtering to establish a vocal tract model;
迭代自适应逆滤波消除了原始语音信号频谱中声门激励带来的影响;Iterative adaptive inverse filtering eliminates the influence of glottal excitation in the original speech signal spectrum;
(b)、然后通过逆滤波的方法消除共振峰的影响;(b), then eliminate the influence of the formant by the method of inverse filtering;
通过线性预测编码和离散全集模型准确地建立声学模型,最后使用逆滤波来得到声门波信号。Acoustic models are accurately built by linear predictive coding and discrete ensemble models, and finally inverse filtering is used to obtain glottal wave signals.
上述步骤(5)中声门波的特征参数归一化振幅商的提取方法如下:The extraction method of the normalized amplitude quotient of the characteristic parameter of glottal wave in the above-mentioned steps (5) is as follows:
式中NAQ为归一化振幅商;T为基音周期;AQ是振幅商,为声门波最大振幅和其对应的一阶导数的最大负峰值之比;In the formula, NAQ is the normalized amplitude quotient; T is the pitch period; AQ is the amplitude quotient, which is the ratio of the maximum amplitude of the glottal wave to the maximum negative peak of its corresponding first-order derivative;
式中fac为声门脉冲的最大波峰值;dpeak为声门脉冲对应一阶导数的最大负峰值。Where f ac is the maximum peak value of the glottal pulse; dpeak is the maximum negative peak value of the first derivative of the glottal pulse.
上述步骤(5)中声门闭合时间比数据的提取方法如下:The extraction method of the glottis closure time ratio data in the above step (5) is as follows:
式中CPR为声门闭合时间比数据;CP为声门关闭阶段;O为声门总开启时间。In the formula, CPR is the ratio of glottis closure time; CP is the glottis closure phase; O is the total opening time of the glottis.
本发明所达到的有益效果:The beneficial effect that the present invention reaches:
本发明通过对发声生理系统在压力影响下变化特征的研究基础上,研究生理特征与声源参数之间的内在联系,探明声门波特征中能够反映压力状态的重要因素,使得所求得的声门波参数,不仅拥有理论指导性,且具有明确的物理意义;找出能够描述发声系统中压力相关声源特性的声门波参数,建立声源特征与生理特征的内在联系,以此特征来标识与压力变异因素的相关性,标志着声带的振动方式并且有物理意义,最终用以语音异常状态的检测,提高语音识别系统的精度与可靠性。Based on the study of the change characteristics of the vocalization physiological system under the influence of pressure, the present invention studies the internal relationship between the physiological characteristics and the sound source parameters, and finds out the important factors that can reflect the pressure state in the glottal wave characteristics, so that the obtained The glottal wave parameters not only have theoretical guidance, but also have clear physical meaning; find out the glottal wave parameters that can describe the characteristics of the pressure-related sound source in the vocal system, and establish the internal relationship between the sound source characteristics and physiological characteristics, so as to The feature is used to identify the correlation with the pressure variation factor, which marks the vibration mode of the vocal cords and has physical meaning. It is finally used to detect the abnormal state of speech and improve the accuracy and reliability of the speech recognition system.
本发明可以应用于车内环境,通过检测驾驶员与乘客的语音数据来判断其压力状态,通过传输设备将状态信息反馈到执行模块,进而由执行模块自动采取有效措施如:提醒驾驶员注意安全、利用车联网通知附近车辆注意避让等,从而达到保护生命和财产安全的目的。The present invention can be applied to the in-vehicle environment, by detecting the voice data of the driver and passengers to judge their stress state, feeding back the state information to the execution module through the transmission device, and then the execution module automatically takes effective measures such as reminding the driver to pay attention to safety , Use the Internet of Vehicles to notify nearby vehicles to pay attention to avoiding, etc., so as to achieve the purpose of protecting life and property safety.
附图说明Description of drawings
图1是本发明的基本流程图;Fig. 1 is a basic flow chart of the present invention;
图2是获得SVM分类模型的基本流程图。Figure 2 is a basic flowchart of obtaining the SVM classification model.
图3是本发明建立的迭代自适应逆滤波(IAIF)技术结构图;Fig. 3 is the iterative adaptive inverse filtering (IAIF) technical structural diagram that the present invention establishes;
图4是实施例1中五种参数的ROC曲线;Fig. 4 is the ROC curve of five kinds of parameters in embodiment 1;
图5是实施例1中五种特征的ROC曲线各参数值,其中AUC为曲线下面积,SE为标准差;CL为置信区间;Fig. 5 is each parameter value of the ROC curve of five kinds of characteristics in embodiment 1, and wherein AUC is the area under the curve, SE is the standard deviation; CL is the confidence interval;
图6是实施例1中经过50轮实验,得出的分类器的平均识别率。Fig. 6 is the average recognition rate of the classifier obtained after 50 rounds of experiments in Example 1.
具体实施方式Detailed ways
下面结合附图对本发明作进一步描述。以下实施例仅用于更加清楚地说明本发明的技术方案,而不能以此来限制本发明的保护范围。The present invention will be further described below in conjunction with the accompanying drawings. The following examples are only used to illustrate the technical solution of the present invention more clearly, but not to limit the protection scope of the present invention.
如图1所示,一种基于声源特征的语音异常检测方法,包括如下步骤:As shown in Figure 1, a method for abnormal speech detection based on sound source features includes the following steps:
(1)、通过传感器实时收集语音数据;(1) Real-time collection of voice data through sensors;
(2)、通过端点检测判断语音数据的语音段和噪声段,以决定是否进行下一步语音信号处理工作;(2), judge the voice segment and the noise segment of voice data by endpoint detection, to decide whether to carry out the next step voice signal processing work;
本发明使用了基于能量和短时过零率的语音端点检测方法,来有效地区分语音段。上述方法均是现有的成熟的检测方法,此处不详细阐述。The invention uses a speech endpoint detection method based on energy and short-term zero-crossing rate to effectively distinguish speech segments. The above methods are all existing mature detection methods, and will not be elaborated here.
(3)、对得到的语音段的语音数据分帧加窗,并对每一帧进行高频预加重处理;(3), windowing the speech data of the speech segment obtained, and carrying out high-frequency pre-emphasis processing to each frame;
预加重:语音信号的平均功率谱受声门激励和口鼻辐射的影响,高频端大约在800Hz以上按6dB/oct(倍频程)衰减,频率越高相应的成分越小,为此要在对语音信号进行分析之前通过一个一阶有限激励响应高通滤波器对其高频部分加以提升。Pre-emphasis: The average power spectrum of the voice signal is affected by glottal excitation and mouth and nose radiation. The high-frequency end is attenuated by 6dB/oct (octave) above 800Hz. The higher the frequency, the smaller the corresponding component. Before the speech signal is analyzed, its high-frequency part is enhanced by a first-order finite excitation response high-pass filter.
分帧:由于语音信号具有短时平稳性,我们可以对信号进行分帧处理。从宏观上看,它必须足够短来保证帧内信号是平稳的,即一帧的长度应当小于一个音素的长度。正常语速下,音素的持续时间大约是50-200毫秒,所以帧长一般小于50毫秒。从微观上来看,它又必须包括足够多的振动周期,因为傅里叶变换是要分析频率的,只有重复足够多次才能分析频率。语音的基频,男声在100赫兹左右,女声在200赫兹左右,换算成周期就是10毫秒和5毫秒。既然一帧要包含多个周期,所以一般取至少20毫秒。Framing: Due to the short-term stationarity of the speech signal, we can process the signal in frames. From a macro point of view, it must be short enough to ensure that the signal within the frame is stable, that is, the length of a frame should be less than the length of a phoneme. At a normal speech rate, the duration of a phoneme is about 50-200 milliseconds, so the frame length is generally less than 50 milliseconds. From a microscopic point of view, it must include enough vibration periods, because the Fourier transform is to analyze the frequency, and the frequency can only be analyzed if it is repeated enough times. The fundamental frequency of speech is around 100 Hz for male voices and around 200 Hz for female voices, and the converted cycles are 10 milliseconds and 5 milliseconds. Since a frame contains multiple cycles, it is generally at least 20 milliseconds.
加窗:采用汉明窗对一帧语音加窗,它不仅具有较好的频率分辨率,还可减少频谱泄露,从而减小吉布斯效应的影响。Windowing: The Hamming window is used to window a frame of speech. It not only has better frequency resolution, but also reduces spectrum leakage, thereby reducing the influence of the Gibbs effect.
(4)、如图3所示,对于语音段的语音数据,使用迭代自适应逆滤波得到声门波信号;(4), as shown in Figure 3, for the speech data of speech segment, use iterative adaptive inverse filter to obtain glottal wave signal;
(a)、使用迭代自适应逆滤波(IAIF)建立声道模型;(a), using iterative adaptive inverse filtering (IAIF) to establish a vocal tract model;
迭代自适应逆滤波消除了原始语音信号频谱中声门激励带来的影响;Iterative adaptive inverse filtering eliminates the influence of glottal excitation in the original speech signal spectrum;
(b)、然后通过逆滤波(IF)的方法消除共振峰的影响;(b), then eliminate the influence of the formant by the method of inverse filtering (IF);
通过线性预测编码(LPC)和离散全集模型(DAP)准确地建立声学模型,最后使用逆滤波(IF)来得到声门波信号,如图2所示。Accurately build the acoustic model through linear predictive coding (LPC) and discrete ensemble model (DAP), and finally use inverse filtering (IF) to obtain the glottal wave signal, as shown in Figure 2.
(5)、提取声门波的特征参数归一化振幅商与声门闭合时间比数据;(5), extracting the characteristic parameter normalized amplitude quotient of glottal wave and glottis closure time ratio data;
在处于工作压力状态下,由于声带肌肉收缩引起声带振动的不规则化。从而导致声门内气流流态的变化,使得语音信号发生了变异。这些声带特征的变化将会反应在声门波的特征上,因此使得声门波在某种程度上能够反映出工作压力。我们用归一化振幅商(NAQ)和声门闭合时间比(CPR)来表征声门波的本质特性,提出的特征有明确的物理意义,反映了语音生成过程中声带不同的振动模式。Irregularity of the vibration of the vocal cords due to contraction of the vocal cord muscles during work stress. As a result, the airflow state in the glottis changes, and the speech signal changes. These changes in the characteristics of the vocal cords will be reflected in the characteristics of the glottal waves, thus making the glottal waves reflect the working pressure to some extent. We use normalized amplitude quotient (NAQ) and glottal closure time ratio (CPR) to characterize the essential characteristics of glottal waves. The proposed features have clear physical meanings and reflect the different vibration modes of the vocal cords during speech production.
声门波的特征参数一:归一化振幅商,主要反映声带的闭合方式,其提取方法如下:The first characteristic parameter of the glottal wave: the normalized amplitude quotient, which mainly reflects the closing mode of the vocal cords, and its extraction method is as follows:
式中NAQ为归一化振幅商;T为基音周期;AQ是振幅商,为声门波最大振幅和其对应的一阶导数的最大负峰值之比;In the formula, NAQ is the normalized amplitude quotient; T is the pitch period; AQ is the amplitude quotient, which is the ratio of the maximum amplitude of the glottal wave to the maximum negative peak of its corresponding first-order derivative;
式中fac为声门脉冲的最大波峰值;dpeak为声门脉冲对应一阶导数的最大负峰值。由于声门开启或闭合的瞬时时刻不需要被测量,使得AQ比较容易得到,但是AQ的值依赖于信号基频(F0)的测量,因此在式(1)中,通过基音周期归一化而得到NAQ,消除了对基频测量的依赖。Where f ac is the maximum peak value of the glottal pulse; dpeak is the maximum negative peak value of the first derivative of the glottal pulse. Since the moment when the glottis opens or closes does not need to be measured, AQ is relatively easy to obtain, but the value of AQ depends on the measurement of the fundamental frequency (F0) of the signal. Therefore, in formula (1), by normalizing the pitch period, the Get NAQ, eliminating dependence on fundamental frequency measurements.
声门波的特征参数二:声门闭合时间比(CPR)。CPR参数反映了声门关闭阶段占声门总开启时间的比率,声门波中主要表现为声门信号的歪斜程度。。The second characteristic parameter of glottal wave: glottis closure time ratio (CPR). The CPR parameters reflect the ratio of the glottis closing phase to the total glottis opening time, and the glottal wave mainly shows the skewness of the glottis signal. .
声门闭合时间比数据的提取方法如下:The extraction method of glottis closure time ratio data is as follows:
式中CPR为声门闭合时间比数据;CP为声门关闭阶段;O为声门总开启时间。In the formula, CPR is the ratio of glottis closure time; CP is the glottis closure phase; O is the total opening time of the glottis.
(6)、将提取到的数据输入已经训练好的SVM模型进行分类;(6), inputting the extracted data into the trained SVM model for classification;
支持向量机(SVM)在模式识别领域一直起到重要的作用[8],所谓支持向量是指那些在间隔区边缘的训练样本点。SVM使用线性和非线性超平面进行分类。SVM是建立在统计学习理论的VC维理论和结构风险最小原理基础上的,根据有限的样本信息在模型的复杂性(即对特定训练样本的学习精度)和学习能力(即无错误地识别任意样本的能力)之间寻求最佳折衷,以求获得最好的推广能力。支持向量机本质上是一个二类的非线性分类器,非常适合变异语音的独特的识别特点:(1)由于说话人发声过程中并非时时刻刻处于压力状态中,压力在连续语音中表现为短暂的瞬时性,所以只有小量样本可以定义为压力下的变异语音,故变异语音识别一般是小样本问题。(2)情感压力的变异语音识别的一种典型二类识别问题。我们建立了基于SVM的分类识别模型,在说话人相关的情况下,由于每个被试说话人的样本数量相对较少,是典型的小样本问题,所以在这种情况下,SVM模型取得了比较好的识别效果。Support vector machine (SVM) has always played an important role in the field of pattern recognition [8]. The so-called support vector refers to those training sample points on the edge of the interval area. SVM uses linear and non-linear hyperplanes for classification. SVM is based on the VC dimension theory of statistical learning theory and the principle of structural risk minimization, according to the complexity of the model (that is, the learning accuracy of a specific training sample) and the learning ability (that is, to identify any The ability of the sample) to seek the best compromise in order to obtain the best generalization ability. Support vector machine is essentially a second-class nonlinear classifier, which is very suitable for the unique recognition characteristics of mutated speech: (1) Since the speaker is not in a state of stress all the time during the vocalization process, stress manifests itself in continuous speech Short transient, so only a small number of samples can be defined as variant speech under pressure, so variant speech recognition is generally a small-sample problem. (2) A typical second-class recognition problem for variant speech recognition of emotional stress. We established a classification recognition model based on SVM. In the case of speaker correlation, since the number of samples of each tested speaker is relatively small, it is a typical small sample problem, so in this case, the SVM model has achieved better recognition effect.
本发明通过SVM分类模型对变异语音和正常状态下语音的进行识别分类,实现所提出方法中声源参数对于变异状态的敏感度的评价,从而对所提出方法的有效性进行验证。The invention uses the SVM classification model to identify and classify the variant speech and the speech in the normal state, and realize the evaluation of the sensitivity of the sound source parameters to the variant state in the proposed method, thereby verifying the effectiveness of the proposed method.
(7)、得到分类标签,用来判断说话人状况,输出说话人状况标签,交由执行模块进行反馈。(7) Obtain classification labels for judging the status of the speaker, output the status label of the speaker, and send it to the execution module for feedback.
实施例1Example 1
我们使用了富士通公司收集的一个数据库,其中包含11个说话人的语音样本(4名男性和7名女性)。为模拟心理压力产生的具体状况,为说话人设置了三种不同的任务,在与操作员进行电话交谈时进行的,以模拟在电话中压力的情况。We used a database collected by Fujitsu Corporation containing speech samples of 11 speakers (4 male and 7 female). In order to simulate the specific situation of psychological stress generation, three different tasks were set up for the speaker, carried out during the telephone conversation with the operator, to simulate the situation of stress in the telephone.
涉及的三个任务(A)高度集中(要求说话者完成包括解决逻辑谜题和发现两张图片之间差异的任务);(B)时间压力(要求说话者在时间压力下回答问题);(C)冒险行为(采取冒险任务,来评估说话人对金钱收益的渴望)。对于每个演讲者,有四种不同任务的对话。在两次对话中,发言者被要求在有限的时间内完成任务,而在其他对话中,没有任何任务,可以轻松地聊天。The three tasks involved were (A) highly focused (the speaker was asked to complete tasks that included solving logic puzzles and finding the difference between two pictures); (B) time pressure (the speaker was asked to answer questions under time pressure); ( C) Risky Behavior (taking risky tasks to assess the speaker's desire for monetary gain). For each speaker, there are four dialogues with different tasks. In two dialogues, the speakers were asked to complete tasks within a limited time, while in the other dialogues, there were no tasks and chatting was easy.
从演讲中截取的部分是元音/a/,/i/,/u/,/e/,/o/。这些实验是针对每位发言者进行的,所有的结果都是由发言人所决定的。实验以选取的11个实验对象在扬声器系统中进行,样本的编号取决于说话者,总语音样本数为700。The parts taken from the speech are the vowels /a/, /i/, /u/, /e/, /o/. These experiments were performed for each speaker, and all results were determined by speaker. The experiment is carried out in the loudspeaker system with 11 subjects selected, the number of samples depends on the speaker, and the total number of speech samples is 700.
在本发明中,所采用的验证数据均来自电话通信数据,其中100个被试者(男50人,女50人)参加实验。实验中,接线员通过电话与每个被试进行聊天,平均每人四组对话,每组聊天时间为10分钟,并记录下最真实的语音通信数据。四组对话中,两组为轻松状态下的休闲聊天,另外两组对话中,被试分别被施加不同类型的压力,施加的压力包括:(1)多工作任务;(2)时间紧迫;(3)冒险投机,具体细节如表1。被试人在压力状态下说话的真实语音数据被记录,用于压力检测方法有效性的验证。In the present invention, the verification data used are all from telephone communication data, wherein 100 subjects (50 males and 50 females) participated in the experiment. In the experiment, the operator chatted with each subject over the phone, with an average of four conversations per person, and each chat lasted 10 minutes, and recorded the most authentic voice communication data. Among the four groups of conversations, two groups were casual chatting in a relaxed state, and in the other two groups of conversations, the subjects were subjected to different types of pressure, including: (1) multi-tasking; (2) tight time; ( 3) Risky speculation, see Table 1 for details. The real voice data of the subjects speaking under stress was recorded to verify the effectiveness of the stress detection method.
表1Table 1
为了验证所提出方法的有效性,本发明使用受试者工作特征曲线(ROC),以评价不同参数的识别性能,如图4、图5所示,ROC曲线是根据一系列不同的二分类方式(分界值或决定阈),以真阳性率(灵敏度)为纵坐标,假阳性率(1-特异度)为横坐标绘制的曲线。ROC曲线越靠近左上角,曲线下面积(AUC)越大,表示方法识别性能越好,准确性就越高。In order to verify the effectiveness of the proposed method, the present invention uses the receiver operating characteristic curve (ROC) to evaluate the recognition performance of different parameters, as shown in Figure 4 and Figure 5, the ROC curve is based on a series of different two classification methods (cut-off value or decision threshold), a curve drawn with the true positive rate (sensitivity) as the ordinate and the false positive rate (1-specificity) as the abscissa. The closer the ROC curve is to the upper left corner, the larger the area under the curve (AUC), which means the better the recognition performance of the method and the higher the accuracy.
真阳性(TPR):True Positive (TPR):
假阳性(FPR):False Positive (FPR):
TP:真阳性;TN:真阴性;FP:假阳性;FN:假阴性TP: true positive; TN: true negative; FP: false positive; FN: false negative
本发明将提出的声源模型参数与传统的参数进行比较,通过在压力检测的平均识别率上相比较,说明基于语音生成建模的方法在压力检测方法有着明显的优势,从而达到区分正常状态和异常状态的目的。三个传统的语音参数包括,基频、梅尔频率倒谱系数、抛物线频谱参数(F0、MFCC、PSP)作为实验对照组。The present invention compares the parameters of the proposed sound source model with the traditional parameters, and compares the average recognition rate of pressure detection, indicating that the method based on voice generation modeling has obvious advantages in the pressure detection method, so as to distinguish the normal state and exception status purposes. Three traditional speech parameters including fundamental frequency, Mel-frequency cepstral coefficient, and parabolic spectral parameters (F0, MFCC, PSP) were used as the experimental control group.
在分类阶段,NAQ和CPR被作为二维向量建立SVM模型,选取125组样本作为训练样本,125组样本作为测试组,实验从数据库中选用了7个不同说话人(4男3女)的语音样本是,旨在消除由于个体特异性而引起的语音参数的变异,同时把F0、MFCC、PSP以一维样本的形式进行了训练,作为实验对照组。如图6所示,经过50轮实验后,计算出参数在SV分类模型的平均识别率。可以看出,NAQ与CPR声源特征与传统参数相比,体现出了在异常状态下良好异常语音的识别性能。In the classification stage, NAQ and CPR are used as two-dimensional vectors to establish the SVM model, 125 sets of samples are selected as training samples, and 125 sets of samples are used as test groups. The experiment selects the voices of 7 different speakers (4 males and 3 females) from the database. The sample is designed to eliminate the variation of speech parameters due to individual specificity. At the same time, F0, MFCC, and PSP are trained in the form of one-dimensional samples as the experimental control group. As shown in Figure 6, after 50 rounds of experiments, the average recognition rate of the parameters in the SV classification model is calculated. It can be seen that, compared with traditional parameters, NAQ and CPR sound source characteristics reflect good abnormal speech recognition performance in abnormal conditions.
以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明技术原理的前提下,还可以做出若干改进和变形,这些改进和变形也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention, it should be pointed out that for those of ordinary skill in the art, without departing from the technical principle of the present invention, some improvements and modifications can also be made. It should also be regarded as the protection scope of the present invention.
Claims (6)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810126670.6A CN108074585A (en) | 2018-02-08 | 2018-02-08 | A kind of voice method for detecting abnormality based on sound source characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810126670.6A CN108074585A (en) | 2018-02-08 | 2018-02-08 | A kind of voice method for detecting abnormality based on sound source characteristics |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108074585A true CN108074585A (en) | 2018-05-25 |
Family
ID=62155229
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810126670.6A Pending CN108074585A (en) | 2018-02-08 | 2018-02-08 | A kind of voice method for detecting abnormality based on sound source characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108074585A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111951824A (en) * | 2020-08-14 | 2020-11-17 | 苏州国岭技研智能科技有限公司 | A detection method for discriminating depression based on sound |
CN112735386A (en) * | 2021-01-18 | 2021-04-30 | 苏州大学 | Voice recognition method based on glottal wave information |
CN113824843A (en) * | 2020-06-19 | 2021-12-21 | 大众问问(北京)信息科技有限公司 | Voice call quality detection method, device, equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2011149558A2 (en) * | 2010-05-28 | 2011-12-01 | Abelow Daniel H | Reality alternate |
CN102324229A (en) * | 2011-09-08 | 2012-01-18 | 中国科学院自动化研究所 | Method and system for detecting abnormal use of voice input equipment |
CN103730130A (en) * | 2013-12-20 | 2014-04-16 | 中国科学院深圳先进技术研究院 | Detection method and system for pathological voice |
US9338547B2 (en) * | 2012-06-26 | 2016-05-10 | Parrot | Method for denoising an acoustic signal for a multi-microphone audio device operating in a noisy environment |
-
2018
- 2018-02-08 CN CN201810126670.6A patent/CN108074585A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2011149558A2 (en) * | 2010-05-28 | 2011-12-01 | Abelow Daniel H | Reality alternate |
CN102324229A (en) * | 2011-09-08 | 2012-01-18 | 中国科学院自动化研究所 | Method and system for detecting abnormal use of voice input equipment |
US9338547B2 (en) * | 2012-06-26 | 2016-05-10 | Parrot | Method for denoising an acoustic signal for a multi-microphone audio device operating in a noisy environment |
CN103730130A (en) * | 2013-12-20 | 2014-04-16 | 中国科学院深圳先进技术研究院 | Detection method and system for pathological voice |
Non-Patent Citations (1)
Title |
---|
李宁: "基于声学参数和支持向量机的病理噪音分类研究", 《华东师范大学》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113824843A (en) * | 2020-06-19 | 2021-12-21 | 大众问问(北京)信息科技有限公司 | Voice call quality detection method, device, equipment and storage medium |
CN113824843B (en) * | 2020-06-19 | 2023-11-21 | 大众问问(北京)信息科技有限公司 | Voice call quality detection method, device, equipment and storage medium |
CN111951824A (en) * | 2020-08-14 | 2020-11-17 | 苏州国岭技研智能科技有限公司 | A detection method for discriminating depression based on sound |
CN112735386A (en) * | 2021-01-18 | 2021-04-30 | 苏州大学 | Voice recognition method based on glottal wave information |
CN112735386B (en) * | 2021-01-18 | 2023-03-24 | 苏州大学 | Voice recognition method based on glottal wave information |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Hansen et al. | Speaker recognition by machines and humans: A tutorial review | |
CN109856517B (en) | A Discrimination Method for Partial Discharge Detection Data of UHV Equipment | |
Kishore et al. | Emotion recognition in speech using MFCC and wavelet features | |
CN107657964A (en) | Depression aided detection method and grader based on acoustic feature and sparse mathematics | |
CN104835498A (en) | Voiceprint identification method based on multi-type combination characteristic parameters | |
TW201935464A (en) | Method and device for voiceprint recognition based on memorability bottleneck features | |
CN108896878A (en) | A kind of detection method for local discharge based on ultrasound | |
Patel et al. | Cochlear filter and instantaneous frequency based features for spoofed speech detection | |
CN111798874A (en) | Voice emotion recognition method and system | |
CN102324232A (en) | Voiceprint recognition method and system based on Gaussian mixture model | |
CN109034046A (en) | Foreign matter automatic identifying method in a kind of electric energy meter based on Acoustic detection | |
CN104900235A (en) | Voiceprint recognition method based on pitch period mixed characteristic parameters | |
CN108986824A (en) | A kind of voice playback detection method | |
CN105895078A (en) | Speech recognition method used for dynamically selecting speech model and device | |
CN108305639A (en) | Speech-emotion recognition method, computer readable storage medium, terminal | |
CN108074585A (en) | A kind of voice method for detecting abnormality based on sound source characteristics | |
Poorna et al. | Emotion recognition using multi-parameter speech feature classification | |
Mahesha et al. | LP-Hillbert transform based MFCC for effective discrimination of stuttering dysfluencies | |
CN115910097A (en) | Audible signal identification method and system for latent fault of high-voltage circuit breaker | |
CN108682432A (en) | Speech emotion recognition device | |
CN115910074A (en) | Voice control method and device for intelligent access control | |
Whitehill et al. | Whosecough: In-the-wild cougher verification using multitask learning | |
CN110415707B (en) | Speaker recognition method based on voice feature fusion and GMM | |
CN118116391A (en) | Robust voiceprint-based anti-fraud system | |
Kaminski et al. | Automatic speaker recognition using a unique personal feature vector and Gaussian Mixture Models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180525 |
|
RJ01 | Rejection of invention patent application after publication |