CN110956981B

CN110956981B - Speech emotion recognition method, device, device and storage medium

Info

Publication number: CN110956981B
Application number: CN201911246544.5A
Authority: CN
Inventors: 孙亚新; 叶青
Original assignee: Hubei University of Arts and Science
Current assignee: Hubei University of Arts and Science
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2022-04-26
Anticipated expiration: 2039-12-06
Also published as: CN110956981A

Abstract

The invention belongs to the technical field of voice signal processing and mode recognition, and discloses a voice emotion recognition method, a device, equipment and a storage medium. The method comprises the following steps: acquiring a test voice sample with a preset dimension, and performing segmentation processing on the test voice sample through a preset rule to obtain a plurality of initial voice samples; extracting signal characteristic data of the initial voice sample to obtain voice signal characteristic data to be processed; performing feature statistics on the feature data of the voice signal to be processed through a preset statistical function to obtain a feature statistical result to be confirmed; obtaining feature target data through a preset multi-target optimization algorithm according to the feature statistical result to be confirmed; and inputting the characteristic target data into a preset Softmax classification model to obtain a speech emotion recognition result. Through the method, the speech emotion fragments form speech emotion data, and the speech emotion data are input into the preset Softmax classification model, so that speech emotion can be better recognized.

Description

Speech emotion recognition method, device, device and storage medium

技术领域technical field

本发明涉及语音信号处理和模式识别技术领域，尤其涉及一种语音情感识别方法、装置、设备及存储介质。The present invention relates to the technical field of speech signal processing and pattern recognition, and in particular, to a speech emotion recognition method, device, device and storage medium.

背景技术Background technique

目前有较多种语音情感识别方法，但是这些方法没有注意到人类的语音情感表达具有短时性和局部性。比如语音情感识别中，前半句、一个词愤怒就可认为整句话愤怒。会出现以下几个问题：一、使用整句话识别情感，经常会稀释情感的特征变化。比如，“我们明天去北京，你觉得可行吗？”，这句话往往后半句才体现较大的情感差别。导致在深度学习中使用针对时间的均值池化、卷积和针对所有特征的全连接层会稀释情感的特征变化；二、局部组合成句子时，经常会中和情感的特征变化。众所周知，汉语语调有一至四声，其中二声和四声在时间变化上的特点完全相反。导致在深度学习中使用针对时间的均值池化，针对时间序列的注意层等均会中和情感的特征变化；三、组成情感的字词在语句中的位置不固定，会造成同情感的特征差异很大。比如，“这样可行吗？”和“可行吗？这样！”表达了相同意思，但是现有卷积神经网络，输出的特征却完全不同。At present, there are many speech emotion recognition methods, but these methods do not pay attention to the short-term and locality of human speech emotion expression. For example, in speech emotion recognition, the first half of a sentence or a word of anger can be regarded as anger in the whole sentence. The following problems will arise: First, the use of whole sentences to identify emotions often dilutes the characteristic changes of emotions. For example, "We will go to Beijing tomorrow, do you think it's feasible?" The second half of this sentence often reflects a greater emotional difference. As a result, the use of temporal mean pooling, convolution, and fully-connected layers for all features in deep learning will dilute the feature changes of emotions; second, when locally combined into sentences, the feature changes of emotions are often neutralized. As we all know, Chinese intonation has one to four tones, and the second and fourth tones have completely opposite characteristics in time change. This leads to the use of time-based mean pooling in deep learning, and the attention layer for time series will neutralize the feature changes of emotions; 3. The position of the words that make up the emotion is not fixed in the sentence, which will cause the characteristics of sympathy big difference. For example, "Is this possible?" and "Is this possible? This!" express the same meaning, but the output features of existing convolutional neural networks are completely different.

上述内容仅用于辅助理解本发明的技术方案，并不代表承认上述内容是现有技术。The above content is only used to assist the understanding of the technical solutions of the present invention, and does not mean that the above content is the prior art.

发明内容SUMMARY OF THE INVENTION

本发明的主要目的在于提供一种语音情感识别方法、装置、设备及存储介质，旨在解决如何准确语音情感的技术问题。The main purpose of the present invention is to provide a speech emotion recognition method, device, equipment and storage medium, aiming at solving the technical problem of how to accurately speech emotion.

为实现上述目的，本发明提供了一种语音情感识别方法，所述方法包括以下步骤:To achieve the above object, the invention provides a kind of speech emotion recognition method, and described method may further comprise the steps:

获取预设维度的测试语音样本，并通过预设规则对所述测试语音样本进行分段处理，获得多个初始语音样本；Acquire a test voice sample of a preset dimension, and perform segmentation processing on the test voice sample according to a preset rule to obtain a plurality of initial voice samples;

对所述初始语音样本进行信号特征数据提取，获得待处理语音信号特征数据；Extracting the signal feature data of the initial voice sample to obtain the feature data of the voice signal to be processed;

通过预设统计函数对所述待处理语音信号特征数据进行特征统计，获得待确认特征统计结果；Perform feature statistics on the feature data of the speech signal to be processed by using a preset statistical function to obtain a feature statistics result to be confirmed;

根据所述待确认特征统计结果，通过预设多目标优化算法获得特征目标数据；According to the statistical result of the feature to be confirmed, the feature target data is obtained through a preset multi-objective optimization algorithm;

将所述特征目标数据输入至预设Softmax分类模型中，获得语音情感识别结果。Inputting the feature target data into a preset Softmax classification model to obtain a speech emotion recognition result.

优选地，所述获取预设维度的测试语音样本，并通过预设规则对所述测试语音样本进行分段处理，获得多个初始语音样本的步骤之前，还包括：Preferably, before the step of obtaining a test voice sample of a preset dimension, and performing segmentation processing on the test voice sample according to a preset rule, and obtaining a plurality of initial voice samples, the method further includes:

获取预设维度的训练语音样本，并通过预设规则对所述测试语音样本进行分段处理，获得多个初始训练语音样本；Acquiring training voice samples of preset dimensions, and performing segmentation processing on the test voice samples through preset rules to obtain a plurality of initial training voice samples;

对所述初始训练语音样本进行特征提取，获得待处理训练语音信号特征；Perform feature extraction on the initial training voice sample to obtain the features of the training voice signal to be processed;

通过预设统计函数对所述待处理训练语音信号特征进行特征统计，获取待确认训练特征统计结果；Perform feature statistics on the features of the to-be-processed training speech signal by using a preset statistical function, and obtain a statistical result of the to-be-confirmed training features;

根据所述待确认训练特征统计结果，通过预设多目标优化算法获得目标训练特征数据；According to the statistical result of the training feature to be confirmed, the target training feature data is obtained through a preset multi-objective optimization algorithm;

根据所述目标训练特征数据获取所述目标训练特征数据对应的情感类别；Obtain the emotion category corresponding to the target training feature data according to the target training feature data;

根据所述情感类别和所述情感类别对应的目标训练特征数据建立预设Softmax分类模型。A preset Softmax classification model is established according to the emotion category and target training feature data corresponding to the emotion category.

优选地，所述根据所述待确认训练特征统计结果，通过预设多目标优化算法获得目标训练特征数据的步骤，包括：Preferably, the step of obtaining target training feature data through a preset multi-objective optimization algorithm according to the statistical results of the training features to be confirmed includes:

对所述待确认训练特征统计结果进行情感类别划分，获得不同情感类别对应的待优化训练特征数据；Perform emotional classification on the statistical results of the training features to be confirmed, and obtain training feature data to be optimized corresponding to different emotional categories;

根据所述待优化训练特征数据，通过预设多目标优化算法获得目标训练特征数据。According to the training feature data to be optimized, target training feature data is obtained through a preset multi-objective optimization algorithm.

优选地，所述将所述特征目标数据输入至预设Softmax分类模型中，获得语音情感识别结果的步骤，包括：Preferably, the step of inputting the feature target data into a preset Softmax classification model to obtain a speech emotion recognition result includes:

将所述特征目标数据输入至所述预设Softmax分类模型中，获得语音情感类别数据；Inputting the feature target data into the preset Softmax classification model to obtain voice emotion category data;

对所述语音情感类别数据进行数据统计，获得语音情感类别数据值；performing data statistics on the voice emotion category data to obtain a voice emotion category data value;

根据所述语音情感类别数据值获得语音情感识别结果。A speech emotion recognition result is obtained according to the speech emotion category data value.

优选地，所述根据所述语音情感类别数据值获得语音情感识别结果的步骤，包括：Preferably, the step of obtaining a speech emotion recognition result according to the speech emotion category data value includes:

判断所述语音情感类别数据值是否属于预设语音情感类别阈值范围；Judging whether the voice emotion category data value belongs to the preset voice emotion category threshold range;

若所述语音情感类别数据值属于所述预设语音情感类别阈值范围，则根据所述语音情感类别数据值获得语音情感识别结果。If the voice emotion category data value belongs to the preset voice emotion category threshold range, a voice emotion recognition result is obtained according to the voice emotion category data value.

优选地，所述判断所述语音情感类别数据值是否属于预设语音情感类别阈值范围的步骤之后，还包括：Preferably, after the step of judging whether the voice emotion category data value belongs to the preset voice emotion category threshold range, the step further includes:

若所述语音情感类别数据值不属于所述预设语音情感类别阈值范围，则返回所述将所述特征目标数据输入至所述预设Softmax分类模型中，获得语音情感类别数据的步骤。If the voice emotion category data value does not belong to the preset voice emotion category threshold range, returning to the step of inputting the feature target data into the preset Softmax classification model to obtain voice emotion category data.

优选地，所述通过预设统计函数对所述待处理语音信号特征数据进行特征统计，获得待确认特征统计结果的步骤，包括：Preferably, the step of performing feature statistics on the feature data of the to-be-processed voice signal by using a preset statistical function to obtain the feature statistics result to be confirmed includes:

对所述待处理语音信号特征数据进行筛选，获得标签样本特征数据；Screening the feature data of the to-be-processed voice signal to obtain label sample feature data;

通过预设统计函数对所述标签样本特征数据进行特征统计，获得待确认特征统计结果。Feature statistics are performed on the label sample feature data through a preset statistical function to obtain a feature statistics result to be confirmed.

此外，为实现上述目的，本发明还提出一种语音情感识别装置，所述装置包括：获取模块，用于获取预设维度的测试语音样本，并通过预设规则对所述测试语音样本进行分段处理，获得多个初始语音样本；In addition, in order to achieve the above object, the present invention also provides a voice emotion recognition device, the device includes: an acquisition module for acquiring test voice samples of preset dimensions, and classifying the test voice samples according to preset rules Segment processing to obtain multiple initial speech samples;

提取模块，用于对所述初始语音样本进行信号特征数据提取，获得待处理语音信号特征数据；an extraction module, used for extracting the signal feature data of the initial voice sample to obtain the feature data of the voice signal to be processed;

统计模块，用于通过预设统计函数对所述待处理语音信号特征数据进行特征统计，获得待确认特征统计结果；a statistical module, configured to perform feature statistics on the feature data of the to-be-processed voice signal through a preset statistical function, and obtain a feature-to-be-confirmed statistical result;

计算模块，用于根据所述待确认特征统计结果，通过预设多目标优化算法获得特征目标数据；a calculation module, configured to obtain characteristic target data through a preset multi-objective optimization algorithm according to the statistical result of the to-be-confirmed characteristic;

确定模块，用于将所述特征目标数据输入至预设Softmax分类模型中，获得语音情感识别结果。A determination module, configured to input the feature target data into a preset Softmax classification model to obtain a speech emotion recognition result.

此外，为实现上述目的，本发明还提出一种电子设备，所述设备包括：存储器、处理器及存储在所述存储器上并可在所述处理器上运行的语音情感识别程序，所述语音情感识别程序配置为实现如上文中任一项所述的语音情感识别方法的步骤。In addition, in order to achieve the above object, the present invention also proposes an electronic device, the device includes: a memory, a processor, and a speech emotion recognition program stored in the memory and running on the processor, the speech The emotion recognition program is configured to implement the steps of the speech emotion recognition method as described in any of the above.

此外，为实现上述目的，本发明还提出一种存储介质，所述存储介质上存储有语音情感识别程序，所述语音情感识别程序被处理器执行时实现如上文中任一项所述的语音情感识别方法的步骤。In addition, in order to achieve the above object, the present invention also provides a storage medium, on which a speech emotion recognition program is stored, and when the speech emotion recognition program is executed by a processor, the speech emotion as described in any one of the above is realized. Identify the steps of the method.

本发明通过先获取预设维度的测试语音样本，并通过预设规则对测试语音样本进行分段处理，获得多个初始语音样本，然后对初始语音样本进行信号特征数据提取，获得待处理语音信号特征数据，并对所述待处理语音信号特征数据进行筛选，获得标签样本特征数据，通过预设统计函数对所述标签样本特征数据进行特征统计，获得待确认特征统计结果，之后根据待确认特征统计结果，通过预设多目标优化算法获得特征目标数据，最后将特征目标数据输入至预设Softmax分类模型中，获得语音情感识别结果。通过上述方法，能够充分利用语音情感片段，以及语句与片段之间的情感关系，并转化为语音情感数据，从而提升语音情感识别效果。The present invention obtains the test voice samples with preset dimensions firstly, and performs segmentation processing on the test voice samples according to the preset rules to obtain a plurality of initial voice samples, and then extracts the signal feature data of the initial voice samples to obtain the to-be-processed voice signal. feature data, and filter the feature data of the to-be-processed voice signal to obtain the feature data of the label sample, perform feature statistics on the feature data of the label sample through a preset statistical function, obtain the feature statistics result to be confirmed, and then according to the feature to be confirmed Statistical results are obtained through a preset multi-objective optimization algorithm to obtain characteristic target data, and finally the characteristic target data is input into a preset Softmax classification model to obtain a speech emotion recognition result. Through the above method, speech emotion segments and the emotional relationship between sentences and segments can be fully utilized, and converted into speech emotion data, thereby improving the speech emotion recognition effect.

附图说明Description of drawings

图1是本发明实施例方案涉及的硬件运行环境的电子设备的结构示意图；1 is a schematic structural diagram of an electronic device of a hardware operating environment involved in an embodiment of the present invention;

图2为本发明语音情感识别方法第一实施例的流程示意图；2 is a schematic flowchart of a first embodiment of a speech emotion recognition method according to the present invention;

图3为本发明语音情感识别方法第二实施例的流程示意图；3 is a schematic flowchart of a second embodiment of a speech emotion recognition method according to the present invention;

图4为本发明语音情感识别装置第一实施例的结构框图。FIG. 4 is a structural block diagram of a first embodiment of a speech emotion recognition apparatus according to the present invention.

本发明目的的实现、功能特点及优点将结合实施例，参照附图做进一步说明。The realization, functional characteristics and advantages of the present invention will be further described with reference to the accompanying drawings in conjunction with the embodiments.

具体实施方式Detailed ways

应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

参照图1，图1为本发明实施例方案涉及的硬件运行环境的电子设备结构示意图。Referring to FIG. 1 , FIG. 1 is a schematic structural diagram of an electronic device of a hardware operating environment involved in an embodiment of the present invention.

如图1所示，该电子设备可以包括：处理器1001，例如中央处理器(CentralProcessing Unit，CPU)，通信总线1002、用户接口1003，网络接口1004，存储器1005。其中，通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard)，可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如无线保真(WIreless-FIdelity，WI-FI)接口)。存储器1005可以是高速的随机存取存储器(RandomAccess Memory，RAM)存储器，也可以是稳定的非易失性存储器(Non-Volatile Memory，NVM)，例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储装置。As shown in FIG. 1 , the electronic device may include: a processor 1001 , such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002 , a user interface 1003 , a network interface 1004 , and a memory 1005 . Among them, the communication bus 1002 is used to realize the connection and communication between these components. The user interface 1003 may include a display screen (Display), an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface. Optionally, the network interface 1004 may include a standard wired interface and a wireless interface (such as a wireless fidelity (WIreless-FIdelity, WI-FI) interface). The memory 1005 may be a high-speed random access memory (Random Access Memory, RAM) memory, or may be a stable non-volatile memory (Non-Volatile Memory, NVM), such as a disk memory. Optionally, the memory 1005 may also be a storage device independent of the aforementioned processor 1001 .

本领域技术人员可以理解，图1中示出的结构并不构成对电子设备的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 1 does not constitute a limitation on the electronic device, and may include more or less components than the one shown, or combine some components, or arrange different components.

如图1所示，作为一种存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及语音情感识别程序。As shown in FIG. 1 , the memory 1005 as a storage medium may include an operating system, a network communication module, a user interface module and a speech emotion recognition program.

在图1所示的电子设备中，网络接口1004主要用于与网络服务器进行数据通信；用户接口1003主要用于与用户进行数据交互；本发明电子设备中的处理器1001、存储器1005可以设置在电子设备中，所述电子设备通过处理器1001调用存储器1005中存储的语音情感识别程序，并执行本发明实施例提供的语音情感识别方法。In the electronic device shown in FIG. 1, the network interface 1004 is mainly used for data communication with the network server; the user interface 1003 is mainly used for data interaction with the user; the processor 1001 and the memory 1005 in the electronic device of the present invention can be set in In the electronic device, the electronic device invokes the speech emotion recognition program stored in the memory 1005 through the processor 1001, and executes the speech emotion recognition method provided by the embodiment of the present invention.

本发明实施例提供了一种语音情感识别方法，参照图2，图2为本发明一种语音情感识别方法第一实施例的流程示意图。An embodiment of the present invention provides a speech emotion recognition method. Referring to FIG. 2 , FIG. 2 is a schematic flowchart of a first embodiment of a speech emotion recognition method of the present invention.

本实施例中，所述语音情感识别方法包括以下步骤：In this embodiment, the speech emotion recognition method includes the following steps:

步骤S10：获取预设维度的测试语音样本，并通过预设规则对所述测试语音样本进行分段处理，获得多个初始语音样本。Step S10: Acquire a test voice sample of a preset dimension, and perform segmentation processing on the test voice sample according to a preset rule to obtain a plurality of initial voice samples.

需要说明的是，在获取预设维度的测试语音样本，并通过预设规则对所述测试语音样本进行分段处理，获得多个初始语音样本的步骤之前，需要获取预设维度的训练语音样本，并通过预设规则对所述测试语音样本进行分段处理，获得多个初始训练语音样本，对所述初始训练语音样本进行特征提取，获得待处理训练语音信号特征，通过预设统计函数对所述待处理训练语音信号特征进行特征统计，获取待确认训练特征统计结果，根据所述待确认训练特征统计结果，通过预设多目标优化算法获得目标训练特征数据，根据所述目标训练特征数据获取所述目标训练特征数据对应的情感类别，根据所述情感类别和所述情感类别对应的目标训练特征数据建立预设Softmax分类模型。It should be noted that, before the steps of obtaining a test voice sample of a preset dimension, and performing segmentation processing on the test voice sample according to a preset rule, and obtaining a plurality of initial voice samples, it is necessary to obtain a training voice sample of a preset dimension. , and perform segmentation processing on the test voice samples through preset rules to obtain a plurality of initial training voice samples, perform feature extraction on the initial training voice samples, and obtain the characteristics of the training voice signal to be processed. Perform feature statistics on the characteristics of the training voice signal to be processed, obtain the statistical results of the training characteristics to be confirmed, obtain target training characteristic data through a preset multi-objective optimization algorithm according to the statistical results of the training characteristics to be confirmed, and obtain the target training characteristic data according to the target training characteristic data Obtain the emotion category corresponding to the target training feature data, and establish a preset Softmax classification model according to the emotion category and the target training feature data corresponding to the emotion category.

此外，应理解的是，上述所说的预设规则为用户自定义的样本划分规则，也就是说，假如获取的预设维度的测试语音样本对应的时长为5s，将预设规则设定为0.2s，则按照预设规则划分后得到25段0.2s的初始语音样本。In addition, it should be understood that the above-mentioned preset rules are user-defined sample division rules, that is to say, if the acquired test voice samples of the preset dimension have a corresponding duration of 5s, the preset rules are set as 0.2s, then 25 segments of initial speech samples of 0.2s are obtained after division according to preset rules.

此外，需要说明的是，上述所说的预设维度可以是时间维度，也可以是非时间维度等等，本实施例并不加以限制。In addition, it should be noted that the preset dimension mentioned above may be a time dimension or a non-time dimension, etc., which is not limited in this embodiment.

步骤S20：对所述初始语音样本进行信号特征数据提取，获得待处理语音信号特征数据。Step S20 : extracting the signal feature data of the initial speech sample to obtain the feature data of the speech signal to be processed.

此外，应理解的是，对所述初始语音样本进行信号特征数据提取Mel频率倒谱系数(Mel Frequency Cepstrum Coefficient，MFCC)、对数频率功率系数(Log FrequencyPower Coefficients，LFPC)、线性预测倒谱系数(Linear Predictive Cepstral Coding，LPCC)、过零峰值幅度(Zero Crossing with Peak Amplitude，ZCPA)、感知线性预测(Perceptual Linear Predictive，PLP)、拉斯塔滤波器感知线性预测(Rasta PerceptualLinear Predictiv，R-PLP)。In addition, it should be understood that, performing signal feature data extraction on the initial speech sample to extract Mel Frequency Cepstrum Coefficients (MFCC), Log Frequency Power Coefficients (LFPC), and Linear Prediction Cepstrum Coefficients (Linear Predictive Cepstral Coding, LPCC), Zero Crossing with Peak Amplitude (ZCPA), Perceptual Linear Predictive (Perceptual Linear Predictive, PLP), Rasta Filter Perceptual Linear Predictiv (Rasta Perceptual Linear Predictiv, R-PLP) ).

应理解的是，上述所说的每类特征的特征提取结果均为二维矩阵，其中一个维度为时间维度，然后计算每类特征F_i在时间维度上的一阶导数ΔF_i、二阶导数ΔΔF_i，并将原始特征、一阶导数结果、二阶导数结果在非时间维度上串接，形成每一类特征的最终特征提取结果；将上述所有类的特征的最终特征提取结果在非时间维度上串接即为该样本的特征提取结果。It should be understood that the feature extraction result of each type of feature mentioned above is a two-dimensional matrix, one of which is the time dimension, and then the first-order derivative ΔF _i and the second-order derivative of each type of feature F _i on the time dimension are calculated. ΔΔF _i , and concatenate the original features, first-order derivative results, and second-order derivative results in the non-time dimension to form the final feature extraction result of each type of feature; Concatenation in dimension is the feature extraction result of the sample.

此外，为了便于理解，以下进行举例说明：In addition, for ease of understanding, the following examples are provided:

假设，MFCC对应的F_MFCC∈R^39×z，ΔF_MFCC∈R^39×z，ΔΔF_i∈R^39×z，其中z为帧数，即时间维度数，在非时间维度上的串接结果

Suppose, the corresponding F _MFCC ∈ R ^39×z , ΔF _MFCC ∈ R ^39×z , ΔΔF _i ∈ R ^39×z , where z is the number of frames, that is, the number of time dimensions, the concatenation result in the non-time dimension

在MFCC和LPCC连接时，假如

串接后为

When MFCC and LPCC are connected, if

concatenated as

此外，应理解的是，在进行每一次语音信号特征提取时，提取MFCC，LFPC，LPCC，ZCPA，PLP，R-PLP特征，其中MFCC、LFPC的Mel滤波器个数为40；LPCC、PLP、R-PLP的线性预测阶数分别为12、16、16；ZCPA的频率分段为：0，106，223，352，495，655，829，1022，1236，1473，1734，2024，2344，2689，3089，3522，4000。从而每条语句的每类特征的维度分别为：ti*39，ti*40，ti*12，ti*16，ti*16，ti*16，其中ti为第i条语句的帧数，乘号后面的数字为每帧特征的维度。为了获得语音信号在时间维度上的变化，还对上述特征在时间维度上计算一阶导数，二阶导数。最后每类特征的维度分别为：ti*117，ti*140，ti*36，ti*48，ti*48，ti*48。第i样本的提取到的语音信号特征由上述所有特征组合而成，维度为ti*(117+140+36+48+48+48)。In addition, it should be understood that MFCC, LFPC, LPCC, ZCPA, PLP, R-PLP features are extracted during each speech signal feature extraction, wherein the number of Mel filters of MFCC and LFPC is 40; LPCC, PLP, The linear prediction orders of R-PLP are 12, 16, and 16 respectively; the frequency segments of ZCPA are: 0, 106, 223, 352, 495, 655, 829, 1022, 1236, 1473, 1734, 2024, 2344, 2689 , 3089, 3522, 4000. Therefore, the dimensions of each type of feature of each statement are: ti*39, ti*40, ti*12, ti*16, ti*16, ti*16, where ti is the frame number of the i-th statement, and the multiplication sign The following numbers are the dimensions of each frame feature. In order to obtain the change of the speech signal in the time dimension, the first-order derivative and the second-order derivative are also calculated for the above-mentioned features in the time dimension. Finally, the dimensions of each type of feature are: ti*117, ti*140, ti*36, ti*48, ti*48, ti*48. The extracted speech signal feature of the ith sample is composed of all the above features, and the dimension is ti*(117+140+36+48+48+48).

步骤S30：通过预设统计函数对所述待处理语音信号特征数据进行特征统计，获得待确认特征统计结果。Step S30: Perform feature statistics on the feature data of the to-be-processed voice signal by using a preset statistical function to obtain a feature statistics result to be confirmed.

需要说明的是，使用统计函数，利用均值(mean)、标准方差(standarddeviation)、最小值(min)、最大值(max)、峭度(kurtosis)、偏度(skewness)获得上述特征在时间维度上的统计结果。It should be noted that the statistical functions are used to obtain the above features in the time dimension by using the mean, standard deviation, minimum value (min), maximum value (max), kurtosis (kurtosis) and skewness (skewness). Statistical results above.

此外，应理解的是，从上述得到的统计结果中进行筛选，获得标签样本特征数据，并通过预设统计函数对所述标签样本特征数据进行特征统计，获得待确认特征统计结果，并将有标签样本的特征统计结果记为{x₁,x₂,...,x_n}，其中n为有标签标本的个数。In addition, it should be understood that the characteristic data of the label samples are obtained by screening from the statistical results obtained above, and the characteristic statistics of the label samples are carried out through a preset statistical function to obtain the statistical results of the characteristics to be confirmed, and there will be The feature statistics of labeled samples are recorded as {x ₁ , x ₂ ,...,x _n }, where n is the number of labeled samples.

步骤S40：根据所述待确认特征统计结果，通过预设多目标优化算法获得特征目标数据。Step S40: According to the statistical result of the feature to be confirmed, obtain feature target data through a preset multi-target optimization algorithm.

此外，需要说明的是，将上步中的{x₁,x₂,...,x_n}，按语句的标签分成X_A＝[x₁,x₂,…,x_m]，X_B＝[x_m+1,x_m+2,…,x_n]，其中X_A是A类情感的片段，X_B是B类情感的片段，训练基于倾向性认知学习的语句片段情感分类方法步骤如下：In addition, it should be noted that {x ₁ ,x ₂ ,...,x _n } in the previous step is divided into X _A =[x ₁ ,x ₂ ,...,x _m ] according to the label of the statement, X _B =[x _m+1 ,x _m+2 ,...,x _n ], where X _A is a segment of type A emotion, and X _B is a segment of type B emotion, training a sentence segment emotion classification method based on propensity cognitive learning Proceed as follows:

(1)对x∈X_A，将X_A中以x为中心的Parzen窗内的样本与中心样本x的角度划分成多个箱子，然后使用下式计算x在X_A中周围数据的分布特征。(1) For x∈X _A , divide the angle between the sample in the Parzen window centered on x in X _A and the center sample x into multiple bins, and then use the following formula to calculate the distribution characteristics of the data around x in X _A .

b_x＝[b₁,b₂,…,b_k]b _x =[b ₁ ,b ₂ ,...,b _k ]

式中b_j表示第j个箱子，1(x_i∈X_j)在x_i属于X_j时的值为1否则为0，X_j是X_A的子集，X_j内的样本与x之间的角度分布在第j个箱子中。where b _j represents the jth box, 1(x _i ∈ X _j ) is 1 when x _i belongs to X _j , otherwise it is 0, X _j is a subset of X _A , and the difference between the samples in X _j and x is The angles between are distributed in the jth bin.

(2)对x∈X_A，将X_B中以x为中心的Parzen窗内的样本与中心样本x的角度划分成多个箱子，然后使用下式计算x在X_A中周围数据的分布特征。(2) For x∈X _A , divide the angle between the sample in the Parzen window centered on x in X _B and the center sample x into multiple bins, and then use the following formula to calculate the distribution characteristics of the data around x in X _A .

式中

表示第

个箱子，1(x_i∈X_j)在x_i属于X_j时的值为1否则为0，X_j是X_B的子集，X_j内的样本与x之间的角度分布在第j个箱子中。in the formula

means the first

box, 1(x _i ∈ X _j ) is 1 when _xi belongs to X _j , otherwise 0, X _j is a subset of X _B , and the angle between the samples in X _j and x is distributed in the jth in a box.

(3)使用下式计算两数据集在x点附近的数据分布差异：(3) Use the following formula to calculate the difference in the data distribution of the two data sets near the x point:

式中

表示两向量之间的距离，此处使用欧氏距离。in the formula

Represents the distance between two vectors, where Euclidean distance is used.

(4)根据上一步骤的计算结果可以得到倾向于A情感的片段集合

倾向于B情感的片段集合

以及倾向于中性情感的片段集合

其中

为d_x>T的x组成的集合。

为d_x<-T的x组成的集合。

为T>d_x>-T组成的集合。T是自主设置的阈值。对每个集合，再使用谱聚类的方法聚成多个区域，得到每个片段xi的区域标签

(4) According to the calculation result of the previous step, the set of fragments tending to A emotion can be obtained

A collection of snippets leaning towards B sentiment

and a collection of clips that tend towards neutral emotions

in

is the set of x where d _x > T.

is the set of x where d _x <-T.

is a set consisting of T>d _x >-T. T is an autonomously set threshold. For each set, the spectral clustering method is used to cluster into multiple regions, and the region label of each segment xi is obtained.

(5)定义

L＝[L_A,L_B,L_C]，其中L_A∈R^p、L_B∈R^q、L_C∈R^u，p、q和u分别为

和

样本的个数，L_A、L_B和L_C中的元素值分别为1、2、3。使用下述目标方程，学习片段的特征子空间：(5) Definition

L=[L _A , L _B , L _C ], where L _A ∈ R ^p , L _B ∈ R ^q , L _C ∈ R ^u , p, q and u are respectively

and

The number of samples, the element values in L _A , L _B and L _C are 1, 2, and 3, respectively. The feature subspace of the segment is learned using the following objective equation:

J＝J₁(o_i,o_j)+β*J₂(o_i,o_j)J=J ₁ (o _i ,o _j )+β*J ₂ (o _i ,o _j )

β是平衡参数。其中J₁(o_i,o_j)可以实现

和

三个类之间的类内距离较小，类间距离较大，定义如下：β is the equilibrium parameter. where J ₁ (o _i ,o _j ) can realize

and

The intra-class distance between the three classes is small, and the inter-class distance is large, which is defined as follows:

式中o_i和o_j为

和

映射到子空间之后的结果。l_i和l_j对应o_i和o_j在L中的值。m是一个阈值，调整类间距离的范围，本发明中取1。G_ij为xi和x_j之间的高斯距离。计算公式如下：where o _i and o _j are

and

The result after mapping to the subspace. l _i and l _j correspond to the values of o _i and o _j in L. m is a threshold, and the range of the distance between classes is adjusted, which is taken as 1 in the present invention. G _ij is the Gaussian distance between xi and x _j . Calculated as follows:

J2(o_i,o_j)可以尽量保持每个区域内的相对关系不变，以及属于同一类的区域相对靠近，但是并不重叠。定义如下：J2(o _i ,o _j ) can try to keep the relative relationship in each region unchanged, and the regions belonging to the same class are relatively close, but do not overlap. Defined as follows:

式中

和

是xi和xj的区域标签。G_li是l_i类所有G_ij中的最大值。实现当两个片段属于同一区域时，保持他们之间的关系，当两者不属于同一区域但是属于同一类别时，以一个小的权重最小化他们之间的距离，可使两个区域尽量不重叠。in the formula

and

are the region labels for xi and xj. G _li is the maximum value _{among all G ij} _of class li. Realize that when two fragments belong to the same area, keep the relationship between them, and when the two do not belong to the same area but belong to the same category, minimize the distance between them with a small weight, so that the two areas can be as different as possible. overlapping.

为了优化目标方程J，我们定义o_i＝φ(W_qφ(…W₃φ(W₂φ(W₁x_i+b₁)+b₂)+b₃)+b₄)式中φ(·)为sigmoid函数，W₁,W₂,…,W_q为映射矩阵，b₁,b₂,…,b_q为偏移量。通过求

和

可得到W₁,W₂,…,W_q和b₁,b₂,…,b_q的值，

是求J对W的导数，

是求J对b的导数。To optimize the objective equation J, we define o _i =φ(W _q φ(…W ₃ φ(W ₂ φ(W ₁ x _i +b ₁ )+b ₂ )+b ₃ )+b ₄ ) where φ( ·) is the sigmoid function, W ₁ , W ₂ ,...,W _q are the mapping matrices, and b ₁ ,b ₂ ,...,b _q are the offsets. by asking

and

The values of W ₁ , W ₂ ,...,W _q and b ₁ ,b ₂ ,...,b _q can be obtained,

is the derivative of J with respect to W,

is the derivative of J with respect to b.

步骤S50：将所述特征目标数据输入至预设Softmax分类模型中，获得语音情感识别结果。Step S50: Input the feature target data into a preset Softmax classification model to obtain a speech emotion recognition result.

此外，应理解的是，根据上述步骤获得W₁,W₂,…,W_q和b₁,b₂,…,b_q，计算{x₁,x₂,...,x_m}的特征选择结果z。In addition, it should be understood that, according to the above steps to obtain W ₁ , W ₂ ,...,W _q and b ₁ ,b ₂ ,...,b _q , calculate the features of {x ₁ ,x ₂ ,...,x _m } Select result z.

此外，需要说明的是，上述所说的W₁,W₂,…,W_q和b₁,b₂,…,b_q为本申请中的特征目标数据。In addition, it should be noted that the above-mentioned W ₁ , W ₂ ,...,W _q and b ₁ ,b ₂ ,...,b _q are characteristic target data in this application.

此外，应理解的是，使用训练过程中获得的预设Softmax分类器，分别获得{x₁,x₂,...,x_m}的语音情感类别{l₁,l₂,...,l_m}。然后根据{l₁,l₂,...,l_m}投票获得该语句的情感。Furthermore, it should be understood that the speech emotion categories { _{l 1} _, _{l 2} _, ..., x _m } are respectively obtained using the preset Softmax classifier obtained during the training process. l _m }. Then vote for the sentiment of the statement according to {l ₁ ,l ₂ ,...,l _m }.

此外，需要说明的是，将所述特征目标数据输入至所述预设Softmax分类模型中，获得语音情感类别数据，对所述语音情感类别数据进行数据统计，获得语音情感类别数据值，根据所述语音情感类别数据值获得语音情感识别结果。In addition, it should be noted that the feature target data is input into the preset Softmax classification model to obtain voice emotion category data, and data statistics are performed on the voice emotion category data to obtain the voice emotion category data value. The speech emotion category data value is used to obtain speech emotion recognition results.

此外，上述所说的根据所述语音情感类别数据值获得语音情感识别结果的步骤为判断所述语音情感类别数据值是否属于预设语音情感类别阈值范围，若所述语音情感类别数据值属于所述预设语音情感类别阈值范围，则根据所述语音情感类别数据值获得语音情感识别结果；若所述语音情感类别数据值不属于所述预设语音情感类别阈值范围，则返回所述将所述特征目标数据输入至所述预设Softmax分类模型中，获得语音情感类别数据的步骤。In addition, the above-mentioned step of obtaining the voice emotion recognition result according to the voice emotion category data value is to judge whether the voice emotion category data value belongs to the preset voice emotion category threshold range, and if the voice emotion category data value belongs to all If the preset voice emotion category threshold range is selected, the voice emotion recognition result is obtained according to the voice emotion category data value; if the voice emotion category data value does not belong to the preset voice emotion category threshold value range, return the The feature target data is input into the preset Softmax classification model to obtain the voice emotion category data.

此外，还需要说明的是，本发明的情感识别的效果评价采用的语料库是语音情感识别领域的标准数据库。首先完成训练过程，然后进行识别测试。测试模式按5倍交叉方式进行。可以识别愤怒、恐惧、烦躁、厌恶、开心、中性、悲伤7种情感，在说话人依赖的情况下平均分类正确率为94.65％，除了开心与愤怒比较容易混淆以外，其它情绪之间区分度较好。在说话人独立的情况下平均分类正确率为89.30％。In addition, it should also be noted that the corpus used in the effect evaluation of emotion recognition of the present invention is a standard database in the field of speech emotion recognition. The training process is completed first, followed by the recognition test. The test pattern is performed in a 5-fold crossover. It can identify 7 emotions of anger, fear, irritability, disgust, happiness, neutrality, and sadness. The average classification accuracy rate is 94.65% in the case of speaker dependence. Except for happiness and anger, which are more easily confused, the degree of discrimination between other emotions better. The average classification accuracy is 89.30% in the case of speaker independence.

本实施例通过先获取预设维度的测试语音样本，并通过预设规则对测试语音样本进行分段处理，获得多个初始语音样本，然后对初始语音样本进行信号特征数据提取，获得待处理语音信号特征数据，并对所述待处理语音信号特征数据进行筛选，获得标签样本特征数据，通过预设统计函数对所述标签样本特征数据进行特征统计，获得待确认特征统计结果，之后对所述待确认训练特征统计结果进行情感类别划分，获得不同情感类别对应的待优化训练特征数据，根据所述待优化训练特征数据，通过预设多目标优化算法获得目标训练特征数据，最后将特征目标数据输入至预设Softmax分类模型中，获得语音情感识别结果。通过上述方法，能够充分利用语音情感片段，以及语句与片段之间的情感关系形成一种倾向性的数据，从而可以模拟人类处理倾向性的过程，利用数据的不平衡信息，相互比较，互为约束条件，将不同情感的片段分离开，从而增加样本规模和提高样本多样性。In this embodiment, a test voice sample of a preset dimension is obtained first, and a plurality of initial voice samples are obtained by segmenting the test voice sample according to a preset rule, and then signal feature data is extracted for the initial voice sample to obtain the voice to be processed. signal feature data, and filter the feature data of the to-be-processed voice signal to obtain label sample feature data, perform feature statistics on the label sample feature data through a preset statistical function, obtain feature statistics results to be confirmed, and then The statistical results of the training features to be confirmed are divided into emotion categories, and the training feature data to be optimized corresponding to different emotion categories is obtained. According to the training feature data to be optimized, the target training feature data is obtained through a preset multi-objective optimization algorithm, and finally the feature target data Input into the preset Softmax classification model to obtain speech emotion recognition results. Through the above method, it is possible to make full use of speech emotional fragments and the emotional relationship between sentences and fragments to form a kind of tendency data, so that the process of human processing tendency can be simulated, and the unbalanced information of the data can be used to compare with each other. Constraints to separate segments of different sentiments, thereby increasing sample size and increasing sample diversity.

参考图3，图3为本发明一种语音情感识别方法第二实施例的流程示意图。Referring to FIG. 3 , FIG. 3 is a schematic flowchart of a second embodiment of a speech emotion recognition method according to the present invention.

基于上述第一实施例，本实施例语音情感识别方法在所述步骤S10之前，还包括：Based on the above-mentioned first embodiment, before the step S10, the speech emotion recognition method of this embodiment further includes:

步骤S000：获取预设维度的训练语音样本，并通过预设规则对所述测试语音样本进行分段处理，获得多个初始训练语音样本。Step S000: Acquire a training voice sample of a preset dimension, and perform segment processing on the test voice sample according to a preset rule to obtain a plurality of initial training voice samples.

步骤S001：对所述初始训练语音样本进行特征提取，获得待处理训练语音信号特征。Step S001: Perform feature extraction on the initial training voice sample to obtain the features of the training voice signal to be processed.

步骤S002：通过预设统计函数对所述待处理训练语音信号特征进行特征统计，获取待确认训练特征统计结果。Step S002: Perform feature statistics on the features of the to-be-processed training speech signal by using a preset statistical function, and obtain a statistical result of the to-be-confirmed training features.

步骤S003：根据所述待确认训练特征统计结果，通过预设多目标优化算法获得目标训练特征数据。Step S003: Obtain target training feature data through a preset multi-objective optimization algorithm according to the statistical result of the training feature to be confirmed.

步骤S004：根据所述目标训练特征数据获取所述目标训练特征数据对应的情感类别。Step S004: Acquire an emotion category corresponding to the target training feature data according to the target training feature data.

步骤S005：根据所述情感类别和所述情感类别对应的目标训练特征数据建立预设Softmax分类模型。Step S005: Establish a preset Softmax classification model according to the emotion category and target training feature data corresponding to the emotion category.

此外，需要说明的是，上述所说的根据所述待确认训练特征统计结果，通过预设多目标优化算法获得目标训练特征数据的步骤为，对所述待确认训练特征统计结果进行情感类别划分，获得不同情感类别对应的待优化训练特征数据，根据所述待优化训练特征数据，通过预设多目标优化算法获得目标训练特征数据。In addition, it should be noted that the above-mentioned step of obtaining target training feature data through a preset multi-objective optimization algorithm according to the statistical results of the training features to be confirmed is to classify the statistical results of the training features to be confirmed by emotion category , obtain training feature data to be optimized corresponding to different emotion categories, and obtain target training feature data through a preset multi-objective optimization algorithm according to the training feature data to be optimized.

此外，还需要说明的是，上述所说的步骤为建立预设Softmax分类模型，在这一阶段中，针对所有说话人均分别进行训练，得到每个说话人所对应的分类器，具体过程如下：In addition, it should be noted that the above-mentioned steps are to establish a preset Softmax classification model. In this stage, all speakers are trained separately to obtain the classifier corresponding to each speaker. The specific process is as follows:

步骤(1-1)对每条语句分段；Step (1-1) segment each statement;

步骤(1-2)提取各个分段的特征；Step (1-2) extracts the feature of each segment;

步骤(1-3)对所有特征执行特征统计；Step (1-3) performs feature statistics on all features;

步骤(1-4)训练基于倾向性认知学习的语句片段情感分类方法；Step (1-4) training a sentence fragment sentiment classification method based on tendency cognitive learning;

步骤(1-5)对每个特征子空间训练支持向量机；Step (1-5) trains a support vector machine for each feature subspace;

步骤(1-6)分类结果由所有支持向量机的结果投票获得；Step (1-6) The classification result is obtained by voting of the results of all support vector machines;

此外，需要说明的是，所述步骤(1-1)中，将语音信号以0.2秒为间隔分段。In addition, it should be noted that, in the step (1-1), the voice signal is segmented at intervals of 0.2 seconds.

所述步骤(1-2)中，对每段提取语音信号特征包括：MFCC(Mel Frequ encyCepstrum Coefficient，Mel频率倒谱系数)、LFPC(Log Frequency Pow er Coefficients，对数频率功率系数)、LPCC(Linear Predictive Cepstral Codin g，线性预测倒谱系数)、ZCPA(Zero Crossing with Peak Amplitude，过零峰值幅度)、PLP(Perceptual LinearPredictive，感知线性预测)、R-PLP(Ras ta Perceptual Linear Predictiv，拉斯塔滤波器感知线性预测)，每类特征的特征提取结果均为二维矩阵，其中一个维度为时间维度；然后计算每类特征F_i在时间维度上的一阶导数ΔF_i、二阶导数ΔΔF_i，并将原始特征、一阶导数结果、二阶导数结果在非时间维度上串接，形成每一类特征的最终特征提取结果；将上述所有类的特征的最终特征提取结果在非时间维度上串接即为该样本的特征提取结果。In the step (1-2), the extracted speech signal features for each segment include: MFCC (Mel FrequencyCepstrum Coefficient, Mel frequency cepstrum coefficient), LFPC (Log Frequency Power Coefficients, logarithmic frequency power coefficient), LPCC ( Linear Predictive Cepstral Coding, linear prediction cepstral coefficient), ZCPA (Zero Crossing with Peak Amplitude, zero-crossing peak amplitude), PLP (Perceptual LinearPredictive, perceptual linear prediction), R-PLP (Ras ta Perceptual Linear Predictiv, Rasta Filter-aware linear prediction), the feature extraction result of each type of feature is a two-dimensional matrix, one of which is the time dimension; then the first-order derivative ΔF _i and the second-order derivative ΔΔF _i of each type of feature F _i in the time dimension are calculated. , and concatenate the original features, first-order derivative results, and second-order derivative results in the non-time dimension to form the final feature extraction result of each type of feature; the final feature extraction result of all the above-mentioned classes of features is in the non-time dimension. The concatenation is the feature extraction result of the sample.

所述步骤(1-3)中对特征进行特征统计为：获得特征在时间维度上的均值、标准方差、最小值、最大值、峭度、偏度统计结果，有标签样本的特征统计结果记为{x₁,x₂,...,x_n}，对应的标签记为Y＝[y₁,y₂,...,y_n]∈Rⁿ。The feature statistics of the features in the step (1-3) are: obtaining the statistical results of the mean, standard deviation, minimum value, maximum value, kurtosis, and skewness of the features in the time dimension, and the feature statistics results of the labeled samples are recorded. is {x ₁ , x ₂ ,...,x _n }, and the corresponding label is denoted as Y=[y ₁ , y ₂ ,..., y _n ]∈R ⁿ .

所述步骤(1-4)中，给定数据集X_A＝[x₁,x₂,…,x_m]，X_B＝[x_m+1,x_m+2,…,x_n]，其中X_A是A类情感的片段，X_B是B类情感的片段，训练基于倾向性认知学习的语句片段情感分类方法步骤如下：In the step (1-4), given a data set X _A =[x ₁ ,x ₂ ,...,x _m ], X _B =[x _m+1 ,x _m+2 ,...,x _n ], Among them, X _A is a segment of A-type emotion, and X _B is a segment of B-type emotion. The steps of training a sentence segment emotion classification method based on propensity cognitive learning are as follows:

步骤(1-4-1)对x∈X_A，将X_A中以x为中心的Parzen窗内的样本与中心样本x的角度划分成多个箱子，然后使用下式计算x在X_A中周围数据的分布特征。Step (1-4-1) For x∈X _A , divide the angle between the sample in the Parzen window centered at x in X _A and the center sample x into multiple bins, and then use the following formula to calculate x in X _A Distribution characteristics of surrounding data.

b_x＝[b₁,b₂,…,b_k]b _x =[b ₁ ,b ₂ ,...,b _k ]

步骤(1-4-2)对x∈X_A，将X_B中以x为中心的Parzen窗内的样本与中心样本x的角度划分成多个箱子，然后使用下式计算x在X_A中周围数据的分布特征。Step (1-4-2) For x ∈ X _A , divide the angle between the sample in the Parzen window centered at x in X _B and the center sample x into multiple bins, and then use the following formula to calculate x in X _A Distribution characteristics of surrounding data.

式中

表示第

means the first

步骤(1-4-3)使用下式计算两数据集在x点附近的数据分布差异：Step (1-4-3) uses the following formula to calculate the difference in the data distribution of the two datasets near the x point:

式中

表示两向量之间的距离，可使用多种距离计算方法。in the formula

Represents the distance between two vectors, and various distance calculation methods can be used.

步骤(1-4-4)根据步骤(1-4-3)的计算结果可以得到倾向于A情感的片段集合

倾向于B情感的片段集合

以及倾向于中性情感的片段集合

其中

为d_x>T的x组成的集合。

为d_x<-T的x组成的集合。

Step (1-4-4) According to the calculation result of step (1-4-3), a set of fragments tending to A sentiment can be obtained

A collection of snippets leaning towards B sentiment

and a collection of clips that tend towards neutral emotions

in

is the set of x where d _x > T.

is the set of x where d _x <-T.

步骤(1-4-5)定义

和

样本的个数，L_A、L_B和L_C中的元素值分别为1、2、3。使用下述目标方程，学习片段的特征子空间：Step (1-4-5) Definition

and

J＝J₁(o_i,o_j)+β*J₂(o_i,o_j)J=J ₁ (o _i ,o _j )+β*J ₂ (o _i ,o _j )

β是平衡参数。其中J₁(o_i,o_j)可以实现

和

and

式中o_i和o_j为

和

映射到子空间之后的结果。l_i和l_j对应o_i和o_j在L中的值。m是一个阈值，调整类间距离的范围。G_ij为x_i和x_j之间的高斯距离。计算公式如下：where o _i and o _j are

and

The result after mapping to the subspace. l _i and l _j correspond to the values of o _i and o _j in L. m is a threshold that adjusts the range of inter-class distances. G _ij is the Gaussian distance between x _i and x _j . Calculated as follows:

J₂(o_i,o_j)可以尽量保持每个区域内的相对关系不变，以及属于同一类的区域相对靠近，但是并不重叠。定义如下：J ₂ (o _i ,o _j ) can try to keep the relative relationship in each region unchanged, and the regions belonging to the same class are relatively close, but do not overlap. Defined as follows:

式中

和

and

为了优化目标方程J，定义o_i＝φ(W_qφ(…W₃φ(W₂φ(W₁x_i+b₁)+b₂)+b₃)+b₄)，式中φ(·)为sigmoid函数，W₁,W₂,…,W_q为映射矩阵，b₁,b₂,…,b_q为偏移量。通过求

和

可得到W₁,W₂,…,W_q和b₁,b₂,…,b_q的值，

是求J对W的导数，

是求J对b的导数。In order to optimize the objective equation J, define o _i =φ(W _q φ(…W ₃ φ(W ₂ φ(W ₁ x _i +b ₁ )+b ₂ )+b ₃ )+b ₄ ), where φ( ·) is the sigmoid function, W ₁ , W ₂ ,...,W _q are the mapping matrices, and b ₁ ,b ₂ ,...,b _q are the offsets. by asking

and

is the derivative of J with respect to W,

is the derivative of J with respect to b.

步骤(1-4-6)对步骤(1-4-5)获得的

和

的特征子空间，训练Softmax分类器将情感A，情感B和中性情感C分开。Step (1-4-6) to step (1-4-5) obtained

and

The feature subspace of , trains a Softmax classifier to separate sentiment A, sentiment B and neutral sentiment C.

步骤(1-4-7)依照步骤(1-4-5)和步骤(1-4-6)的操作过程，训练能识别所有情绪对的softmax分类器。Step (1-4-7) According to the operation process of step (1-4-5) and step (1-4-6), train a softmax classifier capable of identifying all emotion pairs.

此外，应理解的是，以下为上述内容总结：In addition, it should be understood that the following is a summary of the above:

第一步：对所有的训练样本语音以0.2秒为间隔分段。Step 1: Segment all training sample speech at intervals of 0.2 seconds.

第二步：对所有的语音片段训练信号提取MFCC，LFPC，LPCC，ZCPA，PLP，R-PLP特征，其中MFCC、LFPC的Mel滤波器个数为40；LPCC、PLP、R-PLP的线性预测阶数分别为12、16、16；ZCPA的频率分段为：0，106，223，352，495，655，829，1022，1236，1473，1734，2024，2344，2689，3089，3522，4000。从而每条语句的每类特征的维度分别为：ti*39，ti*40，ti*12，ti*16，ti*16，ti*16，其中ti为第i条语句的帧数，乘号后面的数字为每帧特征的维度。为了获得语音信号在时间维度上的变化，还对上述特征在时间维度上计算一阶导数，二阶导数。最后每类特征的维度分别为：ti*117，ti*140，ti*36，ti*48，ti*48，ti*48。第i样本的提取到的语音信号特征由上述所有特征组合而成，维度为ti*(117+140+36+48+48+48).Step 2: Extract MFCC, LFPC, LPCC, ZCPA, PLP, R-PLP features from all speech segment training signals, where the number of Mel filters for MFCC and LFPC is 40; linear prediction for LPCC, PLP, and R-PLP The orders are 12, 16, and 16 respectively; the frequency segments of ZCPA are: 0, 106, 223, 352, 495, 655, 829, 1022, 1236, 1473, 1734, 2024, 2344, 2689, 3089, 3522, 4000 . Therefore, the dimensions of each type of feature of each statement are: ti*39, ti*40, ti*12, ti*16, ti*16, ti*16, where ti is the frame number of the i-th statement, and the multiplication sign The following numbers are the dimensions of each frame feature. In order to obtain the change of the speech signal in the time dimension, the first-order derivative and the second-order derivative are also calculated for the above-mentioned features in the time dimension. Finally, the dimensions of each type of feature are: ti*117, ti*140, ti*36, ti*48, ti*48, ti*48. The extracted speech signal feature of the ith sample is composed of all the above features, and the dimension is ti*(117+140+36+48+48+48).

第三步：使用如下统计函数：均值(mean)、标准方差(standard deviation)、最小值(min)、最大值(max)、峭度(kurtosis)、偏度(skewness)获得上述特征在时间维度上的统计结果。有标签样本的特征统计结果记为{x₁,x₂,...,x_n}，其中n为有标签标本的个数。Step 3: Use the following statistical functions: mean, standard deviation, minimum value (min), maximum value (max), kurtosis (kurtosis), and skewness (skewness) to obtain the above features in the time dimension Statistical results above. The feature statistics of labeled samples are recorded as {x ₁ , x ₂ ,...,x _n }, where n is the number of labeled samples.

第四步：将上步中的{x₁,x₂,...,x_n}，按语句的标签分成X_A＝[x₁,x₂,…,x_m]，X_B＝[x_m+1,x_m+2,…,x_n]，其中X_A是A类情感的片段，X_B是B类情感的片段，训练基于倾向性认知学习的语句片段情感分类方法步骤如下：Step 4: Divide {x ₁ ,x ₂ ,...,x _n } in the previous step into X _A =[x ₁ ,x ₂ ,...,x _m ] according to the label of the statement, X _B =[x _m+1 ,x _m+2 ,…,x _n ], where X _A is a segment of type A emotion, and X _B is a segment of type B emotion. The steps of training a sentence segment emotion classification method based on propensity cognitive learning are as follows:

b_x＝[b₁,b₂,…,b_k]b _x =[b ₁ ,b ₂ ,...,b _k ]

式中

表示第

means the first

式中

表示两向量之间的距离，此处使用欧氏距离。in the formula

Represents the distance between two vectors, where Euclidean distance is used.

(4)根据上述步骤(1-4-3)的计算结果可以得到倾向于A情感的片段集合

倾向于B情感的片段集合

以及倾向于中性情感的片段集合

其中

为d_x>T的x组成的集合。

为d_x<-T的x组成的集合。

(4) According to the calculation result of the above step (1-4-3), the set of fragments tending to A emotion can be obtained

A collection of snippets leaning towards B sentiment

and a collection of clips that tend towards neutral emotions

in

is the set of x where d _x > T.

is the set of x where d _x <-T.

(5)定义

和

and

J＝J₁(o_i,o_j)+β*J₂(o_i,o_j)J=J ₁ (o _i ,o _j )+β*J ₂ (o _i ,o _j )

β是平衡参数。其中J₁(o_i,o_j)可以实现

和

and

式中o_i和o_j为

和

映射到子空间之后的结果。l_i和l_j对应o_i和o_j在L中的值。m是一个阈值，调整类间距离的范围，本发明中取1。G_ij为x_i和x_j之间的高斯距离。计算公式如下：where o _i and o _j are

and

The result after mapping to the subspace. l _i and l _j correspond to the values of o _i and o _j in L. m is a threshold, and the range of the distance between classes is adjusted, which is taken as 1 in the present invention. G _ij is the Gaussian distance between x _i and x _j . Calculated as follows:

式中

和

and

和

可得到W₁,W₂,…,W_q和b₁,b₂,…,b_q的值，

是求J对W的导数，

and

is the derivative of J with respect to W,

is the derivative of J with respect to b.

(6)对上述步骤(1-4-5)获得的

和

的特征子空间，训练Sof tmax分类器将情感A，情感B和中性情感C分开。(6) for the above-mentioned steps (1-4-5) obtained

and

(7)依照上述步骤(1-4-5)和上述步骤(1-4-6)的操作过程，训练能识别所有情绪对的Softmax分类器。(7) According to the operation process of the above step (1-4-5) and the above step (1-4-6), train a Softmax classifier capable of identifying all emotion pairs.

本实施例通过获取预设维度的训练语音样本，并通过预设规则对所述测试语音样本进行分段处理，获得多个初始训练语音样本，然后对所述初始训练语音样本进行特征提取，获得待处理训练语音信号特征，并通过预设统计函数对所述待处理训练语音信号特征进行特征统计，获取待确认训练特征统计结果，根据所述待确认训练特征统计结果，通过预设多目标优化算法获得目标训练特征数据，之后根据所述目标训练特征数据获取所述目标训练特征数据对应的情感类别，并根据所述情感类别和所述情感类别对应的目标训练特征数据建立预设Softmax分类模型。通过上述方法，可以针对语句局部片段训练模型，可避免一句话中不同局部片段含有不同情感、或者同一情感不同局部片段相互冲突，从而降低深度学习物理含义与语音情感识别特性之间的差异。In this embodiment, the training voice samples of preset dimensions are obtained, and the test voice samples are segmented according to preset rules to obtain a plurality of initial training voice samples, and then feature extraction is performed on the initial training voice samples to obtain Features of the training voice signal to be processed, and perform feature statistics on the features of the training voice signal to be processed through a preset statistical function, obtain the statistical results of the training features to be confirmed, and perform preset multi-objective optimization according to the statistical results of the training features to be confirmed. The algorithm obtains target training feature data, then obtains the emotion category corresponding to the target training feature data according to the target training feature data, and establishes a preset Softmax classification model according to the emotion category and the target training feature data corresponding to the emotion category . Through the above method, the model can be trained for local segments of sentences, which can avoid different local segments in a sentence containing different emotions, or different local segments of the same emotion conflicting with each other, thereby reducing the difference between deep learning physical meaning and speech emotion recognition characteristics.

此外，本发明实施例还提出一种存储介质，所述存储介质上存储有语音情感识别程序，所述语音情感识别程序被处理器执行时实现如上文所述的语音情感识别方法的步骤。In addition, an embodiment of the present invention further provides a storage medium, where a speech emotion recognition program is stored, and when the speech emotion recognition program is executed by a processor, the steps of the speech emotion recognition method described above are implemented.

参照图4，图4为本发明语音情感识别装置第一实施例的结构框图。Referring to FIG. 4 , FIG. 4 is a structural block diagram of a first embodiment of a voice emotion recognition apparatus according to the present invention.

如图4所示，本发明实施例提出的语音情感识别装置包括：获取模块4001，用于获取预设维度的测试语音样本，并通过预设规则对所述测试语音样本进行分段处理，获得多个初始语音样本；提取模块4002，用于对所述初始语音样本进行信号特征数据提取，获得待处理语音信号特征数据；统计模块4003，用于通过预设统计函数对所述待处理语音信号特征数据进行特征统计，获得待确认特征统计结果；计算模块4004，用于根据所述待确认特征统计结果，通过预设多目标优化算法获得特征目标数据；确定模块4005，用于将所述特征目标数据输入至预设Softmax分类模型中，获得语音情感识别结果。As shown in FIG. 4 , the voice emotion recognition device provided by the embodiment of the present invention includes: an acquisition module 4001, configured to acquire a test voice sample of a preset dimension, and perform segmentation processing on the test voice sample according to a preset rule to obtain a plurality of initial voice samples; the extraction module 4002 is used to extract the signal feature data of the initial voice samples to obtain the feature data of the to-be-processed voice signal; the statistics module 4003 is used to analyze the to-be-processed voice signal by a preset statistical function Feature statistics are performed on the feature data to obtain the feature statistics results to be confirmed; the calculation module 4004 is used to obtain feature target data through a preset multi-objective optimization algorithm according to the feature statistics results to be confirmed; the determination module 4005 is used to calculate the features The target data is input into the preset Softmax classification model to obtain speech emotion recognition results.

所述获取模块4001获取预设维度的测试语音样本，并通过预设规则对所述测试语音样本进行分段处理，获得多个初始语音样本的操作。The obtaining module 4001 obtains a test voice sample of a preset dimension, and performs segmentation processing on the test voice sample according to a preset rule to obtain a plurality of initial voice samples.

所述提取模块4002对所述初始语音样本进行信号特征数据提取，获得待处理语音信号特征数据的操作。The extraction module 4002 performs an operation of extracting signal feature data on the initial speech sample to obtain the feature data of the speech signal to be processed.

应理解的是，上述所说的每类特征的特征提取结果均为二维矩阵，其中一个维度为时间维度，然后计算每类特征F_i在时间维度上的一阶导数ΔF_i、二阶导数ΔΔFi，并将原始特征、一阶导数结果、二阶导数结果在非时间维度上串接，形成每一类特征的最终特征提取结果；将上述所有类的特征的最终特征提取结果在非时间维度上串接即为该样本的特征提取结果。It should be understood that the feature extraction result of each type of feature mentioned above is a two-dimensional matrix, one of which is the time dimension, and then the first-order derivative ΔF _i and the second-order derivative of each type of feature F _i on the time dimension are calculated. ΔΔFi, and concatenate the original features, first-order derivative results, and second-order derivative results in the non-time dimension to form the final feature extraction result of each type of feature; the final feature extraction result of the above-mentioned features of all classes is in the non-time dimension. The above concatenation is the feature extraction result of the sample.

在MFCC和LPCC连接时，假如

串接后为

When MFCC and LPCC are connected, if

concatenated as

所述统计模块4003通过预设统计函数对所述待处理语音信号特征数据进行特征统计，获得待确认特征统计结果的操作。The statistics module 4003 performs feature statistics on the feature data of the to-be-processed voice signal through a preset statistical function, and obtains the feature statistics result to be confirmed.

所述计算模块4004根据所述待确认特征统计结果，通过预设多目标优化算法获得特征目标数据的操作。The computing module 4004 is an operation of obtaining feature target data through a preset multi-objective optimization algorithm according to the statistical result of the feature to be confirmed.

b_x＝[b₁,b₂,…,b_k]b _x =[b ₁ ,b ₂ ,...,b _k ]

式中

表示第

means the first

式中

表示两向量之间的距离，此处使用欧氏距离。in the formula

Represents the distance between two vectors, where Euclidean distance is used.

(4)根据上一步骤的计算结果可以得到倾向于A情感的片段集合

倾向于B情感的片段集合

以及倾向于中性情感的片段集合

其中

为d_x>T的x组成的集合。

为d_x<-T的x组成的集合。

A collection of snippets leaning towards B sentiment

and a collection of clips that tend towards neutral emotions

in

is the set of x where d _x > T.

is the set of x where d _x <-T.

(5)定义

和

and

J＝J₁(o_i,o_j)+β*J₂(o_i,o_j)J=J ₁ (o _i ,o _j )+β*J ₂ (o _i ,o _j )

β是平衡参数。其中J₁(o_i,o_j)可以实现

和

and

式中o_i和o_j为

和

and

式中

和

and

和

可得到W₁,W₂,…,W_q和b₁,b₂,…,b_q的值，

是求J对W的导数，

and

is the derivative of J with respect to W,

is the derivative of J with respect to b.

所述确定模块4005将所述特征目标数据输入至预设Softmax分类模型中，获得语音情感识别结果的操作。The determining module 4005 inputs the feature target data into the preset Softmax classification model, and obtains the voice emotion recognition result.

应当理解的是，以上仅为举例说明，对本发明的技术方案并不构成任何限定，在具体应用中，本领域的技术人员可以根据需要进行设置，本发明对此不做限制。It should be understood that the above are only examples, and do not constitute any limitation to the technical solutions of the present invention. In specific applications, those skilled in the art can make settings as required, which is not limited by the present invention.

需要说明的是，以上所描述的工作流程仅仅是示意性的，并不对本发明的保护范围构成限定，在实际应用中，本领域的技术人员可以根据实际的需要选择其中的部分或者全部来实现本实施例方案的目的，此处不做限制。It should be noted that the above-described workflow is only illustrative, and does not limit the protection scope of the present invention. In practical applications, those skilled in the art can select some or all of them to implement according to actual needs. The purpose of the solution in this embodiment is not limited here.

另外，未在本实施例中详尽描述的技术细节，可参见本发明任意实施例所提供的语音情感识别方法，此处不再赘述。In addition, for technical details that are not described in detail in this embodiment, reference may be made to the speech emotion recognition method provided by any embodiment of the present invention, and details are not repeated here.

此外，需要说明的是，在本文中，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。Furthermore, it should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or system comprising a series of elements includes not only those elements, but also other elements not expressly listed or inherent to such a process, method, article or system. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article or system that includes the element.

上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。The above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages or disadvantages of the embodiments.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件，但很多情况下前者是更佳的实施方式。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质(如只读存储器(Read Only Memory，ROM)/RAM、磁碟、光盘)中，包括若干指令用以使得一台终端设备(可以是手机，计算机，服务器，或者网络设备等)执行本发明各个实施例所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation. Based on such understanding, the technical solutions of the present invention can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products are stored in a storage medium (such as a read-only memory). , ROM)/RAM, magnetic disk, optical disk), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) execute the methods described in the various embodiments of the present invention.

以上仅为本发明的优选实施例，并非因此限制本发明的专利范围，凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换，或直接或间接运用在其他相关的技术领域，均同理包括在本发明的专利保护范围内。The above are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present invention, or directly or indirectly applied in other related technical fields , are similarly included in the scope of patent protection of the present invention.

Claims

1. A speech emotion recognition method, characterized in that the method comprises:

acquiring a test voice sample with a preset dimension, and performing segmentation processing on the test voice sample through a preset rule to obtain a plurality of initial voice samples;

extracting signal characteristic data of the initial voice sample to obtain voice signal characteristic data to be processed;

performing feature statistics on the voice signal feature data to be processed through a preset statistical function to obtain a feature statistical result to be confirmed;

obtaining feature target data through a preset multi-target optimization algorithm according to the feature statistical result to be confirmed;

inputting the characteristic target data into a preset Softmax classification model to obtain a speech emotion recognition result;

the method for obtaining the initial voice samples comprises the following steps of obtaining a test voice sample with a preset dimensionality, carrying out segmentation processing on the test voice sample through a preset rule, and obtaining a plurality of initial voice samples, wherein before the step of obtaining the initial voice samples, the method further comprises the following steps:

acquiring training voice samples with preset dimensions, and performing segmentation processing on the test voice samples through preset rules to obtain a plurality of initial training voice samples;

performing feature extraction on the initial training voice sample to obtain the features of a training voice signal to be processed;

performing feature statistics on the training voice signal features to be processed through a preset statistical function to obtain a training feature statistical result to be confirmed;

carrying out emotion category division on the statistical result of the training features to be confirmed to obtain training feature data to be optimized corresponding to different emotion categories;

determining a peripheral data distribution characteristic set corresponding to each training characteristic data to be optimized;

determining data distribution differences corresponding to different emotion classes according to the surrounding data distribution feature set;

obtaining emotion fragment sets corresponding to different emotion types according to the data distribution difference;

determining a feature subspace corresponding to each emotion fragment set;

and establishing a preset Softmax classification model based on the plurality of feature subspaces.

2. The method of claim 1, wherein the step of inputting the feature target data into a preset Softmax classification model to obtain the speech emotion recognition result comprises:

inputting the characteristic target data into the preset Softmax classification model to obtain speech emotion category data;

performing data statistics on the voice emotion type data to obtain a voice emotion type data value;

and obtaining a voice emotion recognition result according to the voice emotion category data value.

3. The method of claim 2, wherein the step of obtaining speech emotion recognition results according to the speech emotion classification data values comprises:

judging whether the voice emotion type data value belongs to a preset voice emotion type threshold range or not;

and if the voice emotion type data value belongs to the preset voice emotion type threshold range, acquiring a voice emotion recognition result according to the voice emotion type data value.

4. The method of claim 3, wherein the step of determining whether the speech emotion classification data value falls within a preset speech emotion classification threshold range further comprises:

and if the voice emotion category data value does not belong to the preset voice emotion category threshold range, returning to the step of inputting the feature target data into the preset Softmax classification model to obtain voice emotion category data.

5. The method according to claim 1, wherein the step of performing feature statistics on the feature data of the speech signal to be processed by using a preset statistical function to obtain a feature statistical result to be confirmed comprises:

screening the voice signal characteristic data to be processed to obtain label sample characteristic data;

and carrying out feature statistics on the tag sample feature data through a preset statistical function to obtain a feature statistical result to be confirmed.

6. An apparatus for speech emotion recognition, the apparatus comprising:

the system comprises an acquisition module, a processing module and a control module, wherein the acquisition module is used for acquiring a test voice sample with a preset dimensionality and performing segmentation processing on the test voice sample through a preset rule to obtain a plurality of initial voice samples;

the extraction module is used for extracting signal characteristic data of the initial voice sample to obtain voice signal characteristic data to be processed;

the statistical module is used for carrying out feature statistics on the voice signal feature data to be processed through a preset statistical function to obtain a feature statistical result to be confirmed;

the calculation module is used for obtaining feature target data through a preset multi-target optimization algorithm according to the feature statistical result to be confirmed;

the determining module is used for inputting the characteristic target data into a preset Softmax classification model to obtain a speech emotion recognition result;

the speech emotion recognition apparatus further includes: acquiring training voice samples with preset dimensions, and performing segmentation processing on the test voice samples through preset rules to obtain a plurality of initial training voice samples; performing feature extraction on the initial training voice sample to obtain the features of a training voice signal to be processed; performing feature statistics on the training voice signal features to be processed through a preset statistical function to obtain a training feature statistical result to be confirmed; carrying out emotion category division on the statistical result of the training features to be confirmed to obtain training feature data to be optimized corresponding to different emotion categories; determining a peripheral data distribution characteristic set corresponding to each training characteristic data to be optimized; determining data distribution differences corresponding to different emotion classes according to the surrounding data distribution feature set; obtaining emotion fragment sets corresponding to different emotion types according to the data distribution difference; determining a feature subspace corresponding to each emotion fragment set; and establishing a preset Softmax classification model based on the plurality of feature subspaces.

7. An electronic device, characterized in that the device comprises: a memory, a processor and a speech emotion recognition program stored on the memory and executable on the processor, the speech emotion recognition program being configured to implement the steps of the speech emotion recognition method as claimed in any of claims 1 to 5.

8. A storage medium having stored thereon a speech emotion recognition program, which when executed by a processor implements the steps of the speech emotion recognition method as claimed in any one of claims 1 to 5.