CN118136047A

CN118136047A - Voice emotion analysis method based on semantic intonation

Info

Publication number: CN118136047A
Application number: CN202410545108.2A
Authority: CN
Inventors: 翁文娟; 杨程越
Original assignee: Anhui Wuyu Security Technology Co ltd
Current assignee: Anhui Wuyu Security Technology Co ltd
Priority date: 2024-05-06
Filing date: 2024-05-06
Publication date: 2024-06-04
Anticipated expiration: 2044-05-06
Also published as: CN118136047B

Abstract

The invention discloses a voice emotion analysis method based on semantic intonation, which comprises the following steps: s1, collecting Chinese voice data and converting the Chinese voice data into digital signals; s2, processing the digital signals by using a long-short-time memory network, and extracting acoustic properties of voice; s3, converting voice data into a text form through a natural language processing technology, and analyzing the text by using a transducer model; s4, combining the acoustic attribute extracted in the S2 with the semantic information extracted in the S3, and processing fusion characteristics of the acoustic attribute and the semantic information extracted in the S3 by using a convolutional neural network; s5, analyzing the integrated data by applying a deep learning technology based on the processing result in the S4, and determining emotion tendency and strength in the voice; s6, outputting a final emotion analysis result. The invention greatly improves the accuracy and efficiency of Chinese voice emotion analysis and provides a solid technical foundation for the field of automatic and intelligent voice emotion analysis.

Description

Voice emotion analysis method based on semantic intonation

Technical Field

The invention relates to the technical field of emotion analysis, in particular to a voice emotion analysis method based on semantic intonation.

Background

In the field of Chinese speech emotion analysis, the traditional method is mainly based on two approaches: acoustic feature analysis and text semantic analysis. Acoustic feature analysis focuses on extracting sound attributes such as pitch, intensity, and rhythm from speech signals, while text semantic analysis converts speech into text to extract and analyze emotional tendency from the language content. However, each of these methods has limitations.

First, acoustic feature analysis tends to ignore semantic information of language content, which can lead to inaccuracy in emotion analysis when processing speech with complex context or implicit meaning. For example, in a ironically-bearing sentence, hidden emotional tendencies may not be accurately identified by acoustic features alone. Second, text semantic analysis, while capable of capturing emotion colors of linguistic content, generally lacks consideration for sound fluctuations and intensity variations in speech, which are also important in expressing emotion. For example, even though the text content is the same, different mood and intonation may convey different emotions. The core difficulties faced by the prior art include: 1. acoustic and semantic information separation: in dealing with complex chinese contexts, relying on acoustic features alone or text semantic analysis often makes it difficult to accurately identify emotion because the two types of information are often complementary to each other. 2. Processing constraints of complex contexts: the traditional approach works poorly when dealing with complex language expressions with bilingual, ironic, or metaphorical, etc. 3. Highly dependent on data quality: traditional methods have high dependence on the quality and quantity of data, and insufficient or low quality of data can seriously affect the accuracy of analysis results.

Therefore, how to effectively combine the acoustic features and the semantic information to improve the accuracy and adaptability of the Chinese voice emotion analysis is a problem that needs to be solved by those skilled in the art.

Disclosure of Invention

The invention aims to provide a voice emotion analysis method based on semantic intonation, which can efficiently process a large amount of voice data, meets the requirement of rapid processing and is suitable for a large-scale data analysis scene.

According to the embodiment of the invention, the voice emotion analysis method based on semantic intonation comprises the following steps:

S1, collecting Chinese voice data and converting the Chinese voice data into digital signals;

S2, processing the digital signals by using a long-short-time memory network, and extracting acoustic properties of voice;

S3, converting voice data into a text form through a natural language processing technology, and analyzing the text by using a transducer model;

s4, combining the acoustic attribute extracted in the S2 with the semantic information extracted in the S3, and processing fusion characteristics of the acoustic attribute and the semantic information extracted in the S3 by using a convolutional neural network;

S5, analyzing the integrated data by applying a deep learning technology based on the processing result in the S4, and determining emotion tendency and strength in the voice;

S6, outputting a final emotion analysis result.

Optionally, the S1 specifically includes:

s11, collecting Chinese voice input by using a microphone or similar audio capturing equipment;

S12, converting the collected voice input into an analog signal, and converting the analog signal into a digital signal through an analog-to-digital converter, wherein the digital signal is expressed as WhereinRepresentative time;

s13, converting the digital signals Pretreating;

s14, the preprocessed digital signals Sampling and quantizing, and obtaining the characteristic of Chinese voice by converting continuous signals into discrete data points;

S15, storing the processed digital signals.

Optionally, the S2 specifically includes:

s21, receiving the digital voice signal preprocessed by S1 And divide it into a series of fixed length framesEach frame containsSampling points;

S22, reducing discontinuity of frame edges by applying a window function to each frame;

s23, data processed by window function for each frame Calculating mel frequency cepstral coefficients, converting each frame into a representation of the frequency domain by applying a fast fourier transform, and mapping to a mel scale reflecting human auditory perception:

;

Wherein, Is a discrete cosine transform of the data,Representing the conversion of a time domain signal into a frequency domain signal,Is a Mel filter inA response of the frequency points;

s24, each calculated frame Coefficients ofInputting the characteristic into a long-short-time memory network;

s25, the long-short-time memory network consists of a plurality of layers, each layer comprises a plurality of LSTM units, and each unit receives one Feature vectorAnd hidden state of previous time stepAs input, calculate hidden state of current time stepThe calculation process comprises three door control structures, namely a forgetting doorInput doorOutput doorAnd cell status：

;

Wherein,Is an S-type activation function that,Is a hyperbolic tangent activation function,AndRespectively a weight matrix and a bias term,Representing the previous hidden state and the currentConnecting the feature vectors;

S26, gradually processing the whole part by the long-short time recording network Feature sequence, outputting hidden state of each time step；

S27, outputting LSTM networkAs a comprehensive representation of acoustic properties, the next emotion analysis is performed.

Optionally, the step S3 specifically includes:

S31, receiving the text sequence converted by the automatic voice recognition system in S1 Each of which is provided withRepresenting a word or word;

s32, preprocessing the text sequence, including removing stop words, part-of-speech tagging and semantic disambiguation;

s33, inputting the preprocessed text sequence into a transducer model, wherein the model firstly inputs each word Conversion to a high-dimensional word embedding vectorCapturing semantic features of each word through word embedding layer implementation;

S34, processing each word embedding vector by applying a self-attention mechanism in a transducer model Capturing word-to-word relationships and dependencies, embedding vectors for each wordThree different vectors are generated: query vectorKey vectorValue vector,,Embedding vectors from original words by different linear transformationsGenerating:

;

Wherein, Represent the firstThe high-dimensional word embedding vector of the individual words,，，Representing the conversion of the word embedding vector into a corresponding query, key, value vector,，，Bias terms representing query, key, value, respectively, for a given wordBy calculating its attention score, by calculating its query vectorKey vectors with all other words in the sequenceThrough a scaling factorAnd (3) normalizing:

;

Wherein, Represent the firstThe query vector of the individual terms,A set of key vectors representing all words in the entire text sequence,A set of value vectors representing all words in the entire text sequence,The dimensions of the key vector are represented,One of the functions of the normalization is that,Representing a given query vectorCorresponding self-attention scores are determined in process numberWhen words are selected, the importance degree of information of other words in the sequence is improved;

S35, the output processed by the self-attention layer is further processed by a series of encoder layers, each encoder layer comprises a self-attention sub-layer and a feedforward neural network, and the feedforward neural network carries out nonlinear conversion on the output of the self-attention sub-layer;

s36, final output of the transducer model The comprehensive semantic representation of each word in the text sequence comprises semantic information and word-to-word relationships;

S37, using the output of the transducer model for subsequent speech emotion analysis, wherein each output For evaluating emotional tendency and strength of the corresponding word or word.

Optionally, the step S4 specifically includes:

S41, obtaining output of long-short-time memory network And obtaining the output of the TrS1nsformS r model；

S42, combining the two groups of outputs into a fusion feature setAnd (3) performing fusion operation:

;

Wherein, Is a fusion function, representing that acoustic characteristics and semantic information are combined into a unified feature representation;

s43, integrating the feature sets Input into a convolutional neural network, convolutional operation:

;

Wherein, Is the firstThe outputs of the various convolution layers,Is the firstThe weight matrix of the individual convolutional layers,Is a bias term that is used to determine,A convolution operation is represented and is performed,Is an activation function;

S45, pooling the convolution layer of S43;

S46, after the processing of a plurality of convolution and pooling layers, the output of the convolutional neural network High-level features that incorporate acoustic properties and semantic information are represented.

Optionally, the step S43 specifically includes:

S431 representation of acoustic properties Through a mapping matrixConversion spatially aligned with the representation of semantic information, the converted acoustic properties being represented as：

;

Wherein,Is a mapping matrix of acoustic properties to a common feature space,Representing concatenating hidden states of all acoustic properties into one vector;

S432, representation of semantic information By means of a corresponding mapping matrixConversion, the converted semantic information is expressed as：

;

Wherein,Is a mapping matrix of semantic information to a common feature space,The representation concatenates representations of all semantic information into one vector;

S433 converted acoustic properties And semantic informationFeature set fusion by weighted summation fusionIs calculated by (1):

;

Wherein, AndRepresenting the importance of the acoustic properties and semantic information in the fused feature set.

Optionally, the step S5 specifically includes:

S51, receiving output from the convolutional neural network of S4 Each of which is provided withRepresenting a processed signature;

S52, further analyzing the feature map through an advanced feature extraction layer:

feature map of all convolution neural network output FlatteningForming a one-dimensional vector, and inputting the feature map output by the convolutional neural networkFlatteningInto a one-dimensional vector into a network of multiple fully connected layers, each layer passing through a weight matrixBias termActivation functionConverting and nonlinear processing is carried out on the characteristics:

;

Wherein, Represent the firstThe output characteristics of the layer full-connection network,Represent the firstThe weight matrix of the layer fully connected network,Represents the feature vector after the flattening,Represent the firstThe bias term of the layer full-connection network,Representing an activation function for introducing nonlinearities;

after being processed by the multi-layer full-connection network, the output is finally obtained Representing advanced features extracted in depth;

S53, extracting high-level features Input into one or more classifiers for emotion tendency classification, the classifier identifies different emotion categories from the features through training and learning, and the classifier outputsThe calculation formula is as follows:

;

Wherein, Representing the output of the classifier, i.e. the probability distribution for different emotion categories,A weight matrix representing the classifier is presented,Representing advanced features extracted from a fully connected network,The bias term(s) representing the classifier are,Is an activation function for converting the classifier output into a probability distribution;

s54, according to the output of the classifier The emotional tendency and strength in speech are determined.

Optionally, the step S6 specifically includes:

S61, receiving the classifier output from S5 WhereinProbability distribution representing emotion classification, each elementProbability corresponding to a particular emotion category;

s62, selecting the emotion type with the highest probability value as the final emotion tendency, and the emotion tendency The determination method of (1) is expressed as follows:

;

Wherein, The determined emotional tendency is represented by the character pattern,Representing a found vectorAn index corresponding to a maximum value of (2), the index corresponding to an emotion category;

S63, analyzing probability distribution The highest probability value in the set is used for calculating the emotion intensityThe calculation formula of (2) is expressed as:

;

Wherein, Representing vectorsThe maximum value of (a) that is the probability of the most probable emotion category, reflecting the degree of certainty of the model for that emotion tendency;

s64, formatting final output emotion tendencies And emotion intensity。

The beneficial effects of the invention are as follows:

The invention realizes great innovation in the field of Chinese voice emotion analysis, and effectively combines acoustic characteristics of voice and semantic information of text by fusing a long-short-time memory network, a convolutional neural network and a transducer model. The comprehensive method remarkably improves the processing capacity of complex contexts, and particularly shows higher accuracy and adaptability when processing voices containing complex emotion expressions such as bilingual, ironic or metaphor and the like. Compared with the traditional emotion analysis method relying on single type information, the method and the device have the advantages that the high accuracy is kept, meanwhile, the dependence on the quality of data is reduced, and the robustness of the model on the data with different quality is enhanced. In addition, by optimizing the deep learning model structure, the method can efficiently process a large amount of voice data, meets the requirement of rapid processing, and is suitable for large-scale data analysis scenes. In sum, the invention not only greatly improves the accuracy and efficiency of Chinese voice emotion analysis, but also provides a solid technical basis for the field of automatic and intelligent voice emotion analysis, and has wide application prospect and practical value.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

Fig. 1 is a general architecture diagram of a deep learning-based chinese speech emotion analysis system according to the present invention. The figure shows the whole process from voice data input to emotion analysis output of the system, and comprises key steps of acoustic feature extraction, semantic information extraction, feature fusion, feature processing, emotion classification and the like.

Detailed Description

The invention will now be described in further detail with reference to the accompanying drawings. The drawings are simplified schematic representations which merely illustrate the basic structure of the invention and therefore show only the structures which are relevant to the invention.

Referring to fig. 1, a voice emotion analysis method based on semantic intonation includes the steps of:

In this embodiment, the S1 specifically includes:

s13, converting the digital signals Pretreating;

S15, storing the processed digital signals.

in this embodiment, the S2 specifically includes:

;

S26, gradually processing the whole part by long-short time memory network Feature sequence, outputting hidden state of each time step；

in this embodiment, the step S3 specifically includes:

;

in this embodiment, the S4 specifically includes:

;

S45, pooling the convolution layer of S43;

In this embodiment, the step S43 specifically includes:

;

in this embodiment, the step S5 specifically includes:

;

S6, outputting a final emotion analysis result.

In this embodiment, the step S6 specifically includes:

;

s64, formatting final output emotion tendencies And emotion intensity。

Examples:

In this embodiment, we will describe in detail the application scenario of the present invention at 7 months 2024 in Beijing a large customer service center. The service center processes more than 5000 clients' calls every day, and the telephone records contain rich emotion information, which is important to improving the client experience and service efficiency.

First, the deep learning model of the present invention was used to analyze 5000 phone recordings collected over a week (7 months 1 day to 7 days). Each recording is first processed by converting the speech into a digital signal and then using a long short time memory network (LSTM) to extract acoustic features of the speech, such as pitch, cadence, volume, etc. Simultaneously, the voices are converted into texts through natural language processing technology, and then a transducer model is applied to extract semantic information of the texts.

Next, the acoustic properties and semantic information are combined by a weighted sum fusion function and further feature processing is performed using a Convolutional Neural Network (CNN). The fused feature set is processed by multiple convolution layers and pooling layers of the CNN to extract high-level features useful for emotion analysis.

Our system then uses a full connection layer and classifier to determine the emotional tendency and strength of each phone record. In this step, the probability distribution of the classifier output is used to identify different emotion categories. We select the highest probability emotion category as the final emotion trend and estimate the emotion intensity by the magnitude of the probability value.

In 5000 phone recordings analyzed, the present system successfully identified positive and negative emotions with an accuracy of about 92% with a significant improvement over the traditional approach without semantic information fusion (about 75% accuracy). The method of the invention shows a higher accuracy in particular when dealing with complex emotional expressions, including ironic or bilingual. For example, in a record of a customer under-expressed in humor, the present invention accurately recognizes its negative emotion, whereas the conventional method misjudges as positive emotion.

After the invention is applied, the customer service center can more accurately understand the emotion state of the customer, and the response speed and quality to the customer demand are improved. For example, a 7 month customer satisfaction survey shows an improvement in satisfaction of about 10% over 6 months, which benefits to a great extent from the application of the present invention in emotion analysis.

TABLE 1 comparison of Speech emotion analysis effects for customer service centers

Referring to Table 1 above, we can see that the accuracy of emotion analysis using the method of the present invention is significantly higher than that using conventional methods. Meanwhile, by implementing the invention, the customer satisfaction of 7 months is obviously improved compared with 6 months. These data demonstrate the effectiveness and benefits of the present invention in practical applications.

In general, the invention not only improves the accuracy obviously by combining the acoustic characteristics and the semantic information, but also enhances the processing capacity of complex contexts, and provides effective technical support for application scenes such as customer service centers and the like which need to process a large amount of voice data rapidly and accurately.

Claims

1. A voice emotion analysis method based on semantic intonation is characterized by comprising the following steps of;

S6, outputting a final emotion analysis result.

2. The speech emotion analysis method based on semantic intonation according to claim 1, wherein S1 specifically comprises:

S12, converting the collected voice input into an analog signal, and converting the analog signal into a digital signal through an analog-to-digital converter, wherein the digital signal is expressed as Wherein/>Representative time;

s13, converting the digital signals Pretreating;

S15, storing the processed digital signals.

3. The speech emotion analysis method based on semantic intonation according to claim 2, wherein S2 specifically comprises:

s21, receiving the digital voice signal preprocessed by S1 And divide it into a series of fixed length framesEach frame contains/>Sampling points;

;

Wherein, Is a discrete cosine transform,/>Representing the conversion of a time domain signal into a frequency domain signal,/>Is the mel filter at/>A response of the frequency points;

s24, each calculated frame Coefficient/>Inputting the characteristic into a long-short-time memory network;

s25, the long-short-time memory network consists of a plurality of layers, each layer comprises a plurality of LSTM units, and each unit receives one Feature vector/>And hidden state/>, of the previous time stepAs input, calculate hidden state of current time stepThe calculation process comprises three door control structures, namely forgetting door/>Input gate/>Output door/>And cell status/>：

;

Wherein,Is an S-type activation function,/>Is a hyperbolic tangent activation function,/>And/>Respectively a weight matrix and a bias term,Representing the previous hidden state and the current/>Connecting the feature vectors;

4. The speech emotion analysis method based on semantic intonation according to claim 3, wherein S3 specifically comprises:

S31, receiving the text sequence converted by the automatic voice recognition system in S1 Wherein each/>Representing a word or word;

s33, inputting the preprocessed text sequence into a transducer model, wherein the model firstly inputs each word Conversion to a high-dimensional word embedding vector/>Capturing semantic features of each word through word embedding layer implementation;

S34, processing each word embedding vector by applying a self-attention mechanism in a transducer model Capturing word-to-word relationships and dependencies, embedding vectors/>, for each wordThree different vectors are generated: query vector/>Key vector/>Value vector/>,/>,/>Vector embedding from original words by different linear transformationsGenerating:

;

Wherein, Represents the/>High-dimensional word embedding vector of individual words,/>，/>，/>Representing translation of word embedding vectors into corresponding query, key, value vectors,/>，/>，/>Bias terms representing query, key, value, respectively; for a given word/>By its attention score is by calculating its query vector/>Key vector/>, with all other words in the sequenceThrough a scaling factorAnd (3) normalizing:

;

Wherein, Represents the/>Query vector of individual words,/>Set of key vectors representing all words in the whole text sequence,/>Set of value vectors representing all words in the whole text sequence,/>Representing the dimension of the key vector,/>Normalized function,/>Representing a given query vector/>Corresponding self-attention scores, determined in the process of the/>When words are selected, the importance degree of information of other words in the sequence is improved;

5. The speech emotion analysis method based on semantic intonation according to claim 4, wherein S4 specifically comprises:

;

Wherein, Is/>Output of the convolutional layers,/>Is/>Weight matrix of each convolution layer,/>Is an offset term,/>Representing convolution operations,/>Is an activation function;

S45, pooling the convolution layer of S43;

6. The speech emotion analysis method based on semantic intonation according to claim 5, wherein S43 specifically comprises:

S431 representation of acoustic properties By a mapping matrix/>Conversion, spatially aligned with the representation of semantic information, the converted acoustic properties are represented as/>：

;

Wherein,Is a mapping matrix of acoustic properties to a common feature space,/>Representing concatenating hidden states of all acoustic properties into one vector;

S432, representation of semantic information By a corresponding mapping matrix/>Converted semantic information is expressed as/>：

;

Wherein,Is a mapping matrix of semantic information to a common feature space,/>The representation concatenates representations of all semantic information into one vector;

S433 converted acoustic properties And semantic information/>Fusion feature set/>, by weighted sum fusionIs calculated by (1):

;

Wherein, And/>Representing the importance of the acoustic properties and semantic information in the fused feature set.

7. The speech emotion analysis method based on semantic intonation according to claim 6, wherein S5 specifically comprises:

S51, receiving output from the convolutional neural network of S4 Wherein each/>Representing a processed signature;

feature map of all convolution neural network output Planarization/>Forming a one-dimensional vector, and inputting a characteristic diagram output by a convolutional neural network/>Planarization/>Into a one-dimensional vector into a network of multiple fully connected layers, each layer passing through a weight matrix/>Bias term/>And activation function/>Converting and nonlinear processing is carried out on the characteristics:

;

Wherein, Represents the/>Output characteristics of layer full-connection network,/>Represents the/>Weight matrix of layer full-connection network,/>Representing the flattened eigenvector,/>Represents the/>Bias term of layer full connection network,/>Representing an activation function for introducing nonlinearities;

S53, extracting high-level features Inputting the emotion tendencies into one or more classifiers to classify emotion tendencies, wherein the classifier identifies different emotion categories from the characteristics through training and learning, and the output/>, of the classifierThe calculation formula is as follows:

;

Wherein, Representing the output of the classifier, i.e. the probability distribution for different emotion categories,/>Weight matrix representing classifier,/>Representing advanced features extracted from a fully connected network,/>Bias term representing classifier,/>Is an activation function for converting the classifier output into a probability distribution;

8. The speech emotion analysis method based on semantic intonation according to claim 7, wherein S6 specifically comprises:

S61, receiving the classifier output from S5 Wherein/>Probability distribution representing emotion classification, per element/>Probability corresponding to a particular emotion category;

;

Wherein, Representing a defined emotional tendency,/>Representation find vector/>An index corresponding to a maximum value of (2), the index corresponding to an emotion category;

S63, analyzing probability distribution The highest probability value in the set is used for realizing the calculation of the emotion intensity, and the emotion intensity/>The calculation formula of (2) is expressed as:

;

Wherein, Representing vectors/>The maximum value of (a) that is the probability of the most probable emotion category, reflecting the degree of certainty of the model for that emotion tendency;

s64, formatting final output emotion tendencies And emotional intensity/>。