[go: up one dir, main page]

CN118136047A - Voice emotion analysis method based on semantic intonation - Google Patents

Voice emotion analysis method based on semantic intonation Download PDF

Info

Publication number
CN118136047A
CN118136047A CN202410545108.2A CN202410545108A CN118136047A CN 118136047 A CN118136047 A CN 118136047A CN 202410545108 A CN202410545108 A CN 202410545108A CN 118136047 A CN118136047 A CN 118136047A
Authority
CN
China
Prior art keywords
representing
emotion
output
semantic
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410545108.2A
Other languages
Chinese (zh)
Other versions
CN118136047B (en
Inventor
翁文娟
杨程越
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Wuyu Security Technology Co ltd
Original Assignee
Anhui Wuyu Security Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Wuyu Security Technology Co ltd filed Critical Anhui Wuyu Security Technology Co ltd
Priority to CN202410545108.2A priority Critical patent/CN118136047B/en
Publication of CN118136047A publication Critical patent/CN118136047A/en
Application granted granted Critical
Publication of CN118136047B publication Critical patent/CN118136047B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a voice emotion analysis method based on semantic intonation, which comprises the following steps: s1, collecting Chinese voice data and converting the Chinese voice data into digital signals; s2, processing the digital signals by using a long-short-time memory network, and extracting acoustic properties of voice; s3, converting voice data into a text form through a natural language processing technology, and analyzing the text by using a transducer model; s4, combining the acoustic attribute extracted in the S2 with the semantic information extracted in the S3, and processing fusion characteristics of the acoustic attribute and the semantic information extracted in the S3 by using a convolutional neural network; s5, analyzing the integrated data by applying a deep learning technology based on the processing result in the S4, and determining emotion tendency and strength in the voice; s6, outputting a final emotion analysis result. The invention greatly improves the accuracy and efficiency of Chinese voice emotion analysis and provides a solid technical foundation for the field of automatic and intelligent voice emotion analysis.

Description

Voice emotion analysis method based on semantic intonation
Technical Field
The invention relates to the technical field of emotion analysis, in particular to a voice emotion analysis method based on semantic intonation.
Background
In the field of Chinese speech emotion analysis, the traditional method is mainly based on two approaches: acoustic feature analysis and text semantic analysis. Acoustic feature analysis focuses on extracting sound attributes such as pitch, intensity, and rhythm from speech signals, while text semantic analysis converts speech into text to extract and analyze emotional tendency from the language content. However, each of these methods has limitations.
First, acoustic feature analysis tends to ignore semantic information of language content, which can lead to inaccuracy in emotion analysis when processing speech with complex context or implicit meaning. For example, in a ironically-bearing sentence, hidden emotional tendencies may not be accurately identified by acoustic features alone. Second, text semantic analysis, while capable of capturing emotion colors of linguistic content, generally lacks consideration for sound fluctuations and intensity variations in speech, which are also important in expressing emotion. For example, even though the text content is the same, different mood and intonation may convey different emotions. The core difficulties faced by the prior art include: 1. acoustic and semantic information separation: in dealing with complex chinese contexts, relying on acoustic features alone or text semantic analysis often makes it difficult to accurately identify emotion because the two types of information are often complementary to each other. 2. Processing constraints of complex contexts: the traditional approach works poorly when dealing with complex language expressions with bilingual, ironic, or metaphorical, etc. 3. Highly dependent on data quality: traditional methods have high dependence on the quality and quantity of data, and insufficient or low quality of data can seriously affect the accuracy of analysis results.
Therefore, how to effectively combine the acoustic features and the semantic information to improve the accuracy and adaptability of the Chinese voice emotion analysis is a problem that needs to be solved by those skilled in the art.
Disclosure of Invention
The invention aims to provide a voice emotion analysis method based on semantic intonation, which can efficiently process a large amount of voice data, meets the requirement of rapid processing and is suitable for a large-scale data analysis scene.
According to the embodiment of the invention, the voice emotion analysis method based on semantic intonation comprises the following steps:
S1, collecting Chinese voice data and converting the Chinese voice data into digital signals;
S2, processing the digital signals by using a long-short-time memory network, and extracting acoustic properties of voice;
S3, converting voice data into a text form through a natural language processing technology, and analyzing the text by using a transducer model;
s4, combining the acoustic attribute extracted in the S2 with the semantic information extracted in the S3, and processing fusion characteristics of the acoustic attribute and the semantic information extracted in the S3 by using a convolutional neural network;
S5, analyzing the integrated data by applying a deep learning technology based on the processing result in the S4, and determining emotion tendency and strength in the voice;
S6, outputting a final emotion analysis result.
Optionally, the S1 specifically includes:
s11, collecting Chinese voice input by using a microphone or similar audio capturing equipment;
S12, converting the collected voice input into an analog signal, and converting the analog signal into a digital signal through an analog-to-digital converter, wherein the digital signal is expressed as WhereinRepresentative time;
s13, converting the digital signals Pretreating;
s14, the preprocessed digital signals Sampling and quantizing, and obtaining the characteristic of Chinese voice by converting continuous signals into discrete data points;
S15, storing the processed digital signals.
Optionally, the S2 specifically includes:
s21, receiving the digital voice signal preprocessed by S1 And divide it into a series of fixed length framesEach frame containsSampling points;
S22, reducing discontinuity of frame edges by applying a window function to each frame;
s23, data processed by window function for each frame Calculating mel frequency cepstral coefficients, converting each frame into a representation of the frequency domain by applying a fast fourier transform, and mapping to a mel scale reflecting human auditory perception:
;
Wherein, Is a discrete cosine transform of the data,Representing the conversion of a time domain signal into a frequency domain signal,Is a Mel filter inA response of the frequency points;
s24, each calculated frame Coefficients ofInputting the characteristic into a long-short-time memory network;
s25, the long-short-time memory network consists of a plurality of layers, each layer comprises a plurality of LSTM units, and each unit receives one Feature vectorAnd hidden state of previous time stepAs input, calculate hidden state of current time stepThe calculation process comprises three door control structures, namely a forgetting doorInput doorOutput doorAnd cell status
;
;
;
;
;
Wherein,Is an S-type activation function that,Is a hyperbolic tangent activation function,AndRespectively a weight matrix and a bias term,Representing the previous hidden state and the currentConnecting the feature vectors;
S26, gradually processing the whole part by the long-short time recording network Feature sequence, outputting hidden state of each time step
S27, outputting LSTM networkAs a comprehensive representation of acoustic properties, the next emotion analysis is performed.
Optionally, the step S3 specifically includes:
S31, receiving the text sequence converted by the automatic voice recognition system in S1 Each of which is provided withRepresenting a word or word;
s32, preprocessing the text sequence, including removing stop words, part-of-speech tagging and semantic disambiguation;
s33, inputting the preprocessed text sequence into a transducer model, wherein the model firstly inputs each word Conversion to a high-dimensional word embedding vectorCapturing semantic features of each word through word embedding layer implementation;
S34, processing each word embedding vector by applying a self-attention mechanism in a transducer model Capturing word-to-word relationships and dependencies, embedding vectors for each wordThree different vectors are generated: query vectorKey vectorValue vector,,Embedding vectors from original words by different linear transformationsGenerating:
;
;
;
Wherein, Represent the firstThe high-dimensional word embedding vector of the individual words,Representing the conversion of the word embedding vector into a corresponding query, key, value vector,Bias terms representing query, key, value, respectively, for a given wordBy calculating its attention score, by calculating its query vectorKey vectors with all other words in the sequenceThrough a scaling factorAnd (3) normalizing:
;
Wherein, Represent the firstThe query vector of the individual terms,A set of key vectors representing all words in the entire text sequence,A set of value vectors representing all words in the entire text sequence,The dimensions of the key vector are represented,One of the functions of the normalization is that,Representing a given query vectorCorresponding self-attention scores are determined in process numberWhen words are selected, the importance degree of information of other words in the sequence is improved;
S35, the output processed by the self-attention layer is further processed by a series of encoder layers, each encoder layer comprises a self-attention sub-layer and a feedforward neural network, and the feedforward neural network carries out nonlinear conversion on the output of the self-attention sub-layer;
s36, final output of the transducer model The comprehensive semantic representation of each word in the text sequence comprises semantic information and word-to-word relationships;
S37, using the output of the transducer model for subsequent speech emotion analysis, wherein each output For evaluating emotional tendency and strength of the corresponding word or word.
Optionally, the step S4 specifically includes:
S41, obtaining output of long-short-time memory network And obtaining the output of the TrS1nsformS r model
S42, combining the two groups of outputs into a fusion feature setAnd (3) performing fusion operation:
;
Wherein, Is a fusion function, representing that acoustic characteristics and semantic information are combined into a unified feature representation;
s43, integrating the feature sets Input into a convolutional neural network, convolutional operation:
;
Wherein, Is the firstThe outputs of the various convolution layers,Is the firstThe weight matrix of the individual convolutional layers,Is a bias term that is used to determine,A convolution operation is represented and is performed,Is an activation function;
S45, pooling the convolution layer of S43;
S46, after the processing of a plurality of convolution and pooling layers, the output of the convolutional neural network High-level features that incorporate acoustic properties and semantic information are represented.
Optionally, the step S43 specifically includes:
S431 representation of acoustic properties Through a mapping matrixConversion spatially aligned with the representation of semantic information, the converted acoustic properties being represented as
;
Wherein,Is a mapping matrix of acoustic properties to a common feature space,Representing concatenating hidden states of all acoustic properties into one vector;
S432, representation of semantic information By means of a corresponding mapping matrixConversion, the converted semantic information is expressed as
;
Wherein,Is a mapping matrix of semantic information to a common feature space,The representation concatenates representations of all semantic information into one vector;
S433 converted acoustic properties And semantic informationFeature set fusion by weighted summation fusionIs calculated by (1):
;
Wherein, AndRepresenting the importance of the acoustic properties and semantic information in the fused feature set.
Optionally, the step S5 specifically includes:
S51, receiving output from the convolutional neural network of S4 Each of which is provided withRepresenting a processed signature;
S52, further analyzing the feature map through an advanced feature extraction layer:
feature map of all convolution neural network output FlatteningForming a one-dimensional vector, and inputting the feature map output by the convolutional neural networkFlatteningInto a one-dimensional vector into a network of multiple fully connected layers, each layer passing through a weight matrixBias termActivation functionConverting and nonlinear processing is carried out on the characteristics:
;
Wherein, Represent the firstThe output characteristics of the layer full-connection network,Represent the firstThe weight matrix of the layer fully connected network,Represents the feature vector after the flattening,Represent the firstThe bias term of the layer full-connection network,Representing an activation function for introducing nonlinearities;
after being processed by the multi-layer full-connection network, the output is finally obtained Representing advanced features extracted in depth;
S53, extracting high-level features Input into one or more classifiers for emotion tendency classification, the classifier identifies different emotion categories from the features through training and learning, and the classifier outputsThe calculation formula is as follows:
;
Wherein, Representing the output of the classifier, i.e. the probability distribution for different emotion categories,A weight matrix representing the classifier is presented,Representing advanced features extracted from a fully connected network,The bias term(s) representing the classifier are,Is an activation function for converting the classifier output into a probability distribution;
s54, according to the output of the classifier The emotional tendency and strength in speech are determined.
Optionally, the step S6 specifically includes:
S61, receiving the classifier output from S5 WhereinProbability distribution representing emotion classification, each elementProbability corresponding to a particular emotion category;
s62, selecting the emotion type with the highest probability value as the final emotion tendency, and the emotion tendency The determination method of (1) is expressed as follows:
;
Wherein, The determined emotional tendency is represented by the character pattern,Representing a found vectorAn index corresponding to a maximum value of (2), the index corresponding to an emotion category;
S63, analyzing probability distribution The highest probability value in the set is used for calculating the emotion intensityThe calculation formula of (2) is expressed as:
;
Wherein, Representing vectorsThe maximum value of (a) that is the probability of the most probable emotion category, reflecting the degree of certainty of the model for that emotion tendency;
s64, formatting final output emotion tendencies And emotion intensity
The beneficial effects of the invention are as follows:
The invention realizes great innovation in the field of Chinese voice emotion analysis, and effectively combines acoustic characteristics of voice and semantic information of text by fusing a long-short-time memory network, a convolutional neural network and a transducer model. The comprehensive method remarkably improves the processing capacity of complex contexts, and particularly shows higher accuracy and adaptability when processing voices containing complex emotion expressions such as bilingual, ironic or metaphor and the like. Compared with the traditional emotion analysis method relying on single type information, the method and the device have the advantages that the high accuracy is kept, meanwhile, the dependence on the quality of data is reduced, and the robustness of the model on the data with different quality is enhanced. In addition, by optimizing the deep learning model structure, the method can efficiently process a large amount of voice data, meets the requirement of rapid processing, and is suitable for large-scale data analysis scenes. In sum, the invention not only greatly improves the accuracy and efficiency of Chinese voice emotion analysis, but also provides a solid technical basis for the field of automatic and intelligent voice emotion analysis, and has wide application prospect and practical value.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:
Fig. 1 is a general architecture diagram of a deep learning-based chinese speech emotion analysis system according to the present invention. The figure shows the whole process from voice data input to emotion analysis output of the system, and comprises key steps of acoustic feature extraction, semantic information extraction, feature fusion, feature processing, emotion classification and the like.
Detailed Description
The invention will now be described in further detail with reference to the accompanying drawings. The drawings are simplified schematic representations which merely illustrate the basic structure of the invention and therefore show only the structures which are relevant to the invention.
Referring to fig. 1, a voice emotion analysis method based on semantic intonation includes the steps of:
S1, collecting Chinese voice data and converting the Chinese voice data into digital signals;
In this embodiment, the S1 specifically includes:
s11, collecting Chinese voice input by using a microphone or similar audio capturing equipment;
S12, converting the collected voice input into an analog signal, and converting the analog signal into a digital signal through an analog-to-digital converter, wherein the digital signal is expressed as WhereinRepresentative time;
s13, converting the digital signals Pretreating;
s14, the preprocessed digital signals Sampling and quantizing, and obtaining the characteristic of Chinese voice by converting continuous signals into discrete data points;
S15, storing the processed digital signals.
S2, processing the digital signals by using a long-short-time memory network, and extracting acoustic properties of voice;
in this embodiment, the S2 specifically includes:
s21, receiving the digital voice signal preprocessed by S1 And divide it into a series of fixed length framesEach frame containsSampling points;
S22, reducing discontinuity of frame edges by applying a window function to each frame;
s23, data processed by window function for each frame Calculating mel frequency cepstral coefficients, converting each frame into a representation of the frequency domain by applying a fast fourier transform, and mapping to a mel scale reflecting human auditory perception:
;
Wherein, Is a discrete cosine transform of the data,Representing the conversion of a time domain signal into a frequency domain signal,Is a Mel filter inA response of the frequency points;
s24, each calculated frame Coefficients ofInputting the characteristic into a long-short-time memory network;
s25, the long-short-time memory network consists of a plurality of layers, each layer comprises a plurality of LSTM units, and each unit receives one Feature vectorAnd hidden state of previous time stepAs input, calculate hidden state of current time stepThe calculation process comprises three door control structures, namely a forgetting doorInput doorOutput doorAnd cell status
;
;
;
;
;
Wherein,Is an S-type activation function that,Is a hyperbolic tangent activation function,AndRespectively a weight matrix and a bias term,Representing the previous hidden state and the currentConnecting the feature vectors;
S26, gradually processing the whole part by long-short time memory network Feature sequence, outputting hidden state of each time step
S27, outputting LSTM networkAs a comprehensive representation of acoustic properties, the next emotion analysis is performed.
S3, converting voice data into a text form through a natural language processing technology, and analyzing the text by using a transducer model;
in this embodiment, the step S3 specifically includes:
S31, receiving the text sequence converted by the automatic voice recognition system in S1 Each of which is provided withRepresenting a word or word;
s32, preprocessing the text sequence, including removing stop words, part-of-speech tagging and semantic disambiguation;
s33, inputting the preprocessed text sequence into a transducer model, wherein the model firstly inputs each word Conversion to a high-dimensional word embedding vectorCapturing semantic features of each word through word embedding layer implementation;
S34, processing each word embedding vector by applying a self-attention mechanism in a transducer model Capturing word-to-word relationships and dependencies, embedding vectors for each wordThree different vectors are generated: query vectorKey vectorValue vector,,Embedding vectors from original words by different linear transformationsGenerating:
;
;
;
Wherein, Represent the firstThe high-dimensional word embedding vector of the individual words,Representing the conversion of the word embedding vector into a corresponding query, key, value vector,Bias terms representing query, key, value, respectively, for a given wordBy calculating its attention score, by calculating its query vectorKey vectors with all other words in the sequenceThrough a scaling factorAnd (3) normalizing:
;
Wherein, Represent the firstThe query vector of the individual terms,A set of key vectors representing all words in the entire text sequence,A set of value vectors representing all words in the entire text sequence,The dimensions of the key vector are represented,One of the functions of the normalization is that,Representing a given query vectorCorresponding self-attention scores are determined in process numberWhen words are selected, the importance degree of information of other words in the sequence is improved;
S35, the output processed by the self-attention layer is further processed by a series of encoder layers, each encoder layer comprises a self-attention sub-layer and a feedforward neural network, and the feedforward neural network carries out nonlinear conversion on the output of the self-attention sub-layer;
s36, final output of the transducer model The comprehensive semantic representation of each word in the text sequence comprises semantic information and word-to-word relationships;
S37, using the output of the transducer model for subsequent speech emotion analysis, wherein each output For evaluating emotional tendency and strength of the corresponding word or word.
S4, combining the acoustic attribute extracted in the S2 with the semantic information extracted in the S3, and processing fusion characteristics of the acoustic attribute and the semantic information extracted in the S3 by using a convolutional neural network;
in this embodiment, the S4 specifically includes:
S41, obtaining output of long-short-time memory network And obtaining the output of the TrS1nsformS r model
S42, combining the two groups of outputs into a fusion feature setAnd (3) performing fusion operation:
;
Wherein, Is a fusion function, representing that acoustic characteristics and semantic information are combined into a unified feature representation;
s43, integrating the feature sets Input into a convolutional neural network, convolutional operation:
;
Wherein, Is the firstThe outputs of the various convolution layers,Is the firstThe weight matrix of the individual convolutional layers,Is a bias term that is used to determine,A convolution operation is represented and is performed,Is an activation function;
S45, pooling the convolution layer of S43;
S46, after the processing of a plurality of convolution and pooling layers, the output of the convolutional neural network High-level features that incorporate acoustic properties and semantic information are represented.
In this embodiment, the step S43 specifically includes:
S431 representation of acoustic properties Through a mapping matrixConversion spatially aligned with the representation of semantic information, the converted acoustic properties being represented as
;
Wherein,Is a mapping matrix of acoustic properties to a common feature space,Representing concatenating hidden states of all acoustic properties into one vector;
S432, representation of semantic information By means of a corresponding mapping matrixConversion, the converted semantic information is expressed as
;
Wherein,Is a mapping matrix of semantic information to a common feature space,The representation concatenates representations of all semantic information into one vector;
S433 converted acoustic properties And semantic informationFeature set fusion by weighted summation fusionIs calculated by (1):
;
Wherein, AndRepresenting the importance of the acoustic properties and semantic information in the fused feature set.
S5, analyzing the integrated data by applying a deep learning technology based on the processing result in the S4, and determining emotion tendency and strength in the voice;
in this embodiment, the step S5 specifically includes:
S51, receiving output from the convolutional neural network of S4 Each of which is provided withRepresenting a processed signature;
S52, further analyzing the feature map through an advanced feature extraction layer:
feature map of all convolution neural network output FlatteningForming a one-dimensional vector, and inputting the feature map output by the convolutional neural networkFlatteningInto a one-dimensional vector into a network of multiple fully connected layers, each layer passing through a weight matrixBias termActivation functionConverting and nonlinear processing is carried out on the characteristics:
;
Wherein, Represent the firstThe output characteristics of the layer full-connection network,Represent the firstThe weight matrix of the layer fully connected network,Represents the feature vector after the flattening,Represent the firstThe bias term of the layer full-connection network,Representing an activation function for introducing nonlinearities;
after being processed by the multi-layer full-connection network, the output is finally obtained Representing advanced features extracted in depth;
S53, extracting high-level features Input into one or more classifiers for emotion tendency classification, the classifier identifies different emotion categories from the features through training and learning, and the classifier outputsThe calculation formula is as follows:
;
Wherein, Representing the output of the classifier, i.e. the probability distribution for different emotion categories,A weight matrix representing the classifier is presented,Representing advanced features extracted from a fully connected network,The bias term(s) representing the classifier are,Is an activation function for converting the classifier output into a probability distribution;
s54, according to the output of the classifier The emotional tendency and strength in speech are determined.
S6, outputting a final emotion analysis result.
In this embodiment, the step S6 specifically includes:
S61, receiving the classifier output from S5 WhereinProbability distribution representing emotion classification, each elementProbability corresponding to a particular emotion category;
s62, selecting the emotion type with the highest probability value as the final emotion tendency, and the emotion tendency The determination method of (1) is expressed as follows:
;
Wherein, The determined emotional tendency is represented by the character pattern,Representing a found vectorAn index corresponding to a maximum value of (2), the index corresponding to an emotion category;
S63, analyzing probability distribution The highest probability value in the set is used for calculating the emotion intensityThe calculation formula of (2) is expressed as:
;
Wherein, Representing vectorsThe maximum value of (a) that is the probability of the most probable emotion category, reflecting the degree of certainty of the model for that emotion tendency;
s64, formatting final output emotion tendencies And emotion intensity
Examples:
In this embodiment, we will describe in detail the application scenario of the present invention at 7 months 2024 in Beijing a large customer service center. The service center processes more than 5000 clients' calls every day, and the telephone records contain rich emotion information, which is important to improving the client experience and service efficiency.
First, the deep learning model of the present invention was used to analyze 5000 phone recordings collected over a week (7 months 1 day to 7 days). Each recording is first processed by converting the speech into a digital signal and then using a long short time memory network (LSTM) to extract acoustic features of the speech, such as pitch, cadence, volume, etc. Simultaneously, the voices are converted into texts through natural language processing technology, and then a transducer model is applied to extract semantic information of the texts.
Next, the acoustic properties and semantic information are combined by a weighted sum fusion function and further feature processing is performed using a Convolutional Neural Network (CNN). The fused feature set is processed by multiple convolution layers and pooling layers of the CNN to extract high-level features useful for emotion analysis.
Our system then uses a full connection layer and classifier to determine the emotional tendency and strength of each phone record. In this step, the probability distribution of the classifier output is used to identify different emotion categories. We select the highest probability emotion category as the final emotion trend and estimate the emotion intensity by the magnitude of the probability value.
In 5000 phone recordings analyzed, the present system successfully identified positive and negative emotions with an accuracy of about 92% with a significant improvement over the traditional approach without semantic information fusion (about 75% accuracy). The method of the invention shows a higher accuracy in particular when dealing with complex emotional expressions, including ironic or bilingual. For example, in a record of a customer under-expressed in humor, the present invention accurately recognizes its negative emotion, whereas the conventional method misjudges as positive emotion.
After the invention is applied, the customer service center can more accurately understand the emotion state of the customer, and the response speed and quality to the customer demand are improved. For example, a 7 month customer satisfaction survey shows an improvement in satisfaction of about 10% over 6 months, which benefits to a great extent from the application of the present invention in emotion analysis.
TABLE 1 comparison of Speech emotion analysis effects for customer service centers
Referring to Table 1 above, we can see that the accuracy of emotion analysis using the method of the present invention is significantly higher than that using conventional methods. Meanwhile, by implementing the invention, the customer satisfaction of 7 months is obviously improved compared with 6 months. These data demonstrate the effectiveness and benefits of the present invention in practical applications.
In general, the invention not only improves the accuracy obviously by combining the acoustic characteristics and the semantic information, but also enhances the processing capacity of complex contexts, and provides effective technical support for application scenes such as customer service centers and the like which need to process a large amount of voice data rapidly and accurately.

Claims (8)

1. A voice emotion analysis method based on semantic intonation is characterized by comprising the following steps of;
S1, collecting Chinese voice data and converting the Chinese voice data into digital signals;
S2, processing the digital signals by using a long-short-time memory network, and extracting acoustic properties of voice;
S3, converting voice data into a text form through a natural language processing technology, and analyzing the text by using a transducer model;
s4, combining the acoustic attribute extracted in the S2 with the semantic information extracted in the S3, and processing fusion characteristics of the acoustic attribute and the semantic information extracted in the S3 by using a convolutional neural network;
S5, analyzing the integrated data by applying a deep learning technology based on the processing result in the S4, and determining emotion tendency and strength in the voice;
S6, outputting a final emotion analysis result.
2. The speech emotion analysis method based on semantic intonation according to claim 1, wherein S1 specifically comprises:
s11, collecting Chinese voice input by using a microphone or similar audio capturing equipment;
S12, converting the collected voice input into an analog signal, and converting the analog signal into a digital signal through an analog-to-digital converter, wherein the digital signal is expressed as Wherein/>Representative time;
s13, converting the digital signals Pretreating;
s14, the preprocessed digital signals Sampling and quantizing, and obtaining the characteristic of Chinese voice by converting continuous signals into discrete data points;
S15, storing the processed digital signals.
3. The speech emotion analysis method based on semantic intonation according to claim 2, wherein S2 specifically comprises:
s21, receiving the digital voice signal preprocessed by S1 And divide it into a series of fixed length framesEach frame contains/>Sampling points;
S22, reducing discontinuity of frame edges by applying a window function to each frame;
s23, data processed by window function for each frame Calculating mel frequency cepstral coefficients, converting each frame into a representation of the frequency domain by applying a fast fourier transform, and mapping to a mel scale reflecting human auditory perception:
;
Wherein, Is a discrete cosine transform,/>Representing the conversion of a time domain signal into a frequency domain signal,/>Is the mel filter at/>A response of the frequency points;
s24, each calculated frame Coefficient/>Inputting the characteristic into a long-short-time memory network;
s25, the long-short-time memory network consists of a plurality of layers, each layer comprises a plurality of LSTM units, and each unit receives one Feature vector/>And hidden state/>, of the previous time stepAs input, calculate hidden state of current time stepThe calculation process comprises three door control structures, namely forgetting door/>Input gate/>Output door/>And cell status/>
;
;
;
;
;
Wherein,Is an S-type activation function,/>Is a hyperbolic tangent activation function,/>And/>Respectively a weight matrix and a bias term,Representing the previous hidden state and the current/>Connecting the feature vectors;
S26, gradually processing the whole part by long-short time memory network Feature sequence, outputting hidden state of each time step
S27, outputting LSTM networkAs a comprehensive representation of acoustic properties, the next emotion analysis is performed.
4. The speech emotion analysis method based on semantic intonation according to claim 3, wherein S3 specifically comprises:
S31, receiving the text sequence converted by the automatic voice recognition system in S1 Wherein each/>Representing a word or word;
s32, preprocessing the text sequence, including removing stop words, part-of-speech tagging and semantic disambiguation;
s33, inputting the preprocessed text sequence into a transducer model, wherein the model firstly inputs each word Conversion to a high-dimensional word embedding vector/>Capturing semantic features of each word through word embedding layer implementation;
S34, processing each word embedding vector by applying a self-attention mechanism in a transducer model Capturing word-to-word relationships and dependencies, embedding vectors/>, for each wordThree different vectors are generated: query vector/>Key vector/>Value vector/>,/>,/>Vector embedding from original words by different linear transformationsGenerating:
;
;
;
Wherein, Represents the/>High-dimensional word embedding vector of individual words,/>,/>,/>Representing translation of word embedding vectors into corresponding query, key, value vectors,/>,/>,/>Bias terms representing query, key, value, respectively; for a given word/>By its attention score is by calculating its query vector/>Key vector/>, with all other words in the sequenceThrough a scaling factorAnd (3) normalizing:
;
Wherein, Represents the/>Query vector of individual words,/>Set of key vectors representing all words in the whole text sequence,/>Set of value vectors representing all words in the whole text sequence,/>Representing the dimension of the key vector,/>Normalized function,/>Representing a given query vector/>Corresponding self-attention scores, determined in the process of the/>When words are selected, the importance degree of information of other words in the sequence is improved;
S35, the output processed by the self-attention layer is further processed by a series of encoder layers, each encoder layer comprises a self-attention sub-layer and a feedforward neural network, and the feedforward neural network carries out nonlinear conversion on the output of the self-attention sub-layer;
s36, final output of the transducer model The comprehensive semantic representation of each word in the text sequence comprises semantic information and word-to-word relationships;
S37, using the output of the transducer model for subsequent speech emotion analysis, wherein each output For evaluating emotional tendency and strength of the corresponding word or word.
5. The speech emotion analysis method based on semantic intonation according to claim 4, wherein S4 specifically comprises:
S41, obtaining output of long-short-time memory network And obtaining the output of the TrS1nsformS r model
S42, combining the two groups of outputs into a fusion feature setAnd (3) performing fusion operation:
;
Wherein, Is a fusion function, representing that acoustic characteristics and semantic information are combined into a unified feature representation;
s43, integrating the feature sets Input into a convolutional neural network, convolutional operation:
;
Wherein, Is/>Output of the convolutional layers,/>Is/>Weight matrix of each convolution layer,/>Is an offset term,/>Representing convolution operations,/>Is an activation function;
S45, pooling the convolution layer of S43;
S46, after the processing of a plurality of convolution and pooling layers, the output of the convolutional neural network High-level features that incorporate acoustic properties and semantic information are represented.
6. The speech emotion analysis method based on semantic intonation according to claim 5, wherein S43 specifically comprises:
S431 representation of acoustic properties By a mapping matrix/>Conversion, spatially aligned with the representation of semantic information, the converted acoustic properties are represented as/>
;
Wherein,Is a mapping matrix of acoustic properties to a common feature space,/>Representing concatenating hidden states of all acoustic properties into one vector;
S432, representation of semantic information By a corresponding mapping matrix/>Converted semantic information is expressed as/>
;
Wherein,Is a mapping matrix of semantic information to a common feature space,/>The representation concatenates representations of all semantic information into one vector;
S433 converted acoustic properties And semantic information/>Fusion feature set/>, by weighted sum fusionIs calculated by (1):
;
Wherein, And/>Representing the importance of the acoustic properties and semantic information in the fused feature set.
7. The speech emotion analysis method based on semantic intonation according to claim 6, wherein S5 specifically comprises:
S51, receiving output from the convolutional neural network of S4 Wherein each/>Representing a processed signature;
S52, further analyzing the feature map through an advanced feature extraction layer:
feature map of all convolution neural network output Planarization/>Forming a one-dimensional vector, and inputting a characteristic diagram output by a convolutional neural network/>Planarization/>Into a one-dimensional vector into a network of multiple fully connected layers, each layer passing through a weight matrix/>Bias term/>And activation function/>Converting and nonlinear processing is carried out on the characteristics:
;
Wherein, Represents the/>Output characteristics of layer full-connection network,/>Represents the/>Weight matrix of layer full-connection network,/>Representing the flattened eigenvector,/>Represents the/>Bias term of layer full connection network,/>Representing an activation function for introducing nonlinearities;
after being processed by the multi-layer full-connection network, the output is finally obtained Representing advanced features extracted in depth;
S53, extracting high-level features Inputting the emotion tendencies into one or more classifiers to classify emotion tendencies, wherein the classifier identifies different emotion categories from the characteristics through training and learning, and the output/>, of the classifierThe calculation formula is as follows:
;
Wherein, Representing the output of the classifier, i.e. the probability distribution for different emotion categories,/>Weight matrix representing classifier,/>Representing advanced features extracted from a fully connected network,/>Bias term representing classifier,/>Is an activation function for converting the classifier output into a probability distribution;
s54, according to the output of the classifier The emotional tendency and strength in speech are determined.
8. The speech emotion analysis method based on semantic intonation according to claim 7, wherein S6 specifically comprises:
S61, receiving the classifier output from S5 Wherein/>Probability distribution representing emotion classification, per element/>Probability corresponding to a particular emotion category;
s62, selecting the emotion type with the highest probability value as the final emotion tendency, and the emotion tendency The determination method of (1) is expressed as follows:
;
Wherein, Representing a defined emotional tendency,/>Representation find vector/>An index corresponding to a maximum value of (2), the index corresponding to an emotion category;
S63, analyzing probability distribution The highest probability value in the set is used for realizing the calculation of the emotion intensity, and the emotion intensity/>The calculation formula of (2) is expressed as:
;
Wherein, Representing vectors/>The maximum value of (a) that is the probability of the most probable emotion category, reflecting the degree of certainty of the model for that emotion tendency;
s64, formatting final output emotion tendencies And emotional intensity/>
CN202410545108.2A 2024-05-06 2024-05-06 Voice emotion analysis method based on semantic intonation Active CN118136047B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410545108.2A CN118136047B (en) 2024-05-06 2024-05-06 Voice emotion analysis method based on semantic intonation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410545108.2A CN118136047B (en) 2024-05-06 2024-05-06 Voice emotion analysis method based on semantic intonation

Publications (2)

Publication Number Publication Date
CN118136047A true CN118136047A (en) 2024-06-04
CN118136047B CN118136047B (en) 2024-07-19

Family

ID=91232795

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410545108.2A Active CN118136047B (en) 2024-05-06 2024-05-06 Voice emotion analysis method based on semantic intonation

Country Status (1)

Country Link
CN (1) CN118136047B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118824296A (en) * 2024-08-02 2024-10-22 广州市升谱达音响科技有限公司 A digital conference data processing method and system
CN119358542A (en) * 2024-12-23 2025-01-24 苏州大学 A sentiment analysis system and method based on artificial intelligence drive

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111312245A (en) * 2020-02-18 2020-06-19 腾讯科技(深圳)有限公司 Voice response method, device and storage medium
CN111627462A (en) * 2020-05-22 2020-09-04 云知声(上海)智能科技有限公司 A method and device for emotion recognition based on semantic analysis
CN114446324A (en) * 2022-01-28 2022-05-06 江苏师范大学 Multi-mode emotion recognition method based on acoustic and text features

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111312245A (en) * 2020-02-18 2020-06-19 腾讯科技(深圳)有限公司 Voice response method, device and storage medium
CN111627462A (en) * 2020-05-22 2020-09-04 云知声(上海)智能科技有限公司 A method and device for emotion recognition based on semantic analysis
CN114446324A (en) * 2022-01-28 2022-05-06 江苏师范大学 Multi-mode emotion recognition method based on acoustic and text features

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
FELICIA ANDAYANI 等: "Hybrid LSTM-Transformer Model for Emotion Recognition From Speech Audio Files", IEEE ACCESS, 31 March 2022 (2022-03-31), pages 1 - 4 *
HAIYANG XU: "Learning Alignment for Multimodal Emotion Recognition from Speech", ARXIV, 3 April 2020 (2020-04-03), pages 1 - 4 *
王锐: "基于深度学习的语音情绪识别方法研究", 中国优秀硕士学位论文全文数据库, 15 April 2024 (2024-04-15), pages 8 - 50 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118824296A (en) * 2024-08-02 2024-10-22 广州市升谱达音响科技有限公司 A digital conference data processing method and system
CN118824296B (en) * 2024-08-02 2025-03-28 广州市升谱达音响科技有限公司 A digital conference data processing method and system
CN119358542A (en) * 2024-12-23 2025-01-24 苏州大学 A sentiment analysis system and method based on artificial intelligence drive

Also Published As

Publication number Publication date
CN118136047B (en) 2024-07-19

Similar Documents

Publication Publication Date Title
CN118136047B (en) Voice emotion analysis method based on semantic intonation
CN110444198B (en) Retrieval method, retrieval device, computer equipment and storage medium
CN108319666B (en) Power supply service assessment method based on multi-modal public opinion analysis
WO2021114841A1 (en) User report generating method and terminal device
Gupta et al. Two-stream emotion recognition for call center monitoring.
Pokorny et al. Detection of negative emotions in speech signals using bags-of-audio-words
CN115662435B (en) A method and terminal for generating realistic speech of a virtual teacher
CN112562669A (en) Intelligent digital newspaper automatic summarization and voice interaction news chat method and system
CN111414513A (en) Music genre classification method and device and storage medium
CN112581964A (en) Multi-domain oriented intelligent voice interaction method
CN117851871A (en) Multi-mode data identification method for overseas Internet social network site
CN116431806A (en) Natural language understanding method and refrigerator
CN114298019A (en) Emotion recognition method, emotion recognition apparatus, emotion recognition device, storage medium, and program product
CN110931016A (en) Voice recognition method and system for offline quality inspection
Anguraj et al. Analysis of influencing features with spectral feature extraction and multi-class classification using deep neural network for speech recognition system
CN119004381A (en) Multi-mode large model synchronous training and semantic association construction system and training method thereof
CN117829166A (en) Intention recognition system and method suitable for telephone recording and chat recording
CN117116251A (en) Repayment probability assessment method and device based on collection-accelerating record
Brucal et al. Filipino speech to text system using Convolutional Neural Network
Jing et al. Acquisition of english corpus machine translation based on speech recognition technology
Chethan et al. Comprehensive Approach to Multi Model Speech Emotion Recognition System
CN117857599B (en) Digital person dialogue intelligent management system based on Internet of things
CN112002306A (en) Voice category identification method and device, electronic equipment and readable storage medium
CN112820274B (en) Voice information recognition correction method and system
CN115440198B (en) Method, apparatus, computer device and storage medium for converting mixed audio signal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant