CN115938393A

CN115938393A - Speech emotion recognition method, device and equipment based on multiple features and storage medium

Info

Publication number: CN115938393A
Application number: CN202211508842.9A
Authority: CN
Inventors: 汤志淼; 曾文佳; 李航
Original assignee: Lingxi Beijing Technology Co Ltd
Current assignee: Lingxi Beijing Technology Co Ltd
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2023-04-07

Abstract

The embodiment of the application provides a method, a device, equipment and a storage medium for recognizing speech emotion based on multiple features, wherein the method comprises the following steps: acquiring a voice to be recognized carrying emotion information; extracting emotional characteristics used for representing the emotional information in the voice to be recognized; the emotional characteristics comprise one or more of acoustic characteristics, prosodic characteristics and voice characteristics; and inputting the emotional characteristics into the trained voice recognition model, and acquiring an emotional recognition result output by the voice recognition model. Therefore, the model can output the emotion recognition result of the speech to be recognized, so that the emotion shown when the user speaks is analyzed, and the intellectualization of the applied speech recognition technology is improved.

Description

Speech emotion recognition method, device and equipment based on multiple features and storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for speech emotion recognition based on multiple features.

Background

Automatic Speech Recognition (ASR) is a technology that converts human Speech into text. With the improvement of computer processing capability, the speech recognition technology is rapidly developed, and is widely applied to the fields of intelligent sound, speech commands, man-machine conversation and the like, and the life style of human beings is increasingly improved. In the related art, speech recognition often stops converting the speech of a user into text, and the emotion shown by the speech of the user is not analyzed, so that the speech recognition is limited to be further intelligent.

Disclosure of Invention

The embodiment of the application aims to provide a speech emotion recognition method, a speech emotion recognition device, speech emotion recognition equipment and a storage medium based on multiple features, and the speech emotion recognition method, the speech emotion recognition device, the speech emotion recognition equipment and the storage medium are used for achieving the technical effect of emotion recognition on speech.

The first aspect of the embodiments of the present application provides a speech emotion recognition method based on multiple features, where the method includes:

acquiring a voice to be recognized carrying emotion information;

extracting emotional characteristics used for representing the emotional information in the voice to be recognized; the emotional characteristics comprise one or more of acoustic characteristics, prosodic characteristics and voice characteristics;

and inputting the emotion characteristics into the trained voice recognition model, and acquiring an emotion recognition result output by the voice recognition model.

In the implementation process, one or more of the acoustic characteristic, the prosodic characteristic and the voice characteristic which are used for representing the emotion information carried by the voice to be recognized are extracted from the voice to be recognized and used as the emotion characteristic, and the extracted emotion characteristic is output to the trained voice recognition model, so that the model can output the emotion recognition result of the voice to be recognized, the emotion expressed when the user speaks is analyzed, and the intellectualization of the applied voice recognition technology is improved.

Further, the acquiring the to-be-recognized voice carrying the emotion information includes:

acquiring original voice of a user; the original voice carries emotion information when a user speaks;

and extracting a voice segment in a preset time region in the original voice to be used as the voice to be recognized.

In the implementation process, a voice segment in a preset time region is intercepted from the original voice of the user to be used as the voice to be recognized, and the emotion recognized from the voice to be recognized represents the emotion when the user speaks. Therefore, the input data volume of the voice recognition model can be reduced, the model recognition burden is reduced, and the computing resources are saved.

Further, the speech recognition model is a classification model; the emotion recognition result is an emotion type; the emotion categories include no emotion, positive emotion and negative emotion.

In the implementation process, the classification model is used as a voice recognition model, the output emotion recognition results are different types of emotions, and the emotion of the user during speaking can be known and judged more intuitively, so that corresponding measures can be taken subsequently.

Further, the method further comprises:

and if the output emotion recognition result is negative emotion, executing a soothing measure.

In the implementation process, if negative emotion is identified when the user speaks, corresponding soothing measures are executed, humanization in man-machine conversation is greatly improved, intellectualization of the voice recognition technology is improved, and better user experience is brought to the user.

Further, the prosodic features include one or more of speech duration, fundamental frequency, short-term energy, and zero-crossing rate;

the psychoacoustic features include one or more of frequency, bandwidth, frequency perturbations, and amplitude perturbations of the formants.

In the implementation process, any one or more of the speech duration, the fundamental frequency, the short-term energy and the zero crossing rate can be selected as the prosodic features, and one or more of the frequency, the bandwidth, the frequency perturbation and the amplitude perturbation of the formants can be selected as the psychoacoustic features. The emotion information carried by the voice is recognized by utilizing the emotion characteristics representing the emotion information, so that the emotion exposed when a user speaks can be analyzed, and the intellectualization of the voice recognition technology is improved.

A second aspect of the embodiments of the present application provides a device for recognizing speech emotion based on multiple features, the device comprising:

the obtaining module is used for obtaining the voice to be recognized carrying the emotion information;

the extraction module is used for extracting the emotional characteristics used for representing the emotional information in the voice to be recognized; the emotional characteristics comprise a tone quality characteristic, a rhythm characteristic and a voice characteristic;

and the recognition module is used for inputting the emotion characteristics into the trained voice recognition model and acquiring an emotion recognition result output by the voice recognition model.

Further, the obtaining module is specifically configured to:

Further, the apparatus further comprises:

and the pacifying module is used for executing a pacifying measure if the output emotion recognition result is negative emotion.

A third aspect of an embodiment of the present application provides an electronic device, including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor, when invoking the executable instructions, implements the operations of any of the methods of the first aspect.

A fourth aspect of embodiments of the present application provides a computer-readable storage medium, on which computer instructions are stored, which when executed by a processor implement the steps of any one of the above-mentioned methods of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

FIG. 1 is a flowchart of a method for recognizing speech emotion based on multiple features according to an embodiment of the present application;

FIG. 2 is a flowchart of another method for recognizing speech emotion based on multiple features according to an embodiment of the present application;

FIG. 3 is a flow chart of speech feature extraction provided by an embodiment of the present application;

FIG. 4 is a block diagram of a speech recognition model provided by an embodiment of the present application;

FIG. 5 is a block diagram illustrating an exemplary embodiment of a multi-feature-based speech emotion recognition apparatus;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined or explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

Automatic Speech Recognition (ASR) is a technology that converts human Speech into text. With the improvement of computer processing capability, the speech recognition technology has been developed rapidly, and is widely applied to the fields of intelligent sound, speech commands, man-machine conversation and the like, and increasingly promotes the life style of human beings. In the related art, speech recognition tends to stop converting a user's speech into text, and lacks analysis of the emotion exhibited by the user speaking. If the emotion of the user during speaking is analyzed, the speech of the user needs to be converted into a text, and then emotion judgment is carried out according to the content expressed by the text. The text content is a side-guess of the emotion of the user when speaking from the semantics, and therefore its accuracy is low. Meanwhile, the speech recognition is limited from being further intelligentized.

Therefore, the application provides a speech emotion recognition method based on multiple features, which comprises the following steps as shown in fig. 1:

step 110: acquiring a voice to be recognized carrying emotion information;

step 120: extracting emotional characteristics used for representing the emotional information in the voice to be recognized;

wherein the emotional features comprise one or more of acoustic features, prosodic features, and speech features;

step 130: and inputting the emotion characteristics into the trained voice recognition model, and acquiring an emotion recognition result output by the voice recognition model.

The above method may be performed by an electronic device. Illustratively, the electronic device may include, but is not limited to, a server, a smart phone/handset, a Personal Digital Assistant (PDA), a media content player, a video game station/system, a virtual reality system, an augmented reality system, a wearable device (e.g., a watch, a bracelet, a glove, a hat, a helmet, a virtual reality headset, an augmented reality headset, a Head Mounted Device (HMD), a headband, a pendant, an armband, a leg loop, a shoe, or a vest, etc.), among other devices that require speech recognition.

The method can be applied to human-computer voice scenes. The man-machine voice scene, that is, the scene where the man-machine has a voice conversation, may include, but is not limited to, a telephone, a library robot navigation, a mall robot navigation, and the like. In a man-machine voice scene, a user talks with the machine in a speaking form, and the machine can perform emotion recognition on the voice of the user by using the voice recognition method provided by the application so as to analyze and judge the current emotion of the user and respond in the forms of voice, character display and the like.

Because the speech to be recognized carries emotion information, one or more of the acoustic characteristic, the prosodic characteristic and the speech characteristic can represent the emotion information carried by the speech to be recognized. Therefore, the emotion characteristics in the voice to be recognized are extracted, and the trained voice recognition model can be used for obtaining the emotion recognition result of the voice to be recognized, so that the emotion shown when the user speaks is analyzed, and the intellectualization of the applied voice recognition technology is improved.

In some embodiments, the speech to be recognized may be from speech spoken by the user. The step 110 of acquiring the speech to be recognized may include the steps shown in fig. 2:

step 111: acquiring original voice of a user;

wherein, the original voice carries the emotional information when the user speaks;

step 112: and extracting a voice segment in a preset time region in the original voice to be used as the voice to be recognized.

Taking a telephone machine service scene as an example, the original voice of the user can be a conversational voice spoken by the user to the machine service. The original voice of the user carries emotion information when the user speaks, so that the voice segments extracted from the original voice also carry the same emotion information, and the voice segments can be used as the voice to be recognized for emotion recognition of the voice recognition model.

The speech duration of the original speech is long or short. If a whole original speech is input into the speech recognition model, the amount of data to be processed by the model is very large, and a long processing time is required. In order to meet the requirement of high real-time performance in a conversation scene, a voice segment in a preset time region in original voice can be extracted as voice to be recognized.

Alternatively, the preset time zone may include a preset duration. Therefore, the voice segment with the preset duration in the original voice can be extracted as the voice to be recognized. The specific value of the preset duration can be determined according to the computing capability of the model and the requirement of the application scene on the real-time performance. As an example, the preset time period may be 5 seconds. If the voice duration of the original voice is less than the preset duration, the whole original voice can be used as the voice to be recognized.

Alternatively, the preset time region may include a position of the speech to be recognized in the original speech. As an example, the start time point and the end time point may be used to characterize the position of the speech to be recognized in the original speech. Namely, extracting the voice segment of the specific position in the original voice as the voice to be recognized. Wherein the voice segment of the specific position should be a voice segment capable of showing the emotion revealed when the user speaks. The specific location referred to by the preset time zone can be determined by those skilled in the art according to actual situations.

As described above, the emotional features may include one or more of psychoacoustic features, prosodic features, and phonetic features.

The tone quality can reflect whether the voice is clear, pure, high or low in recognition degree and the like. Acoustic manifestations of wheezing, gagging, tremolo, etc. included in speech can all affect sound quality. These acoustic manifestations are more pronounced in mood-excited speech. Therefore, the psychoacoustic characteristics can be used as the emotional characteristics for representing the emotional information carried by the voice. In some embodiments, the psychoacoustic features may include one or more of a frequency, a bandwidth, a frequency perturbation (jitter), and an amplitude perturbation (shmer) of the formants. Alternatively, the formants may be extracted by LPC (Linear Predictive Coding) root finding. The relevant extraction steps are referred to in the related art, and the application is not discussed herein.

The prosody of the speech refers to the characteristics of pitch, duration, speed and the like based on the semantic symbols. Prosody, although not useful for understanding the content of words and phrases in speech, determines whether a segment of speech sounds pleasant or not. The prosodic features can be used to characterize the emotion information of the speech sample. In some embodiments, the prosodic features may include one or more of speech duration, pitch frequency, temporal energy, zero-crossing rate.

The voice duration is a measure of the time of the voice signal. Different emotion speaking is used, the voice samples carry different emotion information, and corresponding voice signals also have different durations. The speech duration t is calculated as:

T＝length(frame)/sr

where, length (frame) is the larger value of the number of rows or columns in the frame matrix of the voice signal, and sr is the sampling frequency of the voice signal.

Pitch frequency (pitch), also known as fundamental frequency. When a person is voiced, the vocal cords vibrate due to the excitation of the airflow, thereby generating a pulse signal having periodicity, and the frequency of the vocal cord vibration at this moment is called the fundamental frequency. Fundamental frequency R _x (v) Is calculated as:

wherein x (N) is a discrete signal with a length of N, and v is the delay of the speech signal.

The short-time energy is usually reflected as the loudness of sound in voice, different emotions are used for speaking, voice samples carry different emotion information, and corresponding voice signals also have different short-time energy. For example, the average energy of the amplitude of speech uttered using happy, angry, or surprised emotions is greater than the average energy of the amplitude of speech uttered using neutral emotions. Short time energy E _n The calculation at time n is:

wherein, h (n) = w (n) ² And w (n) is a window function. When N is more than or equal to 0 and less than or equal to N-1, the value of w (N) is 1, otherwise, the value of w (N) is 0. Since the short-time energy belongs to the speech time-domain feature, the Fourier transform is not appliedThe lower w (n) belongs to the square window. The short-term energy of the speech at this time can be expressed as a sum of squares of the speech signal included in each frame.

The zero crossing rate refers to a rate of zeros that a sampled signal passes through within a frame of speech signal. Under the condition that the speech signal is discrete time, zero crossing occurs if the speech signal undergoes positive and negative transitions between adjacent sample points. The zero crossing rate is thus the number of times a zero crossing occurs per unit time. The zero crossing rate can measure the speed of speech and is used for distinguishing silence, noise and human voice. The zero crossing rate ZCR (n) is calculated as:

wherein n is a sampling point of a frame of voice signal, sgn [ · ] is a sign function, and when the independent variable is greater than or equal to 0, sgn takes a value of 1; otherwise, the value is 0.

In some embodiments, the speech features may be one or more of fbank (filter bank), MFCC (Mel Frequency Cepstral Coefficients), or LPC, among others.

As shown in fig. 3, taking MFCC and fbbank as examples, the extraction process includes: the speech signal of the speech sample is first pre-processed. Wherein the preprocessing comprises pre-emphasis, framing and windowing. fBank is then obtained by performing Discrete Fourier Transform (DFT), mel filter bank and logarithm operation on the preprocessed voice signal. On the basis of fBank, inverse Discrete Fourier Transform (IDFT) is performed to obtain MFCC.

After the emotional features of the speech to be recognized are extracted, the emotional features can be input into the trained speech recognition model. The training process of the speech recognition model can be referred to in the related art, and the application is not developed here.

Further, in some embodiments, the speech recognition model may be a classification model and the emotion recognition result may be an emotion category. As an example, the speech recognition model may be a CNN (Convolutional Neural Networks) model. As shown in fig. 4, the CNN model includes 3 convolutional layers (Convolution), 3 max pooling layers (MaxPooling), and 2 full Connection layers (full Connection Layer). The first layer of convolutional layer comprises 256 convolutional kernels, the second layer and the third layer of convolutional layer respectively comprise 128 convolutional kernels, and the size of the convolutional layer is 3 x 3. The pooling nuclei of the first largest pooling layer were 8 × 8, and the pooling nuclei of the second and third layers were 3 × 3. The number of neurons in the first fully-connected layer is 256, and the number of neurons in the second fully-connected layer is 4. Of course, the CNN model structure is only an example, and the speech recognition model of the present application is not limited to the above model structure.

In this embodiment, the extracted emotional features may be regarded as image class features and input into the CNN model to obtain a classification result.

Alternatively, the speech recognition model may be a multi-classification model, and the classification results, i.e., emotion categories, may include no emotion, positive emotion, and negative emotion.

Positive emotions may include, but are not limited to, nature, happy. Negative emotions may include, but are not limited to, anger, disgust, sadness.

For example, after the speech recognition model is input with the emotion features, the speech recognition model may automatically score the recognized emotion types and take the type with the highest score or the type with the score larger than a preset threshold as the emotion recognition result.

Therefore, the emotion information carried by the voice to be recognized is recognized and classified through the classification model, and the emotion of the user during speaking can be visually judged so as to take corresponding measures in the subsequent process.

Further, in some embodiments, if the output emotion recognition result is a negative emotion, a soothing measure is performed.

The soothing measures include various types, and different soothing measures can be selected under different application scenes to sooth the emotion of the user. For example, soothing background music may be played; in the scene of the robot customer service, the robot customer service can switch to a relaxed tone and intonation to answer the user, or can automatically switch to manual service for the user, or play a placating sentence. As described above, the negative emotions may include a plurality of types, and the corresponding placating sentence is different for different negative emotions.

Therefore, the soothing measures are actively executed during the identified negative emotion to sooth the emotion of the user, humanization during man-machine conversation is greatly improved, intellectualization of the voice recognition technology is improved, and better user experience is brought to the user.

Based on the multi-feature-based speech emotion recognition method provided by any one of the embodiments, the application further provides a multi-feature-based speech emotion recognition device. As shown in FIG. 5, the apparatus 500 for recognizing speech emotion based on multiple features includes:

an obtaining module 510, configured to obtain a to-be-recognized voice carrying emotion information;

an extracting module 520, configured to extract an emotional feature used for representing the emotional information in the speech to be recognized; the emotional characteristics comprise a tone quality characteristic, a rhythm characteristic and a voice characteristic;

and the recognition module 530 is configured to input the emotion feature into the trained speech recognition model, and obtain an emotion recognition result output by the speech recognition model.

In some embodiments, the obtaining module 510 is specifically configured to:

In some embodiments, the speech recognition model is a classification model; the emotion recognition result is an emotion type; the emotion categories include no emotion, positive emotion and negative emotion.

In some embodiments, the speech recognition apparatus 500 further comprises:

and the soothing module is used for executing a soothing measure if the output emotion recognition result is negative emotion.

In some embodiments, the prosodic features include speech duration, fundamental frequency, short-term energy, and zero-crossing rate;

the psychoacoustic features include frequency, bandwidth, frequency perturbations, and amplitude perturbations of the formants.

The implementation process of the functions and actions of each module in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

Based on the multi-feature-based speech emotion recognition method described in any of the above embodiments, the present application further provides a schematic structural diagram of an electronic device shown in fig. 6. As shown in fig. 6, at the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile memory, but may also include hardware required for other services. The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to implement the multi-feature-based speech emotion recognition method according to any of the embodiments.

The present application further provides a computer storage medium, which stores a computer program, and the computer program, when executed by a processor, can be used to execute a multi-feature-based speech emotion recognition method according to any of the above embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative and, for example, the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A speech emotion recognition method based on multiple features is characterized by comprising the following steps:

acquiring a voice to be recognized carrying emotion information;

extracting emotional characteristics used for representing the emotional information in the voice to be recognized; the emotional characteristics comprise one or more of tone quality characteristics, rhythm characteristics and voice characteristics;

2. The method of claim 1, wherein the obtaining the speech to be recognized carrying emotion information comprises:

3. The method of claim 1, wherein the speech recognition model is a classification model; the emotion recognition result is an emotion type; the emotion categories include no emotion, positive emotion and negative emotion.

4. The method of claim 3, further comprising:

5. The method of claim 1, wherein the prosodic features include one or more of speech duration, fundamental frequency, short-term energy, and zero-crossing rate;

6. An apparatus for multi-feature based speech emotion recognition, the apparatus comprising:

7. The apparatus of claim 6, wherein the obtaining module is specifically configured to:

8. The apparatus of claim 6, wherein the speech recognition model is a classification model; the emotion recognition result is an emotion type; the emotion categories include no emotion, positive emotion and negative emotion.

9. The apparatus of claim 8, further comprising:

10. The apparatus of claim 6, wherein the prosodic features include one or more of speech duration, fundamental frequency, short-term energy, and zero-crossing rate;

11. An electronic device, characterized in that the electronic device comprises:

a processor;

a memory for storing processor-executable instructions;

wherein the processor, when invoking the executable instructions, implements the operations of the method of any of claims 1-5.

12. A computer-readable storage medium having stored thereon computer instructions, wherein the computer instructions, when executed by a processor, perform the steps of the method of any one of claims 1 to 5.