CN118486297B

CN118486297B - Response method based on voice emotion recognition and intelligent voice assistant system

Info

Publication number: CN118486297B
Application number: CN202410937524.7A
Authority: CN
Inventors: 吕亮; 李玮; 陈柯
Original assignee: Beijing Coral Reef Technology Co ltd
Current assignee: Beijing Coral Reef Technology Co ltd
Priority date: 2024-07-12
Filing date: 2024-07-12
Publication date: 2024-09-27
Anticipated expiration: 2044-07-12
Also published as: CN118486297A

Abstract

The invention provides a response method based on voice emotion recognition and an intelligent voice assistant system, which relate to the field of voice recognition, and comprise the following steps: a voice acquisition module configured to: acquiring a voice signal of a user, and preprocessing the voice signal of the user; a feature extraction module configured to: extracting emotion characteristics of the preprocessed voice signals to obtain at least one voice characteristic; an emotion recognition module configured to: identifying the current emotion of the user based on at least one voice feature through a voice emotion identification model; an emotion response module configured to: determining an optimal response strategy based on the current emotion of the user and response decision logic, and executing at least one response action; a feedback learning module configured to: the method has the advantages of improving the accuracy of voice emotion recognition, further improving the intelligent degree of an intelligent voice assistant and improving the interactive experience of the user by adjusting parameters of a voice emotion recognition model and/or adjusting response decision logic based on feedback information of at least one response action of the user.

Description

Response method based on voice emotion recognition and intelligent voice assistant system

Technical Field

The invention relates to the field of voice recognition, in particular to a response method based on voice emotion recognition and an intelligent voice assistant system.

Background

In the field of intelligent voice assistants, conventional voice assistant systems understand the instructions of a user primarily through voice recognition techniques, but these systems often lack the ability to recognize and understand the emotional state of the user. This results in that the emotional demands of the user cannot be satisfied during the interaction process, which affects the interaction experience of the user. Although there have been studies attempting to infer the emotional state of a user through the characteristics of the pitch, rhythm, volume, etc. of voice, these methods have problems of low accuracy and low adaptability, and it is difficult to maintain stable recognition effects in different users and scenes.

Accordingly, there is a need to provide a voice emotion recognition-based intelligent voice assistant system configured to: the accuracy of voice emotion recognition is improved, the intelligent degree of an intelligent voice assistant is further improved, and the interactive experience of a user is improved.

Disclosure of Invention

The invention provides an intelligent voice assistant system based on voice emotion recognition, which comprises the following steps: a voice acquisition module configured to: acquiring a voice signal of a user, and preprocessing the voice signal of the user to obtain a preprocessed voice signal; a feature extraction module configured to: extracting emotion characteristics of the preprocessed voice signals to obtain at least one voice characteristic; an emotion recognition module configured to: identifying the current emotion of the user based on the at least one voice feature through a voice emotion identification model; an emotion response module configured to: determining an optimal response strategy based on the current emotion of the user and response decision logic, and executing at least one response action based on the optimal response strategy; a feedback learning module configured to: and acquiring feedback information of the user on the at least one response action, and adjusting parameters of the voice emotion recognition model and/or adjusting the response decision logic based on the feedback information of the user on the at least one response action.

Further, the voice acquisition module performs preprocessing on the voice signal of the user, including: denoising the voice signal of the user to obtain a denoised voice signal; and carrying out analog-to-digital conversion on the denoised voice signal to obtain the preprocessed voice signal.

Further, the feature extraction module performs emotion feature extraction on the preprocessed voice signal to obtain at least one voice feature, including: decomposing the preprocessed voice signal into multi-frame voice signals; for each frame of speech signal, converting the speech signal from a time domain signal to a frequency domain signal by a fast fourier transform; and extracting a mel-frequency cepstrum coefficient of the frequency domain signal, wherein the at least one voice feature at least comprises the mel-frequency cepstrum coefficient. Further, the feature extraction module extracts mel-frequency cepstral coefficients of the frequency domain signal based on the following formula, including: wherein, the method comprises the steps of, wherein, Is the firstThe mel frequency cepstrum coefficient of the frequency domain signal corresponding to the frame voice signal, N is the number of sampling points of the fast fourier transform,Is the firstAmplitude of the nth frequency domain signal of the frame speech signal.

Further, the feature extraction module performs emotion feature extraction on the preprocessed voice signal to obtain at least one voice feature, including: performing empirical mode decomposition on the preprocessed voice signal to generate a plurality of connotation mode components and residual errors corresponding to the preprocessed voice signal; extracting time domain features and frequency domain features of the connotation modal components for each connotation modal component; extracting the time domain features and the frequency domain features of the residual, wherein the at least one voice feature at least comprises the time domain features of each connotation modal component, the frequency domain features of each connotation modal component, the time domain features of the residual and the frequency domain features of the residual.

Further, the at least one speech feature also includes at least a speech rate feature, a pitch feature, and a volume fluctuation feature.

Further, the voice emotion recognition model is a long-term and short-term memory network model.

Further, the emotion recognition module recognizes the current emotion of the user based on the at least one voice feature through a voice emotion recognition model, including: establishing a sample database, wherein the sample database comprises background information, emotion expression characteristics and emotion fluctuation characteristics of a plurality of sample users; for any two sample users, calculating the user similarity of the two sample users based on the background information and emotion expression characteristics of the two sample users; clustering the plurality of sample users based on the user similarity of the two sample users through a clustering algorithm to determine a plurality of user clusters; for each user cluster, establishing a voice emotion recognition model corresponding to the user cluster; acquiring the historical recognition emotion of the user, and determining emotion fluctuation characteristics of the user based on the historical recognition emotion of the user; for each user cluster, determining a target user cluster from the plurality of user clusters based on the background information, the emotion expression characteristics and the emotion fluctuation characteristics of the user and the background information, the emotion expression characteristics and the emotion fluctuation characteristics of the cluster center of the user cluster; and identifying the current emotion of the user based on the at least one voice characteristic through the voice emotion identification model corresponding to the target user cluster.

Further, the at least one response action also includes at least a dialogue response action, a music response action, and a light response action.

The invention provides a response method based on voice emotion recognition, which comprises the following steps: acquiring a voice signal of a user, and preprocessing the voice signal of the user to obtain a preprocessed voice signal; extracting emotion characteristics of the preprocessed voice signals to obtain at least one voice characteristic; identifying the current emotion of the user based on the at least one voice feature through a voice emotion identification model; determining an optimal response strategy based on the current emotion of the user and response decision logic, and executing at least one response action based on the optimal response strategy; and acquiring feedback information of the user on the at least one response action, and adjusting parameters of the voice emotion recognition model and/or adjusting the response decision logic based on the feedback information of the user on the at least one response action.

The invention provides computer equipment, which comprises a memory and a processor, wherein the memory stores computer readable instructions, and the instructions, when executed by the processor, enable the processor to execute the response method based on voice emotion recognition.

The present invention provides one or more non-transitory computer-readable storage media containing computer-executable instructions that, when executed by one or more processors, cause the processors to perform a speech emotion recognition-based response method as described above.

Compared with the prior art, the response method and the intelligent voice assistant system based on voice emotion recognition have the following beneficial effects:

1. By introducing an advanced voice emotion recognition model, emotion characteristics in the voice of the user can be analyzed, so that emotion requirements of the user can be better understood, and more personalized and humanized service can be provided. Through improving emotion interaction experience, the system is expected to improve user satisfaction and widen the application range of the intelligent voice assistant.

2. Through a voice emotion recognition model corresponding to the target user cluster, the current emotion of the user is recognized from a more-dimensional angle based on the Mel frequency cepstrum coefficient of the frequency domain signal corresponding to each frame of voice signal, the time domain feature and the frequency domain feature of each connotation modal component, the time domain feature and the frequency domain feature of the residual, the speech speed feature, the tone feature and the volume fluctuation feature, and the emotion recognition accuracy is further improved.

3. Classifying users, training a voice emotion recognition model corresponding to each user cluster, and improving emotion recognition accuracy and personalized service level;

4. Based on the analysis of the feedback information, the feedback learning module may automatically adjust parameters of the emotion recognition algorithm or update rules in the response strategy. For example, the weight of the voice emotion recognition model is adjusted, or response decision logic is modified to better adapt to the personalized needs of the user, so as to realize continuous optimization of emotion response.

Drawings

The present specification will be further elucidated by way of example embodiments, which will be described in detail by means of the accompanying drawings. The embodiments are not limiting, in which like numerals represent like structures, wherein:

FIG. 1 is a block diagram of an intelligent voice assistant system based on speech emotion recognition according to some embodiments of the present disclosure;

FIG. 2 is a schematic flow chart of deriving at least one speech feature according to some embodiments of the present disclosure;

FIG. 3 is a schematic flow chart for identifying a user's current emotion according to some embodiments of the present specification;

FIG. 4 is a flow chart of a response method based on speech emotion recognition according to some embodiments of the present disclosure;

fig. 5 is a schematic diagram of a computer device according to some embodiments of the present description.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present specification, the drawings that are required to be used in the description of the embodiments will be briefly described below. It is apparent that the drawings in the following description are only some examples or embodiments of the present specification, which may be configured by those of ordinary skill in the art without inventive effort, according to the drawings: other similar scenarios. Unless otherwise apparent from the context of the language or otherwise specified, like reference numerals in the figures refer to like structures or operations.

FIG. 1 is a schematic block diagram of an intelligent voice assistant system based on voice emotion recognition according to some embodiments of the present disclosure, as shown in FIG. 1, an intelligent voice assistant system based on voice emotion recognition may include a voice acquisition module, a feature extraction module, an emotion recognition module, an emotion response module, and a feedback learning module.

The voice acquisition module may be configured to: and acquiring a voice signal of the user, and preprocessing the voice signal of the user to obtain a preprocessed voice signal.

In particular, the voice acquisition module may include a high sensitivity microphone array capable of clearly capturing a user's voice signal in various environments.

In some embodiments, the voice acquisition module pre-processes the voice signal of the user, including:

denoising the voice signal of the user to obtain a denoised voice signal so as to eliminate background noise and echo and ensure the definition of the voice signal;

And performing analog-to-digital conversion on the denoised voice signal to obtain a preprocessed voice signal.

In particular, the sampling rate and bit depth of the analog-to-digital converter are critical to the conversion process, as they directly affect the quality of the speech signal and the efficiency of subsequent processing. Typically, the speech acquisition module may employ a sampling rate (e.g., 44.1 kHz) that is higher than recognizable by the human ear and a sufficient bit depth (e.g., 16 bits) to ensure accuracy and rich dynamic range of the speech data.

The feature extraction module may be configured to: and extracting emotion characteristics of the preprocessed voice signals to obtain at least one voice characteristic.

Fig. 2 is a schematic flow chart of obtaining at least one speech feature according to some embodiments of the present disclosure, as shown in fig. 2, and in some embodiments, the feature extraction module performs emotion feature extraction on the preprocessed speech signal to obtain at least one speech feature, where the method includes:

Decomposing the preprocessed voice signal into multi-frame voice signals;

for each frame of speech signal, converting the speech signal from a time domain signal to a frequency domain signal by a fast fourier transform;

and extracting a Mel frequency cepstrum coefficient of the frequency domain signal, wherein the at least one voice feature at least comprises the Mel frequency cepstrum coefficient.

In some embodiments, the feature extraction module extracts mel-frequency cepstral coefficients of the frequency domain signal based on the following formula, including:

wherein, Is the firstThe mel frequency cepstrum coefficient of the frequency domain signal corresponding to the frame voice signal, N is the number of sampling points of the fast fourier transform,Is the firstAmplitude of the nth frequency domain signal of the frame speech signal.

The feature extraction module may capture the energy distribution characteristics of the speech by extracting mel-frequency cepstrum coefficients of the frequency domain signal.

In some embodiments, the feature extraction module performs emotion feature extraction on the preprocessed voice signal to obtain at least one voice feature, including:

Performing empirical mode decomposition on the preprocessed voice signal to generate a plurality of connotation modal components and residual errors corresponding to the preprocessed voice signal;

extracting time domain features and frequency domain features of the connotation modal components for each connotation modal component;

Extracting time domain features and frequency domain features of the residual, wherein at least one voice feature at least comprises the time domain features of each connotation modal component, the frequency domain features of each connotation modal component, the time domain features of the residual and the frequency domain features of the residual.

Empirical mode decomposition may include the steps of:

S11, identifying extreme points: firstly, scanning an input signal, finding out all local maximum value points and local minimum value points, and executing S12;

S12, constructing upper and lower envelopes: interpolation is carried out on the extreme points by using a cubic spline interpolation method, an upper envelope line and a lower envelope line of the signal are constructed, and S13 is executed;

S13, calculating a mean envelope curve: calculating the average value of the upper envelope line and the lower envelope line to obtain an average value envelope line, and executing S14;

s14, obtaining an intermediate signal: subtracting the mean envelope curve from the original signal to obtain a new signal, namely an intermediate signal, and executing S15;

S15, judging whether the intermediate signal meets the connotation mode component condition: checking whether the intermediate signal meets the content modal component condition: in the whole data segment, the number of extreme points and the number of zero crossing points must be equal or the difference between the extreme points and the zero crossing points cannot exceed one at most; at any time, the average value of the upper envelope formed by the local maximum points and the lower envelope formed by the local minimum points is zero, that is, the upper and lower envelopes are locally symmetrical with respect to the time axis, S16 is performed;

s16, iterative screening process: if the intermediate signal does not meet the content modal component condition, regarding the intermediate signal as a new input signal, repeating the steps S12 to S15 until a signal meeting the content modal component condition is obtained, stopping when the residual signal only contains no more than two extreme values, and executing S17;

s17, decomposing to obtain an connotation modal component: through the iterative screening process, the finally obtained signal meeting the content modal component condition is the first content modal component, and S18 is executed;

S18, extracting residual connotation modal components: subtracting the first connotation modal component from the original signal to obtain a residual signal. Then, the remaining signals are used as new input signals, the steps S11 to S17 are repeated, other connotation modal components are sequentially extracted, and S19 is executed;

S19, decomposition is completed: when the residual signal can not extract the components meeting the content modal component condition, the decomposition process is ended. At this point, the original signal is decomposed into a series of content modality components and a residual.

In some embodiments, the at least one speech feature further comprises at least a speech rate feature, a pitch feature, and a volume fluctuation feature.

For example only, the feature extraction module may extract the speech rate features according to the following procedure:

S21, pre-emphasizing through a first-order Finite Impulse Response (FIR) filter, and enhancing a high-frequency part in a voice signal without increasing low-frequency noise;

s22, dividing the continuous voice signal into shorter frames (or segments), typically 20-30 milliseconds, because the voice signal is stationary in short time and can be treated as a stationary signal in short time;

s23, in order to reduce spectrum leakage at the beginning and the end of the frame, windowing is carried out on the frame. Common window functions include hamming windows and rectangular windows.

S24, for each frame, determining the time length of the frame, and calculating the number of samples in the frame and dividing the number by the sampling rate.

S25, calculating the total frame number of the whole voice signal, dividing the total frame number by the total duration of the voice signal, and obtaining the speech speed characteristics.

In some embodiments, the feature extraction module estimates the pitch frequency by analyzing the spectral structure of the speech signal, in particular the harmonic content, and the pitch feature may comprise the pitch frequency.

In some embodiments, the feature extraction module may determine the volume fluctuation feature based on the following procedure:

S31, dividing the continuous voice signal into multi-frame voice signals;

S32, for each frame of voice signal, calculating an amplitude mean value and an amplitude fluctuation parameter corresponding to the frame of voice signal based on the amplitude value of each time point in the frame of voice signal, wherein the volume fluctuation characteristic can comprise the amplitude mean value and the amplitude fluctuation parameter corresponding to each frame of voice signal.

For example, the amplitude fluctuation parameter may be calculated based on the following formula:

wherein, As the amplitude fluctuation parameter of the i-th frame speech signal,For the average value of the amplitude corresponding to the i-th frame of voice signal,For the amplitude value of the ith frame of speech signal at the time point of the ith time point, T is the total number of time points included in the ith frame of speech signal.

The emotion recognition module may be configured to: the current emotion of the user is identified based on at least one speech feature by means of a speech emotion recognition model.

FIG. 3 is a schematic flow chart of identifying a current emotion of a user according to some embodiments of the present disclosure, as shown in FIG. 3, in some embodiments, the emotion recognition module identifies the current emotion of the user based on at least one voice feature through a voice emotion recognition model, including:

Establishing a sample database, wherein the sample database comprises background information, emotion expression characteristics and emotion fluctuation characteristics of a plurality of sample users;

For any two sample users, calculating the user similarity of the two sample users based on the background information and emotion expression characteristics of the two sample users;

Clustering a plurality of sample users based on the user similarity of the two sample users through a clustering algorithm (such as a K-means clustering algorithm and the like) to determine a plurality of user clusters;

for each user cluster, establishing a voice emotion recognition model corresponding to the user cluster;

acquiring a history identification emotion of a user, and determining emotion fluctuation characteristics of the user based on the history identification emotion of the user;

For each user cluster, determining a target user cluster from a plurality of user clusters based on the background information, the emotion expression characteristics and the emotion fluctuation characteristics of the user and the background information, the emotion expression characteristics and the emotion fluctuation characteristics of the cluster center of the user cluster;

And identifying the current emotion of the user based on at least one voice characteristic through a voice emotion identification model corresponding to the target user cluster.

Specifically, the emotion fluctuation characteristics of the user can be extracted from the pre-acquired registered voice information of different emotions of the user. The emotion fluctuation features of the user can comprise mel frequency cepstrum coefficients of frequency domain signals corresponding to registered voice information of the user under different emotion, time domain features and frequency domain features of each connotation modal component, time domain features and frequency domain features of residual errors, speech speed features, tone features and volume fluctuation features.

For example only, when registering a user, emotion prompt information and text information may be displayed to instruct the user to record registered voice information corresponding to a certain emotion, where the emotion prompt information may represent an emotion type in which the text information is read. For example, the user is instructed to read the text information "weather is good today" in the "happy" emotion and to perform voice recording to acquire the registered voice information corresponding to "happy", and for example, the user is instructed to read the text information "day bad today" in the "sad" emotion and to perform voice recording to acquire the registered voice information corresponding to "sad".

User similarity of the two sample users can be calculated based on background information and emotion expression characteristics of the two sample users through a similarity determination model, wherein the similarity determination model can be a convolutional neural network (Convolutional Neural Network, CNN) model. And the user similarity between the user and the user cluster can be calculated through the similarity determination model based on the background information, the emotion expression characteristic and the emotion fluctuation characteristic of the user and the background information, the emotion expression characteristic and the emotion fluctuation characteristic of the cluster center of the user cluster, and the user cluster with the maximum user similarity is used as the target user cluster.

As an example, the current emotion of the user is identified based on the mel frequency cepstrum coefficient of the frequency domain signal corresponding to each frame of voice signal, the time domain feature and the frequency domain feature of each connotation modal component, the time domain feature and the frequency domain feature of the residual, the speech speed feature, the tone feature and the volume fluctuation feature by the voice emotion identification model corresponding to the target user cluster.

In some embodiments, the speech emotion recognition model is a long-term memory network model, expressed as follows:

wherein, Expressed in timeIs a speech emotion recognition model,Is the hidden state of the previous time step,Is the state of the cells in the previous time step,AndThe hidden state and the cell state of the current time step, respectively. In this way, the speech emotion recognition model is able to capture dynamic changes in emotion characteristics over time. The speech emotion recognition model may output a probability distribution of the user's emotion state.

The emotion response module may be configured to: an optimal response policy is determined based on the current emotion of the user and the response decision logic, and at least one response action is performed based on the optimal response policy.

In some embodiments, the at least one responsive action further comprises at least a dialog responsive action, a music responsive action, and a light responsive action.

For example, if the current emotion of the user is "happy", according to the response decision logic, the emotion response module can provide corresponding services according to the state, such as playing happy music or speaking humorous joke, so as to improve the experience of the user.

As another example, if the user's current emotion is "sad," the emotion response module may choose to play soft music or provide a placebo conversation, depending on the response decision logic.

For another example, if the current emotion of the user is "anxiety," the emotion response module can put a series of light music aiming at reducing anxiety according to the response decision logic, and at the same time, the emotion response module can adjust the color temperature of the intelligent bulb in the room and switch to a softer blue color tone so as to create a softer environment.

In this way, the intelligent voice assistant system is able to more accurately identify the emotional state of the user and provide a more personalized interactive experience.

The feedback learning module may be configured to: and acquiring feedback information of the user on at least one response action, and adjusting parameters of the voice emotion recognition model and/or adjusting response decision logic based on the feedback information of the user on the at least one response action.

For example, after the intelligent voice assistant performs an emotion response action, the system may ask the user if satisfied, or implicitly gather feedback through further interaction by the user. For example, if the user turns off the music player after hearing soft music, the system may interpret this as a user's dissatisfaction with the current emotional response. The user's feedback will be used as training data, entered into the machine learning model. Through analysis of this data, deficiencies in the speech emotion recognition model and response decision logic can be identified. For example, when the user feedback response action is not expected, the feedback learning module may reduce the bias of the speech emotion recognition model by adding new training samples. Based on the analysis of the feedback information, the feedback learning module may automatically adjust parameters of the emotion recognition algorithm or update rules in the response strategy. For example, the weights of the speech emotion recognition models are adjusted, or the response decision logic is modified to better accommodate the personalized needs of the user.

By way of example only, assume that a user is not satisfied after the system plays a piece of music that is intended to alleviate anxiety. The user feedback learning module will record this feedback and add this data point to the subsequent model training. By comparing the user's emotional state with feedback, the system may find that the user's response to anxiety emotions is more preferable to silence than music. Thus, it adjusts the response strategy, and the next time the user anxiety is detected, it may choose to reduce ambient noise rather than play music. Through continuous learning and adjustment, the intelligent voice assistant system can more accurately and insights the requirements of different users, and provides more careful services.

FIG. 4 is a flow chart of a response method based on speech emotion recognition according to some embodiments of the present disclosure, and as shown in FIG. 4, a response method based on speech emotion recognition may include the following steps.

Step 410, obtaining a voice signal of a user, and preprocessing the voice signal of the user to obtain a preprocessed voice signal;

step 420, extracting emotion characteristics of the preprocessed voice signal to obtain at least one voice characteristic;

Step 430, recognizing the current emotion of the user based on at least one voice feature through the voice emotion recognition model;

Step 440, determining an optimal response strategy based on the current emotion of the user and the response decision logic, and executing at least one response action based on the optimal response strategy;

Step 450, obtaining feedback information of the user for the at least one responsive action, and adjusting parameters of the speech emotion recognition model and/or adjusting response decision logic based on the feedback information of the user for the at least one responsive action.

A response method based on voice emotion recognition can be performed by an intelligent voice assistant system based on voice emotion recognition, and further description of a response method based on voice emotion recognition can be found in a related description of an intelligent voice assistant system based on voice emotion recognition, which is not described here again.

Fig. 5 is a schematic diagram of a computer device according to some embodiments of the present disclosure, where the computer device includes a memory and a processor, and the memory stores computer readable instructions that, when executed by the processor, cause the processor to perform a response method based on speech emotion recognition as described above. It should be noted that, the computer device 500 shown in fig. 5 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.

As shown in fig. 5, the processor includes a central processing unit (Central Processing Unit, CPU) 501, which can perform various appropriate actions and processes, such as performing the methods in the above-described embodiments, according to a program stored in a Read-Only Memory (ROM) 502 or a program loaded from a storage portion 505 into a random access Memory (Random Access Memory, RAM) 503. In the RAM 503, various programs and data required for the system operation are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other through a bus 504. An Input/Output (I/O) interface 505 is also connected to bus 504.

The following components are connected to the I/O interface 505: an input section 506 including a keyboard, a mouse, and the like; an output portion 507 including a Cathode Ray Tube (CRT), a Liquid crystal display (Liquid CRYSTAL DISPLAY, LCD), and a speaker, etc.; a storage section 505 including a hard disk or the like; and a communication section 509 including a network interface card such as a LAN (Local Area Network ) card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The drive 510 is also connected to the I/O interface 505 as needed. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as needed so that a computer program read therefrom is mounted into the storage section 505 as needed.

In particular, according to embodiments of the present application, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising instructions configured to: a computer program for executing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 509, and/or installed from the removable media 511. When executed by a Central Processing Unit (CPU) 501, performs the various functions defined in the system of the present application.

It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium may be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), a flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with a computer-readable computer program embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport a signal configured to: a program for use by or in connection with an instruction execution system, apparatus, or device. A computer program embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

Finally, it should be understood that the embodiments described in this specification are merely illustrative of the principles of the embodiments of this specification. Other variations are possible within the scope of this description. Thus, by way of example, and not limitation, alternative configurations of embodiments of the present specification may be considered as consistent with the teachings of the present specification. Accordingly, the embodiments of the present specification are not limited to only the embodiments explicitly described and depicted in the present specification.

Claims

1. An intelligent voice assistant system based on voice emotion recognition, comprising:

a voice acquisition module configured to: acquiring a voice signal of a user, and preprocessing the voice signal of the user to obtain a preprocessed voice signal;

a feature extraction module configured to: extracting emotion characteristics of the preprocessed voice signals to obtain at least one voice characteristic;

An emotion recognition module configured to: identifying the current emotion of the user based on the at least one voice feature through a voice emotion identification model;

an emotion response module configured to: determining an optimal response strategy based on the current emotion of the user and response decision logic, and executing at least one response action based on the optimal response strategy;

A feedback learning module configured to: acquiring feedback information of a user on the at least one response action, and adjusting parameters of the voice emotion recognition model and/or adjusting the response decision logic based on the feedback information of the user on the at least one response action;

The voice acquisition module preprocesses the voice signal of the user to obtain a preprocessed voice signal, and the voice acquisition module comprises:

Denoising the voice signal of the user to obtain a denoised voice signal;

performing analog-to-digital conversion on the denoised voice signal to obtain the preprocessed voice signal;

The feature extraction module performs emotion feature extraction on the preprocessed voice signal to obtain at least one voice feature, and the feature extraction module comprises:

decomposing the preprocessed voice signal into multi-frame voice signals;

Extracting a mel-frequency cepstrum coefficient of the frequency domain signal, wherein the at least one voice feature at least comprises the mel-frequency cepstrum coefficient;

The feature extraction module extracts mel-frequency cepstral coefficients of the frequency domain signal based on the following formula: wherein, Is the firstThe mel frequency cepstrum coefficient of the frequency domain signal corresponding to the frame voice signal, N is the number of sampling points of the fast fourier transform,Is the firstAmplitude of the nth frequency domain signal of the frame speech signal;

The feature extraction module performs emotion feature extraction on the preprocessed voice signal to obtain at least one voice feature, and the feature extraction module further comprises:

performing empirical mode decomposition on the preprocessed voice signal to generate a plurality of connotation mode components and residual errors corresponding to the preprocessed voice signal;

Extracting the time domain features and the frequency domain features of the residual, wherein the at least one voice feature at least comprises the time domain features of each connotation modal component, the frequency domain features of each connotation modal component, the time domain features of the residual and the frequency domain features of the residual.

2. The intelligent voice assistant system according to claim 1, wherein the at least one voice feature further comprises at least a speech rate feature, a pitch feature, and a volume fluctuation feature.

3. The intelligent voice assistant system based on voice emotion recognition of claim 1, wherein the voice emotion recognition model is a long-term memory network model.

4. The intelligent voice assistant system based on voice emotion recognition of claim 3, wherein the emotion recognition module recognizes a current emotion of a user based on the at least one voice feature through a voice emotion recognition model, comprising:

Clustering the plurality of sample users based on the user similarity of the two sample users through a clustering algorithm to determine a plurality of user clusters;

Acquiring the historical recognition emotion of the user, and determining emotion fluctuation characteristics of the user based on the historical recognition emotion of the user;

For each user cluster, determining a target user cluster from the plurality of user clusters based on the background information, the emotion expression characteristics and the emotion fluctuation characteristics of the user and the background information, the emotion expression characteristics and the emotion fluctuation characteristics of the cluster center of the user cluster;

And identifying the current emotion of the user based on the at least one voice characteristic through the voice emotion identification model corresponding to the target user cluster.

5. The intelligent voice assistant system based on voice emotion recognition of claim 1, wherein the at least one response action further comprises at least a dialogue response action, a music response action, and a light response action.

6. A response method based on speech emotion recognition, wherein the response method based on speech emotion recognition is performed based on the intelligent voice assistant system based on speech emotion recognition according to any one of claims 1 to 5, and comprises:

acquiring a voice signal of a user, and preprocessing the voice signal of the user to obtain a preprocessed voice signal;

extracting emotion characteristics of the preprocessed voice signals to obtain at least one voice characteristic;

identifying the current emotion of the user based on the at least one voice feature through a voice emotion identification model;

determining an optimal response strategy based on the current emotion of the user and response decision logic, and executing at least one response action based on the optimal response strategy;

and acquiring feedback information of the user on the at least one response action, and adjusting parameters of the voice emotion recognition model and/or adjusting the response decision logic based on the feedback information of the user on the at least one response action.