WO2020098107A1

WO2020098107A1 - Detection model-based emotions analysis method, apparatus and terminal device

Info

Publication number: WO2020098107A1
Application number: PCT/CN2018/124629
Authority: WO
Inventors: 王健宗; 彭俊清; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-11-12
Filing date: 2018-12-28
Publication date: 2020-05-22
Also published as: CN109473122A

Abstract

The present application is applicable in the technical field of data processing, and provided thereby are a detection model-based emotions analysis method, apparatus, and terminal device, the method comprising: obtaining a plurality of pre-stored voice signals, and analyzing the pre-stored voice signals to obtain voice parameters, the voice parameters comprising sound intensity values, loudness values, pitch values, and signal periods, wherein each pre-stored voice signal corresponds to one preset emotion level; constructing parameter vectors on the basis of the voice parameters, and training an initial model by means of the plurality of parameter vectors and a plurality of corresponding emotion levels to obtain an emotion model; inputting voice parameters of a voice signal to be tested into the emotion model, and determining an output result of the emotion model to be an emotion level corresponding to the voice signal to be tested. In the present application, emotions are analyzed at an objective level on the basis of quantized values of pre-stored voice signals and an emotion level training model, which thereby improves the objectivity and accuracy of emotions analysis.

Description

Emotion analysis method, device and terminal equipment based on detection model

This application requires the priority of the Chinese patent application filed on November 12, 2018 in the Chinese Patent Office with the application number 201811340781.3 and the title "Sentiment Analysis Method, Device and Terminal Equipment Based on Detection Model", the entire contents of which are incorporated by reference In this application.

Technical field

The present application belongs to the technical field of data processing, and particularly relates to a sentiment analysis method, device and terminal device based on a detection model.

Background technique

Sentiment analysis is a hot research topic nowadays. It is suitable for interviews, consultations, and marketing. A sentiment analysis technique analyzes the pronunciation of the interviewee to obtain the current emotional state of the interviewee, which is convenient for the interviewer to adjust according to the emotional state. Talking and talking style.

In the prior art, the interviewer usually makes an artificial judgment based on the pronunciation of the interviewee at the current moment, that is, the emotional state of the interviewee is estimated based on the sound characteristics. Since human judgment is subjective, it is easily influenced by the interviewer itself, resulting in that the emotional state obtained is not an objective result, and the accuracy of emotional analysis is low.

technical problem

In view of this, the embodiments of the present application provide a method, device and terminal device for sentiment analysis based on a detection model to solve the problem that sentiment analysis in the prior art relies on subjective judgment and has low accuracy.

Technical solution

The first aspect of the embodiments of the present application provides a sentiment analysis method based on a detection model, including:

Obtain multiple pre-stored voice signals, and analyze the pre-stored voice signals to obtain voice parameters, where the voice parameters include sound intensity value, loudness value, pitch value, and signal period, wherein each of the pre-stored voice signals corresponds to a pre-stored voice signal Set emotion level;

Constructing a parameter vector based on the speech parameters, and training the initial model through a plurality of the parameter vectors and a corresponding plurality of the emotion levels to obtain an emotion model;

The voice parameters of the voice signal to be tested are input to the emotion model, and the output result of the emotion model is determined as the emotion level corresponding to the voice signal to be tested.

A second aspect of the embodiments of the present application provides a detection model-based sentiment analysis device, which may include a unit for implementing the steps of the above-described detection model-based sentiment analysis method.

A third aspect of the embodiments of the present application provides a terminal device, including a memory and a processor. The memory stores computer-readable instructions that can run on the processor. The processor executes the computer. When reading the instruction, the steps of the emotion analysis method based on the detection model described above are implemented.

A fourth aspect of the embodiments of the present application provides a computer non-volatile readable storage medium. The computer non-volatile readable storage medium stores computer readable instructions, and the computer readable instructions are executed by a processor. To realize the steps of the above sentiment analysis method based on detection model.

Beneficial effect

In this embodiment of the present application, a plurality of pre-stored voice signals corresponding to emotional levels are obtained, and the voice parameters of the pre-stored voice signals are analyzed, and then the initial model is trained based on the voice parameters and the corresponding emotional levels to obtain the emotional model. The voice parameters of the voice signal to be tested are input into the emotion model, and the output result of the emotion model is determined as the emotion level corresponding to the voice signal to be tested. The embodiment of the present application uses the preset emotion level and the extracted voice parameter as input parameters to complete The training of emotion models has improved the objectivity and accuracy of emotion analysis.

BRIEF DESCRIPTION

1 is a flowchart of an implementation of a sentiment analysis method based on a detection model provided in Embodiment 1 of the present application;

2 is an implementation flowchart of a detection model-based sentiment analysis method provided in Embodiment 2 of the present application;

3 is a flowchart of an implementation of a sentiment analysis method based on a detection model provided in Embodiment 3 of the present application;

4 is a flowchart of an implementation of a sentiment analysis method based on a detection model provided in Embodiment 4 of the present application;

5 is a structural block diagram of a sentiment analysis device based on a detection model provided in Embodiment 5 of the present application;

6 is a schematic diagram of the terminal device provided in Embodiment 6 of the present application;

7 is an architecture diagram of a model unit in the initial model provided in Embodiment 7 of the present application.

Embodiments of the invention

In order to have a clearer understanding of the technical features, purposes and effects of the present application, the specific implementation of the present application will now be described in detail with reference to the drawings.

Please refer to FIG. 1, which is a flowchart of an implementation of a sentiment analysis method based on a detection model provided by an embodiment of the present application. As shown in Figure 1, the sentiment analysis method includes the following steps:

S101: Obtain multiple pre-stored voice signals, and analyze the pre-stored voice signals to obtain voice parameters. The voice parameters include a sound intensity value, a loudness value, a pitch value, and a signal period, wherein each of the pre-stored voice signals corresponds to A preset emotional level.

In the embodiment of the present application, in order to analyze the emotion carried by the speech signal at an objective level, first of all, a plurality of pre-stored speech signals are obtained, and the plurality of pre-stored speech signals are used as the data basis for sentiment analysis, wherein the pre-stored speech signals are preferably continuous Voice signals, pre-stored voice signals can be pre-acquired from the open source voice library and stored locally, and each pre-stored voice signal corresponds to a preset emotion level. The emotion level can be determined by people in advance, for example, a dedicated emotion analyst analyzes each pre-stored speech signal to determine the emotion level. It is worth mentioning that the specific rules for determining the emotional level can be formulated according to the actual application scenario, and this embodiment of the present application is not limited, for example, one way is to assign an integer between 1 and 10 to the emotional level of the pre-stored voice signal, The larger the value, the more intense the emotion. In addition, in order to facilitate subsequent training, after acquiring multiple pre-stored voice signals, the multiple pre-stored voice signals are intercepted. The interception duration for interception can be set in advance in a unified manner, for example, if it is set to 2 minutes, the duration exceeds 2 minutes. The pre-stored voice signal starts from the starting position of the pre-stored voice signal and is intercepted until the duration is 2 minutes; for the pre-stored voice signal whose duration is not more than 2 minutes, it is not intercepted. After intercepting according to the preset interception duration, the duration of multiple pre-stored voice signals may be inconsistent, so another interception method may be applied, that is, first obtain the duration of multiple pre-stored voice signals, and use the shortest duration as interception The duration is intercepted. After the interception method is used for interception, the duration of all pre-stored voice signals is the same.

For each pre-stored voice signal after interception, it is analyzed to obtain voice parameters. The voice parameters include sound intensity value, loudness value, pitch value and signal period. Specifically, the sound intensity is the energy per unit time passing through the unit area perpendicular to the sound wave propagation direction in units of watts / square meter, and the sound intensity value is obtained by comparing the sound intensity of the speech signal with the reference sound intensity and then taking the common Logarithm, the value obtained by multiplying by 10 finally, the calculation formula is L = 10lg (I / I0), where L is the sound intensity value, the unit is decibel (db), I is the sound intensity of the voice signal, I0 is the reference Sound intensity, the value is 10-12 watts / square meter, lg () is the common logarithm with base 10. Loudness indicates the strength of the voice signal. Loudness is related to the intensity and frequency of the voice signal. When the frequency of the voice signal is fixed, the louder the louder with the louder the voice signal, and the loudness value is the opposite of the loudness. A numerical value, in units of square (PHON), a frequency of 1000 Hz, and a sound signal with a sound intensity value of n decibels, whose loudness value is n square, where n is an integer greater than zero. The pitch value indicates the level of the sound frequency of the voice information, the unit is Mel, the loudness value is 40 square, and the pitch value of the voice signal with a frequency of 1000 Hz is 1000 mel. The signal period is the length of time that the vocal cord of the speaker who emits a voice signal has been opened and closed once.

The above-mentioned sound intensity value, loudness value and pitch value can be obtained by analyzing the pre-stored voice signal through the open-source voice analysis component, and because the pre-stored voice signal is not limited to a signal at a certain time, that is, within the duration of the pre-stored voice signal , The sound intensity value, the loudness value and the pitch value may change at different times, so the average value of all sound intensity values, the average value of all loudness values and the average value of all pitch values collected within the duration of the pre-stored speech signal The value is used as the final sound intensity value, loudness value and pitch value. Of course, this does not constitute a limitation on the embodiments of the present application. According to different actual application scenarios, there are other ways to obtain the above sound intensity value, loudness value, and pitch value. In addition, the calculation formula of the signal period is as follows:

In the above calculation formula, Measure (m) is the measurement function, x (n) is the pre-stored voice signal at time n, N is the duration of the pre-stored voice signal, and m> 0, where N can take the largest value of the pre-stored voice signal At the moment, for example, the duration of the pre-stored voice signal is 3 minutes, and the value of n is determined in seconds, then N is 3 * 60 = 180. When calculating the signal period by the above calculation formula, compare the Measure (m) obtained according to different m, determine the Measure (m) with the largest value, and use the value of m in the Measure (m) with the largest value as Signal period.

S102: Construct a parameter vector based on the speech parameters, and train the initial model through a plurality of the parameter vectors and a corresponding plurality of the emotion levels to obtain an emotion model.

In order to quantify the relationship between speech parameters and emotion levels, after obtaining speech parameters, a parameter vector is constructed based on the speech parameters. The parameter vector is a multi-dimensional vector, in which sound intensity value, loudness value, pitch value and signal period are parameters One dimension in the vector. Since the parameter vector is obtained based on a pre-stored speech signal, the constructed parameter vector has a corresponding relationship with the emotional level of the pre-stored voice signal, so multiple parameter vectors and corresponding multiple emotional levels are sequentially input into the initial model for training , And determine the initial model after training as the emotion model.

For ease of explanation, assume that the t-th parameter vector x _t is (Value _sound-t , Value _volume-t , Value _loudness-t , Period _signal-t ), and the emotion level corresponding to the parameter vector is grade _t , where t is Integer greater than zero. The initial model includes multiple model units. FIG. 7 is an architectural diagram of the t-th model unit in the initial model. The model unit includes four levels, namely a vector level, a first level, a second level, and a third level. The circle in FIG. 7 represents an operation. If the circle is a plus sign, it represents a vector sum operation; if the circle is a multiplication sign, it represents a vector multiplication operation. For each model unit, it must maintain a unit state. The unit state format is vector. Since FIG. 7 is the t-th model unit in the initial model, the unit state of the model unit is assumed to be State _t . In addition, assuming that the output parameter of the model unit is output _t , the calculation process in the model unit is described according to each level in the model unit as follows:

(1) For the vector level in the model unit, the input parameters include output _t-1 (that is, the output parameter of the previous model unit) and x _t . The vector level is used to create a maintenance vector that conforms to the model unit. The calculation formula is State _t-support = tanh (W _support · [output _t-1 , x _t ] + b _support ), where tanh represents a hyperbolic tangent function;

(2) For the first level in the model unit, its role is to set a threshold to determine the parameters that need to be updated in the unit state State _t , the specific calculation formula is First _t = σ (W _First · [output _t-1 , x _t ] + b _First ), the output First _t is a value between 0 and 1. If First _t is 1, it means that the parameters are completely retained, if First _t is 0, it means that the parameters are completely abandoned, where σ represents the neural network ’s SIGMOID function. After going through the calculation of the vector level and the first level, multiply the First _t and State _t-support in order to subsequently update the unit state of the model unit;

(3) For the second level in the model unit, its role is to determine the information discarded from the unit state State _t-1 of the previous model unit. The specific calculation formula is Second _t = σ (W _Second · [output _t-1 , X _t ] + b _Second ), Similarly, the output Second _t is a value between 0 and 1, 1 means completely retain the parameters, 0 means completely discard the parameters. Since the second level is to determine the information discarded from the cell state State _t-1 , after the Second _t is obtained, the State _t-1 and the Second _t are multiplied, which is reflected in the self-loop part in FIG. 7. After the calculation is completed at the vector level, the first level, and the second level, the unit state of the model unit can be updated. The calculation formula is State _t = Second _t · State _t-1 + First _t · State _t-support . In addition, as shown in FIG. 7, the calculated unit state is also used to maintain the first level, second level, and third level of the model unit, to facilitate subsequent calculation of the model unit;

(4) For the third level in the model unit, its function is to calculate the output parameters of the model unit. The specific calculation formula includes Third _t = σ (W _Third · [output _t-1 , x _t ] + b _Third ) and output _t = Third _t · tanh (State _t ).

After inputting the input parameter x _t into the t-th model unit in the initial model, the output parameter output _t can be obtained through calculation. Since the emotional level grade _t corresponding to the input parameter x _{t is} known, the output parameter output _t can be The difference between sentiment levels grade _t is used as the error value, and based on the calculated error value, a backpropagation algorithm is used to adjust the parameters in each level of the model unit, including W _First , b _First , W _Second , b _Second , W _Third and b _Third , so that the output parameters of the model unit are as close as possible to the emotional level. It is worth noting that the above W _First , W _Second and W _Third respectively represent the level weights of the first, second and third levels, and the above b _First , b _Second and b _Third represent the equilibrium variables. During initialization, The values of multiple level weights and multiple equilibrium variables can be set randomly, and after the error value is obtained, the values of the level weights and equilibrium variables can be adjusted based on the back propagation algorithm. After inputting all the parameter vectors and emotion levels into the initial model, and completing the parameter adjustment, the trained emotion model is obtained.

S103: Input the speech parameter of the speech signal to be tested into the emotion model, and determine the output result of the emotion model as the emotion level corresponding to the speech signal to be tested.

After completing the training of the emotion model, you can start emotion analysis. Specifically, the speech signal to be tested is obtained, and the speech signal to be tested is analyzed to obtain a speech parameter, wherein the method of analyzing the speech signal to be tested is the same as the method of analyzing the pre-stored speech signal. The speech parameters obtained by analyzing the speech signal to be tested are also input into the emotion model in the form of vectors, and the output result (output parameter) of the emotion model is determined as the emotion level corresponding to the speech signal to be tested. If the emotion level needs to be output to the outside world later, it can be output in the form of text, graphics or voice.

It can be known from the embodiment shown in FIG. 1 that the embodiment of the present application improves the objectivity and accuracy of emotion analysis by disassembling the features in the pre-stored speech signal and training the emotion model according to the features and emotion levels.

Please refer to FIG. 2, which is an implementation flowchart of a sentiment analysis method based on a detection model provided by an embodiment of the present application. Compared with the embodiment corresponding to FIG. 1, this embodiment obtains S201-S202 after refining S101 on the basis that the speech parameter further includes the signal energy value, and the details are as follows:

S201: Split the pre-stored speech signal into multiple sub-speech signals in a time dimension, and perform a product operation on each of the sub-speech signals and a weighting coefficient, where the weighting coefficient is generated by a preset weighting formula.

In the case where the pre-stored voice signal is a continuous voice signal, in order to improve the accuracy of training the initial model, in the embodiment of the present application, the pre-stored voice signal is split into multiple sub-voice signals in the time dimension. Specifically, since the voice signal is stable in a short duration, the split duration can be determined in advance, and starting from the starting position of the pre-stored voice signal, interception is performed every other split duration, and the intercepted voice The signal is used as a sub-speech signal, for example, the preset splitting time is 30 milliseconds, and the pre-stored voice signal is 120 milliseconds, then 4 sub-voice signals can be intercepted.

Optionally, a preset overlap duration is obtained, after a sub-voice signal has been intercepted, a split duration is moved backward, and then an overlap duration is moved forward, and the next voice signal is intercepted according to the width of the split duration. In the embodiment of the present application, since the pre-stored voice signal is a continuous signal, in order to prevent the loss of dynamic information in the pre-stored voice signal, the overlap duration is set in advance, and the pre-stored voice signal is intercepted according to the split duration and the overlap duration, wherein, The overlap duration is less than the split duration. In the interception process, the latter sub-speech signal overlaps with the previous sub-speech signal, and the width of the overlapping area is the overlapping duration. For example, if the split duration is 30 ms, the overlap duration is 10 ms, and the duration of the pre-stored voice signal is 120 ms, then the width of the first sub-voice signal is from the 0th to 30 ms of the pre-stored voice signal, the second The width of each sub-speech signal is from the 20th to the 50th millisecond of the pre-stored speech signal, the width of the third sub-speech signal is from the 40th to 70th millisecond of the pre-stored speech signal, and so on, where width refers to the time dimension Value.

After the interception of the pre-stored speech signal is completed, in order to increase the periodicity of each sub-speech signal, the obtained sub-speech signal and the weighting coefficient are multiplied to weaken the left and right ends of each sub-speech signal. The weighting coefficient is generated by a preset weighting formula. The weighting formula is as follows:

The formula for performing the product operation is: x _new (n) = x (n) · ω (n), where x (n) in this step is the sub-speech signal at time n, and x _new (n) is weighted Sub-voice signal at time n.

Optionally, a preset overlap duration is obtained, after a sub-voice signal has been intercepted, a split duration is moved backward, and then an overlap duration is moved forward, and the next voice signal is intercepted according to the width of the split duration. In the embodiment of the present application, since the product operation of the sub-speech signal and the weighting coefficient will weaken both ends of the sub-speech signal, the overlap duration is preset in the embodiment of the present application, and the pre-stored according to the split duration and overlap duration Interception of voice signals, where the overlap duration is less than the split duration. In the interception process, the latter sub-speech signal overlaps with the previous sub-speech signal, and the width of the overlapping area is the overlapping duration. For example, if the split duration is 30 ms, the overlap duration is 10 ms, and the duration of the pre-stored voice signal is 120 ms, then the width of the first sub-voice signal is from the 0th to 30 ms of the pre-stored voice signal, the second The width of each sub-speech signal is from the 20th to the 50th millisecond of the pre-stored speech signal, the width of the third sub-speech signal is from the 40th to 70th millisecond of the pre-stored speech signal, and so on, where width refers to the time dimension Value. After the sub-speech signal is generated according to the above method, since the two ends of the sub-speech signal overlap, the product operation of the sub-speech signal and the weighting coefficient reduces the influence of the weakened ends on the sub-speech signal itself.

S202: Combine the weighted sound intensity value, the loudness value, the pitch value, the signal energy value, and the signal period of the sub-speech signal into the speech parameter.

For each sub-speech signal after weighting, the sound intensity value, loudness value, pitch value, signal energy value and signal period of the sub-speech signal are obtained, wherein the sound intensity value, loudness value and pitch value can also be analyzed by running speech The component is automatically acquired. Specifically, within the split duration of the sub-speech signal, the acquired sound intensity value, loudness value and pitch value are averaged to obtain the final sound intensity value, loudness value and pitch value . As for the signal energy value, the calculation formula is:

In the above formula, E _n is the signal energy value at time n. In the calculation process, multiple sampling points can be set within the split duration of the sub-speech signal, and the average value of the signal energy values obtained from the multiple sampling points can be determined as the final signal energy value corresponding to the sub-speech signal.

As for the signal period, because the splitting time of the sub-voice signal is shorter and the periodicity is stronger, the calculation method is updated. The calculation formula is as follows:

In the above calculation formula, x (n + m) is the sub-speech signal at time n + m, and the value range of θ is 0 <θ <60. When calculating the signal period of the sub-speech signal, we obtain Measure (θ) compares the values to determine the Measure (θ) with the largest value, and uses the value of θ in the Measure (θ) with the largest value as the signal period. After the calculation is completed, the sound intensity value, loudness value, pitch value, signal energy value and signal period of the weighted sub-voice signal are combined into a voice parameter. Since a pre-stored voice signal corresponds to multiple sub-voice signals, after the above calculation One pre-stored speech signal corresponds to multiple speech parameters. In the subsequent training process of the initial model, the training accuracy of the initial model is improved, and the amount of training calculation is also increased.

It can be seen from the embodiment shown in FIG. 2 that the embodiment of the present application improves the training accuracy of training the initial model by splitting the pre-stored sound signal in the time dimension and generating speech parameters corresponding to each sub-speech signal. The accuracy of the emotional model after training is further improved.

Please refer to FIG. 3, which is an implementation flowchart of a sentiment analysis method based on a detection model provided by an embodiment of the present application. Compared with the embodiment corresponding to FIG. 1, this embodiment includes multiple initial models, and the pre-stored voice signals also correspond to attribute characteristics, and refine S102 to obtain S301-S302. Details are as follows:

S301: Divide the voice parameters of the pre-stored voice signal corresponding to the same attribute feature into feature parameter sets, and construct the parameter vector based on the voice parameters in the feature parameter set.

Pre-stored voice signals may correspond to attribute characteristics, which are related to an attribute of the pre-stored voice signal. For example, attribute characteristics may be related to the age of the speaker of the pre-stored voice signal, and divide the age between 0 and 10 The attribute feature divides the age between 11 and 20 years into another attribute feature. Of course, depending on the actual application scenario, the attribute feature can be more content. In the embodiment of the present application, the pre-stored voice signals corresponding to different attribute features are separately processed, and the voice parameters of the pre-stored voice signals corresponding to the same attribute feature are divided into separate feature parameter sets based on the voices in the feature parameter set The parameter constructs the parameter vector.

In S302, a mapping relationship between the attribute feature corresponding to the feature parameter set and one of the initial models is established, and through a plurality of the parameter vectors and a corresponding plurality of the emotions in the feature parameter set Level trains the initial model to obtain the emotion model.

Since the feature parameter set is related to an attribute feature of the pre-stored voice signal, a mapping relationship is established between the attribute feature and one of the initial models (which can be randomly selected), and multiple parameter vectors in the feature parameter set , And the emotional level corresponding to the parameter vector is used as the input parameter, and the initial model with the mapping relationship is trained to obtain the emotional model. Through the above method, if there are y attribute features, y emotion models can be finally obtained, and each emotion model has a mapping relationship with one attribute feature, where y is an integer greater than zero.

Optionally, acquire the attribute characteristics corresponding to the voice signal to be tested; determine the emotion model corresponding to the attribute characteristics according to the mapping relationship, and input the voice parameters of the voice signal to be tested into the emotion model. After training to obtain multiple emotion models, the attribute characteristics of the voice signal to be tested are obtained, and the emotion model corresponding to the attribute characteristics is determined according to the mapping relationship, and the voice parameters of the voice signal to be tested are input into the emotion model for emotion analysis. For example, for multiple pre-stored speech signals, there are three attribute features, one is between 0 and 10 years old, named as feature one; the other is between 11 and 20 years old, named Feature two; the last one is between 21 and 30 years old, and is named Feature three. Then there are three emotion models finally trained, which have mapping relationship with feature one, feature two and feature three respectively. For the voice signal to be tested, the corresponding attribute feature is obtained, for example, the age of the speaker of the voice signal to be tested is 22 years old, then the corresponding attribute feature is determined as feature three, and the voice parameters of the voice signal to be tested are input Three emotional models with mapping relationships.

It can be seen from the embodiment shown in FIG. 3 that in the embodiment of the present application, the voice parameters of the pre-stored voice signals corresponding to the same attribute feature are divided into feature parameter sets, and a parameter vector is constructed based on the voice parameters in the feature parameter set to distinguish the features The attribute feature corresponding to the parameter set establishes a mapping relationship with one of the initial models, and the initial model is trained through multiple parameter vectors in the feature parameter set and corresponding multiple emotional levels to obtain the emotional model. Attribute features train multiple emotion models, improving the pertinence of sentiment analysis.

Please refer to FIG. 4, which is a flowchart of an implementation of a sentiment analysis method based on a detection model provided by an embodiment of the present application. Compared with the embodiment corresponding to FIG. 3, this embodiment expands the process before S301 to obtain S401-S403 based on the attribute characteristics including male and female. The details are as follows:

S401: Acquire a preset frequency threshold, and determine the signal frequency of the pre-stored voice signal based on the signal period.

In the embodiments of the present application, the attribute features include male and female, and due to the high vibration frequency of the female vocal cords, there is a difference in the vocalization frequency of male and female. In an unknown situation, the reciprocal of the signal period is determined as the signal frequency of the pre-stored voice signal, and the attribute characteristic corresponding to the pre-stored voice signal is determined to be male or female according to the signal frequency.

S402: If the signal frequency is higher than the frequency threshold, set the attribute characteristic of the pre-stored voice signal corresponding to the signal frequency to female.

In the embodiment of the present application, a frequency threshold is set in advance, and if the signal frequency is higher than the frequency threshold, the attribute characteristic of the pre-stored voice signal corresponding to the signal frequency is set to female. Among them, the frequency threshold can be customized. For example, if it is set to 500 Hz, multiple frequency thresholds can be set based on a large number of sample speech signals with known attribute characteristics, and the frequency threshold with the highest accuracy rate is analyzed, and the The frequency threshold is determined as the frequency threshold used as a judgment condition in this step.

S403: If the signal frequency is not higher than the frequency threshold, set the attribute characteristic of the pre-stored voice signal corresponding to the signal frequency to male.

If the signal frequency is not higher than the frequency threshold, the attribute characteristic of the pre-stored voice signal corresponding to the signal frequency is set to male. After setting the attribute characteristics of all pre-stored voice signals, divide the voice parameters of the pre-stored voice signals corresponding to men into feature parameter sets, and divide the voice parameters of the pre-stored voice signals corresponding to women into another feature parameter set to achieve different Separate processing of pre-stored speech signals of attribute features.

It can be known from the embodiment shown in FIG. 4 that in the embodiment of the present application, a preset frequency threshold is obtained, and the signal frequency of the pre-stored voice signal is determined based on the signal period. If the signal frequency is higher than the frequency threshold, the signal frequency corresponding to the pre-stored The attribute characteristic of the voice signal is set to female; if the signal frequency is not higher than the frequency threshold, the attribute characteristic of the pre-stored voice signal corresponding to the signal frequency is set to male. The embodiment of the present application determines the attribute characteristics of the pre-stored voice signal by the signal frequency, which facilitates the subsequent classification of the pre-stored voice signals with different attribute characteristics, and improves the pertinence of the sentiment analysis.

Corresponding to a detection model-based sentiment analysis method described in the above embodiment, FIG. 5 shows a structural block diagram of a detection model-based sentiment analysis device provided by an embodiment of the present application. Referring to FIG. 5, the device include:

The analyzing unit 51 is configured to acquire a plurality of pre-stored voice signals, and analyze the pre-stored voice signals to obtain voice parameters. The voice parameters include a sound intensity value, a loudness value, a pitch value, and a signal period. The pre-stored voice signal corresponds to a preset emotion level;

The training unit 52 is configured to construct a parameter vector based on the speech parameters, and train the initial model through a plurality of the parameter vectors and a corresponding plurality of the emotion levels to obtain an emotion model;

The input unit 53 is configured to input the speech parameters of the speech signal to be tested into the emotion model, and determine the output result of the emotion model as the emotion level corresponding to the speech signal to be tested.

Optionally, the voice parameter further includes a signal energy value, and the analysis unit 51 includes:

A splitting unit, configured to split the pre-stored voice signal into multiple sub-voice signals in a time dimension, and perform a product operation on each of the sub-voice signals and a weighting coefficient, wherein the weighting coefficient is determined by a preset Weighted formula generation;

The combining unit is configured to combine the weighted sound intensity value, the loudness value, the pitch value, the signal energy value, and the signal period of the sub-voice signal into the voice parameter.

Optionally, multiple initial models are included, and the pre-stored voice signals also correspond to attribute features. The training unit 52 includes:

A dividing unit, configured to divide the voice parameters of the pre-stored voice signal corresponding to the same attribute feature into feature parameter sets, and construct the parameter vector based on the voice parameters in the feature parameter set;

A establishing unit, configured to establish a mapping relationship between the attribute feature corresponding to the feature parameter set and one of the initial models, and pass a plurality of the parameter vectors in the feature parameter set and a corresponding plurality of the The emotional model trains the initial model to obtain the emotional model.

Optionally, the input unit 53 includes:

An obtaining unit, configured to obtain the attribute characteristic corresponding to the voice signal to be tested;

The determining unit is configured to determine the emotion model corresponding to the attribute feature according to the mapping relationship, and input the voice parameters of the voice signal to be tested into the emotion model.

Optionally, the attribute characteristics include male and female, and the division unit further includes:

A frequency determining unit, configured to acquire a preset frequency threshold, and determine the signal frequency of the pre-stored voice signal based on the signal period;

A first setting unit, configured to set the attribute characteristic of the pre-stored voice signal corresponding to the signal frequency to female if the signal frequency is higher than the frequency threshold;

The second setting unit is configured to set the attribute characteristic of the pre-stored voice signal corresponding to the signal frequency to male if the signal frequency is not higher than the frequency threshold.

6 is a schematic diagram of a terminal device provided by an embodiment of the present application. As shown in FIG. 6, the terminal device 6 of this embodiment includes: a processor 60 and a memory 61, and the memory 61 stores computer-readable instructions 62 that can run on the processor 60, for example, based on a detection model Sentiment analysis program. When the processor 60 executes the computer-readable instruction 62, the steps in the foregoing embodiments of the sentiment analysis method based on the detection model are implemented, for example, steps S101 to S103 shown in FIG. 1. Alternatively, when the processor 60 executes the computer-readable instructions 62, the functions of the units in the foregoing embodiment of the emotion analysis device based on the detection model are implemented, for example, the functions of the units 51 to 53 shown in FIG.

Exemplarily, the computer-readable instructions 62 may be divided into one or more modules / units, the one or more modules / units are stored in the memory 61, and executed by the processor 60, To complete this application. The one or more modules / units may be a series of computer-readable instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 62 in the terminal device 6. For example, the computer-readable instructions 62 may be divided into an analysis unit, a training unit, and an input unit, and the specific functions of each unit are as described above.

The terminal device may include, but is not limited to, the processor 60 and the memory 61. Those skilled in the art may understand that FIG. 6 is only an example of the terminal device 6 and does not constitute a limitation on the terminal device 6, and may include more or less components than the illustration, or a combination of certain components or different components. For example, the terminal device may further include an input and output device, a network access device, a bus, and the like.

The so-called processor 60 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 61 may be an internal storage unit of the terminal device 6, such as a hard disk or a memory of the terminal device 6. The memory 61 may also be an external storage device of the terminal device 6, for example, a plug-in hard disk equipped on the terminal device 6, a smart memory card (Smart, Media, Card, SMC), and a secure digital (SD) Cards, flash cards, etc. Further, the memory 61 may also include both an internal storage unit of the terminal device 6 and an external storage device. The memory 61 is used to store the computer-readable instructions and other programs and data required by the terminal device. The memory 61 can also be used to temporarily store data that has been or will be output.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above integrated unit may be implemented in the form of hardware or software functional unit.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application essentially or part of the contribution to the existing technology or all or part of the technical solution can be embodied in the form of a software product, the computer software product is stored in a storage medium , Including several instructions to enable a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .

As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, persons of ordinary skill in the art should understand that they can still The technical solutions described in the embodiments are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not deviate from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

A sentiment analysis method based on detection model, which is characterized by:

Obtain multiple pre-stored voice signals, and analyze the pre-stored voice signals to obtain voice parameters, where the voice parameters include sound intensity value, loudness value, pitch value, and signal period, wherein each of the pre-stored voice signals corresponds to a pre-stored voice signal Set emotion level;

Constructing a parameter vector based on the speech parameters, and training the initial model through a plurality of the parameter vectors and a corresponding plurality of the emotion levels to obtain an emotion model;

The voice parameters of the voice signal to be tested are input to the emotion model, and the output result of the emotion model is determined as the emotion level corresponding to the voice signal to be tested.
The sentiment analysis method according to claim 1, wherein the speech parameter further includes a signal energy value, and the analysis of the pre-stored speech signal to obtain the speech parameter includes:

Split the pre-stored speech signal into multiple sub-speech signals in the time dimension, and perform a product operation on each of the sub-speech signals and a weighting coefficient, where the weighting coefficient is generated by a preset weighting formula;

The weighted sound intensity value, the loudness value, the pitch value, the signal energy value, and the signal period of the sub-speech signal are combined into the speech parameter.
The sentiment analysis method according to claim 1, characterized in that it includes a plurality of said initial models, and said pre-stored speech signals also correspond to attribute features, said constructing a parameter vector based on said speech parameters, through a plurality of said The parameter vector and corresponding multiple emotion levels train the initial model to obtain the emotion model, including:

Dividing the voice parameters of the pre-stored voice signal corresponding to the same attribute feature into feature parameter sets, and constructing the parameter vector based on the voice parameters in the feature parameter set;

Establish a mapping relationship between the attribute feature corresponding to the feature parameter set and one of the initial models, and use multiple parameter vectors and corresponding multiple emotion levels in the feature parameter set to The initial model is trained to obtain the emotion model.
The emotion analysis method according to claim 3, wherein the inputting the speech parameters of the speech signal to be tested into the emotion model includes:

Acquiring the attribute characteristics corresponding to the voice signal to be tested;

The emotion model corresponding to the attribute feature is determined according to the mapping relationship, and the speech parameters of the speech signal to be tested are input into the emotion model.
The sentiment analysis method according to claim 3, wherein the attribute features include male and female, and the voice parameters of the pre-stored voice signals corresponding to the same attribute feature are divided into feature parameter sets, Also includes:

Acquiring a preset frequency threshold, and determining the signal frequency of the pre-stored voice signal based on the signal period;

If the signal frequency is higher than the frequency threshold, the attribute characteristic of the pre-stored voice signal corresponding to the signal frequency is set to female;

If the signal frequency is not higher than the frequency threshold, the attribute characteristic of the pre-stored voice signal corresponding to the signal frequency is set to male.
A sentiment analysis device based on a detection model is characterized in that it includes:

The analyzing unit is used to obtain a plurality of pre-stored voice signals, and analyze the pre-stored voice signals to obtain voice parameters. The voice parameters include a sound intensity value, a loudness value, a pitch value, and a signal period, wherein each of the pre-stored voice signals The voice signal corresponds to a preset emotion level;

A training unit, configured to construct a parameter vector based on the speech parameters, and train the initial model through a plurality of the parameter vectors and a corresponding plurality of the emotion levels to obtain an emotion model;

The input unit is configured to input the speech parameters of the speech signal to be tested into the emotion model, and determine the output result of the emotion model as the emotion level corresponding to the speech signal to be tested.
The sentiment analysis device according to claim 6, wherein the speech parameter further includes a signal energy value, and the analysis unit includes:

A splitting unit for splitting the pre-stored voice signal into multiple sub-voice signals in the time dimension, and performing a product operation on each of the sub-voice signals and a weighting coefficient, wherein Weighted formula generation;

The combining unit is configured to combine the weighted sound intensity value, the loudness value, the pitch value, the signal energy value, and the signal period of the sub-voice signal into the voice parameter.
The emotion analysis device according to claim 6, characterized in that it includes a plurality of the initial models, and the pre-stored speech signal also corresponds to attribute characteristics, and the training unit includes:

A dividing unit, configured to divide the voice parameters of the pre-stored voice signal corresponding to the same attribute feature into feature parameter sets, and construct the parameter vector based on the voice parameters in the feature parameter set;

A establishing unit, configured to establish a mapping relationship between the attribute feature corresponding to the feature parameter set and one of the initial models, and pass a plurality of the parameter vectors in the feature parameter set and a corresponding plurality of the The emotional model trains the initial model to obtain the emotional model.
The sentiment analysis device according to claim 8, wherein the input unit comprises:

An obtaining unit, configured to obtain the attribute characteristic corresponding to the voice signal to be tested;

The determining unit is configured to determine the emotion model corresponding to the attribute feature according to the mapping relationship, and input the voice parameters of the voice signal to be tested into the emotion model.
The sentiment analysis device according to claim 8, wherein the attribute feature includes male and female, and the dividing unit further includes:

A frequency determining unit, configured to acquire a preset frequency threshold, and determine the signal frequency of the pre-stored voice signal based on the signal period;

A first setting unit, configured to set the attribute characteristic of the pre-stored voice signal corresponding to the signal frequency to female if the signal frequency is higher than the frequency threshold;

The second setting unit is configured to set the attribute characteristic of the pre-stored voice signal corresponding to the signal frequency to male if the signal frequency is not higher than the frequency threshold.
A terminal device, characterized in that it includes a memory and a processor, and the memory stores computer-readable instructions executable on the processor, and the processor implements the computer-readable instructions to implement the following steps :

Obtain multiple pre-stored voice signals, and analyze the pre-stored voice signals to obtain voice parameters, where the voice parameters include sound intensity value, loudness value, pitch value, and signal period, wherein each of the pre-stored voice signals corresponds to a pre-stored voice signal Set emotion level;

Constructing a parameter vector based on the speech parameters, and training the initial model through a plurality of the parameter vectors and a corresponding plurality of the emotion levels to obtain an emotion model;

The voice parameters of the voice signal to be tested are input to the emotion model, and the output result of the emotion model is determined as the emotion level corresponding to the voice signal to be tested.
The terminal device according to claim 11, wherein the voice parameter further includes a signal energy value, and the analyzing the pre-stored voice signal to obtain the voice parameter includes:

Split the pre-stored speech signal into multiple sub-speech signals in the time dimension, and perform a product operation on each of the sub-speech signals and a weighting coefficient, where the weighting coefficient is generated by a preset weighting formula;

The weighted sound intensity value, the loudness value, the pitch value, the signal energy value, and the signal period of the sub-speech signal are combined into the speech parameter.
The terminal device according to claim 11, characterized in that it includes a plurality of said initial models, and said pre-stored speech signal also corresponds to attribute characteristics, said constructing a parameter vector based on said speech parameters, through a plurality of said parameters The initial model is trained by the vector and corresponding multiple of the emotion levels, including:

Dividing the voice parameters of the pre-stored voice signal corresponding to the same attribute feature into feature parameter sets, and constructing the parameter vector based on the voice parameters in the feature parameter set;

Establish a mapping relationship between the attribute feature corresponding to the feature parameter set and one of the initial models, and use multiple parameter vectors and corresponding multiple emotion levels in the feature parameter set to The initial model is trained to obtain the emotion model.
The terminal device according to claim 13, wherein the inputting the voice parameters of the voice signal to be tested into the emotion model includes:

Acquiring the attribute characteristics corresponding to the voice signal to be tested;

The emotion model corresponding to the attribute feature is determined according to the mapping relationship, and the speech parameters of the speech signal to be tested are input into the emotion model.
The terminal device according to claim 13, wherein the attribute features include male and female, and before dividing the voice parameters of the pre-stored voice signal corresponding to the same attribute feature into feature parameter sets, include:

Acquiring a preset frequency threshold, and determining the signal frequency of the pre-stored voice signal based on the signal period;

If the signal frequency is higher than the frequency threshold, the attribute characteristic of the pre-stored voice signal corresponding to the signal frequency is set to female;

If the signal frequency is not higher than the frequency threshold, the attribute characteristic of the pre-stored voice signal corresponding to the signal frequency is set to male.
A computer nonvolatile readable storage medium, the computer nonvolatile readable storage medium stores computer readable instructions, characterized in that, when the computer readable instructions are executed by at least one processor, the following steps are realized :

Obtain multiple pre-stored voice signals, and analyze the pre-stored voice signals to obtain voice parameters, where the voice parameters include sound intensity value, loudness value, pitch value, and signal period, wherein each of the pre-stored voice signals corresponds to a pre-stored voice signal Set emotion level;

Constructing a parameter vector based on the speech parameters, and training the initial model through a plurality of the parameter vectors and a corresponding plurality of the emotion levels to obtain an emotion model;

The voice parameters of the voice signal to be tested are input to the emotion model, and the output result of the emotion model is determined as the emotion level corresponding to the voice signal to be tested.
The computer non-volatile storage medium according to claim 16, wherein the voice parameter further includes a signal energy value, and the analyzing the pre-stored voice signal to obtain the voice parameter includes:

Split the pre-stored speech signal into multiple sub-speech signals in the time dimension, and perform a product operation on each of the sub-speech signals and a weighting coefficient, where the weighting coefficient is generated by a preset weighting formula;

The weighted sound intensity value, the loudness value, the pitch value, the signal energy value, and the signal period of the sub-speech signal are combined into the speech parameter.
The computer non-volatile storage medium according to claim 16, characterized in that it includes a plurality of the initial models, and the pre-stored voice signal also corresponds to attribute characteristics, and the parameter vector is constructed based on the voice parameters , The initial model is trained through multiple of the parameter vectors and corresponding multiple of the emotional levels, including:

Dividing the voice parameters of the pre-stored voice signal corresponding to the same attribute feature into feature parameter sets, and constructing the parameter vector based on the voice parameters in the feature parameter set;

Establish a mapping relationship between the attribute feature corresponding to the feature parameter set and one of the initial models, and use multiple parameter vectors and corresponding multiple emotion levels in the feature parameter set to The initial model is trained to obtain the emotion model.
The computer non-volatile readable storage medium according to claim 18, wherein the inputting the voice parameters of the voice signal to be tested into the emotion model includes:

Acquiring the attribute characteristics corresponding to the voice signal to be tested;

The emotion model corresponding to the attribute feature is determined according to the mapping relationship, and the speech parameters of the speech signal to be tested are input into the emotion model.
The computer non-volatile readable storage medium according to claim 18, wherein the attribute features include male and female, and the voice parameters of the pre-stored voice signals corresponding to the same attribute feature are divided Before the feature parameter set, it also includes:

Acquiring a preset frequency threshold, and determining the signal frequency of the pre-stored voice signal based on the signal period;

If the signal frequency is higher than the frequency threshold, the attribute characteristic of the pre-stored voice signal corresponding to the signal frequency is set to female;

If the signal frequency is not higher than the frequency threshold, the attribute characteristic of the pre-stored voice signal corresponding to the signal frequency is set to male.