CN112185389B

CN112185389B - Voice generation method, device, storage medium and electronic equipment

Info

Publication number: CN112185389B
Application number: CN202011003603.9A
Authority: CN
Inventors: 魏晨
Original assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2020-09-22
Filing date: 2020-09-22
Publication date: 2024-06-18
Anticipated expiration: 2040-09-22
Also published as: CN112185389A

Abstract

The disclosure relates to a voice generating method, a device, a storage medium and an electronic device, wherein the method comprises the following steps: determining a voice emotion label corresponding to the input voice according to the voice frequency spectrum characteristics of the input voice and the semantic text corresponding to the input voice through a preset trained emotion classification model; extracting cognitive information from the semantic text; determining a response emotion label corresponding to the voice emotion label and a response text corresponding to the semantic text according to a preset emotion correlation model, a preset text correlation model, the voice emotion label and the cognitive information; and generating a reply voice aiming at the input voice according to the intonation determined by the reply emotion label and the reply text. The voice emotion and the semantic text of the input voice can be obtained, and corresponding reply voice is generated by the response emotion corresponding to the voice emotion and the reply text corresponding to the semantic text, so that the intelligent degree of intelligent voice interaction is improved.

Description

Voice generation method, device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to a method and apparatus for generating speech, a storage medium, and an electronic device.

Background

After the intelligent voice assistant Siri of the apple company starts the first river of the intelligent voice assistant, the intelligent voice interaction systems such as the voice interaction system or the intelligent voice chat system of each science and technology company are also vigorously developed like the spring bamboo shoots after rain. The intelligent voice interaction system is arranged in electronic equipment such as a mobile terminal or intelligent household appliance. In the related art of intelligent voice interaction, when receiving an input voice of a user, it is generally required to analyze the semantics of the input voice by using the intelligent voice interaction system, and generate a reply voice corresponding to the input voice according to the semantics of the input voice, so as to perform voice communication with the user or assist the user to control a mobile terminal or an intelligent household appliance through the reply voice.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a voice generating method, apparatus, storage medium, and electronic device.

According to a first aspect of embodiments of the present disclosure, there is provided a speech generation method, the method comprising:

Receiving input speech;

Determining a voice emotion label corresponding to the input voice according to the voice frequency spectrum characteristics of the input voice and the semantic text corresponding to the input voice through a preset trained emotion classification model;

extracting cognitive information from the semantic text, wherein the cognitive information comprises: at least one of user profile information, event flow information, and event decision information;

Determining a response emotion label corresponding to the voice emotion label and a response text corresponding to the semantic text according to a preset emotion correlation model, a preset text correlation model, the voice emotion label and the cognitive information; the emotion association model is used for representing the association relationship among the voice emotion label, the response emotion label and the cognitive information, and the text association model is used for representing the association relationship among the voice emotion label, the response text and the cognitive information;

Generating a reply voice aiming at the input voice according to the reply emotion label and the reply text, wherein the semantic text corresponding to the reply voice is the reply text, and the intonation feature of the reply voice is the intonation feature determined according to the reply emotion label;

And outputting the reply voice.

Optionally, the emotion classification model includes: the speech prediction method comprises the steps of a speech decoder, a text decoder, an audio processing model, a speech recognition model and a classification prediction model, wherein the classification prediction model comprises a connecting layer and a Softmax layer, and the speech emotion label corresponding to input speech is determined according to the voice spectrum characteristics of the input speech and semantic texts corresponding to the input speech through a preset trained emotion classification model, and comprises the following steps:

acquiring sound spectrum characteristics corresponding to the input voice through the audio processing model, and inputting the sound spectrum characteristics into the voice decoder to acquire corresponding first feature vectors;

Recognizing a semantic text in the input voice through the voice recognition model, and inputting the semantic text into the text decoder to obtain a corresponding second feature vector;

splicing the first feature vector and the second feature vector into a third feature vector through the connection layer;

and inputting the third feature vector into the Softmax layer, and acquiring an emotion label corresponding to the third feature vector as the voice emotion label.

Optionally, before the determining, by the emotion classification model after the training, a speech emotion label corresponding to the input speech according to the sound spectrum feature of the input speech and the semantic text corresponding to the input speech, the method further includes:

training a preset classification prediction model through preset voice emotion training data to obtain a trained classification prediction model;

constructing the emotion classification model by the speech decoder, the text decoder, the audio processing model, the speech recognition model and the trained classification prediction model; wherein,

The output of the speech recognition model is the input of the speech decoder, the output of the speech recognition model is the input of the text decoder, and the input of the classification prediction model comprises: an output of the speech decoder and an output of the text decoder.

Optionally, the extracting cognitive information from the semantic text includes:

Extracting a first text element for describing personal information and/or interest information from the semantic text, and taking a text feature corresponding to the first text element as the user portrait information;

Extracting a second text element for describing an event processing flow and/or a thing development rule from the semantic text, and taking text features corresponding to the second text element as the event flow information; and/or the number of the groups of groups,

Identifying a third text element in the semantic text for describing event decision conditions;

Determining event probability of each event result caused by the event decision condition according to the third text element through a preset Bayesian network model, and taking each event result and the event probability corresponding to each event result as the event decision information.

Optionally, the determining, according to a preset emotion correlation model, a preset text correlation model, the speech emotion tag and the cognitive information, a response emotion tag corresponding to the speech emotion tag and a response text corresponding to the semantic text includes:

respectively inputting the voice emotion label and the cognitive information into the emotion correlation model and the text correlation model, and acquiring a first probability set output by the emotion correlation model and a second probability set output by the text correlation model; the first probability set comprises a plurality of emotion tags and first probabilities corresponding to the emotion tags, and the first probability set comprises a plurality of texts and first probabilities corresponding to the texts;

Taking the emotion label corresponding to the highest first probability in the plurality of emotion labels as the response emotion label; and

And taking the text corresponding to the highest second probability in the plurality of texts as the reply text.

Optionally, the generating a reply voice for the input voice according to the reply emotion label and the reply text includes:

Inputting a preset intonation association model according to the response emotion label, and acquiring intonation characteristics corresponding to the response emotion label output by the intonation association model;

And synthesizing the intonation features and the reply text into the reply voice through a preset text-to-voice (TTS) algorithm.

According to a second aspect of embodiments of the present disclosure, there is provided a speech generating apparatus, the apparatus comprising:

a voice receiving module configured to receive an input voice;

The label determining module is configured to determine a voice emotion label corresponding to the input voice according to the voice frequency spectrum characteristics of the input voice and the semantic text corresponding to the input voice through a preset trained emotion classification model;

an information extraction module configured to extract cognitive information from the semantic text, the cognitive information comprising: at least one of user profile information, event flow information, and event decision information;

The information determining module is configured to determine a response emotion label corresponding to the voice emotion label and a response text corresponding to the semantic text according to a preset emotion correlation model, a preset text correlation model, the voice emotion label and the cognitive information; the emotion association model is used for representing the association relationship among the voice emotion label, the response emotion label and the cognitive information, and the text association model is used for representing the association relationship among the voice emotion label, the response text and the cognitive information;

The voice synthesis module is configured to generate a reply voice for the input voice according to the reply emotion label and the reply text, the semantic text corresponding to the reply voice is the reply text, and the intonation feature of the reply voice is the intonation feature determined according to the reply emotion label;

and the voice output module is configured to output the reply voice.

Optionally, the emotion classification model includes: a speech decoder, a text decoder, an audio processing model, a speech recognition model, and a classification prediction model, the classification prediction model comprising a connection layer and a Softmax layer, the tag determination module configured to:

Optionally, the apparatus further includes:

The model training module is configured to train a preset classification prediction model through preset voice emotion training data so as to obtain a trained classification prediction model;

A model building module configured to build the emotion classification model from the speech decoder, the text decoder, the audio processing model, the speech recognition model, and the trained classification prediction model; wherein,

Optionally, the information extraction module is configured to:

Optionally, the information determining module is configured to:

Optionally, the voice synthesis module is configured to:

According to a third aspect of embodiments of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the speech generation method provided by the first aspect of the present disclosure.

According to a fourth aspect of embodiments of the present disclosure, there is provided an electronic device in which a voice interaction system is provided; the electronic device includes: the second aspect of the present disclosure provides a speech generating apparatus.

According to the technical scheme provided by the embodiment of the disclosure, the voice emotion label corresponding to the input voice can be determined according to the voice frequency spectrum characteristics of the input voice and the semantic text corresponding to the input voice through the preset trained emotion classification model; extracting cognitive information from the semantic text, the cognitive information comprising: at least one of user profile information, event flow information, and event decision information; determining a response emotion label corresponding to the voice emotion label and a reply text corresponding to the semantic text according to a preset emotion correlation model, a preset text correlation model, the voice emotion label and the cognitive information, wherein the emotion correlation model is used for representing the correlation among the voice emotion label, the response emotion label and the cognitive information, and the text correlation model is used for representing the correlation among the voice emotion label, the reply text and the cognitive information; generating a reply voice aiming at the input voice according to the voice emotion corresponding to the reply emotion label and the reply text, wherein the semantic text corresponding to the reply voice is the reply text, and the intonation of the reply voice is the intonation determined according to the reply emotion label. The voice emotion and the semantic text of the input voice can be obtained, and corresponding reply voice is generated by the response emotion corresponding to the voice emotion and the reply text corresponding to the semantic text, so that the intelligent degree of intelligent voice interaction is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flowchart illustrating a method of speech generation according to an exemplary embodiment;

FIG. 2 is a flow chart of a method of determining a speech emotion tag according to the one shown in FIG. 1;

FIG. 3 is a flow chart of another speech generation method according to the one shown in FIG. 2;

FIG. 4 is a flow chart of yet another speech generation method according to the one shown in FIG. 1;

FIG. 5 is a flow chart of a method of determining a response emotion tag and a response text according to the method shown in FIG. 1;

FIG. 6 is a flow chart of a method of speech synthesis according to the one shown in FIG. 1;

FIG. 7 is a block diagram of a speech generating device, according to an example embodiment;

FIG. 8 is a block diagram of another speech generating device according to the one shown in FIG. 7;

fig. 9 is a block diagram illustrating an apparatus for speech generation according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

Before describing the voice interaction method provided by the present disclosure, a description is first given of a target application scenario related to each embodiment of the present disclosure, where the target application scenario includes an electronic device, and an audio output and output device is provided in the electronic device, where the electronic device may be, for example, an electronic device such as a Personal computer, a notebook computer, a smart phone, a tablet computer, a smart television, a smart watch, a PDA (english: personal DIGITAL ASSISTANT, chinese: personal digital assistant), and the like. The electronic equipment is internally provided with an intelligent voice interaction system based on a brain-like cognitive model.

Illustratively, such brain cognitive models generally include a sensing unit, a memory unit, a learning unit, and a decision unit. The sensing unit is used for sensing voice audio information, image information and even smell information which are actively input by a user or actively monitored by electronic equipment, and extracting and analyzing the information so as to simulate the vision, hearing, smell, touch and the like of human beings. In an embodiment of the present disclosure, the perception unit comprises an emotion analysis model capable of determining the semantics of the input speech itself and the emotion information contained within the speech from the audio features of the input speech. The memory unit is used for extracting and memorizing user personal information, interest information and the like of different latitudes used for representing the personal characteristics of the user from the acquired various information. The learning unit is used for extracting event flow information for representing the whole flow of a user participating in an event (such as purchasing a train ticket, going out of a net-bound car, and the like) from the acquired various information. The decision unit is mainly realized by constructing a Bayesian network, and is used for extracting different entities for event decision from various information acquired by the perception unit, and constructing a corresponding Bayesian network according to causal relations among the entities. The probability of occurrence of different entities caused by each other is stored in a fixed condition probability table corresponding to the Bayesian network, and when the decision of a plurality of entity conditions is needed, the result entity caused by a plurality of entities is determined according to the trigger probability corresponding to the plurality of entity conditions in the condition probability table.

Fig. 1 is a flowchart of a voice generating method according to an exemplary embodiment, as shown in fig. 1, applied to an electronic device described in the above application scenario, the method includes the following steps:

Step 101, an input speech is received.

Step 102, determining a voice emotion label corresponding to the input voice according to the voice spectrum characteristics of the input voice and the semantic text corresponding to the input voice through a preset trained emotion classification model.

For example, a piece of speech contains semantic text and intonation (or intonation) in the speech that is the basis for determining the actual meaning of the piece of speech, different pieces of semantic text may exhibit exactly opposite actual meaning in different intonation, and intonation is determined by the emotion the user wants to express when speaking the piece of speech. Based on this, in the embodiment of the present disclosure, it is necessary to determine a reply voice for the input voice from two kinds of information contained in the input voice. The two kinds of information include semantic text of the input voice and emotion corresponding to the input voice, and the emotion can be, for example, pain, feeling, happiness, and the like. In step 102, after receiving an input voice from one end through the voice acquisition unit in the sensing unit, a preset trained emotion classification model is needed to determine an emotion label corresponding to the input voice. The emotion classification model comprises two parts, one part is used for extracting feature vectors of audio features of the input voice, and the other part is used for extracting feature vectors of text features of semantic text of the input voice. And then, taking the feature vectors of the voice features and the text features as the input of the trained classification prediction model, and obtaining a voice emotion label corresponding to the input voice, wherein the voice emotion label is used for representing the emotion contained in the section of input voice. The voice emotion tag can be recorded and transmitted in a numbered form in the actual implementation process.

And step 103, extracting cognitive information from the semantic text.

For example, in addition to acquiring a speech emotion tag corresponding to an input speech, analysis of content contained in a semantic text of the input speech is required. The text features of the semantic text of the input voice extracted through the emotion classification model in the step 102 may be used in the step 103, so as to obtain the cognitive information capable of expressing the semantics from the text features. The cognitive information includes: at least one of user profile information, event flow information, and event decision information.

For example, in the actual execution process, the semantic text does not include any cognitive information, so that the cognitive information can be determined through information acquired by other information acquisition units in the sensing unit in addition to extracting the cognitive information from the received voice text of the input voice. Specifically, the other information acquisition units described above, for example, an image acquisition unit, a date-time information acquisition unit, a history behavior acquisition unit, and the like, may be activated while receiving the input voice. Then, the image information, date and time information, historical behavior information and the like acquired by the information acquisition units are converted into feature vectors which can be identified, and the cognitive information is determined according to the feature vectors.

Step 104, determining a response emotion label corresponding to the voice emotion label and a response text corresponding to the semantic text according to the preset emotion correlation model, the preset text correlation model, the voice emotion label and the cognitive information.

The emotion association model is used for representing the association relationship among the voice emotion label, the response emotion label and the cognitive information, and the text association model is used for representing the association relationship among the voice emotion label, the response text and the cognitive information.

For example, the emotion-related model (or the text-related model) may be a pre-trained classification prediction model, and the speech emotion label and the cognitive information are used as inputs of the classification prediction model, so that a plurality of emotion labels (or a plurality of texts) output by the classification prediction model and a prediction probability corresponding to each emotion label (or a prediction probability corresponding to each text) can be obtained, and then the response emotion label (or the response text) is determined from the plurality of emotion labels (or the plurality of texts) according to the prediction probability. Or the emotion association model (or the text association model) can also be an association relation comparison table, and the comparison table comprises the association relation among the voice emotion label, the response emotion label (or the response text) and the cognitive information. After determining the speech emotion tag and the awareness information, the look-up table may be directly consulted to determine the response emotion tag (or response text).

Step 105, generating a reply voice for the input voice according to the reply emotion label and the reply text.

And step 106, outputting the reply voice.

The semantic text corresponding to the reply voice is the reply text, and the intonation feature of the reply voice is the intonation feature determined according to the reply emotion label.

Illustratively, the intonation feature and the text feature are variables in the TTS (TextToSpeech, text-to-speech) algorithm. In step 105, the intonation feature corresponding to the reply emotion tag is determined, and the intonation feature corresponding to the reply emotion tag and the text feature corresponding to the reply text are input as variables into the TTS algorithm, so as to obtain the synthesized reply voice. After the reply voice is obtained, in step 106, the reply voice may be output through the sound output device of the electronic device in the application scenario to interact with the user.

In summary, according to the technical scheme provided by the embodiment of the disclosure, the speech emotion label corresponding to the input speech can be determined according to the voice spectrum characteristics of the input speech and the semantic text corresponding to the input speech by presetting the trained emotion classification model; extracting cognitive information from the semantic text, the cognitive information comprising: at least one of user profile information, event flow information, and event decision information; determining a response emotion label corresponding to the voice emotion label and a reply text corresponding to the semantic text according to a preset emotion correlation model, a preset text correlation model, the voice emotion label and the cognitive information, wherein the emotion correlation model is used for representing the correlation among the voice emotion label, the response emotion label and the cognitive information, and the text correlation model is used for representing the correlation among the voice emotion label, the reply text and the cognitive information; generating a reply voice aiming at the input voice according to the voice emotion corresponding to the reply emotion label and the reply text, wherein the semantic text corresponding to the reply voice is the reply text, and the intonation of the reply voice is the intonation determined according to the reply emotion label. The voice emotion and the semantic text of the input voice can be obtained, and corresponding reply voice is generated by the response emotion corresponding to the voice emotion and the reply text corresponding to the semantic text, so that the intelligent degree of intelligent voice interaction is improved.

FIG. 2 is a flow chart of a method of determining a speech emotion tag according to the one shown in FIG. 1, as shown in FIG. 2, the emotion classification model comprising: a speech decoder, a text decoder, an audio processing model, a speech recognition model, and a classification prediction model, the classification prediction model being a Softmax logistic regression model comprising a connection layer and a Softmax layer, the step 102 may comprise:

in step 1021, the audio processing model is used to obtain the sound spectrum feature corresponding to the input voice, and the sound spectrum feature is input to the voice decoder to obtain the corresponding first feature vector.

Illustratively, the speech processing model is used to pre-process speech, and the pre-processing process may include: pre-emphasis processing, framing processing, windowing processing, and FFT (Fast Fourier Transform ) processing. Wherein the pre-emphasis process is used to emphasize the high frequency portion of the input speech and remove the effect of lip radiation to increase the high frequency resolution of the speech. In addition, because the voice of the person has short-time stationarity, the voice signal can be considered to be stable within the range of 10-30ms, and in the framing process, the voice signal can be framed by taking not less than 20ms as one frame and taking about 1/2 of the time as frame movement. Wherein, the frame shift is the overlapping area between two adjacent frames, and is used for avoiding the excessive change of the two adjacent frames. The beginning and end of each frame after framing will be discontinuous, so the more frames will have larger errors from the original signal. The windowing process can reduce the error, so that the signal after framing process becomes continuous, and each frame of voice signal can be characterized by a periodic function. The FFT process is used to transform the time domain audio signal to a frequency domain audio signal. The final output of the speech processing model is the corresponding sound spectral feature of the input speech. In addition, the speech decoder may comprise a set of convolutional neural networks, which also comprise a convolutional layer and a pooling layer.

The step 1022 is to recognize the semantic text in the input speech through the speech recognition model, and input the semantic text into the text decoder to obtain the corresponding second feature vector.

The speech recognition model is illustratively an end-to-end ASR (Automatic Speech Recognition ) model in which input speech is converted into a piece of text after the process of encoding and decoding, i.e. semantic text of the input speech. The speech decoder may comprise two sets of convolutional neural networks, each set comprising a convolutional layer and a pooling layer, the output of the pooling layer of the preceding set of convolutional neural networks being the input of the convolutional layer of the following set of convolutional neural networks.

In step 1023, the first feature vector and the second feature vector are spliced into a third feature vector by the connection layer.

In step 1024, the third feature vector is input into the Softmax layer, and the emotion tag corresponding to the third feature vector is obtained as the voice emotion tag.

For example, the step 1021 may be performed simultaneously with the step 1022, thereby generating the first feature vector and the second feature vector simultaneously. After the first feature vector and the second feature vector are obtained in steps 1021 and 1022, the two feature vectors may be synthesized into the same feature vector (i.e., a third feature vector). The third feature vector can reflect semantic properties and audio spectral properties of the original audio of the input speech. Thereafter, inputting the third feature vector into a pre-trained Softmax layer, and obtaining the emotion label output by the Softmax layer as the voice emotion label

Fig. 3 is a flow chart of another speech generation method according to fig. 2, as shown in fig. 3, before this step 101, the method may further comprise:

In step 107, training the preset classification prediction model according to the preset speech emotion training data to obtain a trained classification prediction model.

The step 108 is to construct the emotion classification model by the speech decoder, the text decoder, the audio processing model, the speech recognition model and the trained classification prediction model.

Wherein the output of the speech recognition model is the input of the speech decoder, the output of the speech recognition model is the input of the text decoder, and the input of the classification prediction model comprises: an output of the speech decoder and an output of the text decoder.

For example, the emotion classification model construction process may take two construction modes, one of which is the mode described in the above steps 107 and 108, that is, firstly, training the classification prediction model through a preset training data set containing a large amount of speech emotion training data; after training of the classification prediction model is completed, the emotion classification model is constructed by the speech decoder, the text decoder, the audio processing model, the speech recognition model and the trained classification prediction model. Wherein each voice emotion training data is a binary group consisting of two groups of voice feature vectors (the same form as the first feature vector and the second feature vector) and emotion labels.

By way of example, another way of constructing the emotion classification model includes: step a, an initial emotion classification model is constructed through the voice decoder, the text decoder, the audio processing model, the voice recognition model and a preset classification prediction model; and b, inputting a large number of binary groups consisting of voice audio and emotion labels into the initial emotion classification model as training data to obtain a trained emotion classification model. It can be understood that, in the construction mode, the voice audio input into the binary set of the initial emotion classification model is input into the audio processing model and the voice recognition model respectively, and after two feature vectors are obtained through the voice decoder, the text decoder, the audio processing model and the voice recognition model respectively, the two feature vectors and the emotion labels in the binary set are input into the preset classification prediction model simultaneously so as to train the classification prediction model. It will be appreciated that based on the manner of construction, the completion of training of the classification predictive model also means the completion of construction of the entire emotion classification model.

Fig. 4 is a flowchart of yet another speech generation method according to fig. 1, and as shown in fig. 4, the step 103 may include: the step 1031, the step 1032, and/or the step 1033 and the step 1034.

The step 1031 extracts a first text element for describing personal information and/or interest information from the semantic text, and uses a text feature corresponding to the first text element as the user portrait information.

In step 1032, a second text element for describing the event processing flow and/or the object development rule is extracted from the semantic text, so as to use the text feature corresponding to the second text element as the event flow information.

The step 1033 identifies a third text element in the semantic text for describing event decision conditions.

In step 1034, determining, according to the third text element, an event probability that the event decision condition results in occurrence of each event result by using a preset bayesian network model, and taking each event result and a corresponding event probability of each event result as the event decision information.

Illustratively, the user portrait information may include: age, emotional state, gender, place of birth, work, favorite person, most commonly listened song, favorite sport, etc. The event flow information may include: flow information of making a dish, flow information of train ticket buying events, flow information of four-season rotation day-night alternation, and the like, or flow information of human beings engaged in various social activities or flow information of natural things development rules. Specifically, taking a train ticket purchasing event as an example, the flow information may include: and an information tree consisting of node information such as ticket purchasing time, ticket purchasing amount, starting station, destination, riding time, arrival time and the like, wherein each node of the information tree is the node information. The event decision information differs from the event flow information in that the event decision information comprises causal information of a certain event result due to the occurrence of a certain decision condition in the course of human activity. The event decision information may include: for whether to carry umbrella or not at present decision information whether to take out or watch television, etc. Taking decision information about whether to take an umbrella today as an example, the first decision condition can be that more people can open the umbrella on the road, the second decision condition can be that people feel rainy today, and event results corresponding to the decision time can be determined based on the first decision condition and the second decision condition, namely, the umbrella is needed to be taken. The first text element, the second text element and the third text element may be a word or text in the segment semantic text. Aiming at the user portrait information, the event flow information and the event decision condition, the brain cognitive model has a corresponding corpus. In the step 1031, the step 1032, and the step 1033, the first text element, the second text element, and the third text element described above may be identified by a corresponding corpus and a preset text recognition algorithm.

FIG. 5 is a flowchart of a method of determining a response emotion tag and a response text according to the method shown in FIG. 1, as shown in FIG. 5, the step 104 may include:

In step 1041, the speech emotion label and the cognitive information are respectively input into the emotion correlation model and the text correlation model, and a first probability set output by the emotion correlation model and a second probability set output by the text correlation model are obtained.

The first probability set comprises a plurality of emotion tags and first probabilities corresponding to the emotion tags, and the first probability set comprises a plurality of texts and first probabilities corresponding to the texts.

In step 1042, the emotion tag corresponding to the highest first probability among the plurality of emotion tags is used as the response emotion tag.

In step 1043, a text corresponding to the highest second probability among the plurality of texts is taken as the reply text.

For example, the emotion-related model and the text-related model may be classification prediction models, and the probability recognition model may be a neural network model having different structures. Taking the emotion correlation model as an example, the voice emotion tag and the cognitive information can be used as input end training data of the neural network model, and the emotion tag bound with the voice emotion tag and the cognitive information can be used as output end training data of the neural network model to train the neural network model. Wherein each emotion tag corresponds to a classification, and the probability output by the trained neural network model (the emotion association model) is the classification prediction probability of the current voice emotion tag and the cognitive information in each classification (corresponding to each emotion tag).

Fig. 6 is a flowchart of a method of speech synthesis according to the one shown in fig. 1, as shown in fig. 6, the step 105 may include:

In step 1051, according to the response emotion label input preset intonation association model, the intonation feature corresponding to the response emotion label output by the intonation association model is obtained.

The step 1052 synthesizes the intonation feature and the reply text into the reply speech by a predetermined text-to-speech TTS algorithm.

For example, the TTS algorithm takes the intonation feature and the reply text as the basis for synthesizing the reply speech, so before step 1052, the intonation feature corresponding to the reply emotion tag needs to be determined by a intonation association model capable of reflecting the correspondence between the emotion tag and the intonation feature. The intonation association model can also be an association relation comparison table or a pre-trained classification prediction model. After the responsive emotion tags are determined, the intonation features may be determined in step 1051 by looking up a look-up table or input model for classification predictions.

Fig. 7 is a block diagram of a voice generating apparatus according to an exemplary embodiment, and as shown in fig. 7, the apparatus 700 may include:

a voice receiving module 710 configured to receive an input voice;

The tag determining module 720 is configured to determine, according to the voice spectrum characteristics of the input voice and the semantic text corresponding to the input voice, a voice emotion tag corresponding to the input voice through a preset trained emotion classification model;

An information extraction module 730 configured to extract cognitive information from the semantic text, the cognitive information comprising: at least one of user profile information, event flow information, and event decision information;

the information determining module 740 is configured to determine a response emotion tag corresponding to the voice emotion tag and a response text corresponding to the semantic text according to a preset emotion association model, a preset text association model, the voice emotion tag and the cognitive information, wherein the emotion association model is used for representing an association relationship among the voice emotion tag, the response emotion tag and the cognitive information, and the text association model is used for representing an association relationship among the voice emotion tag, the response text and the cognitive information;

A speech synthesis module 750 configured to generate a reply speech for the input speech according to the reply emotion tag and the reply text, the semantic text corresponding to the reply speech being the reply text, and the intonation features of the reply speech being intonation features determined according to the reply emotion tag;

A speech output module 760 configured to output the reply speech. .

Optionally, the emotion classification model includes: a speech decoder, a text decoder, an audio processing model, a speech recognition model, and a classification prediction model, the classification prediction model comprising a connection layer and a Softmax layer, the tag determination module 720 configured to:

Splicing the first feature vector and the second feature vector into a third feature vector through the connecting layer;

inputting the third feature vector into the Softmax layer, and obtaining an emotion label corresponding to the third feature vector as the voice emotion label.

FIG. 8 is a block diagram of another speech generating apparatus according to the one shown in FIG. 7, as shown in FIG. 8, the apparatus 700 may further include;

The model training module 770 is configured to train a preset classification prediction model through preset speech emotion training data to obtain a trained classification prediction model;

a model building module 780 configured to build the emotion classification model from the speech decoder, the text decoder, the audio processing model, the speech recognition model, and the trained classification prediction model; wherein,

Optionally, the information extraction module 730 is configured to:

Extracting a second text element for describing an event processing flow and/or a thing development rule from the semantic text, and taking text characteristics corresponding to the second text element as the event flow information; and/or the number of the groups of groups,

and determining the event probability of each event result caused by the event decision condition according to the third text element through a preset Bayesian network model, and taking each event result and the event probability corresponding to each event result as the event decision information.

Optionally, the information determining module 740 is configured to:

Optionally, the speech synthesis module 750 is configured to:

And synthesizing the intonation feature and the reply text into the reply voice through a preset text-to-voice (TTS) algorithm.

Fig. 9 is a block diagram illustrating an apparatus 900 for speech generation according to an example embodiment. For example, apparatus 900 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.

Referring to fig. 9, apparatus 900 may include one or more of the following components: a processing component 902, a memory 904, a power component 906, a multimedia component 908, an audio component 910, an input/output (I/O) interface 912, a sensor component 914, and a communication component 916.

The processing component 902 generally controls overall operations of the apparatus 900, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 902 can include one or more processors 920 to execute instructions to perform all or part of the steps of the speech generation method described above. Further, the processing component 902 can include one or more modules that facilitate interaction between the processing component 902 and other components. For example, the processing component 902 can include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.

The memory 904 is configured to store various types of data to support operations at the apparatus 900. Examples of such data include instructions for any application or method operating on the device 900, contact data, phonebook data, messages, pictures, videos, and the like. The memory 904 may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power component 906 provides power to the various components of the device 900. Power components 906 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for device 900.

The multimedia component 908 comprises a screen between the device 900 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 908 includes a front-facing camera and/or a rear-facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the apparatus 900 is in an operational mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 910 is configured to output and/or input audio signals. For example, the audio component 910 includes a Microphone (MIC) configured to receive external audio signals when the device 900 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 904 or transmitted via the communication component 916. In some embodiments, the audio component 910 further includes a speaker for outputting audio signals.

The I/O interface 912 provides an interface between the processing component 902 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 914 includes one or more sensors for providing status assessment of various aspects of the apparatus 900. For example, the sensor assembly 914 may detect the on/off state of the device 900, the relative positioning of the components, such as the display and keypad of the device 900, the sensor assembly 914 may also detect the change in position of the device 900 or one component of the device 900, the presence or absence of user contact with the device 900, the orientation or acceleration/deceleration of the device 900, and the change in temperature of the device 900. The sensor assembly 914 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 916 is configured to facilitate communication between the apparatus 900 and other devices in a wired or wireless manner. The device 900 may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication component 916 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 916 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, apparatus 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for performing the above-described speech generating methods.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as a memory 904 including instructions executable by the processor 920 of the apparatus 900 to perform the above-described speech generation method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

In another exemplary embodiment, a computer program product is also provided, comprising a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-described speech generating method when executed by the programmable apparatus.

The device for generating the voice provided by the embodiment of the disclosure can acquire the voice emotion and the semantic text of the input voice, and generate corresponding reply voice by using the response emotion corresponding to the voice emotion and the reply text corresponding to the semantic text, so that the intelligent degree of intelligent voice interaction is improved.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of speech generation, the method comprising:

Receiving input speech;

Generating a reply voice of the input voice according to the response emotion label and the reply text, wherein the semantic text corresponding to the reply voice is the reply text, and the intonation feature of the reply voice is the intonation feature determined according to the response emotion label;

Outputting the reply voice;

The determining the response emotion label corresponding to the voice emotion label and the response text corresponding to the semantic text according to the preset emotion correlation model, the preset text correlation model, the voice emotion label and the cognitive information comprises the following steps:

Inputting the voice emotion tag and the cognitive information into the emotion correlation model, inputting the voice emotion tag and the cognitive information into the text correlation model, and obtaining a first probability set output by the emotion correlation model and a second probability set output by the text correlation model; the first probability set comprises a plurality of emotion tags and first probabilities corresponding to the emotion tags, and the second probability set comprises a plurality of texts and second probabilities corresponding to the texts;

2. The method of claim 1, wherein the emotion classification model comprises: the speech prediction method comprises the steps of a speech decoder, a text decoder, an audio processing model, a speech recognition model and a classification prediction model, wherein the classification prediction model comprises a connecting layer and a Softmax layer, and the speech emotion label corresponding to input speech is determined according to the voice spectrum characteristics of the input speech and semantic texts corresponding to the input speech through a preset trained emotion classification model, and comprises the following steps:

3. The method of claim 1, wherein before the determining, by the emotion classification model trained in advance, a speech emotion label corresponding to the input speech according to a sound spectrum feature of the input speech and a semantic text corresponding to the input speech, the method further comprises:

4. The method of claim 1, wherein the extracting cognitive information from the semantic text comprises:

5. The method of claim 1, wherein the generating a reply voice to the input voice from the reply emotion tag and the reply text comprises:

6. A speech generating apparatus, the apparatus comprising:

a voice receiving module configured to receive an input voice;

A speech synthesis module configured to generate a reply speech for the input speech according to the reply emotion tag and the reply text, wherein a semantic text corresponding to the reply speech is the reply text, and intonation features of the reply speech are intonation features determined according to the reply emotion tag;

A voice output module configured to output the reply voice;

the information determination module is configured to:

7. The apparatus of claim 6, wherein the emotion classification model comprises: a speech decoder, a text decoder, an audio processing model, a speech recognition model, and a classification prediction model, the classification prediction model comprising a connection layer and a Softmax layer, the tag determination module configured to:

8. The apparatus of claim 6, wherein the apparatus further comprises:

9. The apparatus of claim 6, wherein the information extraction module is configured to:

10. The apparatus of claim 6, wherein the speech synthesis module is configured to:

11. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the steps of the method of any of claims 1-5.

12. An electronic device is characterized in that a voice interaction system is arranged in the electronic device;

The electronic device includes: the speech generating device of any of claims 6-10.