Disclosure of Invention
The invention provides a method, a device, a storage medium and computer equipment for converting characters into voice, which mainly can convert the characters in the communication process into voice information, thereby avoiding inconvenience to an information receiver.
According to a first aspect of the present invention, there is provided a text-to-speech method, comprising:
Acquiring character information to be converted;
Carrying out multidimensional emotion recognition on the text information to obtain an intention recognition result, an emotion recognition result and a mood recognition result corresponding to the text information;
And calling a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the mood recognition result to convert the text information into voice information.
Optionally, the calling a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the mood recognition result to convert the text information into voice information includes:
Acquiring an initial voice conversion model and a plurality of corresponding model parameters;
determining target model parameters which are corresponding to the intention recognition result, the emotion recognition result and the language gas recognition result together from the multiple groups of model parameters, wherein each group of emotion recognition result corresponds to one group of model parameters, and the group of emotion recognition results comprises the intention recognition result, the emotion recognition result and the language gas recognition result;
Adding the target model parameters into the initial voice conversion model to obtain a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the language recognition result;
And calling the matched preset voice conversion model to convert the text information into voice information.
Optionally, the obtaining the text information to be converted includes:
receiving text information input by a sender;
the calling a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the language recognition result to convert the text information into voice information comprises the following steps:
if the text information contains special characters, verifying the identity of a sender, and calling a preset voice conversion model matched with the intention recognition result, the emotion recognition result, the mood recognition result and the identity information of the sender to convert the text information input by the sender into voice information.
Optionally, the receiving text information input by the sender includes:
Detecting the sound decibel of the current environment of the sender;
If the sound decibel is larger than the preset sound decibel, providing a text input interface and receiving text information input by the sender based on the text input interface, or
Acquiring a history dialogue record between the sender and the receiver;
determining an input mode selected when the sender last talks with the receiver based on the history dialogue record;
and outputting a text input interface if the input mode selected by the sender when the sender last dialogues with the receiver is a text mode, and receiving text information input by the sender based on the text input interface.
Optionally, before the receiving the text information input by the sender, the method further includes:
Collecting voice information of a plurality of groups of scene sentences read by the sender, wherein the intention, emotion or mood corresponding to the scene sentences in different groups are different;
And determining the text information respectively corresponding to the plurality of groups of voice information read by the sender, and training an initial voice conversion model and a plurality of groups of model parameters corresponding to the initial voice conversion model which are bound with the identity information of the sender based on the plurality of groups of voice information and the text information respectively corresponding to the voice information.
Optionally, after training the initial speech conversion model and the corresponding multiple sets of model parameters bound to the identity information of the sender based on the multiple sets of speech information and the corresponding text information, the method further includes:
Collecting real-time voice information of the sender through a preset device;
Optimizing a plurality of sets of model parameters of an initial speech conversion model bound with the identity information of the sender based on the collected real-time speech information, and/or
Acquiring specific characters input by the sender and voice information input by the sender aiming at the specific characters;
And optimizing a plurality of groups of model parameters of an initial voice conversion model bound with the identity information of the sender based on the specific text and the corresponding voice information thereof.
Optionally, after the collecting, by the predetermined device, the real-time voice information of the sender, the method further includes:
performing text conversion on the real-time voice information to obtain real-time text information corresponding to the real-time voice information;
Respectively carrying out intention, emotion and mood recognition on the real-time text information to obtain an intention recognition result, an emotion recognition result and a mood recognition result corresponding to the real-time text information;
If the emotion of the sender is determined to be abnormal according to the intention recognition result corresponding to the real-time text information, or the emotion recognition result corresponding to the real-time text information is determined to be not matched with the mood recognition result, the collection of the real-time voice information of the sender is interrupted.
Optionally, after training the initial speech conversion model and the corresponding multiple sets of model parameters bound to the identity information of the sender based on the multiple sets of speech information and the corresponding text information, the method further includes:
acquiring a test reading statement input by the sender;
Converting the test reading statement into voice information by utilizing a plurality of groups of model parameters of an initial voice conversion model which are well trained and bound with the identity information of the sender, playing the voice information to the sender, and outputting a selection correction interface corresponding to the test reading statement;
acquiring revised voice corresponding to a target character in the test reading sentence selected by the sender based on the selection correction interface;
And optimizing a plurality of groups of model parameters of an initial voice conversion model bound with the identity information of the sender based on the revised voice.
Optionally, after the receiving the text information input by the sender, the method further includes:
when the text information is dialect, obtaining the output voice type selected by the sender;
If the output voice type is dialect voice, calling a preset dialect voice conversion model which is bound with the identity information of the sender, and converting the dialect input by the sender into dialect voice information;
If the output voice type is standard voice, using a preset dialect word stock to convert the dialect into Mandarin, and obtaining standard text information corresponding to the dialect;
carrying out multidimensional emotion recognition on the standard text information to obtain an emotion recognition result of the standard text information under multiple dimensions;
and calling a preset voice conversion model matched with the emotion recognition result under the multi-dimension and the identity information of the sender, and converting the standard text information into standard voice information.
Optionally, after the invoking of the preset voice conversion model matched with the intention recognition result, the emotion recognition result and the mood recognition result to convert the text information into voice information, the method further includes:
And responding to the received sound wave conversion instruction, and converting the audio sound wave in the voice information into bone conduction sound wave.
Optionally, after the calling a preset voice conversion model matched with the intention recognition result, the emotion recognition result, the mood recognition result and the identity information of the sender, the method further includes:
Outputting and displaying a background sound list in response to a received background sound adding instruction, acquiring a selection instruction for selecting a target background sound from the background sound list, and adding the target background sound to the voice information by superposing sound waves in the target background sound and sound waves in the voice information, or
Acquiring current dialogue records of the sender and the receiver, and determining a current scene of the sender according to the dialogue records; and determining a target background sound matched with the current scene of the sender, and adding the target background sound into the voice information by superposing sound waves in the target background sound and sound waves in the voice information.
Optionally, after the calling a preset voice conversion model matched with the intention recognition result, the emotion recognition result, the mood recognition result and the identity information of the sender, the method further includes:
responding to an encryption voice adding instruction triggered by the sender, and acquiring the encryption voice of the sender;
Adjusting the frequency of sound waves in the encrypted voice to a specific frequency which cannot be recognized by human ears;
And adding the adjusted encrypted voice to the voice information by superposing the sound waves in the adjusted encrypted voice and the sound waves in the voice information.
Optionally, the method further comprises:
collecting operation data of the sender aiming at a communication device in the communication process;
Matching the operation data in the communication process with the historical operation data;
If the operation data in the communication process is not matched with the historical operation data, the sent identity information is verified by starting the camera device.
According to a second aspect of the present invention, there is provided a text-to-speech apparatus comprising:
The acquisition unit is used for acquiring the text information to be converted;
The recognition unit is used for carrying out multidimensional emotion recognition on the text information to obtain an intention recognition result, an emotion recognition result and a mood recognition result corresponding to the text information;
and the conversion unit is used for calling a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the language recognition result to convert the text information into voice information.
Optionally, the conversion unit comprises a first acquisition module, a first determination module, an addition module and a conversion module,
The first acquisition module is used for acquiring an initial voice conversion model and a plurality of corresponding groups of model parameters;
The first determining module is configured to determine target model parameters that correspond to the intent recognition result, the emotion recognition result, and the mood recognition result from the multiple sets of model parameters, where each set of emotion recognition result corresponds to a set of model parameters, and the set of emotion recognition results includes the intent recognition result, the emotion recognition result, and the mood recognition result;
The adding module is used for adding the target model parameters into the initial voice conversion model to obtain a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the mood recognition result;
The conversion module is used for calling the matched preset voice conversion model to convert the text information into voice information.
Optionally, the acquiring unit is specifically configured to receive text information input by a sender;
the conversion unit is specifically configured to verify the identity of the sender if the text information includes special characters, and call a preset voice conversion model that is matched with the intention recognition result, the emotion recognition result, the mood recognition result and the identity information of the sender, so as to convert the text information input by the sender into voice information.
Optionally, the acquisition unit comprises a detection module, a receiving module, a second acquisition module and a second determination module,
The detection module is used for detecting sound decibels of the environment where the sender is currently located;
The receiving module is used for providing a text input interface and receiving text information input by the sender based on the text input interface if the sound decibel is larger than a preset sound decibel;
the second acquisition module is used for acquiring a history dialogue record between the sender and the receiver;
the second determining module is configured to determine, based on the history dialogue record, an input mode selected when the sender dialogues with the receiver last time;
The receiving module is further configured to output a text input interface if the input mode selected by the sender when the sender last dialogues with the receiver is a text mode, and receive text information input by the sender based on the text input interface.
Optionally, the device further comprises an acquisition unit and a training unit,
The collecting unit is used for collecting voice information of a plurality of groups of scene sentences read by the sender, wherein the intention, emotion or mood corresponding to the scene sentences in different groups are different;
The training unit is used for determining the text information corresponding to the plurality of groups of voice information read by the sender, and training an initial voice conversion model bound with the identity information of the sender and a plurality of groups of model parameters corresponding to the initial voice conversion model based on the plurality of groups of voice information and the text information corresponding to the voice information.
Optionally, the apparatus further comprises an optimization unit,
The acquisition unit is also used for acquiring real-time voice information of the sender through a preset device;
The optimizing unit is used for optimizing a plurality of groups of model parameters of an initial voice conversion model bound with the identity information of the sender based on the collected real-time voice information;
The acquiring unit is used for acquiring specific characters input by the sender and voice information input by the sender aiming at the specific characters;
the optimizing unit is further configured to optimize a plurality of sets of model parameters of an initial voice conversion model bound with the identity information of the sender based on the specific text and the voice information corresponding to the specific text.
Optionally, the apparatus further comprises an interrupt unit,
The conversion unit is also used for performing text conversion on the real-time voice information to obtain real-time text information corresponding to the real-time voice information;
The recognition unit is further used for respectively carrying out intention, emotion and mood recognition on the real-time text information to obtain an intention recognition result, an emotion recognition result and a mood recognition result corresponding to the real-time text information;
the interruption unit is configured to interrupt collection of the real-time voice information of the sender if it is determined that the emotion of the sender is abnormal according to the intent recognition result corresponding to the real-time text information, or if it is determined that the emotion recognition result corresponding to the real-time text information is not matched with the mood recognition result.
Optionally, the acquiring unit is further configured to acquire a read statement input by the sender;
The conversion unit is further used for converting the read statement into voice information to be played to the sender by utilizing a plurality of groups of model parameters of the trained initial voice conversion model which is bound with the identity information of the sender, and outputting a selection correction interface corresponding to the read statement;
the acquisition unit is further used for acquiring revised voice corresponding to the target character in the trial reading sentence selected by the sender based on the selection correction interface;
the optimizing unit is further configured to optimize a plurality of sets of model parameters of an initial speech conversion model bound with the identity information of the sender based on the revised speech.
Optionally, the obtaining unit is further configured to obtain the output voice type selected by the sender when the text information is a dialect;
The conversion unit is further configured to invoke a preset dialect voice conversion model bound with the identity information of the sender if the output voice type is dialect voice, and convert the dialect input by the sender into dialect voice information;
The conversion unit is further configured to, if the output voice type is standard voice, convert the dialect into mandarin by using a preset dialect word stock to obtain standard text information corresponding to the dialect;
The identification unit is also used for carrying out multidimensional emotion identification on the standard text information to obtain an emotion identification result of the standard text information under the multidimensional condition;
the conversion unit is further used for calling a preset voice conversion model matched with the emotion recognition result under the multi-dimension and the identity information of the sender to convert the standard text information into standard voice information.
Optionally, the conversion unit is further configured to convert the audio sound wave in the voice information into the bone conduction sound wave in response to the received sound wave conversion instruction.
Optionally, the apparatus further comprises a superposition unit,
The superposition unit is used for responding to the received background sound adding instruction, outputting and displaying a background sound list, acquiring a selection instruction for selecting a target background sound from the background sound list, and adding the target background sound into the voice information by superposing sound waves in the target background sound and sound waves in the voice information, or acquiring current dialogue records of the sender and the receiver, determining a scene where the sender is currently located according to the dialogue records, determining a target background sound matched with the scene where the sender is currently located, and adding the target background sound into the voice information by superposing sound waves in the target background sound and sound waves in the voice information.
Optionally, the obtaining unit is further configured to obtain the encrypted voice of the sender in response to an encrypted voice adding instruction triggered by the sender;
The superposition unit is further used for adjusting the frequency of the sound wave in the encrypted voice to a specific frequency which cannot be recognized by human ears, and adding the adjusted encrypted voice into the voice information by superposing the sound wave in the adjusted encrypted voice and the sound wave in the voice information.
Optionally, the apparatus further comprises a matching unit,
The acquisition unit is also used for acquiring the operation data of the sender aiming at the communication device in the communication process;
and if the operation data in the communication process is not matched with the historical operation data, the sent identity information is verified by starting the camera device.
According to a third aspect of the present invention, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described upper and lower body motion matching method.
According to a fourth aspect of the present invention, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above-mentioned upper and lower body motion matching method when executing the program.
Compared with the prior communication mode using characters, the character-to-voice method, device, storage medium and computer equipment provided by the invention can acquire character information to be converted, multidimensional emotion recognition is carried out on the character information to acquire an intention recognition result, an emotion recognition result and a mood recognition result corresponding to the character information, and meanwhile, a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the mood recognition result is called to convert the character information into voice information, and a receiver can feel the emotion, the mood and the intention of a sender at the moment through the converted voice information.
Detailed Description
The application will be described in detail hereinafter with reference to the drawings in conjunction with embodiments. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.
At present, the text communication mode cannot convey the emotion or the mood of the information initiator at the time, and in addition, if the cultural degree of the information receiver is not high, text information cannot be read, or vision disorder exists, the mode can cause inconvenience for the information receiver.
In order to solve the above problems, an embodiment of the present invention provides a text-to-speech method, as shown in fig. 1, where the method includes:
101. and acquiring the text information to be converted.
The text information to be converted is text information sent to the receiver by the sender through communication software. The embodiment of the invention is mainly suitable for the scene of converting the text information into the voice information in the communication process. The execution main body of the embodiment of the invention is a device or equipment capable of performing voice conversion on text information, and specifically can be a client or a server.
In a specific application scenario, there are two input modes of a client of a sender, one is a text input mode, and the other is a voice input mode, when the sender inputs text information at the client and selects to perform voice conversion, the client of the sender can acquire the text information input by the sender and uses the text information as text information to be converted, and then the client of the sender directly converts the text information into voice information and sends the voice information to the client of a receiver. In addition, the sender can directly send the input text information to the receiver, after the client of the receiver receives the text information, the receiver can select to perform voice conversion in the client, and at the moment, the client of the receiver can acquire the text information received by the receiver and convert the text information into voice information to be played to the receiver.
Moreover, the client of the sender can also send the text information input by the sender to the server, the server sends an information prompt to the receiver after receiving the text information sent by the sender, meanwhile, the text information sent by the sender is directly converted into voice information at the side of the server, the voice information and the text information are correspondingly stored, when the receiver sees the prompt information and knows that the communication information of the sender exists, an information acquisition request is sent to the server, and the server sends the text information or the voice information to the client of the receiver based on the information receiving mode selected by the receiver.
It can be known that the text information obtaining process and the voice conversion process to be converted can be performed in the client or the server, which are not particularly limited in the embodiment of the present invention.
102. And carrying out multidimensional emotion recognition on the text information to obtain an intention recognition result, an emotion recognition result and a mood recognition result corresponding to the text information.
The multi-dimensional emotion recognition comprises intention recognition, emotion recognition and mood recognition, the intention recognition results comprise solicitation, daily chatting, the requirement of others for me and the like, the mood recognition results comprise statement, doubt, praying, exaggeration, praying, assumption, emphasis, reverse question, turning and the like, and the emotion recognition results comprise thank, pleasure, love, complaint, anger, aversion, fear and the like.
For the embodiment of the invention, in order to enable the receiver to feel the current mood, emotion and intention of the sender, in the process of voice conversion, intention recognition, emotion recognition and mood recognition are required to be performed on the text information. Aiming at the specific process of intention recognition, as an optional implementation manner, the method comprises the steps of determining semantic information vectors of all the words corresponding to the text information, inputting the semantic information vectors corresponding to the words into a preset intention recognition model for intention recognition, and obtaining an intention recognition result corresponding to the text information. Further, determining semantic information vectors of all the words corresponding to the text information comprises determining query vectors, key vectors and value vectors corresponding to any one word in all the words, multiplying the query vectors corresponding to any one word with the key vectors corresponding to all the words to obtain attention scores of all the words for any one word, multiplying the attention scores and the value vectors corresponding to all the words and summing to obtain the semantic information vector corresponding to any one word. The preset intention recognition model may be a multi-layer sensor.
Specifically, word segmentation processing can be performed on the text information to obtain each word segment corresponding to the text information, then an embedded vector corresponding to each word segment is determined by using a word2vec mode, the embedded vector corresponding to each word segment is input into an attention layer of an encoder to perform feature extraction, in the processing process of the attention layer, different linear transformations are performed on the embedded vector to obtain a query vector, a key vector and a value vector corresponding to each word segment, then the semantic information vector corresponding to each word segment is determined according to the query vector, the key vector and the value vector corresponding to each word segment, further, after the semantic information vector corresponding to each word segment is determined, the semantic information vector corresponding to each word segment is input into a multi-layer perceptron to perform intention recognition, the multi-layer perceptron is used for actually performing a classification process, finally the multi-layer perceptron outputs probability values of different intentions, and the intention corresponding to the maximum probability value is determined as a target intention corresponding to the text information.
Further, when recognizing the word information, the word information can be judged according to the word information, punctuation and information input speed, for example, if the word information contains exclamation mark, the word information is the exclamation mark; if the text information contains the word "back to attention", the recognition result of the mood is emphasized; if the input speed of the text information is relatively slow, the recognition result of the language is a statement.
The method comprises the steps of carrying out word segmentation on the text information to obtain each word segment corresponding to the text information, inputting an embedded vector corresponding to any word segment into different attention subspaces of an attention layer of an encoder to carry out feature extraction to obtain a first feature vector of the any word segment under the different attention subspaces, multiplying and summing the first feature vector of the any word segment under the different attention subspaces with weights corresponding to the different attention subspaces to obtain an attention layer output vector corresponding to the any word segment, adding the attention layer output vector and the first feature vector to obtain a second feature vector corresponding to the any word segment, carrying out feature extraction to the different attention subspaces to obtain a first feature vector corresponding to the first feature vector, and carrying out feature extraction to the first feature vector corresponding to the word segment.
Further, the step of inputting the embedded vector corresponding to any one of the segmented words into different attention subspaces in an attention layer of an encoder to perform feature extraction to obtain a first feature vector of the any one of the segmented words in the different attention subspaces comprises the steps of determining a query vector, a key vector and a value vector of the any one of the segmented words in the different attention subspaces according to the embedded vector corresponding to the any one of the segmented words, multiplying the query vector of the any one of the segmented words in the different attention subspaces by the key vector of the each of the segmented words in the different attention subspaces to obtain the attention score of the any one of the segmented words in the different attention subspaces, and multiplying and summing the attention score of the any one of the segmented words in the different attention subspaces and the key vector to obtain the first feature vector corresponding to the any one of the segmented words.
Thereby obtaining the intention recognition result, the emotion recognition result and the mood recognition result corresponding to the text information according to the mode. It should be noted that, the multidimensional emotion recognition in the embodiment of the present invention is not limited to intent recognition, emotion recognition and mood recognition, but may include emotion recognition in other dimensions.
103. And calling a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the mood recognition result to convert the text information into voice information.
In order to enable a receiver to feel the current emotion, intention and mood of a sender, the embodiment of the invention needs to call a preset voice conversion model matched with an intention recognition result, an emotion recognition result and a mood recognition result of text information to perform voice conversion so as to ensure that the converted voice information contains the current emotion, intention and mood of the sender, based on the voice conversion model, step 103 specifically comprises the steps of obtaining an initial voice conversion model and a plurality of groups of model parameters corresponding to the initial voice conversion model, determining target model parameters commonly corresponding to the intention recognition result, the emotion recognition result and the mood recognition result from the plurality of groups of model parameters, adding the target model parameters into the initial voice conversion model to obtain a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the mood recognition result, and calling the matched preset voice conversion model to convert the text information into voice information. Each group of emotion recognition results corresponds to a group of model parameters, and the group of emotion recognition results comprises an intention recognition result, an emotion recognition result and a mood recognition result.
For example, the initial speech conversion model does not include model parameters, the intention recognition result corresponding to the text information is required, the mood recognition result is praying, the emotion recognition result is thank you, the target model parameters corresponding to the emotion recognition results of required, praying and thank you are determined from a plurality of sets of model parameters, then the target model parameters are added into the initial speech conversion model to obtain a preset speech conversion model, the text information is converted into the speech information by utilizing the preset speech conversion model, and the receiver can feel that the intention of the sender is required, the mood is praying and the emotion is thank you according to the converted speech information.
Further, in the embodiment of the present invention, the emotion recognition result, the intention recognition result, and the mood recognition result listed in step 102 may be further refined according to degrees, for example, the emotion recognition result includes "thank you" and "no thank you", the thank you "may be refined according to degrees to" special thank you "," comparative thank you "and" general thank you ", specifically, the emotion recognition result is" general thank you "when the output value is greater than 0.5, when the output value is between 0.5 and 0.6, and the emotion recognition result is" comparative thank you "when the output value is between 0.6 and 0.8, and the emotion recognition result is" special thank you "when the output value is above 0.8. Different degrees of emotion recognition results can also result in different model parameters, for example, "have request", "pray" and "thank you specially" correspond to the A group model parameters, and "have request", "pray" and "thank you normally" correspond to the B group model parameters. Therefore, according to the method, the emotion recognition result can be divided into finer granularity, and the determination accuracy of the target model parameters is improved.
After determining that a target parameter model is obtained, inputting text information into the preset voice conversion model to perform voice conversion to obtain voice information, wherein the preset voice conversion model can be a Tacotron model, the Tacotron model mainly comprises an encoder, a decoder based on an attention mechanism and a post-processing network, specifically, firstly, extracting feature vectors corresponding to the text information by using the encoder, then, converting the feature vectors into sound spectrum data by using the decoder, and finally, converting the sound spectrum data into waveforms by using the post-processing network, so that the voice information can be output.
In a specific application scenario, in order to facilitate a receiver with hearing impairment to acquire communication content, an audio sound wave in voice information can be converted into bone conduction sound wave, and the receiver with hearing impairment can acquire the bone conduction sound wave by means of specific hardware equipment, so that corresponding communication content can be acquired. The audio sound wave is a sound wave which can be normally received by human ears.
Compared with the current communication mode using characters, the character-to-voice conversion method provided by the embodiment of the invention can acquire character information to be converted, performs multidimensional emotion recognition on the character information to acquire an intention recognition result, an emotion recognition result and a mood recognition result corresponding to the character information, and meanwhile, invokes a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the mood recognition result to convert the character information into voice information, and a receiver can feel the emotion, the mood and the intention of a sender at the moment through the converted voice information.
Further, in order to better illustrate the text-to-speech process, as a refinement and extension of the foregoing embodiment, an embodiment of the present invention provides another text-to-speech method, as shown in fig. 2, where the method includes:
201. And receiving the text information input by the sender.
For the embodiment of the invention, when a sender sends information to a receiver, a client of the sender can detect the sound decibel of the current environment, and when the sound decibel exceeds a certain value, the client can automatically switch to a text input interface to prompt the sender to input text information; and if the sound decibel is smaller than or equal to the preset sound decibel, outputting a voice input interface and receiving voice information input by the sender based on the voice input interface. The preset sound decibels can be set according to actual service requirements.
For example, the preset sound decibel is 70 decibel, the client of the sender can detect the sound decibel of the current environment by means of the sensor, if the sound decibel of the current environment of the sender is detected to be 90, the client can output a text input interface and acquire text information input by the sender through the text input interface because the sound decibel of the current environment of the sender is detected to be more than 70 decibel, and if the sound decibel of the current environment of the sender is detected to be 50 decibel, the voice input of the sender cannot be interfered under the current environment because the sound decibel is less than 70 decibel, the client can output a voice input interface and receive the voice information input by the sender based on the voice input interface.
In a specific application scene, the client side can recommend corresponding input modes for a sender according to a history dialogue record between the sender and a specific receiver, and based on the history dialogue record, the method comprises the steps of obtaining the history dialogue record between the sender and the receiver, determining an input mode selected by the sender when the sender is in dialogue with the receiver last time based on the history dialogue record, outputting a text input interface and receiving text information input by the sender based on the text input interface if the input mode selected by the sender is in a text mode last time when the sender is in dialogue with the receiver, outputting a voice input interface and receiving voice information input by the sender based on the voice input interface if the input model selected by the sender is in a voice mode last time when the sender is in dialogue with the receiver.
Further, in order to ensure the safety of information transmission, when a sender inputs text information at a client, the sender collects operation data of the sender aiming at a communication device in the communication process, matches the operation data in the communication process with historical operation data, and if the operation data in the communication process is not matched with the historical operation data, verifies the transmitted identity information by starting a camera device. The historical operation data comprises character input habits of the sender, such as frequently-wrong characters or high-frequency used phrases, and can also comprise the positions of the mobile phones which are habitually held by the sender.
Specifically, when a sender inputs text information at a client, the client acquires operation data aiming at a communication device in the communication process of the sender, including a holding position of a mobile phone, an input error text, a used phrase and the like, if the holding position of the mobile phone acquired by the client is different from the habitual holding position of the sender, or the acquired input error text is different from the habitual input error text of the sender, the identity of the sender needs to be verified, such as acquiring a picture of a current mobile phone holder through a camera device, comparing the picture with a picture of a client logger, if the picture is consistent, the identity verification of the sender is passed, converting the text information into voice information, and if the picture is inconsistent, the identity verification of the sender is not passed, intercepting the text information, namely not converting the voice information.
Further, when the identity verification of the sender fails, the text information can be intercepted, the text information can be converted normally, a voice prompt can be sent to the receiver before the voice information is sent to the receiver, and the specific content can be that the information sender is not a mobile phone holder, so that the receiver receives the prompt information before receiving the voice information, and the situation of fraudulent receivers can be avoided. It should be noted that, the authentication of the sender may be performed not only before the sending of the voice information, but also after the sending of the voice information, for example, after the sending of the voice information to the receiver, the authentication is performed on the sender, and if the sender does not pass the authentication, a voice prompt message is sent to the receiver.
202. And carrying out multidimensional emotion recognition on the text information to obtain an intention recognition result, an emotion recognition result and a mood recognition result corresponding to the text information.
For the embodiment of the present invention, the specific process of performing intent recognition, emotion recognition and mood recognition on the text information is substantially similar to that of step 102, and will not be described herein.
203. If the text information contains special characters, verifying the identity of a sender, and calling a preset voice conversion model matched with the intention recognition result, the emotion recognition result, the mood recognition result and the identity information of the sender to convert the text information input by the sender into voice information.
Wherein the special characters comprise money, password verification, account password and other characters related to financial transactions. For the embodiment of the invention, if the text information input by the sender once contains the special characters related to the financial transaction, in order to avoid the situation that the benefit of the receiver is damaged due to fraud of the receiver, the identity of the sender needs to be verified, and a preset voice model matched with both the emotion recognition result and the identity information of the sender is called to convert the text information into voice information. The preset voice model matched with the identity information of the sender is an original voice model of the sender, namely, after the receiver receives the voice information, the identity of the sender can be identified through voice, so that the safety of communication information can be ensured, and the damage of benefits of the receiver is avoided.
Specifically, an initial voice conversion model matched with identity information of a sender and a plurality of corresponding model parameters are determined, then target model parameters matched with an intention recognition result, an emotion recognition result and a language recognition result are determined according to the plurality of model parameters, the target model parameters are added into the matched initial voice conversion model, a preset voice conversion model matched with both the emotion recognition result and the identity information of the sender is obtained, finally text information is converted into voice information by utilizing the preset voice conversion model, and voice information played by a client side of a receiver is original voice of the sender, so that the receiver can identify the identity of the sender.
For the embodiment of the invention, in order to reflect the scene of the sender in the text input more truly, corresponding background sounds can be added in the converted voice information, and as an optional implementation manner of adding the background sounds, the method comprises the steps of responding to a received background sound adding instruction, outputting and displaying a background sound list, acquiring a selection instruction for selecting a target background sound from the background sound list, and adding the target background sound into the voice information by superposing sound waves in the target background sound and sound waves in the voice information. For example, the sender selects the background sound of the christmas happiness from the background sound list, the client superimposes the sound wave in the background sound of the christmas happiness with the sound wave in the voice information, and sends the superimposed voice information to the receiver, and the receiver can hear the background sound of the christmas happiness in the process of hearing the voice information.
Further, as another optional implementation manner of adding background sound, the method further comprises the steps of obtaining a current dialogue record of the sender and the receiver, determining a scene where the sender is currently located according to the dialogue record, determining a target background sound matched with the scene where the sender is currently located, and adding the target background sound into the voice information by superposing sound waves in the target background sound and sound waves in the voice information. For example, the client side obtains the dialogue record of the sender and the receiver to determine that the sender is currently at the sea, so that ocean waves are added into the voice information as background sounds, and the receiver can hear the sound of the ocean waves in the process of hearing the voice information, so that the sender can be known to be at the sea currently.
In a specific application scenario, a sender can also embed encrypted voice in voice information to prevent the situation that the voice information is maliciously utilized, for example, the sender can embed encrypted voice in voice information, which is only used for transacting credit card business, in the process of sending identity document information to a receiver, so that the voice can become evidence, the sender can be conveniently authenticated, and the benefit of the sender is prevented from being damaged; adjusting the frequency of sound waves in the encrypted voice to a specific frequency which cannot be recognized by human ears; and adding the adjusted encrypted voice to the voice information by superposing the sound waves in the adjusted encrypted voice and the sound waves in the voice information.
Specifically, since the encrypted voice may affect the receiving party to answer the normal voice information, the frequency of the sound wave in the encrypted voice can be adjusted to a specific frequency which cannot be recognized by the human ear, then the adjusted encrypted voice is added to the voice information, and the receiving party cannot hear the adjusted encrypted voice in the process of hearing the voice information, and can only hear the normal voice information, so that the receiving party is not affected.
For the embodiment of the invention, before voice conversion is performed by using the initial voice conversion model and the corresponding multiple groups of model parameters thereof which are bound with the identity information of the sender, the initial voice conversion model and the corresponding multiple groups of model parameters thereof which are bound with the identity information of the sender need to be trained. Based on the method, the voice information of a plurality of groups of scene sentences read by the sender is collected, wherein intentions, moods or mood corresponding to the scene sentences of different groups are different, the text information respectively corresponding to the plurality of groups of voice information read by the sender is determined, and an initial voice conversion model bound with the identity information of the sender and a plurality of groups of model parameters corresponding to the initial voice conversion model are trained based on the plurality of groups of voice information and the text information respectively corresponding to the voice information. The method comprises the steps that a group of intention, emotion and mood corresponds to a group of model parameters, so that a plurality of groups of model parameters of an initial voice conversion model which are bound with identity information of a sender can be trained through voice information of a plurality of groups of scene sentences read by the sender.
Further, in the process of collecting voice information, the reliability of the collected voice information is required to be verified, if the collected voice information is found to have a problem, the collection is immediately interrupted, and based on the fact, the method comprises the steps of performing text conversion on the real-time voice information to obtain real-time text information corresponding to the real-time voice information, respectively performing intention, emotion and mood recognition on the real-time text information to obtain an intention recognition result, an emotion recognition result and a mood recognition result corresponding to the real-time text information, and if the intention recognition result corresponding to the real-time text information is determined, the emotion of a sender is abnormal, or the emotion recognition result corresponding to the real-time text information is determined to be not matched with the mood recognition result, the collection of the real-time voice information of the sender is interrupted.
For example, if the intention recognition result is a luxury, the emotion of the sender is abnormal, and the collection of the voice information is interrupted. For another example, if the emotion recognition result is anger and the mood recognition result is statement, the emotion recognition result and the mood recognition result are not matched, and the collection of voice information is interrupted.
Further, after training the initial speech conversion model bound with the identity information of the sender and the corresponding multiple sets of model parameters thereof, the speech tone of the speech conversion model can be modified, namely, the model parameters are optimized, and the method comprises the steps of collecting real-time speech information of the sender through a preset device, and optimizing the multiple sets of model parameters of the initial speech conversion model bound with the identity information of the sender based on the collected real-time speech information.
Specifically, the real-time voice information of the sender can be collected, the collected real-time voice information is converted into text information, the real-time voice information of the sender and the corresponding text information are used as a sample training set, and multiple groups of model parameters of the initial voice conversion model are optimized, so that the voice intonation of the converted voice information is closer to the actual voice intonation of the sender.
Further, as another optional implementation mode of model parameter optimization, the method comprises the steps of obtaining specific characters input by the sender and voice information input by the sender for the specific characters, and optimizing a plurality of groups of model parameters of an initial voice conversion model bound with identity information of the sender based on the specific characters and the corresponding voice information. Specifically, the sender can input some unusual characters or characters with specific pronunciation at the client, and read the characters, and the client collects the voice information of the specific characters read by the sender and optimizes a plurality of groups of model parameters of the initial voice conversion model by using the voice information, so that the voice conversion effect of the model can be improved.
In a specific application scene, the trained initial voice conversion model and multiple groups of model parameters can be used for converting the test reading statement input by the sender into corresponding voice information to be played, the sender can revise the voice information, and meanwhile, the multiple groups of model parameters of the initial voice conversion model are optimized based on revised voice. Based on the method, the method comprises the steps of obtaining a test reading sentence input by a sender, converting the test reading sentence into voice information by utilizing a plurality of groups of model parameters of an initial voice conversion model which are trained and bound with identity information of the sender, playing the voice information to the sender, outputting a selection correction interface corresponding to the test reading sentence, obtaining revised voice corresponding to a target character in the test reading sentence selected by the sender based on the selection correction interface, and optimizing the plurality of groups of model parameters of the initial voice conversion model which are bound with the identity information of the sender based on the revised voice. The target character can be any one, two or more characters in the test reading sentence.
For example, after the sender hears the voice information corresponding to the test-reading sentence, if the pronunciation of the "seaside" is not satisfied, the sender may select the target character "seaside" in the selection correction interface corresponding to the test-reading sentence, and correct the voice intonation of the "seaside" at the same time, so as to obtain a revised voice corresponding to the test-reading sentence, and optimize a plurality of sets of model parameters corresponding to the initial voice conversion model based on the revised voice, so that the voice conversion effect of the optimized model parameters is closer to the true voice of the sender.
In a specific application scene, a preset dialect voice conversion model bound with identity information of a sender can be built in advance, when character information input by the sender is dialect, the preset dialect voice conversion model can be called to convert the character information into the dialect voice information, based on the fact that the preset dialect voice conversion model is used for converting the dialect into standard character information corresponding to the dialect, when the character information is dialect, an output voice type selected by the sender is obtained, if the output voice type is dialect voice, the preset dialect voice conversion model bound with the identity information of the sender is called to convert the dialect input by the sender into the dialect voice information, if the output voice type is standard voice, a preset dialect word library is used for conducting Mandarin conversion on the dialect to obtain standard character information corresponding to the dialect, multi-dimensional emotion recognition is conducted on the standard character information to obtain an emotion recognition result of the standard character information under multi-dimensions, and the preset conversion model matched with the identity information of the sender is called to convert the standard character information into the standard character information. The preset dialect word library comprises dialect phrases and corresponding standard phrases.
Specifically, when the text information input by the sender is dialect, the sender can select the output voice type, and when the output voice type is dialect voice, a preset dialect voice conversion model bound with the identity information of the sender is called to convert the dialect into dialect voice information, wherein the preset dialect voice conversion model is obtained by training the collected dialect voice information read by the sender and the corresponding dialect text information. Further, when the output voice type is standard voice, the input dialect can be converted into standard text information by using the preset dialect word stock, and then the matched preset voice conversion model is called for voice conversion to obtain standard voice information, wherein the specific conversion process of the standard voice is identical to the above process in the step 203, and is not repeated here.
Compared with the current communication mode using characters, the character-to-voice conversion method provided by the embodiment of the invention can acquire character information to be converted, performs multidimensional emotion recognition on the character information to acquire an intention recognition result, an emotion recognition result and a mood recognition result corresponding to the character information, and meanwhile, invokes a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the mood recognition result to convert the character information into voice information, and a receiver can feel the emotion, the mood and the intention of a sender at the moment through the converted voice information.
Further, as a specific implementation of fig. 1, an embodiment of the present invention provides a text-to-speech apparatus, as shown in fig. 3, where the apparatus includes an obtaining unit 31, a recognition unit 32, and a conversion unit 33.
The obtaining unit 31 may be configured to obtain text information to be converted.
The recognition unit 32 may be configured to perform multidimensional emotion recognition on the text information, so as to obtain an intention recognition result, an emotion recognition result and a mood recognition result corresponding to the text information.
The conversion unit 33 may be configured to invoke a preset speech conversion model that matches the intent recognition result, the emotion recognition result, and the mood recognition result, and convert the text information into speech information.
In a specific application scenario, the conversion unit 33, as shown in fig. 4, includes a first acquisition module 331, a first determination module 332, an adding module 333, and a conversion module 334.
The first obtaining module 331 may be configured to obtain an initial speech conversion model and a plurality of sets of model parameters corresponding to the initial speech conversion model.
The first determining module 332 may be configured to determine, from the multiple sets of model parameters, a target model parameter that corresponds to the intent recognition result, the emotion recognition result, and the mood recognition result together, where each set of emotion recognition results corresponds to a set of model parameters, and the set of emotion recognition results includes the intent recognition result, the emotion recognition result, and the mood recognition result.
The adding module 333 may be configured to add the target model parameter to the initial speech conversion model to obtain a preset speech conversion model that matches the intent recognition result, the emotion recognition result, and the mood recognition result.
The conversion module 334 may be configured to invoke the matched preset voice conversion model to convert the text information into voice information.
In a specific application scenario, the obtaining unit 31 may be specifically configured to receive text information input by a sender.
The conversion unit 33 may be specifically configured to verify the identity of the sender if the text information includes special characters, and call a preset voice conversion model that is matched with the intended recognition result, the emotion recognition result, the mood recognition result, and the identity information of the sender, so as to convert the text information input by the sender into voice information.
In a specific application scenario, the acquiring unit 31, as shown in fig. 4, includes a detecting module 311, a receiving module 312, a second acquiring module 313, and a second determining module 314.
The detection module 311 may be configured to detect a sound decibel of an environment in which the sender is currently located.
The receiving module 312 may be configured to provide a text input interface if the sound db is greater than a preset sound db, and receive text information input by the sender based on the text input interface.
The second obtaining module 313 may be configured to obtain a history dialogue record between the sender and the receiver.
The second determining module 314 may be configured to determine, based on the history of conversations, an input mode selected when the sender last conversations with the receiver.
The receiving module 312 may be further configured to output a text input interface if the input mode selected by the sender when the sender last dialogues with the receiver is a text mode, and receive text information input by the sender based on the text input interface.
In a specific application scenario, the device further comprises an acquisition unit 34 and a training unit 35.
The collection unit 34 may be configured to collect voice information of the sender reading multiple groups of scene sentences, where intention, emotion or mood corresponding to the scene sentences in different groups are different.
The training unit 35 may be configured to determine text information corresponding to each of the plurality of sets of voice information read by the sender, and train an initial voice conversion model bound to the identity information of the sender and a plurality of sets of model parameters corresponding to the initial voice conversion model based on the plurality of sets of voice information and the text information corresponding to the voice information.
In a specific application scenario, the apparatus further comprises an optimization unit 36.
The acquisition unit 34 may also be configured to acquire real-time voice information of the sender through a predetermined device.
The optimizing unit 36 may be configured to optimize, based on the collected real-time voice information, a plurality of sets of model parameters of an initial voice conversion model bound with the identity information of the sender.
The obtaining unit 31 may be further configured to obtain a specific text input by the sender, and voice information entered by the sender for the specific text.
The optimizing unit 36 may be further configured to optimize a plurality of sets of model parameters of an initial speech conversion model bound to the identity information of the sender based on the specific text and the corresponding speech information thereof.
In a specific application scenario, the device further comprises an interrupt unit 37.
The converting unit 33 may be further configured to perform text conversion on the real-time voice information, so as to obtain real-time text information corresponding to the real-time voice information.
The recognition unit 32 may be further configured to perform intent, emotion and mood recognition on the real-time text information, so as to obtain an intent recognition result, an emotion recognition result and a mood recognition result corresponding to the real-time text information.
The interruption unit 37 may be configured to interrupt the collection of the real-time voice information of the sender if it is determined that the emotion of the sender is abnormal according to the intent recognition result corresponding to the real-time text information, or if it is determined that the emotion recognition result corresponding to the real-time text information does not match the mood recognition result.
In a specific application scenario, the obtaining unit 31 may be further configured to obtain a read statement input by the sender.
The conversion unit 33 may be further configured to convert the read sentence into voice information by using multiple sets of model parameters of the trained initial voice conversion model that are bound to the identity information of the sender, play the voice information to the sender, and output a selection correction interface corresponding to the read sentence.
The obtaining unit 31 may be further configured to obtain a revised voice corresponding to the target character in the test reading sentence selected by the sender based on the selection correction interface.
The optimizing unit 36 may be further configured to optimize a plurality of sets of model parameters of an initial speech conversion model bound to the identity information of the sender based on the revised speech.
In a specific application scenario, the obtaining unit 31 may be further configured to obtain the output voice type selected by the sender when the text information is dialect.
The conversion unit 33 may be further configured to invoke a preset dialect voice conversion model bound with the identity information of the sender to convert the dialect input by the sender into dialect voice information if the output voice type is dialect voice.
The conversion unit 33 may be further configured to convert the dialect into mandarin by using a preset dialect word stock if the output voice type is standard voice, so as to obtain standard text information corresponding to the dialect.
The recognition unit 32 may be further configured to perform multidimensional emotion recognition on the standard text information, so as to obtain an emotion recognition result of the standard text information under multiple dimensions.
The conversion unit 33 may be further configured to invoke a preset voice conversion model that matches the emotion recognition result in the multi-dimension and the identity information of the sender, and convert the standard text information into standard voice information.
In a specific application scenario, the converting unit 33 may be further configured to convert an audio sound wave in the voice information into a bone conduction sound wave in response to the received sound wave conversion instruction.
In a specific application scenario, the device further comprises a superposition unit 38.
The superimposing unit 38 may be configured to output and display a background sound list in response to the received background sound adding instruction, obtain a selection instruction for selecting a target background sound from the background sound list, and add the target background sound to the voice information by superimposing the sound wave in the target background sound and the sound wave in the voice information, or obtain a current dialogue record of the sender and the receiver, and determine a scene in which the sender is currently located according to the dialogue record, determine a target background sound matching the scene in which the sender is currently located, and add the target background sound to the voice information by superimposing the sound wave in the target background sound and the sound wave in the voice information.
In a specific application scenario, the obtaining unit 31 may be further configured to obtain the encrypted voice of the sender in response to an encrypted voice adding instruction triggered by the sender.
The superimposing unit 38 may be further configured to adjust the frequency of the sound wave in the encrypted voice to a specific frequency that cannot be recognized by human ears, and add the adjusted encrypted voice to the voice information by superimposing the sound wave in the adjusted encrypted voice and the sound wave in the voice information.
In a specific application scenario, the device further comprises a matching unit 39.
The collecting unit 34 may be further configured to collect operation data of the sender for a communication device during the present communication process.
The matching unit 39 may be configured to match the operation data in the current communication process with the historical operation data, and if the operation data in the current communication process does not match with the historical operation data, verify the sent identity information by starting the camera device.
It should be noted that, for other corresponding descriptions of each functional module related to the text-to-speech device provided by the embodiment of the present invention, reference may be made to corresponding descriptions of the method shown in fig. 1, which are not repeated herein.
Based on the method shown in fig. 1, correspondingly, the embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, the program when executed by a processor realizes the steps of acquiring text information to be converted, performing multidimensional emotion recognition on the text information to obtain an intention recognition result, an emotion recognition result and a mood recognition result corresponding to the text information, and calling a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the mood recognition result to convert the text information into voice information.
Based on the embodiment of the method shown in fig. 1 and the device shown in fig. 3, the embodiment of the invention also provides a physical structure diagram of a computer device, as shown in fig. 5, the computer device comprises a processor 41, a memory 42 and a computer program stored on the memory 42 and capable of running on the processor, wherein the memory 42 and the processor 41 are both arranged on a bus 43, when the processor 41 executes the program, the steps of acquiring text information to be converted, performing multidimensional emotion recognition on the text information to obtain an intention recognition result, an emotion recognition result and a language recognition result corresponding to the text information, and calling a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the language recognition result to convert the text information into voice information.
According to the technical scheme, the text information to be converted can be obtained, multidimensional emotion recognition is carried out on the text information, the intention recognition result, the emotion recognition result and the mood recognition result corresponding to the text information are obtained, meanwhile, a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the mood recognition result is called, the text information is converted into voice information, a receiver can feel the emotion, the mood and the intention of the sender at the moment through the converted voice information, and in addition, the communication content can be obtained more conveniently aiming at the receiver with low cultural degree or vision disorder.
It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module for implementation. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.