CN110718208A

CN110718208A - Voice synthesis method and system based on multitask acoustic model

Info

Publication number: CN110718208A
Application number: CN201910977818.1A
Authority: CN
Inventors: 罗浩源
Original assignee: Sichuan Changhong Electric Co Ltd
Current assignee: Sichuan Changhong Electric Co Ltd
Priority date: 2019-10-15
Filing date: 2019-10-15
Publication date: 2020-01-21

Abstract

The invention relates to the technical field of voice synthesis, and discloses a voice synthesis method and a voice synthesis system based on a multi-task acoustic model, which are used for solving the problem that voice attribute control is difficult to perform in a voice synthesis task. The invention comprises the following steps: a multitask synthesis condition obtaining module, configured to obtain a multitask speech synthesis condition, where the multitask speech synthesis condition includes: text to be synthesized and voice attribute to be synthesized; the multitask synthesis condition processing module is used for processing the to-be-synthesized conditions into to-be-synthesized data; the system comprises a multitask acoustic model obtaining module, a variable self-coding module and a variable self-coding module, wherein the multitask acoustic model obtaining module is used for obtaining a pre-generated multitask acoustic model, and the multitask acoustic model is a network formed by combining a single-task acoustic model, an embedding technology and a variable self-coding technology; the generating module is used for generating acoustic parameters according to the multitask acoustic model and the data to be synthesized; and the synthesis module is used for carrying out voice synthesis according to the generated acoustic parameters to obtain the multi-task synthesized voice. The invention is suitable for speech synthesis.

Description

Voice synthesis method and system based on multitask acoustic model

Technical Field

The invention relates to the technical field of speech synthesis, in particular to a speech synthesis method and system based on a multitask acoustic model.

Background

The speech synthesis technology is widely applied in the current life, and can convert text information into audio similar to the reading of a speaker in real time, so that a novel interactive mode is provided for people. Currently, mainstream speech synthesis technologies are classified into a splicing method, a parametric method, a hybrid method, and an end-to-end method based on deep learning. At present, the mainstream technology mainly solves the problem of the non-emotional or single emotional voice synthesis task under the single language of a single speaker, when a multi-speaker system is built, different training needs to be carried out on the same set of system, and a set of system with completely the same functional flow is newly built, so that the system expansibility is poor, the occupied resources are more, and the efficiency is low. In addition, when a speaker has only a training speech of a certain language, a speech synthesis system constructed by using the corpus data can often perform speech synthesis only on a text of the language contained in the training corpus, and speech attributes such as emotion, context, tone and the like are uncontrollable, so that the expressive force of the system is low. With the increasing demand of people for computer systems, the above problems become more and more obvious and urgent problems to be solved.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the method and the system are used for solving the problem that voice attribute control is difficult to perform in a voice synthesis task.

In order to solve the problems, the invention adopts the technical scheme that:

the voice synthesis method based on the multitask acoustic model comprises the following steps:

acquiring a multitask voice synthesis condition, wherein the multitask voice synthesis condition comprises the following steps: text to be synthesized and voice attributes to be synthesized, which may include: the method comprises the following steps of synthesizing language conditions, emotion conditions, speaker tone conditions, context conditions and tone conditions.

Processing the multitask voice synthesis condition into data to be synthesized;

the method comprises the following steps of obtaining a pre-generated multitask acoustic model, wherein the multitask acoustic model is a network formed by combining a single-task acoustic model, an embedding technology and a variational self-coding technology, and the combination mode is as follows: connecting and coupling the condition characteristic vectors generated by utilizing an embedding technology and a variational self-coding technology with one or more data input characteristic vectors in the single-task acoustic model network process;

inputting the data to be synthesized into the multitask acoustic model for acoustic parameter generation;

and carrying out voice synthesis according to the generated acoustic parameters to obtain the multi-task synthesized voice.

By adopting the multi-task acoustic model, the method can realize the single-model multi-task speech synthesis, even including language migration, emotion migration, context migration and tone migration. Even if a speaker in the training expectation only has single-language and single-language data, cross-language and cross-language voice synthesis can be realized under the voice synthesis method based on the multitask acoustic model.

Specifically, the step of processing the multitask speech synthesis condition into data to be synthesized according to the present invention may include:

processing a text to be synthesized into a pinyin sequence vector through text standardization, word segmentation, part of speech tagging, prosody prediction and text-to-pinyin conversion;

processing a text to be synthesized into a pinyin sequence vector through text standardization, word segmentation, part of speech tagging, prosody prediction, text-to-pinyin conversion and vectorization;

and processing the attribute vector by utilizing an embedding technology to obtain an attribute matrix, and coupling the attribute matrix with the pinyin sequence vector to obtain data to be synthesized.

Specifically, the steps of generating the multitask acoustic model in advance in the invention comprise:

acquiring multitask data, wherein the multitask data is obtained by correspondingly processing multitask voice data, multitask text data and multitask attribute data;

acquiring a multi-task acoustic model to be trained;

training according to the multitask data and the multitask acoustic model to be trained to generate a trained multitask acoustic model, wherein the method for training the model comprises the following steps: taking the multitask attribute data as a control condition of a model, and converting the control condition into a condition feature vector by using an Embedding technology (Embedding) and a Variational self-encoding technology (Variational automatic encoder); and using the multi-task text labeling data as data input of the model, and converting the data input into a data input feature vector by using an Embedding technology (Embedding); and applying the multitask voice data to an output end and an intermediate layer input end during the multitask acoustic model training, simultaneously applying the condition characteristic vector and the data input characteristic vector to the input end of the multitask acoustic model, and controlling the model training and the parameter convergence by using one or more loss functions.

On one hand, the model training method can reduce the data acquisition of a single speaker, reduce the data acquisition of multi-language, multi-emotion, multi-context and multi-language corpus of the single speaker and realize the establishment of speaker space by performing combined training according to multi-task data and a multi-task acoustic model, thereby greatly reducing the time and economic expenditure for data preparation and realizing the functions of speaker emotion migration, language migration and the like.

Corresponding to the method, the invention provides a voice synthesis system based on a multitask acoustic model, which comprises the following modules:

a multitask synthesis condition obtaining module, configured to obtain a multitask speech synthesis condition, where the multitask speech synthesis condition includes: text to be synthesized and voice attribute to be synthesized;

the multitask synthesis condition processing module is used for processing the to-be-synthesized conditions into to-be-synthesized data;

the system comprises a multitask acoustic model obtaining module, a multitask acoustic model obtaining module and a variable self-coding module, wherein the multitask acoustic model is a network formed by combining a single-task acoustic model, an embedding technology and a variable self-coding technology, and the combination mode is as follows: connecting and coupling the condition characteristic vectors generated by utilizing an embedding technology and a variational self-coding technology with one or more data input characteristic vectors in the single-task acoustic model network process;

the generating module is used for generating acoustic parameters according to the multitask acoustic model and the data to be synthesized;

and the synthesis module is used for carrying out voice synthesis according to the generated acoustic parameters to obtain the multi-task synthesized voice.

Specifically, the attributes of the speech to be synthesized include: the method comprises the following steps of synthesizing language conditions, emotion conditions, speaker tone conditions, context conditions and tone conditions.

Specifically, the step of processing the multitask speech synthesis condition into data to be synthesized by the multitask synthesis condition processing module includes:

Specifically, the step of generating the multitask acoustic model in advance by the multitask acoustic model obtaining module includes:

acquiring a multi-task acoustic model to be trained;

training according to the multitask data and the multitask acoustic model to be trained to generate a trained multitask acoustic model, wherein the method for training the model comprises the following steps:

taking the multitask attribute data as a control condition of a model, and converting the control condition into a condition feature vector by utilizing an embedding technology and a variational self-coding technology;

and using the multitask text labeling data as data input of the model, and converting the data input into a data input characteristic vector by using an embedding technology;

and applying the multitask voice data to an output end and an intermediate layer input end during the multitask acoustic model training, simultaneously applying the condition characteristic vector and the data input characteristic vector to the input end of the multitask acoustic model, and controlling the model training and the parameter convergence by using one or more loss functions.

The invention has the beneficial effects that: the invention can realize that the voice synthesis system deploys different tones, different languages, different emotions and different contexts in the online application process, and simultaneously solves the problems of large amount of computing resources occupation, complex deployment process and the like in the voice synthesis process.

Drawings

FIG. 1 is a flowchart of a first embodiment.

FIG. 2 is a flow chart of the second embodiment.

Fig. 3 is a schematic structural diagram of the third embodiment.

Fig. 4 is a schematic structural diagram of a fourth embodiment.

Detailed Description

The embodiment aims to realize that one model completes a multi-task speech synthesis task through a deep learning technology, an embedding technology and a variational self-coding technology, on one hand, the problems of language migration, tone migration and the like are solved, on the other hand, hardware computing resources and system deployment complexity are greatly reduced, and system operation cost is reduced.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without any inventive step, are within the scope of the present invention.

Example one

Fig. 1 shows a flow chart of a method of multitasking acoustic model generation comprising:

s11: and acquiring multitask data, wherein the multitask data is obtained by performing corresponding processing according to multitask voice data, multitask text data and multitask attribute data.

For example, when the voice data is fitted into a multitask voice synthesis model with Chinese and English mixing, male and female voice mixing, sad emotion mixing, customer service context mixing and broadcasting context mixing, the obtained multitask data can include, but is not limited to, voice data of the customer service context under the happy emotion of a certain Chinese speaking female voice and corresponding multitask attribute label and corresponding multitask text label data, and voice data of the broadcasting context under the sad emotion of another English speaking male voice and corresponding multitask attribute label and corresponding multitask text label data; the voice data of more speakers under different emotions or contexts can also be adopted, but when the multitask acoustic model of the conditions is generated, at least several target attribute elements of male and female voice, Chinese and English, happy and sad, customer service context and broadcast context are included, the multitask voice data can be any combination or arrangement of the voice data of the attributes, and the corresponding multitask text data and the corresponding multitask attribute data are specific labels of the corresponding voice data. After acquiring the multitask voice data, the multitask text data and the multitask attribute data, processing the data into a group of multitask data through a data format required by an acoustic model, for example, performing feature extraction on the multitask audio data, wherein applicable features include: acoustic features such as frequency spectrum, cepstrum, fundamental frequency, duration, etc. For example, the multitask text data may include pinyin, prosody, text-to-audio alignment information, and the like. For example, the multitask attribute feature is processed, which may include a duration interval of the attribute, an attribute category, and the like.

S12, acquiring a multi-task acoustic model to be trained;

the multi-task acoustic model to be trained is a deep neural network based on an Embedding (Embedding) technology and a Variational self-encoding (Variational automatic encoder) technology, such as a sequence-to-sequence neural network. Accordingly, the multitask acoustic model generated in step S13 is also a deep neural network acoustic model.

S13: training according to the multitask data and the multitask acoustic model to generate a multitask acoustic model; the generated multi-task acoustic model is a machine learning method based on a deep learning technology, the multi-task acoustic model refers to a network formed by combining a single-task acoustic model, an embedding technology and a variational self-coding technology, and the combination mode is as follows: and connecting and coupling the condition characteristic vectors generated by utilizing an embedding technology and a variational self-coding technology with one or more data input characteristic vectors in the single-task acoustic model network process.

The embedding technology and the variational self-coding technology form different attribute spaces according to the input multitask data, and the attribute characteristic output of the subsequent neural network is controlled through the vector in the space.

The training process comprises the following steps: taking the multitask attribute data as a control condition of a model, and converting the control condition into a condition feature vector by using an Embedding technology (Embedding) and a Variational self-encoding technology (Variational automatic encoder); and using the multi-task text labeling data as data input of the model, and converting the data input into a data input feature vector by using an Embedding technology (Embedding); and applying the multitask voice data to an output end and an intermediate layer input end during the multitask acoustic model training, simultaneously applying the condition characteristic vector and the data input characteristic vector to the input end of the multitask acoustic model, and controlling the model training and the parameter convergence by using one or more loss functions.

Unlike the generation of a single-task model, the multi-task model generated in the embodiment can directly input all multi-task data and directly generate a multi-task acoustic model, so that one model does not need to be trained singly aiming at certain data and then all the models are combined to work, the multi-task work can be directly performed on one model, and the characteristic output of the acoustic model is controlled through multi-task attribute conditions. The problems of complex model deployment and resource occupation can be solved, and the problems of language migration, emotion migration and the like are also solved. And personalized voice synthesis can be performed to a certain extent.

Example two

Fig. 2 shows a flow diagram of a method of speech synthesis based on a multitasking acoustic model, including,

s21, acquiring a multitask voice synthesis condition;

the multitask voice synthesis condition comprises a text to be synthesized and a voice attribute to be synthesized. Wherein the voice attribute to be synthesized comprises: the speech conditions to be synthesized, the emotion conditions to be synthesized, the tone conditions of the speaker to be synthesized, the context conditions to be synthesized, the tone conditions to be synthesized and the like. For example, a well-trained multi-task acoustic model has Chinese and English languages, happy and serious emotions, male voice and female voice timbre, customer service, general context, question and statement tone. At the moment, a Chinese character, seriousness, male voice, customer service and 'hello' under the attribute of statement are required to be synthesized, and a user is about to come on a credit card repayment date and asks for repayment in time. "here, the speech synthesis conditions are: chinese, serious, male voice, customer service, states as the attribute of the voice to be synthesized, and the text to be synthesized is 'hello, you pay a credit card on the coming date, please pay in time'.

S22, processing the condition to be synthesized into data to be synthesized;

the processing of the text to be synthesized typically includes: text standardization, word segmentation, part of speech tagging, prosody prediction, text-to-pinyin conversion, vectorization processing and the like, wherein the preprocessing is consistent with a common voice synthesis method, a pinyin sequence with prosody and part of speech tagging is finally obtained, and a pinyin sequence vector is obtained through a pinyin dictionary. The processing of the speech attributes to be synthesized generally includes attribute normalization, attribute numeralization, attribute vectorization, and the like, for example, in a multitasking acoustic model set forth in S21, the attributes mentioned in S21 include chinese, english language, happy, serious emotion, male voice, female voice tone, customer service, general context, question, and statement, an attribute vector of [1,2,1,1,2] is obtained after combined processing, the first 1 in the vector represents that chinese is the first language in chinese and english with options, the following numerical meanings are analogized with each other, the vector is processed by an Embedding technology (Embedding) technology to obtain an attribute matrix, and the attribute matrix is coupled with a sequence vector to obtain data to be synthesized.

S23, acquiring a pre-generated multitask acoustic model, wherein the multitask acoustic model is a network formed by combining a single-task acoustic model, an embedding technology and a variational self-coding technology, and the combination mode is as follows: connecting and coupling the condition characteristic vectors generated by utilizing an embedding technology and a variational self-coding technology with one or more data input characteristic vectors in the single-task acoustic model network process;

the multitask acoustic model is generated by adopting the method of the embodiment I.

S24, generating acoustic parameters according to the multitask acoustic model and the data to be synthesized;

specifically, the data to be synthesized generated in S22 may be directly input to the multitask acoustic model.

S25, carrying out voice synthesis according to the generated acoustic parameters to obtain multi-task synthesized voice;

the principles of acoustic parameter generation and speech synthesis may be applied in the form of existing vocoders.

Different from the existing method, firstly, the multitask acoustic model adopted by the embodiment distinguishes the common acoustic model of the existing method, and secondly, the data input during the speech synthesis of the embodiment not only needs text data, but also needs target attribute conditions as supplement to control the speech attributes of the synthesized speech. In the embodiment, the deployment complexity of the voice synthesis system and the resource occupation can be reduced through the multitask acoustic model, and the problems of language migration, emotion migration and the like can be realized, for example, a girl in training data only has Chinese corpora, but after the model is trained by adding English corpora of other speakers, the English voice of the girl can be synthesized, and the model can also be applied to the migration of other attributes in the same way. And personalized voice synthesis can be performed to a certain extent.

EXAMPLE III

Fig. 3 shows a schematic structural diagram of a multitask acoustic model generating device, including,

z31, a multitask data obtaining module, configured to obtain multitask data, where the multitask data is obtained by performing corresponding processing according to multitask voice data, multitask text data, and multitask attribute data;

in this embodiment, the functions of S11 in the first embodiment are mainly implemented, for example, when we fit into a multitask speech synthesis model with chinese-english mixing, male-female voice mixing, sad emotion mixing, customer service context mixing and broadcast context mixing, the obtained multitask data may include, but is not limited to, speech data of the customer service context under the happy emotion of a certain chinese speaking female and its corresponding multitask attribute label and corresponding multitask text label data, and speech data of the broadcast context under the sad emotion of another english speaking male and its corresponding multitask attribute label and corresponding multitask text label data; the voice data of more speakers under different emotions or contexts can also be adopted, but when the multitask acoustic model of the conditions is generated, at least several target attribute elements of male and female voice, Chinese and English, happy and sad, customer service context and broadcast context are included, the multitask voice data can be any combination or arrangement of the voice data of the attributes, and the corresponding multitask text data and the corresponding multitask attribute data are specific labels of the corresponding voice data. After acquiring the multitask voice data, the multitask text data and the multitask attribute data, processing the data into a group of multitask data through a data format required by an acoustic model, for example, performing feature extraction on the multitask audio data, wherein applicable features include: acoustic features such as frequency spectrum, cepstrum, fundamental frequency, duration, etc. For example, the multitask text data may include pinyin, prosody, text-to-audio alignment information, and the like. For example, the multitask attribute feature is processed, which may include a duration interval of the attribute, an attribute category, and the like.

Z32, a model obtaining module for obtaining a multi-task acoustic model to be trained;

in this embodiment, the function described in S12 in the first embodiment is mainly implemented, where the acoustic model is a deep neural network based on an Embedding (Embedding) technique and a variational self-encoding (VAE) technique, such as a sequence-to-sequence neural network. Accordingly, the generated acoustic model is a deep neural network acoustic model.

And Z33, a generating module, configured to perform joint training according to the multitask data and the multitask acoustic model, and generate a multitask acoustic model.

In this embodiment, the function described in S13 in the first embodiment, the embedding technique and the variational self-coding technique form different attribute spaces according to the input multitask data, and the attribute feature output of the subsequent neural network is controlled by the vector in the space.

Unlike the generation of the single-task model, the generation of the multi-task model can directly input all multi-task data and directly generate the multi-task acoustic model, so that one model does not need to be singly trained for certain data and then all the models are combined to work, the multi-task work can be directly performed on one model, and the characteristic output of the acoustic model is controlled through multi-task attribute conditions. The problems of complex model deployment and resource occupation can be solved, and the problems of language migration, emotion migration and the like are also solved. And personalized voice synthesis can be performed to a certain extent.

Example four

Fig. 4 shows a schematic structural diagram of a speech synthesis system based on a multitasking acoustic model, including,

z41, a multitask synthesis condition obtaining module for obtaining multitask voice synthesis conditions;

the multitask voice synthesis condition comprises a text to be synthesized and a voice attribute to be synthesized. Wherein the voice attribute to be synthesized comprises: the speech conditions to be synthesized, the emotion conditions to be synthesized, the tone conditions of the speaker to be synthesized, the context conditions to be synthesized, the tone conditions to be synthesized and the like.

Z42, a multitask synthesis condition processing module, which is used for processing the to-be-synthesized condition into to-be-synthesized data;

the processing of the text to be synthesized typically includes: text normalization, word segmentation, part of speech tagging, prosody prediction, text-to-pinyin conversion and the like. The processing of the speech attributes to be synthesized generally includes attribute normalization, attribute numeralization, attribute vectorization, and the like. And combining the processing results of the text and the attributes to obtain the data to be synthesized.

Z43, a multitask acoustic model obtaining module, configured to obtain a pre-generated multitask acoustic model, where the multitask acoustic model is a network formed by combining a single-task acoustic model, an embedding technique, and a variational self-coding technique, and a combination manner of the multitask acoustic model is as follows: connecting and coupling the condition characteristic vectors generated by utilizing an embedding technology and a variational self-coding technology with one or more data input characteristic vectors in the single-task acoustic model network process;

the multitask acoustic model can be generated by adopting the method of one embodiment;

z44, a generating module, for generating acoustic parameters according to the multitask acoustic model and the data to be synthesized;

z45, a synthesis module for performing voice synthesis according to the generated acoustic parameters to obtain a multi-task synthesized voice;

the principles of acoustic parameter generation and speech synthesis may adopt an existing manner, and different from the existing manner, firstly, the multitask acoustic model adopted in the present embodiment distinguishes a common acoustic model of the existing manner, and secondly, data input during speech synthesis in the present embodiment not only needs text data, but also needs a target attribute condition as a supplement to control speech attributes of synthesized speech. According to the embodiment, the deployment complexity of the voice synthesis system and the reduction of resource occupation can be realized through the multi-task acoustic model, the problems of language migration, emotion migration and the like can also be realized, and personalized voice synthesis can also be performed to a certain extent.

It should be noted that, in this embodiment, each module (or unit) is in a logical sense, and in particular, when the embodiment is implemented, a plurality of modules (or units) may be combined into one module (or unit), and one module (or unit) may also be split into a plurality of modules (or units).

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium. The storage medium may be a magnetic disk, an optical disk, a Read-only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The voice synthesis method based on the multitask acoustic model is characterized by comprising the following steps:

acquiring a multitask voice synthesis condition, wherein the multitask voice synthesis condition comprises the following steps: text to be synthesized and voice attribute to be synthesized;

processing the multitask voice synthesis condition into data to be synthesized;

acquiring a pre-generated multitask acoustic model;

2. The method of claim 1, wherein the attributes of the speech to be synthesized comprise: the method comprises the following steps of synthesizing language conditions, emotion conditions, speaker tone conditions, context conditions and tone conditions.

3. A method for multitasking acoustic model based speech synthesis as claimed in claim 1 characterised in that the step of processing the multitasking speech synthesis conditions into data to be synthesized comprises:

processing the attribute normalization, attribute numeralization and attribute vectorization of the voice to be synthesized into an attribute vector;

4. The method for speech synthesis based on multitask acoustic model according to claim 1, characterized in that said multitask acoustic model is a network formed by combining a single-task acoustic model with an embedding technique and a variational self-coding technique, and its combination mode is: and connecting and coupling the condition characteristic vectors generated by utilizing an embedding technology and a variational self-coding technology with one or more data input characteristic vectors in the single-task acoustic model network process.

5. The method of multitask acoustic model based speech synthesis according to claim 4 wherein the step of pre-generating a multitask acoustic model comprises:

acquiring a multi-task acoustic model to be trained;

6. The voice synthesis system based on the multitask acoustic model is characterized by comprising the following modules:

the multitask acoustic model acquisition module is used for acquiring a pre-generated multitask acoustic model;

7. The method of claim 6, wherein the attributes of the speech to be synthesized comprise: the method comprises the following steps of synthesizing language conditions, emotion conditions, speaker tone conditions, context conditions and tone conditions.

8. The method of claim 6, wherein the step of the multitask synthesis condition processing module processing the multitask speech synthesis condition into data to be synthesized comprises:

the speech to be synthesized is processed by attribute normalization, attribute numeralization and attribute vectorization, thereby being processed into an attribute vector,

9. The method for speech synthesis based on multitask acoustic model according to claim 6, characterized in that the multitask acoustic model obtained by the multitask acoustic model obtaining module is a network formed by combining a single-task acoustic model, an embedding technique and a variational self-coding technique, and the combination mode is as follows: and connecting and coupling the condition characteristic vectors generated by utilizing an embedding technology and a variational self-coding technology with one or more data input characteristic vectors in the single-task acoustic model network process.

10. The method of claim 9, wherein the step of generating the multitask acoustic model in advance by the multitask acoustic model obtaining module comprises:

acquiring a multi-task acoustic model to be trained;