[go: up one dir, main page]

CN110718208A - Voice synthesis method and system based on multitask acoustic model - Google Patents

Voice synthesis method and system based on multitask acoustic model Download PDF

Info

Publication number
CN110718208A
CN110718208A CN201910977818.1A CN201910977818A CN110718208A CN 110718208 A CN110718208 A CN 110718208A CN 201910977818 A CN201910977818 A CN 201910977818A CN 110718208 A CN110718208 A CN 110718208A
Authority
CN
China
Prior art keywords
multitask
data
acoustic model
synthesized
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910977818.1A
Other languages
Chinese (zh)
Inventor
罗浩源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Changhong Electric Co Ltd
Original Assignee
Sichuan Changhong Electric Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Changhong Electric Co Ltd filed Critical Sichuan Changhong Electric Co Ltd
Priority to CN201910977818.1A priority Critical patent/CN110718208A/en
Publication of CN110718208A publication Critical patent/CN110718208A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/027Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of voice synthesis, and discloses a voice synthesis method and a voice synthesis system based on a multi-task acoustic model, which are used for solving the problem that voice attribute control is difficult to perform in a voice synthesis task. The invention comprises the following steps: a multitask synthesis condition obtaining module, configured to obtain a multitask speech synthesis condition, where the multitask speech synthesis condition includes: text to be synthesized and voice attribute to be synthesized; the multitask synthesis condition processing module is used for processing the to-be-synthesized conditions into to-be-synthesized data; the system comprises a multitask acoustic model obtaining module, a variable self-coding module and a variable self-coding module, wherein the multitask acoustic model obtaining module is used for obtaining a pre-generated multitask acoustic model, and the multitask acoustic model is a network formed by combining a single-task acoustic model, an embedding technology and a variable self-coding technology; the generating module is used for generating acoustic parameters according to the multitask acoustic model and the data to be synthesized; and the synthesis module is used for carrying out voice synthesis according to the generated acoustic parameters to obtain the multi-task synthesized voice. The invention is suitable for speech synthesis.

Description

Voice synthesis method and system based on multitask acoustic model
Technical Field
The invention relates to the technical field of speech synthesis, in particular to a speech synthesis method and system based on a multitask acoustic model.
Background
The speech synthesis technology is widely applied in the current life, and can convert text information into audio similar to the reading of a speaker in real time, so that a novel interactive mode is provided for people. Currently, mainstream speech synthesis technologies are classified into a splicing method, a parametric method, a hybrid method, and an end-to-end method based on deep learning. At present, the mainstream technology mainly solves the problem of the non-emotional or single emotional voice synthesis task under the single language of a single speaker, when a multi-speaker system is built, different training needs to be carried out on the same set of system, and a set of system with completely the same functional flow is newly built, so that the system expansibility is poor, the occupied resources are more, and the efficiency is low. In addition, when a speaker has only a training speech of a certain language, a speech synthesis system constructed by using the corpus data can often perform speech synthesis only on a text of the language contained in the training corpus, and speech attributes such as emotion, context, tone and the like are uncontrollable, so that the expressive force of the system is low. With the increasing demand of people for computer systems, the above problems become more and more obvious and urgent problems to be solved.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the method and the system are used for solving the problem that voice attribute control is difficult to perform in a voice synthesis task.
In order to solve the problems, the invention adopts the technical scheme that:
the voice synthesis method based on the multitask acoustic model comprises the following steps:
acquiring a multitask voice synthesis condition, wherein the multitask voice synthesis condition comprises the following steps: text to be synthesized and voice attributes to be synthesized, which may include: the method comprises the following steps of synthesizing language conditions, emotion conditions, speaker tone conditions, context conditions and tone conditions.
Processing the multitask voice synthesis condition into data to be synthesized;
the method comprises the following steps of obtaining a pre-generated multitask acoustic model, wherein the multitask acoustic model is a network formed by combining a single-task acoustic model, an embedding technology and a variational self-coding technology, and the combination mode is as follows: connecting and coupling the condition characteristic vectors generated by utilizing an embedding technology and a variational self-coding technology with one or more data input characteristic vectors in the single-task acoustic model network process;
inputting the data to be synthesized into the multitask acoustic model for acoustic parameter generation;
and carrying out voice synthesis according to the generated acoustic parameters to obtain the multi-task synthesized voice.
By adopting the multi-task acoustic model, the method can realize the single-model multi-task speech synthesis, even including language migration, emotion migration, context migration and tone migration. Even if a speaker in the training expectation only has single-language and single-language data, cross-language and cross-language voice synthesis can be realized under the voice synthesis method based on the multitask acoustic model.
Specifically, the step of processing the multitask speech synthesis condition into data to be synthesized according to the present invention may include:
processing a text to be synthesized into a pinyin sequence vector through text standardization, word segmentation, part of speech tagging, prosody prediction and text-to-pinyin conversion;
processing a text to be synthesized into a pinyin sequence vector through text standardization, word segmentation, part of speech tagging, prosody prediction, text-to-pinyin conversion and vectorization;
and processing the attribute vector by utilizing an embedding technology to obtain an attribute matrix, and coupling the attribute matrix with the pinyin sequence vector to obtain data to be synthesized.
Specifically, the steps of generating the multitask acoustic model in advance in the invention comprise:
acquiring multitask data, wherein the multitask data is obtained by correspondingly processing multitask voice data, multitask text data and multitask attribute data;
acquiring a multi-task acoustic model to be trained;
training according to the multitask data and the multitask acoustic model to be trained to generate a trained multitask acoustic model, wherein the method for training the model comprises the following steps: taking the multitask attribute data as a control condition of a model, and converting the control condition into a condition feature vector by using an Embedding technology (Embedding) and a Variational self-encoding technology (Variational automatic encoder); and using the multi-task text labeling data as data input of the model, and converting the data input into a data input feature vector by using an Embedding technology (Embedding); and applying the multitask voice data to an output end and an intermediate layer input end during the multitask acoustic model training, simultaneously applying the condition characteristic vector and the data input characteristic vector to the input end of the multitask acoustic model, and controlling the model training and the parameter convergence by using one or more loss functions.
On one hand, the model training method can reduce the data acquisition of a single speaker, reduce the data acquisition of multi-language, multi-emotion, multi-context and multi-language corpus of the single speaker and realize the establishment of speaker space by performing combined training according to multi-task data and a multi-task acoustic model, thereby greatly reducing the time and economic expenditure for data preparation and realizing the functions of speaker emotion migration, language migration and the like.
Corresponding to the method, the invention provides a voice synthesis system based on a multitask acoustic model, which comprises the following modules:
a multitask synthesis condition obtaining module, configured to obtain a multitask speech synthesis condition, where the multitask speech synthesis condition includes: text to be synthesized and voice attribute to be synthesized;
the multitask synthesis condition processing module is used for processing the to-be-synthesized conditions into to-be-synthesized data;
the system comprises a multitask acoustic model obtaining module, a multitask acoustic model obtaining module and a variable self-coding module, wherein the multitask acoustic model is a network formed by combining a single-task acoustic model, an embedding technology and a variable self-coding technology, and the combination mode is as follows: connecting and coupling the condition characteristic vectors generated by utilizing an embedding technology and a variational self-coding technology with one or more data input characteristic vectors in the single-task acoustic model network process;
the generating module is used for generating acoustic parameters according to the multitask acoustic model and the data to be synthesized;
and the synthesis module is used for carrying out voice synthesis according to the generated acoustic parameters to obtain the multi-task synthesized voice.
Specifically, the attributes of the speech to be synthesized include: the method comprises the following steps of synthesizing language conditions, emotion conditions, speaker tone conditions, context conditions and tone conditions.
Specifically, the step of processing the multitask speech synthesis condition into data to be synthesized by the multitask synthesis condition processing module includes:
processing a text to be synthesized into a pinyin sequence vector through text standardization, word segmentation, part of speech tagging, prosody prediction and text-to-pinyin conversion;
processing a text to be synthesized into a pinyin sequence vector through text standardization, word segmentation, part of speech tagging, prosody prediction, text-to-pinyin conversion and vectorization;
and processing the attribute vector by utilizing an embedding technology to obtain an attribute matrix, and coupling the attribute matrix with the pinyin sequence vector to obtain data to be synthesized.
Specifically, the step of generating the multitask acoustic model in advance by the multitask acoustic model obtaining module includes:
acquiring multitask data, wherein the multitask data is obtained by correspondingly processing multitask voice data, multitask text data and multitask attribute data;
acquiring a multi-task acoustic model to be trained;
training according to the multitask data and the multitask acoustic model to be trained to generate a trained multitask acoustic model, wherein the method for training the model comprises the following steps:
taking the multitask attribute data as a control condition of a model, and converting the control condition into a condition feature vector by utilizing an embedding technology and a variational self-coding technology;
and using the multitask text labeling data as data input of the model, and converting the data input into a data input characteristic vector by using an embedding technology;
and applying the multitask voice data to an output end and an intermediate layer input end during the multitask acoustic model training, simultaneously applying the condition characteristic vector and the data input characteristic vector to the input end of the multitask acoustic model, and controlling the model training and the parameter convergence by using one or more loss functions.
The invention has the beneficial effects that: the invention can realize that the voice synthesis system deploys different tones, different languages, different emotions and different contexts in the online application process, and simultaneously solves the problems of large amount of computing resources occupation, complex deployment process and the like in the voice synthesis process.
Drawings
FIG. 1 is a flowchart of a first embodiment.
FIG. 2 is a flow chart of the second embodiment.
Fig. 3 is a schematic structural diagram of the third embodiment.
Fig. 4 is a schematic structural diagram of a fourth embodiment.
Detailed Description
The embodiment aims to realize that one model completes a multi-task speech synthesis task through a deep learning technology, an embedding technology and a variational self-coding technology, on one hand, the problems of language migration, tone migration and the like are solved, on the other hand, hardware computing resources and system deployment complexity are greatly reduced, and system operation cost is reduced.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without any inventive step, are within the scope of the present invention.
Example one
Fig. 1 shows a flow chart of a method of multitasking acoustic model generation comprising:
s11: and acquiring multitask data, wherein the multitask data is obtained by performing corresponding processing according to multitask voice data, multitask text data and multitask attribute data.
For example, when the voice data is fitted into a multitask voice synthesis model with Chinese and English mixing, male and female voice mixing, sad emotion mixing, customer service context mixing and broadcasting context mixing, the obtained multitask data can include, but is not limited to, voice data of the customer service context under the happy emotion of a certain Chinese speaking female voice and corresponding multitask attribute label and corresponding multitask text label data, and voice data of the broadcasting context under the sad emotion of another English speaking male voice and corresponding multitask attribute label and corresponding multitask text label data; the voice data of more speakers under different emotions or contexts can also be adopted, but when the multitask acoustic model of the conditions is generated, at least several target attribute elements of male and female voice, Chinese and English, happy and sad, customer service context and broadcast context are included, the multitask voice data can be any combination or arrangement of the voice data of the attributes, and the corresponding multitask text data and the corresponding multitask attribute data are specific labels of the corresponding voice data. After acquiring the multitask voice data, the multitask text data and the multitask attribute data, processing the data into a group of multitask data through a data format required by an acoustic model, for example, performing feature extraction on the multitask audio data, wherein applicable features include: acoustic features such as frequency spectrum, cepstrum, fundamental frequency, duration, etc. For example, the multitask text data may include pinyin, prosody, text-to-audio alignment information, and the like. For example, the multitask attribute feature is processed, which may include a duration interval of the attribute, an attribute category, and the like.
S12, acquiring a multi-task acoustic model to be trained;
the multi-task acoustic model to be trained is a deep neural network based on an Embedding (Embedding) technology and a Variational self-encoding (Variational automatic encoder) technology, such as a sequence-to-sequence neural network. Accordingly, the multitask acoustic model generated in step S13 is also a deep neural network acoustic model.
S13: training according to the multitask data and the multitask acoustic model to generate a multitask acoustic model; the generated multi-task acoustic model is a machine learning method based on a deep learning technology, the multi-task acoustic model refers to a network formed by combining a single-task acoustic model, an embedding technology and a variational self-coding technology, and the combination mode is as follows: and connecting and coupling the condition characteristic vectors generated by utilizing an embedding technology and a variational self-coding technology with one or more data input characteristic vectors in the single-task acoustic model network process.
The embedding technology and the variational self-coding technology form different attribute spaces according to the input multitask data, and the attribute characteristic output of the subsequent neural network is controlled through the vector in the space.
The training process comprises the following steps: taking the multitask attribute data as a control condition of a model, and converting the control condition into a condition feature vector by using an Embedding technology (Embedding) and a Variational self-encoding technology (Variational automatic encoder); and using the multi-task text labeling data as data input of the model, and converting the data input into a data input feature vector by using an Embedding technology (Embedding); and applying the multitask voice data to an output end and an intermediate layer input end during the multitask acoustic model training, simultaneously applying the condition characteristic vector and the data input characteristic vector to the input end of the multitask acoustic model, and controlling the model training and the parameter convergence by using one or more loss functions.
Unlike the generation of a single-task model, the multi-task model generated in the embodiment can directly input all multi-task data and directly generate a multi-task acoustic model, so that one model does not need to be trained singly aiming at certain data and then all the models are combined to work, the multi-task work can be directly performed on one model, and the characteristic output of the acoustic model is controlled through multi-task attribute conditions. The problems of complex model deployment and resource occupation can be solved, and the problems of language migration, emotion migration and the like are also solved. And personalized voice synthesis can be performed to a certain extent.
Example two
Fig. 2 shows a flow diagram of a method of speech synthesis based on a multitasking acoustic model, including,
s21, acquiring a multitask voice synthesis condition;
the multitask voice synthesis condition comprises a text to be synthesized and a voice attribute to be synthesized. Wherein the voice attribute to be synthesized comprises: the speech conditions to be synthesized, the emotion conditions to be synthesized, the tone conditions of the speaker to be synthesized, the context conditions to be synthesized, the tone conditions to be synthesized and the like. For example, a well-trained multi-task acoustic model has Chinese and English languages, happy and serious emotions, male voice and female voice timbre, customer service, general context, question and statement tone. At the moment, a Chinese character, seriousness, male voice, customer service and 'hello' under the attribute of statement are required to be synthesized, and a user is about to come on a credit card repayment date and asks for repayment in time. "here, the speech synthesis conditions are: chinese, serious, male voice, customer service, states as the attribute of the voice to be synthesized, and the text to be synthesized is 'hello, you pay a credit card on the coming date, please pay in time'.
S22, processing the condition to be synthesized into data to be synthesized;
the processing of the text to be synthesized typically includes: text standardization, word segmentation, part of speech tagging, prosody prediction, text-to-pinyin conversion, vectorization processing and the like, wherein the preprocessing is consistent with a common voice synthesis method, a pinyin sequence with prosody and part of speech tagging is finally obtained, and a pinyin sequence vector is obtained through a pinyin dictionary. The processing of the speech attributes to be synthesized generally includes attribute normalization, attribute numeralization, attribute vectorization, and the like, for example, in a multitasking acoustic model set forth in S21, the attributes mentioned in S21 include chinese, english language, happy, serious emotion, male voice, female voice tone, customer service, general context, question, and statement, an attribute vector of [1,2,1,1,2] is obtained after combined processing, the first 1 in the vector represents that chinese is the first language in chinese and english with options, the following numerical meanings are analogized with each other, the vector is processed by an Embedding technology (Embedding) technology to obtain an attribute matrix, and the attribute matrix is coupled with a sequence vector to obtain data to be synthesized.
S23, acquiring a pre-generated multitask acoustic model, wherein the multitask acoustic model is a network formed by combining a single-task acoustic model, an embedding technology and a variational self-coding technology, and the combination mode is as follows: connecting and coupling the condition characteristic vectors generated by utilizing an embedding technology and a variational self-coding technology with one or more data input characteristic vectors in the single-task acoustic model network process;
the multitask acoustic model is generated by adopting the method of the embodiment I.
S24, generating acoustic parameters according to the multitask acoustic model and the data to be synthesized;
specifically, the data to be synthesized generated in S22 may be directly input to the multitask acoustic model.
S25, carrying out voice synthesis according to the generated acoustic parameters to obtain multi-task synthesized voice;
the principles of acoustic parameter generation and speech synthesis may be applied in the form of existing vocoders.
Different from the existing method, firstly, the multitask acoustic model adopted by the embodiment distinguishes the common acoustic model of the existing method, and secondly, the data input during the speech synthesis of the embodiment not only needs text data, but also needs target attribute conditions as supplement to control the speech attributes of the synthesized speech. In the embodiment, the deployment complexity of the voice synthesis system and the resource occupation can be reduced through the multitask acoustic model, and the problems of language migration, emotion migration and the like can be realized, for example, a girl in training data only has Chinese corpora, but after the model is trained by adding English corpora of other speakers, the English voice of the girl can be synthesized, and the model can also be applied to the migration of other attributes in the same way. And personalized voice synthesis can be performed to a certain extent.
EXAMPLE III
Fig. 3 shows a schematic structural diagram of a multitask acoustic model generating device, including,
z31, a multitask data obtaining module, configured to obtain multitask data, where the multitask data is obtained by performing corresponding processing according to multitask voice data, multitask text data, and multitask attribute data;
in this embodiment, the functions of S11 in the first embodiment are mainly implemented, for example, when we fit into a multitask speech synthesis model with chinese-english mixing, male-female voice mixing, sad emotion mixing, customer service context mixing and broadcast context mixing, the obtained multitask data may include, but is not limited to, speech data of the customer service context under the happy emotion of a certain chinese speaking female and its corresponding multitask attribute label and corresponding multitask text label data, and speech data of the broadcast context under the sad emotion of another english speaking male and its corresponding multitask attribute label and corresponding multitask text label data; the voice data of more speakers under different emotions or contexts can also be adopted, but when the multitask acoustic model of the conditions is generated, at least several target attribute elements of male and female voice, Chinese and English, happy and sad, customer service context and broadcast context are included, the multitask voice data can be any combination or arrangement of the voice data of the attributes, and the corresponding multitask text data and the corresponding multitask attribute data are specific labels of the corresponding voice data. After acquiring the multitask voice data, the multitask text data and the multitask attribute data, processing the data into a group of multitask data through a data format required by an acoustic model, for example, performing feature extraction on the multitask audio data, wherein applicable features include: acoustic features such as frequency spectrum, cepstrum, fundamental frequency, duration, etc. For example, the multitask text data may include pinyin, prosody, text-to-audio alignment information, and the like. For example, the multitask attribute feature is processed, which may include a duration interval of the attribute, an attribute category, and the like.
Z32, a model obtaining module for obtaining a multi-task acoustic model to be trained;
in this embodiment, the function described in S12 in the first embodiment is mainly implemented, where the acoustic model is a deep neural network based on an Embedding (Embedding) technique and a variational self-encoding (VAE) technique, such as a sequence-to-sequence neural network. Accordingly, the generated acoustic model is a deep neural network acoustic model.
And Z33, a generating module, configured to perform joint training according to the multitask data and the multitask acoustic model, and generate a multitask acoustic model.
In this embodiment, the function described in S13 in the first embodiment, the embedding technique and the variational self-coding technique form different attribute spaces according to the input multitask data, and the attribute feature output of the subsequent neural network is controlled by the vector in the space.
Unlike the generation of the single-task model, the generation of the multi-task model can directly input all multi-task data and directly generate the multi-task acoustic model, so that one model does not need to be singly trained for certain data and then all the models are combined to work, the multi-task work can be directly performed on one model, and the characteristic output of the acoustic model is controlled through multi-task attribute conditions. The problems of complex model deployment and resource occupation can be solved, and the problems of language migration, emotion migration and the like are also solved. And personalized voice synthesis can be performed to a certain extent.
Example four
Fig. 4 shows a schematic structural diagram of a speech synthesis system based on a multitasking acoustic model, including,
z41, a multitask synthesis condition obtaining module for obtaining multitask voice synthesis conditions;
the multitask voice synthesis condition comprises a text to be synthesized and a voice attribute to be synthesized. Wherein the voice attribute to be synthesized comprises: the speech conditions to be synthesized, the emotion conditions to be synthesized, the tone conditions of the speaker to be synthesized, the context conditions to be synthesized, the tone conditions to be synthesized and the like.
Z42, a multitask synthesis condition processing module, which is used for processing the to-be-synthesized condition into to-be-synthesized data;
the processing of the text to be synthesized typically includes: text normalization, word segmentation, part of speech tagging, prosody prediction, text-to-pinyin conversion and the like. The processing of the speech attributes to be synthesized generally includes attribute normalization, attribute numeralization, attribute vectorization, and the like. And combining the processing results of the text and the attributes to obtain the data to be synthesized.
Z43, a multitask acoustic model obtaining module, configured to obtain a pre-generated multitask acoustic model, where the multitask acoustic model is a network formed by combining a single-task acoustic model, an embedding technique, and a variational self-coding technique, and a combination manner of the multitask acoustic model is as follows: connecting and coupling the condition characteristic vectors generated by utilizing an embedding technology and a variational self-coding technology with one or more data input characteristic vectors in the single-task acoustic model network process;
the multitask acoustic model can be generated by adopting the method of one embodiment;
z44, a generating module, for generating acoustic parameters according to the multitask acoustic model and the data to be synthesized;
z45, a synthesis module for performing voice synthesis according to the generated acoustic parameters to obtain a multi-task synthesized voice;
the principles of acoustic parameter generation and speech synthesis may adopt an existing manner, and different from the existing manner, firstly, the multitask acoustic model adopted in the present embodiment distinguishes a common acoustic model of the existing manner, and secondly, data input during speech synthesis in the present embodiment not only needs text data, but also needs a target attribute condition as a supplement to control speech attributes of synthesized speech. According to the embodiment, the deployment complexity of the voice synthesis system and the reduction of resource occupation can be realized through the multi-task acoustic model, the problems of language migration, emotion migration and the like can also be realized, and personalized voice synthesis can also be performed to a certain extent.
It should be noted that, in this embodiment, each module (or unit) is in a logical sense, and in particular, when the embodiment is implemented, a plurality of modules (or units) may be combined into one module (or unit), and one module (or unit) may also be split into a plurality of modules (or units).
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium. The storage medium may be a magnetic disk, an optical disk, a Read-only Memory (ROM), a Random Access Memory (RAM), or the like.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. The voice synthesis method based on the multitask acoustic model is characterized by comprising the following steps:
acquiring a multitask voice synthesis condition, wherein the multitask voice synthesis condition comprises the following steps: text to be synthesized and voice attribute to be synthesized;
processing the multitask voice synthesis condition into data to be synthesized;
acquiring a pre-generated multitask acoustic model;
inputting the data to be synthesized into the multitask acoustic model for acoustic parameter generation;
and carrying out voice synthesis according to the generated acoustic parameters to obtain the multi-task synthesized voice.
2. The method of claim 1, wherein the attributes of the speech to be synthesized comprise: the method comprises the following steps of synthesizing language conditions, emotion conditions, speaker tone conditions, context conditions and tone conditions.
3. A method for multitasking acoustic model based speech synthesis as claimed in claim 1 characterised in that the step of processing the multitasking speech synthesis conditions into data to be synthesized comprises:
processing a text to be synthesized into a pinyin sequence vector through text standardization, word segmentation, part of speech tagging, prosody prediction, text-to-pinyin conversion and vectorization;
processing the attribute normalization, attribute numeralization and attribute vectorization of the voice to be synthesized into an attribute vector;
and processing the attribute vector by utilizing an embedding technology to obtain an attribute matrix, and coupling the attribute matrix with the pinyin sequence vector to obtain data to be synthesized.
4. The method for speech synthesis based on multitask acoustic model according to claim 1, characterized in that said multitask acoustic model is a network formed by combining a single-task acoustic model with an embedding technique and a variational self-coding technique, and its combination mode is: and connecting and coupling the condition characteristic vectors generated by utilizing an embedding technology and a variational self-coding technology with one or more data input characteristic vectors in the single-task acoustic model network process.
5. The method of multitask acoustic model based speech synthesis according to claim 4 wherein the step of pre-generating a multitask acoustic model comprises:
acquiring multitask data, wherein the multitask data is obtained by correspondingly processing multitask voice data, multitask text data and multitask attribute data;
acquiring a multi-task acoustic model to be trained;
training according to the multitask data and the multitask acoustic model to be trained to generate a trained multitask acoustic model, wherein the method for training the model comprises the following steps:
taking the multitask attribute data as a control condition of a model, and converting the control condition into a condition feature vector by utilizing an embedding technology and a variational self-coding technology;
and using the multitask text labeling data as data input of the model, and converting the data input into a data input characteristic vector by using an embedding technology;
and applying the multitask voice data to an output end and an intermediate layer input end during the multitask acoustic model training, simultaneously applying the condition characteristic vector and the data input characteristic vector to the input end of the multitask acoustic model, and controlling the model training and the parameter convergence by using one or more loss functions.
6. The voice synthesis system based on the multitask acoustic model is characterized by comprising the following modules:
a multitask synthesis condition obtaining module, configured to obtain a multitask speech synthesis condition, where the multitask speech synthesis condition includes: text to be synthesized and voice attribute to be synthesized;
the multitask synthesis condition processing module is used for processing the to-be-synthesized conditions into to-be-synthesized data;
the multitask acoustic model acquisition module is used for acquiring a pre-generated multitask acoustic model;
the generating module is used for generating acoustic parameters according to the multitask acoustic model and the data to be synthesized;
and the synthesis module is used for carrying out voice synthesis according to the generated acoustic parameters to obtain the multi-task synthesized voice.
7. The method of claim 6, wherein the attributes of the speech to be synthesized comprise: the method comprises the following steps of synthesizing language conditions, emotion conditions, speaker tone conditions, context conditions and tone conditions.
8. The method of claim 6, wherein the step of the multitask synthesis condition processing module processing the multitask speech synthesis condition into data to be synthesized comprises:
processing a text to be synthesized into a pinyin sequence vector through text standardization, word segmentation, part of speech tagging, prosody prediction, text-to-pinyin conversion and vectorization;
the speech to be synthesized is processed by attribute normalization, attribute numeralization and attribute vectorization, thereby being processed into an attribute vector,
and processing the attribute vector by utilizing an embedding technology to obtain an attribute matrix, and coupling the attribute matrix with the pinyin sequence vector to obtain data to be synthesized.
9. The method for speech synthesis based on multitask acoustic model according to claim 6, characterized in that the multitask acoustic model obtained by the multitask acoustic model obtaining module is a network formed by combining a single-task acoustic model, an embedding technique and a variational self-coding technique, and the combination mode is as follows: and connecting and coupling the condition characteristic vectors generated by utilizing an embedding technology and a variational self-coding technology with one or more data input characteristic vectors in the single-task acoustic model network process.
10. The method of claim 9, wherein the step of generating the multitask acoustic model in advance by the multitask acoustic model obtaining module comprises:
acquiring multitask data, wherein the multitask data is obtained by correspondingly processing multitask voice data, multitask text data and multitask attribute data;
acquiring a multi-task acoustic model to be trained;
training according to the multitask data and the multitask acoustic model to be trained to generate a trained multitask acoustic model, wherein the method for training the model comprises the following steps:
taking the multitask attribute data as a control condition of a model, and converting the control condition into a condition feature vector by utilizing an embedding technology and a variational self-coding technology;
and using the multitask text labeling data as data input of the model, and converting the data input into a data input characteristic vector by using an embedding technology;
and applying the multitask voice data to an output end and an intermediate layer input end during the multitask acoustic model training, simultaneously applying the condition characteristic vector and the data input characteristic vector to the input end of the multitask acoustic model, and controlling the model training and the parameter convergence by using one or more loss functions.
CN201910977818.1A 2019-10-15 2019-10-15 Voice synthesis method and system based on multitask acoustic model Pending CN110718208A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910977818.1A CN110718208A (en) 2019-10-15 2019-10-15 Voice synthesis method and system based on multitask acoustic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910977818.1A CN110718208A (en) 2019-10-15 2019-10-15 Voice synthesis method and system based on multitask acoustic model

Publications (1)

Publication Number Publication Date
CN110718208A true CN110718208A (en) 2020-01-21

Family

ID=69211682

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910977818.1A Pending CN110718208A (en) 2019-10-15 2019-10-15 Voice synthesis method and system based on multitask acoustic model

Country Status (1)

Country Link
CN (1) CN110718208A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111724809A (en) * 2020-06-15 2020-09-29 苏州意能通信息技术有限公司 Vocoder implementation method and device based on variational self-encoder
CN111739509A (en) * 2020-06-16 2020-10-02 掌阅科技股份有限公司 Electronic book audio generation method, electronic device and storage medium
CN112397083A (en) * 2020-11-13 2021-02-23 Oppo广东移动通信有限公司 Voice processing method and related device
CN112767910A (en) * 2020-05-13 2021-05-07 腾讯科技(深圳)有限公司 Audio information synthesis method and device, computer readable medium and electronic equipment
CN112786009A (en) * 2021-02-26 2021-05-11 平安科技(深圳)有限公司 Speech synthesis method, apparatus, device and storage medium
CN113409765A (en) * 2021-06-11 2021-09-17 北京搜狗科技发展有限公司 Voice synthesis method and device for voice synthesis
CN113611286A (en) * 2021-10-08 2021-11-05 之江实验室 Cross-language speech emotion recognition method and system based on common feature extraction
CN115240635A (en) * 2022-07-22 2022-10-25 北京有竹居网络技术有限公司 Speech synthesis method, apparatus, medium, and electronic device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103366733A (en) * 2012-03-30 2013-10-23 株式会社东芝 Text to speech system
CN105206258A (en) * 2015-10-19 2015-12-30 百度在线网络技术(北京)有限公司 Generation method and device of acoustic model as well as voice synthetic method and device
US20170092258A1 (en) * 2015-09-29 2017-03-30 Yandex Europe Ag Method and system for text-to-speech synthesis
CN107481713A (en) * 2017-07-17 2017-12-15 清华大学 A mixed language speech synthesis method and device
CN110083710A (en) * 2019-04-30 2019-08-02 北京工业大学 It is a kind of that generation method is defined based on Recognition with Recurrent Neural Network and the word of latent variable structure

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103366733A (en) * 2012-03-30 2013-10-23 株式会社东芝 Text to speech system
US20170092258A1 (en) * 2015-09-29 2017-03-30 Yandex Europe Ag Method and system for text-to-speech synthesis
CN105206258A (en) * 2015-10-19 2015-12-30 百度在线网络技术(北京)有限公司 Generation method and device of acoustic model as well as voice synthetic method and device
CN107481713A (en) * 2017-07-17 2017-12-15 清华大学 A mixed language speech synthesis method and device
CN110083710A (en) * 2019-04-30 2019-08-02 北京工业大学 It is a kind of that generation method is defined based on Recognition with Recurrent Neural Network and the word of latent variable structure

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YA-JIE ZHANG ET AL.: "《LEARNING LATENT REPRESENTATIONS FOR STYLE CONTROL AND TRANSFER IN END-TO-END SPEECH SYNTHESIS》", 《ICASSP 2019》 *
李德毅、于剑等: "《中国科协新一代信息技术系列丛书 人工智能导论》", 31 August 2018, 中国科学技术出版社 *
黄国捷等: "《增强变分自编码器做非平行语料语音转换》", 《信号处理》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112767910A (en) * 2020-05-13 2021-05-07 腾讯科技(深圳)有限公司 Audio information synthesis method and device, computer readable medium and electronic equipment
CN111724809A (en) * 2020-06-15 2020-09-29 苏州意能通信息技术有限公司 Vocoder implementation method and device based on variational self-encoder
CN111739509A (en) * 2020-06-16 2020-10-02 掌阅科技股份有限公司 Electronic book audio generation method, electronic device and storage medium
CN111739509B (en) * 2020-06-16 2022-03-22 掌阅科技股份有限公司 Electronic book audio generation method, electronic device and storage medium
CN112397083A (en) * 2020-11-13 2021-02-23 Oppo广东移动通信有限公司 Voice processing method and related device
CN112397083B (en) * 2020-11-13 2024-05-24 Oppo广东移动通信有限公司 Voice processing method and related device
CN112786009A (en) * 2021-02-26 2021-05-11 平安科技(深圳)有限公司 Speech synthesis method, apparatus, device and storage medium
CN113409765A (en) * 2021-06-11 2021-09-17 北京搜狗科技发展有限公司 Voice synthesis method and device for voice synthesis
CN113409765B (en) * 2021-06-11 2024-04-26 北京搜狗科技发展有限公司 Speech synthesis method and device for speech synthesis
CN113611286A (en) * 2021-10-08 2021-11-05 之江实验室 Cross-language speech emotion recognition method and system based on common feature extraction
CN115240635A (en) * 2022-07-22 2022-10-25 北京有竹居网络技术有限公司 Speech synthesis method, apparatus, medium, and electronic device
CN115240635B (en) * 2022-07-22 2025-03-07 北京有竹居网络技术有限公司 Speech synthesis method, device, medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN110718208A (en) Voice synthesis method and system based on multitask acoustic model
CN112352275B (en) Neural text-to-speech synthesis with multi-level text information
WO2021047233A1 (en) Deep learning-based emotional speech synthesis method and device
CN105845125B (en) Phoneme synthesizing method and speech synthetic device
CN109036371B (en) Audio data generation method and system for speech synthesis
CN111477216A (en) A training method and system for a sound-meaning understanding model for dialogue robots
CN110148399A (en) A kind of control method of smart machine, device, equipment and medium
CN111949784A (en) Outbound method and device based on intention recognition
CN116049360A (en) Intervention method and system for speech skills in intelligent voice dialogue scenes based on customer portraits
CN111177350A (en) Method, device and system for forming dialect of intelligent voice robot
CN112860871B (en) Natural language understanding model training method, natural language understanding method and device
CN113257225B (en) A kind of emotional speech synthesis method and system integrating vocabulary and phoneme pronunciation features
CN111696521A (en) Method for training speech clone model, readable storage medium and speech clone method
CN111508466A (en) Text processing method, device and equipment and computer readable storage medium
CN113628609A (en) Automatic audio content generation
CN115983282A (en) Prompt-based high-efficiency small sample dialogue semantic understanding method
CN115798456A (en) Cross-language emotion voice synthesis method and device and computer equipment
CN113299272A (en) Speech synthesis model training method, speech synthesis apparatus, and storage medium
CN114783405B (en) Speech synthesis method, device, electronic equipment and storage medium
KR102624790B1 (en) Natural language processing apparatus for intent analysis and processing of multi-intent speech, program and its control method
CN117133269A (en) Speech synthesis method, device, electronic equipment and storage medium
CN109524000A (en) Offline implementation method and device
CN114121010A (en) Model training, speech generation, speech interaction method, device and storage medium
CN113920987A (en) Voice recognition method, device, equipment and storage medium
CN116705058B (en) Processing method of multimode voice task, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200121

RJ01 Rejection of invention patent application after publication