CN111883101B

CN111883101B - Model training and speech synthesis method, device, equipment and medium

Info

Publication number: CN111883101B
Application number: CN202010668214.1A
Authority: CN
Inventors: 康永国
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-07-13
Filing date: 2020-07-13
Publication date: 2024-02-23
Anticipated expiration: 2040-07-13
Also published as: CN111883101A

Abstract

The application discloses a model training and voice synthesis method, device, equipment and medium, and relates to the technical fields of artificial intelligence, deep learning and voice. The specific implementation scheme is as follows: acquiring a sample text in a training data set; determining label information corresponding to the sample text based on an acoustic model trained by an unsupervised training method in advance; wherein the tag information includes style information and/or character information; training a text classification model based on the sample text and label information corresponding to the sample text; the text classification model is used for outputting corresponding tag information according to the input text. According to the embodiment of the application, the technical effect of automatically determining the label information corresponding to the sample text is achieved, the accuracy and the efficiency of label marking are improved, and the training speed of the text training model is correspondingly improved.

Description

Model training and speech synthesis method, device, equipment and medium

Technical Field

The embodiment of the application relates to the technical fields of artificial intelligence, deep learning and voice, in particular to a model training and voice synthesis method, device, equipment and medium.

Background

Traditional speech synthesis techniques employ supervised machine learning, i.e., text data of different styles, different emotions, or different roles, all have their corresponding labels that can help the speech synthesis system to better model and generate speech.

In the existing method, labeling personnel label the acquired text data according to subjective experience, but the label labeling accuracy is low due to inconsistent understanding of each labeling personnel on the labels, and the efficiency of label labeling is low due to the fact that the labeling personnel are required to manually label the data.

Disclosure of Invention

The embodiment of the disclosure provides a model training and voice synthesis method, device, equipment and medium.

According to an aspect of the disclosure, there is provided a model training method, the method including:

acquiring a sample text in a training data set;

determining label information corresponding to the sample text based on an acoustic model trained by an unsupervised training method in advance; wherein the tag information includes style information and/or character information;

training a text classification model based on the sample text and label information corresponding to the sample text; the text classification model is used for outputting corresponding tag information according to the input text.

According to another aspect of the disclosure, there is provided a method of speech synthesis, the method comprising:

inputting a text to be synthesized into a text classification model trained in advance, and obtaining label information corresponding to the text to be synthesized, which is output by the classification model; wherein the tag information includes style information and/or character information; the text classification model is a model trained by using the model training method disclosed by the application;

according to another aspect of the present disclosure, there is provided a model training apparatus, the apparatus comprising:

the sample text acquisition module is used for acquiring sample texts in the training data set;

the label information determining module is used for determining label information corresponding to the sample text based on an acoustic model trained by an unsupervised training method in advance; wherein the tag information includes style information and/or character information;

the text classification model training module is used for training a text classification model based on the sample text and the label information corresponding to the sample text; the text classification model is used for outputting corresponding tag information according to the input text.

According to another aspect of the present disclosure, there is provided a voice synthesis apparatus, the apparatus comprising:

the label information acquisition module is used for inputting a text to be synthesized into a text classification model trained in advance to obtain label information corresponding to the text to be synthesized, which is output by the classification model; wherein the tag information includes style information and/or character information; the text classification model is a model trained by using the model training method disclosed by the application;

the acoustic feature acquisition module is used for inputting the text features of the text to be synthesized and the label information corresponding to the text to be synthesized into an acoustic model trained by an unsupervised training method in advance to obtain the text features and the acoustic features corresponding to the label information output by the acoustic model;

and the voice synthesis module is used for carrying out voice synthesis on the text to be synthesized based on the acoustic characteristics to obtain voice data corresponding to the text to be synthesized.

According to another aspect of the present disclosure, there is provided an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the model training method and/or the speech synthesis method as described in any one of the embodiments of the present application.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform a model training method and/or a speech synthesis method according to any one of the embodiments of the present application.

According to the method and the device, the technical effect of automatically determining the label information corresponding to the sample text is achieved, the accuracy and the efficiency of label marking are improved, and the training speed of the text training model is correspondingly improved; and the method realizes the generation of voice data with multiple styles, multiple roles and rich and colorful emotion, is more close to the real person reading style, and greatly improves the time and experience of users for listening to the voice data. It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for better understanding of the present solution and do not constitute a limitation of the present application. Wherein:

FIG. 1 is a flow chart of a model training method according to an embodiment of the present application;

FIG. 2A is a flow chart of a model training method according to an embodiment of the present application;

FIG. 2B is a schematic diagram of a model training according to an embodiment of the present application;

FIG. 3A is a flow chart of a method of speech synthesis according to an embodiment of the present application;

FIG. 3B is a schematic diagram of a speech synthesis according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a model training apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural view of a speech synthesis apparatus according to an embodiment of the present application;

fig. 6 is a block diagram of an electronic device with a model training and speech synthesis method according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the traditional speech synthesis technology, it is not easy to determine the labels corresponding to the text data by adopting a supervised machine learning method, for example, text data of four labels are required to be obtained, and one method is to record the text data one by one according to the labels, so that time is very wasted; another method is to collect the existing historical data and manually annotate the data according to experience by an annotator, and the annotating method has the defects of low accuracy and efficiency. Therefore, for multi-style and multi-role voice broadcasting of some scenes, the traditional supervised machine learning method is very difficult.

Fig. 1 is a flowchart of a model training method disclosed in an embodiment of the present application, which may be suitable for use in an unsupervised case of training a text classification model. The method of the embodiment can be implemented by a model training device, and the device can be implemented by software and/or hardware and can be integrated on any electronic device with computing capability, such as a server and the like.

As shown in fig. 1, the model training method disclosed in this embodiment may include:

s101, acquiring sample texts in a training data set.

Wherein the training data set comprises training data for training an acoustic model and a text classification model. The training dataset in this embodiment includes sample text, text features of the sample text, and speech data corresponding to the sample text. The sample text is obtained by performing voice recognition on the collected voice data.

Optionally, the method for generating the sample text comprises two steps of A and B:

A. and acquiring a preset number of real voice data, and executing background music and/or noise removing operation on each real voice data.

Wherein, the real person voice data refers to voice data dubbed by a real person.

In one embodiment, data collection is performed in the internet to obtain a preset number of real voice data, and according to an audio range in which background music and/or noise in the real voice data are located, a filter corresponding to the audio range is adopted to perform filtering processing on the real voice data so as to remove the background music and/or noise in the real voice data.

Optionally, the real person voice data includes: the real voice broadcasting data of the literature carrier, such as the real voice broadcasting data of the novice, the real voice broadcasting data of the poetry, the real voice broadcasting data of the prose, the real voice broadcasting data of the opera and the like.

The real voice broadcasting data of the literature carrier are used as real voice data, compared with the traditional recording studio, the method has the following advantages: 1) The style performance and emotion fluctuation are more natural, and the role playing is more vivid and natural. 2) The real voice broadcasting data of the literature carrier can be evaluated in real time, and even if the voice broadcasting data is selected by a speaker in advance, the voice broadcasting data is recorded in the later period of the recording studio, and the situation that the recording quality is not up to standard or the style is not expressed in place is likely to occur after long-time recording. 3) The real voice broadcasting speaker and the real voice broadcasting data of different types of literature carriers can be quickly copied to more literature carriers.

B. And segmenting each piece of real voice data, and acquiring texts corresponding to each piece of segmented voice data respectively to serve as sample texts.

In one embodiment, at least one of a word, a word or a sentence is taken as a unit, the real voice data is segmented to obtain a plurality of pieces of voice data, for example, the segmentation is performed by taking the word as a unit, then a plurality of pieces of word voice data are obtained, for example, the segmentation is performed by taking the sentence as a unit, and then a plurality of pieces of sentence voice data are obtained. And then, carrying out voice recognition on each piece of voice data by using the existing voice recognition algorithm, and obtaining a text corresponding to each piece of voice data as a sample text in the training data set. The speech recognition algorithm includes, but is not limited to, a dynamic time warping algorithm, a hidden Markov model algorithm based on a parameter model, a vector quantization algorithm based on a non-parameter model, and the like.

Optionally, the slicing the real voice data with sentence as a unit includes: the obtained real voice data is segmented by the existing voice sentence segmentation method, such as voice sentence segmentation according to voice pause time, or voice sentence segmentation according to the decibel number of voice. By acquiring the preset number of real voice data, performing background music and/or noise removal operation on each real voice data, so that the obtained real voice data has fewer interference factors, and the accuracy and reliability of subsequent acoustic model training are ensured; through segmenting each real voice data and acquiring texts corresponding to each piece of segmented voice data, the text is used as a sample text, and the consistency of the sample text and the voice data is maintained, so that subsequent acoustic model training can be smoothly carried out.

By acquiring the sample text in the training data set, a foundation is laid for determining the corresponding label information according to the sample text.

S102, determining label information corresponding to the sample text based on an acoustic model trained by an unsupervised training method in advance; wherein the tag information includes style information and/or character information.

The model training can be completed without human participation by using an unsupervised training method, and the data can be independently mined.

In one embodiment, an unsupervised training method is used to model an acoustic model, and existing clustering algorithms are used to cluster speech data, including but not limited to k-means clustering, hierarchical clustering, DBSCAN density clustering, grid clustering, and the like, to obtain a clustering result. Optionally, the clustering result includes label information corresponding to each voice data, where the label information includes style information and/or role information, for example, style information includes but is not limited to generous, male sexual luxury, emotion running, cool sadness, tricky variation, and the like, and the role information includes but is not limited to women, men, old people, children, and the like. And finally, using the label information corresponding to the voice data as the label information corresponding to the sample text corresponding to the voice data.

By determining the label information corresponding to the sample text based on the acoustic model trained by the non-supervision training method in advance, the effect of automatically determining the label corresponding to the sample text is achieved, and the accuracy and efficiency of label marking are improved.

S103, training a text classification model based on the sample text and label information corresponding to the sample text; the text classification model is used for outputting corresponding tag information according to the input text.

Types of text classification models include, but are not limited to, CNN (Convolutional Neural Networks, convolutional neural network) models, transform models, and the like.

In one embodiment, the sample text and the label information corresponding to the obtained sample text are used as training data for training a text classification model, and model training is performed by using an existing model training method to obtain the text classification model. The text classification model is used for labeling the text, inputting the text to be labeled into the text classification model, and correspondingly outputting label information corresponding to the text by the text classification model.

The text classification model is trained based on the sample text and the label information corresponding to the sample text, so that the modeling of the text classification model is realized, manual participation is not needed, and the labor cost is saved.

According to the technical scheme of the embodiment, the sample text in the training data set is obtained, the label information of the sample text is determined based on the acoustic model trained by the non-supervision training method in advance, and finally the text classification model is trained based on the sample text and the label information corresponding to the sample text, so that the technical effect of automatically determining the label information corresponding to the sample text is realized, the label labeling accuracy and efficiency are improved, and the training speed of the text training model is correspondingly improved.

Fig. 2A is a flowchart of a model training method disclosed in the embodiments of the present application, which is further optimized and expanded based on the above technical solution, and may be combined with the above various alternative embodiments. As shown in fig. 2A, the method may include:

s201, training data in a training data set is obtained, wherein the training data comprises text characteristics of a sample text and voice data corresponding to the sample text.

The text characteristics of the sample text are obtained by text analysis of the sample text, and the text characteristics include, but are not limited to, syllable sequences and the like.

S202, training a pre-constructed acoustic model based on the training data by adopting an unsupervised training method to establish a mapping relation between text features and acoustic features, and obtaining a clustering result of clustering the training data according to styles and/or roles.

Wherein the acoustic features include, but are not limited to, mel-frequency cepstral coefficients, acoustic energy, acoustic fundamental frequencies, and acoustic spectra, among others.

In one embodiment, the acoustic model training is performed using an unsupervised training method comprising any one of the following: VAE (Variational AutoEncoder), variable automatic encoder), VQ-VAE (Vector Quantised-Variational AutoEncoder), mutual information method and GAN (Generative Adversarial Networks, generated countermeasure network). The acoustic model can be used for identifying acoustic features corresponding to the text features, inputting the text features into the trained acoustic model, and outputting the acoustic features corresponding to the text features according to the established mapping relation between the text features and the acoustic features. In the process of performing acoustic model training by adopting an unsupervised training method, clustering the voice data according to a preset style and/or role to determine style information and/or role information corresponding to each piece of voice data, and obtaining a clustering result.

S203, acquiring sample texts in the training data set.

S204, inputting the voice data corresponding to the sample text into the trained acoustic model, and obtaining the label information corresponding to the sample text output by the acoustic model.

In one embodiment, since the trained acoustic model already obtains the clustering result of the voice data corresponding to each sample text, the acoustic model can output the clustering result corresponding to the voice data, that is, the style information and/or the character information corresponding to the voice data, and further output the style information and/or the character information as the tag information of the sample text corresponding to the voice data, as long as any voice data is input into the trained acoustic model.

S205, training a text classification model based on the sample text and label information corresponding to the sample text; the text classification model is used for outputting corresponding tag information according to the input text.

In one embodiment, the text classification model is trained based on sample text in the training data set and the obtained label information corresponding to the sample text by inputting the speech data corresponding to the sample text into the acoustic model, so as to obtain a trained text classification model.

The embodiment also provides a schematic diagram of model training, as shown in fig. 2B, which is a schematic diagram of model training disclosed in the embodiment of the present application, according to sample features 21 corresponding to the sample text 20 and voice data 22 corresponding to the sample text 20, performing unsupervised acoustic model training, and using a clustering result obtained by clustering according to style and/or role in the unsupervised acoustic model training process as label information 23 corresponding to the sample text 20, and further performing training of a text classification model according to the sample text 20 and the label information 23.

According to the technical scheme of the embodiment, training is performed on a pre-constructed acoustic model based on training data by acquiring training data in a training data set and adopting an unsupervised training method so as to establish a mapping relation between text features and acoustic features, and a clustering result for clustering the training data according to styles and/or roles is obtained, so that modeling of the acoustic model is realized, a secondary basis is laid for automatically determining acoustic features according to the text features, and an effect of automatically determining style information and/or role information of voice data is realized due to clustering operation of the training data according to the styles and/or roles; by inputting the voice data corresponding to the sample text into the trained acoustic model, the label information corresponding to the sample text output by the acoustic model is obtained, the technical effect of automatically determining the label information corresponding to the sample text is achieved, the label labeling accuracy and efficiency are improved, and the training speed of the text training model is correspondingly improved.

With the explosive growth of audio novels, dramas and various podcasts, it is becoming increasingly common for people to obtain information and entertainment content audibly using fragmented time. Speech synthesis technology can provide a huge amount of speech data technically as a tool for converting text into speech.

However, the current voice synthesis technical effect has a certain gap with the voice broadcast of a real person, the synthesized voice has single style, single role and no emotion fluctuation, a user can feel tired easily, the product can not be promoted for a long time, and the user experience is poor.

Fig. 3A is a flowchart of a speech synthesis method according to an embodiment of the present application, where the embodiment may be applicable to a case of obtaining corresponding speech data according to a text to be synthesized. The method of the present embodiment may be performed by a speech synthesis apparatus, which may be implemented in software and/or hardware, and may be integrated on any electronic device having computing capabilities, such as a server or the like.

As shown in fig. 3A, the model training method disclosed in this embodiment may include:

s301, inputting a text to be synthesized into a text classification model trained in advance, and obtaining label information corresponding to the text to be synthesized, which is output by the classification model; wherein the tag information includes style information and/or character information; the text classification model is a model trained using the model training method described in this embodiment.

In one embodiment, since the text classification model is trained according to the sample text and the tag information corresponding to the sample text, the text to be synthesized is input into the trained text classification model, and the text classification model can output the tag information corresponding to the text to be synthesized.

Optionally, the text to be synthesized includes literature carrier text to be synthesized, such as novice text, poetry text, prose text, opera text, and the like.

By taking the literature carrier text to be synthesized as the text to be synthesized, the literature carrier voice broadcasting content with different styles, multiple roles and the like and more colorful can be finally obtained, and the requirement of users on entertainment content is met.

S302, inputting the text characteristics of the text to be synthesized and the label information corresponding to the text to be synthesized into an acoustic model trained by an unsupervised training method in advance, and obtaining the text characteristics and the acoustic characteristics corresponding to the label information output by the acoustic model.

In one embodiment, text analysis is performed on a text to be synthesized to obtain text features of the text to be synthesized, the text features of the text to be synthesized and the tag information of the text to be synthesized acquired in S301 are input into a trained acoustic model, and the acoustic model outputs acoustic features corresponding to the text features and the tag information according to a mapping relation between the constructed text features and the acoustic features and a clustering result of clustering training data according to styles and/or roles, namely, the acoustic model is controlled through the tag information, so that the predicted acoustic features not only correspond to text contents to be synthesized, but also are added with style information and/or role information corresponding to the tag information.

S303, performing voice synthesis on the text to be synthesized based on the acoustic features to obtain voice data corresponding to the text to be synthesized.

In one embodiment, the acoustic features output by the acoustic model are input into a preset vocoder, the vocoder performs voice synthesis by using a voice synthesis technology based on the acoustic features, and voice data corresponding to the text to be synthesized is output.

In this embodiment, as shown in fig. 3B, a text 30 to be synthesized is subjected to a text analysis operation 31 to obtain text features 32, the text 30 to be synthesized is input into a trained text classification model 33 to obtain tag information 34 corresponding to the text 30 to be synthesized, the obtained text features 32 and the tag information 34 are input into a trained acoustic model 35 together to obtain acoustic features 36 corresponding to the text features 32 and the tag information 34, and finally the acoustic features 36 are input into a preset vocoder 37 to perform speech synthesis to obtain speech data 38 corresponding to the text 30 to be synthesized.

According to the technical scheme of the embodiment, the text to be synthesized is input into the text classification model trained in advance, the label information corresponding to the text to be synthesized, which is output by the classification model, is obtained, the text characteristics of the text to be synthesized and the label information corresponding to the text to be synthesized are input into the acoustic model trained in advance by an unsupervised training method, the text characteristics output by the acoustic model and the acoustic characteristics corresponding to the label information are obtained, finally, the voice synthesis is carried out on the text to be synthesized based on the acoustic characteristics, the voice data corresponding to the text to be synthesized is obtained, the generation of multi-style, multi-role and rich-emotion voice data is realized, the voice data is more close to the human-like reading style, and the time and experience of a user for listening to the voice data are greatly improved.

Fig. 4 is a schematic structural diagram of a model training device according to an embodiment of the present application, which may be suitable for an unsupervised case of training a text classification model. The apparatus of this embodiment may be implemented in software and/or hardware, and may be integrated on any electronic device with computing capabilities, such as a server or the like.

As shown in fig. 4, the model training apparatus 40 disclosed in this embodiment may include a sample text acquisition module 41, a tag information determination module 42, and a text classification model training module 43, where:

a sample text obtaining module 41, configured to obtain a sample text in the training data set;

the tag information determining module 42 is configured to determine tag information corresponding to the sample text based on an acoustic model trained in advance by an unsupervised training method; wherein the tag information includes style information and/or character information;

a text classification model training module 43, configured to train a text classification model based on the sample text and tag information corresponding to the sample text; the text classification model is used for outputting corresponding tag information according to the input text.

Optionally, the training method of the acoustic model includes:

Acquiring training data in the training data set, wherein the training data comprises text characteristics of a sample text and voice data corresponding to the sample text;

training a pre-constructed acoustic model based on the training data by adopting an unsupervised training method to establish a mapping relation between text features and acoustic features, and obtaining a clustering result of clustering the training data according to styles and/or roles.

Optionally, the clustering result includes label information corresponding to each voice data; the tag information determination module 42 is specifically configured to:

and inputting the voice data corresponding to the sample text into the trained acoustic model to obtain the label information corresponding to the sample text output by the acoustic model.

Optionally, the method for generating the sample text includes:

acquiring a preset number of real voice data, and executing background music and/or noise removal operation on each real voice data;

and segmenting each piece of real voice data, and acquiring texts corresponding to each piece of segmented voice data respectively to serve as sample texts.

Optionally, the real person voice data includes: the real person of the literature carrier broadcasts the data in voice.

The model training device 40 disclosed in the embodiment of the present application may execute any model training method disclosed in the embodiment of the present application, and has the corresponding functional modules and beneficial effects of the execution method. Details not described in detail in this embodiment may refer to descriptions in any of the model training method embodiments of the present application.

Fig. 5 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application, where the embodiment may be applicable to a case of obtaining corresponding speech data according to a text to be synthesized. The apparatus of this embodiment may be implemented in software and/or hardware, and may be integrated on any electronic device with computing capabilities, such as a server or the like.

As shown in fig. 5, the speech synthesis apparatus 50 disclosed in the present embodiment may include a tag information acquiring module 51, an acoustic feature acquiring module 52, and a speech synthesis module 53, wherein:

the tag information obtaining module 51 is configured to input a text to be synthesized into a text classification model trained in advance, and obtain tag information corresponding to the text to be synthesized output by the classification model; wherein the tag information includes style information and/or character information; the text classification model is a model trained by using the model training method disclosed by the application;

The acoustic feature obtaining module 52 is configured to input the text feature of the text to be synthesized and the tag information corresponding to the text to be synthesized into an acoustic model trained by an unsupervised training method in advance, and obtain the text feature and the acoustic feature corresponding to the tag information output by the acoustic model;

and the voice synthesis module 53 is configured to perform voice synthesis on the text to be synthesized based on the acoustic feature, so as to obtain voice data corresponding to the text to be synthesized.

Optionally, the text to be synthesized includes: literature carrier text to be synthesized.

The speech synthesis apparatus 50 disclosed in the embodiments of the present application can execute the speech synthesis method disclosed in the embodiments of the present application, and has the corresponding functional modules and beneficial effects of the execution method. Reference may be made to the description of any speech synthesis method embodiment of the present application for details not described in this embodiment.

According to embodiments of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 6, a block diagram of an electronic device according to a model training method and/or a speech synthesis method according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.

As shown in fig. 6, the electronic device includes: one or more processors 601, memory 602, and interfaces for connecting the components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 601 is illustrated in fig. 6.

Memory 602 is a non-transitory computer-readable storage medium provided herein. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the model training method and/or the speech synthesis method provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the model training method and/or the speech synthesis method provided by the present application.

The memory 602 is used as a non-transitory computer readable storage medium, and may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the model training method and/or the speech synthesis method in the embodiments of the present application (e.g., the sample text acquisition module 41, the tag information determination module 42, and the text classification model training module 43 shown in fig. 4, and/or the tag information acquisition module 51, the acoustic feature acquisition module 52, and/or the speech synthesis module 53 shown in fig. 5). The processor 601 executes various functional applications of the server and data processing, i.e., implements the model training method and/or the speech synthesis method in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 602.

The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for a function; the storage data area may store data created according to the use of the electronic device of the model training method and/or the speech synthesis method, etc. In addition, the memory 602 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 602 may optionally include memory remotely located with respect to processor 601, which may be connected to the electronic device of the model training method and/or the speech synthesis method via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the model training method and/or the speech synthesis method may further include: an input device 603 and an output device 604. The processor 601, memory 602, input device 603 and output device 604 may be connected by a bus or otherwise, for example in fig. 6.

The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device of the model training method and/or the speech synthesis method, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer stick, one or more mouse buttons, a track ball, a joystick, and the like. The output means 604 may include a display device, auxiliary lighting means (e.g., LEDs), tactile feedback means (e.g., vibration motors), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme of the embodiment of the application, the sample text in the training data set is obtained, the label information of the sample text is determined based on the acoustic model trained by the non-supervision training method in advance, and finally the text classification model is trained based on the sample text and the label information corresponding to the sample text, so that the technical effect of automatically determining the label information corresponding to the sample text is realized, the label labeling accuracy and efficiency are improved, and the training speed of the text training model is correspondingly improved;

The text to be synthesized is input into a text classification model trained in advance, label information corresponding to the text to be synthesized, which is output by the classification model, is obtained, text characteristics of the text to be synthesized and label information corresponding to the text to be synthesized are input into an acoustic model trained in advance by an unsupervised training method, the text characteristics output by the acoustic model and acoustic characteristics corresponding to the label information are obtained, finally, voice synthesis is carried out on the text to be synthesized based on the acoustic characteristics, voice data corresponding to the text to be synthesized is obtained, generation of multi-style, multi-role and colorful voice data is achieved, the voice data is closer to a real person reading style, and time and experience of a user for listening to the voice data are greatly improved.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions disclosed in the present application can be achieved, and are not limited herein.

The above embodiments do not limit the scope of the application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims

1. A method of model training, the method comprising:

acquiring a sample text in a training data set;

training a text classification model based on the sample text and label information corresponding to the sample text; the text classification model is used for outputting corresponding label information according to the input text;

the generation method of the sample text comprises the following steps:

acquiring a preset number of real voice data, segmenting each real voice data by taking at least one of words, words or sentences as a unit, and acquiring texts corresponding to each segmented voice data respectively as sample texts;

the training method of the acoustic model comprises the following steps:

2. The method of claim 1, wherein the clustering result comprises label information corresponding to each voice data; based on an acoustic model trained by an unsupervised training method in advance, determining label information corresponding to the sample text comprises:

3. The method of any of claims 1-2, wherein the method of generating the sample text comprises:

and performing an operation of removing background music and/or noise on each of the real voice data.

4. The method of claim 3, wherein the real person voice data comprises: the real person of the literature carrier broadcasts the data in voice.

5. A method of speech synthesis, the method comprising:

inputting a text to be synthesized into a text classification model trained in advance, and obtaining label information corresponding to the text to be synthesized, which is output by the classification model; wherein the tag information includes style information and/or character information; the text classification model is a model trained using the model training method of any one of claims 1-4;

Inputting the text characteristics of the text to be synthesized and the label information corresponding to the text to be synthesized into an acoustic model trained by an unsupervised training method in advance, and obtaining the text characteristics and the acoustic characteristics corresponding to the label information output by the acoustic model;

and carrying out voice synthesis on the text to be synthesized based on the acoustic characteristics to obtain voice data corresponding to the text to be synthesized.

6. The method of claim 5, wherein the text to be synthesized comprises: literature carrier text to be synthesized.

7. A model training apparatus, the apparatus comprising:

the text classification model training module is used for training a text classification model based on the sample text and the label information corresponding to the sample text; the text classification model is used for outputting corresponding label information according to the input text;

The generation method of the sample text comprises the following steps:

the training method of the acoustic model comprises the following steps:

8. The apparatus of claim 7, wherein the clustering result includes tag information corresponding to each voice data; the tag information determining module is specifically configured to:

9. The apparatus of any of claims 7-8, wherein the method of generating the sample text comprises:

10. The apparatus of claim 9, wherein the real person voice data comprises: the real person of the literature carrier broadcasts the data in voice.

11. A speech synthesis apparatus, the apparatus comprising:

the label information acquisition module is used for inputting a text to be synthesized into a text classification model trained in advance to obtain label information corresponding to the text to be synthesized, which is output by the classification model; wherein the tag information includes style information and/or character information; the text classification model is a model trained using the model training method of any one of claims 1-4;

12. The apparatus of claim 11, wherein the text to be synthesized comprises: literature carrier text to be synthesized.

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the model training method of any one of claims 1-4 and/or the speech synthesis method of any one of claims 5-6.

14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the model training method of any one of claims 1-4 and/or the speech synthesis method of any one of claims 5-6.