CN112151009B

CN112151009B - Voice synthesis method and device based on prosody boundary, medium and equipment

Info

Publication number: CN112151009B
Application number: CN202011031529.1A
Authority: CN
Inventors: 孙奥兰; 王健宗; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-09-27
Filing date: 2020-09-27
Publication date: 2024-06-25
Anticipated expiration: 2040-09-27
Also published as: CN112151009A; WO2021174874A1

Abstract

The invention provides a voice synthesis method based on prosody boundary, a device, a medium and equipment, wherein the method comprises the following steps: acquiring prosodic boundary information of text information to be synthesized, and generating graph embedding information based on the prosodic boundary information; generating hidden state vectors of the embedded information of the graph and sequence codes of the text information to be synthesized; generating a speech spectrum based on the hidden state vector and the sequence code; synthesizing the voice information of the text information to be synthesized according to the voice language spectrum. Based on the method provided by the invention, the semantic and grammar structures of sentences can be analyzed from the text side, and the prosodic boundaries are represented by graph embedding, so that the prosodic information in the text can fully participate in training and reasoning, and the prosodic sense of the synthesized voice information is improved. The invention also relates to a block chain technology, wherein the hidden state vector, the sequence code of the text information to be synthesized and other data are stored in the block chain, so that the safety of data storage is improved.

Description

Voice synthesis method and device based on prosody boundary, medium and equipment

Technical Field

The invention relates to the technical field of speech synthesis, in particular to a prosodic boundary-based speech synthesis method and device, medium and equipment.

Background

In a deep learning-based speech synthesis system (TTS), prosody is an important factor in determining the naturalness and fluency of synthesized speech. Prosody can be subdivided into 3-dimensional features, fundamental frequency, loudness, and duration. In an end-to-end speech synthesis system, academia and industry try to extract prosody embedded hidden states from the MEL language spectrum of speech, and then a global style vector is introduced into a multi-head attention mechanism for training, so as to control prosody effects of the synthesized speech whole sentence; the automatic variation encoder is used as a prosody classifier for learning the hidden state of prosody embedding in various prosody data sets; in order to obtain more accurate local prosody control, some scholars try to employ finer granularity features to accurately locally control prosody features.

These methods attempt to analyze prosodic information from the speech side, i.e., extract prosodic features from the spectral information of the frequency domain, because the prosody of a segment of speech can be fully demonstrated in the frequency domain, but it cannot fully represent the semantic and grammatical information of the input text sequence, but the text side information largely determines the local prosodic information of a sentence, so the prosodic effects of the synthesized speech often do not conform to the prosodic cadence of the text content.

Disclosure of Invention

The present invention has been made in view of the above problems, and it is an object of the present invention to provide a prosodic boundary-based speech synthesis method and device, medium, and apparatus which overcome or at least partially solve the above problems.

According to an aspect of the present invention, there is provided a prosody boundary-based speech synthesis method including:

acquiring prosodic boundary information of text information to be synthesized, and generating graph embedding information based on the prosodic boundary information;

Generating hidden state vectors of the graph embedded information and sequence codes of the text information to be synthesized based on a preset neural network model;

Generating a speech spectrum based on the hidden state vector and the sequence code;

synthesizing the voice information of the text information to be synthesized according to the voice language spectrum.

Optionally, the obtaining prosodic boundary information of the text information to be synthesized, generating the graph embedded information based on the prosodic boundary information, includes:

dividing the text information to be synthesized into a plurality of layers according to a preset prosodic boundary structure; wherein the hierarchy includes prosodic words and prosodic phrases;

acquiring a first vector corresponding to each prosodic word in the text information to be synthesized;

Combining a plurality of first vectors belonging to the same prosodic phrase in pairs to generate second vectors corresponding to different combinations;

graph embedded information is formed based on the first vector and the second vector combination.

Optionally, generating the hidden state vector of the graph embedded information based on a preset neural network model includes:

Inputting the graph embedded information as an input vector into a first preset neural network model; the first neural network model is a neural network model which is trained in advance to be in a convergence state and is used for carrying out code conversion on the embedded information of the graph;

and obtaining hidden state vectors corresponding to each prosodic phrase output by the first neural network model.

Optionally, generating the sequence code of the text information to be synthesized based on a preset neural network model includes:

Converting the text information to be synthesized into character information, and forming character diagram embedded information according to the character information;

inputting the character map embedded information into a second preset neural network model, wherein the second preset neural network model is a neural network model which is trained in advance to a convergence state and is used for performing code conversion on a text;

and acquiring the sequence code of the text information to be synthesized generated by the second neural network model.

Optionally, the generating a speech spectrum based on the hidden state vector and the sequence coding includes:

And inputting the hidden state vector and the sequence code into an attention mechanism to obtain a voice language spectrum.

Optionally, the attention mechanism comprises a decoder;

inputting the hidden state vector and the sequence code to a preset attention mechanism to obtain a voice language spectrum, wherein the method comprises the following steps of:

splicing the hidden state vector and the sequence code to obtain a spliced vector;

And inputting the spliced vector into a decoder in the attention mechanism, and obtaining the voice language spectrum of the text information to be synthesized through the decoder.

Optionally, the synthesizing the voice information of the text information to be synthesized according to the voice language spectrum includes:

and synthesizing the voice speech spectrum into the voice information of the text information to be synthesized based on a preset Griffin-Lim algorithm.

According to another aspect of the present invention, there is also provided a prosody boundary-based speech synthesis apparatus including:

The information acquisition module is suitable for acquiring prosodic boundary information of the text information to be synthesized, and generating graph embedded information based on the prosodic boundary information;

the hidden vector generation module is suitable for generating a hidden state vector of the graph embedded information;

The sequence code generation module is suitable for generating a sequence code of the text to be synthesized;

The speech spectrum generation module is suitable for generating a speech spectrum based on the hidden state vector and the sequence code;

And the voice information synthesis module is suitable for synthesizing the voice information of the text information to be synthesized according to the voice language spectrum.

According to yet another aspect of the present invention, there is also provided a computer-readable storage medium for storing a program code for performing the prosody boundary based speech synthesis method of any of the above.

According to yet another aspect of the present invention, there is also provided a computing device including a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

The processor is configured to perform the prosody boundary based speech synthesis method of any of the above according to instructions in the program code.

The invention provides a voice synthesis method based on prosodic boundaries, a device, a medium and equipment. The embedded vector of the multidimensional features enables prosodic information in the text to be fully encoded, and the embedded vector is spliced with sequence encoding of the text information in a data structure in a graph form and obtains the speech spectrum of the text information to be synthesized through an attention mechanism, so that the prosodic information in the text can fully participate in training and reasoning, further the prosodic synthesis effect has obvious correlation with text semantic content, and the prosodic sense of the synthesized speech information is improved.

The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.

The above, as well as additional objectives, advantages, and features of the present invention will become apparent to those skilled in the art from the following detailed description of a specific embodiment of the present invention when read in conjunction with the accompanying drawings.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 shows a schematic flow diagram of a prosodic boundary-based speech synthesis method according to an embodiment of the invention;

FIG. 2 shows a prosodic boundary hierarchy schematic according to an embodiment of the invention;

FIG. 3 illustrates a diagram of embedded information according to an embodiment of the present invention;

FIG. 4 shows a schematic diagram of a neural network structure according to an embodiment of the present invention;

FIG. 5 shows a sequence encoder schematic according to an embodiment of the invention;

FIG. 6 is a schematic diagram of a speech synthesis system according to an embodiment of the invention;

FIG. 7 shows a schematic flow diagram of a prosodic boundary-based speech synthesis method according to an embodiment of the invention;

fig. 8 shows a schematic structural diagram of a prosody boundary based speech synthesis device according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Fig. 1 shows a schematic flow chart of a prosodic boundary-based speech synthesis method according to an embodiment of the invention, and referring to fig. 1, it can be understood that the prosodic boundary-based speech synthesis method provided by the embodiment of the invention at least may include the following steps S102 to S108.

Step S102, prosodic boundary information of the text information to be synthesized is obtained, and graph embedding information is generated based on the prosodic boundary information.

In the embodiment of the present invention, the text information to be synthesized may be any text information that needs to be synthesized by speech, optionally, step S102 described above, obtaining prosodic boundary information with synthesized text information, and generating the graph embedded information based on prosodic boundary information may include:

s1, dividing text information to be synthesized into a plurality of layers according to a preset prosodic boundary structure; wherein the hierarchy includes prosodic words, prosodic phrases, intonation phrases, and sentences.

Fig. 2 illustrates a hierarchical structure of prosody boundary according to an embodiment of the present invention, and as illustrated in fig. 2, the prosody boundary structure may be divided into four levels, namely, prosodic Words (PW), prosodic Phrases (PPH), intonation Phrases (IPH), and sentences (UTTERANCE) levels. Wherein the sentence represents a complete sentence, the intonation phrases are independent of each other in the sentence and are separated by punctuation marks in the sentence.

Taking the text information to be synthesized as the specific court trial time, the court of the dan-edge county is not mentioned in the notification as an example, wherein the whole text information to be synthesized is in a sentence (UTTERANCE) level, namely the specific court trial time, and the court of the dan-edge county is not mentioned in the notification as a sentence level. The Intonation Phrase (IPH) hierarchy may include "specific open trial time" and "dan-li county court is not mentioned in the notification. Prosodic Phrase (PPH) levels include: "specific court trial time", "dence county court", "not mentioned in the notification". Then each Prosodic Phrase (PPH) contains Prosodic Words (PW) that may be "concrete", "open court", "aesthetic", "time", "dan-li-county", "court", "in" "notification", "not mentioned" as shown in fig. 2.

S2, obtaining a first vector corresponding to each prosodic word in the text information to be synthesized.

In the embodiment of the present invention, taking the Intonation Phrase (IPH) "specific court-trial time" as an example, it can be known from fig. 2 and 3 that, four Prosodic Words (PW) connected by the Prosodic Phrase (PPH), "specific" "court" "and" trial "time" constitute four nodes embedded in the graph, i.e., the first vectors corresponding to each prosodic word may be respectively represented as v ₁、v₂、v₃、v₄, i.e., vectors corresponding to prosodic words represented as each node.

S3, combining the first vectors belonging to the same prosodic phrase in pairs to generate second vectors corresponding to different combinations.

In the foregoing description, since the Prosodic Phrase (PPH) connects the nodes of the four prosodic words, there is a connection relationship between the nodes, and by combining the first vectors in pairs, a second vector corresponding to a different combination is generated, specifically denoted as e, where e= (v, v '), where v and v' may represent two different first vectors, respectively. The edges between the nodes are undirected edges, i.e., v→v ', v' →v, i.e., in conjunction with fig. 3, the second vectors of the "specific court trial time" may be denoted as e ₁₂、e₂₃、e₃₄、e₁₃、e₁₄、e₂₄, respectively.

And S4, forming graph embedded information based on the combination of the first vector and the second vector.

Graph embedding is the conversion of an attribute graph into a vector or set of vectors. Embedding should capture the topology of the graph, vertex-to-vertex relationships, and other relevant information about the graph, subgraph, and vertex. In an embodiment of the present invention, the graph embedding information may be represented as g= (V, epsilon), where g represents the graph embedding information, V represents a node vector, V e V, v=1. After the graph embedding information is acquired, a subsequent speech synthesis step can be performed based on the graph embedding information.

According to the method provided by the embodiment of the invention, the semantics and grammar structure of sentences can be analyzed from the text side of the text information to be synthesized, and the prosody boundary is represented by graph embedding, and the embedded vector of the multidimensional feature enables prosody information in the text to be fully encoded, and the data structure in the form of a graph is spliced with the sequence encoding of the text information to obtain the language spectrum of the text information to be synthesized through an attention mechanism, so that the prosody information in the text can fully participate in training and reasoning, further the prosody synthesis effect has obvious correlation with the text semantic content, and the prosody sense of the synthesized voice information is improved.

Step S104, generating hidden state vectors of the graph embedded information and sequence codes of the text information to be synthesized based on a preset neural network model.

The hidden state vector of the graph embedded information is hidden feature information capable of representing the graph embedded information, optionally, when the hidden state vector is generated, the graph embedded information can be input into the first preset neural network model as an input vector, and the hidden state vector H _PB corresponding to each prosodic phrase output by the first neural network model is obtained. Alternatively, the first preset neural network model may be a neural network model constructed based on CNN (Convolutional Neural Networks) convolutional neural networks. The first neural network model is a neural network model which is trained in advance to be in a convergence state and is used for carrying out code conversion on the embedded information of the graph.

The neural network model provided by the embodiment of the invention can comprise a plurality of network layers, the graph embedded information can be input into a first network layer of a first preset neural network model as an input vector, and after a plurality of convolution layers and pooling layers are sequentially carried out, the hidden state vector can be finally output through a full connection layer. Fig. 4 schematically illustrates a schematic structural diagram of a first preset neural network model, where each open circle may represent each node of each network layer, and when information is transferred between each node of two adjacent layers, information may be transferred between the nodes of the previous part, and information may be transferred between the nodes of the last two nodes only with the nodes of the last two nodes.

Wherein the hidden state vector is a hidden state vector H _PB corresponding to each node v ₁…v_n. The hidden state vector can be represented in the form of a three-dimensional vector matrix, namely, a characteristic vector equivalent to the embedded information of the graph. Since the structure of the convolutional neural network model and the specific flow of the related algorithm are known to those skilled in the art, the description thereof is omitted in this embodiment.

Optionally, when generating the sequence code of the text information to be synthesized, the text information to be synthesized can be converted into character information first, and character diagram embedded information is formed according to the character information; and further inputting the character map embedded information into a second preset neural network model, and acquiring a sequence code of the text information to be synthesized, which is generated by the second neural network model. The second preset neural network model is a neural network model which is trained in advance to a convergence state and is used for performing code conversion on the text; the structure of the second preset neural network model refers to the first preset neural network model, and model parameters of the second preset neural network model can be adjusted according to different requirements.

Taking the specific court trial time as an example, when the specific court trial time is converted into character information, the character information can be converted into character information according to pinyin and intonation of the character information, wherein numbers 1,2, 3 and 4 respectively represent one sound, two sounds, three sounds and four sounds in the intonation, and then the character information corresponding to the specific court trial time can be expressed as "ju4ti3kai1ting2shen i3shi2jian1", and further the character information can be formed into a graph embedded vector in a character-by-character mode to serve as character graph embedded information. In practical application, the character information can be converted into the graph embedding vector based on a Word2vec method and a Skip-Gram model or other modes. The character map embedded information can be represented in the form of a three-dimensional vector matrix.

Step S106, generating a voice language spectrum based on the hidden state vector and the sequence code.

In an alternative embodiment of the present invention, after the hidden state vector and the sequence code are obtained, the hidden state vector and the sequence code are input into a preset attention mechanism to obtain a speech spectrum.

Before generating the speech spectrum, the hidden state vector and the sequence code can be spliced to obtain a spliced vector; and then inputting the spliced vector into a decoder in an attention mechanism, and obtaining the voice language spectrum of the text information to be synthesized through the decoder. In the foregoing description, both the hidden state vector and the sequence code may be represented as a three-dimensional vector matrix, and thus, the hidden state vector and the sequence code may be respectively spliced in each dimension to obtain a spliced vector.

In this embodiment, the attention mechanism may be a Tacotron-sonography-based prediction network attention mechanism. Tacotron the spectrum prediction network is a seq2seq network with attention mechanisms (attention) comprising an encoder, a decoder for the attention mechanisms and a post-processing network. The mel language spectrum with the rhythm of rhythm is obtained after the hidden state vector and the spliced vector after the sequence codes are spliced are input to a decoder in an attention mechanism for processing.

Step S108, synthesizing the voice information of the text information to be synthesized according to the voice language spectrum. Alternatively, the voice speech may be synthesized into the voice information of the text information to be synthesized based on a preset Griffin-Lim algorithm.

Griffin-lim is a vocoder commonly used in speech synthesis for converting acoustic parameters generated by a speech synthesis system into speech waveforms, which vocoder does not require training, does not require prediction of the phase spectrum, but estimates phase information from frame to frame, and never reconstructs the speech waveforms. Of course, in practical application, the voice information of the text information to be synthesized can be synthesized based on the voice language spectrum in other modes.

Fig. 6 shows a schematic diagram of a speech synthesis system according to an embodiment of the present invention, fig. 7 shows a schematic flow diagram of a speech synthesis method based on prosody boundary according to another embodiment of the present invention, and as can be understood from fig. 6 and fig. 7, the method provided by the embodiment of the present invention may include:

Step S702, prosodic boundary information of text information to be synthesized is obtained, and graph embedded information of prosodic boundaries is generated; with reference to fig. 2 and 3, graph embedded information of text information to be synthesized, g= (V, epsilon), may be obtained; in the method, V represents a node vector in the embedded information of the graph, and taking the prosodic phrase of specific court trial time of fig. 2 as an example, namely V ₁、v₂、v₃、v₄ corresponds to word vectors of four words of specific court trial time and time in space vectors respectively, and the words can be represented in a vector form by adopting a embedding word embedding mode, wherein V epsilon V and v=1. Epsilon represents the edge vector in the graph embedded information, e epsilon.

In FIG. 3, ε may comprise e ₁₂、e₂₃、e₃₄、e₁₃、e₁₄. In this embodiment ,e₁₂＝(v₁,v₂),e₁₃＝(v₁,v₃),e₁₄＝(v₁,v₄),e₂₃＝(v₂,v₃),e₂₄＝(v₂,v₄),e₃₄＝(v₃,v₄)., since there are a plurality of node vectors and edge vectors that form the graph embedding information, the graph embedding information may be a vector set composed of a plurality of vectors.

In this embodiment, the graph embedded information is a vector representation of prosodic words and prosodic phrases of the text information to be synthesized, which carries layout information of each prosodic word in the text information to be synthesized, and may reflect the relative position distribution among the prosodic words.

Step S704, the graph embedded information is input into a first preset GNN neural network model to obtain a hidden state vector H _PB. The first preset GNN neural network model is a multi-layer neural network model which can be a neural network model trained to a convergence state in advance, and the first preset neural network model carries out information transfer through node vectors and side vectors in the graph embedded information so as to output the hidden state vector based on the input graph embedded information. The hidden vector is used as a high-level representation of the input information for the generation of new sequences during the decoding phase. In this embodiment, the first preset GNN neural network model may encode the input graph embedded information to obtain hidden state vectors corresponding to each node, that is, hidden state vectors corresponding to nodes such as v ₁、v₂、v₃、v₄ respectively.

Step S706, generating a sequence code of the text information to be synthesized; specifically, the text information to be synthesized can be converted into character information according to the pinyin and the intonation of the text information to be synthesized, so that character diagram embedded information is generated, and then the character diagram embedded information is input into a second preset GNN neural network to obtain sequence codes.

Alternatively, the pinyin and the tone of each word may be obtained first, and the pinyin and the tone of each word are combined to obtain the corresponding character information: "specific open court time, which is not mentioned in the notice", may be expressed as "ju4ti3kai1ting2shen li3shi2jian1", and similarly, for the intonation phrase "no mention is made in the notice" at the dan-edge county court, it may be expressed as "dan1ling3 xin 4fa3yuan4zai ton 1bao4 zng 1wei4ti2ji2". The character information of the text to be synthesized can be obtained after the character information is combined, the character information is mapped to a vector space to obtain character diagram embedding information, namely a diagram embedding vector of a character level, and the diagram embedding vector of the character level can be generated by adopting a node embedding method, a word2vec and other diagram embedding methods. Further, the character-level graph embedding vector may be input to a second preset GNN neural network, and encoded by the second preset GNN neural network to obtain a sequence code. It should be noted that, both the hidden state vector and the sequence code may be represented in the dimensions required by the attention mechanism, for example, both the hidden state vector and the sequence code may be represented in the form of a three-dimensional vector matrix.

Step S708, the hidden state vector and the sequence code are spliced and then input into a decoder of an attention mechanism, and a Mel language spectrum is obtained.

Because both the hidden state vector and the sequence code can be expressed in the form of a three-dimensional vector matrix, elements positioned at the same coordinate point in the two matrices of the hidden state vector and the sequence code can be spliced to obtain a spliced three-dimensional vector matrix, the spliced three-dimensional vector matrix is input into a decoder based on an attention mechanism of Tacotron sound spectrum prediction network for decoding, a mel language spectrum with a rhythm is finally generated, and the mel language spectrum with the rhythm is provided for a Griffin-Lim reconstruction algorithm to directly generate voice. Since the structure of Tacotron sound spectrum prediction network and the specific flow of related algorithm are known to those skilled in the art, the description is omitted in this embodiment.

In step S710, synthesizing the voice information of the text information to be synthesized according to the mel language spectrum by the Griffin-Lim algorithm.

In the voice synthesis method, the input information is text information, the output information is voice information, namely, the voice information corresponding to the text can be obtained by inputting the text information, so that the whole prosody embedded and voice synthesis system becomes an end-to-end process, and the mapping from the text information to the voice information can be completed without manual feature screening and modeling.

It should be emphasized that, to further ensure the privacy and security of data, the hidden state vector and the sequence code calculated in the above embodiments may also be stored in a node of a blockchain.

Blockchains are novel application modes of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanisms, encryption algorithms, and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.

The blockchain underlying platform may include processing modules for user management, basic services, smart contracts, operation monitoring, and the like. The user management module is responsible for identity information management of all blockchain participants, including maintenance of public and private key generation (account management), key management, maintenance of corresponding relation between the real identity of the user and the blockchain address (authority management) and the like, and under the condition of authorization, supervision and audit of transaction conditions of certain real identities, and provision of rule configuration (wind control audit) of risk control; the basic service module is deployed on all block chain node devices, is used for verifying the validity of a service request, recording the service request on a storage after the effective request is identified, for a new service request, the basic service firstly analyzes interface adaptation and authenticates the interface adaptation, encrypts service information (identification management) through an identification algorithm, and transmits the encrypted service information to a shared account book (network communication) in a complete and consistent manner, and records and stores the service information; the intelligent contract module is responsible for registering and issuing contracts, triggering contracts and executing contracts, a developer can define contract logic through a certain programming language, issue the contract logic to a blockchain (contract registering), invoke keys or other event triggering execution according to the logic of contract clauses to complete the contract logic, and simultaneously provide a function of registering contract upgrading; the operation monitoring module is mainly responsible for deployment in the product release process, modification of configuration, contract setting, cloud adaptation and visual output of real-time states in product operation, for example: alarms, monitoring network conditions, monitoring node device health status, etc.

Based on the same inventive concept, the embodiment of the present invention further provides a voice synthesis device based on prosody boundary, as shown in fig. 8, including:

The information acquisition module 810 is adapted to acquire prosodic boundary information of the text information to be synthesized, and generate graph embedding information based on the prosodic boundary information;

The generating module 820 is adapted to generate a hidden state vector of the graph embedded information and a sequence code of the text information to be synthesized based on a preset neural network model;

A speech spectrum generation module 830 adapted to generate a speech spectrum based on the hidden state vector and the sequence encoding;

the voice information synthesis module 830 is adapted to synthesize voice information of the text information to be synthesized according to the voice speech.

In an alternative embodiment of the invention, the information acquisition module 810 is further adapted to:

dividing the text information to be synthesized into a plurality of layers according to a preset prosodic boundary structure; wherein, the hierarchy includes prosodic words and prosodic phrases;

combining a plurality of first vectors belonging to the same prosodic phrase in pairs according to different sequences to generate second vectors corresponding to different combinations;

the graph embedding information is formed based on the first vector and the second vector combination.

In an alternative embodiment of the invention, the generating module 820 is further adapted to:

The graph embedded information is used as an input vector to be input into a first preset neural network model; the first neural network model is a neural network model which is trained in advance and is in a convergence state and used for carrying out code conversion on the embedded information of the graph;

In an alternative embodiment of the present invention, the generating module 820 is further adapted to convert the text information to be synthesized into character information, and form character map embedded information according to the character information;

Inputting the character map embedded information into a second preset neural network model; the second preset neural network model is a neural network model which is trained in advance to a convergence state and is used for performing code conversion on the text;

In an alternative embodiment of the invention, the speech spectrum generation module 830 is further adapted to:

the hidden state vector and the sequence code are input into an attention mechanism to obtain a voice language spectrum.

Inputting the hidden state vector and the sequence code into a preset attention mechanism to obtain a voice language spectrum, wherein the method comprises the following steps of:

In an alternative embodiment of the present invention, the speech information synthesis module 830 is further adapted to:

In an alternative embodiment of the present invention, there is also provided a computer readable storage medium storing a program code for performing the prosody boundary based speech synthesis method of the above embodiment.

In an alternative embodiment of the present invention, there is also provided a computing device including a processor and a memory: the memory is used for storing the program codes and transmitting the program codes to the processor; the processor is configured to perform the prosodic boundary-based speech synthesis method according to the above-described embodiment according to the instructions in the program code.

The embodiment of the invention provides a voice synthesis method, a voice synthesis device, a voice synthesis medium and voice synthesis equipment based on prosodic boundaries. The embedded vector of the multidimensional features enables prosodic information in the text to be fully encoded, and the embedded vector is spliced with sequence encoding of the text information in a data structure in a graph form and obtains the speech spectrum of the text information to be synthesized through an attention mechanism, so that the prosodic information in the text can fully participate in training and reasoning, further the prosodic synthesis effect has obvious correlation with text semantic content, and the prosodic sense of the synthesized speech information is improved.

It will be clear to those skilled in the art that the specific working processes of the above-described systems, devices, modules and units may refer to the corresponding processes in the foregoing method embodiments, and for brevity, the description is omitted here.

In addition, each functional unit in the embodiments of the present invention may be physically independent, two or more functional units may be integrated together, or all functional units may be integrated in one processing unit. The integrated functional units may be implemented in hardware or in software or firmware.

Those of ordinary skill in the art will appreciate that: the integrated functional units, if implemented in software and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in essence or in whole or in part in the form of a software product stored in a storage medium, comprising instructions for causing a computing device (e.g., a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present invention when the instructions are executed. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a read-only memory (ROM), a random-access memory (RAM), a magnetic disk, or an optical disk, etc.

Or all or part of the steps of implementing the foregoing method embodiments may be implemented by hardware (such as a personal computer, a server, or a computing device such as a network device) associated with program instructions, which may be stored in a computer-readable storage medium, which when executed by a processor of the computing device, performs all or part of the steps of the method of embodiments of the present invention.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all technical features thereof can be replaced by others within the spirit and principle of the present invention; such modifications and substitutions do not depart from the scope of the invention.

Claims

1. A method of prosodic boundary-based speech synthesis, comprising:

synthesizing the voice information of the text information to be synthesized according to the voice language spectrum;

The obtaining prosodic boundary information of the text information to be synthesized, generating graph embedding information based on the prosodic boundary information, includes:

2. The method of claim 1, wherein generating the hidden state vector of the graph embedding information based on a predetermined neural network model comprises:

inputting the graph embedded information as an input vector into a first preset neural network model; the first preset neural network model is a neural network model which is trained in advance and is in a convergence state and used for carrying out code conversion on the embedded information of the graph;

and obtaining hidden state vectors corresponding to each prosodic phrase output by the first preset neural network model.

3. The method according to claim 1, wherein generating the sequence code of the text information to be synthesized based on a preset neural network model comprises:

And acquiring the sequence code of the text information to be synthesized, which is generated by the second preset neural network model.

4. The method of claim 1, wherein generating a speech spectrum based on the hidden state vector and sequence encoding comprises:

And inputting the hidden state vector and the sequence code into a preset attention mechanism to obtain a voice language spectrum.

5. The method of claim 4, wherein the attention mechanism comprises a decoder;

6. The method according to claim 1, wherein synthesizing the speech information of the text information to be synthesized from the speech spectrum comprises:

7. A prosodic boundary-based speech synthesis device, comprising:

The voice information synthesis module is suitable for synthesizing the voice information of the text information to be synthesized according to the voice language spectrum;

the information acquisition module is further adapted to divide the text information to be synthesized into a plurality of layers according to a preset prosodic boundary structure; wherein the hierarchy includes prosodic words and prosodic phrases; and

Acquiring a first vector corresponding to each prosodic word in the text information to be synthesized; and

Combining a plurality of first vectors belonging to the same prosodic phrase in pairs to generate second vectors corresponding to different combinations; and

8. A computer readable storage medium for storing program code for performing the prosodic boundary-based speech synthesis method according to any of the claims 1-6.

9. A computing device, the computing device comprising a processor and a memory:

The processor is configured to perform the prosodic boundary-based speech synthesis method according to the instructions in the program code.