CN118588056B

CN118588056B - Text-to-speech generation method and device based on syntactic diagram construction and electronic equipment

Info

Publication number: CN118588056B
Application number: CN202411059713.5A
Authority: CN
Inventors: 司马华鹏; 蒋达; 汤毅平
Original assignee: Nanjing Silicon Intelligence Technology Co Ltd
Current assignee: Nanjing Silicon Intelligence Technology Co Ltd
Priority date: 2024-08-05
Filing date: 2024-08-05
Publication date: 2025-03-14
Anticipated expiration: 2044-08-05
Also published as: CN119380690A; CN118588056A

Abstract

The application relates to the technical field of computers and discloses a text-to-speech generating method, a device and electronic equipment based on syntactic diagram construction, wherein the method comprises the steps of obtaining a text to be processed and a target reference speech; according to the text to be processed, determining text information and phoneme information corresponding to the text to be processed, constructing a network based on the text information and a target syntax diagram in a target voice generation model, generating a target syntax diagram corresponding to the text to be processed, generating a target word-level code corresponding to the text to be processed based on the phoneme information, the boundary information, the target syntax diagram and a target coding network in the target voice generation model, and generating target synthesized voice based on the target word-level code, the target reference voice and the target voice generation network in the target voice generation model. The target synthesized voice generated by the text-to-voice method provided by the embodiment of the application has the prosodic features of the text to be processed, and the authenticity and the richness of the synthesized voice are improved.

Description

Text-to-speech generation method and device based on syntactic diagram construction and electronic equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a text-to-speech generating method and apparatus based on syntactic diagram construction, and an electronic device.

Background

In general, the text-to-speech synthesis process is more focused on the reproduction of the speaker's voice, i.e., ensuring that the synthesized speech fits as closely as possible to the speaker's voice characteristics (e.g., pitch, timbre, etc.). But for synthesized speech its presentation effect is not only determined by the voice characteristics of the speaker but also by the text information to which the speech corresponds. For example, when the text message is a thrill, the synthesized speech has a faster, deep prosodic character in the presentation effect. In the text-to-speech synthesis process, the prosodic features corresponding to the text information are not considered in the related art, so that the prosody of the synthesized speech cannot be completely attached to the text information, and the richness and the sense of reality of the synthesized speech are reduced.

Disclosure of Invention

In order to solve the above problems, embodiments of the present application provide a text-to-speech generating method, apparatus, and electronic device based on syntactic diagram construction, which can consider prosodic features corresponding to text information in a text-to-speech synthesis process, so as to improve realism and richness of a synthesized speech. Specifically, the embodiment of the application discloses the following technical scheme:

The first aspect of the embodiment of the application provides a text-to-speech generation method based on a syntactic diagram, which comprises the steps of firstly obtaining a text to be processed and a target reference speech, secondly determining text information and phoneme information corresponding to the text to be processed according to the text to be processed, wherein the text information comprises text content and boundary information, then building a network based on the text information and a target syntactic diagram in a target speech generation model to generate a target syntactic diagram corresponding to the text to be processed, wherein the text content comprises a plurality of characters, the target syntactic diagram is used for representing syntactic relations among the characters in the plurality of characters, thirdly generating target word-level codes corresponding to the text to be processed based on the phoneme information, the boundary information, the target syntactic diagram and a target coding network in the target speech generation model, and finally generating target synthetic speech based on the target word-level codes, the target reference speech and the target speech generation network in the target speech generation model, wherein the target synthetic speech comprises prosodic features of the text to be processed.

In some embodiments, constructing a network based on text information and a target syntactic map in a target speech generation model to generate a target syntactic map corresponding to a text to be processed, including generating an initial dependency map corresponding to the text information based on text content, word boundaries in boundary information, and a dependency parser in the syntactic map construction network; the initial dependency graph comprises a plurality of word nodes and one-way connection relations among the word nodes in the plurality of word nodes, and a syntax graph constructor in a network is constructed based on the initial dependency graph and the syntax graph to generate a target syntax graph corresponding to the initial dependency graph.

In some embodiments, a target syntax diagram corresponding to an initial dependency diagram is generated based on the initial dependency diagram and a syntax diagram constructor in a network, and the target syntax diagram comprises dividing each word node in the initial dependency diagram into at least one byte point based on the syntax diagram constructor and word boundaries in boundary information, determining a first word node in each word node, establishing a first dependency connection relationship among a plurality of first word nodes, a starting node and a destination node, establishing a second dependency connection relationship among at least one byte point in each word node according to the intra-word sequence of at least one word node in each word node, and obtaining the target syntax diagram according to the first dependency relationship and the second dependency connection relationship, wherein the first dependency relationship and/or the second dependency relationship are used for representing the bidirectional connection relationship among each byte point.

In some embodiments, generating a target word-level code corresponding to the text to be processed based on the phoneme information, the boundary information, the target syntax diagram, and the target coding network in the target speech generation model includes generating an initial word-level code corresponding to the text to be processed based on the phoneme information, word boundaries in the boundary information, and a word-level averaging pooling layer in the target coding network, and generating the target word-level code based on the initial word-level code, the target syntax diagram, and a plurality of gating-diagram convolutional layers in the target coding network.

In some embodiments, generating an initial word-level code corresponding to the text to be processed based on the phoneme information, word boundaries in the boundary information, and a word-level averaging pooling layer in the target coding network includes determining at least one phoneme feature corresponding to each byte point in the text to be processed based on the phoneme information and the word boundaries, performing an averaging pooling operation on the at least one phoneme feature corresponding to each byte point to determine an integrated phoneme feature corresponding to each byte point, and determining the initial word-level code corresponding to the text to be processed based on the integrated phoneme feature corresponding to each byte point.

In some embodiments, generating a target word level code based on an initial word level code, a target syntax diagram, and a plurality of gate map convolution layers in a target coding network includes generating a first word level code based on the initial word level code, the target syntax diagram, and a first gate map convolution layer of the plurality of gate map convolution layers, generating a second word level code based on the first word level code, the target syntax diagram, and a second gate map convolution layer of the plurality of gate map convolution layers, and determining the target word level code based on the initial word level code, the first word level code, and the second word level code.

In some embodiments, generating the target synthesized speech based on the target word level encoding, the target reference speech, and the target speech generation network in the target speech generation model includes generating a target style vector based on the target word level encoding, the target reference speech, and a style vector unit in the target speech generation network, wherein the target style vector is used to characterize prosodic information corresponding to the target reference speech and prosodic information corresponding to the text to be processed, and generating the target synthesized speech based on the target style vector, the text to be processed, and a generation unit in the target speech generation network.

In some embodiments, the method further comprises the steps of obtaining sample text data and sample reference voice data, generating predicted synthesized voice based on the sample text data, the sample reference voice data and a to-be-trained voice generation model, wherein the to-be-trained voice generation model comprises a to-be-trained syntactic diagram construction network and a to-be-trained coding network, obtaining sample synthesized voice data, taking the predicted synthesized voice as initial training output information of the to-be-trained voice generation model, taking the sample synthesized voice data as supervision information, and iterating the to-be-trained voice generation model to obtain a target voice generation model, wherein the target voice generation model comprises a target syntactic diagram construction network and a target coding network.

The second aspect of the embodiment of the application provides a text-to-speech generating device constructed based on a syntactic diagram, which comprises an acquisition module, a determination module, a first generating module, a second generating module and a third generating module. Wherein the acquisition module is configured to acquire the text to be processed and the target reference speech. The determining module is configured to determine text information and phoneme information corresponding to the text to be processed according to the text to be processed, wherein the text information comprises text content and boundary information. The first generation module is configured to construct a network based on the text information and a target syntax graph in a target voice generation model, and generate a target syntax graph corresponding to the text to be processed, wherein the text content comprises a plurality of characters, and the target syntax graph is used for representing the syntax relationship among the characters in the plurality of characters. The second generation module is configured to generate a target word-level code corresponding to the text to be processed based on the phoneme information, the boundary information, the target syntax diagram, and the target coding network in the target speech generation model. The third generation module is configured to generate a target synthesized speech based on the target word level encoding, the target reference speech, and a target speech generation network in a target speech generation model, wherein the target synthesized speech includes prosodic features of the text to be processed.

A third aspect of the embodiment of the present application provides an electronic device, including a processor and a memory, where the memory is configured to store computer executable instructions, and the processor is configured to read the instructions from the memory and execute the instructions to implement the text-to-speech generating method based on the syntactic map construction according to the first aspect.

A fourth aspect of the embodiments of the present application provides a computer-readable storage medium storing computer program instructions that, when read by a computer, perform the text-to-speech generating method based on syntactic map construction described in the foregoing first aspect.

A fifth aspect of an embodiment of the present application provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the text-to-speech generation method based on syntactic map construction as described in the foregoing first aspect.

A sixth aspect of the embodiments of the present application provides a computer program, which when executed by a processor, can implement the text-to-speech generating method based on the syntactic map construction described in the foregoing first aspect.

According to the text-to-speech generation method based on the syntactic diagram construction, the target synthetic speech corresponding to the text to be processed and the target reference speech is generated through the pre-trained target speech generation model. Firstly, constructing a network based on text information corresponding to a text to be processed and a target syntactic diagram in a target voice generation model to generate a target syntactic diagram corresponding to the text to be processed, then, generating a target word-level code corresponding to the text to be processed based on phoneme information and boundary information corresponding to the text to be processed, the target syntactic diagram and a target coding network in the target voice generation model, and finally, generating target synthesized voice based on the target word-level code, target reference voice and the target voice generation network in the target voice generation model. In the text-to-speech generating method based on the syntactic diagram construction provided by the embodiment of the application, as the syntactic relation between each character in a plurality of characters can be represented by the target syntactic diagram generated by the target syntactic diagram construction network, the syntactic information more conforming to the Chinese syntax can be provided, and the prosodic features more conforming to the text information can be provided, so that the target word-level code generated according to the target syntactic diagram can have the prosodic features corresponding to the text to be processed, and therefore, when the target synthetic speech is generated based on the target word-level code and the target reference speech, the generated target synthetic speech also has the prosodic features of the text to be processed. Therefore, the text-to-speech generating method based on the syntactic diagram can extract prosodic features of text information in the text-to-speech synthesis process, so that the sense of reality and the richness of the synthesized speech are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of a text-to-speech generating method based on syntactic diagram construction according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a target speech generation model according to an embodiment of the present application;

FIG. 3 is a schematic diagram of another text-to-speech generation method based on syntactic diagram construction according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an initial dependency graph and a target syntax graph according to an embodiment of the present application;

FIG. 5 is a schematic diagram of yet another text-to-speech generation method based on syntactic diagram construction according to an embodiment of the present application;

FIG. 6 is a schematic diagram of yet another text-to-speech generation method based on syntactic diagram construction according to an embodiment of the present application;

FIG. 7 is a schematic diagram of yet another text-to-speech generation method based on syntactic diagram construction according to an embodiment of the present application;

FIG. 8 is a schematic diagram of another target speech generation model according to an embodiment of the present application;

FIG. 9 is a schematic diagram of yet another text-to-speech generation method based on syntactic diagram construction according to an embodiment of the present application;

FIG. 10 is a schematic diagram of yet another text-to-speech generation method based on syntactic diagram construction according to an embodiment of the present application;

FIG. 11 is a schematic diagram of an X-vector feature vector extractor according to an embodiment of the present application;

FIG. 12 is a schematic diagram of yet another text-to-speech generation method based on syntactic diagram construction according to an embodiment of the present application;

fig. 13 is a schematic diagram of a target voice generating network according to an embodiment of the present application;

FIG. 14 is a schematic diagram of a text-to-speech generating device constructed based on a syntax diagram according to an embodiment of the present application;

Fig. 15 is a schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to better understand the technical solution in the embodiments of the present invention and make the above objects, features and advantages of the embodiments of the present invention more comprehensible, the technical solution in the embodiments of the present invention is described in further detail below with reference to the accompanying drawings.

Text To Speech (TTS) is the conversion of written Text data into audible Speech data for output. In general, in the text-to-speech process, the reproduction of the voice of the reference speaker is often more focused, that is, the synthesized speech is made to fit the voice characteristics of the reference speaker as much as possible, such as the pitch, timbre, prosody, etc., of the reference speaker.

But the quality of the presentation effect for a segment of synthesized speech depends not only on the sound characteristics of the reference speaker, but also on the text information corresponding to the speech. For example, synthesized speech may be more desirable to have fast, deep prosodic features in the presentation when the text information is a thrill, and slow, smooth prosodic features in the presentation when the text information is a discipline. The voice of the text information cannot be generated in the mode of synthesizing the voice in the related technology, so that the synthesized voice is not real and rich enough, and better experience cannot be provided for users.

In order to solve the problems, the embodiment of the application provides a text-to-speech generating method constructed based on a syntactic diagram, which can extract prosodic features of text information in the process of synthesizing speech and generate synthesized speech based on the prosodic features of the text information, thereby ensuring that the synthesized speech has the prosodic features of the text information, improving the authenticity and richness of the synthesized speech and providing better use experience for users.

A text-to-speech generation method based on syntactic diagram construction according to an embodiment of the present application will be described with reference to fig. 1 to 13.

Fig. 1 is a schematic diagram of a text-to-speech generating method based on syntactic diagram construction according to an embodiment of the present application. As shown in fig. 1, the text-to-speech method includes steps 110 through 150 as follows.

Step 110, obtaining the text to be processed and the target reference voice.

For example, in the text-to-speech generation process, it is necessary to express information of the text to be processed in sound information of the target reference speech to obtain the finally generated target synthesized speech. Therefore, the text to be processed and the target reference voice need to be acquired before the target synthesized voice is generated.

In some examples, the text to be processed is used to provide text information to the target synthesized speech. The text to be processed can be a text, such as a thrilling novel segment or a document segment; or the text to be processed can be a sentence or a phrase, for example, the text to be processed can be "the monkey likes to eat bananas". The embodiment of the application does not limit the text to be processed.

In some examples, the target reference speech is used to provide voice information to the target synthesized speech, including prosodic information of the target speaker's speech, such as timbre and pitch, etc. The target reference voice may be a voice of the target speaker, for example, the target reference voice may be a voice of the target speaker for a preset duration. The target speaker may be a real person or a virtual person, and the comparison of the embodiment of the present application is not limited.

And 120, determining text information and phoneme information corresponding to the text to be processed according to the text to be processed.

For example, the text to be processed may be parsed to obtain text information and phoneme information corresponding to the text to be processed, where the phoneme information corresponds to the text information. The text information is written in language, and can embody the direct meaning, grammar structure, context association and other information of the language. The phoneme information is a phoneme sequence of actual pronunciation formed by further converting text information, and can embody pronunciation characteristics and acoustic expression of language.

In some examples, after obtaining text information according to the text to be processed, each word in the text information may be respectively converted into its corresponding phoneme sequence based on the text information and the phoneme dictionary, so as to obtain phoneme information corresponding to the text information. The phoneme dictionary may be a dictionary library preset, for example, the phoneme dictionary may be a resource library built according to the target language, or the phoneme dictionary may be an existing phoneme dictionary library in the related art, which is not limited in the embodiment of the present application.

For example, the text information may include text content and boundary information. The boundary information includes word boundaries and word boundaries.

In some examples, the text content is plain text corresponding to the text information, the text content is used to express meaning of the text, and the text content may include a plurality of characters, for example, when the text content includes words (e.g., words or letters), numbers, punctuation marks, and the like. The boundary information is words in the text content or separation positions between words. For example, the boundary information may be two adjacent words or a separation between two adjacent words. Where word-to-word separation may be referred to as word boundary, and word-to-word separation may be referred to as word boundary.

For example, when the text to be processed is "monkey likes to eat bananas", word boundaries are the separations between "monkey" and "like", "like" and "eat", "eat" and "banana", respectively, and word boundaries are the separations between "monkey" and "son", "son" and "happy", "happy" and "eat", "eat" and "fragrant", and "fragrant" and "banana", respectively.

According to the embodiment of the application, the text to be processed is converted into text information and phoneme information, so that the essential characteristics of the text to be processed can be respectively depicted from two layers, the text information is focused on the semantic structure and vocabulary meaning corresponding to the text to be processed, the phoneme information is focused on the pronunciation details and voice characteristics corresponding to the text to be processed, and necessary input information is provided for the generation process from the subsequent text to the voice.

Fig. 2 is a schematic diagram of a target speech generation model according to an embodiment of the present application. As shown in fig. 2, the target speech generation model 1 includes a syntactic diagram construction sub-model 100 and a generation sub-model 200. It should be noted that, the text-to-speech generating method based on the syntactic diagram construction provided in the embodiment of the present application may be implemented based on the target speech generating model 1.

Illustratively, the syntax diagram construction sub-model 100 includes a target syntax diagram construction network 10 and a target coding network 20. Wherein the target syntax graph construction network 10 may include a dependency parser 11 and a syntax graph constructor 12, and the target encoding network 20 may include a word-level averaging pooling layer 21 and a plurality of gating graph convolution layers 22.

Illustratively, the generation sub-model 200 includes the target speech generation network 30. It should be noted that, the generating sub-model 200 may be a generator in a speech generating model in the related art, or may be a generator modified in the embodiment of the present application. The embodiment of the present application is not limited thereto.

The text-to-speech generation method based on syntactic diagram construction provided by the embodiment of the present application is further described below with reference to the target speech generation model 1 in fig. 2.

And 130, constructing a network based on the text information and the target syntactic diagram in the target voice generation model, and generating a target syntactic diagram corresponding to the text to be processed.

Illustratively, the text content includes a plurality of characters, and the target syntax graph is used to characterize a syntactic relationship between each of the plurality of characters. When the text content is a Chinese sentence, the characters are used for representing Chinese characters in the Chinese sentence.

In some examples, the syntactic map is a graphical representation of the text information, which may characterize syntactic relationships between words in the text information. Through the syntactic relation among the vocabularies, the prosodic features of the vocabularies, including pronunciation features, tone features and the like, can be obtained.

The target syntax graph generated by the target syntax construction network comprises the syntax relation among the characters, and compared with the syntax graph in the related art which only can represent the syntax relation among the words, the target syntax graph is more accordant with the syntax information of Chinese syntax, and can provide prosodic features more accordant with text information.

For example, for a chinese sentence, the prosody corresponding to the same word may be different for different sentences, i.e., the same word may have different pronunciations in different sentences. For example, the corresponding pronunciation of the fresh word in the fresh word and the fresh word in the fresh word are different, namely, the prosody is different.

Therefore, when determining the prosody of a Chinese character, it is necessary to combine the sentences in which it is located, that is, the dependency relationship between the Chinese character and other words in the sentence. The target syntactic diagram provided by the embodiment of the application can characterize the dependency relationship among the words, so that more accurate and rich rhythm characteristics of the text to be processed can be further obtained through the target syntactic diagram.

Fig. 3 is a schematic diagram of another text-to-speech generating method based on syntactic diagram construction according to an embodiment of the present application. As shown in fig. 3, the above-mentioned step 130 includes steps 1310 to 1320 as follows.

Step 1310, constructing a dependency resolver in the network based on the text content, word boundaries in the boundary information and the syntactic map, and generating an initial dependency map corresponding to the text information.

With continued reference to fig. 2, text content and boundary information, such as word boundaries in the text content and the boundary information, are input to the dependency resolver 11 in the target syntax graph construction network 10, and an initial dependency graph corresponding to the text information is generated through the dependency resolver 11.

In some examples, the initial dependency graph is a directed graph based on a unidirectional structure. The initial dependency graph comprises a plurality of word nodes and unidirectional connection relations among the word nodes in the plurality of word nodes.

Fig. 4 is a schematic diagram of an initial dependency graph and a target syntax graph according to an embodiment of the present application. Wherein (a) in fig. 4 is a schematic diagram of an initial dependency graph. For example, when the text information is "monkey likes to eat bananas", the initial dependency graph generated after passing through the dependency resolver 11 is shown in fig. 4 (a).

As shown in (a) of fig. 4, the initial dependency graph includes four word nodes of "monkey", "like", "eat" and "banana", wherein "like" is a parent node, and "monkey", "eat" and "banana" are three word nodes, and a unidirectional connection relationship between two adjacent word nodes is respectively established. Because "like" is the father node, therefore, the unidirectional connection relationship between "monkey" and "like" is "like" directed to "monkey", the unidirectional connection relationship between "like" and "eat" is "like" directed to "eat", and the unidirectional connection relationship between "eat" and "banana" is "eat" directed to "banana".

In some examples, the initial dependency graph may characterize prosodic features corresponding to the text information. But the prosodic information determined by the initial dependency graph is not accurate enough. On the one hand, since the initial dependency graph is a unidirectional connection relationship between the word nodes, that is, each word node in the sentence cannot acquire any information from other word nodes, that is, from other non-adjacent word nodes in the graph aggregation process, the prosodic features of the acquired text information are not accurate enough. On the other hand, since each node of the initial dependency graph is a word node, the pronunciation rules of each word in the Chinese words are different depending on the sentence or word, so that the initial dependency graph is adopted to encode the text information, the pronunciation characteristics of each word in the text information cannot be presented, and the effect of the subsequent synthesized voice can be influenced. Therefore, when aiming at a section of more complex sentence, the corresponding prosody, such as pronunciation, pause, rereading and the like, is also more complex, and the prosody characteristics in the sentence cannot be better reflected by directly adopting the initial dependency graph.

In order to more accurately acquire prosodic features corresponding to text information, the embodiment of the application introduces a syntax graph constructor 12 into the target syntax graph construction network 10, and further processes the initial dependency graph through the syntax graph constructor 12, thereby acquiring the prosodic features richer in the text information.

In step 1320, a target syntax diagram corresponding to the initial dependency diagram is generated based on the initial dependency diagram and a syntax diagram builder in the syntax diagram building network.

Illustratively, the syntax graph builder 12 is operable to convert an initial dependency graph into a target syntax graph. The target syntax diagram is a directed diagram based on a bidirectional structure, and comprises a plurality of word nodes and bidirectional connection relations among all byte points in the plurality of word nodes.

With continued reference to fig. 2, after the initial dependency graph output by the dependency resolver 11 is acquired, the initial dependency graph may be input into the syntax graph builder 12, thereby acquiring a target syntax graph.

In some examples, the target syntax graph is based on an initial dependency graph, adding bi-directional edges between word nodes such that the information flow in the target syntax graph becomes a bi-directional information flow. That is, for any pair of parent and word nodes, the target syntax graph includes both a forward edge from parent to word node and a reverse edge from byte to parent.

Fig. 5 is a schematic diagram of yet another text-to-speech generating method based on syntactic diagram construction according to an embodiment of the present application. As shown in fig. 5, the above step 1320 includes steps 1321 to 1324 as follows.

Step 1321, based on the syntactic map constructor and the word boundary in the boundary information, dividing each word node in the initial dependency graph into at least one byte point, and determining the first word node in each word node.

Illustratively, the syntax graph builder 12 may further split each word node in the initial dependency graph into at least one byte point based on word boundaries and establish a dependency relationship between the byte points.

For example, for a chinese sentence (i.e., text content), the same word may have different prosody or pronunciation in different words or sentences, thereby determining that the phonemes of the word may be different. In order to enable the target syntactic diagram to more accurately represent the pronunciation rules of Chinese, the syntactic diagram constructor 12 provided by the embodiment of the application can extend each word in the initial dependency diagram as a node to each word as a node, namely, after each word node is split into byte points, the dependency relationship among the byte points is established, so that more accurate pronunciation and rhythm information can be obtained.

With continued reference to fig. 4, fig. 4 (b) is a schematic diagram of a target syntax diagram. For example, the initial dependency graph of (a) in fig. 4 may generate a target syntax graph as shown in (b) in fig. 4 after processing by the syntax graph builder 12.

As shown in fig. 4 (b), in the target syntax diagram, the word node "monkey" is split into two byte points "monkey" and "child", the word node "like" is split into two byte points "happy" and "happy", the word node "happy banana" is split into two byte points "fragrant" and "banana", and the byte point corresponding to the word node "eat" is also "eat".

In some examples, after dividing each word node into at least one byte point, a first word node in each word node may be determined according to an intra-word order of each byte point in the word nodes to which each byte point belongs. For example, in the word node "monkey", the intra-word sequences are "monkey" and "child", and therefore, the first word node in the word node is "monkey". And the first byte points corresponding to the word nodes can be determined to be 'monkey', 'happy', 'eat', 'fragrant' in sequence.

In step 1322, a first dependency connection relationship among the plurality of first word nodes, the start node, and the end node is established.

Illustratively, the target syntax diagram also includes a start node (also referred to as BOS) and an end node (also referred to as EOS). The initial node is connected with the first byte point in the first word node in the text content, and the end node is connected with the first byte point in the last word node in the text content.

In some examples, a first dependency connection may be established between a first word node, a start node, and an end node to which the plurality of word nodes respectively correspond. The first dependency connection relationship may be a bidirectional connection relationship. For example, a first dependency relationship between a first word node in each neighboring node may be established according to the neighboring relationship of the plurality of word nodes in the initial dependency graph. It should be noted that, the target syntax graph is built on the basis of the initial dependency graph, and thus, the order of the word nodes is consistent with the order of the word nodes in the initial dependency graph.

As shown in fig. 4 (b), in the target syntax diagram, solid arrows are used to represent the first dependency connection relationship, and the "BOS" and "monkey", "monkey" and "happiness", "happiness" and "eat", "eat" and "fragrance", and "EOS" are all bi-directional connections (i.e., the first dependency connection relationship).

Step 1323, establishing a second dependency connection relationship between at least one byte point in each word node according to the intra-word sequence of at least one word node in each word node.

In some examples, after dividing each word node into at least one word node, a second dependency relationship between at least one byte point in each word node may be established according to an intra-word order of each byte point in the belonging word node. The second dependency connection relationship may be a bidirectional connection relationship.

As shown in fig. 4 (b), the dashed arrow is used to indicate the second dependency relationship, and the byte point "monkey" and the byte point "son" in the word node "monkey" are connected in two directions (i.e., the second dependency connection relationship), and similarly, the "happy" and the "chee" are connected in two directions, and the "xiang" and the "banana" are connected in two directions (i.e., the second dependency connection relationship).

And 1324, obtaining the target syntax graph according to the first dependency relationship and the second dependency connection relationship.

The first dependency relationship and/or the second dependency relationship are used for representing a bidirectional connection relationship between each byte point.

In some examples, after the establishment of the first and second dependency connections between the respective byte points is completed, a target syntax diagram as shown in (b) of fig. 4 may be obtained.

According to the embodiment of the application, the initial dependency graph can be converted into the target syntax graph through the syntax graph constructor 12, and as all byte points in the target syntax graph are connected in a bidirectional manner, more accurate pronunciation and rhythm characteristics in text information can be obtained, and better input is provided for the subsequent generation of target word level codes. That is, the embodiment of the application improves the extraction process of the grammar graph to obtain the syntactic information which accords with the Chinese syntax, and further accords with the rhythm characteristics of the input text in the follow-up rhythm characteristics.

And 140, generating a target word-level code corresponding to the text to be processed based on the phoneme information, the boundary information, the target syntax diagram and a target coding network in the target voice generation model.

Referring to fig. 2, after the target syntax diagram is acquired, phoneme information, boundary information, and the target syntax diagram corresponding to the text to be processed may be input into the target encoding network 20, and the target encoding network 20 outputs target word-level encoding.

Fig. 6 is a schematic diagram of still another text-to-speech generating method based on syntactic diagram construction according to an embodiment of the present application. As shown in fig. 6, the above-mentioned step 140 includes steps 1410 to 1420 as follows.

Step 1410, generating an initial word-level code corresponding to the text to be processed based on the phoneme information, word boundaries in the boundary information, and a word-level averaging pooling layer in the target coding network.

In some examples, the boundary information includes word boundaries. As shown in fig. 2, the phoneme information and word boundaries may be input to a word-level averaging pooling layer 21 in a target encoding network 20 to generate an initial word-level encoding.

In some embodiments, step 1410 includes determining at least one phoneme feature corresponding to each byte point in the text to be processed based on the phoneme information and word boundaries, performing an average pooling operation on the at least one phoneme feature corresponding to each byte point to determine an integrated phoneme feature corresponding to each byte point, and determining an initial word level encoding corresponding to the text to be processed based on the integrated phoneme feature corresponding to each byte point.

In some examples, after the phoneme information corresponding to the text to be processed is obtained, the phoneme information needs to be further processed. The phoneme information corresponding to the text to be processed may be a phoneme coding sequence. Since each byte point (e.g., a chinese character in a chinese sentence) may generally be composed of a plurality of phonemes, a code sequence corresponding to each word node is determined among the phoneme code sequences corresponding to the text to be processed.

By way of example, it is possible to determine by word boundaries which of the adjacent phonemes in the phoneme-coding sequence corresponding to the text to be processed are phonemes corresponding to a word node. After determining the phonemes corresponding to each word node, at least one phoneme corresponding to the same byte point may be subjected to an average pooling operation, so as to obtain the integrated phoneme characteristic of each word node. After the comprehensive phoneme characteristics of each word node are obtained, each byte point is encoded to determine the initial word level encoding corresponding to the text to be processed. That is, in the embodiment of the present application, the encoding is performed in units of characters, rather than words.

According to the embodiment of the application, word-level average pooling is adopted, and word-level codes are generated in word units, so that more accurate prosody information in the text to be processed can be obtained compared with word-level codes in the related art.

Step 1420, generating a target word level encoding based on the initial word level encoding, the target syntax diagram, and a plurality of gating map convolutional layers in the target encoding network.

In some examples, as shown in fig. 2, after the initial word-level encoding output by the word-level average pooling layer 21 is obtained, the initial word-level encoding and the target syntax graph may be input to a plurality of gate-level graph convolution layers 22 in the target encoding network 20, and the target word-level encoding may be determined according to the output of each gate-level graph convolution layer 22. The target word-level code may characterize prosodic features corresponding to the text to be processed.

Illustratively, a plurality of gating map convolution layers 22 may be included in the target encoding network 20. For example, two gate-map volume overlays may be included in the target encoding network 20, such as a first gate-map volume overlay and a second gate-map volume overlay. Or more layers of the gating map convolution may be included in the target encoding network 20, the number of gating map convolution layers 22 is not limited in this embodiment of the present application. Embodiments of the present application are illustratively described with respect to a target encoding network 20 that may include a first gated graph convolution layer and a second gated graph convolution layer.

In some examples, the target syntax graph input to each of the gating graph convolution layers 22 may be a target syntax graph obtained by processing the output results of the syntax graph builder 12. That is, the output result of the syntax graph builder 12 may be converted into a data type that can be processed by the gating graph convolution layers 22 and input to each gating graph convolution layer 22.

For example, the target syntax graph input to the gating graph convolution layer 22 may include node embedded feature vectors and graph structure feature vectors corresponding to the target syntax graph. The node embedded feature vector is used for representing the feature of each word node in the target syntax graph, and the graph structure feature vector is used for representing the relation among each byte point in the target syntax graph.

In some examples, the gatepost layer 22 is used to extract syntax information for each word node and dependencies between byte points in the target syntax graph. For example, each gating graph convolution layer 22 may include 5 iterations to extract the dependency of each word node in the target syntax graph.

According to the embodiment of the application, the gating map convolution layer 22 can further extract the phoneme characteristics of each word node in the text to be processed according to the initial word level coding and the target syntax map so as to acquire richer and more accurate rhythm information of each byte point in the text to be processed.

Fig. 7 is a schematic diagram of still another text-to-speech generating method based on syntactic diagram construction according to an embodiment of the present application. As shown in fig. 7, the above step 1420 includes steps 1421 to 1423 as follows.

Step 1421 generates a first word-level encoding based on the initial word-level encoding, the target syntax diagram, and a first gating map convolution layer of the plurality of gating map convolution layers.

Step 1422 generates a second word-level encoding based on the first word-level encoding, the target syntax diagram, and a second gating map convolution layer of the plurality of gating map convolution layers.

Step 1423, determining the target word-level encoding based on the initial word-level encoding, the first word-level encoding, and the second word-level encoding.

In some examples, the initial word level code output by the word level averaging pooling layer 21 and the target syntax diagram output by the target syntax diagram construction network 10 are input to the first gating convolution layer, the first word level code is output after the processing of the first gating convolution layer, and then the first word level code output by the first gating convolution layer and the target syntax diagram output by the target syntax diagram construction network 10 are input to the second gating diagram convolution layer, and the second word level code is output after the processing of the second gating convolution layer. Finally, the initial word level code output by the word level average pooling layer 21, the first word level code output by the first gating convolution layer and the second word level code output by the second gating convolution layer are summed to obtain the target word level code.

It should be noted that, a greater number of gate-control chart volume laminates 22 may be further disposed in the target encoding network 20 to superimpose and aggregate the output of each gate-control chart convolution layer, so as to output the target word-level encoding, and the processing manner of the greater number of gate-control chart volume laminates is similar to the processing manner of the steps 1421 to 1423 described above, which is not repeated herein.

According to the text-to-speech generation method based on the syntactic diagram construction, the target syntactic diagram corresponding to the text to be processed is obtained through the target syntactic diagram construction network in the target speech generation model, and then target word level codes corresponding to the text to be processed are obtained according to the target syntactic diagram and the target coding network 20. Because the target syntax diagram represents the two-way dependency relationship among the byte points in the text information, the syntax information more accords with the Chinese syntax, the prosodic features more accords with the text information can be provided, and the prosodic features of the word nodes in the text to be processed can be better embodied in the target word-level code acquired based on the target syntax diagram. Therefore, when the target word-level encoding is used as the input of the subsequent generation sub-model 200 for speech synthesis, the generated target synthesized speech may have prosodic features of the text to be processed therein, improving the realism and richness of the target synthesized speech.

Step 150, generating target synthesized speech based on the target word level encoding, the target reference speech, and the target speech generation network in the target speech generation model.

Illustratively, the target synthesized speech includes prosodic features of the text to be processed.

In some examples, as shown in fig. 2, after the target word-level code is obtained through the syntactic diagram construction sub-model 100, the target word-level code and the target reference speech may be input to the target speech generation network 30 in the generation sub-model 200 to generate target synthesized speech corresponding to the text to be processed and the target reference speech.

It should be noted that, the target speech generating network 30 may be a generator for performing speech synthesis in the related art, or may be a generator modified in the embodiment of the present application. The input of the generator in the embodiment of the application is the target word level coding, so that whether the generator in the related technology or the generator improved by the embodiment of the application is the generator, the final synthesized voice has the prosodic features of the text to be processed, and the authenticity and the richness of the final synthesized voice are improved. For example, the present embodiment exemplifies a process of generating a target synthesized voice using an improved generator (i.e., the target voice generation network 30).

In some embodiments, target speech generation network 30 includes a style vector unit and a generation unit.

The target speech generating network 30 provided by the embodiment of the present application is further described below with reference to fig. 8 to 13.

Fig. 8 is a schematic diagram of another target speech generation model according to an embodiment of the present application. As shown in fig. 8, the target speech generation network 30 includes a style vector unit 31 and a generation unit 32. Wherein the style vector unit 31 is used for further extracting prosodic information corresponding to the text to be processed and prosodic information corresponding to the target reference speech, and the generating unit 32 is used for generating the final target synthesized speech according to the prosodic information (such as the first prosodic information and the second prosodic information) extracted by the style vector unit.

Fig. 9 is a schematic diagram of still another text-to-speech generating method based on syntactic diagram construction according to an embodiment of the present application. As shown in fig. 9, the above step 150 includes steps 1510 to 1520 as follows.

Step 1510 generates a target style vector based on the target word level encoding, the target reference speech, and a style vector unit in the target speech generation network.

For example, the target style vector may be used to characterize first prosodic information corresponding to the text to be processed and second prosodic information corresponding to the target reference speech.

In some embodiments, the style vector unit 31 includes a style diffusion sampler and a feature extractor.

Fig. 10 is a schematic diagram of still another text-to-speech generating method based on syntactic diagram construction according to an embodiment of the present application. As shown in fig. 10, the above step 1510 includes steps 1511 to 1513 as follows.

Step 1511, generating a first prosodic vector corresponding to the text to be processed based on the target word-level encoding and the style diffusion sampler in the style vector unit.

Wherein the first prosody vector is used to characterize the first prosody information. That is, the style diffusion sampler in the style vector unit 31 may extract prosodic features (i.e., first prosodic features) corresponding to the text to be processed based on the target word-level coding.

It should be noted that, the style diffusion sampler in the embodiment of the present application may be a pre-trained style diffusion sampler. The style diffusion sampler learns the relation between sample data and noisy data by gradually converting the sample data (such as target word level codes) into a gaussian distribution, thereby recovering the sample data from the gaussian distribution. The style diffusion sampler can improve the denoising capability of the target voice generation model, so that the noise immunity of the target voice generation model is stronger.

In some examples, during training of the style diffusion sampler, first, a sample word-level code is obtained (wherein the sample word-level code may be obtained by the above-mentioned steps 110 to 140), and then, in an iterative manner, gaussian noise is gradually added to the sample word-level code until a sample vector corresponding to the sample word-level code becomes a pure gaussian noise vector.

For example, the sample word level code is y ₀, the iteration number is t, the sample vector after noise addition is y _t, and then y _t=k_t*y_t-1. I.e. on a per y _t-1 basis, a small gaussian noise k _t is added until y _t becomes a pure gaussian noise vector. Wherein k indicates variance, k _t indicates variance value corresponding to different cycle times t, k _t can be calculated in advance based on standard gaussian distribution, i.e. for any t, k _t can be directly calculated to be determined for model training.

On the basis of the above, t is randomly selected, corresponding k _t and y _t are determined through the calculation process, and the parameters are used as training basis of the model. Wherein y ₀ and t are training inputs, and y _t is a training output, so that the style diffusion sampler (also referred to as a diffusion model) is trained, and the style diffusion sampler can learn the relation between sample data and noisy data. In the training process, the model loss variation amplitude is larger in the early training stage, so that the information of the model excessively concentrated in a certain time period in the early training stage can be effectively avoided by adopting a random selection mode, and the information of other time periods is ignored. Based on this, the random distribution of t should be as uniform as possible.

After training is completed based on enough sample data, the style diffusion sampler can recover data from Gaussian distribution through a reverse process by reversely processing the trained parameters, so that a prosodic vector (namely a first prosodic vector) corresponding to the text to be processed is obtained.

According to the embodiment of the application, the target word-level code output by the syntactic diagram construction submodel 100 is input into the trained style diffusion sampler in the style vector unit 31, and the prosodic vector (namely the first prosodic vector) corresponding to the output text to be processed can be further improved, so that better input is provided for the subsequent generation of target synthesized voice.

At step 1512, a second prosodic vector corresponding to the target reference speech is generated based on the feature extractor in the target reference speech and the style vector unit.

Wherein the second prosody vector is used to characterize the second prosody information. That is, the feature extractor in the style vector unit 31 may extract prosodic features (i.e., second prosodic features) corresponding to the target reference speech.

In some embodiments, the feature extractor may comprise an X-vector feature extractor. The X-vector feature extractor may include a plurality of latency network layers and a plurality of fully-connected network layers.

In some embodiments, the step 1512 includes generating a reference audio feature corresponding to the target reference voice based on the target reference voice, and sequentially passing the reference audio feature through a plurality of delay network layers and a plurality of fully-connected network layers to generate an X vector with a preset dimension.

Illustratively, the reference audio features may include mel-frequency cepstral coefficient (Mel Frequency Cepstral Coefficients, MFCCs) features, and the second prosodic vector includes an X-vector.

In some examples, the X-vector feature vector extractor employs a deep neural network (Deep Neural Network, DNN) to map the target reference speech to an embedding of a preset fixed dimension, referred to as an X-vector. The X vector can better represent prosodic information (i.e., second prosodic information) corresponding to the target reference voice, and can represent timbre information, treble information, energy information, and the like of the target reference voice.

Fig. 11 is a schematic diagram of an X-vector feature vector extractor according to an embodiment of the present application. As shown in fig. 11, the X-vector feature vector extractor includes a plurality of delay network layers and a plurality of fully connected network layers.

It should be noted that, in the embodiment of the present application, the number of the delay network layer and the fully-connected network layer in the X-vector feature vector extractor is not limited. The X-vector feature vector extractor in fig. 11 is illustrated as comprising 5 delay network layers (i.e., delay network layer 1 through delay network layer 5) and 3 fully connected network layers (i.e., fc 1 through fc 3).

As shown in fig. 11, the parameter of each delay network layer may be denoted as F (N, D, K), where the parameter F of each delay network layer is activated by a context with a previous layer width of K and an extended length of D, and N is the dimension of the output vector.

For example, when the feature of the target reference speech is a mel-frequency cepstrum feature of 24 dimensions (hereinafter referred to as MFCC feature), and the frame length is 25ms, the feature map output by the delay network layer 1 is subjected to 1-dimensional convolution, the input channel×convolution kernel size×output channel=24×5×512, the feature maps output by the delay network layer 2 and the delay network layer 3 are subjected to 1-dimensional convolution, the input channel×convolution kernel size×output channel=512×3×512, the feature map output by the delay network layer 4 is subjected to 1-dimensional convolution, the input channel×convolution kernel size×output channel=512×1×512, and the feature map output by the delay network layer 5 is subjected to 1-dimensional convolution, the input channel×convolution kernel size×output channel=512×1×1500. After the delay network layer 5 outputs, the output feature vectors are further statistically pooled through fc1, fc2 and fc 3.

For example, the statistical pooling process may be to average and standard deviation over the dimension of the sequence length T, and then concatenate the average and standard deviation. Thus, after statistical pooling, the dimension of the sequence length T is lost, and 1500 averages and 1500 standard deviations are obtained, and after concatenation, a vector of length 3000 can be obtained. Wherein, fc1, fc2 and fc3 can be standard full-connection network layers, fc3 can be the characteristic vector representation of the target reference voice, and the characteristic vector representation comprises prosodic information, phoneme information and the like corresponding to the target reference voice.

The embodiment of the present application provides a better input for subsequently generating the target synthesized speech by inputting the target reference speech to the feature extractor (e.g., the X-vector feature extractor) in the style vector unit 31 to extract the prosodic vector (i.e., the second prosodic vector) corresponding to the target reference speech.

Step 1513, generating a target style vector based on the first prosodic vector and the second prosodic vector.

After the first prosodic vector is obtained through step 1511 and the second prosodic vector is obtained through step 1512, a target style vector may be generated based on the first prosodic vector and the second prosodic vector.

In some embodiments, step 1513 includes summing the first prosodic vector and the second prosodic vector to obtain an initial style vector, and time-series averaging the initial style vector to determine the target style vector.

In some examples, the temporal averaging pooling process may determine an optimal one of the plurality of initial style vectors as the target style vector.

In some examples, time-series averaging pooling (Temporal Average Pooling) may operate on time-series data, primarily to reduce the complexity of the time dimension (e.g., T) while preserving important information on each characteristic channel (e.g., F). For example, the time-averaged pooling process may average all values in each time dimension (T), turning each characteristic channel into a single value.

For example, the time-series averaging pooling process may transform a feature map of dimension (batchsize, F, T) into a feature vector of dimension (batchsize, F). During time-series averaging pooling, the dimension of T disappears. Thus, time-series averaging pooling can be seen as extracting features that most represent feature map information from a series of frame features, and transforming the corresponding variable-length frame sequence of the most representative features into a fixed-length feature vector (i.e., such as a target style vector). For example, feature vectors for a plurality of frames may be averaged to determine a target style vector.

Step 1520, generating a target synthesized speech based on the target style vector, the text to be processed, and a generation unit in the target speech generation network.

Illustratively, the target synthesized speech has prosodic features corresponding to the text to be processed (i.e., first prosodic features) and prosodic features corresponding to the reference speech (i.e., second prosodic features).

In some embodiments, the generation unit 32 includes a phoneme encoder, a variance adapter, a mel-spectrum encoder, and a vocoder.

Fig. 12 is a schematic diagram of still another text-to-speech generating method based on syntactic diagram construction according to an embodiment of the present application. The process of generating the target synthesized speech by the generating unit 32 is explained below with reference to fig. 12. As shown in fig. 12, the above step 1520 includes steps 1521 to 1523 as follows.

At step 1521, a phoneme state sequence corresponding to the text to be processed is generated based on the text to be processed, the target style vector, and the phoneme encoder.

In some embodiments, the phoneme encoder comprises at least one style-adaptive normalization layer.

Illustratively, the phoneme encoder further comprises a first convolutional layer, a first fully-concatenated layer, a plurality of first conversion modules, and a second fully-concatenated layer.

In some embodiments, step 1521 includes determining an initial phoneme vector corresponding to the text to be processed and a phoneme position encoding based on the text to be processed, generating a first intermediate sequence based on the initial phoneme vector and processing of a first convolutional layer and a first fully-connected layer in the phoneme encoder, and generating a sequence of phoneme states based on the first intermediate sequence, the phoneme position encoding, a target style vector, and a plurality of first conversion modules and second fully-connected layers in the phoneme encoder.

In some examples, the phoneme encoder may convert the input text to be processed into a sequence of phonemes and further convert the sequence of phonemes into a phoneme encoding result that appears as a sequence of phoneme steganography states and output.

Illustratively, the phoneme position codes are used to characterize the position information of each phoneme corresponding to the text to be processed in the text to be processed. Wherein the phoneme position coding may be determined in different ways.

In some examples, the phoneme position encoding may be determined by absolute position encoding. For example, the position code of each phoneme may be determined by assigning an index (e.g., an integer index) to each phoneme in the text to be processed, such as an index of 1 for the first phoneme and an index of 2 for the second phoneme, and so on, by the index information of each phoneme.

In other examples, the phoneme position encoding may be determined by way of relative position encoding. For example, the encoding may be performed by a relative order between the phonemes, such as may be based on a distance between the phonemes. For example, a first phoneme code may be [1,0, ], a second phoneme code is [0,1,0, ], and so on, it may be determined that the text to be processed corresponds to the position code of each phoneme. Alternatively, the encoding may be performed using a functional approach, such as sin/cos position embedding, which may be used, to determine the position encoding of each phoneme.

In still other examples, the phoneme position coding may also be determined by a coding scheme where absolute position is combined with relative position. For example, in some cases, the absolute position of a phoneme and the position relative to other phonemes may be considered simultaneously, i.e., the position encoding of each phoneme may be determined by combining two encoding schemes.

It should be noted that, in the embodiment of the present application, the position codes of each phoneme in the text to be processed may be determined by any one of the above three manners, or the position codes of each phoneme may be determined by other manners.

The first conversion module may include, for example, a first attention mechanism layer, at least one style-adaptive normalization layer, and a second convolution layer. The number of the first conversion modules in the phoneme encoder may be multiple, for example, may be 4, or may be greater than 4, which is not limited by the embodiment of the present application.

Step 1522, obtaining an adjusted phoneme state sequence based on the phoneme state sequence and the variance adaptor and the preset variance information added to the phoneme state sequence.

Illustratively, the preset variance information includes at least one of duration information, pitch information, and energy information.

In some examples, a sequence of phoneme states (which may also be referred to as a sequence of phoneme hidden states) is input into the variance adapter as it is acquired by the phoneme encoder. The variance adaptor is used for adding additional variance information to the phoneme state sequence so as to obtain an adjusted phoneme state sequence. Wherein the variance adaptor can add information such as duration, pitch, energy, etc. to the phoneme state sequence.

For example, the variance adaptor may predict the duration of the phonemes for each duration by a duration predictor to adjust the length of the speech frames in the sequence of phoneme states and predict and adjust corresponding pitch and energy values by a pitch predictor and an energy predictor, respectively.

At step 1523, the adjusted phoneme state sequence is converted into a mel-spectrum sequence based on the adjusted phoneme state sequence, the target style vector and the mel-spectrum encoder.

In some embodiments, the mel-spectrum encoder includes at least one style-adaptive normalization layer.

Illustratively, the mel-spectrum encoder further includes a third fully-connected layer, a plurality of second conversion modules, and a fourth fully-connected layer.

In some embodiments, step 1523 includes generating a second intermediate sequence based on the adjusted phoneme state sequence and a third fully-connected layer in the mel-spectrum encoder, and generating a mel-spectrum sequence based on the second intermediate sequence, the target style vector, and a plurality of second conversion modules and fourth fully-connected layers in the mel-spectrum encoder.

Illustratively, the second conversion module includes a second attention mechanism layer, at least one style-adaptive normalization layer, and a third convolution layer. The number of the second conversion modules in the mel-spectrum encoder may be multiple, for example, may be 4, or may be greater than 4, which is not limited in the embodiment of the present application.

It should be noted that the second conversion module in the mel-spectrum encoder and the first conversion module in the pixel encoder may have the same structure, and the number of the second conversion module and the first conversion module may also be the same.

Illustratively, at least one style adaptive normalization layer is disposed in each of the mel-spectrum encoder and the voxel encoder, and the style vector unit 31 may generate a target style vector and input the target style vector to each style adaptive normalization layer for further processing of the target style vector.

The processing procedure of the target style vector is described below by the style adaptive normalization layer. It should be noted that, the processing procedure of each style adaptive normalization layer in the mel-spectrum encoder and the phoneme encoder is similar, and the style adaptive normalization layer in the mel-spectrum encoder may be any style adaptive normalization layer in the phoneme encoder or any style adaptive normalization layer in the phoneme encoder.

In some examples, a style-adaptive normalization layer may be used to add the target style vector generated by the style vector unit 32 to the speech synthesis process. For example, the style adaptive normalization layer may adjust the synthesized speech based on the target style vector after receiving the target style vector. For example, assume a target style feature vectorWherein H is the dimension of the style feature vector, and the normalized vector is output after the style self-adaptive normalization layer。

In some examples, the normalized vector y may be determined by equations (1) and (2) as described below.

Formula (1);

formula (2);

Wherein, Is the mean value of the normal distribution,Is the standard deviation of the normal distribution.

On this basis, the gain and deviation of the target style vector can be further determined by the following formula (3).

Formula (3);

wherein g (w) and b (w) are gain and deviation parameters of a style self-adaptive normalization layer (also can be an affine layer and a SALN layer), and SALN layer is a single full-communication layer.

In some examples, unlike fixed gains and offsets in conventional normalization layers, g (w) and b (w) can adaptively perform scaling and shifting of features in the target speech generation network 30 based on the target style vector itself. According to the embodiment of the application, the original normalization layer is replaced by the style self-adaptive normalization layer (SALN), and the affine layer for converting the target style vector into the corresponding bias and gain according to the method is constructed.

At step 1524, a target synthesized speech is generated based on the mel-pattern sequence and the vocoder.

For example, after a mel-pattern sequence output by a mel-pattern encoder is obtained, the mel-pattern sequence may be input to a vocoder to determine a final generated target synthesized speech.

In the text-to-speech generation method based on the syntactic diagram construction, in the process of generating target synthetic speech through a target speech generation network, on one hand, a style vector unit is adopted to extract prosody from target word level codes and target reference speech, and then a target style vector is calculated, and on the other hand, as for the target style vector, the method is embedded into a generation unit (such as a style self-adaptive normalization layer) in a style self-adaptive mode, so that the generated effect is more attached to the target reference speech and the target synthetic speech of the text to be processed.

Fig. 13 is a schematic diagram of a target voice generating network according to an embodiment of the present application. As shown in fig. 13, the target speech generation network 30 includes a style vector unit 31 (left side part in fig. 13) and a generation unit 32 (right side part in fig. 13). The style vector unit 31 includes a style diffusion sampler, a prosodic text encoder, and a V-vector feature vector extractor, among others. The generating unit 32 includes a phoneme encoder, a variance adapter, a mel-spectrum encoder, and a vocoder.

The process of generating the target synthesized speech by the target speech generating network 30 will be described below with reference to fig. 13.

After the text to be processed is obtained, the text to be processed is converted into a phoneme vector (e.g., 256 dimensions). The phoneme vector is input to the style vector unit 31. The prosodic text encoder and the style diffusion sampler in the style vector unit 31 extract a first prosodic vector corresponding to the text to be processed from the phoneme vector. The target reference voice is input into the style vector unit 31, MCFF features of the target reference voice are obtained, MCFF features of the target reference voice are input into the V-vector feature vector extractor in the style vector unit 31, and a second prosodic vector corresponding to the target reference voice is extracted. And summing the first prosody vector and the second prosody vector to obtain a style vector. And carrying out time sequence average pooling treatment on the style vector to obtain a target style vector.

The phoneme vector is input to the phoneme encoder in the generating unit 32, specifically, the phoneme vector may be input to 2 one-dimensional convolutional layers with residual connection and then to one full-concatenation layer. Wherein the number of filters of the one-dimensional convolution layer in the phoneme encoder is 256 and the convolution kernel size is 3. Then, the phoneme vector processed by the full-concatenation layer is added to the phoneme position code (the phoneme position code is used for the position of the phoneme in the sentence), the result (i.e., the first intermediate sequence) and the target style vector are input into 4 feedforward converters (Feedforward Transmitter using Fast Fourier Transform, FFT) blocks to be calculated, so as to obtain a phoneme coding result, and the full-concatenation layer outputs the phoneme coding result (i.e., the phoneme state sequence).

The phoneme encoding result output by the phoneme encoder is input to a variance adaptor, which extracts three kinds of information including a phoneme duration, a pitch (for expressing emotion and prosody information), and energy (for expressing volume information) in the phoneme encoding result as an output.

After the variance adapter extracts the above information, the extracted information (e.g. the adjusted phoneme state sequence) is input to the mel-spectrum encoder, specifically, the output result of the variance adapter may be input to 2 fully-connected layers (e.g. the fully-connected layers of [128,256 ]) for processing, the processed result (i.e. the second intermediate sequence) is added to the phoneme position code, the added result and the target style vector are input to 4 FFT blocks for calculation, and finally, the mel spectrum (i.e. the mel-spectrum sequence) is output by adapting the dimension of the mel spectrum through one fully-connected layer.

As shown in fig. 13, in the above procedure, each FFT block is composed of one multi-head attention mechanism, one style adaptive normalization, one-dimensional convolution, and another style adaptive normalization in order. The target style vector outputted by the style vector unit 31 is embedded in each style adaptive normalization to be processed. In each FFT block, the size of the hidden layer of the one-dimensional convolution may be 256, the number of attention heads may be 2, the convolution kernel size may be 9, and the number of filters may be 1024. All activation functions may be Mish activation functions and the parameter size of the random inactivation layer may be set to 0.1.

Finally, the mel spectrum output by the mel spectrum coder is output as the final synthesized voice, namely the target synthesized voice, through the vocoder.

In some embodiments, the text-to-speech generation method based on the syntactic diagram construction further comprises the steps of obtaining sample text data and sample reference speech data, generating predicted synthesized speech based on the sample text data, the sample reference speech data and a to-be-trained speech generation model, wherein the to-be-trained speech generation model comprises a to-be-trained syntactic diagram construction network and a to-be-trained coding network, obtaining sample synthesized speech data, taking the predicted synthesized speech as initial training output information of the to-be-trained speech generation model, taking the sample synthesized speech data as supervision information, and iterating the to-be-trained speech generation model to obtain a target speech generation model, wherein the target speech generation model comprises a target syntactic diagram construction network and a target coding network.

Illustratively, the target speech generation model 1 referred to in the above embodiment is a pre-trained model. The syntactic diagram construction sub-model 100 and the generation sub-model 200 in the target speech generation model 1 are both trained models. Note that, the syntactic diagram construction sub-model 100 and the generation sub-model 200 may be trained as a whole or may be trained separately, which is not limited in the embodiment of the present application.

In some examples, the speech generation model to be trained may include a syntactic diagram construction network to be trained and a coding network to be trained, and training the speech generation model to be trained includes training the syntactic diagram construction network to be trained and the coding network to be trained, resulting in a target syntactic diagram construction network and a target coding network.

In some examples, the to-be-trained speech generation model further includes a to-be-trained speech generation sub-network, and training the training speech generation model includes training the to-be-trained speech generation sub-network to obtain the target speech generation network.

For example, during model pre-training, the audio corresponding to at least one reference speaker may be used as sample reference speech data for model training. For example, the number of reference speakers may be greater than a preset number of people, such as greater than 100. The reference speaker corresponds to at least 100 sample voices, each of which has an audio length greater than a predetermined audio length, such as greater than 3 seconds.

In some examples, after the sample text data and the sample reference speech data are obtained, a predictive synthesized speech may be generated based on the sample text data, the sample reference speech data, and a speech generation model to be trained. And then, taking the predicted synthesized voice as initial training output information of the voice generation model to be trained, taking sample synthesized voice data as supervision information, and iterating the voice generation model to be trained to obtain a target voice generation model.

It should be noted that, the text-to-speech generating method based on the syntactic diagram provided by the embodiment of the application not only can make the prosodic features of the obtained target synthesized speech more ideal, but also can synthesize synthesized speech of different speech styles given other reference speech.

Fig. 14 is a schematic diagram of a text-to-speech generating device constructed based on a syntax diagram according to an embodiment of the present application. As shown in fig. 14, the text-to-speech generating apparatus 1400 constructed based on the syntax diagram includes an acquisition module 1410, a determination module 1420, a first generation module 1430, a second generation module 1440, and a third generation module 1450. Wherein:

an acquisition module 1410 configured to acquire the text to be processed and the target reference speech.

The determining module 1420 is configured to determine text information and phoneme information corresponding to the text to be processed according to the text to be processed, wherein the text information includes text content and boundary information.

The first generation module 1430 is configured to construct a network based on the text information and the target syntax graph in the target speech generation model, and generate a target syntax graph corresponding to the text to be processed, wherein the text content includes a plurality of characters, and the target syntax graph is used for representing a syntax relationship between each character in the plurality of characters.

A second generation module 1440 configured to generate a target word-level code corresponding to the text to be processed based on the phoneme information, the boundary information, the target syntax diagram, and the target coding network in the target speech generation model.

A third generation module 1450 configured to generate target synthesized speech based on the target word level encoding, the target reference speech, and the target speech generation network in the target speech generation model, wherein the target synthesized speech includes prosodic features of the text to be processed.

In some embodiments, a first generating module 1430 is configured to generate an initial dependency graph corresponding to the text information based on the text content, word boundaries in the boundary information, and a dependency resolver in the syntax graph construction network, wherein the initial dependency graph includes a plurality of word nodes and unidirectional connection relations between the word nodes in the plurality of word nodes, and generate a target syntax graph corresponding to the initial dependency graph based on the initial dependency graph and the syntax graph construction network.

In some embodiments, a first generating module 1430 is configured to divide each word node in the initial dependency graph into at least one byte point based on the syntax graph constructor and word boundaries in the boundary information, determine a first word node in each word node, establish a first dependency connection relationship among a plurality of first word nodes, a start node and an end node, establish a second dependency connection relationship among at least one byte point in each word node according to an intra-word order of at least one word node in each word node, and obtain a target syntax graph according to the first dependency relationship and the second dependency connection relationship, wherein the first dependency relationship and/or the second dependency relationship are used for characterizing a bidirectional connection relationship among each byte point.

In some embodiments, the second generating module 1440 is configured to generate an initial word-level code corresponding to the text to be processed based on the phoneme information, word boundaries in the boundary information, and word-level averaging pooling layers in the target encoding network, and generate a target word-level code based on the initial word-level code, the target syntax graph, and a plurality of gating graph convolution layers in the target encoding network.

In some embodiments, the second generating module 1440 is configured to determine at least one phoneme feature corresponding to each byte point in the text to be processed based on the phoneme information and the word boundary, perform an averaging pooling operation on the at least one phoneme feature corresponding to each byte point, determine an integrated phoneme feature corresponding to each byte point, and determine an initial word level encoding corresponding to the text to be processed according to the integrated phoneme feature corresponding to each byte point.

In some embodiments, the second generation module 1440 is configured to generate a first word level encoding based on the initial word level encoding, the target syntax diagram, and a first gate map volume layer of the plurality of gate map convolution layers, generate a second word level encoding based on the first word level encoding, the target syntax diagram, and a second gate map volume layer of the plurality of gate map convolution layers, and determine a target word level encoding based on the initial word level encoding, the first word level encoding, and the second word level encoding.

In some embodiments, the third generation module 1450 is configured to generate a target style vector based on the target word level encoding, the target reference speech, and a style vector unit in the target speech generation network, wherein the target style vector is used to characterize prosodic information corresponding to the target reference speech and prosodic information corresponding to the text to be processed, and generate the target synthesized speech based on the target style vector, the text to be processed, and the generation unit in the target speech generation network.

In some embodiments, the text-to-speech generating device 1400 constructed based on the syntactic map further includes a training module. Wherein the acquisition module 1410 is configured to acquire sample text data and sample reference speech data and to acquire sample synthesized speech data. The training module is configured to generate predicted synthesized voice based on sample text data, sample reference voice data and a to-be-trained voice generation model, take the predicted synthesized voice as initial training output information of the to-be-trained voice generation model, take the sample synthesized voice data as supervision information, iterate the to-be-trained voice generation model to obtain a target voice generation model, wherein the to-be-trained voice generation model comprises a to-be-trained syntax graph construction network and a to-be-trained coding network, and the target voice generation model comprises a target syntax graph construction network and a target coding network.

Fig. 15 is a schematic diagram of an electronic device according to an embodiment of the present application. In some embodiments, the electronic device includes one or more processors and memory. The memory is configured to store one or more programs. Wherein the one or more processors implement the text-to-speech generation method of the above embodiments based on syntactic diagram construction, when the one or more programs are executed by the one or more processors.

As shown in fig. 15, the electronic device 1000 includes a processor 1001 and a memory 1002. The electronic device 1000 may also include a communication interface (Communications Interface) 1003 and a communication bus 1004, for example. The processor 1001, the memory 1002, and the communication interface 1003 perform communication with each other via the communication bus 1004. Communication interface 1003 is used to communicate with network elements of other devices such as clients or other servers.

In some embodiments, the processor 1001 is configured to execute the program 1005, and may specifically perform relevant steps in the text-to-speech generation method embodiment based on syntactic diagram construction described above. In particular, program 1005 may include program code comprising computer-executable instructions.

The processor 1001 may be, for example, a central processing unit CPU, or an Application SPECIFIC INTEGRATED Circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present application. The one or more processors that the electronic device 1000 may include may be the same type of processor, such as one or more CPUs, or may be different types of processors, such as one or more CPUs and one or more ASICs.

In some embodiments, memory 1002 is used to store program 1005. The Memory 1002 may include a high-speed RAM Memory or may further include a Non-Volatile Memory (NVM), such as at least one magnetic disk Memory. The program 1005 may be specifically invoked by the processor 1001 to cause the electronic device 1000 to perform operations of a text-to-speech generation method constructed based on a syntactic diagram.

Embodiments of the present application provide a computer readable storage medium storing at least one executable instruction that, when executed on an electronic device 1000, cause the electronic device 1000 to perform the text-to-speech generating method based on syntactic diagram construction in the above embodiments.

The executable instructions may be particularly useful for causing the electronic device 1000 to perform operations of a text-to-speech generation method constructed based on syntactic diagrams.

For example, the computer readable storage medium may be Read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), compact disc Read-Only Memory (CD-ROM), magnetic tape, floppy disk, optical data storage device, and the like.

In some embodiments, embodiments of the present application provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the text-to-speech generation method based on syntactic map construction described in any of the embodiments above.

In some embodiments, embodiments of the present application further provide a computer program, which when executed by a processor, can implement the text-to-speech generating method based on the syntactic map construction described in any of the above embodiments.

The text-to-speech generating device, the electronic device, the computer readable storage medium, the computer program product and the computer program according to the embodiments of the present application may refer to the beneficial effects of the text-to-speech generating method based on the syntax diagram provided above, and are not described herein.

It is noted that in the present application, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include an electrical connection (an electronic device) having one or more wires, a portable computer diskette (a magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM).

Additionally, the computer-readable medium may even be paper or other suitable medium upon which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory. It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof.

In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of techniques known in the art, discrete logic circuits with logic gates for implementing logic functions on data signals, application specific integrated circuits with appropriate combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

The embodiments of the present application described above do not limit the scope of the present application.

Claims

1. A text-to-speech generation method based on syntactic graph construction, characterized in that the method comprises:

Obtain the text to be processed and the target reference speech;

According to the text to be processed, determining text information and phoneme information corresponding to the text to be processed; wherein the text information includes text content and boundary information;

Building a network based on the text information and the target syntax graph in the target speech generation model to generate a target syntax graph corresponding to the text to be processed; wherein the text content includes a plurality of characters, and the target syntax graph is used to represent the syntactic relationship between each of the plurality of characters;

Based on the phoneme information and the word boundaries in the boundary information, at least one phoneme feature corresponding to each byte in the text to be processed is determined, and an average pooling operation is performed on at least one phoneme feature corresponding to each byte through a word-level average pooling layer in a target encoding network to determine a comprehensive phoneme feature corresponding to each byte, and according to the comprehensive phoneme feature corresponding to each byte, an initial word-level encoding corresponding to the text to be processed is determined; wherein the target speech generation model includes the target encoding network;

Based on the initial word-level code, the target syntactic graph, and a plurality of gated graph convolutional layers in the target coding network, generating a target word-level code corresponding to the text to be processed;

Based on the target word-level encoding, the target reference speech, and the target speech generation network in the target speech generation model, a target synthesized speech is generated; wherein the target synthesized speech includes the prosodic features of the text to be processed.

2. The method according to claim 1, characterized in that the step of constructing a network based on the text information and the target syntactic graph in the target speech generation model to generate the target syntactic graph corresponding to the text to be processed comprises:

Based on the text content, the word boundaries in the boundary information, and the dependency parser in the target syntactic graph construction network, an initial dependency graph corresponding to the text information is generated; wherein the initial dependency graph includes a plurality of word nodes and a unidirectional dependency connection relationship between each word node in the plurality of word nodes;

A syntax graph builder in a network is constructed based on the initial dependency graph and the target syntax graph to generate the target syntax graph corresponding to the initial dependency graph.

3. The method according to claim 2, characterized in that the constructing a syntactic graph builder in the network based on the initial dependency graph and the target syntactic graph to generate the target syntactic graph corresponding to the initial dependency graph comprises:

Based on the syntax graph builder and the word boundaries in the boundary information, each of the word nodes in the initial dependency graph is divided into at least one sub-node, and a first sub-node in each of the word nodes is determined;

Establishing a first dependent connection relationship between the first byte node, the start node and the end node;

Establishing a second dependency connection relationship between at least one byte node in each of the word nodes according to the intra-word order of at least one byte node in each of the word nodes;

The target syntax graph is obtained according to the first dependency connection relationship and the second dependency connection relationship; wherein the first dependency connection relationship and/or the second dependency connection relationship are used to characterize the bidirectional connection relationship between each of the byte nodes.

4. The method according to claim 1, characterized in that the generating the target word-level code corresponding to the to-be-processed text based on the initial word-level code, the target syntactic graph, and a plurality of gated graph convolutional layers in the target coding network comprises:

Generate a first word-level code based on the initial word-level code, the target syntactic graph, and a first gated graph convolutional layer among the plurality of gated graph convolutional layers;

Generate a second word-level code based on the first word-level code, the target syntactic graph, and a second gated graph convolutional layer among the plurality of gated graph convolutional layers;

The target word-level code is determined based on the initial word-level code, the first word-level code and the second word-level code.

5. The method according to any one of claims 1 to 4, characterized in that the generating the target synthesized speech based on the target word-level encoding, the target reference speech, and the target speech generation network in the target speech generation model comprises:

Generate a target style vector based on the target word-level encoding, the target reference speech, and a style vector unit in the target speech generation network; wherein the target style vector is used to characterize the prosodic information corresponding to the target reference speech and the prosodic information corresponding to the text to be processed;

The target synthesized speech is generated based on the target style vector, the text to be processed, and a generation unit in the target speech generation network.

6. The method according to any one of claims 1 to 4, characterized in that the method further comprises:

Obtaining sample text data and sample reference voice data;

Generate predicted synthesized speech based on the sample text data, the sample reference speech data, and the speech generation model to be trained; wherein the speech generation model to be trained includes a syntactic graph construction network to be trained and a coding network to be trained;

Obtain sample synthesized speech data;

The predicted synthesized speech is used as the initial training output information of the speech generation model to be trained, the sample synthesized speech data is used as the supervision information, the speech generation model to be trained is iterated, and the target speech generation model is obtained; wherein the target speech generation model includes the target syntactic graph construction network and the target encoding network.

7. A text-to-speech generation device based on a syntactic graph, comprising:

The acquisition module is configured to: acquire the text to be processed and the target reference speech;

A determination module is configured to: determine text information and phoneme information corresponding to the text to be processed according to the text to be processed; wherein the text information includes text content and boundary information;

A first generation module is configured to: construct a network based on the text information and a target syntax graph in a target speech generation model to generate a target syntax graph corresponding to the text to be processed; wherein the text content includes a plurality of characters, and the target syntax graph is used to represent the syntactic relationship between each of the plurality of characters;

The second generation module is configured to: determine at least one phoneme feature corresponding to each byte in the text to be processed based on the phoneme information and the word boundary in the boundary information, perform an average pooling operation on at least one phoneme feature corresponding to each byte through the word-level average pooling layer in the target encoding network, determine the comprehensive phoneme feature corresponding to each byte, and determine the initial word-level encoding corresponding to the text to be processed according to the comprehensive phoneme feature corresponding to each byte; wherein the target speech generation model includes the target encoding network; based on the initial word-level encoding, the target syntactic graph, and a plurality of gated graph convolution layers in the target encoding network, generate a target word-level encoding corresponding to the text to be processed;

The third generation module is configured to generate a target synthesized speech based on the target word-level encoding, the target reference speech, and the target speech generation network in the target speech generation model; wherein the target synthesized speech includes the prosodic features of the text to be processed.

8. An electronic device, comprising:

one or more processors; and

A memory configured to: store one or more programs;

Wherein, when the one or more programs are executed by the one or more processors, the one or more processors implement the text-to-speech generation method based on syntactic graph construction according to any one of claims 1-6.

9. A computer-readable storage medium, characterized in that a computer program is stored thereon, and when the computer program is executed by a processor, the text-to-speech generation method based on syntactic graph construction according to any one of claims 1-6 is implemented.