CN111968619A

CN111968619A - Method and device for controlling voice synthesis pronunciation

Info

Publication number: CN111968619A
Application number: CN202010873463.4A
Authority: CN
Inventors: 王昆; 朱海; 周琳珉; 展华益
Original assignee: Sichuan Changhong Electric Co Ltd
Current assignee: Sichuan Changhong Electric Co Ltd
Priority date: 2020-08-26
Filing date: 2020-08-26
Publication date: 2020-11-20

Abstract

The invention discloses a method for controlling voice synthesis pronunciation, which comprises the following steps: creating a pronunciation dictionary; carrying out regularization processing and prosody analysis on the text to be synthesized and converting the text to be synthesized into a pinyin mark; reading the pronunciation dictionary, performing replacement processing on the pinyin marks, and converting the pinyin marks into phonemes; converting the phonemes into acoustic features using a speech synthesis model; converting the acoustic features into audio using a vocoder; the invention solves the problems of pronunciation error of polyphones of a voice synthesis system and self-adaption of accents of users.

Description

Method and device for controlling voice synthesis pronunciation

Technical Field

The invention relates to the technical field of voice processing, in particular to a method and a device for controlling voice synthesis pronunciation.

Background

Speech synthesis is a technique for converting text information into speech information, i.e. converting text information into arbitrary audible speech. In relation to various disciplines such as acoustics, linguistics, computer disciplines and the like, an End-to-End (End-to-End) modeling method mainly represented by Tacotron is the mainstream at present.

When using end-to-end speech synthesis techniques, front-end processing is crucial and it is necessary to make full use of the linguistic information in the text to obtain high quality speech synthesis results. However, pronunciation errors of polyphones have been a problem in the end-to-end speech synthesis process, and how to let users control speech synthesis according to their pronunciation habits is also considered by the speech synthesis system. Therefore, for the Chinese speech synthesis system, text replacement is carried out by constructing a pronunciation dictionary, so that the user can correct polyphone errors and adapt to the pronunciation habit of the user by inputting pinyin by himself.

Disclosure of Invention

The invention aims to solve the problems in the prior art, and provides a method and a device for controlling pronunciation of a speech synthesis system.

In order to achieve the purpose, the invention adopts the technical scheme that: a method of controlling speech synthesis pronunciation, comprising the steps of:

s1, creating a pronunciation dictionary;

s2, carrying out regularization processing and prosody analysis on the text to be synthesized and converting the text to be synthesized into a pinyin mark;

s3, reading the pronunciation dictionary, replacing the phonetic symbol, and converting the phonetic symbol into phoneme;

s4, converting the phoneme into acoustic features by using a speech synthesis model;

and S5, converting the acoustic features into audio by using a vocoder.

As a preferred embodiment, the step S1 is specifically as follows:

the key value of the pronunciation dictionary is Chinese words, the value is pinyin, and the pronunciation dictionary is initialized to be empty before the user inputs the pinyin; when the user inputs the input key value and the value, checking the input key value and the input value to ensure that the input key value and the input value are legal; if the key value in the pronunciation dictionary does not exist, adding the key value and the value input by the user into the pronunciation dictionary; if the key value exists in the pronunciation dictionary, updating the value corresponding to the key value; and the pronunciation dictionary supports the user to view and modify and delete.

As another preferred embodiment, the step S2 is specifically as follows:

and carrying out regularization processing on the text to be synthesized, screening out illegal characters, carrying out word segmentation and part-of-speech tagging on the input of the method, inputting the extracted comprehensive linguistic characteristics into a rhythm prediction model to obtain pause level tagging, and converting the Chinese characters into pinyin tags.

As another preferred embodiment, the step S3 is specifically as follows:

when the pronunciation dictionary is read, if the pronunciation dictionary is empty, no processing is carried out; and if the pronunciation dictionary is not empty, reading the pronunciation dictionary, performing word detection on the text to be synthesized through the key value of the pronunciation dictionary, and if the text to be synthesized contains the key value, replacing the pinyin mark on the key value corresponding to the text to be synthesized in the step S2 by the value corresponding to the key value, and keeping the rest unchanged.

In another preferred embodiment, in step S4, the speech synthesis model is tacontron or tacontron 2 or Transformer TTS.

In another preferred embodiment, in step S5, the network structure adopted by the model of the vocoder is WavNET, WavRNN or MelGAN.

As another preferred embodiment, in steps S4 and S5, the acoustic features are mel-frequency spectrum features or linear spectrum features or other acoustic features related to spectrum envelopes.

In order to solve the problems of pronunciation error of polyphones and self-adaption of accents of users in a speech synthesis system, the invention also provides a device for controlling the pronunciation of the speech synthesis, which comprises:

the pronunciation dictionary construction module is used for storing and reading Chinese words and pronunciations input by a user;

the text processing module is used for carrying out regularization processing and rhythm analysis on the text to be synthesized and converting the text to a pinyin mark;

the replacing processing module is used for reading the pronunciation dictionary, replacing the phonetic symbol and converting the phonetic symbol into a phoneme;

the synthesis module is used for converting the input processed text to be synthesized into acoustic features;

a vocoder module for converting the input acoustic features into audio.

The invention has the beneficial effects that:

the invention replaces the pinyin of the Chinese words in the voice synthesis process by the pronunciation dictionary which is self-defined by the user in the using process, so that the synthesized voice meets the pronunciation habit of the user, and the problem of polyphone errors can be corrected.

Drawings

FIG. 1 is a block flow diagram of an embodiment of the present invention;

FIG. 2 is a diagram illustrating a pronunciation dictionary creation method according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating an alternative pronunciation processing method according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

Examples

As shown in fig. 1, a method for controlling pronunciation of a speech synthesis includes the following steps:

s1, creating a pronunciation dictionary:

the pronunciation dictionary is created in a way shown in fig. 2, the key value of the pronunciation dictionary is a Chinese word, the value is pinyin, and the pronunciation dictionary is initialized to be empty before the user inputs the pinyin; when the user inputs the input key value and the value, checking the input key value and the input value to ensure that the input key value and the input value are legal; if the key value in the pronunciation dictionary does not exist, adding the key value and the value input by the user into the pronunciation dictionary; if the key value exists in the pronunciation dictionary, updating the value corresponding to the key value; the pronunciation dictionary supports user viewing and modification deletion.

For example, firstly, a pronunciation dictionary is initialized to be { }, when a user uses the system, because the input text to be synthesized is 'Jiuzhitong donkey-hide gelatin hematinic granule', the speech synthesis system wrongly pronounces 'donkey-hide gelatin' (pinyin is represented as a1 jiao1, the latter number is the tone of the pinyin, and the correct pronunciation should be e1 jiao1), and then the user inputs: "Ejiao | e1 jiao 1" (this example takes the "|" symbol for segmentation). If the key value of the donkey-hide gelatin does not exist in the pronunciation dictionary, the pronunciation dictionary is newly added with the donkey-hide gelatin: and e1 jiao1', if the pronunciation dictionary contains the key value of donkey-hide gelatin, replacing the value corresponding to the key value with e1 jiao 1'.

S2, carrying out regularization processing and prosody analysis on the text to be synthesized and converting the text to be synthesized into a pinyin mark:

carrying out regularization processing on a text to be synthesized, screening out illegal characters, carrying out word segmentation, part of speech tagging and the like on legal input, and inputting the extracted comprehensive linguistic characteristics into a rhythm prediction model to obtain pause level tagging; the Chinese characters are converted into phonetic symbols.

For example, the "Jiuzhitang donkey-hide gelatin blood-enriching granule is sold in a 180 Yuan box. "," is first screened out in this example. The method includes the steps of "waiting for illegal characters, wherein the Arabic numerals" 180 "are converted into corresponding pronounced Chinese characters according to circumstances, then inputting legal texts to be synthesized into a prosody prediction model, obtaining pause level labels, and then converting the Chinese characters into phonetic symbols, namely [ 'jiu3', 'zhi1', 'tang2', '2', 'e1', 'jiao1', '1', 'bu3', 'xue4', '2', 'ke1', 'li4', '1', 'shou4', 'jia4', '2', 'yi1', 'bai3', 'ba1', 'shi2', 'yuan2', '#1', 'yi1', 'he2', 'pause', wherein the pause marks are [ # ].

S3, reading the pronunciation dictionary, replacing the phonetic symbol, and converting the phonetic symbol into phoneme:

the alternative processing mode of the pronunciation dictionary is shown in fig. 3, if the pronunciation dictionary is empty, no processing is performed, and the subsequent processing step is directly performed; if the pronunciation dictionary is not empty, reading the pronunciation dictionary, performing word detection on the text to be synthesized through a key value of the pronunciation dictionary, and if the text to be synthesized contains the key value, replacing the pinyin mark on the key value corresponding to the text to be synthesized in the step S2 by the value corresponding to the key value; the rest remain unchanged.

Specifically, suppose that the user inputs a text to be synthesized as "Jiuzhitong donkey-hide gelatin blood-enriching granule with a selling price of 180 Yuan Box. If the user never inputs the words and the pinyin, the pronunciation dictionary is empty, and the subsequent processing and synthesis steps are directly carried out to obtain audio; if the user wants the pronunciation of "enrich | bu3xue 4" to be "enrich | bu3 xie 3" according to his own habits, then the user inputs "enrich | bu3 xie 3", and the pronunciation dictionary is { "enrich": "bu 3 xie 3" }, then the text to be synthesized is detected by the key value of the pronunciation dictionary, and the text to be synthesized is found to contain the key value "enrich the blood", and then "bu 3xue 4" is replaced by the value "bu 3 xie 3" corresponding to the key value, resulting in [ 'j iu3', 'zhi1', 'tan 2', '2', 'e1', 'jiao1', '1', 'bu3', 'xie3', '2', 'ke1', 'li4', '1', 'shou4', 'jia4', '2', 'yi1', 'bai3', 'ba1', 'shi2', 'yu 2', 'yia 6342', '6861', '68628', '6864'. And then, converting the pinyin label into a phoneme label to obtain 'j iou3 zh iii1t ang2#2e1 j iao1#1b u3x ie3#2k e1 l i4#1sh 4 j ia4#2i1b ai3b a 1sh iii2van2#1i1h e2# 4' as the front-end input of the speech synthesis model.

S4, converting the processing result into acoustic features by using a speech synthesis model:

to synthesize acoustic features, speech synthesis models include, but are not limited to, the current tacontron or tacontron 2 or the Transformer TTS. Optionally, in order to extract acoustic features, the acoustic features include, but are not limited to, mel-frequency spectral features or linear spectral features or other acoustic features related to a spectral envelope.

S5, converting the acoustic features into audio by using a vocoder:

in order to convert acoustic features into audio, models of vocoders employ network structures including, but not limited to, WavNET, WavRNN, MelGAN. Optionally, in order to extract acoustic features, the acoustic features include, but are not limited to, mel-frequency spectral features or linear spectral features or other acoustic features related to a spectral envelope.

By the method for controlling the pronunciation of the synthesized voice, the user-defined pronunciation dictionary is used for replacing the synthesized voice text, so that the synthesized voice meets the pronunciation habit of the user, and meanwhile, the problem of polyphone errors can be corrected.

The embodiment further provides an apparatus for controlling pronunciation of speech synthesis, including:

the key value of the pronunciation dictionary is Chinese words, and the value is pinyin; before the user inputs, the pronunciation dictionary is initialized to be empty; when the user inputs the input key value and the value, checking the input key value and the input value to ensure that the input key value and the input value are legal; if the key value does not exist in the pronunciation dictionary, adding the key value and the value input by the user into the pronunciation dictionary; if the key value exists in the pronunciation dictionary, updating the value corresponding to the key value; the pronunciation dictionary supports user viewing and modification deletion.

the key value of the pronunciation dictionary is Chinese words, the value is the corresponding pinyin, and the pronunciation dictionary is an empty dictionary when no user inputs; in the using process of a user, firstly, checking the words and the pinyin input by the user to ensure that the words and the pinyin are input legally; if the key value does not exist, the Chinese words newly input by the user are used as the key value, the input pinyin is used as the value and stored in the dictionary, and if the key value already exists, the value of the existing key value is updated by the newly input pinyin; the pronunciation dictionary supports the user to view the existing key values and value values and to modify and delete them.

A vocoder module for converting the input acoustic features into audio.

In order to convert acoustic features into audio, vocoder models employ network structures including, but not limited to, WavNET, WavRNN, MelGAN. Optionally, in order to extract acoustic features, the acoustic features include, but are not limited to, mel-frequency spectral features or linear spectral features or other acoustic features related to a spectral envelope.

By the device for controlling the pronunciation of the synthesized voice, the synthesized voice meets the pronunciation habit of the user by replacing the synthesized voice text by using the user-defined pronunciation dictionary, and meanwhile, the problem of polyphone errors can be corrected.

The above-mentioned embodiments only express the specific embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.

Claims

1. A method of controlling speech synthesis pronunciation, comprising the steps of:

s1, creating a pronunciation dictionary;

and S5, converting the acoustic features into audio by using a vocoder.

2. The method for controlling speech synthesis pronunciation according to claim 1, wherein the step S1 is as follows:

3. The method for controlling speech synthesis pronunciation according to claim 2, wherein the step S2 is as follows:

4. The method for controlling speech synthesized pronunciation according to claim 3, wherein the step S3 is as follows:

5. The method for controlling pronunciation synthesis according to any one of claims 1-4, wherein the speech synthesis model in step S4 is Tacotron or Tacotron2 or Transformer TTS.

6. The method for controlling pronunciation synthesis as claimed in claim 5, wherein the network structure adopted by the model of the vocoder in step S5 is WavNET or WavRNN or MelGAN.

7. The method for controlling synthesized pronunciation of speech as claimed in claim 1 or 6, wherein the acoustic features are Mel spectral features or linear spectral features or other acoustic features related to spectral envelope in steps S4 and S5.

8. An apparatus for controlling speech synthesis pronunciation, comprising:

a vocoder module for converting the input acoustic features into audio.