CN114283782A

CN114283782A - Speech synthesis method and apparatus, electronic device, and storage medium

Info

Publication number: CN114283782A
Application number: CN202111665515.XA
Authority: CN
Inventors: 刘利娟; 胡亚军; 江源; 潘嘉; 刘庆峰
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd; University of Science and Technology of China USTC
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-04-05
Anticipated expiration: 2041-12-31
Also published as: CN114283782B

Abstract

The application discloses a voice synthesis method and device, electronic equipment and a storage medium, wherein the voice synthesis method comprises the following steps: extracting pronunciation attribute features of a text to be synthesized, and acquiring target attribute features of various voice attributes based on target categories of the text to be synthesized on a plurality of voice attributes respectively; and synthesizing to obtain the synthesized voice of the text to be synthesized based on the pronunciation attribute characteristics and the target attribute characteristics. According to the scheme, the voice synthesis freedom degree can be improved, and meanwhile the cost is reduced.

Description

Speech synthesis method and device, electronic device and storage medium

技术领域technical field

本申请涉及语音合成技术领域，特别是涉及一种语音合成方法及装置、电子设备和存储介质。The present application relates to the technical field of speech synthesis, and in particular, to a speech synthesis method and apparatus, electronic equipment and storage medium.

背景技术Background technique

语音合成技术是实现人机交互的核心技术之一。随着语音合成技术的不断发展和完善，目前语音合成已经广泛应用到社会生活的方方面面，包括公共服务(信息播报、智能客服等)、智能硬件(智能音箱、智能机器人等)、智慧交通(语音导航、智能车载设备等)、教育(智慧课堂、外语学习等)、泛娱乐(有声阅读、影视配音、虚拟形象等)等领域。Speech synthesis technology is one of the core technologies for realizing human-computer interaction. With the continuous development and improvement of speech synthesis technology, speech synthesis has been widely used in all aspects of social life, including public services (information broadcasting, intelligent customer service, etc.), intelligent hardware (smart speakers, intelligent robots, etc.), intelligent transportation (voice navigation, intelligent vehicle equipment, etc.), education (smart classroom, foreign language learning, etc.), pan-entertainment (audio reading, video dubbing, avatar, etc.) and other fields.

目前，语音合成技术已经能够达到接近真人发声水平，而提升合成语音的自由度以及降低合成系统构建代价成为当前语音合成技术的研究热点之一。通过让同一个发音人录制不同情感、不同风格的语音数据进行建模，虽然可以解决该问题，但是这种方式对发音人的能力要求很高，发音人寻找和音频录制的难度都比较大。有鉴于此，如何在提升语音合成自由度的同时，降低其成本成为亟待解决的问题。At present, speech synthesis technology has been able to reach the level of human voice, and improving the degree of freedom of synthesized speech and reducing the cost of building a synthesis system have become one of the current research hotspots of speech synthesis technology. Although the problem can be solved by letting the same speaker record voice data of different emotions and styles for modeling, this method requires a high degree of ability of the speaker, and it is difficult to find the speaker and record the audio. In view of this, how to reduce the cost of speech synthesis while improving the freedom of speech synthesis has become an urgent problem to be solved.

发明内容SUMMARY OF THE INVENTION

本申请主要解决的技术问题是提供一种语音合成方法及装置、电子设备和存储介质，能够在提升语音合成自由度的同时，降低其成本。The main technical problem to be solved by this application is to provide a speech synthesis method and device, an electronic device and a storage medium, which can reduce the cost while improving the freedom of speech synthesis.

为了解决上述技术问题，本申请第一方面提供了一种语音合成方法，包括：提取待合成文本的发音属性特征；且基于待合成文本分别在若干种语音属性上的目标类别，获取各种语音属性的目标属性特征；再基于发音属性特征和各目标属性特征，合成得到待合成文本的合成语音。In order to solve the above technical problems, a first aspect of the present application provides a speech synthesis method, including: extracting pronunciation attribute features of text to be synthesized; and obtaining various speech attributes based on target categories of the text to be synthesized on several speech attributes The target attribute feature of the attribute; and then based on the pronunciation attribute feature and each target attribute feature, the synthesized speech of the text to be synthesized is synthesized.

为了解决上述技术问题，本申请第二方面提供了一种语音合成方法，包括：提取待处理语音的发音属性特征；基于待处理语音分别在若干种语音属性上的目标类别，获取各种语音属性的目标属性特征；再基于发音属性特征和各目标属性特征，合成得到待处理语音的合成语音。In order to solve the above technical problems, a second aspect of the present application provides a speech synthesis method, which includes: extracting pronunciation attribute features of the speech to be processed; obtaining various speech attributes based on the target categories of the speech to be processed on several speech attributes respectively Then, based on the pronunciation attribute feature and each target attribute feature, the synthesized speech of the to-be-processed speech is synthesized.

为了解决上述技术问题，本申请第三方面提供了一种语音合成装置，包括提取模块、获取模块和合成模块，提取模块用于提取待合成文本的发音属性特征；获取模块用于基于待合成文本分别在若干种语音属性上的目标类别，获取各种语音属性的目标属性特征；合成模块用于基于发音属性特征和各目标属性特征，合成得到待合成文本的合成语音。In order to solve the above technical problems, a third aspect of the present application provides a speech synthesis device, including an extraction module, an acquisition module and a synthesis module, the extraction module is used to extract the pronunciation attribute features of the text to be synthesized; Obtain the target attribute features of various voice attributes from the target categories on several voice attributes respectively; the synthesis module is used to synthesize the synthesized voice of the text to be synthesized based on the pronunciation attribute features and each target attribute feature.

为了解决上述技术问题，本申请第四方面提供了一种语音合成装置，包括提取模块、获取模块和合成模块，提取模块用于提取待合成文本的发音属性特征；获取模块用于基于待合成文本分别在若干种语音属性上的目标类别，获取各种语音属性的目标属性特征；合成模块用于基于发音属性特征和各目标属性特征，合成得到待合成文本的合成语音。In order to solve the above technical problems, a fourth aspect of the present application provides a speech synthesis device, including an extraction module, an acquisition module and a synthesis module, the extraction module is used to extract the pronunciation attribute features of the text to be synthesized; Obtain the target attribute features of various voice attributes from the target categories on several voice attributes respectively; the synthesis module is used to synthesize the synthesized voice of the text to be synthesized based on the pronunciation attribute features and each target attribute feature.

为了解决上述技术问题，本申请第五方面提供了一种电子设备，包括相互耦接的存储器和处理器，存储器中存储有程序指令，处理器用于执行程序指令以实现上述第一方面或第二方面中的语音合成方法。In order to solve the above technical problem, a fifth aspect of the present application provides an electronic device, comprising a memory and a processor coupled to each other, the memory stores program instructions, and the processor is configured to execute the program instructions to implement the above first aspect or the second Aspects of speech synthesis methods.

为了解决上述技术问题，本申请第六方面提供了一种计算机可读存储介质，存储有能够被处理器运行的程序指令，所述程序指令用于实现上述第一方面或第二方面中的语音合成方法。In order to solve the above technical problems, a sixth aspect of the present application provides a computer-readable storage medium storing program instructions that can be executed by a processor, and the program instructions are used to implement the voice in the first aspect or the second aspect. resolve resolution.

上述方案，通过提取待合成文本的发音属性特征，再基于待合成文本分别在若干种语音属性上的目标类别，获取各种语音属性的目标属性特征；最后基于发音属性特征和各目标属性特征，合成得到待合成文本的合成语音。一方面由于无需人为进行大量的语音录制，节省了时间，有助于降低语音合成的成本，另一方面由于在语音合成过程中，通过获取待合成文本分别在若干语音属性上的目标类别，获取各种语音属性的目标属性特征，在此基础上再进行语音合成，能够合成具有任意类别语音属性的语音数据，提升了语音合成的自由度。故此，能够在提升语音合成自由度的同时，降低其成本。In the above scheme, the pronunciation attribute features of the text to be synthesized are extracted, and then the target attribute features of various speech attributes are obtained based on the target categories of the to-be-synthesized texts respectively on several kinds of speech attributes; finally, based on the pronunciation attribute features and each target attribute feature, Synthesized to obtain the synthesized speech of the text to be synthesized. On the one hand, it saves time and helps to reduce the cost of speech synthesis because there is no need to perform a large number of voice recordings. On the basis of the target attribute features of various speech attributes, speech synthesis is carried out, and speech data with any type of speech attributes can be synthesized, which improves the freedom of speech synthesis. Therefore, the cost of speech synthesis can be reduced while the degree of freedom of speech synthesis is improved.

附图说明Description of drawings

图1是本申请语音合成方法一实施例的流程示意图；1 is a schematic flowchart of an embodiment of a speech synthesis method of the present application;

图2是图1中步骤S12一实施例的属性特征提取示意图；FIG. 2 is a schematic diagram of attribute feature extraction in an embodiment of step S12 in FIG. 1;

图3是图1中步骤S13一实施例的属性特征合成示意图；3 is a schematic diagram of attribute feature synthesis in an embodiment of step S13 in FIG. 1;

图4是本申请语音合成方法另一实施例的流程示意图；4 is a schematic flowchart of another embodiment of the speech synthesis method of the present application;

图5是本申请语音合成装置一实施例的框架示意图；5 is a schematic diagram of a framework of an embodiment of a speech synthesis apparatus of the present application;

图6是本申请语音合成装置另一实施例的框架示意图；6 is a schematic diagram of a framework of another embodiment of the speech synthesis apparatus of the present application;

图7是本申请电子设备一实施例的框架示意图；7 is a schematic diagram of a framework of an embodiment of an electronic device of the present application;

图8是本申请计算机可读存储介质一实施例的框架示意图。FIG. 8 is a schematic diagram of a framework of an embodiment of a computer-readable storage medium of the present application.

具体实施方式Detailed ways

下面结合说明书附图，对本申请实施例的方案进行详细说明。The solutions of the embodiments of the present application will be described in detail below with reference to the accompanying drawings.

以下描述中，为了说明而不是为了限定，提出了诸如特定系统结构、接口、技术之类的具体细节，以便透彻理解本申请。In the following description, for purposes of illustration and not limitation, specific details such as specific system structures, interfaces, techniques, etc. are set forth in order to provide a thorough understanding of the present application.

本文中术语“系统”和“网络”在本文中常被可互换使用。本文中术语“和/或”，仅仅是一种描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。另外，本文中字符“/”，一般表示前后关联对象是一种“或”的关系。此外，本文中的“多”表示两个或者多于两个。The terms "system" and "network" are often used interchangeably herein. The term "and/or" in this article is only an association relationship to describe the associated objects, indicating that there can be three kinds of relationships, for example, A and/or B, it can mean that A exists alone, A and B exist at the same time, and A and B exist independently B these three cases. In addition, the character "/" in this document generally indicates that the related objects are an "or" relationship. Also, "multiple" herein means two or more than two.

请参阅图1，图1是本申请语音合成方法一实施例的流程示意图。Please refer to FIG. 1. FIG. 1 is a schematic flowchart of an embodiment of a speech synthesis method of the present application.

具体而言，可以包括如下步骤：Specifically, the following steps can be included:

步骤S11：提取待合成文本的发音属性特征。Step S11: Extract the pronunciation attribute features of the text to be synthesized.

在一个实施场景中，发音属性可以包括但不限于发音内容信息等，在此不做限定。具体而言，可以先提取待合成文本的发音序列，且发音序列中包含若干发音标记，再分别对各个发音标记进行特征编码，从而得到待合成文本的发音属性特征。示例性地，可以采用国际音标(International Phonetic Alphabet,IPA)对待合成文本其发音内容信息进行标记，得到发音序列；此外，可以采用独热编码(one-hot)对各个发音标记进行特征编码。具体过程，可以参阅诸如IPA、one-hot等技术细节，在此不再赘述。In an implementation scenario, the pronunciation attribute may include, but is not limited to, pronunciation content information, etc., which is not limited herein. Specifically, the pronunciation sequence of the text to be synthesized can be extracted first, and the pronunciation sequence includes several pronunciation marks, and then feature encoding is performed on each pronunciation mark, so as to obtain the pronunciation attribute features of the to-be-synthesized text. Exemplarily, International Phonetic Alphabet (IPA) may be used to mark the pronunciation content information of the synthesized text to obtain a pronunciation sequence; in addition, one-hot encoding may be used to perform feature encoding on each pronunciation mark. For the specific process, you can refer to technical details such as IPA and one-hot, which will not be repeated here.

步骤S12：基于待合成文本分别在若干种语音属性上的目标类别，获取各种语音属性的目标属性特征。Step S12: Obtain target attribute features of various voice attributes based on the target categories of the text to be synthesized on several voice attributes respectively.

在一个实施场景中，若干种语音属性包括信道属性、方言属性、音色属性、情感属性、风格属性中至少一种。上述方式，通过将语音属性进行区分，即将耦合在一起的语音信息通过解耦合的方式区分为多种语音属性，有助于在语音合成过程中提高语音合成的多样性，并且提高了语音合成的效率。In an implementation scenario, the several speech attributes include at least one of channel attributes, dialect attributes, timbre attributes, emotional attributes, and style attributes. In the above method, by distinguishing the speech attributes, that is, distinguishing the coupled speech information into multiple speech attributes by decoupling, it is helpful to improve the diversity of speech synthesis in the process of speech synthesis, and improve the performance of speech synthesis. efficiency.

在一个实施场景中，语音属性可控的难点在于语音中包含了许多信息耦合，为了实现对语音中包含的属性信息进行自由组合并且控制合成，需要对语音属性进行区分，并且将耦合的信息进行解耦合，进而实现自由组合控制语音合成的内容。语音中包括了丰富的属性信息，为实现对实行信息的控制，首先需要确定语音属性的类别，语音属性的具体类别可以根据具体的应用场景进行设置，在此不做具体限制。本申请以语音属性中的信道属性、方言属性、音色属性、情感属性、风格属性为例，对语音合成过程进行说明。In an implementation scenario, the difficulty in controlling speech attributes is that speech contains a lot of information coupling. In order to freely combine and control the synthesis of attribute information contained in speech, it is necessary to distinguish speech attributes, and combine the coupled information. Decoupling, so as to achieve free combination control of the content of speech synthesis. Voice includes rich attribute information. In order to control the execution information, the category of the voice attribute needs to be determined first. The specific category of the voice attribute can be set according to the specific application scenario, and there is no specific limitation here. This application takes the channel attribute, dialect attribute, timbre attribute, emotion attribute, and style attribute among the speech attributes as examples to describe the speech synthesis process.

在一个具体实施场景中，信道属性是指语音所处的环境，例如录音棚、会议室、车载等，不同的录音环境，语音的信道属性不同，听感也会不同，可以通过控制语音信道属性进行语音生成，可以使合成的语音听感与场景更加相符、合成语音更加真实；方言属性是指一个语种下的不同方言，例如中文下的普通话、东北话等或者英语下的英式英语、美式英语等。根据使用的语种场景确定方言类别；音色属性是指语音中包含的用以区分说话人身份的人声音色信息；情感属性是指语音中包含的情感类别，例如：中立、高兴、悲伤、生气等等；风格属性是指语音的讲话风格，例如：新闻、交互、客服、小说等不同场景下的讲话风格。在语音属性确定之后，还可以对若干种语音属性进行细分类别，例如：对于音色属性，可以每个说话人作为一个类别，也可以以性别作为一个类别，还可以以年龄区间作为一个类别，音色属性的具体分类可以根据实际情况进行设置，在此不做具体限定。对于情感等有成熟类别划分的属性，依据应用需求确定。对于情感等有成熟类别划分的属性，可以依据应用需求确定。In a specific implementation scenario, the channel attribute refers to the environment in which the voice is located, such as a recording studio, conference room, vehicle, etc. In different recording environments, the channel attributes of the voice are different, and the sense of hearing will also be different. You can control the voice channel attribute by controlling the voice channel attribute. Performing voice generation can make the synthesized voice more consistent with the scene and the synthesized voice more realistic; the dialect attribute refers to different dialects under a language, such as Mandarin under Chinese, Northeast dialect, etc. or British English, American under English English etc. The dialect category is determined according to the language scene used; the timbre attribute refers to the human voice color information contained in the speech to distinguish the identity of the speaker; the emotion attribute refers to the emotional category contained in the voice, such as neutral, happy, sad, angry, etc. etc.; the style attribute refers to the speech style of the speech, such as the speech style in different scenarios such as news, interaction, customer service, and novels. After the voice attributes are determined, several voice attributes can also be subdivided into categories. For example, for the timbre attribute, each speaker can be regarded as a category, gender can also be regarded as a category, and age range can be regarded as a category. The specific classification of timbre attributes can be set according to the actual situation, which is not specifically limited here. For attributes that have mature categories such as emotion, they are determined according to application requirements. For attributes that have mature categories such as emotion, they can be determined according to application requirements.

在一个实施场景中，语音属性包括多种类别，为确保语音合成的准确性需要收集语音数据，依据语音属性的多种类别进行语音数据收集，以覆盖这些语音属性类别。为确保不同类别之间的语音属性在进行解耦合时的正确率，应确保若干个语音属性中的各个属性类别至少收集两人以上的语音数据。在此过程中，语音数据可以通过发音人进行录制；语音数据也可以通过搜集等途径获取得到；语音数据还可以通过公告场所的录音设备中的录音数据经过处理得到。具体语音数据的收集方法可以根据实际应用中场景进行选择，在此不做具体限制。In an implementation scenario, the speech attributes include multiple categories. To ensure the accuracy of speech synthesis, voice data needs to be collected, and the voice data collection is performed according to the multiple categories of voice attributes to cover these voice attribute categories. In order to ensure the correct rate of decoupling of speech attributes between different categories, it should be ensured that each attribute category of several speech attributes collects at least two people's speech data. In this process, the voice data can be recorded by the speaker; the voice data can also be obtained through collection and other means; the voice data can also be obtained by processing the recorded data in the recording equipment of the announcement place. The specific voice data collection method can be selected according to the actual application scenario, which is not specifically limited here.

在一个实施场景中，根据待合成文本分别在若干种语音属性上的目标类别，获取各种语音属性的目标属性特征，至少包括：基于与各种语音属性的属性类别相关的语音数据进行特征建模，得到各种语音属性的属性空间概率分布模型；再基于待合成文本在语音属性上的目标类别，在对应语音属性的属性空间概率分布模型进行采样，得到对应语音属性的目标属性特征。上述方式，通过对语音属性的属性类别相关的语音数据进行特征建模，可以得到语音属性的属性空间概率分布模型，空间概率分布模型中可以对不同属性类别进行区域划分，得到相同语音属性下的不同属性类别的分布空间，有助于语音合成过程中对数据的采样，提高了语音合成的效率。In an implementation scenario, acquiring target attribute features of various voice attributes according to target categories of the text to be synthesized on several voice attributes, at least including: performing feature construction based on voice data related to the attribute categories of various voice attributes Then, based on the target category of the text to be synthesized on the voice attribute, sampling is performed in the attribute space probability distribution model of the corresponding voice attribute to obtain the target attribute feature of the corresponding voice attribute. In the above manner, by performing feature modeling on the speech data related to the attribute category of the voice attribute, the attribute spatial probability distribution model of the voice attribute can be obtained. In the spatial probability distribution model, different attribute categories can be divided into regions to obtain the same voice attribute The distribution space of different attribute categories is helpful for data sampling during speech synthesis and improves the efficiency of speech synthesis.

在一个实施场景中，每种语音属性具有至少一种属性类别；再基于与各种语音属性的属性类别相关的语音数据进行特征建模，得到各种语音属性的属性空间概率分布模型，包括：对于语音属性下各种属性类别，提取与属性类别相关的语音数据关于语音属性的样本属性特征；再基于关于语音属性下各种属性类别的样本属性特征，构建得到语音属性的属性空间概率分布模型。上述方式，通过构建语音属性的属性空间概率分布模型，在语音合成过程中可以在分布模型中直接采样得到对应属性特征，提高了语音合成的效率。In an implementation scenario, each voice attribute has at least one attribute category; then feature modeling is performed based on the voice data related to the attribute categories of various voice attributes to obtain attribute space probability distribution models of various voice attributes, including: For various attribute categories under the voice attribute, extract the sample attribute features of the voice data related to the attribute category, and then construct the attribute space probability distribution model of the voice attribute based on the sample attribute features of the various attribute categories under the voice attribute. . In the above manner, by constructing an attribute space probability distribution model of speech attributes, in the process of speech synthesis, corresponding attribute features can be directly sampled in the distribution model, thereby improving the efficiency of speech synthesis.

在一个实施场景中，在获取各种语音属性的目标属性特征的过程中，可以通过人工标注的方式对目标属性进行提取，还可以通过对各个语音属性类别相关的语音数据进行特征建模，得到各种语音属性的属性空间概率分布模型，通过对目标语音属性的属性空间概率分布采样得到目标属性特征，目标属性特征的具体获取方式可以根据实际应用场景进行选择，在此不做具体限定。In an implementation scenario, in the process of acquiring the target attribute features of various voice attributes, the target attributes can be extracted by manual annotation, and the feature modeling can also be performed on the voice data related to each voice attribute category to obtain The attribute space probability distribution model of various speech attributes obtains the target attribute feature by sampling the attribute space probability distribution of the target voice attribute. The specific acquisition method of the target attribute feature can be selected according to the actual application scenario, which is not specifically limited here.

在一个具体实施场景中，在对各种语音属性的属性类别相关的语音数据进行特征建模之前，需要对语音数据中的各个属性类别特征进行提取，再对属性特征进行建模，得到各种语音属性的属性空间概率分布模型。对目标属性进行提取的过程中，可以通过建立特征提取模型对目标属性进行提取，还可以通过人工标注的方式对目标属性进行提取，对目标属性进行提取的具体方式可以根据实际情况进行设置，在此不做具体限定。In a specific implementation scenario, before feature modeling is performed on the speech data related to the attribute categories of various speech attributes, it is necessary to extract the features of each attribute category in the voice data, and then model the attribute features to obtain various Attribute space probability distribution model for speech attributes. In the process of extracting the target attribute, the target attribute can be extracted by establishing a feature extraction model, and the target attribute can also be extracted by manual annotation. The specific method of extracting the target attribute can be set according to the actual situation. This is not specifically limited.

在一个具体实施场景中，通过人工标注的方式对目标属性类别特征进行提取，人工标注方法是指人在听语音后标注对应的属性特征，然后通过一定的编码方式对该标注信息进行编码，将编码后的特征作为该属性信息的表征特征。例如：语音属性中若包括发音属性，发音属性可以采用国际音标(International Phonetic Alphabet,IPA)对语音中的发音内容进行标注，获得语音的发音序列标注，然后采用独热编码对发音序列中的每个发音进行编码得到编码表征序列。或者，对于情感属性进行人工标注，情感属性包括中立、高兴、悲伤、生气等等类别，对于情感属性可以通过独热编码进行标注，中立可以标注为1000，高兴可以标注为0100，悲伤可以标注为0010，生气可以标注为0001；对于情感属性还可以通过阿拉伯数字进行编码，中立可以标注为1，高兴可以标注为2，悲伤可以标注为3，生气可以标注为4。在对目标属性进行人工标注的过程中，可以根据实际情况选择不同的编码方式进行标注，在此不做具体限定。In a specific implementation scenario, the target attribute category features are extracted by manual labeling. The manual labeling method means that people mark the corresponding attribute features after listening to the speech, and then encode the labeling information through a certain encoding method, and the The encoded feature is used as the characteristic feature of the attribute information. For example, if the pronunciation attribute is included in the pronunciation attribute, the pronunciation attribute can use the International Phonetic Alphabet (IPA) to mark the pronunciation content in the speech, obtain the pronunciation sequence annotation of the speech, and then use one-hot encoding to label each pronunciation sequence in the pronunciation sequence. encoding the utterances to obtain a sequence of encoded representations. Or, manually label the emotional attributes. The emotional attributes include neutral, happy, sad, angry and other categories. For the emotional attributes, one-hot encoding can be used to label them. Neutral can be labeled as 1000, happy can be labeled as 0100, and sad can be labeled as 0010, angry can be marked as 0001; emotional attributes can also be encoded by Arabic numerals, neutral can be marked as 1, happy can be marked as 2, sadness can be marked as 3, angry can be marked as 4. In the process of manually labeling the target attributes, different coding methods can be selected for labeling according to the actual situation, which is not specifically limited here.

在一个具体实施场景中，通过建立特征进而采用模型训练方式提取模型对目标属性类别特征进行提取，模型训练的方法可以包括但不限于有监督学习、自监督学习、建模方法等等。其中，有监督学习方法的学习方式中，首先利用已有的属性标注数据作为训练数据，训练情感分类模型。例如：以情感属性为例，情感属性的训练数据可以是录制得到的语音数据也可以是额外收集的带情感标注的语音数据。可以利用情感分类模型对语音数据进行情感预测，得到语音数据的预测情感类别，并基于语音数据标注的样本情感类别与预测情感类别之间的差异，调整情感分类模型的网络参数，以此来实现对情感分类模型的训练。需要说明的是，情感分类模型可以采用深度神经网络、卷积神经网络等，在此不做限定。此外，可以通过最小化交叉熵损失函数对模型不断更新优化，以得到训练收敛的情感分类模型。在此情况下，可以将情感分类模型的最后一层隐层特征作为当前语音的情感属性特征。此外，为了使得提取得到的情感属性特征不包含其他属性的特征信息，在情感分类模型的训练过程中，可以额外设计信息约束准则，例如最小化与音色的互信息等准则，约束提取的信息表征不受其他属性特征的影响。其他语音属性可以以此类推，在此不再一一举例。In a specific implementation scenario, the target attribute category features are extracted by establishing features and then using model training methods to extract models. Model training methods may include but are not limited to supervised learning, self-supervised learning, modeling methods, and the like. Among them, in the learning method of the supervised learning method, the existing attribute labeling data is used as training data to train a sentiment classification model. For example, taking the emotion attribute as an example, the training data of the emotion attribute can be recorded voice data or additionally collected voice data with emotional annotations. The emotion classification model can be used to predict the emotion of the speech data, and the predicted emotion category of the speech data can be obtained, and based on the difference between the sample emotion category marked by the speech data and the predicted emotion category, the network parameters of the emotion classification model can be adjusted to achieve Training of sentiment classification models. It should be noted that the emotion classification model may adopt a deep neural network, a convolutional neural network, etc., which is not limited here. In addition, the model can be continuously updated and optimized by minimizing the cross-entropy loss function to obtain a training-converged sentiment classification model. In this case, the last hidden layer feature of the emotion classification model can be used as the emotion attribute feature of the current speech. In addition, in order to make the extracted emotional attribute features do not contain the feature information of other attributes, in the training process of the emotion classification model, additional information constraint criteria can be designed, such as minimizing the mutual information with the timbre and other criteria to constrain the extracted information representation Not affected by other attribute characteristics. Other voice attributes can be deduced in the same way, and will not be listed one by one here.

请参阅图2，图2是图1中步骤S12一实施例的属性特征提取示意图，如图2所示，语音数据20是通过多种方式收集的音频信息，且若干属性模块可以包括信道属性模块21、方言属性模块22、音色属性模块23、情感属性模块24和风格属性模块25，并且各个属性模块可以是基于人工标注或者模型学习的方法获得的。当然，若干属性模块还可以包括发音属性模块(未图示)，通过发音属性模块可以提取发音属性特征，发音属性模块的具体提取过程，可以参阅前述关于提取发音属性特征的相关描述，在此不再赘述。与各个属性模块对应地，属性特征可以包括信道属性特征210、方言属性特征220、音色属性特征230、情感属性特征240和风格属性特征250，并且属性特征是根据对应属性模块获得的。经过大量语音数据对应的属性特征提取，可以得到各个属性中不同属性类别对应的不同属性类别特征。根据语音属性下各种属性类别的样本属性特征，构建得到语音属性的属性空间概率分布模型。根据语音属性的属性空间概率分布模型，可以获取不同属性类别的对应属性特征。Please refer to FIG. 2. FIG. 2 is a schematic diagram of attribute feature extraction in an embodiment of step S12 in FIG. 1. As shown in FIG. 2, the voice data 20 is audio information collected in various ways, and several attribute modules may include a channel attribute module. 21. The dialect attribute module 22, the timbre attribute module 23, the emotion attribute module 24 and the style attribute module 25, and each attribute module can be obtained based on manual annotation or model learning. Certainly, several attribute modules may also include pronunciation attribute modules (not shown), through which pronunciation attribute features can be extracted. For the specific extraction process of the pronunciation attribute module, you can refer to the above-mentioned related descriptions about extracting pronunciation attribute features, which are not described here. Repeat. Corresponding to each attribute module, the attribute characteristics may include channel attribute characteristics 210, dialect attribute characteristics 220, timbre attribute characteristics 230, emotional attribute characteristics 240 and style attribute characteristics 250, and the attribute characteristics are obtained according to the corresponding attribute modules. After the attribute feature extraction corresponding to a large amount of speech data, different attribute category features corresponding to different attribute categories in each attribute can be obtained. According to the sample attribute characteristics of various attribute categories under the voice attribute, the attribute space probability distribution model of the voice attribute is constructed. According to the attribute space probability distribution model of speech attributes, corresponding attribute features of different attribute categories can be obtained.

在一个实施场景中，根据与各种语音属性的属性类别相关的语音数据进行特征建模，得到各种语音属性的属性空间概率分布模型。对语音属性进行属性空间概率分布模型进行建模的方法可以包括但不限于最大似然准则的建模方法等等，其中，采用的模型结构包括但不限于高斯混合模型、流模型等。在语音属性的属性空间概率分布模型建立之后，进行语音合成的过程中，可以利用概率分布进行不同属性类别的采样，得到目标属性类别特征。此外，为了对属性空间概率分布实现有控制的采样，在建模时输入其他精细的控制信息对属性特征空间的条件概率分布进行建模。例如对于情感空间，在建模时，将情感类别信息输入模型中进行建模。由此，在生成阶段，可以采样得到指定的情感属性类别特征，从而可以实现更加精细的采样预测。In an implementation scenario, feature modeling is performed according to the speech data related to the attribute categories of various speech attributes to obtain attribute space probability distribution models of various speech attributes. The method for modeling the speech attribute with the attribute space probability distribution model may include, but is not limited to, the maximum likelihood criterion modeling method, etc., wherein the adopted model structure includes but is not limited to a Gaussian mixture model, a flow model, and the like. After the attribute space probability distribution model of speech attributes is established, in the process of speech synthesis, the probability distribution can be used to sample different attribute categories to obtain the target attribute category features. In addition, in order to achieve controlled sampling of the attribute space probability distribution, other fine control information is input to model the conditional probability distribution of the attribute feature space during modeling. For example, for emotion space, when modeling, the emotion category information is input into the model for modeling. Therefore, in the generation stage, the specified emotional attribute category features can be sampled, so that more refined sampling prediction can be achieved.

在一个具体实施场景中，进行语音属性的属性空间概率分布模型的属性特征是需要从模型学习方法中从语音中获取的，且在对语音属性的属性空间概率分布建立模型时，可以有选择的对语音属性建立模型，具体选择方式可以根据具体情况进行选择。In a specific implementation scenario, the attribute features of the attribute space probability distribution model for speech attributes need to be obtained from the speech in the model learning method, and when building the model for the attribute space probability distribution of speech attributes, there can be selected A model is established for the speech attribute, and the specific selection method can be selected according to the specific situation.

步骤S13：基于发音属性特征和各目标属性特征，合成得到待合成文本的合成语音。Step S13: Synthesize the synthesized speech of the text to be synthesized based on the pronunciation attribute feature and each target attribute feature.

在一个实施场景中，发音属性特征即为待合成文本的发音属性特征，各目标属性特征为在各种语音属性的属性空间概率分布模型中采样得到的目标属性特征，将所有获取的属性特征进行合成得到合成语音。In an implementation scenario, the pronunciation attribute feature is the pronunciation attribute feature of the text to be synthesized, and each target attribute feature is the target attribute feature sampled from the attribute space probability distribution model of various speech attributes. Synthesized to obtain synthesized speech.

在一个实施场景中，如前所述，语音属性包括音色属性，则在基于发音属性特征和各目标属性特征，合成得到待合成文本的合成语音之前，还可以基于待合成文本在音色属性上的目标类别，获取音色属性的参考属性特征；且音色属性的参考属性特征基于与音色属性的目标类别相关的语音数据而提取得到；再基于音色属性的参考属性特征进行调整处理，得到音色属性的目标属性特征，从而可以生成区别于训练集内发音人音色，在此基础上，可以再基于发音属性特征和各目标属性特征，合成得到待合成文本的合成语音。上述方式，语音合成的过程中，还可以通过对参考属性进行调整处理，能够生成区别于训练集内发音人音色，有利于快速创造出包含富音色的合成语音。In an implementation scenario, as mentioned above, the speech attributes include timbre attributes, and before synthesizing the synthesized speech of the text to be synthesized based on the pronunciation attribute feature and each target attribute feature, it can also be based on the timbre attribute of the to-be-synthesized text. The target category is to obtain the reference attribute feature of the timbre attribute; and the reference attribute feature of the timbre attribute is extracted based on the speech data related to the target category of the timbre attribute; and then the adjustment processing is performed based on the reference attribute feature of the timbre attribute to obtain the target of the timbre attribute. Attribute features, so that a timbre that is different from the speaker in the training set can be generated. On this basis, the synthesized speech of the text to be synthesized can be synthesized based on the pronunciation attribute features and each target attribute feature. In the above manner, in the process of speech synthesis, the reference attribute can also be adjusted and processed to generate a timbre that is different from that of the speaker in the training set, which is conducive to quickly creating a synthesized speech containing rich timbre.

在一个具体的实施场景中，在基于音色属性的参考属性特征进行调整处理过程中，可以将音色属性的多个参考属性特征进行加权处理，得到音色属性的目标属性特征。示例性地，可以通过对音色空间概率分布进行采样，能够生成不同于训练集内发音人音色的合成语音，从而生成具有新音色的合成语音。对于两个不同的音色属性特征S_A和S_B，可以通过插值的方式生成新的音色属性特征S_new，通过加入权重λ，其中λ是两个特征之间的插值权重，取值范围为0<λ<1。将插值后的特征S_new通过语音合成新的语音数据，可以表示为S_new＝λ*S_A+(1-λ)*S_B。上述方式，采用这种方式构建合成语音，省去发音人寻找和音频录制工作，能够生成不同于训练集内发音人音色的合成语音，从而生成具有新音色的合成语音，并具有避免版权风险的优点。In a specific implementation scenario, during the adjustment process based on the reference attribute feature of the timbre attribute, weighting processing may be performed on multiple reference attribute features of the timbre attribute to obtain the target attribute feature of the timbre attribute. Exemplarily, by sampling the timbre space probability distribution, a synthesized speech with a timbre different from that of the speaker in the training set can be generated, thereby generating a synthesized speech with a new timbre. For two different timbre attribute features S _A and S _B , a new timbre attribute feature S _new can be generated by interpolation, by adding a weight λ, where λ is the interpolation weight between the two features, and the value range is 0 <λ<1. The interpolated feature S _new is synthesized into new speech data by speech, which can be expressed as S _new =λ*S _A +(1-λ)*S _B . The above method, using this method to construct a synthetic voice, saves the work of finding a speaker and audio recording, and can generate a synthetic voice that is different from the voice of the speaker in the training set, thereby generating a synthetic voice with a new tone, and has the ability to avoid copyright risks. advantage.

在一个具体的实施场景中，在基于音色属性的参考属性特征进行调整处理过程中，也可以将音色属性的参考属性特征进行尺度调整，得到音色属性的目标属性特征。示例性地，一个音色属性特征S_C，可以通过拉伸的方式生成新的音色属性特征S_new，通过加入权重λ，拉伸后的特征S_new可以表示为S_new＝λ*S_C。上述方式，采用这种方式构建合成语音，能够生成不同于训练集内发音人音色的合成语音，从而生成具有新音色的合成语音，可以降低语音合成的代价。In a specific implementation scenario, during the adjustment process based on the reference attribute feature of the timbre attribute, the reference attribute feature of the timbre attribute may also be scaled to obtain the target attribute feature of the timbre attribute. Exemplarily, a timbre attribute feature S _C can be stretched to generate a new timbre attribute feature S _new , and by adding a weight λ, the stretched feature S _new can be expressed as S _new =λ*S _C . In the above manner, the synthetic speech constructed in this way can generate a synthesized speech with a timbre different from that of the speaker in the training set, thereby generating a synthesized speech with a new timbre, which can reduce the cost of speech synthesis.

请参阅图3，图3是图1中步骤S13一实施例的属性特征合成示意图，如图3所示，语音属性特征包括发音属性特征30、信道属性特征31、方言属性特征32、音色属性特征33、情感属性特征34和风格属性特征35，可以将发音属性特征和上述各语音属性特征输入语音合成模块36进行合成处理，可以得到合成语音。具体而言，语音合成模块36可以由样本语音训练得到。在此训练过程中，可以获取样本语音的发音属性特征和上述各语音属性特征，并采用语音合成模块36对样本语音的发音属性特征和上述各语音属性特征进行合成处理，得到样本语音对应的合成语音，从而可以基于样本语音及其对应的合成语音之间的差异(如，两者梅尔谱之间的差异)，调整语音合成模块36的网络参数，从而可以通过训练语音合成模块36以实现各属性特征对语音的控制生成。需要说明的是，该语音合成模块36的输入可以为前述提取得到的属性特征，输出为合成语音37，隐层网络由深度神经网络模块组成，例如可以是深度神经网络、循环神经网络、卷积神经网络等网络类型中的一种或几种的组合。然后通过设置一定训练准则对网络进行训练。该训练准则包括但不限于最小均方误差准则、最大似然准则等。根据语音合成模块36，可以进行控制语音合成，进而可以更好的控制合成语音37的特征。Please refer to FIG. 3 . FIG. 3 is a schematic diagram of attribute feature synthesis in an embodiment of step S13 in FIG. 1 . As shown in FIG. 3 , the speech attribute features include pronunciation attribute feature 30 , channel attribute feature 31 , dialect attribute feature 32 , and timbre attribute feature 33. The emotional attribute feature 34 and the style attribute feature 35, the pronunciation attribute feature and the above-mentioned various speech attribute features can be input into the speech synthesis module 36 for synthesis processing, and the synthesized speech can be obtained. Specifically, the speech synthesis module 36 can be obtained by training sample speech. In this training process, the pronunciation attribute features of the sample speech and the above-mentioned various speech attribute characteristics can be obtained, and the speech synthesis module 36 is used to synthesize the pronunciation attribute characteristics of the sample speech and the above-mentioned various speech attribute characteristics, so as to obtain the synthesis corresponding to the sample speech. speech, so that the network parameters of the speech synthesis module 36 can be adjusted based on the difference between the sample speech and its corresponding synthesized speech (for example, the difference between the two Mel spectra), so that the speech synthesis module 36 can be trained to achieve Each attribute feature controls the generation of speech. It should be noted that the input of the speech synthesis module 36 can be the attribute features obtained by the aforementioned extraction, and the output is synthesized speech 37. The hidden layer network is composed of a deep neural network module, such as a deep neural network, a recurrent neural network, a convolutional neural network. One or a combination of network types such as neural networks. Then the network is trained by setting certain training criteria. The training criterion includes, but is not limited to, the minimum mean square error criterion, the maximum likelihood criterion, and the like. According to the speech synthesis module 36, the control speech synthesis can be performed, and then the characteristics of the synthesized speech 37 can be better controlled.

上述方案，通过提取待合成文本的发音属性特征，再基于待合成文本分别在若干种语音属性上的目标类别，获取各种语音属性的目标属性特征；最后基于发音属性特征和各目标属性特征，合成得到待合成文本的合成语音。一方面由于无需人为进行大量的语音录制，节省了时间，有助于降低语音合成的成本，另一方面由于在语音合成过程中，通过获取待合成文本分别在若干语音属性上的目标类别，获取各种语音属性的目标属性特征，在此基础上再进行语音合成，能够合成具有任意类别语音属性的语音数据，提升了语音合成的自由度。故此，能够在提升语音合成自由度的同时，降低其成本。In the above scheme, the pronunciation attribute features of the text to be synthesized are extracted, and then the target attribute features of various speech attributes are obtained based on the target categories of the to-be-synthesized texts respectively on several kinds of speech attributes; finally, based on the pronunciation attribute features and each target attribute feature, Synthesized to obtain the synthesized speech of the text to be synthesized. On the one hand, it saves time and helps to reduce the cost of speech synthesis because there is no need to perform a large number of voice recordings. On the basis of the target attribute features of various speech attributes, speech synthesis can be performed, and speech data with any type of speech attribute can be synthesized, which improves the freedom of speech synthesis. Therefore, the cost of speech synthesis can be reduced while the degree of freedom of speech synthesis is improved.

请参阅图4，图4是本申请语音合成方法另一实施例的流程示意图。需要说明的是，本公开实施例可以在待处理语音的基础上进行任意属性迁移，从而合成得到与待处理语音具有部分相同属性特征的合成语音。此外，本公开实施例仅重点陈述与前述公开实施例不同之处，相同或相似之处可以参阅前述公开实施例，在此不再赘述。具体而言，本公开实施例具体可以包括如下步骤：Please refer to FIG. 4 , which is a schematic flowchart of another embodiment of the speech synthesis method of the present application. It should be noted that, in the embodiment of the present disclosure, any attribute transfer can be performed on the basis of the to-be-processed speech, so as to synthesize a synthesized speech having some of the same attributes and features as the to-be-processed speech. In addition, the embodiments of the present disclosure only focus on the differences between the embodiments of the present disclosure and the foregoing disclosed embodiments. For the same or similar points, reference may be made to the foregoing disclosed embodiments, which will not be repeated here. Specifically, the embodiments of the present disclosure may specifically include the following steps:

步骤S41：提取待处理语音的发音属性特征。Step S41: Extract the pronunciation attribute features of the speech to be processed.

在一个实施场景中，如前所述，发音属性可以包括语音中的发音内容信息。与前述提取待合成文本的发音属性特征类似地，对于待处理语音而言，也可以采用国际音标(International Phonetic Alphabet,IPA)对语音中的发音内容进行标注，获得语音的发音序列标注，然后采用独热编码对发音序列中的每个发音进行编码得到发音属性特征。In an implementation scenario, as mentioned above, the pronunciation attribute may include pronunciation content information in the speech. Similar to the above-mentioned extraction of the pronunciation attribute features of the text to be synthesized, for the speech to be processed, the International Phonetic Alphabet (IPA) can also be used to mark the pronunciation content in the speech to obtain the pronunciation sequence labeling of the speech, and then use One-hot encoding encodes each pronunciation in the pronunciation sequence to obtain pronunciation attribute features.

步骤S42：基于待处理语音分别在若干种语音属性上的目标类别，获取各种语音属性的目标属性特征。Step S42: Obtain target attribute features of various voice attributes based on target categories of the speech to be processed on several voice attributes respectively.

在一个实施场景中，基于待处理语音分别在若干种语音属性上的目标类别，获取各种语音属性的目标属性特征，可以参阅前述公开实施例中，关于基于待合成文本分别在若干种语音属性上的目标类别，获取各种语音属性的目标属性特征的相关描述，在此不再赘述。In one implementation scenario, target attribute features of various speech attributes are obtained based on the target categories of the speech to be processed on several speech attributes. Please refer to the aforementioned disclosed embodiments for details on the specific speech attributes based on the to-be-synthesized text. The relevant descriptions of the target attribute features of various speech attributes are obtained, which will not be repeated here.

在一个实施场景中，在基于待处理语音分别在若干种语音属性上的目标类别，获取各种语音属性的目标属性特征之前，可以选择至少一种语音属性作为第一语音属性，并将未选择的语音属性作为第二语音属性；基于待处理语音分别在若干种语音属性上的目标类别，获取各种语音属性的目标属性特征，可以基于待处理语音分别在各第一语音属性上的目标类别，获取各第一语音属性的目标属性特征；以及，提取待处理语音分别关于各第二语音属性的语音属性特征，作为对应第二语音属性的目标属性特征。上述方式，通过对语音属性特征进行迁移，完成语音的属性特征迁移，可以控制合成属性特征类别不同的多种合成语音。In an implementation scenario, before acquiring the target attribute features of various voice attributes based on the target categories of the speech to be processed on several voice attributes, at least one voice attribute can be selected as the first voice attribute, and the unselected voice attribute can be selected as the first voice attribute. The voice attribute of the voice attribute is used as the second voice attribute; based on the target categories of the to-be-processed voices on several voice attributes, the target attribute features of various voice attributes can be obtained, and the target attribute features of the voice to be processed can be obtained based on the target categories of the first voice attributes respectively. , acquiring the target attribute features of each first voice attribute; and extracting the voice attribute features of each second voice attribute of the to-be-processed voice, respectively, as the target attribute features corresponding to the second voice attributes. In the above-mentioned manner, by migrating the speech attribute features, the attribute feature migration of the speech is completed, and it is possible to control the synthesis of a variety of synthesized speeches with different attribute feature categories.

在一个具体的实施场景中，以选择情感属性作为第一语音属性为例，则信道属性、方言属性、音色属性、风格属性可以作为第二语音属性，待处理语音中各个属性具有各自的属性类别，且第一语音属性的目标属性特征为高兴，此时，可以在已有的语音数据中提取情感属性的属性类别为高兴的属性特征，将提取情感属性为高兴的属性特征与剩余各个属性具有的属性特征进行合成，得到最终合成语音。也即是说，对于待处理的语音，只有情感属性特征发生改变，其余属性特征均保持即可得到最终合成语音。In a specific implementation scenario, taking the emotion attribute as the first voice attribute as an example, the channel attribute, dialect attribute, timbre attribute, and style attribute can be used as the second voice attribute, and each attribute in the voice to be processed has its own attribute category , and the target attribute feature of the first voice attribute is happy, at this time, the attribute category of the emotional attribute can be extracted from the existing voice data as the attribute feature of happy, and the attribute feature of the extracted emotional attribute as happy and the remaining attributes have The attribute features are synthesized to obtain the final synthesized speech. That is to say, for the speech to be processed, only the emotional attribute features are changed, and the other attribute features are maintained to obtain the final synthesized speech.

在一个具体的实施场景中，以选择风格属性作为第一语音属性为例，则信道属性、方言属性、音色属性、情感属性可以作为第二语音属性，待处理语音中各个属性具有各自的属性类别，且第一语音属性的目标属性特征为交互，此时，可以在已有的语音数据中提取风格属性的属性类别为交互的属性特征，将提取风格属性为交互的属性特征与剩余各个属性具有的属性特征进行合成，得到最终合成语音。In a specific implementation scenario, taking the style attribute as the first voice attribute as an example, the channel attribute, dialect attribute, timbre attribute, and emotion attribute can be used as the second voice attribute, and each attribute in the to-be-processed voice has its own attribute category , and the target attribute feature of the first voice attribute is interaction, at this time, the attribute category of the style attribute can be extracted from the existing voice data as the interactive attribute feature, and the extracted style attribute is the interactive attribute feature and the remaining attributes have The attribute features are synthesized to obtain the final synthesized speech.

需要说明的是，上述仅以情感属性、风格属性为例来说明属性特征迁移的具体过程，在实际应用过程中，可以根据选择前述若干种语音属性中任一者或者至少两者的组合来进行迁移，在此不做限定。It should be noted that the above only takes emotional attributes and style attributes as examples to illustrate the specific process of attribute feature migration. In the actual application process, it can be carried out by selecting any one of the aforementioned several speech attributes or at least a combination of the two. Migration is not limited here.

此外，示例性地，可以提取待处理语音的发音属性特征，并基于待处理语音分别在信道属性、音色属性、方言属性、情感属性、风格属性上的目标类别，获取上述几种语音属性的目标属性特征，且待处理语音在音色属性上的目标类别为其原音色类别，也就是说，可以从待处理语音提取出音色属性特征作为音色属性的目标属性特征，从而后续可以基于发音属性特征、音色属性特征，以及在信道属性、方言属性、情感属性、风格属性等各语音属性上分别指定的目标类别对应的目标属性特征，进行语音合成，以在保留待处理语音原发音属性和原音色属性的前提下，自由合成具有不同信道、不同方言、不同情感、不同风格的合成语音。In addition, exemplarily, the pronunciation attribute features of the speech to be processed can be extracted, and based on the target categories of the to-be-processed speech on channel attributes, timbre attributes, dialect attributes, emotional attributes, and style attributes, the targets of the above-mentioned several speech attributes can be obtained. attribute feature, and the target category of the timbre attribute of the voice to be processed is the original timbre category, that is to say, the timbre attribute feature can be extracted from the to-be-processed voice as the target attribute feature of the timbre attribute, so that the subsequent can be based on the pronunciation attribute feature, The timbre attribute features, as well as the target attribute features corresponding to the target categories specified respectively on the channel attributes, dialect attributes, emotional attributes, style attributes and other phonetic attributes, are used for speech synthesis to preserve the original pronunciation attributes and original timbre of the speech to be processed. On the premise of properties, synthetic speech with different channels, different dialects, different emotions, and different styles can be freely synthesized.

步骤S43：基于发音属性特征和各目标属性特征，合成得到待处理语音的合成语音。Step S43: Based on the pronunciation attribute feature and each target attribute feature, synthesize the synthesized speech to obtain the speech to be processed.

在一个实施场景中，根据发音属性特征和各目标属性特征，可以合成得到待处理语音的合成语音，且目标属性特征可以根据实际应用场景进行设置，一方面可以生成多种合成语音，另一方面可以有效控制合成语音的语音属性特征。具体合成过程，可以参阅前述公开实施例中关于基于发音属性特征和各目标属性特征，合成得到待合成文本的合成语音的相关描述，在此不再赘述。In an implementation scenario, according to the pronunciation attribute feature and each target attribute feature, the synthesized speech of the speech to be processed can be synthesized, and the target attribute feature can be set according to the actual application scenario. It can effectively control the speech attribute characteristics of the synthesized speech. For the specific synthesis process, reference may be made to the relevant description of the synthesized speech to obtain the text to be synthesized based on the pronunciation attribute feature and each target attribute feature in the aforementioned disclosed embodiments, which will not be repeated here.

上述方案，通过提取待处理语音的发音属性特征，再基于待处理语音分别在若干种语音属性上的目标类别，获取各种语音属性的目标属性特征；最后基于发音属性特征和各目标属性特征，合成得到待处理语音的合成语音。一方面由于无需人为进行大量的语音录制，节省了时间，有助于降低语音合成的成本，另一方面由于在语音合成过程中，通过获取待合成文本分别在若干语音属性上的目标类别，获取各种语音属性的目标属性特征，在此基础上再进行语音合成，能够合成具有任意类别语音属性的语音数据，提升了语音合成的自由度。故此，能够在提升语音合成自由度的同时，降低其成本。In the above scheme, by extracting the pronunciation attribute features of the speech to be processed, and then based on the target categories of the to-be-processed speech on several kinds of speech attributes, the target attribute features of various speech attributes are obtained; finally, based on the pronunciation attribute features and each target attribute feature, Synthesized to obtain the synthesized speech of the speech to be processed. On the one hand, it saves time and helps to reduce the cost of speech synthesis because there is no need to perform a large number of voice recordings. On the basis of the target attribute features of various speech attributes, speech synthesis can be performed, and speech data with any type of speech attribute can be synthesized, which improves the freedom of speech synthesis. Therefore, the cost of speech synthesis can be reduced while the degree of freedom of speech synthesis is improved.

请参阅图5，图5是本申请语音合成装置一实施例的框架示意图。语音合成装置50包括提取模块51、获取模块52和合成模块53。其中，提取模块51用于提取待合成文本的发音属性特征；获取模块52用于基于待合成文本分别在若干种语音属性上的目标类别，获取各种语音属性的目标属性特征；合成模块53用于基于发音属性特征和各目标属性特征，合成得到待合成文本的合成语音。Please refer to FIG. 5 , which is a schematic diagram of a framework of an embodiment of a speech synthesis apparatus of the present application. The speech synthesis apparatus 50 includes an extraction module 51 , an acquisition module 52 and a synthesis module 53 . Among them, the extraction module 51 is used to extract the pronunciation attribute features of the text to be synthesized; the acquisition module 52 is used to obtain the target attribute features of various voice attributes based on the target categories of the text to be synthesized on several kinds of voice attributes; the synthesis module 53 uses Based on the pronunciation attribute feature and each target attribute feature, the synthesized speech of the text to be synthesized is obtained by synthesizing.

在一些公开实施例中，若干种语音属性包括信道属性、方言属性、音色属性、情感属性、风格属性中至少一种。In some disclosed embodiments, the several speech attributes include at least one of channel attributes, dialect attributes, timbre attributes, emotional attributes, and style attributes.

因此，通过将语音属性进行区分，即将耦合在一起的语音信息通过解耦合的方式区分为多种语音属性，有助于在语音合成过程中提高语音合成的多样性，并且提高了语音合成的效率。Therefore, by distinguishing the speech attributes, that is, distinguishing the coupled speech information into multiple speech attributes by decoupling, it is helpful to improve the diversity of speech synthesis in the process of speech synthesis, and improve the efficiency of speech synthesis .

在一些公开实施例中，获取模块52包括特征建模子模块，用于基于与各种语音属性的属性类别相关的语音数据进行特征建模，得到各种语音属性的属性空间概率分布模型；获取模块52包括模型采样子模块，用于基于待合成文本在语音属性上的目标类别，在对应语音属性的属性空间概率分布模型进行采样，得到对应语音属性的目标属性特征。In some disclosed embodiments, the acquisition module 52 includes a feature modeling sub-module for performing feature modeling based on speech data related to the attribute categories of various speech attributes to obtain attribute space probability distribution models of various speech attributes; The module 52 includes a model sampling sub-module for sampling in the attribute space probability distribution model corresponding to the speech attribute based on the target category of the speech attribute of the text to be synthesized to obtain the target attribute feature corresponding to the speech attribute.

因此，通过对语音属性的属性类别相关的语音数据进行特征建模，可以得到语音属性的属性空间概率分布模型，空间概率分布模型中可以对不同属性类别进行区域划分，得到相同语音属性下的不同属性类别的分布空间，有助于语音合成过程中对数据的采样，提高了语音合成的效率。Therefore, by performing feature modeling on the speech data related to the attribute categories of the voice attributes, the attribute spatial probability distribution model of the voice attributes can be obtained. The distribution space of attribute categories is helpful for data sampling during speech synthesis and improves the efficiency of speech synthesis.

在一些公开实施例中，每种语音属性具有至少一种属性类别；特征建模子模块包括提取单元，用于对于语音属性下各种属性类别，提取与属性类别相关的语音数据关于语音属性的样本属性特征；特征建模子模块包括构建单元，用于基于关于语音属性下各种属性类别的样本属性特征，构建得到语音属性的属性空间概率分布模型。In some disclosed embodiments, each voice attribute has at least one attribute category; the feature modeling sub-module includes an extraction unit for extracting, for various attribute categories under the voice attribute, the voice data related to the attribute category with respect to the voice attribute The sample attribute feature; the feature modeling sub-module includes a construction unit for constructing an attribute space probability distribution model of the voice attribute based on the sample attribute features of various attribute categories under the voice attribute.

因此，通过构建语音属性的属性空间概率分布模型，在语音合成过程中可以在分布模型中直接采样得到对应属性特征，提高了语音合成的效率。Therefore, by constructing an attribute space probability distribution model of speech attributes, the corresponding attribute features can be directly sampled in the distribution model in the process of speech synthesis, which improves the efficiency of speech synthesis.

在一些公开实施例中，合成模块53包括获取子模块，用于基于待合成文本在音色属性上的目标类别，获取音色属性的参考属性特征；且音色属性的参考属性特征基于与音色属性的目标类别相关的语音数据而提取得到；合成模块53包括调整子模块，用于基于音色属性的参考属性特征进行调整处理，得到音色属性的目标属性特征。In some disclosed embodiments, the synthesis module 53 includes an acquisition sub-module for acquiring reference attribute features of the timbre attribute based on the target category of the text to be synthesized on the timbre attribute; and the reference attribute feature of the timbre attribute is based on the target of the timbre attribute Class-related speech data is extracted; the synthesis module 53 includes an adjustment sub-module for performing adjustment processing based on the reference attribute feature of the timbre attribute to obtain the target attribute feature of the timbre attribute.

因此，语音合成的过程中，还可以通过对参考属性进行调整处理，得到目标属性特征，能够生成不同于训练集内发音人音色的合成语音，从而生成具有新音色的合成语音，并具有避免版权风险的优点。Therefore, in the process of speech synthesis, the target attribute feature can also be obtained by adjusting the reference attribute, which can generate a synthesized speech different from the timbre of the speaker in the training set, thereby generating a synthesized speech with a new timbre, and has the ability to avoid copyright The advantage of risk.

在一些公开实施例中，调整子模块包括加权处理单元和尺度调整单元中任一者：加权处理单元用于将音色属性的多个参考属性特征进行加权处理，得到音色属性的目标属性特征；尺度调整单元用于将音色属性的参考属性特征进行尺度调整，得到音色属性的目标属性特征。In some disclosed embodiments, the adjustment sub-module includes any one of a weighting processing unit and a scale adjusting unit: the weighting processing unit is configured to perform weighting processing on a plurality of reference attribute features of the timbre attribute to obtain a target attribute feature of the timbre attribute; the scale The adjustment unit is used to perform scale adjustment on the reference attribute feature of the timbre attribute to obtain the target attribute feature of the timbre attribute.

因此，可以通过插值、拉伸等方式对音色属性进行调整处理，可以得到音色属性的目标属性特征，进而在语音合成过程中得到不同于训练集内发音人音色的新音色的语音数据，并具有避免版权风险的优点。Therefore, the timbre attributes can be adjusted by means of interpolation, stretching, etc., and the target attribute characteristics of the timbre attributes can be obtained, and then in the process of speech synthesis, the voice data of a new timbre different from the voice of the speaker in the training set can be obtained. The advantage of avoiding copyright risks.

请参阅图6，图6是本申请语音合成装置另一实施例的框架示意图。语音合成装置60包括提取模块61、获取模块62和合成模块63。其中，提取模块61用于提取待处理语音的发音属性特征；获取模块62用于基于待处理语音分别在若干种语音属性上的目标类别，获取各种语音属性的目标属性特征；合成模块63用于基于发音属性特征和各目标属性特征，合成得到待处理语音的合成语音。Please refer to FIG. 6 , which is a schematic diagram of a framework of another embodiment of the speech synthesis apparatus of the present application. The speech synthesis device 60 includes an extraction module 61 , an acquisition module 62 and a synthesis module 63 . Among them, the extraction module 61 is used to extract the pronunciation attribute features of the speech to be processed; the acquisition module 62 is used to obtain the target attribute features of various speech attributes based on the target categories of the speech to be processed on several speech attributes; the synthesis module 63 uses Based on the pronunciation attribute feature and each target attribute feature, the synthesized speech is obtained by synthesizing the speech to be processed.

上述方案，通过提取待处理语音的发音属性特征，再基于待处理语音分别在若干种语音属性上的目标类别，获取各种语音属性的目标属性特征；最后基于发音属性特征和各目标属性特征，合成得到待处理语音的合成语音。一方面由于无需人为进行大量的语音录制，节省了时间，有助于降低语音合成的成本，另一方面由于在语音合成过程中，通过获取待处理语音分别在若干语音属性上的目标类别，获取各种语音属性的目标属性特征，在此基础上再进行语音合成，能够合成具有任意类别语音属性的语音数据，提升了语音合成的自由度。故此，能够在提升语音合成自由度的同时，降低其成本。In the above scheme, by extracting the pronunciation attribute features of the speech to be processed, and then based on the target categories of the to-be-processed speech on several kinds of speech attributes, the target attribute features of various speech attributes are obtained; finally, based on the pronunciation attribute features and each target attribute feature, Synthesized to obtain the synthesized speech of the speech to be processed. On the one hand, it saves time and helps to reduce the cost of speech synthesis because there is no need to perform a large number of voice recordings. On the basis of the target attribute features of various speech attributes, speech synthesis can be performed, and speech data with any type of speech attribute can be synthesized, which improves the freedom of speech synthesis. Therefore, the cost of speech synthesis can be reduced while the degree of freedom of speech synthesis is improved.

在一些公开实施例中，语音合成装置60还包括选择模块，用于选择至少一种语音属性作为第一语音属性，并将未选择的语音属性作为第二语音属性；获取模块62包括获取子模块和提取子模块，获取子模块用于基于待处理语音分别在各第一语音属性上的目标类别，获取各第一语音属性的目标属性特征；提取子模块用于提取待处理语音分别关于各第二语音属性的语音属性特征，作为对应第二语音属性的目标属性特征。In some disclosed embodiments, the speech synthesis apparatus 60 further includes a selection module for selecting at least one speech attribute as the first speech attribute, and using the unselected speech attribute as the second speech attribute; the obtaining module 62 includes an obtaining sub-module and extraction sub-module, the acquisition sub-module is used to obtain the target attribute feature of each first voice attribute based on the target category of the to-be-processed voice respectively on each first voice attribute; The voice attribute feature of the second voice attribute is used as the target attribute feature corresponding to the second voice attribute.

因此，通过对语音属性特征进行迁移，完成语音的属性特征迁移，可以控制合成属性特征类别不同的多种合成语音。Therefore, by migrating the speech attribute features to complete the speech attribute feature migration, it is possible to control the synthesis of a variety of synthesized speeches with different attribute feature categories.

请参阅图7，图7是本申请电子设备一实施例的框架示意图。电子设备70包括相互耦接的存储器71和处理器72，存储器71中存储有程序指令，处理器72用于执行程序指令以实现上述任一语音合成方法实施例中的步骤。具体地，电子设备70可以包括但不限于：台式计算机、笔记本电脑、服务器、手机、平板电脑等等，在此不做限定。Please refer to FIG. 7 , which is a schematic diagram of a framework of an embodiment of an electronic device of the present application. The electronic device 70 includes a memory 71 and a processor 72 coupled to each other, the memory 71 stores program instructions, and the processor 72 is configured to execute the program instructions to implement the steps in any of the foregoing speech synthesis method embodiments. Specifically, the electronic device 70 may include, but is not limited to, a desktop computer, a notebook computer, a server, a mobile phone, a tablet computer, etc., which are not limited herein.

具体而言，处理器72用于控制其自身以及存储器71以实现上述任一语音合成方法实施例中的步骤。处理器72还可以称为CPU(Central Processing Unit，中央处理单元)。处理器72可能是一种集成电路芯片，具有信号的处理能力。处理器72还可以是通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application SpecificIntegrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。另外，处理器72可以由集成电路芯片共同实现。Specifically, the processor 72 is used to control itself and the memory 71 to implement the steps in any of the above-mentioned speech synthesis method embodiments. The processor 72 may also be referred to as a CPU (Central Processing Unit, central processing unit). The processor 72 may be an integrated circuit chip with signal processing capability. The processor 72 may also be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other Programming logic devices, discrete gate or transistor logic devices, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 72 may be jointly implemented by an integrated circuit chip.

上述方案，一方面由于无需人为进行大量的语音录制，节省了时间，有助于降低语音合成的成本，另一方面由于在语音合成过程中，通过获取待合成文本分别在若干语音属性上的目标类别，获取各种语音属性的目标属性特征，在此基础上再进行语音合成，能够合成具有任意类别语音属性的语音数据，提升了语音合成的自由度。故此，能够在提升语音合成自由度的同时，降低其成本。The above solution, on the one hand, saves time and helps to reduce the cost of speech synthesis because there is no need to perform a large number of voice recordings by humans; Class, obtain the target attribute features of various speech attributes, and then perform speech synthesis on this basis, which can synthesize speech data with any type of speech attribute, which improves the degree of freedom of speech synthesis. Therefore, the cost of speech synthesis can be reduced while the degree of freedom of speech synthesis is improved.

请参阅图8，图8是本申请计算机可读存储介质一实施例的框架示意图。计算机可读存储介质80存储有能够被处理器运行的程序指令81，程序指令81用于实现上述任一语音合成方法实施例中的步骤。Please refer to FIG. 8 , which is a schematic diagram of a framework of an embodiment of a computer-readable storage medium of the present application. The computer-readable storage medium 80 stores program instructions 81 that can be executed by the processor, and the program instructions 81 are used to implement the steps in any of the foregoing speech synthesis method embodiments.

在一些实施例中，本公开实施例提供的装置具有的功能或包含的模块可以用于执行上文方法实施例描述的方法，其具体实现可以参照上文方法实施例的描述，为了简洁，这里不再赘述。In some embodiments, the functions or modules included in the apparatuses provided in the embodiments of the present disclosure may be used to execute the methods described in the above method embodiments. For specific implementation, reference may be made to the descriptions of the above method embodiments. For brevity, here No longer.

上文对各个实施例的描述倾向于强调各个实施例之间的不同之处，其相同或相似之处可以互相参考，为了简洁，本文不再赘述。The above descriptions of the various embodiments tend to emphasize the differences between the various embodiments, and the similarities or similarities can be referred to each other. For the sake of brevity, details are not repeated herein.

在本申请所提供的几个实施例中，应该理解到，所揭露的方法和装置，可以通过其它的方式实现。例如，以上所描述的装置实施方式仅仅是示意性的，例如，模块或单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性、机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the apparatus implementations described above are only illustrative, for example, the division of modules or units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, which may be in electrical, mechanical or other forms.

作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施方式方案的目的。Units described as separate components may or may not be physically separated, and components shown as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this implementation manner.

另外，在本申请各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.

集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)或处理器(processor)执行本申请各个实施方式方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented as a software functional unit and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods of the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .

Claims

1. A method of speech synthesis, comprising:

extracting pronunciation attribute characteristics of a text to be synthesized;

acquiring target attribute characteristics of various voice attributes based on target categories of the text to be synthesized on the plurality of voice attributes respectively;

and synthesizing to obtain the synthesized voice of the text to be synthesized based on the pronunciation attribute characteristics and the target attribute characteristics.

2. The method of claim 1, wherein the plurality of voice attributes comprises: at least one of channel attribute, dialect attribute, timbre attribute, emotion attribute and style attribute.

3. The method according to claim 1 or 2, wherein the obtaining of the target attribute feature of each voice attribute based on the target category of the text to be synthesized respectively on several voice attributes at least comprises:

performing feature modeling based on voice data related to attribute types of various voice attributes to obtain attribute space probability distribution models of various voice attributes;

and sampling an attribute space probability distribution model corresponding to the voice attribute based on the target category of the text to be synthesized on the voice attribute to obtain target attribute characteristics corresponding to the voice attribute.

4. The method of claim 3, wherein each of said voice attributes has at least one of said attribute categories; the performing feature modeling based on the voice data related to the attribute types of the various voice attributes to obtain an attribute space probability distribution model of the various voice attributes includes:

for each attribute category under the voice attribute, extracting sample attribute features of the voice data related to the attribute category and related to the voice attribute;

and constructing an attribute space probability distribution model of the voice attributes based on the sample attribute characteristics of various attribute categories under the voice attributes.

5. The method according to claim 1, wherein the voice attribute comprises a tone attribute, and the obtaining of the target attribute feature of the tone attribute comprises:

acquiring reference attribute characteristics of the tone attribute based on the target category of the text to be synthesized on the tone attribute; wherein the reference attribute feature of the timbre attribute is extracted based on the voice data related to the target category of the timbre attribute;

and adjusting based on the reference attribute characteristics of the tone attributes to obtain target attribute characteristics of the tone attributes.

6. The method according to claim 5, wherein the adjusting process based on the reference attribute feature of the tone color attribute to obtain the target attribute feature of the tone color attribute comprises any one of:

weighting the multiple reference attribute characteristics of the tone attributes to obtain target attribute characteristics of the tone attributes;

and carrying out scale adjustment on the reference attribute characteristics of the tone attributes to obtain target attribute characteristics of the tone attributes.

7. A method of speech synthesis, comprising:

extracting pronunciation attribute characteristics of the voice to be processed;

acquiring target attribute characteristics of various voice attributes based on target categories of the voice to be processed on a plurality of voice attributes respectively;

and synthesizing to obtain the synthesized voice of the voice to be processed based on the pronunciation attribute characteristics and the target attribute characteristics.

8. The method according to claim 7, before said obtaining target attribute features of various voice attributes based on target classes of the voice to be processed on several voice attributes respectively, comprising:

selecting at least one voice attribute as a first voice attribute, and using the unselected voice attribute as a second voice attribute;

the obtaining of the target attribute characteristics of various voice attributes based on the target categories of the voice to be processed on the plurality of voice attributes respectively comprises:

acquiring target attribute characteristics of each first voice attribute based on the target category of the voice to be processed on each first voice attribute; and the number of the first and second groups,

and extracting voice attribute characteristics of the voice to be processed respectively related to the second voice attributes as target attribute characteristics corresponding to the second voice attributes.

9. A speech synthesis apparatus, comprising:

the extraction module is used for extracting pronunciation attribute characteristics of the text to be synthesized;

the acquisition module is used for acquiring target attribute characteristics of various voice attributes based on target categories of the text to be synthesized on the voice attributes respectively;

and the synthesis module is used for synthesizing the synthesized voice of the text to be synthesized based on the pronunciation attribute characteristics and the target attribute characteristics.

10. A speech synthesis apparatus, comprising:

the extraction module is used for extracting pronunciation attribute characteristics of the voice to be processed;

the acquisition module is used for acquiring target attribute characteristics of various voice attributes based on target categories of the voice to be processed on the voice attributes respectively;

and the synthesis module is used for synthesizing the synthesized voice of the voice to be processed based on the pronunciation attribute characteristics and the target attribute characteristics.

11. An electronic device comprising a memory and a processor coupled to each other, the memory having stored therein program instructions, the processor being configured to execute the program instructions to implement the speech synthesis method of any one of claims 1 to 8.

12. A computer-readable storage medium, characterized in that program instructions executable by a processor for implementing the speech synthesis method of any one of claims 1 to 8 are stored.