CN116469372A

CN116469372A - Speech synthesis method, speech synthesis device, electronic device, and storage medium

Info

Publication number: CN116469372A
Application number: CN202310632858.9A
Authority: CN
Inventors: 张旭龙; 王健宗; 唐浩彬
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-05-31
Filing date: 2023-05-31
Publication date: 2023-07-21

Abstract

The embodiment of the application provides a voice synthesis method, a voice synthesis device, electronic equipment and a storage medium, and belongs to the technical field of financial science and technology. The method comprises the following steps: acquiring target text data and reference voice data; vectorizing the reference voice data to obtain a reference embedded voice vector; extracting features of the target text data to obtain a target text representation vector; performing style marking on the reference embedded voice vector to obtain a target style embedded vector corresponding to the reference embedded voice vector; performing voice synthesis based on the target style embedded vector and the target text representation vector to obtain synthesized spectrum data; and performing frequency spectrum conversion on the synthesized frequency spectrum data to obtain synthesized voice data. The embodiment of the application can improve the accuracy of voice synthesis.

Description

Speech synthesis method, speech synthesis device, electronic device and storage medium

技术领域technical field

本申请涉及金融科技技术领域，尤其涉及一种语音合成方法、语音合成装置、电子设备及存储介质。The present application relates to the technical field of financial technology, and in particular to a speech synthesis method, a speech synthesis device, electronic equipment, and a storage medium.

背景技术Background technique

随着人工智能技术的飞速发展，智能语音交互在被广泛应用于金融、物流、客服等领域，通过智能营销、智能催收、内容导航等功能提高了企业客服的服务水平。With the rapid development of artificial intelligence technology, intelligent voice interaction is widely used in finance, logistics, customer service and other fields. Through intelligent marketing, intelligent collection, content navigation and other functions, the service level of enterprise customer service has been improved.

目前，在智能客服、购物引导等金融服务场景中常常会采用对话机器人来为各个对象提供相应的服务支持。而这些对话机器人所采用的对话语音常常是基于语音合成的方式生成的。At present, dialogue robots are often used in financial service scenarios such as intelligent customer service and shopping guidance to provide corresponding service support for each object. The dialogue voices used by these dialogue robots are often generated based on speech synthesis.

相关技术中的语音合成方法大多是基于卷积神经网络提取文本数据中的文本内容信息，并依赖于提取到的文本内容信息和固定的韵律模板来进行语音合成，这一方式往往会造成合成的语音数据的情感表达能力较差，影响语音合成的准确性，因此，如何提高语音合成的准确性，成为了亟待解决的技术问题。Most of the speech synthesis methods in the related art are based on the convolutional neural network to extract the text content information in the text data, and rely on the extracted text content information and fixed prosody templates to perform speech synthesis. This method often results in poor emotional expression ability of the synthesized speech data, which affects the accuracy of speech synthesis. Therefore, how to improve the accuracy of speech synthesis has become a technical problem to be solved urgently.

发明内容Contents of the invention

本申请实施例的主要目的在于提出一种语音合成方法、语音合成装置、电子设备及存储介质，旨在提高语音合成的准确性。The main purpose of the embodiment of the present application is to provide a speech synthesis method, a speech synthesis device, an electronic device and a storage medium, aiming at improving the accuracy of speech synthesis.

为实现上述目的，本申请实施例的第一方面提出了一种语音合成方法，所述方法包括：In order to achieve the above purpose, the first aspect of the embodiment of the present application proposes a speech synthesis method, the method comprising:

获取目标文本数据和参考语音数据；Obtain target text data and reference voice data;

对所述参考语音数据进行向量化处理，得到参考嵌入语音向量；Carrying out vectorization processing on the reference voice data to obtain a reference embedded voice vector;

对所述目标文本数据进行特征提取，得到目标文本表示向量；Carrying out feature extraction to the target text data to obtain a target text representation vector;

对所述参考嵌入语音向量进行风格标记，得到所述参考嵌入语音向量对应的目标风格嵌入向量；Performing style marking on the reference embedded speech vector to obtain a target style embedding vector corresponding to the reference embedded speech vector;

基于所述目标风格嵌入向量和所述目标文本表示向量进行语音合成，得到合成频谱数据；performing speech synthesis based on the target style embedding vector and the target text representation vector to obtain synthesized spectrum data;

对所述合成频谱数据进行频谱转换，得到合成语音数据。Spectrum conversion is performed on the synthesized spectrum data to obtain synthesized speech data.

在一些实施例，所述对所述参考嵌入语音向量进行风格标记，得到所述参考嵌入语音向量对应的目标风格嵌入向量，包括：In some embodiments, performing style marking on the reference embedded speech vector to obtain a target style embedding vector corresponding to the reference embedded speech vector includes:

获取预设的多个风格标签向量，其中，所述风格标签向量为预设的风格标签对应的向量表示；Obtaining a plurality of preset style tag vectors, wherein the style tag vector is a vector representation corresponding to a preset style tag;

对每一所述风格标签向量和所述参考嵌入语音向量进行注意力计算，得到每一所述风格标签向量和所述参考嵌入语音向量之间的风格相似度；performing attention calculation on each of the style label vectors and the reference embedded speech vector, to obtain the style similarity between each of the style label vectors and the reference embedded speech vector;

基于多个所述风格相似度，得到每一所述风格标签向量的风格标签权重；Obtaining a style label weight of each style label vector based on a plurality of style similarities;

基于所述风格标签权重对所述风格标签向量进行加权求和，得到所述目标风格嵌入向量。The style label vectors are weighted and summed based on the style label weights to obtain the target style embedding vector.

在一些实施例，所述对每一所述风格标签向量和所述参考嵌入语音向量进行注意力计算，得到每一所述风格标签向量和所述参考嵌入语音向量之间的风格相似度，包括：In some embodiments, the attention calculation is performed on each of the style label vectors and the reference embedded speech vector to obtain the style similarity between each of the style label vectors and the reference embedded speech vector, including:

对每一所述风格标签向量和所述参考嵌入语音向量进行矩阵相乘，得到每一所述风格标签向量对应的查询向量、键向量、值向量；performing matrix multiplication on each of the style tag vectors and the reference embedded speech vector to obtain a query vector, a key vector, and a value vector corresponding to each of the style tag vectors;

基于预设函数对所述查询向量、所述键向量、所述值向量进行注意力计算，得到所述风格标签向量和所述参考嵌入语音向量之间的风格相似度。Attention calculation is performed on the query vector, the key vector, and the value vector based on a preset function to obtain a style similarity between the style label vector and the reference embedded speech vector.

在一些实施例，所述基于多个所述风格相似度，得到每一所述风格标签向量的风格标签权重，包括：In some embodiments, the obtaining the style label weight of each style label vector based on the multiple style similarities includes:

对多个所述风格相似度进行求和处理，得到综合风格相似度；summing the multiple style similarities to obtain a comprehensive style similarity;

对每一所述风格标签向量的风格相似度和所述综合风格相似度进行相除，得到所述风格标签向量的风格标签权重。The style similarity of each style label vector is divided by the comprehensive style similarity to obtain the style label weight of the style label vector.

在一些实施例，所述基于所述目标风格嵌入向量和所述目标文本表示向量进行语音合成，得到合成频谱数据，包括：In some embodiments, performing speech synthesis based on the target style embedding vector and the target text representation vector to obtain synthesized spectral data includes:

对所述目标风格嵌入向量和所述目标文本表示向量进行拼接处理，得到组合嵌入向量；splicing the target style embedding vector and the target text representation vector to obtain a combined embedding vector;

基于注意力机制对所述组合嵌入向量进行特征对齐，得到目标音素向量；performing feature alignment on the combined embedding vector based on an attention mechanism to obtain a target phoneme vector;

对所述目标音素向量进行解码处理，得到所述合成频谱数据。Decoding the target phoneme vector to obtain the synthesized spectrum data.

在一些实施例，所述基于注意力机制对所述组合嵌入向量进行特征对齐，得到目标音素向量，包括：In some embodiments, the attention-based mechanism performs feature alignment on the combined embedding vector to obtain a target phoneme vector, including:

基于预设的持续时间预测模型对所述组合嵌入向量进行时间预测，得到音素持续时间；performing time prediction on the combined embedding vector based on a preset duration prediction model to obtain phoneme duration;

基于所述注意力机制和所述音素持续时间对所述组合嵌入向量进行音素长度调整，得到所述目标音素向量。The phoneme length adjustment is performed on the combined embedding vector based on the attention mechanism and the phoneme duration to obtain the target phoneme vector.

在一些实施例，所述对所述合成频谱数据进行频谱转换，得到合成语音数据，包括：In some embodiments, performing spectral conversion on the synthesized spectral data to obtain synthesized speech data includes:

将所述合成频谱数据输入至预设的声码器中，其中，所述声码器包括反卷积层和多感受野融合层；Inputting the synthesized spectral data into a preset vocoder, wherein the vocoder includes a deconvolution layer and a multi-receptive field fusion layer;

基于所述反卷积层对所述合成频谱数据进行上采样处理，得到目标频谱数据；performing upsampling processing on the synthesized spectral data based on the deconvolution layer to obtain target spectral data;

基于所述多感受野融合层对所述目标频谱数据进行多尺度特征融合，得到所述合成语音数据。performing multi-scale feature fusion on the target spectrum data based on the multi-receptive field fusion layer to obtain the synthesized speech data.

为实现上述目的，本申请实施例的第二方面提出了一种语音合成装置，所述装置包括：In order to achieve the above purpose, the second aspect of the embodiment of the present application proposes a speech synthesis device, the device includes:

数据获取模块，用于获取目标文本数据和参考语音数据；A data acquisition module, configured to acquire target text data and reference voice data;

向量化模块，用于对所述参考语音数据进行向量化处理，得到参考嵌入语音向量；A vectorization module, configured to perform vectorization processing on the reference speech data to obtain a reference embedded speech vector;

特征提取模块，用于对所述目标文本数据进行特征提取，得到目标文本表示向量；A feature extraction module, configured to perform feature extraction on the target text data to obtain a target text representation vector;

风格标记模块，用于对所述参考嵌入语音向量进行风格标记，得到所述参考嵌入语音向量对应的目标风格嵌入向量；A style labeling module, configured to perform style labeling on the reference embedded speech vector to obtain a target style embedding vector corresponding to the reference embedded speech vector;

语音合成模块，用于基于所述目标风格嵌入向量和所述目标文本表示向量进行语音合成，得到合成频谱数据；Speech synthesis module, for performing speech synthesis based on the target style embedding vector and the target text representation vector, to obtain synthesized spectrum data;

频谱转换模块，用于对所述合成频谱数据进行频谱转换，得到合成语音数据。The spectrum conversion module is configured to perform spectrum conversion on the synthesized spectrum data to obtain synthesized speech data.

为实现上述目的，本申请实施例的第三方面提出了一种电子设备，所述电子设备包括存储器、处理器，所述存储器存储有计算机程序，所述处理器执行所述计算机程序时实现上述第一方面所述的方法。To achieve the above object, the third aspect of the embodiments of the present application provides an electronic device, the electronic device includes a memory and a processor, the memory stores a computer program, and the processor implements the method described in the first aspect above when executing the computer program.

为实现上述目的，本申请实施例的第四方面提出了一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序被处理器执行时实现上述第一方面所述的方法。To achieve the above object, the fourth aspect of the embodiments of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the method described in the above-mentioned first aspect is implemented.

本申请提出的语音合成方法、语音合成装置、电子设备及存储介质，其通过获取目标文本数据和参考语音数据；对参考语音数据进行向量化处理，得到参考嵌入语音向量；对目标文本数据进行特征提取，得到目标文本表示向量，能够得到表征文本语义内容的目标文本表示向量，使得能够利用目标文本表示向量进行语音合成。进一步地，对参考嵌入语音向量进行风格标记，得到参考嵌入语音向量对应的目标风格嵌入向量，能够实现对语音韵律风格的控制，使得能够将该目标风格嵌入向量用于语音合成，提高语音合成情感的准确性和语音合成效果。进一步地，基于目标风格嵌入向量和目标文本表示向量进行语音合成，得到合成频谱数据，能够较好地提高合成频谱数据的数据质量。最后，对合成频谱数据进行频谱转换，得到合成语音数据，能够较为方便地得到波形形式的合成语音数据，该合成语音数据同时包含目标文本数据的文本内容特征和参考语音数据的风格特征，具备较好地情感表达能力，提高了语音合成的准确性，进而使得在保险产品、理财产品等智能对话的过程中，对话机器人表达的合成语音能够更贴合对话对象的对话风格偏好，通过采用对话对象更感兴趣的对话方式和对话风格进行会话交流，提高对话质量和对话有效性，能实现智能语音对话服务，提高客户的服务质量以及客户满意度。The speech synthesis method, speech synthesis device, electronic equipment, and storage medium proposed by the present application obtain target text data and reference speech data; perform vectorization processing on the reference speech data to obtain reference embedded speech vectors; perform feature extraction on the target text data to obtain target text representation vectors, and can obtain target text representation vectors representing the semantic content of the text, so that speech synthesis can be performed using the target text representation vectors. Furthermore, the reference embedded speech vector is style-marked to obtain the target style embedding vector corresponding to the reference embedded speech vector, which can realize the control of speech prosodic style, so that the target style embedding vector can be used for speech synthesis, and improve the accuracy of speech synthesis emotion and speech synthesis effect. Furthermore, speech synthesis is performed based on the target style embedding vector and the target text representation vector to obtain synthetic spectral data, which can better improve the data quality of the synthetic spectral data. Finally, spectrum conversion is performed on the synthesized spectrum data to obtain synthesized speech data, and the synthesized speech data in the form of a waveform can be obtained more conveniently. The synthesized speech data contains both the text content characteristics of the target text data and the style characteristics of the reference speech data. It has better emotional expression ability and improves the accuracy of speech synthesis. In turn, in the process of intelligent dialogues such as insurance products and wealth management products, the synthetic speech expressed by the dialogue robot can be more suitable for the dialogue style preference of the dialogue object. By using the dialogue style and dialogue style that the dialogue object is more interested in for conversation communication, the dialogue quality and effectiveness can be improved, and intelligent voice dialogue can be realized. service, improve customer service quality and customer satisfaction.

附图说明Description of drawings

图1是本申请实施例提供的语音合成方法的流程图；Fig. 1 is the flowchart of the speech synthesis method that the embodiment of the present application provides;

图2是图1中的步骤S104的流程图；Fig. 2 is the flowchart of step S104 in Fig. 1;

图3是图2中的步骤S202的流程图；Fig. 3 is the flowchart of step S202 in Fig. 2;

图4是图2中的步骤S203的流程图；Fig. 4 is the flowchart of step S203 in Fig. 2;

图5是图1中的步骤S105的流程图；Fig. 5 is the flowchart of step S105 in Fig. 1;

图6是图5中的步骤S502的流程图；Fig. 6 is the flowchart of step S502 in Fig. 5;

图7是图1中的步骤S106的流程图；Fig. 7 is the flowchart of step S106 in Fig. 1;

图8是本申请实施例提供的语音合成装置的结构示意图；FIG. 8 is a schematic structural diagram of a speech synthesis device provided by an embodiment of the present application;

图9是本申请实施例提供的电子设备的硬件结构示意图。FIG. 9 is a schematic diagram of a hardware structure of an electronic device provided by an embodiment of the present application.

具体实施方式Detailed ways

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本申请，并不用于限定本申请。In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, not to limit the present application.

需要说明的是，虽然在装置示意图中进行了功能模块划分，在流程图中示出了逻辑顺序，但是在某些情况下，可以以不同于装置中的模块划分，或流程图中的顺序执行所示出或描述的步骤。说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。It should be noted that although the functional modules are divided in the schematic diagram of the device, and the logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than the module division in the device or the flow chart. The terms "first", "second" and the like in the specification and claims and the above drawings are used to distinguish similar objects, and not necessarily used to describe a specific sequence or sequence.

除非另有定义，本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的，不是旨在限制本申请。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terms used herein are only for the purpose of describing the embodiments of the present application, and are not intended to limit the present application.

首先，对本申请中涉及的若干名词进行解析：First, analyze some nouns involved in this application:

人工智能(artificial intelligence，AI)：是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及应用系统的一门新的技术科学；人工智能是计算机科学的一个分支，人工智能企图了解智能的实质，并生产出一种新的能以人类智能相似的方式做出反应的智能机器，该领域的研究包括机器人、语言识别、图像识别、自然语言处理和专家系统等。人工智能可以对人的意识、思维的信息过程的模拟。人工智能还是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能，感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。Artificial intelligence (AI): It is a new technical science that studies and develops theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science, artificial intelligence attempts to understand the essence of intelligence, and produce a new intelligent machine that can respond in a similar way to human intelligence. Research in this field includes robots, language recognition, image recognition, natural language processing and expert systems. Artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.

自然语言处理(natural language processing，NLP)：NLP用计算机来处理、理解以及运用人类语言(如中文、英文等)，NLP属于人工智能的一个分支，是计算机科学与语言学的交叉学科，又常被称为计算语言学。自然语言处理包括语法分析、语义分析、篇章理解等。自然语言处理常用于机器翻译、手写体和印刷体字符识别、语音识别及文语转换、信息意图识别、信息抽取与过滤、文本分类与聚类、舆情分析和观点挖掘等技术领域，它涉及与语言处理相关的数据挖掘、机器学习、知识获取、知识工程、人工智能研究和与语言计算相关的语言学研究等。Natural language processing (natural language processing, NLP): NLP uses computers to process, understand and use human languages (such as Chinese, English, etc.). NLP belongs to a branch of artificial intelligence and is an interdisciplinary subject of computer science and linguistics. Natural language processing includes syntax analysis, semantic analysis, text understanding, etc. Natural language processing is often used in technical fields such as machine translation, handwritten and printed character recognition, speech recognition and text-to-speech conversion, information intent recognition, information extraction and filtering, text classification and clustering, public opinion analysis, and opinion mining. It involves data mining related to language processing, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, and linguistic research related to language computing.

信息抽取(Information Extraction，NER)：从自然语言文本中抽取指定类型的实体、关系、事件等事实信息，并形成结构化数据输出的文本处理技术。信息抽取是从文本数据中抽取特定信息的一种技术。文本数据是由一些具体的单位构成的，例如句子、段落、篇章，文本信息正是由一些小的具体的单位构成的，例如字、词、词组、句子、段落或是这些具体的单位的组合。抽取文本数据中的名词短语、人名、地名等都是文本信息抽取，当然，文本信息抽取技术所抽取的信息可以是各种类型的信息。Information Extraction (Information Extraction, NER): A text processing technology that extracts specified types of factual information such as entities, relationships, and events from natural language texts, and forms structured data output. Information extraction is a technique to extract specific information from text data. Text data is composed of some specific units, such as sentences, paragraphs, and chapters, and text information is composed of some small specific units, such as words, words, phrases, sentences, paragraphs, or combinations of these specific units. Extracting noun phrases, personal names, and place names in text data is all text information extraction. Of course, the information extracted by text information extraction technology can be various types of information.

梅尔倒频谱系数(Mel-Frequency Cipstal Coefficients,MFCC)：是一组用来建立梅尔倒频谱的关键系数。由音乐信号当中的片段，可以得到一组足以代表此音乐信号的倒频谱，而梅尔倒频谱系数即是从这个倒频谱中推得的倒频谱(也就是频谱的频谱)。与一般的倒频谱不同，梅尔倒频谱最大的特色在于梅尔倒频谱上的频带是均匀分布于梅尔刻度上的，也就是说，相较于一般所看到、线性的倒频谱表示方法，这样的频带会和人类非线性的听觉系统(Audio System)更为接近。例如：在音讯压缩的技术中，常常使用梅尔倒频谱来处理。Mel-Frequency Cepstral Coefficients (MFCC): It is a set of key coefficients used to establish Mel-Frequency Cepstral Coefficients. A set of cepstrums sufficient to represent the music signal can be obtained from the segments in the music signal, and the Mel cepstral coefficients are the cepstrums (that is, the frequency spectrum) derived from the cepstrums. Different from the general cepstrum, the biggest feature of the Mel cepstrum is that the frequency bands on the Mel cepstrum are evenly distributed on the Mel scale. That is to say, compared with the generally seen, linear cepstrum representation method, such frequency bands will be closer to the human nonlinear auditory system (Audio System). For example: In audio compression technology, Mel cepstrum is often used for processing.

声码器(vocoder)：即语音信号某种模型的语音分析合成系统。在传输中只利用模型参数，是在编译码时利用模型参数估计和语音合成技术的语音信号编译码器，也是一种对话音进行分析和合成的编、译码器，也称话音分析合成系统或话音频带压缩系统。Vocoder (vocoder): that is, a speech analysis and synthesis system of a certain model of the speech signal. Only model parameters are used in transmission. It is a speech signal codec that uses model parameter estimation and speech synthesis technology when encoding and decoding. It is also a coder and decoder that analyzes and synthesizes speech.

多头注意力：(multi-head attention)是利用多个查询，来平行地计算从输入信息中选取多个信息。每个注意力关注输入信息的不同部分。硬注意力，即基于注意力分布的所有输入信息的期望。Multi-head attention: (multi-head attention) is to use multiple queries to calculate and select multiple information from input information in parallel. Each attention focuses on a different part of the input information. Hard attention, i.e. the expectation of all input information based on the attention distribution.

嵌入(embedding)：embedding是一种向量表征，是指用一个低维的向量表示一个物体，该物体可以是一个词，或是一个商品，或是一个电影等等；这个embedding向量的性质是能使距离相近的向量对应的物体有相近的含义，embedding实质是一种映射，从语义空间到向量空间的映射，同时尽可能在向量空间保持原样本在语义空间的关系，如语义接近的两个词汇在向量空间中的位置也比较接近。embedding能够用低维向量对物体进行编码还能保留其含义，常应用于机器学习，在机器学习模型构建过程中，通过把物体编码为一个低维稠密向量再传给DNN，以提高效率。Embedding: embedding is a kind of vector representation, which refers to using a low-dimensional vector to represent an object, which can be a word, a commodity, or a movie, etc.; the nature of this embedding vector is to enable objects corresponding to vectors with similar distances to have similar meanings. Embedding is essentially a mapping from semantic space to vector space, while maintaining the relationship of the original sample in the semantic space as much as possible in the vector space. Embedding can encode an object with a low-dimensional vector and retain its meaning. It is often used in machine learning. In the process of building a machine learning model, the object is encoded as a low-dimensional dense vector and then passed to DNN to improve efficiency.

编码(Encoder)：将输入序列转化成一个固定长度的向量。Encoder: Converts the input sequence into a fixed-length vector.

解码(Decoder)：就是将之前生成的固定向量再转化成输出序列；其中，输入序列可以是文字、语音、图像、视频；输出序列可以是文字、图像。Decoding (Decoder): It is to convert the previously generated fixed vector into an output sequence; where the input sequence can be text, voice, image, video; the output sequence can be text, image.

BERT(Bidirectional Encoder Representation from Transformers)模型：BERT模型进一步增加词向量模型泛化能力，充分描述字符级、词级、句子级甚至句间关系特征，基于Transformer构建而成。BERT中有三种embedding，即Token Embedding，SegmentEmbedding，Position Embedding；其中Token Embeddings是词向量，第一个单词是CLS标志，可以用于之后的分类任务；Segment Embeddings用来区别两种句子，因为预训练不光做LM还要做以两个句子为输入的分类任务；Position Embeddings，这里的位置词向量不是transfor中的三角函数，而是BERT经过训练学到的。但BERT直接训练一个positionembedding来保留位置信息，每个位置随机初始化一个向量，加入模型训练，最后就得到一个包含位置信息的embedding，最后这个position embedding和word embedding的结合方式上，BERT选择直接拼接。音素(Phone)：是根据语音的自然属性划分出来的最小语音单位，依据音节里的发音动作来分析，一个动作构成一个音素。BERT (Bidirectional Encoder Representation from Transformers) model: The BERT model further increases the generalization ability of the word vector model, fully describes character-level, word-level, sentence-level and even inter-sentence relationship features, and is built based on Transformer. There are three kinds of embeddings in BERT, namely Token Embedding, SegmentEmbedding, and Position Embedding; among them, Token Embeddings is a word vector, and the first word is a CLS mark, which can be used for subsequent classification tasks; Segment Embeddings are used to distinguish two kinds of sentences, because pre-training is not only for LM but also for classification tasks with two sentences as input; learned in training. However, BERT directly trains a position embedding to retain position information. Each position randomly initializes a vector, joins the model training, and finally obtains an embedding containing position information. In the final combination of position embedding and word embedding, BERT chooses to splice directly. Phoneme (Phone): It is the smallest unit of speech divided according to the natural properties of speech. It is analyzed according to the pronunciation actions in syllables. An action constitutes a phoneme.

Softmax函数：Softmax函数是归一化指数函数。Softmax function: The Softmax function is a normalized exponential function.

语音合成是指从文本中合成出可理解的、自然的语音，又称文本转语音(Text-To-Speech,TTS)。语音合成系统在生活中被广泛应用于各个场景，包括语音对话系统；智能语音助手；电话信息查询系统；车载导航，有声电子书等辅助应用；语言学习；机场，车站等实时信息广播系统；视力或语音障碍者的信息获取与交流等等。Speech synthesis refers to synthesizing understandable and natural speech from text, also known as Text-To-Speech (TTS). Speech synthesis systems are widely used in various scenarios in life, including voice dialogue systems; intelligent voice assistants; telephone information query systems; auxiliary applications such as car navigation and audio e-books; language learning; real-time information broadcasting systems at airports and stations;

以保险服务机器人为例，常常需要将保险产品的描述文本和固定对象的说话风格进行融合，生成由该固定对象对保险产品的描述语音。当保险服务机器人与一些感兴趣的对象进行对话时，会自动调用这一描述语音为对象进行保险产品介绍。Taking the insurance service robot as an example, it is often necessary to integrate the description text of the insurance product with the speaking style of the fixed object to generate a voice describing the insurance product by the fixed object. When the insurance service robot has a conversation with some interested objects, it will automatically call this description voice to introduce insurance products for the objects.

目前的语音合成方法大多是基于卷积神经网络提取文本数据中的文本内容信息，并依赖于提取到的文本内容信息和固定的韵律模板来进行语音合成，这一方式往往会造成合成的语音数据的情感表达能力较差，影响语音合成的准确性，因此，如何提高语音合成的准确性，成为了亟待解决的技术问题。Most of the current speech synthesis methods are based on the convolutional neural network to extract the text content information in the text data, and rely on the extracted text content information and fixed prosody templates for speech synthesis. This method often results in poor emotional expression ability of the synthesized speech data, which affects the accuracy of speech synthesis. Therefore, how to improve the accuracy of speech synthesis has become a technical problem to be solved urgently.

基于此，本申请实施例提供了一种语音合成方法、语音合成装置、电子设备及存储介质，旨在提高语音合成的准确性。Based on this, an embodiment of the present application provides a speech synthesis method, a speech synthesis device, an electronic device, and a storage medium, aiming at improving the accuracy of speech synthesis.

本申请实施例提供的语音合成方法、语音合成装置、电子设备及存储介质，具体通过如下实施例进行说明，首先描述本申请实施例中的语音合成方法。The speech synthesis method, speech synthesis device, electronic equipment, and storage medium provided in the embodiments of the present application are specifically described through the following embodiments. First, the speech synthesis method in the embodiments of the present application is described.

本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中，人工智能(Artificial Intelligence，AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能，感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。The embodiments of the present application may acquire and process relevant data based on artificial intelligence technology. Among them, artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.

人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、机器人技术、生物识别技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.

本申请实施例提供的语音合成方法，涉及人工智能技术领域。本申请实施例提供的语音合成方法可应用于终端中，也可应用于服务器端中，还可以是运行于终端或服务器端中的软件。在一些实施例中，终端可以是智能手机、平板电脑、笔记本电脑、台式计算机等；服务器端可以配置成独立的物理服务器，也可以配置成多个物理服务器构成的服务器集群或者分布式系统，还可以配置成提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN以及大数据和人工智能平台等基础云计算服务的云服务器；软件可以是实现语音合成方法的应用等，但并不局限于以上形式。The speech synthesis method provided in the embodiment of the present application relates to the technical field of artificial intelligence. The speech synthesis method provided in the embodiment of the present application can be applied to a terminal, can also be applied to a server, and can also be software running on the terminal or the server. In some embodiments, the terminal can be a smart phone, a tablet computer, a notebook computer, a desktop computer, etc.; the server end can be configured as an independent physical server, or can be configured as a server cluster or distributed system composed of multiple physical servers, and can also be configured as a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms;

需要说明的是，在本申请的各个具体实施方式中，当涉及到需要根据用户信息、用户语音数据、用户行为数据，用户历史数据以及用户位置信息等与用户身份或特性相关的数据进行相关处理时，都会先获得用户的许可或者同意，而且，对这些数据的收集、使用和处理等，都会遵守相关法律法规和标准。此外，当本申请实施例需要获取用户的敏感个人信息时，会通过弹窗或者跳转到确认页面等方式获得用户的单独许可或者单独同意，在明确获得用户的单独许可或者单独同意之后，再获取用于使本申请实施例能够正常运行的必要的用户相关数据。It should be noted that, in each specific implementation of this application, when it comes to relevant processing based on user information, user voice data, user behavior data, user historical data, user location information and other data related to user identity or characteristics, the user's permission or consent will be obtained first, and the collection, use and processing of these data will comply with relevant laws, regulations and standards. In addition, when the embodiment of this application needs to obtain the user's sensitive personal information, it will obtain the user's separate permission or separate consent through a pop-up window or jump to a confirmation page, etc., and obtain the necessary user-related data for the normal operation of the application embodiment after the user's separate permission or separate consent is clearly obtained.

本申请可用于众多通用或专用的计算机系统环境或配置中。例如：个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的系统、置顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述，例如程序模块。一般地，程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请，在这些分布式计算环境中，由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中，程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。The application can be used in numerous general purpose or special purpose computer system environments or configurations. Examples: personal computers, server computers, handheld or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics devices, network PCs, minicomputers, mainframe computers, distributed computing environments including any of the above systems or devices, etc. This application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.

图1是本申请实施例提供的语音合成方法的一个可选的流程图，图1中的方法可以包括但不限于包括步骤S101至步骤S106。Fig. 1 is an optional flow chart of the speech synthesis method provided by the embodiment of the present application. The method in Fig. 1 may include but not limited to steps S101 to S106.

步骤S101，获取目标文本数据和参考语音数据；Step S101, acquiring target text data and reference voice data;

步骤S102，对参考语音数据进行向量化处理，得到参考嵌入语音向量；Step S102, vectorizing the reference speech data to obtain a reference embedded speech vector;

步骤S103，对目标文本数据进行特征提取，得到目标文本表示向量；Step S103, performing feature extraction on the target text data to obtain a target text representation vector;

步骤S104，对参考嵌入语音向量进行风格标记，得到参考嵌入语音向量对应的目标风格嵌入向量；Step S104, performing style marking on the reference embedded speech vector to obtain a target style embedding vector corresponding to the reference embedded speech vector;

步骤S105，基于目标风格嵌入向量和目标文本表示向量进行语音合成，得到合成频谱数据；Step S105, performing speech synthesis based on the target style embedding vector and the target text representation vector to obtain synthesized spectrum data;

步骤S106，对合成频谱数据进行频谱转换，得到合成语音数据。Step S106, performing spectrum conversion on the synthesized spectrum data to obtain synthesized speech data.

本申请实施例所示意的步骤S101至步骤S106，通过获取目标文本数据和参考语音数据；对参考语音数据进行向量化处理，得到参考嵌入语音向量；对目标文本数据进行特征提取，得到目标文本表示向量，能够得到表征文本语义内容的目标文本表示向量，使得能够利用目标文本表示向量进行语音合成。进一步地，对参考嵌入语音向量进行风格标记，得到参考嵌入语音向量对应的目标风格嵌入向量，能够实现对语音韵律风格的控制，使得能够将该目标风格嵌入向量用于语音合成，提高语音合成情感的准确性和语音合成效果。进一步地，基于目标风格嵌入向量和目标文本表示向量进行语音合成，得到合成频谱数据，能够较好地提高合成频谱数据的数据质量。最后，对合成频谱数据进行频谱转换，得到合成语音数据，能够较为方便地得到波形形式的合成语音数据，该合成语音数据同时包含目标文本数据的文本内容特征和参考语音数据的风格特征，具备较好地情感表达能力，提高了语音合成的准确性。From step S101 to step S106 shown in the embodiment of the present application, by acquiring target text data and reference voice data; performing vectorization processing on the reference voice data to obtain reference embedded voice vectors; performing feature extraction on the target text data to obtain the target text representation vector, the target text representation vector representing the semantic content of the text can be obtained, so that speech synthesis can be performed using the target text representation vector. Furthermore, the reference embedded speech vector is style-marked to obtain the target style embedding vector corresponding to the reference embedded speech vector, which can realize the control of speech prosodic style, so that the target style embedding vector can be used for speech synthesis, and improve the accuracy of speech synthesis emotion and speech synthesis effect. Furthermore, speech synthesis is performed based on the target style embedding vector and the target text representation vector to obtain synthetic spectral data, which can better improve the data quality of the synthetic spectral data. Finally, spectrum conversion is performed on the synthesized spectrum data to obtain synthesized voice data, and the synthesized voice data in the form of a waveform can be obtained more conveniently. The synthesized voice data contains both the text content characteristics of the target text data and the style characteristics of the reference voice data, which has better emotional expression ability and improves the accuracy of speech synthesis.

在一些实施例的步骤S101中，可以从公开数据集中获取目标文本数据，也可以从已有的文本数据库或者网络平台等获取待处理的目标文本数据，不做限制。例如，公开数据集可以是THCHS30数据集或者LJSpeech数据集等等。In step S101 of some embodiments, the target text data may be obtained from public datasets, or the target text data to be processed may be obtained from existing text databases or network platforms, without limitation. For example, the public dataset can be THCHS30 dataset or LJSpeech dataset, etc.

需要说明的是，目标文本数据可以是含有金融领域的专有名词、金融业务模板词汇、也可以是含有保险产品的产品描述、理财产品的产品描述以及金融领域的常用对话话术等的文本数据。It should be noted that the target text data can be text data containing proper nouns in the financial field, financial business template vocabulary, product descriptions of insurance products, product descriptions of wealth management products, and common dialogue terms in the financial field.

同时，可以通过编写网络爬虫,设置好数据源之后进行有目标性地爬取数据，得到参考说话对象的参考语音数据，其中，数据源可以是各种类型的网络平台、社交媒体也可以是某些特定的音频数据库等，参考说话对象可以是网络用户、演讲人员、歌手等等，参考语音数据可以是参考说话对象的音乐素材、演讲汇报、聊天对话等。其中，该参考语音数据可以是音频信号，该音频信号可以以由音素组成，其中音素是构成音节的最小单元或者最小语音片段。通过上述方式能够较为方便地获取参考语音数据和目标文本数据，提高了数据获取效率。At the same time, by writing a web crawler, setting up the data source and crawling data in a targeted manner, the reference voice data of the reference speaker can be obtained. The data source can be various types of network platforms, social media, or certain specific audio databases. Wherein, the reference speech data may be an audio signal, and the audio signal may be composed of phonemes, where a phoneme is the smallest unit or the smallest speech segment constituting a syllable. The reference speech data and the target text data can be obtained more conveniently through the above method, and the data acquisition efficiency is improved.

在一些实施例的步骤S102中，可以将参考语音数据输入至预设的参考编码器中，通过参考编码器将参考语音数据压缩成可变长度的韵律音频信号，从而实现对参考语音数据的向量化处理，将参考语音数据转换成固定长度的语音向量，得到参考嵌入语音向量，其中，该参考嵌入语音向量包含参考语音数据的语音韵律特征信息。这一方式能够较为方便地将参考语音数据由时域信号转换为频域特征，提取参考语音数据中的韵律特征信息，使得能够在后续的语音合成过程中将参考语音数据的韵律特征信息融入合成语音，从而提高语音合成的准确性。In step S102 of some embodiments, the reference speech data can be input into a preset reference encoder, and the reference speech data is compressed into a variable-length prosodic audio signal by the reference encoder, thereby realizing vectorization processing of the reference speech data, converting the reference speech data into a fixed-length speech vector, and obtaining a reference embedded speech vector, wherein the reference embedded speech vector includes speech prosody feature information of the reference speech data. This method can more conveniently convert the reference speech data from the time-domain signal to the frequency-domain feature, and extract the prosodic feature information in the reference speech data, so that the prosodic feature information of the reference speech data can be integrated into the synthesized speech in the subsequent speech synthesis process, thereby improving the accuracy of speech synthesis.

在一些实施例的步骤S103中，可以将目标文本数据输入至预设的文本模型中，预设的文本模型可以是Bert模型等文本编码模型，不做限制。通过预设的文本模型对目标文本数据进行特征提取，获取目标文本数据中每一文本字符对应的文本嵌入表示，将目标文本数据对应的所有文本嵌入表示进行拼接，得到目标文本表示向量。例如，首先利用文本模型对目标文本数据进行分词处理，得到多个文本字符，再通过预设的字符字典查询到每一文本字符的文本嵌入表示，将所有的文本嵌入表示进行向量连接，得到该目标文本数据对应的目标文本表示向量，该目标文本表示向量包含着目标文本数据对应的音素特征，因此，该目标文本表示向量能够用于表征目标文本数据的文本语义内容。这一方式能够较为准确地实现目标文本数据从数据空间到向量空间的转换，得到表征文本语义内容的目标文本表示向量，使得能够利用目标文本表示向量进行语音合成，得到该目标文本数据对应的合成语音，有助于提高语音合成的准确性。In step S103 of some embodiments, the target text data may be input into a preset text model, and the preset text model may be a text encoding model such as a Bert model, without limitation. The feature extraction of the target text data is carried out through the preset text model, the text embedding representation corresponding to each text character in the target text data is obtained, and all the text embedding representations corresponding to the target text data are spliced to obtain the target text representation vector. For example, firstly, the text model is used to segment the target text data to obtain multiple text characters, and then the text embedding representation of each text character is queried through the preset character dictionary, and all the text embedding representations are vector-connected to obtain the target text representation vector corresponding to the target text data. This method can more accurately realize the conversion of target text data from data space to vector space, obtain the target text representation vector representing the semantic content of the text, make it possible to use the target text representation vector for speech synthesis, and obtain the synthesized speech corresponding to the target text data, which helps to improve the accuracy of speech synthesis.

请参阅图2，在一些实施例中，步骤S104可以包括但不限于包括步骤S201至步骤S204：Referring to FIG. 2, in some embodiments, step S104 may include but not limited to include steps S201 to S204:

步骤S201，获取预设的多个风格标签向量，其中，风格标签向量为预设的风格标签对应的向量表示；Step S201, obtaining a plurality of preset style label vectors, wherein the style label vector is a vector representation corresponding to the preset style label;

步骤S202，对每一风格标签向量和参考嵌入语音向量进行注意力计算，得到每一风格标签向量和参考嵌入语音向量之间的风格相似度；Step S202, perform attention calculation on each style label vector and the reference embedded speech vector, and obtain the style similarity between each style label vector and the reference embedded speech vector;

步骤S203，基于多个风格相似度，得到每一风格标签向量的风格标签权重；Step S203, based on multiple style similarities, obtain the style label weight of each style label vector;

步骤S204，基于风格标签权重对风格标签向量进行加权求和，得到目标风格嵌入向量。Step S204, weighting and summing the style label vectors based on the style label weights to obtain the target style embedding vector.

在一些实施例的步骤S201中，可以从预设的标签数据库中提取预设的多个风格标签，并对提取到的风格标签进行编码处理，得到每一风格标签的向量表示，从而得到多个风格标签向量。其中，不同的风格标签包含有不同的韵律风格，例如风格标签包含声音低沉、声音高亢、声音急促等多种韵律风格类型，不做限制，预设的标签数据库可以是根据专家经验等推理构建得来。In step S201 of some embodiments, a plurality of preset style tags can be extracted from a preset tag database, and the extracted style tags are encoded to obtain a vector representation of each style tag, thereby obtaining a plurality of style tag vectors. Among them, different style tags contain different prosody styles. For example, the style tags include various types of prosody styles such as deep voice, high voice, and rapid voice. There is no limitation. The preset label database can be constructed based on expert experience and other reasoning.

在一些实施例的步骤S202中，可以引入多头注意力机制等方式来计算每一风格标签向量和参考嵌入语音向量之间的风格相似度，利用多头注意力机制将参考嵌入语音向量投影到不同的风格标签向量上，得到参考嵌入语音向量在不同的风格标签向量上的占比情况，根据占比情况确定每一风格标签向量的风格标签权重。例如，首先可以对每一风格标签向量和参考嵌入语音向量进行矩阵相乘，再基于预设函数(例如softmax函数等)对每一风格标签向量对应的查询向量、键向量、值向量进行注意力计算，从而得到每一风格标签向量和参考嵌入语音向量之间的风格相似度。In step S202 of some embodiments, a multi-head attention mechanism can be introduced to calculate the style similarity between each style tag vector and the reference embedded speech vector, and the multi-head attention mechanism is used to project the reference embedded speech vector onto different style tag vectors to obtain the proportion of the reference embedded speech vector on different style tag vectors, and determine the style tag weight of each style tag vector according to the proportion. For example, first, matrix multiplication can be performed on each style label vector and the reference embedded speech vector, and then the attention calculation can be performed on the query vector, key vector, and value vector corresponding to each style label vector based on a preset function (such as softmax function, etc.), so as to obtain the style similarity between each style label vector and the reference embedded speech vector.

在一些实施例的步骤S203中，可以对多个风格相似度进行求和处理，得到综合风格相似度，再对每一风格标签向量的风格相似度和综合风格相似度进行相除，得到每个风格标签向量的风格标签权重。In step S203 of some embodiments, multiple style similarities can be summed to obtain a comprehensive style similarity, and then the style similarity of each style label vector and the comprehensive style similarity are divided to obtain the style label weight of each style label vector.

在一些实施例的步骤S204中，可以根据风格标签权重对风格标签向量进行加权求和，从而实现对参考嵌入语音向量进行风格标记，得到参考嵌入语音向量对应的目标风格嵌入向量。例如，风格标签向量包括风格标签向量A、风格标签向量B以及风格标签向量C,且这三个风格标签向量的风格标签权重分别为0.1、0.3和0.6，则目标风格嵌入向量M为M＝0.1*A+0.3*B+0.6*C。该目标风格嵌入向量包含有参考语音数据的韵律风格信息。In step S204 of some embodiments, the style label vectors may be weighted and summed according to the style label weights, so as to implement style labeling on the reference embedded speech vector, and obtain a target style embedding vector corresponding to the reference embedded speech vector. For example, the style label vector includes style label vector A, style label vector B and style label vector C, and the style label weights of these three style label vectors are 0.1, 0.3 and 0.6 respectively, then the target style embedding vector M is M=0.1*A+0.3*B+0.6*C. The target style embedding vector contains prosodic style information of the reference speech data.

通过上述步骤S201至步骤S204能够较为方便地确定参考嵌入语音向量在不同的风格标签向量上的占比情况，根据占比情况确定每一风格标签向量的风格标签权重，并基于不同的风格标签权重来生成参考嵌入语音向量的加权风格嵌入表示，能够采用风格标记加权表示语音韵律空间特征，实现对语音韵律的控制，使得能够将该目标风格嵌入向量用于语音合成，提高语音合成情感的准确性和语音合成效果。Through the above steps S201 to S204, it is possible to more conveniently determine the proportion of the reference embedded speech vector on different style tag vectors, determine the style tag weight of each style tag vector according to the ratio, and generate a weighted style embedding representation of the reference embedded speech vector based on the different style tag weights. The style tag weighting can be used to represent the spatial characteristics of speech prosody, and the control of speech rhythm can be realized, so that the target style embedding vector can be used for speech synthesis, improving the accuracy of speech synthesis emotion and speech synthesis effect.

请参阅图3，在一些实施例中，步骤S202可以包括但不限于包括步骤S301至步骤S302：Referring to FIG. 3, in some embodiments, step S202 may include but not limited to include steps S301 to S302:

步骤S301，对每一风格标签向量和参考嵌入语音向量进行矩阵相乘，得到每一风格标签向量对应的查询向量、键向量、值向量；Step S301, performing matrix multiplication on each style label vector and the reference embedded speech vector to obtain the query vector, key vector, and value vector corresponding to each style label vector;

步骤S302，基于预设函数对查询向量、键向量、值向量进行注意力计算，得到风格标签向量和参考嵌入语音向量之间的风格相似度。Step S302, perform attention calculation on the query vector, key vector, and value vector based on a preset function, and obtain the style similarity between the style label vector and the reference embedded speech vector.

在一些实施例的步骤S301中，可以通过多头注意力机制对每一风格标签向量和参考嵌入语音向量进行矩阵相乘，分别计算出每一风格标签向量的查询向量、键向量、值向量，键向量可以表示为K＝N*W1，值向量可以表示为V＝N*W2，查询向量可以表示为Q＝N*W3，其中，N为风格标签向量，W1、W2、W3为可训练参数。In step S301 of some embodiments, the multi-head attention mechanism can be used to perform matrix multiplication on each style label vector and the reference embedded speech vector, and calculate the query vector, key vector, and value vector of each style label vector. The key vector can be expressed as K=N*W1, the value vector can be expressed as V=N*W2, and the query vector can be expressed as Q=N*W3, where N is the style label vector, and W1, W2, and W3 are trainable parameters.

在一些实施例的步骤S302中,预设函数可以是softmax函数等激活函数，以softmax函数为例，通过softmax函数对查询向量、键向量、值向量进行注意力计算，得到风格标签向量和参考嵌入语音向量之间的风格相似度Z，风格相似度Z可以表示如公式(1)所示，其中，d是风格标签向量的特征维度，T表示对键向量K进行转置运算：In step S302 of some embodiments, the preset function can be an activation function such as a softmax function. Taking the softmax function as an example, the softmax function is used to perform attention calculation on the query vector, key vector, and value vector to obtain the style similarity Z between the style label vector and the reference embedded speech vector. The style similarity Z can be expressed as shown in formula (1), where d is the feature dimension of the style label vector, and T represents the transposition operation on the key vector K:

通过上述步骤S301至步骤S302能够较为方便地确定参考嵌入语音向量与不同的风格标签向量的接近程度，得到风格相似度，使得能够基于风格相似度来判断参考嵌入语音向量的韵律风格偏向，提高韵律风格配置的准确性。Through the above steps S301 to S302, the proximity of the reference embedded speech vector to different style label vectors can be determined more conveniently, and the style similarity can be obtained, so that the prosodic style bias of the reference embedded speech vector can be judged based on the style similarity, and the accuracy of prosodic style configuration can be improved.

请参阅图4，在一些实施例中，步骤S203可以包括但不限于包括步骤S401至步骤S402：Referring to FIG. 4, in some embodiments, step S203 may include but not limited to include steps S401 to S402:

步骤S401，对多个风格相似度进行求和处理，得到综合风格相似度；Step S401, performing a summation process on multiple style similarities to obtain a comprehensive style similarity;

步骤S402，对每一风格标签向量的风格相似度和综合风格相似度进行相除，得到风格标签向量的风格标签权重。In step S402, the style similarity of each style label vector is divided by the comprehensive style similarity to obtain the style label weight of the style label vector.

在一些实施例的步骤S401中，可以采用sum函数或者其他统计学函数或者统计工具对多个风格相似度进行求和处理，得到综合风格相似度。In step S401 of some embodiments, a sum function or other statistical functions or statistical tools may be used to sum the multiple style similarities to obtain a comprehensive style similarity.

在一些实施例的步骤S402中，可以将对每一风格标签向量的风格相似度和综合风格相似度进行相除，得到风格标签向量的风格标签权重。In step S402 of some embodiments, the style similarity of each style label vector can be divided by the comprehensive style similarity to obtain the style label weight of the style label vector.

例如，风格标签向量包括风格标签向量A、风格标签向量B以及风格标签向量C，且这三个风格标签向量与参考语音嵌入向量之间的风格相似度分别为0.37,0.66,0.8，则对多个风格相似度进行求和处理，得到综合风格相似度，即综合风格相似度为0.37+0.66+0.8＝1.83。风格标签向量A的风格标签权重为0.37/1.83＝0.2，风格标签向量B的风格标签权重为0.66/1.83＝0.36，风格标签向量C的风格标签权重为0.8/1.83＝0.44。For example, the style tag vector includes style tag vector A, style tag vector B, and style tag vector C, and the style similarities between these three style tag vectors and the reference speech embedding vector are 0.37, 0.66, and 0.8 respectively, then the multiple style similarities are summed to obtain a comprehensive style similarity, that is, the comprehensive style similarity is 0.37+0.66+0.8=1.83. The style tag weight of style tag vector A is 0.37/1.83=0.2, the style tag weight of style tag vector B is 0.66/1.83=0.36, and the style tag weight of style tag vector C is 0.8/1.83=0.44.

通过上述步骤S401至步骤S402能够较为方便地确定参考嵌入语音向量在不同的风格标签向量上的占比情况，根据占比情况确定每一风格标签向量的风格标签权重，并基于不同的风格标签权重来生成参考嵌入语音向量的加权风格嵌入表示，提高语音合成中韵律控制的准确性。Through the above steps S401 to S402, it is possible to more conveniently determine the proportion of the reference embedded speech vector on different style label vectors, determine the style label weight of each style label vector according to the proportion, and generate a weighted style embedding representation of the reference embedded speech vector based on the different style label weights, so as to improve the accuracy of prosody control in speech synthesis.

请参阅图5，在一些实施例中，步骤S105可以包括但不限于包括步骤S501至步骤S503：Referring to FIG. 5, in some embodiments, step S105 may include but not limited to include steps S501 to S503:

步骤S501，对目标风格嵌入向量和目标文本表示向量进行拼接处理，得到组合嵌入向量；Step S501, splicing the target style embedding vector and the target text representation vector to obtain a combined embedding vector;

步骤S502，基于注意力机制对组合嵌入向量进行特征对齐，得到目标音素向量；Step S502, performing feature alignment on the combined embedding vector based on the attention mechanism to obtain the target phoneme vector;

步骤S503，对目标音素向量进行解码处理，得到合成频谱数据。Step S503, decoding the target phoneme vector to obtain synthesized spectrum data.

在一些实施例的步骤S501中，在对目标风格嵌入向量和目标文本表示向量进行拼接处理时，可以是直接对目标风格嵌入向量和目标文本表示向量进行向量连接，得到组合嵌入向量。In step S501 of some embodiments, when splicing the target style embedding vector and the target text representation vector, the target style embedding vector and the target text representation vector may be directly vector-connected to obtain a combined embedding vector.

在一些实施例的步骤S502中，首先可以利用预设的持续时间预测模型对组合嵌入向量进行时间预测，得到音素持续时间。并利于注意力机制和音素持续时间来对组合嵌入向量进行音素长度调整，得到目标音素向量。需要说明的是，该过程主要是为了实现音素和梅尔倒谱帧的对齐。由于音素序列的长度通常小于其梅尔倒谱序列的长度。因此需要计算将每个音素对齐的梅尔倒谱序列的长度，该长度即为音素持续时间。基于长度调节器和音素持续时间能够较为方便地将音素序列进行平铺，使得音素序列匹配梅尔倒谱序列的长度。In step S502 of some embodiments, first, a preset duration prediction model may be used to perform time prediction on the combined embedding vector to obtain the phoneme duration. And the attention mechanism and phoneme duration are used to adjust the phoneme length of the combined embedding vector to obtain the target phoneme vector. It should be noted that this process is mainly for the alignment of phonemes and Mel cepstrum frames. Since the length of a phoneme sequence is usually smaller than the length of its Mel cepstrum sequence. Therefore, it is necessary to calculate the length of the Mel cepstrum sequence that aligns each phoneme, which is the phoneme duration. Based on the length regulator and phoneme duration, the phoneme sequence can be tiled more conveniently, so that the phoneme sequence matches the length of the Mel cepstrum sequence.

在另一些实施例中，还可以等比例地延长或者缩短音素持续时间，来实现语音合成过程中对合成语音的声音速度的控制。或者，还可以通过调整组合嵌入向量的空格字符的持续时间来控制合成语音中的单词之间的停顿时长，从而调整合成语音的韵律。In some other embodiments, the phoneme duration can also be extended or shortened proportionally to realize the control of the sound speed of the synthesized speech during the speech synthesis process. Alternatively, the prosody of the synthesized speech can also be adjusted by adjusting the duration of the space characters that combine the embedding vectors to control the length of pauses between words in the synthesized speech.

在一些实施例的步骤S503中，可以通过解码器对目标音素向量进行解码处理，将目标音素向量转换为梅尔倒谱序列的形式，得到合成频谱数据。In step S503 of some embodiments, the target phoneme vector may be decoded by a decoder to convert the target phoneme vector into a form of Mel cepstrum sequence to obtain synthesized spectrum data.

通过上述步骤S501至步骤S503能够基于目标文本数据的文本内容特征和参考语音数据的韵律风格特征进行语音合成，使得合成频谱数据中含有符合需求的文本内容信息和韵律风格特征，能够较好地提高合成频谱数据的数据质量，从而合成包含目标韵律的语音数据，提高了语音合成的准确性。Through the above steps S501 to S503, speech synthesis can be performed based on the text content features of the target text data and the prosodic style features of the reference speech data, so that the synthesized spectral data contains text content information and prosodic style features that meet the requirements, which can better improve the data quality of the synthesized spectral data, thereby synthesizing speech data containing the target prosody, and improving the accuracy of speech synthesis.

请参阅图6，在一些实施例，步骤S502包括但不限于包括步骤S601至步骤S602：Referring to FIG. 6, in some embodiments, step S502 includes but is not limited to steps S601 to S602:

步骤S601，基于预设的持续时间预测模型对组合嵌入向量进行时间预测，得到音素持续时间；Step S601, performing time prediction on the combined embedding vector based on the preset duration prediction model to obtain the phoneme duration;

步骤S602，基于注意力机制和音素持续时间对组合嵌入向量进行音素长度调整，得到目标音素向量。Step S602, based on the attention mechanism and the duration of the phoneme, the phoneme length is adjusted on the combined embedding vector to obtain the target phoneme vector.

在一些实施例的步骤S601中，预设的持续时间预测模型包括一个两层的一维卷积网络和一个线性层，一维卷积网络用于提取组合嵌入向量中的频谱时间特征信息，线性层主要用于输出标量来预测音素的持续时间。该持续时间预测模型可以使用均方误差函数(MSE)作为损失函数。利用持续时间预测模型的一维卷积网络对组合嵌入向量中的频谱时间特征信息，得到频谱时间特征，再利用线性层中的预测函数(例如softmax函数、sigmiod函数等等)对频谱时间特征进行预测，得到该频谱时间特征对应的时间长度，将该时间长度作为音素持续时间。In step S601 of some embodiments, the preset duration prediction model includes a two-layer one-dimensional convolutional network and a linear layer, the one-dimensional convolutional network is used to extract the spectrum-time feature information in the combined embedded vector, and the linear layer is mainly used to output scalars to predict the duration of phonemes. The duration prediction model may use a mean square error function (MSE) as a loss function. The one-dimensional convolutional network of the duration prediction model is used to combine the spectral time feature information embedded in the vector to obtain the spectral time feature, and then use the prediction function (such as softmax function, sigmiod function, etc.) in the linear layer to predict the spectral time feature to obtain the corresponding time length of the spectral time feature, and use the time length as the phoneme duration.

在一些实施例的步骤S602中，基于注意力机制和音素持续时间对组合嵌入向量进行音素长度调整时，将音素持续时间作为组合嵌入向量对齐梅尔倒谱序列的长度，利用注意力机制对组合嵌入向量中的音素序列进行音素平铺，使得组合嵌入向量的音素序列能够与梅尔倒谱序列的长度一致，得到目标音素向量。In step S602 of some embodiments, when adjusting the phoneme length of the combined embedding vector based on the attention mechanism and phoneme duration, the phoneme duration is used as the length of the combined embedding vector to align the Mel cepstrum sequence, and the attention mechanism is used to perform phoneme tiling on the phoneme sequence in the combined embedding vector, so that the phoneme sequence of the combined embedding vector can be consistent with the length of the Mel cepstrum sequence, and the target phoneme vector is obtained.

通过上述步骤S601至步骤S602能够较为方便地实现组合嵌入向量的音素和帧的对齐，使得组合嵌入向量的音素序列能够匹配待合成的梅尔倒谱序列的长度，提高合成语音的韵律合理性以及合成语音的语音质量。Through the above steps S601 to S602, the phoneme and frame alignment of the combined embedding vector can be realized more conveniently, so that the phoneme sequence of the combined embedding vector can match the length of the Mel cepstrum sequence to be synthesized, and the prosodic rationality of the synthesized speech and the speech quality of the synthesized speech can be improved.

请参阅图7，在一些实施例中，步骤S106可以包括但不限于包括步骤S701至步骤S702：Referring to FIG. 7, in some embodiments, step S106 may include but not limited to include steps S701 to S702:

步骤S701，将合成频谱数据输入至预设的声码器中，其中，声码器包括反卷积层和多感受野融合层；Step S701, input the synthesized spectral data into a preset vocoder, wherein the vocoder includes a deconvolution layer and a multi-receptive field fusion layer;

步骤S702，基于反卷积层对合成频谱数据进行上采样处理，得到目标频谱数据；Step S702, performing upsampling processing on the synthesized spectrum data based on the deconvolution layer to obtain the target spectrum data;

步骤S703，基于多感受野融合层对目标频谱数据进行多尺度特征融合，得到合成语音数据。Step S703, performing multi-scale feature fusion on the target spectrum data based on the multi-receptive field fusion layer to obtain synthesized speech data.

在一些实施例的步骤S701中，可以利用预设的计算机程序或者脚本程序将合成频谱数据输入至预设的声码器中，其中，该声码器可以是HiFi-Gan或者MelGan等等，不做限制，该声码器包括反卷积层和多感受野融合层，该声码器用于将梅尔倒谱序列形式的合成频谱数据转换为波形形式的合成语音数据。In step S701 of some embodiments, a preset computer program or script program can be used to input the synthesized spectral data into a preset vocoder, wherein the vocoder can be HiFi-Gan or MelGan, etc., without limitation, the vocoder includes a deconvolution layer and a multi-receptive field fusion layer, and the vocoder is used to convert the synthesized spectrum data in the form of a Mel cepstrum sequence into synthesized voice data in the form of a waveform.

在一些实施例的步骤S702中，基于反卷积层对合成频谱数据进行上采样处理，实现对合成频谱数据的卷积转置，得到频谱特征内容更为丰富的目标频谱数据。In step S702 of some embodiments, upsampling processing is performed on the synthesized spectrum data based on the deconvolution layer to implement convolution and transposition of the synthesized spectrum data, and obtain target spectrum data with richer spectrum feature content.

在一些实施例的步骤S703中，多感受野融合层包含多个残差块，基于多感受野融合层对目标频谱数据进行多尺度特征融合时，可以利用每个残差块对目标频谱数据进行特征重构，得到多个尺度的语音波形特征，将所有尺度的语音波形特征进行融合，得到合成语音数据。In step S703 of some embodiments, the multi-receptive field fusion layer includes a plurality of residual blocks. When performing multi-scale feature fusion on the target spectral data based on the multi-receptive field fusion layer, each residual block can be used to reconstruct the features of the target spectral data to obtain speech waveform features of multiple scales, and fuse the speech waveform features of all scales to obtain synthesized speech data.

在一个具体示例中，合成语音数据是包含某个动画人物的说话风格、说话情感的、关于保险产品、理财产品的描述语音。对话机器人能通过合成语音数据以动画人物的特有说话风格和说话情感来吸引潜在对象，使潜在对象更感兴趣于合成语音数据所推荐的保险产品或者理财产品。In a specific example, the synthesized voice data is a description voice about insurance products and wealth management products, including the speaking style and emotion of a certain animation character. Dialogue robots can attract potential objects with the unique speaking style and emotion of animated characters by synthesizing voice data, making potential objects more interested in the insurance products or financial products recommended by the synthetic voice data.

通过上述步骤S701至步骤S703能够使得合成的目标语音数据同时包含目标文本数据的文本内容特征和参考语音数据的韵律特征，从而有效地提高语音合成的准确性。Through the above steps S701 to S703, the synthesized target speech data can simultaneously contain the text content features of the target text data and the prosody features of the reference speech data, thereby effectively improving the accuracy of speech synthesis.

本申请实施例的语音合成方法，其通过获取目标文本数据和参考语音数据；对参考语音数据进行向量化处理，得到参考嵌入语音向量；对目标文本数据进行特征提取，得到目标文本表示向量，能够得到表征文本语义内容的目标文本表示向量，使得能够利用目标文本表示向量进行语音合成。进一步地，对参考嵌入语音向量进行风格标记，得到参考嵌入语音向量对应的目标风格嵌入向量，能够实现对语音韵律风格的控制，使得能够将该目标风格嵌入向量用于语音合成，提高语音合成情感的准确性和语音合成效果。进一步地，基于目标风格嵌入向量和目标文本表示向量进行语音合成，得到合成频谱数据，能够较好地提高合成频谱数据的数据质量。最后，对合成频谱数据进行频谱转换，得到合成语音数据，能够较为方便地得到波形形式的合成语音数据，该合成语音数据同时包含目标文本数据的文本内容特征和参考语音数据的风格特征，具备较好地情感表达能力，提高了语音合成的准确性，进而使得在保险产品、理财产品等智能对话的过程中，对话机器人表达的合成语音能够更贴合对话对象的对话风格偏好，通过采用对话对象更感兴趣的对话方式和对话风格进行会话交流，提高对话质量和对话有效性，能实现智能语音对话服务，提高客户的服务质量以及客户满意度，提高保险产品、理财产品的成交率。The speech synthesis method of the embodiment of the present application obtains target text data and reference speech data; vectorizes the reference speech data to obtain a reference embedded speech vector; performs feature extraction on the target text data to obtain a target text representation vector, and can obtain a target text representation vector representing the semantic content of the text, so that speech synthesis can be performed using the target text representation vector. Furthermore, the reference embedded speech vector is style-marked to obtain the target style embedding vector corresponding to the reference embedded speech vector, which can realize the control of speech prosodic style, so that the target style embedding vector can be used for speech synthesis, and improve the accuracy of speech synthesis emotion and speech synthesis effect. Furthermore, speech synthesis is performed based on the target style embedding vector and the target text representation vector to obtain synthetic spectral data, which can better improve the data quality of the synthetic spectral data. Finally, spectrum conversion is performed on the synthesized spectrum data to obtain synthesized speech data, and the synthesized speech data in the form of a waveform can be obtained more conveniently. The synthesized speech data contains both the text content characteristics of the target text data and the style characteristics of the reference speech data. It has better emotional expression ability and improves the accuracy of speech synthesis. In turn, in the process of intelligent dialogues such as insurance products and wealth management products, the synthetic speech expressed by the dialogue robot can be more suitable for the dialogue style preference of the dialogue object. By using the dialogue style and dialogue style that the dialogue object is more interested in for conversation communication, the dialogue quality and effectiveness can be improved, and intelligent voice dialogue can be realized. services, improve customer service quality and customer satisfaction, and increase the transaction rate of insurance products and wealth management products.

请参阅图8，本申请实施例还提供一种语音合成装置，可以实现上述语音合成方法，该装置包括：Please refer to FIG. 8, the embodiment of the present application also provides a speech synthesis device, which can realize the above speech synthesis method, and the device includes:

数据获取模块801，用于获取目标文本数据和参考语音数据；Data acquiring module 801, for acquiring target text data and reference voice data;

向量化模块802，用于对参考语音数据进行向量化处理，得到参考嵌入语音向量；A vectorization module 802, configured to vectorize the reference voice data to obtain a reference embedded voice vector;

特征提取模块803，用于对目标文本数据进行特征提取，得到目标文本表示向量；Feature extraction module 803, is used for carrying out feature extraction to target text data, obtains target text representation vector;

风格标记模块804，用于对参考嵌入语音向量进行风格标记，得到参考嵌入语音向量对应的目标风格嵌入向量；The style marking module 804 is used to carry out style marking on the reference embedded speech vector, and obtain the target style embedding vector corresponding to the reference embedded speech vector;

语音合成模块805，用于基于目标风格嵌入向量和目标文本表示向量进行语音合成，得到合成频谱数据；Speech synthesis module 805, for carrying out speech synthesis based on the target style embedding vector and the target text representation vector, to obtain synthetic spectral data;

频谱转换模块806，用于对合成频谱数据进行频谱转换，得到合成语音数据。The spectrum conversion module 806 is configured to perform spectrum conversion on the synthesized spectrum data to obtain synthesized speech data.

该语音合成装置的具体实施方式与上述语音合成方法的具体实施例基本相同，在此不再赘述。The specific implementation manner of the speech synthesis device is basically the same as the specific embodiment of the above speech synthesis method, and will not be repeated here.

本申请实施例还提供了一种电子设备，电子设备包括：存储器、处理器、存储在存储器上并可在处理器上运行的程序以及用于实现处理器和存储器之间的连接通信的数据总线，程序被处理器执行时实现上述语音合成方法。该电子设备可以为包括平板电脑、车载电脑等任意智能终端。The embodiment of the present application also provides an electronic device, the electronic device includes: a memory, a processor, a program stored in the memory and operable on the processor, and a data bus for realizing connection and communication between the processor and the memory, and the above speech synthesis method is implemented when the program is executed by the processor. The electronic device may be any intelligent terminal including a tablet computer, a vehicle-mounted computer, and the like.

请参阅图9，图9示意了另一实施例的电子设备的硬件结构，电子设备包括：Please refer to FIG. 9. FIG. 9 illustrates a hardware structure of an electronic device in another embodiment. The electronic device includes:

处理器901，可以采用通用的CPU(CentralProcessingUnit，中央处理器)、微处理器、应用专用集成电路(ApplicationSpecificIntegratedCircuit，ASIC)、或者一个或多个集成电路等方式实现，用于执行相关程序，以实现本申请实施例所提供的技术方案；The processor 901 may be implemented by a general-purpose CPU (Central Processing Unit, central processing unit), microprocessor, application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, and is used to execute related programs to implement the technical solutions provided by the embodiments of the present application;

存储器902，可以采用只读存储器(ReadOnlyMemory，ROM)、静态存储设备、动态存储设备或者随机存取存储器(RandomAccessMemory，RAM)等形式实现。存储器902可以存储操作系统和其他应用程序，在通过软件或者固件来实现本说明书实施例所提供的技术方案时，相关的程序代码保存在存储器902中，并由处理器901来调用执行本申请实施例的语音合成方法；The memory 902 may be implemented in the form of a read-only memory (ReadOnlyMemory, ROM), a static storage device, a dynamic storage device, or a random access memory (RandomAccessMemory, RAM). The memory 902 can store an operating system and other application programs. When implementing the technical solutions provided by the embodiments of this specification through software or firmware, the relevant program codes are stored in the memory 902, and the processor 901 invokes and executes the speech synthesis method of the embodiments of the present application;

输入/输出接口903，用于实现信息输入及输出；The input/output interface 903 is used to realize information input and output;

通信接口904，用于实现本设备与其他设备的通信交互，可以通过有线方式(例如USB、网线等)实现通信，也可以通过无线方式(例如移动网络、WI F I、蓝牙等)实现通信；The communication interface 904 is used to realize the communication and interaction between the device and other devices, and the communication can be realized through a wired method (such as USB, network cable, etc.), or can be realized through a wireless method (such as a mobile network, WIFI, Bluetooth, etc.);

总线905，在设备的各个组件(例如处理器901、存储器902、输入/输出接口903和通信接口904)之间传输信息；bus 905, for transferring information between various components of the device (such as processor 901, memory 902, input/output interface 903 and communication interface 904);

其中处理器901、存储器902、输入/输出接口903和通信接口904通过总线905实现彼此之间在设备内部的通信连接。The processor 901 , the memory 902 , the input/output interface 903 and the communication interface 904 are connected to each other within the device through the bus 905 .

本申请实施例还提供了一种计算机可读存储介质，计算机可读存储介质存储有一个或者多个程序，一个或者多个程序可被一个或者多个处理器执行，以实现上述语音合成方法。The embodiment of the present application also provides a computer-readable storage medium, where one or more programs are stored in the computer-readable storage medium, and the one or more programs can be executed by one or more processors, so as to implement the above speech synthesis method.

存储器作为一种非暂态计算机可读存储介质，可用于存储非暂态软件程序以及非暂态性计算机可执行程序。此外，存储器可以包括高速随机存取存储器，还可以包括非暂态存储器，例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中，存储器可选包括相对于处理器远程设置的存储器，这些远程存储器可以通过网络连接至该处理器。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。As a non-transitory computer-readable storage medium, memory can be used to store non-transitory software programs and non-transitory computer-executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage devices. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

本申请实施例提供的语音合成方法、语音合成装置、电子设备及计算机可读存储介质，其通过获取目标文本数据和参考语音数据；对参考语音数据进行向量化处理，得到参考嵌入语音向量；对目标文本数据进行特征提取，得到目标文本表示向量，能够得到表征文本语义内容的目标文本表示向量，使得能够利用目标文本表示向量进行语音合成。进一步地，对参考嵌入语音向量进行风格标记，得到参考嵌入语音向量对应的目标风格嵌入向量，能够实现对语音韵律风格的控制，使得能够将该目标风格嵌入向量用于语音合成，提高语音合成情感的准确性和语音合成效果。进一步地，基于目标风格嵌入向量和目标文本表示向量进行语音合成，得到合成频谱数据，能够较好地提高合成频谱数据的数据质量。最后，对合成频谱数据进行频谱转换，得到合成语音数据，能够较为方便地得到波形形式的合成语音数据，该合成语音数据同时包含目标文本数据的文本内容特征和参考语音数据的风格特征，具备较好地情感表达能力，提高了语音合成的准确性，进而使得在保险产品、理财产品等智能对话的过程中，对话机器人表达的合成语音能够更贴合对话对象的对话风格偏好，通过采用对话对象更感兴趣的对话方式和对话风格进行会话交流，提高对话质量和对话有效性，能实现智能语音对话服务，提高客户的服务质量以及客户满意度，提高保险产品、理财产品的成交率。The speech synthesis method, speech synthesis device, electronic equipment, and computer-readable storage medium provided by the embodiments of the present application obtain target text data and reference speech data; perform vectorization processing on the reference speech data to obtain reference embedded speech vectors; perform feature extraction on the target text data to obtain target text representation vectors, and can obtain target text representation vectors representing the semantic content of the text, so that speech synthesis can be performed using the target text representation vectors. Furthermore, the reference embedded speech vector is style-marked to obtain the target style embedding vector corresponding to the reference embedded speech vector, which can realize the control of speech prosodic style, so that the target style embedding vector can be used for speech synthesis, and improve the accuracy of speech synthesis emotion and speech synthesis effect. Furthermore, speech synthesis is performed based on the target style embedding vector and the target text representation vector to obtain synthetic spectral data, which can better improve the data quality of the synthetic spectral data. Finally, spectrum conversion is performed on the synthesized spectrum data to obtain synthesized speech data, and the synthesized speech data in the form of a waveform can be obtained more conveniently. The synthesized speech data contains both the text content characteristics of the target text data and the style characteristics of the reference speech data. It has better emotional expression ability and improves the accuracy of speech synthesis. In turn, in the process of intelligent dialogues such as insurance products and wealth management products, the synthetic speech expressed by the dialogue robot can be more suitable for the dialogue style preference of the dialogue object. By using the dialogue style and dialogue style that the dialogue object is more interested in for conversation communication, the dialogue quality and effectiveness can be improved, and intelligent voice dialogue can be realized. services, improve customer service quality and customer satisfaction, and increase the transaction rate of insurance products and wealth management products.

本申请实施例描述的实施例是为了更加清楚的说明本申请实施例的技术方案，并不构成对于本申请实施例提供的技术方案的限定，本领域技术人员可知，随着技术的演变和新应用场景的出现，本申请实施例提供的技术方案对于类似的技术问题，同样适用。The embodiments described in the embodiments of the present application are to illustrate the technical solutions of the embodiments of the present application more clearly, and do not constitute a limitation to the technical solutions provided by the embodiments of the present application. Those skilled in the art know that with the evolution of technology and the emergence of new application scenarios, the technical solutions provided by the embodiments of the present application are also applicable to similar technical problems.

本领域技术人员可以理解的是，图1-7中示出的技术方案并不构成对本申请实施例的限定，可以包括比图示更多或更少的步骤，或者组合某些步骤，或者不同的步骤。Those skilled in the art can understand that the technical solutions shown in FIGS. 1-7 do not limit the embodiments of the present application, and may include more or fewer steps than those shown in the illustrations, or combine some steps, or different steps.

以上所描述的装置实施例仅仅是示意性的，其中作为分离部件说明的单元可以是或者也可以不是物理上分开的，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

本领域普通技术人员可以理解，上文中所公开方法中的全部或某些步骤、系统、设备中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。Those of ordinary skill in the art can understand that all or some of the steps in the methods disclosed above, the functional modules/units in the system, and the device can be implemented as software, firmware, hardware, and an appropriate combination thereof.

本申请的说明书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth" and the like (if any) in the description of the present application and the above drawings are used to distinguish similar objects and not necessarily to describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the application described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having", and any variations thereof, are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or elements is not necessarily limited to those steps or elements explicitly listed, but may include other steps or elements not expressly listed or inherent to the process, method, product or device.

应当理解，在本申请中，“至少一个(项)”是指一个或者多个，“多个”是指两个或两个以上。“和/或”，用于描述关联对象的关联关系，表示可以存在三种关系，例如，“A和/或B”可以表示：只存在A，只存在B以及同时存在A和B三种情况，其中A，B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达，是指这些项中的任意组合，包括单项(个)或复数项(个)的任意组合。例如，a，b或c中的至少一项(个)，可以表示：a，b，c，“a和b”，“a和c”，“b和c”，或“a和b和c”，其中a，b，c可以是单个，也可以是多个。It should be understood that in this application, "at least one (item)" means one or more, and "multiple" means two or more. "And/or" is used to describe the association relationship of associated objects, indicating that there can be three types of relationships, for example, "A and/or B" can mean: there are only A, only B, and both A and B, where A and B can be singular or plural. The character "/" generally indicates that the contextual objects are an "or" relationship. "At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one item (piece) of a, b or c can represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c can be single or multiple.

在本申请所提供的几个实施例中，应该理解到，所揭露的装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，上述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed devices and methods may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the above units is only a logical function division. There may be other division methods in actual implementation, for example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

上述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外，在本申请各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括多指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施例的方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(Read-Only Memory，简称ROM)、随机存取存储器(Random Access Memory，简称RAM)、磁碟或者光盘等各种可以存储程序的介质。If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on such an understanding, the technical solution of the present application essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, the computer software product is stored in a storage medium, and includes multiple instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method in each embodiment of the application. The above-mentioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM for short), random access memory (Random Access Memory, RAM for short), magnetic disk or optical disk, and various media that can store programs.

以上参照附图说明了本申请实施例的优选实施例，并非因此局限本申请实施例的权利范围。本领域技术人员不脱离本申请实施例的范围和实质内所作的任何修改、等同替换和改进，均应在本申请实施例的权利范围之内。The preferred embodiments of the embodiments of the present application have been described above with reference to the accompanying drawings, which does not limit the scope of rights of the embodiments of the present application. Any modifications, equivalent replacements and improvements made by those skilled in the art without departing from the scope and essence of the embodiments of the present application shall fall within the scope of rights of the embodiments of the present application.

Claims

1. A method of speech synthesis, the method comprising:

acquiring target text data and reference voice data;

vectorizing the reference voice data to obtain a reference embedded voice vector;

extracting features of the target text data to obtain a target text representation vector;

performing style marking on the reference embedded voice vector to obtain a target style embedded vector corresponding to the reference embedded voice vector;

performing voice synthesis based on the target style embedded vector and the target text representation vector to obtain synthesized spectrum data;

and performing frequency spectrum conversion on the synthesized frequency spectrum data to obtain synthesized voice data.

2. The method of claim 1, wherein the performing style marking on the reference embedded speech vector to obtain a target style embedded vector corresponding to the reference embedded speech vector comprises:

acquiring a plurality of preset style tag vectors, wherein the style tag vectors are vector representations corresponding to the preset style tags;

performing attention calculation on each style tag vector and each reference embedded voice vector to obtain style similarity between each style tag vector and each reference embedded voice vector;

Based on the style similarity, style tag weight of each style tag vector is obtained;

and carrying out weighted summation on the style tag vector based on the style tag weight to obtain the target style embedded vector.

3. The method of speech synthesis according to claim 2, wherein performing attention computation on each of the style tag vectors and the reference embedded speech vectors to obtain style similarities between each of the style tag vectors and the reference embedded speech vectors comprises:

performing matrix multiplication on each style tag vector and the reference embedded voice vector to obtain a query vector, a key vector and a value vector corresponding to each style tag vector;

and performing attention calculation on the query vector, the key vector and the value vector based on a preset function to obtain the style similarity between the style tag vector and the reference embedded voice vector.

4. The method of claim 2, wherein the deriving style tag weights for each of the style tag vectors based on the plurality of style similarities comprises:

Summing the style similarities to obtain comprehensive style similarity;

and dividing the style similarity of each style tag vector and the comprehensive style similarity to obtain the style tag weight of the style tag vector.

5. The method according to claim 1, wherein the performing speech synthesis based on the target style embedding vector and the target text representation vector to obtain synthesized spectrum data includes:

performing splicing processing on the target style embedded vector and the target text representation vector to obtain a combined embedded vector;

performing feature alignment on the combined embedded vector based on an attention mechanism to obtain a target phoneme vector;

and decoding the target phoneme vector to obtain the synthesized spectrum data.

6. The method of claim 5, wherein feature alignment of the combined embedded vectors based on an attention mechanism to obtain a target phoneme vector comprises:

performing time prediction on the combined embedded vector based on a preset duration prediction model to obtain a phoneme duration;

And carrying out phoneme length adjustment on the combined embedded vector based on the attention mechanism and the phoneme duration to obtain the target phoneme vector.

7. The method according to any one of claims 1 to 6, wherein the performing spectral conversion on the synthesized spectral data to obtain synthesized speech data includes:

inputting the synthesized spectrum data into a preset vocoder, wherein the vocoder comprises a deconvolution layer and a multi-receptive field fusion layer;

up-sampling the synthesized spectrum data based on the deconvolution layer to obtain target spectrum data;

and carrying out multi-scale feature fusion on the target frequency spectrum data based on the multi-receptive field fusion layer to obtain the synthesized voice data.

8. A speech synthesis apparatus, the apparatus comprising:

the data acquisition module is used for acquiring target text data and reference voice data;

the vectorization module is used for vectorizing the reference voice data to obtain a reference embedded voice vector;

the feature extraction module is used for extracting features of the target text data to obtain a target text expression vector;

The style marking module is used for performing style marking on the reference embedded voice vector to obtain a target style embedded vector corresponding to the reference embedded voice vector;

the voice synthesis module is used for carrying out voice synthesis based on the target style embedded vector and the target text representation vector to obtain synthesized spectrum data;

and the spectrum conversion module is used for performing spectrum conversion on the synthesized spectrum data to obtain synthesized voice data.

9. An electronic device comprising a memory storing a computer program and a processor implementing the speech synthesis method according to any of claims 1 to 7 when the computer program is executed by the processor.

10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the speech synthesis method of any one of claims 1 to 7.