CN102982809A - Conversion method for sound of speaker - Google Patents
Conversion method for sound of speaker Download PDFInfo
- Publication number
- CN102982809A CN102982809A CN2012105286294A CN201210528629A CN102982809A CN 102982809 A CN102982809 A CN 102982809A CN 2012105286294 A CN2012105286294 A CN 2012105286294A CN 201210528629 A CN201210528629 A CN 201210528629A CN 102982809 A CN102982809 A CN 102982809A
- Authority
- CN
- China
- Prior art keywords
- speaker
- features
- voice
- training
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000006243 chemical reaction Methods 0.000 title claims abstract description 75
- 238000000034 method Methods 0.000 title claims abstract description 48
- 238000012549 training Methods 0.000 claims abstract description 85
- 230000003595 spectral effect Effects 0.000 claims abstract description 28
- 238000012546 transfer Methods 0.000 claims abstract description 5
- 238000013528 artificial neural network Methods 0.000 claims description 44
- 230000006870 function Effects 0.000 claims description 35
- 230000008569 process Effects 0.000 claims description 10
- 238000012545 processing Methods 0.000 claims description 9
- 238000001228 spectrum Methods 0.000 claims description 9
- 239000000284 extract Substances 0.000 claims description 8
- 238000009826 distribution Methods 0.000 claims description 6
- 230000011218 segmentation Effects 0.000 claims description 5
- 238000002372 labelling Methods 0.000 claims description 4
- 230000002159 abnormal effect Effects 0.000 claims description 3
- 230000001537 neural effect Effects 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000015572 biosynthetic process Effects 0.000 description 16
- 238000003786 synthesis reaction Methods 0.000 description 16
- 238000000605 extraction Methods 0.000 description 15
- 238000005516 engineering process Methods 0.000 description 9
- 238000000926 separation method Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 230000003993 interaction Effects 0.000 description 5
- 238000012417 linear regression Methods 0.000 description 4
- 238000004519 manufacturing process Methods 0.000 description 4
- 239000000203 mixture Substances 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000001419 dependent effect Effects 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 210000004556 brain Anatomy 0.000 description 2
- 230000008451 emotion Effects 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000001755 vocal effect Effects 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000005452 bending Methods 0.000 description 1
- 210000003710 cerebral cortex Anatomy 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000006378 damage Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000037433 frameshift Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000000653 nervous system Anatomy 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000005507 spraying Methods 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
Images
Landscapes
- Electrically Operated Instructional Devices (AREA)
Abstract
本发明公开了一种说话人声音转换方法,包括训练阶段和转换阶段,训练阶段包括:从源说话人和目标说话人的训练语音信号中分别提取基频特征、说话人特征和内容特征;根据所述基频特征构建基频转换函数;根据所述说话人特征构建说话人转换函数。转换阶段包括:从源说话人的待转换语音信号中提取基频特征和频谱特征;使用训练阶段得到的基频转换函数和说话人转换函数对从所述待转换语音信号中提取出的基频特征和说话人特征进行转换,得到转换后的基频特征和说话人特征;根据所得到的转换后的基频特征、说话人特征和待转换语音信号中的内容特征合成目标说话人的语音。本发明易于实现且转换后的音质和相似度较高。
The invention discloses a speaker voice conversion method, which includes a training stage and a conversion stage. The training stage includes: extracting fundamental frequency features, speaker features and content features from the training voice signals of the source speaker and the target speaker; The fundamental frequency feature constructs a fundamental frequency transfer function; constructs a speaker transfer function according to the speaker feature. The conversion stage includes: extracting fundamental frequency features and spectral features from the source speaker's speech signal to be converted; The features and speaker features are converted to obtain the converted fundamental frequency features and speaker features; the speech of the target speaker is synthesized according to the obtained converted fundamental frequency features, speaker features and content features in the speech signal to be converted. The invention is easy to implement and has higher sound quality and similarity after conversion.
Description
技术领域technical field
本发明属于信号处理技术领域,具体涉及在不改变语音信号中内容信息的前提下,将一个说话人的语音信号通过转换处理,改变为能够被感知为另一个说话人的语音信号,特别是一种将语音信号中的说话人信息和内容信息进行分离的说话人声音转换方法。The invention belongs to the technical field of signal processing, and specifically relates to converting a speaker's voice signal into a voice signal that can be perceived as another speaker's voice signal without changing the content information in the voice signal. A speaker voice conversion method for separating speaker information and content information in a voice signal.
背景技术Background technique
在如今的信息时代,人机交互一直是计算机领域的研究热点,高效智能的人机交互环境已经成为了当前信息技术的应用和发展的迫切需求。众所周知,语音是人类交流的最重要、最便捷的途径之一。语音交互将是人际交互中最为“友好”的。基于语音识别、语音合成及自然语言理解的人机语音对话技术是世界公认的一个难度很大,极富挑战性的高技术领域,但是其应用前景十分光明。In today's information age, human-computer interaction has always been a research hotspot in the computer field, and an efficient and intelligent human-computer interaction environment has become an urgent need for the application and development of current information technology. As we all know, speech is one of the most important and convenient ways for human communication. Voice interaction will be the most "friendly" among human interactions. Human-machine speech dialogue technology based on speech recognition, speech synthesis and natural language understanding is recognized as a very difficult and challenging high-tech field in the world, but its application prospect is very bright.
作为人机交互的核心技术之一,语音合成近年来在技术和应用方面都取得了长足进展。目前,基于大语料库的合成系统合成的语音在音质和自然度方面都取得了不错的效果,因此大家对语音合成系统提出了更多的需求——多样化的语音合成,包括多个发音人、多种发音风格、多种情感以及多语种等。而现有的语音合成系统大多是单一化的,一个合成系统一般只包括一到两个说话人,采用朗读或者新闻播报风格,而且针对某个特定的语种。这种单一化的合成语音大大限制了语音合成系统的在实际中的应用,包括教育、娱乐和玩具等。为此,多样化语音合成方面的研究逐渐成为近期语音合成研究领域的主流方向之一。As one of the core technologies of human-computer interaction, speech synthesis has made great progress in both technology and application in recent years. At present, the speech synthesized by the synthesis system based on a large corpus has achieved good results in terms of sound quality and naturalness, so everyone puts forward more demands on the speech synthesis system - diversified speech synthesis, including multiple speakers, Multiple pronunciation styles, multiple emotions, multiple languages, etc. However, most of the existing speech synthesis systems are simplistic. A synthesis system generally only includes one or two speakers, adopts the style of reading aloud or news broadcast, and is aimed at a specific language. This simplification of synthesized speech greatly limits the practical applications of speech synthesis systems, including education, entertainment and toys. For this reason, research on diverse speech synthesis has gradually become one of the mainstream directions in the field of recent speech synthesis research.
实现一个多说话人、多种发音风格、多种情感的语音合成系统,最直接的方法就是录制多个人、多种风格的音库,并分别构建各个发音人、各个风格的个性化语音合成系统。由于针对每个发音人、每种风格、每种情感制作一个特定的语音库的工作量过大,因此这种方法在实际中并不可行。在这一背景下,说话人声音转换技术被提出。说话人声音转换技术就是试图把一个人(源说话人)说的话(的语音)进行转换(对基频、时长、谱参数等包含说话人特征信息的参数进行调整),使它听起来好像另一个人(目标说话人)说出来的一样。与此同时,保持源说话人表达的意思不变。说话人声音转换技术通过录制少量的说话人的语音信号进行训练,调整源说话人的语音得到目标说话人的合成语音,从而快速实现个性化语音合成系统。The most direct way to realize a multi-speaker, multi-pronunciation style, and multi-emotion speech synthesis system is to record multiple people and multi-style sound banks, and build individual speech synthesis systems for each speaker and each style . This approach is not feasible in practice because it would be too much work to produce a specific speech library for each speaker, each style, and each emotion. In this background, speaker voice conversion technology is proposed. The speaker voice conversion technology is to try to convert the speech (voice) of a person (source speaker) (adjust the fundamental frequency, duration, spectral parameters and other parameters containing speaker characteristic information) to make it sound like another person (source speaker) The same as what a person (the target speaker) says. At the same time, the meaning expressed by the source speaker is kept unchanged. The speaker voice conversion technology trains by recording a small amount of speaker's voice signals, adjusts the voice of the source speaker to obtain the synthesized voice of the target speaker, and quickly realizes the personalized speech synthesis system.
实现一个说话人声音转换系统,最主要的挑战在于转换语音的相似度和音质。作为当前的一种主流的说话人声音转换方法——基于联合空间高斯混合模型的说话人声音转换方法,由于使用了统计建模的框架,相对来说具有很好的鲁棒性和推广性,但是该方法只是一个典型的机器学习中的特征映射的方法,并没有利用语音信号特有的一些特性(说话人信息和内容信息共存),而且统计建模带来了诸多问题,如对数据量的依赖,建模精度不够,统计模型对声学参数原有的信息的破坏,均导致转换语音的效果急剧下降。而另一种主流的语音合成技术,基于共振峰的频谱弯折方法,则利用到了语音信号中的说话人共振峰结构这一主要反映说话人信息的特征,在转换时尽可能的保留语音信号中的细节成分,保证了转换语音的音质,但是由于共振峰的提取和建模很难,就使得这一类方法需要很多人工的干预,而且鲁棒性较差。To implement a speaker voice conversion system, the main challenge lies in the similarity and sound quality of the converted voice. As a current mainstream speaker voice conversion method-the speaker voice conversion method based on the joint space Gaussian mixture model, due to the use of a statistical modeling framework, it is relatively robust and generalizable. However, this method is only a typical feature mapping method in machine learning, and does not take advantage of some characteristics unique to speech signals (coexistence of speaker information and content information), and statistical modeling brings many problems, such as the amount of data. Dependence, insufficient modeling accuracy, and the destruction of the original information of the acoustic parameters by the statistical model all lead to a sharp decline in the effect of converting speech. Another mainstream speech synthesis technology, the formant-based spectrum bending method, utilizes the speaker formant structure in the speech signal, which mainly reflects the speaker's information, and preserves the speech signal as much as possible during the conversion. However, due to the difficulty in extracting and modeling formants, this type of method requires a lot of manual intervention and is less robust.
总的来说,传统的说话人语音转换方法,由于其对语音信号中特定说话人的声音信息缺乏有效表达及有效建模,对建模数据要求高,所构建的转换方法往往包含了对语音信号内容的转换,因此转换后的语音音质和相似度目前不能达到令人满意的程度。In general, the traditional speaker-to-speech conversion method lacks effective expression and effective modeling of the voice information of a specific speaker in the speech signal, and has high requirements for modeling data. The conversion of signal content, so the voice quality and similarity after conversion cannot reach a satisfactory level at present.
发明内容Contents of the invention
(一)要解决的技术问题(1) Technical problems to be solved
本发明所要解决的技术问题是现有的说话人语音转换方法的语音音质较差和相似度不高的问题。The technical problem to be solved by the invention is that the existing speaker voice conversion method has poor voice quality and low similarity.
(二)技术方案(2) Technical solutions
本发明提出一种说话人声音转换方法,用于把源说话人所说的话的语音信号进行转换,使转换后的语音听起来是不同于源说话人的目标说话人所说的,其特征在于,该方法包括训练阶段和转换阶段,其中,The present invention proposes a speaker voice conversion method, which is used to convert the voice signal of the source speaker's speech, so that the converted voice sounds different from the target speaker's speech of the source speaker, and is characterized in that , the method includes a training phase and a conversion phase, where,
所述训练阶段包括:The training phase includes:
步骤A1、从源说话人和目标说话人的训练语音信号中分别提取基频特征和频谱特征,所述频谱特征包括说话人特征和内容特征;Step A1, extract fundamental frequency features and spectral features respectively from the training voice signals of the source speaker and the target speaker, the spectral features include speaker features and content features;
步骤A2、根据源说话人和目标说话人的训练语音信号的基频特征,构建从源说话人的语音到目标说话人的语音的基频转换函数;Step A2, according to the fundamental frequency characteristics of the training voice signals of the source speaker and the target speaker, construct a fundamental frequency conversion function from the voice of the source speaker to the voice of the target speaker;
步骤A3、根据步骤A1提取的源说话人和目标说话人的说话人特征构建说话人转换函数;Step A3, constructing a speaker conversion function according to the speaker characteristics of the source speaker and the target speaker extracted in step A1;
所述转换阶段包括:The conversion phase includes:
步骤B1、从源说话人的待转换语音信号中提取基频特征和频谱特征,所述频谱特征包括说话人特征和内容特征;Step B1, extracting fundamental frequency features and spectral features from the source speaker's speech signal to be converted, the spectral features include speaker features and content features;
步骤B2、分别使用训练阶段得到的基频转换函数和说话人转换函数,对从步骤B1中从所述待转换语音信号中提取出的基频特征和说话人特征进行转换,得到转换后的基频特征和说话人特征;Step B2, using the base frequency conversion function and the speaker conversion function obtained in the training stage to convert the base frequency feature and the speaker feature extracted from the speech signal to be converted in step B1 to obtain the converted base frequency frequency features and speaker features;
步骤B3、根据步骤B2得到的转换后的基频特征和说话人特征,以及步骤B1提取的待转换语音信号中的内容特征,合成目标说话人的语音。Step B3: Synthesize the speech of the target speaker according to the converted fundamental frequency features and speaker features obtained in step B2, and the content features in the speech signal to be converted extracted in step B1.
根据本发明的一种具体实施方式,所述步骤A1和步骤B1的提取语音信号的基频特征和频谱特征的方法包括:According to a specific embodiment of the present invention, the method for extracting the fundamental frequency feature and spectral feature of the speech signal in the steps A1 and B1 includes:
步骤a1、基于语音信号的源-滤波器结构,将语音信号以20~30ms进行分段,每一段作为一帧,并对每一帧的语音信号提取基频和频谱参数;Step a1, based on the source-filter structure of the speech signal, the speech signal is segmented by 20-30 ms, each segment is regarded as a frame, and the fundamental frequency and spectrum parameters are extracted from the speech signal of each frame;
步骤a2、使用一个神经网络来分离所述频谱参数中的说话人特征和内容特征,该神经网络结构采用上下对称的共2K-1层多层(K为自然数)网络结构,包括:最下层为输入层,从该层输入待分离的声学特征;最上层为输出层,该层输出重构出的声学特征;中间2K-3个隐层,每层若干个节点,模拟神经单元的处理过程。从输入层到从下至上的第K个隐层为编码网络,用于从输入的语音声学特征中提取出高层的信息;从下至上的第K个隐层为编码层;编码层的网络节点分为两部分,一部分与说话人相关,另一部分与内容相关,它们的输出分别对应说话人特征和内容特征;从下至上的第K个隐层以上的隐层为解码网络,用于从高层的说话人特征和内容特征中重建出声学频谱参数。Step a2, using a neural network to separate the speaker features and content features in the spectral parameters, the neural network structure adopts a symmetrical up-down 2K-1 layer multi-layer (K is a natural number) network structure, including: the lowest layer is The input layer, from which the acoustic features to be separated are input; the uppermost layer is the output layer, which outputs the reconstructed acoustic features; the middle 2K-3 hidden layers, each with several nodes, simulates the processing process of the neural unit. The Kth hidden layer from the input layer to the top is the encoding network, which is used to extract high-level information from the input speech acoustic features; the Kth hidden layer from the bottom to the top is the encoding layer; the network node of the encoding layer It is divided into two parts, one part is related to the speaker, and the other part is related to the content. Their outputs correspond to the speaker features and content features respectively; Acoustic spectral parameters are reconstructed from the speaker features and content features of the system.
根据本发明的一种具体实施方式,步骤a2包括在一语音信号数据库上对所述神经网络进行训练,以使其具备从声学特征中提取和分离说话人特征和内容特征的能力,所述对所述神经网络进行训练的步骤包括:According to a specific embodiment of the present invention, step a2 includes training the neural network on a speech signal database, so that it has the ability to extract and separate speaker features and content features from acoustic features, the pair The steps of training the neural network include:
步骤b1、通过预训练来初始化所述神经网络的网络权值;Step b1, initialize the network weights of the neural network through pre-training;
步骤b2、对所述神经网络的编码层的每个节点的输出特征,采用一个区分性准则来统计其在不同说话人之间和不同内容之间的区分性,将不同说话人间区分性大而不同内容之间区分性小的节点作为说话人相关节点,其余的节点作为内容相关节点;Step b2, for the output features of each node of the coding layer of the neural network, use a discrimination criterion to count its discrimination among different speakers and between different contents, and distinguish between different speakers and Nodes with little discrimination between different contents are regarded as speaker-related nodes, and the remaining nodes are regarded as content-related nodes;
步骤b3、设计特定的区分性目标函数来精细调整该神经网络的权值,使该神经网络具备从声学特征中分离说话人信息和内容信息的能力。Step b3, designing a specific discriminative objective function to fine-tune the weights of the neural network, so that the neural network has the ability to separate speaker information and content information from acoustic features.
根据本发明的一种具体实施方式,所述的语音信号数据库是通过下列步骤制作的:According to a specific embodiment of the present invention, the speech signal database is made through the following steps:
步骤c1、建立一个语料库,使该语料库中包括多个句子;Step c1, set up a corpus so that multiple sentences are included in the corpus;
步骤c2、录制多个说话人朗读所述语料库中的句子的语音信号,构建语音信号数据库,并对该语音信号数据库中的语音信号进行预处理,以去除语音信号中的不正常部分;Step c2, recording voice signals of multiple speakers reading sentences in the corpus, constructing a voice signal database, and preprocessing the voice signals in the voice signal database to remove abnormal parts in the voice signals;
步骤c3、使用隐马尔科夫模型来对进行预处理的杨这语音信号数据库中的语音信号行切分,切分后的每一段作为一个帧,由得到各语音信号的帧一级的说话人标注信息和内容标注信息;Step c3, using the hidden Markov model to segment the speech signal in the preprocessed speech signal database, each segment after the segmentation is used as a frame, and the frame-level speaker of each speech signal is obtained Labeling information and content labeling information;
步骤c4、对所述语音数据库的各语音信号进行随机组合,构造神经网络的训练数据。Step c4. Randomly combine the speech signals in the speech database to construct training data for the neural network.
(三)有益效果(3) Beneficial effects
本发明的说话人声音转换方法具有以下优点:The speaker voice conversion method of the present invention has the following advantages:
1、本发明首次提出了使用深层神经网络来实现语音信号中说话人信息和内容信息的分离,以满足不同语音信号处理任务的需求,如语音识别、说话人识别与转换。1. The present invention proposes for the first time the use of a deep neural network to separate speaker information and content information in speech signals, so as to meet the needs of different speech signal processing tasks, such as speech recognition, speaker recognition and conversion.
2、本发明在进行说话人声音转换时,仅考虑说话人的因素,排除了内容因素的干扰,使得说话人声音转换更易于实现,转换后的音质和相似度得以大幅度提高。2. When the present invention converts the speaker's voice, only the factor of the speaker is considered, and the interference of the content factor is eliminated, so that the conversion of the speaker's voice is easier to realize, and the converted sound quality and similarity can be greatly improved.
3、本发明采用的分离器只需要训练一次,训练好后能够对任意说话人语音提取说话人特征和内容特征,一次训练多次使用,无需重复训练模型。3. The separator used in the present invention only needs to be trained once. After training, it can extract speaker features and content features from any speaker's voice, and it can be used for multiple times after one training session without repeated training of the model.
附图说明Description of drawings
图1是本发明的的说话人声音转换方法的流程图;Fig. 1 is the flowchart of the speaker's voice conversion method of the present invention;
图2是本发明的特征提取步骤的框图;Fig. 2 is the block diagram of feature extraction step of the present invention;
图3是本发明的用于特征分离的神经网络结构示意图;Fig. 3 is a schematic diagram of the neural network structure for feature separation of the present invention;
图4是本发明的神经网络训练流程图;Fig. 4 is the neural network training flowchart of the present invention;
图5是本发明中数据库制作的流程图;Fig. 5 is the flowchart that database is made among the present invention;
图6是本发明中倒谱特征在不同说话人和不同内容之间的区分性的示意图;Fig. 6 is a schematic diagram of the discrimination of cepstral features between different speakers and different contents in the present invention;
图7是本发明中提取出的说话人特征和内容特征在不同说话人和不同内容之间的区分性的示意图。Fig. 7 is a schematic diagram of the distinction between different speakers and different contents of the speaker features and content features extracted in the present invention.
具体实施方式Detailed ways
为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本发明作进一步的详细说明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with specific embodiments and with reference to the accompanying drawings.
从生理学的角度来讲,已有学者的工作证实,人脑在感知语音信号时,对说话人信息的感知和对说话内容的感知分别是在大脑皮层的不同区域完成的。这说明人脑在高层对说话人和内容信息做了分解,语音信号中的信息是可分离的,说话人信息和内容信息的分离对语音信号处理的意义很重大,分离出来的信息可分别用于说话人识别,语音识别以及其他的一些针对性的应用。From a physiological point of view, the work of scholars has confirmed that when the human brain perceives speech signals, the perception of speaker information and the perception of speech content are completed in different areas of the cerebral cortex. This shows that the human brain decomposes the speaker and content information at a high level. The information in the speech signal is separable. The separation of speaker information and content information is of great significance to speech signal processing. The separated information can be used separately. For speaker recognition, speech recognition and other targeted applications.
本发明从说话人声音转换的本质出发,即保持说话人所说的话的内容不变,而仅改变说该句话的说话人的信息。基于这一考虑,对语音信号中的信息进行分离,得到说话人特征和内容特征,以便对说话人的成分进行操作。本发明中所说的“说话人特征”指的是反应说话人特性、区别不同说话人的特征,“内容特征”指的是反应语音信号所要表达的意思的的特征。The present invention starts from the essence of the speaker's voice conversion, that is, keeps the content of the speaker's words unchanged, and only changes the information of the speaker who said the sentence. Based on this consideration, the information in the speech signal is separated to obtain speaker features and content features in order to operate on speaker components. The "speaker feature" in the present invention refers to the feature that reflects the characteristics of the speaker and distinguishes different speakers, and the "content feature" refers to the feature that reflects the meaning to be expressed by the speech signal.
对此,本发明使用一种基于深层神经网络的技术,在高层将语音信号的声学特征分解为说话人特征和内容特征,以而使说话人声音转换得以更完美和简单的实现,达到音质和相似度大幅提升的转换语音信号。In this regard, the present invention uses a deep neural network-based technology to decompose the acoustic features of the speech signal into speaker features and content features at a high level, so that the speaker's voice conversion can be realized more perfectly and simply, achieving sound quality and The converted speech signal with greatly improved similarity.
图1是本发明的的说话人声音转换方法的流程图。如图所示,本发明的方法总体上包含两个阶段:训练阶段和转换阶段。下面依次介绍:FIG. 1 is a flow chart of the speaker voice conversion method of the present invention. As shown in the figure, the method of the present invention generally includes two phases: a training phase and a conversion phase. The following are introduced in turn:
(一)训练阶段(1) Training stage
训练阶段主要包括三个步骤:The training phase mainly includes three steps:
步骤A1:特征提取。Step A1: Feature Extraction.
该步骤从源说话人和目标说话人的训练语音信号中分别提取特征,所述特征包括基频特征和频谱特征,频谱特征在本发明中分为说话人特征和内容特征。This step extracts features from the training speech signals of the source speaker and the target speaker respectively, and the features include fundamental frequency features and spectral features, and the spectral features are divided into speaker features and content features in the present invention.
步骤A2:基频转换函数训练。Step A2: Fundamental frequency conversion function training.
该步骤根据源说话人和目标说话人的训练语音信号的基频特征,构建从源说话人的语音到目标说话人的语音的基频转换函数。This step constructs a fundamental frequency conversion function from the voice of the source speaker to the voice of the target speaker according to the fundamental frequency features of the training speech signals of the source speaker and the target speaker.
根据一种具体实施方式,该步骤统计源说话人和目标说话人的训练语音信号的基频特征在对数域分布的均值和方差,根据所统计的均值和方差构建从源说话人的语音到目标说话人的语音的基频转换函数。According to a specific implementation manner, this step counts the mean and variance of the fundamental frequency features of the training speech signals of the source speaker and the target speaker in the logarithmic domain, and constructs the speech from the source speaker to The fundamental frequency transfer function of the speech of the target speaker.
由于每个说话人的基频特征参数在对数域呈高斯分布,因此对于基频转换,本发明中优选为仅使用对数域的简单线性变换进行。Since the fundamental frequency characteristic parameter of each speaker is Gaussian distributed in the logarithmic domain, the fundamental frequency conversion is preferably performed only by simple linear transformation in the logarithmic domain in the present invention.
步骤A3:频谱转换函数训练。Step A3: spectral conversion function training.
该步骤根据从源说话人和目标说话人的训练语音信号中提取的频谱特征中的说话人特征构建说话人转换函数。This step constructs a speaker transfer function according to the speaker features in the spectral features extracted from the training speech signals of the source speaker and the target speaker.
前述说话人转换的要求保持说话内容不变而只改变说话人信息。因此,本发明只需要训练说话人特征的转换函数(说话人转换函数)即可。The aforementioned requirement of speaker switching keeps the speech content unchanged and only changes the speaker information. Therefore, the present invention only needs to train the conversion function of speaker characteristics (speaker conversion function).
由于在录制源说话人和目标说话人的语音信号时,无法做到不同说话人进行同一句话的录音时保持完全相同的时长,因此需要一些规整手段来将不同时长的句子规整到相同的时长以便进行有监督的特征转换学习(特征对齐),本发明采用动态时间规整(dynamic time warping)算法来进行时长规整,说话人特征转换的建模可以使用线性回归模型或者联合空间高斯混合模型等方法来实现。Because when recording the voice signals of the source speaker and the target speaker, it is impossible to keep the same duration when different speakers record the same sentence, so some regular means are needed to regularize the sentences of different durations to the same duration In order to carry out supervised feature conversion learning (feature alignment), the present invention uses a dynamic time warping (dynamic time warping) algorithm to carry out time length regularization, and the modeling of speaker feature conversion can use methods such as linear regression model or joint space Gaussian mixture model to fulfill.
(二)转换阶段(2) Conversion stage
转换阶段包括三个步骤:The conversion phase consists of three steps:
步骤B1:特征提取。Step B1: Feature extraction.
与训练阶段相仿,该步骤从源说话人的待转换语音信号中提取特征,所述特征包括基频特征和频谱特征,频谱特征分为说话人特征和内容特征。Similar to the training stage, this step extracts features from the source speaker's speech signal to be converted, the features include fundamental frequency features and spectral features, and the spectral features are divided into speaker features and content features.
步骤B2:特征转换。Step B2: Feature Transformation.
分别使用训练阶段得到的基频转换函数和说话人转换函数,对从步骤B1中从所述待转换语音信号中提取出的基频特征和说话人特征进行转换,得到转换后的基频特征和说话人特征。Using the base frequency conversion function and the speaker conversion function obtained in the training stage respectively, the base frequency feature and the speaker feature extracted from the speech signal to be converted in step B1 are converted to obtain the converted base frequency feature and speaker characteristics.
对于基频转换,具体的,训练阶段在训练集上统计出源、目标说话人语音信号的基频在对数域的均值μx、μy和方差 基频转换时转换函数形如下式所示:For fundamental frequency conversion, specifically, in the training phase, the mean value μ x , μ y and variance of the fundamental frequency of the source and target speaker’s speech signals in the logarithmic domain are calculated on the training set The form of the conversion function when the base frequency is converted is as follows:
而对于说话人特征的转换,假设有源和目标说话人对应时间对齐的说话人特征X={x1,x2,...xT}和Y={y1,y2,...yT}作为训练数据。本发明采用两种方案。一种方案是使用线性回归模型F(xt)=Axt+b作为频谱转换函数,其中的参数可有下式计算得到:And for the transformation of speaker features, it is assumed that the source and target speakers correspond to time-aligned speaker features X={x 1 , x 2 ,...x T } and Y={y 1 , y 2 ,... y T } as training data. The present invention adopts two kinds of schemes. One solution is to use the linear regression model F(x t )=Ax t +b as the spectrum conversion function, where the parameters can be calculated by the following formula:
[A,b]=YXT(XXT)-1 [A, b] = YX T (XX T ) -1
另外一种方案,基于联合空间高斯混合模型的方法,需要使用联合特征Z=[XT,YT]T来训练一个高斯混合模型,他以如下形式来描述联合特征空间的分布:Another solution, the method based on the joint space Gaussian mixture model, needs to use the joint feature Z=[X T , Y T ] T to train a Gaussian mixture model, which describes the distribution of the joint feature space in the following form:
P(z)=∑mwmN(z;μm,∑m),P(z)= ∑mwmN (z; μm , ∑m ) ,
其中in
从中,导出转换函数:From it, export the conversion function:
式中为后验概率。In the formula is the posterior probability.
步骤B3:语音合成。Step B3: speech synthesis.
该步骤根据步骤B2得到的转换后的基频特征和说话人特征,以及步骤B1提取的待转换语音信号中的内容特征,合成目标说话人的语音。In this step, the speech of the target speaker is synthesized according to the converted fundamental frequency features and speaker features obtained in step B2, and the content features in the speech signal to be converted extracted in step B1.
本发明使用基于源-滤波器结构的合成器,需要输入激励(即基频)和声道响应(频谱参数)来生成待转换的语音。因此首先需要从转换的说话人特征和待转换的说话人语音信号的内容特征中重建出转换的说话人频谱参数(频谱参数重建过程见下文所述),进而通过合成器来生成转换的语音。本发明采用STRAIGHT分析合成器来进行语音生成。The present invention uses a synthesizer based on a source-filter structure that requires an input stimulus (ie fundamental frequency) and a vocal tract response (spectral parameters) to generate the speech to be converted. Therefore, it is first necessary to reconstruct the converted speaker spectral parameters from the converted speaker features and the content features of the speaker's voice signal to be converted (the spectral parameter reconstruction process is described below), and then the converted speech is generated by a synthesizer. The present invention uses a STRAIGHT analysis synthesizer for speech generation.
(三)特征提取(3) Feature extraction
以上对本发明的方法进行了整体性的介绍,下面对于所述方法中采用的特征提取步骤进行详细的说明。The method of the present invention has been introduced as a whole above, and the feature extraction steps adopted in the method will be described in detail below.
如前所述,本发明所述特征提取包括基频特征、说话人特征和内容特征的提取。本发明中基频特征提取采用传统的基频提取方法。说话人特征和内容特征的特征提取方法是本发明核心所在。As mentioned above, the feature extraction in the present invention includes the extraction of fundamental frequency features, speaker features and content features. The fundamental frequency feature extraction in the present invention adopts the traditional fundamental frequency extraction method. The feature extraction method of speaker features and content features is the core of the present invention.
3.1基本步骤3.1 Basic steps
图2是本发明的特征提取步骤的框图。如图2所示,特征提取步骤具体分为两步骤:Fig. 2 is a block diagram of the feature extraction steps of the present invention. As shown in Figure 2, the feature extraction step is specifically divided into two steps:
步骤a1:声学特征提取。Step a1: Acoustic feature extraction.
基于语音信号的源-滤波器结构,考虑到语音信号的短时平稳性和长时非平稳性,将语音信号以20-30ms进行分段,每一段本发明称作一帧。对每一帧语音信号,使用现有的语音分析算法(如STRAIGHT等)从语音信号中提取基频和频谱参数(如线谱对、Mel倒谱等)。Based on the source-filter structure of the speech signal, considering the short-term stationary and long-term non-stationary of the speech signal, the speech signal is segmented by 20-30 ms, and each segment is called a frame in the present invention. For each frame of speech signal, use the existing speech analysis algorithm (such as STRAIGHT, etc.) to extract fundamental frequency and spectral parameters (such as line spectrum pair, Mel cepstrum, etc.) from the speech signal.
步骤a2:说话人特征和内容特征提取。Step a2: Speaker feature and content feature extraction.
考虑到说话人之间的差异主要体现在声道结构上,在声学特征上,即主要反映在频谱参数中。因此,本发明主要考虑从频谱特征分离出说话人相关特征和内容相关特征。另外,本发明考虑到说话人特征是一种超音段长时的特征,为有效提取语音信号中的说话人相关特征,使其与内容相关特征更好地分离,本发明将连续多帧的特征拼接成一个称之为超音段特征输入到特征分离器中。具体的特征分离方法如下:Considering that the difference between speakers is mainly reflected in the structure of the vocal tract, in terms of acoustic characteristics, that is, it is mainly reflected in the spectral parameters. Therefore, the present invention mainly considers separating speaker-dependent features and content-dependent features from spectral features. In addition, the present invention considers that the speaker feature is a long-term feature of the suprasegment, in order to effectively extract the speaker-related features in the speech signal, so that it can be better separated from the content-related features, the present invention combines the continuous multi-frame The features are concatenated into a so-called suprasegment feature input to the feature separator. The specific feature separation method is as follows:
3.2特征分离算法3.2 Feature Separation Algorithm
本发明使用一个深层的神经网络来分离声学频谱参数中的说话人特征和内容特征。图3是本发明的用于特征分离的神经网络结构示意图。如图3所示,该神经网络结构采用上下对称的共2K-1层多层(K为自然数)网络结构,包括:最下层为输入层,从该层输入待分离的声学特征;最上层为输出层,该层输出重构出的声学特征;中间2K-3个隐层,每层包括若干个节点,模拟神经单元的处理过程。The present invention uses a deep neural network to separate speaker features and content features in acoustic spectrum parameters. Fig. 3 is a schematic diagram of the structure of the neural network used for feature separation in the present invention. As shown in Figure 3, the neural network structure adopts a symmetric 2K-1 layer multi-layer (K is a natural number) network structure, including: the lowermost layer is the input layer, from which the acoustic features to be separated are input; the uppermost layer is The output layer, which outputs the reconstructed acoustic features; the middle 2K-3 hidden layers, each layer includes several nodes, simulating the processing process of the neural unit.
从输入层到从下至上的第K个隐层为编码网络(或称编码器),用于从输入的语音声学特征中提取出高层的信息,从下至上的第K个隐层为编码层;编码层的网络节点分为两部分,一部分与说话人相关,另一部分与内容相关,它们的输出分别对应说话人特征和内容特征。从下至上的第K个隐层以上的隐层为解码网络(或称解码器),它的功能与编码网络相反,用于从高层的说话人特征和内容特征中重建出声学频谱参数。The Kth hidden layer from the input layer to the top is the encoding network (or encoder), which is used to extract high-level information from the input speech acoustic features, and the Kth hidden layer from the bottom to the top is the encoding layer ; The network nodes of the encoding layer are divided into two parts, one part is related to the speaker, and the other part is related to the content, and their outputs correspond to the speaker features and content features respectively. The hidden layer above the Kth hidden layer from bottom to top is the decoding network (or decoder), its function is opposite to that of the encoding network, and it is used to reconstruct the acoustic spectrum parameters from the high-level speaker features and content features.
本发明采用的图3所示的深层神经网络是对人的神经系统处理语音信号的一个模拟,需要对其进行训练,从而使其具有所需要的能够从声学特征中实现提取和分离说话人特征和内容特征这一特定的能力。图3所示深层神经网络的训练是在本发明提出的数据库制作方法所设计的语音信号数据库上进行,本发明提出的数据库制作方法见本发明数据库制作部分。The deep neural network shown in Figure 3 that the present invention adopts is a simulation of the processing of speech signals to the human nervous system, and it needs to be trained so that it has the required ability to extract and separate speaker features from acoustic features and the specific ability to characterize content. The training of the deep neural network shown in Fig. 3 is carried out on the speech signal database designed by the database production method proposed by the present invention, and the database production method proposed by the present invention is shown in the database production part of the present invention.
图4是本发明中神经网络训练的具体流程图。训练过程分为三步骤:Fig. 4 is a specific flowchart of neural network training in the present invention. The training process is divided into three steps:
步骤b1:预训练。Step b1: Pre-training.
由于深层神经网络的优化比较困难,在训练之前需要通过预训练来初始化网络权值。本发明采取一种无监督的学习模式,使用贪婪算法来逐层训练网络,快速的得到模型的初始参数。在每一层的训练中,可以使用消除噪声干扰的自动编码器(De-noising auto-encoder)来初始化网络权值,即在输入特征上加上一定的噪声掩盖,使得神经网络的训练能够更加鲁棒,并且防止过训练。具体的,在输入层,输入特征服从高斯分布,则在输入的各维上加入适量的高斯噪声,并采用最小均方误差准则来训练。而在第一层以上各层,输入特征服从二值分布,因此以一定的概率,将输入特征的某些维置零,并使用最小交叉熵(cross-entropy)准则来训练。经过预训练得到一个K层叠加的自动编码器后,将其向上翻转,便得到了上下对称的自动编码器结构。Since the optimization of deep neural networks is difficult, it is necessary to initialize the network weights through pre-training before training. The present invention adopts an unsupervised learning mode, uses a greedy algorithm to train the network layer by layer, and quickly obtains the initial parameters of the model. In the training of each layer, a De-noising auto-encoder can be used to initialize the network weights, that is, a certain amount of noise is added to the input features to make the training of the neural network more efficient. Robust and resistant to overtraining. Specifically, in the input layer, if the input features obey the Gaussian distribution, an appropriate amount of Gaussian noise is added to each dimension of the input, and the minimum mean square error criterion is used for training. In the layers above the first layer, the input features obey the binary distribution, so with a certain probability, some dimensions of the input features are set to zero, and the minimum cross-entropy (cross-entropy) criterion is used for training. After pre-training, a K-layer superimposed autoencoder is obtained, and it is turned upside down to obtain a vertically symmetrical autoencoder structure.
步骤b2:编码层调整。Step b2: Coding layer adjustment.
经过预训练之后的神经网络,已经具备了一定的高层信息提取能力,在编码层,某些节点能反映出较强的说话人区分能力,另外一些节点则能反映较强的内容区分能力。这一步将使用一些客观的准则来将这些节点挑选出来,其输出分别作为对应的特征。这里可以使用一些区分性准则,如Fisher′s ratio,来挑选。具体的,在所述语音信号数据库的训练集上,对编码层的每个节点的输出特征,均用该准则来统计其在不同说话人之间和不同内容之间的区分性,将不同说话人间区分性大而不同内容之间区分性小的节点作为说话人相关节点,其余的节点作为内容相关节点。After pre-training, the neural network already has a certain high-level information extraction capability. In the coding layer, some nodes can reflect a strong ability to distinguish speakers, while others can reflect a strong ability to distinguish content. In this step, some objective criteria will be used to select these nodes, and their outputs will be used as corresponding features. Here you can use some distinguishing criteria, such as Fisher's ratio, to select. Specifically, on the training set of the speech signal database, for the output features of each node of the coding layer, this criterion is used to count its discrimination among different speakers and between different contents, and the different speech Nodes with high inter-personal distinction but little distinction between different contents are regarded as speaker-related nodes, and the rest are regarded as content-related nodes.
步骤b3:精细调整。Step b3: Fine adjustment.
本发明需要从输入的声学频谱参数中分离出说话人相关和内容相关的特征,并能将其应用到说话人声音转换中去。对此,要设计特定的区分性目标函数来训练该网络,使其具备本发明所期望的这种能力。要达到这种要求,需要在输入训练样本中引入对比竞争的手段。在如图3所示的网络结构中,在输入层,每次同时并行输入两个样本x1和x2,他们分别在编码输出层生成说话人特征cs1、cs2和内容特征cc1、cc2,然后通过解码网络,重建出输入的声学特征和因此,训练网络的目标函数中包含如下的三部分:The present invention needs to separate the speaker-related and content-related features from the input acoustic spectrum parameters, and apply them to the speaker's voice conversion. For this, it is necessary to design a specific discriminative objective function to train the network so that it has the ability expected by the present invention. To meet this requirement, it is necessary to introduce a means of comparison and competition in the input training samples. In the network structure shown in Figure 3, in the input layer, two samples x 1 and x 2 are input in parallel each time, and they generate speaker features c s1 , c s2 and content features c c1 , c c2 , and then through the decoding network, the input acoustic features are reconstructed and Therefore, the objective function of training the network contains the following three parts:
重建误差:一方面,由于说话人声音转换应用的需要,要从高层特征中重建恢复出声学频谱参数,解码网络需要具有很好的恢复重建的能力,该能力将会直接影响合成语音的质量。因此,在训练目标函数中需要对重建误差加以限制。另一个方面,加入重建误差的限制也是为了保证编码输出的说话人特征和内容特征中信息的完整性。本发明中采用如下形式的误差形式:Reconstruction error: On the one hand, due to the needs of the speaker's voice conversion application, to reconstruct and restore the acoustic spectrum parameters from the high-level features, the decoding network needs to have a good ability to restore and reconstruct, which will directly affect the quality of synthesized speech . Therefore, it is necessary to limit the reconstruction error in the training objective function. On the other hand, the limitation of reconstruction error is also added to ensure the integrity of information in the encoded output speaker features and content features. The error form of following form is adopted in the present invention:
说话人特征代价:为了使说话人特征对说话人具有很强的区分性,而对内容不具有区分性,可以设计这样一种准则,使相同说话人之间的说话人特征误差尽量小,而不同说话人之间的误差尽量大,这种准则可以表示为下式:Speaker feature cost: In order to make the speaker feature highly distinguishable to the speaker but not to the content, such a criterion can be designed so that the error of the speaker feature between the same speakers is as small as possible, while The error between different speakers should be as large as possible. This criterion can be expressed as the following formula:
Lsc=δs*Es+(1-δs)*exp(-λsEs)L sc =δ s *E s +(1-δ s )*exp(-λ s E s )
其中,Es=|cs1-cs2|2,δs是输入的两个样本的说话人标注,δs=1表示两个输入它们来自同一个说话人,而δs=0则表示来自不同的两个说话人。Among them, E s =|c s1 -c s2 | 2 , δ s is the speaker annotation of the two input samples, δ s =1 means that the two inputs come from the same speaker, and δ s =0 means from two different speakers.
内容特征代价:与说话人特征误差类似,可以构造内容特征的区分性代价函数:Content feature cost: Similar to the speaker feature error, a discriminative cost function for content features can be constructed:
Lcc=δc*Es+(1-δc)*exp(-λcEc)L cc =δ c *E s +(1-δ c )*exp(-λ c E c )
综合上述三种代价,可以得到最终用于的精细调整的目标函数:Combining the above three costs, the final fine-tuned objective function can be obtained:
Lcc=αLr+βLsc+ζLcc L cc =αL r +βL sc +ζL cc
α、β和ζ调整这三种代价比重的权值,神经网络的训练目标是调整网络权值使得该目标函数尽量小,训练时本发明使用误差反向传播算法,利用带冲量的梯度下降算法来更新网络权值。α, β and ζ adjust the weights of these three kinds of cost proportions. The training goal of the neural network is to adjust the network weights so that the objective function is as small as possible. During training, the present invention uses the error backpropagation algorithm and utilizes the gradient descent algorithm with impulse to update the network weights.
(四)说话人语音信号库的制作(4) Production of speaker speech signal library
本发明中所使用的神经网络需要大量的训练数据来进行,需要包含很多的说话人,每个说话人也需要录制充足内容的语料。The neural network used in the present invention requires a large amount of training data, needs to include many speakers, and each speaker also needs to record a corpus with sufficient content.
所要特别指出的是,神经网络所需要的大量训练数据,并不是图1中所示训练过程的源说话人或目标说话人数据。实际应用中,获得图1中所示训练过程的源说话人或目标说话人的大量数据不切实际或要求过高,但获得本处所述神经网络所需要的大量训练数据是可行的,符合实际要求。It should be pointed out that the large amount of training data required by the neural network is not the source speaker or target speaker data in the training process shown in Figure 1. In practical applications, it is impractical or too demanding to obtain a large amount of data of the source speaker or the target speaker in the training process shown in Figure 1, but it is feasible to obtain the large amount of training data required by the neural network described here. practical requirements.
图5是本发明中数据库制作的流程图。分为四个步骤:Fig. 5 is a flowchart of database creation in the present invention. Divided into four steps:
步骤c1:建立一个语料库,使该语料库中包括多个句子。Step c1: Establish a corpus, so that the corpus includes multiple sentences.
考虑到要设计一种鲁棒的分离网络,需要其能处理所有的人以及所有的内容,本发明中设计一个音素均衡的语料库,而且句子数不能太多,通常在100句以内,以便采集大量的说话人数据。所谓音素均衡是指语料中包含所有的音素,而且各音素的数量相对均衡。Considering that to design a robust separation network, it needs to be able to handle all people and all content. In the present invention, a phoneme-balanced corpus is designed, and the number of sentences should not be too many, usually within 100 sentences, so as to collect a large number of speaker data. The so-called phoneme balance means that the corpus contains all phonemes, and the number of each phoneme is relatively balanced.
步骤c2:录制多个说话人朗读所述语料库中的句子的语音信号,构建语音信号数据库,并对该语音信号数据库中的语音信号进行预处理,以去除语音信号中的不正常部分。Step c2: recording voice signals of multiple speakers reading sentences in the corpus, constructing a voice signal database, and preprocessing the voice signals in the voice signal database to remove abnormal parts in the voice signals.
考虑到要使网络具有区分说话人的能力,需要录制大量说话人的数据来训练网络。在录音阶段,由于成本等方面的原因,无法找到如此多的播音员来录制音库,只能采集业余人员的录音,这就使得录制的语音质量参差不齐,因此,录制完成后,需要对录制的语音做一些预处理,如能量规整、信道均衡、喷麦现象的处理等等,保证训练语料的质量。Considering that in order for the network to have the ability to distinguish speakers, it is necessary to record a large number of speaker data to train the network. In the recording stage, due to cost and other reasons, it is impossible to find so many announcers to record the sound library, and only the recordings of amateurs can be collected, which makes the quality of recorded voices uneven. Therefore, after the recording is completed, it is necessary to The recorded speech is pre-processed, such as energy regularization, channel equalization, processing of microphone spraying, etc., to ensure the quality of the training corpus.
步骤c3:使用隐马尔科夫模型来对进行预处理的杨这语音信号数据库中的语音信号行切分,切分后的每一段作为一个帧,由得到各语音信号的帧一级的说话人标注信息和内容标注信息。Step c3: Use the hidden Markov model to segment the speech signal in the preprocessed speech signal database, and each segment after segmentation is used as a frame, and the frame-level speaker of each speech signal is obtained Annotation information and content annotation information.
从上文可知,在神经网络训练的精细调整阶段,是有监督的学习过程,需要知道输入每帧训练数据的说话人标注信息和内容标注信息。因此,需要对语音信号数据库中的语音信号做帧一级的标注,即进行音段的切分。具体的,可以采用一个现有的用作语音合成的上下文相关的隐马尔可夫模型来实现音段切分。在切分之前,先用每个说话人的录音数据使用最大似然线性回归算法将该模型自适应到该说话人的声学空间,再使用自适应得到的模型对该说话人的录音数据利用维特比算法进行解码,得到模型各状态的边界信息。It can be seen from the above that in the fine adjustment stage of neural network training, it is a supervised learning process, and it is necessary to know the speaker annotation information and content annotation information of each frame of training data input. Therefore, it is necessary to mark the speech signals in the speech signal database at the frame level, that is, to segment the speech segments. Specifically, an existing context-dependent hidden Markov model used for speech synthesis can be used to realize segment segmentation. Before segmentation, use the maximum likelihood linear regression algorithm to adapt the model to the speaker's acoustic space with the recording data of each speaker, and then use the adaptive model to use the Werther's recording data for the speaker. The ratio algorithm is decoded to obtain the boundary information of each state of the model.
步骤c4:对所述语音数据库的各语音信号进行随机组合,构造神经网络的训练数据。Step c4: Randomly combine the speech signals in the speech database to construct training data for the neural network.
根据上文描述,神经网络的训练数据有四类:相同说话人相同内容、相同说话人不同内容、不同说话人相同内容和不同说话人不同内容。由于有很多的说话人特征和内容特征属性,在训练阶段,本发明在训练数据中随机挑选组合,输入到网络进行训练。According to the above description, there are four types of training data for neural networks: the same content for the same speaker, different content for the same speaker, the same content for different speakers, and different content for different speakers. Since there are many speaker features and content feature attributes, in the training phase, the present invention randomly selects combinations from the training data and inputs them to the network for training.
(五)具体实施例(5) Specific examples
根据上文所述方法,作为本发明实施方式举例,本发明搭建了一个说话人声音转换系统。首先,本发明设计了包含100句话的音素平衡的语料,募集了81个说话人(其中包含40个男性和41个女性说话人)来录音,经过处理后形成最终的训练语料库。录音的语音文件是单声道、16kHz采样率的。在这81个说话人的数据中,我们随机挑选60人(30个男性、30个女性)的数据作为训练神经网络的训练集,另外10人(5个男性和5个女性)的数据作为训练神经网络训练的验证集,余下的11人的数据作为测试集,测试说话人声音转换的效果。在提取声学特征时,我们采用25ms的汉明窗对波形信号进行分帧处理,并以5ms的帧移来移动短时窗,每帧提取一个基频和一组24维的Mel倒谱参数作为声学特征。According to the method described above, as an example of the implementation of the present invention, the present invention builds a speaker voice conversion system. First, the present invention designs a phoneme-balanced corpus containing 100 sentences, recruits 81 speakers (including 40 male and 41 female speakers) to record, and forms the final training corpus after processing. The recorded voice files are mono, 16kHz sampling rate. Among the 81 speaker data, we randomly selected the data of 60 people (30 males, 30 females) as the training set for training the neural network, and the data of the other 10 people (5 males and 5 females) as the training set. The verification set of the neural network training, and the data of the remaining 11 people are used as the test set to test the effect of the speaker's voice conversion. When extracting acoustic features, we use a Hamming window of 25ms to process the waveform signal into frames, and move the short time window with a frame shift of 5ms. Each frame extracts a fundamental frequency and a set of 24-dimensional Mel cepstrum parameters as acoustic features.
在训练用于特征分离的神经网络阶段,网络的输入向量为当前帧与其前后各5帧共11帧拼成的超音段特征,共264维,由于输出只需要重建出输入的当前帧,因此,输出层为24维。另外,网络包含7个隐层,其中节点数分别为500、400、300、200、300、400、500,在中间的那一层,我们使前100个节点的输出为说话人特征,剩下的100个节点的输出为内容特征。在预训练阶段,我们采用4个层叠的自动编码器的形式来初始化网络权值,节点数分别为:264-500、500-400、400-300和300-200,自底向上,每一个自动编码器的输出作为下一个自动编码器的输入,通过无监督学习的形式初始化网络权值,最后将网络权值翻转,得到整个网络的初始化权值,需要注意的是,第一层翻转到整个网络的最上面一层的时候,由于输出只有24维,只需要将输入层当前帧对应的权值翻转上去即可。另外,在中间层翻转之前,需要计算每个节点输出在不同说话人之间和不同内容之间的区分性(上文中提到的Fisher’s ratio),并以此来对节点和网络权值进行重排。预训练之后,按照上文所述的方法进行精细调整,在这个过程中,需要在验证集上对目标函数的权值进行调整,得到最优值。In the stage of training the neural network for feature separation, the input vector of the network is the supersegment features composed of 11 frames of the current frame and 5 frames before and after each, with a total of 264 dimensions. Since the output only needs to reconstruct the current input frame, so , the output layer is 24-dimensional. In addition, the network contains 7 hidden layers, in which the number of nodes is 500, 400, 300, 200, 300, 400, 500. In the middle layer, we make the output of the first 100 nodes the speaker features, and the remaining The output of 100 nodes is the content feature. In the pre-training stage, we use 4 stacked autoencoders to initialize the network weights, the number of nodes are: 264-500, 500-400, 400-300 and 300-200, from bottom to top, each automatic The output of the encoder is used as the input of the next autoencoder. The network weights are initialized in the form of unsupervised learning, and finally the network weights are flipped to obtain the initialization weights of the entire network. It should be noted that the first layer is flipped to the entire At the top layer of the network, since the output is only 24-dimensional, it is only necessary to flip the weight corresponding to the current frame of the input layer. In addition, before the middle layer is flipped, it is necessary to calculate the difference between different speakers and different content of each node output (Fisher's ratio mentioned above), and use this to re-weight the node and network weights. Row. After pre-training, fine-tuning is carried out according to the method described above. In this process, the weight of the objective function needs to be adjusted on the verification set to obtain the optimal value.
训练好特征分离器之后,便可以进行搭建说话人声音转换系统了,我们在测试集上任意挑选两个说话人来,选择其中50句话作为训练数据,按上文提取需要的特征,训练基频、说话人特征的转换函数(本实施方式举例中使用直接的线性回归模型),剩下的50句话作为测试数据来验证说话人声音转换的效果。After the feature separator is trained, the speaker voice conversion system can be built. We randomly select two speakers from the test set, select 50 sentences as the training data, and extract the required features according to the above. The training base The conversion function of the frequency and speaker features (a direct linear regression model is used as an example in this embodiment), and the remaining 50 sentences are used as test data to verify the effect of speaker voice conversion.
我们使用Fisher’s ratio来度量提取出的不同特征在不同说话人之间和不同内容之间的区分性。Fisher’s ratio度量的是特征类内距离和类间距离的比值,该比值越大,说明特征在此种分类方法下更加具有区分性。图6和图7分别是Mel倒谱系数和分离出的特征在不同说话人(实线)和不同内容(虚线)之间的区分性。可见,输入的声学特征中,除了低维在内容上显示较强的区分性外,其余维并没有很强的区分性。而提取出的特征(前100维为说话人特征,剩下100维为内容特征)经过训练,对不同的分类体现出所期望的区分性。而在说话人转换实验上,直接用目标说话人的说话人特征加上源说话人的内容特征合成出的语音,倒谱误差为4.39dB,而用线性变换过的源说话人的说话人特征和其内容特征合成的语音倒谱误差为5.64dB,从主观听感上已经逼近目标说话人的语音。We use Fisher's ratio to measure the discriminativeness of different extracted features between different speakers and different content. Fisher's ratio measures the ratio of feature intra-class distance to inter-class distance. The larger the ratio, the more discriminative the feature is under this classification method. Figure 6 and Figure 7 are the discriminativeness of Mel cepstral coefficients and separated features between different speakers (solid line) and different content (dashed line), respectively. It can be seen that among the input acoustic features, except for the low-dimensions that show strong discrimination in terms of content, the other dimensions are not very differentiated. The extracted features (the first 100 dimensions are speaker features, and the remaining 100 dimensions are content features) have been trained to show the desired distinction for different categories. In the speaker conversion experiment, the cepstrum error is 4.39dB for the speech synthesized by directly using the speaker characteristics of the target speaker plus the content characteristics of the source speaker, while the linearly transformed speaker characteristics of the source speaker The speech cepstrum error synthesized with its content features is 5.64dB, which is close to the target speaker's speech from the subjective sense of hearing.
以上所述的具体实施例,对本发明的目的、技术方案和有益效果进行了进一步详细说明,应理解的是,以上所述仅为本发明的具体实施例而已,并不用于限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The specific embodiments described above have further described the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above descriptions are only specific embodiments of the present invention, and are not intended to limit the present invention. Within the spirit and principles of the present invention, any modifications, equivalent replacements, improvements, etc., shall be included in the protection scope of the present invention.
Claims (11)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210528629.4A CN102982809B (en) | 2012-12-11 | 2012-12-11 | Conversion method for sound of speaker |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210528629.4A CN102982809B (en) | 2012-12-11 | 2012-12-11 | Conversion method for sound of speaker |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102982809A true CN102982809A (en) | 2013-03-20 |
CN102982809B CN102982809B (en) | 2014-12-10 |
Family
ID=47856718
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210528629.4A Active CN102982809B (en) | 2012-12-11 | 2012-12-11 | Conversion method for sound of speaker |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102982809B (en) |
Cited By (55)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103514883A (en) * | 2013-09-26 | 2014-01-15 | 华南理工大学 | Method for achieving self-adaptive switching of male voice and female voice |
CN103531205A (en) * | 2013-10-09 | 2014-01-22 | 常州工学院 | Asymmetrical voice conversion method based on deep neural network feature mapping |
CN103886859A (en) * | 2014-02-14 | 2014-06-25 | 河海大学常州校区 | Voice conversion method based on one-to-many codebook mapping |
CN104143327A (en) * | 2013-07-10 | 2014-11-12 | 腾讯科技(深圳)有限公司 | Acoustic model training method and device |
CN104392717A (en) * | 2014-12-08 | 2015-03-04 | 常州工学院 | Sound track spectrum Gaussian mixture model based rapid voice conversion system and method |
CN104464725A (en) * | 2014-12-30 | 2015-03-25 | 福建星网视易信息系统有限公司 | Method and device for singing imitation |
CN105023570A (en) * | 2014-04-30 | 2015-11-04 | 安徽科大讯飞信息科技股份有限公司 | method and system of transforming speech |
CN105206257A (en) * | 2015-10-14 | 2015-12-30 | 科大讯飞股份有限公司 | Voice conversion method and device |
CN105321526A (en) * | 2015-09-23 | 2016-02-10 | 联想(北京)有限公司 | Audio processing method and electronic device |
CN105390141A (en) * | 2015-10-14 | 2016-03-09 | 科大讯飞股份有限公司 | Sound conversion method and sound conversion device |
US9508347B2 (en) | 2013-07-10 | 2016-11-29 | Tencent Technology (Shenzhen) Company Limited | Method and device for parallel processing in model training |
CN106228976A (en) * | 2016-07-22 | 2016-12-14 | 百度在线网络技术(北京)有限公司 | Audio recognition method and device |
CN106384587A (en) * | 2015-07-24 | 2017-02-08 | 科大讯飞股份有限公司 | Voice recognition method and system thereof |
CN107068157A (en) * | 2017-02-21 | 2017-08-18 | 中国科学院信息工程研究所 | A kind of information concealing method and system based on audio carrier |
CN107452369A (en) * | 2017-09-28 | 2017-12-08 | 百度在线网络技术(北京)有限公司 | Phonetic synthesis model generating method and device |
CN107464569A (en) * | 2017-07-04 | 2017-12-12 | 清华大学 | Vocoder |
CN107464554A (en) * | 2017-09-28 | 2017-12-12 | 百度在线网络技术(北京)有限公司 | Phonetic synthesis model generating method and device |
CN107481735A (en) * | 2017-08-28 | 2017-12-15 | 中国移动通信集团公司 | Method for converting audio sound production, server and computer readable storage medium |
CN107507619A (en) * | 2017-09-11 | 2017-12-22 | 厦门美图之家科技有限公司 | Phonetics transfer method, device, electronic equipment and readable storage medium storing program for executing |
CN107545903A (en) * | 2017-07-19 | 2018-01-05 | 南京邮电大学 | A kind of phonetics transfer method based on deep learning |
CN107871497A (en) * | 2016-09-23 | 2018-04-03 | 北京眼神科技有限公司 | Audio recognition method and device |
CN108108357A (en) * | 2018-01-12 | 2018-06-01 | 京东方科技集团股份有限公司 | Accent conversion method and device, electronic equipment |
WO2018098892A1 (en) * | 2016-11-29 | 2018-06-07 | 科大讯飞股份有限公司 | End-to-end modelling method and system |
CN105023574B (en) * | 2014-04-30 | 2018-06-15 | 科大讯飞股份有限公司 | A kind of method and system for realizing synthesis speech enhan-cement |
CN108461079A (en) * | 2018-02-02 | 2018-08-28 | 福州大学 | A kind of song synthetic method towards tone color conversion |
CN108550372A (en) * | 2018-03-24 | 2018-09-18 | 上海诚唐展览展示有限公司 | A kind of system that astronomical electric signal is converted into audio |
CN108847249A (en) * | 2018-05-30 | 2018-11-20 | 苏州思必驰信息科技有限公司 | Sound converts optimization method and system |
CN109147758A (en) * | 2018-09-12 | 2019-01-04 | 科大讯飞股份有限公司 | A kind of speaker's sound converting method and device |
CN109326283A (en) * | 2018-11-23 | 2019-02-12 | 南京邮电大学 | Many-to-many speech conversion method based on text encoder under the condition of non-parallel text |
CN109377978A (en) * | 2018-11-12 | 2019-02-22 | 南京邮电大学 | A multi-to-many speaker conversion method based on i-vector under the condition of non-parallel text |
CN109599091A (en) * | 2019-01-14 | 2019-04-09 | 南京邮电大学 | Multi-to-multi voice conversion method based on STARWGAN-GP and x vector |
CN109616131A (en) * | 2018-11-12 | 2019-04-12 | 南京南大电子智慧型服务机器人研究院有限公司 | A digital real-time voice change method |
CN109637551A (en) * | 2018-12-26 | 2019-04-16 | 出门问问信息科技有限公司 | Phonetics transfer method, device, equipment and storage medium |
CN109817198A (en) * | 2019-03-06 | 2019-05-28 | 广州多益网络股份有限公司 | Multiple sound training method, phoneme synthesizing method and device for speech synthesis |
CN109935225A (en) * | 2017-12-15 | 2019-06-25 | 富泰华工业(深圳)有限公司 | Character information processor and method, computer storage medium and mobile terminal |
CN110010144A (en) * | 2019-04-24 | 2019-07-12 | 厦门亿联网络技术股份有限公司 | Voice signals enhancement method and device |
CN110060690A (en) * | 2019-04-04 | 2019-07-26 | 南京邮电大学 | Multi-to-multi voice conversion method based on STARGAN and ResNet |
CN110060701A (en) * | 2019-04-04 | 2019-07-26 | 南京邮电大学 | Multi-to-multi phonetics transfer method based on VAWGAN-AC |
CN110459232A (en) * | 2019-07-24 | 2019-11-15 | 浙江工业大学 | A Speech Conversion Method Based on Recurrent Generative Adversarial Networks |
CN110570873A (en) * | 2019-09-12 | 2019-12-13 | Oppo广东移动通信有限公司 | Voiceprint wake-up method, device, computer equipment and storage medium |
CN110600013A (en) * | 2019-09-12 | 2019-12-20 | 苏州思必驰信息科技有限公司 | Training method and device for non-parallel corpus voice conversion data enhancement model |
CN110600012A (en) * | 2019-08-02 | 2019-12-20 | 特斯联(北京)科技有限公司 | Fuzzy speech semantic recognition method and system for artificial intelligence learning |
CN111201565A (en) * | 2017-05-24 | 2020-05-26 | 调节股份有限公司 | System and method for sound-to-sound conversion |
CN111433847A (en) * | 2019-12-31 | 2020-07-17 | 深圳市优必选科技股份有限公司 | Speech conversion method and training method, intelligent device and storage medium |
CN111462769A (en) * | 2020-03-30 | 2020-07-28 | 深圳市声希科技有限公司 | End-to-end accent conversion method |
CN111883149A (en) * | 2020-07-30 | 2020-11-03 | 四川长虹电器股份有限公司 | Voice conversion method and device with emotion and rhythm |
CN111951810A (en) * | 2019-05-14 | 2020-11-17 | 国际商业机器公司 | High quality non-parallel many-to-many voice conversion |
CN112309365A (en) * | 2020-10-21 | 2021-02-02 | 北京大米科技有限公司 | Training method, device, storage medium and electronic device for speech synthesis model |
CN112382308A (en) * | 2020-11-02 | 2021-02-19 | 天津大学 | Zero-order voice conversion system and method based on deep learning and simple acoustic features |
CN112735434A (en) * | 2020-12-09 | 2021-04-30 | 中国人民解放军陆军工程大学 | Voice communication method and system with voiceprint cloning function |
CN113345452A (en) * | 2021-04-27 | 2021-09-03 | 北京搜狗科技发展有限公司 | Voice conversion method, training method, device and medium of voice conversion model |
CN113889130A (en) * | 2021-09-27 | 2022-01-04 | 平安科技(深圳)有限公司 | A voice conversion method, device, equipment and medium |
CN114203154A (en) * | 2021-12-09 | 2022-03-18 | 北京百度网讯科技有限公司 | Training method and device of voice style migration model and voice style migration method and device |
CN115410597A (en) * | 2022-05-24 | 2022-11-29 | 北方工业大学 | A Genetic Algorithm-Based Method for Optimizing Speech Conversion Parameters |
US11996117B2 (en) | 2020-10-08 | 2024-05-28 | Modulate, Inc. | Multi-stage adaptive system for content moderation |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1835074A (en) * | 2006-04-07 | 2006-09-20 | 安徽中科大讯飞信息科技有限公司 | Speaking person conversion method combined high layer discription information and model self adaption |
US20090089063A1 (en) * | 2007-09-29 | 2009-04-02 | Fan Ping Meng | Voice conversion method and system |
CN102063899A (en) * | 2010-10-27 | 2011-05-18 | 南京邮电大学 | Method for voice conversion under unparallel text condition |
-
2012
- 2012-12-11 CN CN201210528629.4A patent/CN102982809B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1835074A (en) * | 2006-04-07 | 2006-09-20 | 安徽中科大讯飞信息科技有限公司 | Speaking person conversion method combined high layer discription information and model self adaption |
US20090089063A1 (en) * | 2007-09-29 | 2009-04-02 | Fan Ping Meng | Voice conversion method and system |
CN102063899A (en) * | 2010-10-27 | 2011-05-18 | 南京邮电大学 | Method for voice conversion under unparallel text condition |
Non-Patent Citations (3)
Title |
---|
LING-HUI CHEN,ET AL.: "NON-PARALLEL TRAINING FOR VOICE CONVERSION BASED ON FT-GMM", 《2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS,SPEECH AND SIGNAL PROCESSING(ICASSP)》, 27 May 2011 (2011-05-27), pages 5116 - 5119 * |
LING-HUI CHEN,ET AL: "GMM-based Voice Conversion with Explicit Modelling on Feature Transform", 《2010 7TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING(ISCSLP)》, 3 December 2010 (2010-12-03), pages 364 - 368 * |
徐小峰等: "基于说话人独立建模的语音转换系统研究", 《信号处理》, vol. 25, no. 8, 31 August 2009 (2009-08-31), pages 171 - 174 * |
Cited By (87)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104143327A (en) * | 2013-07-10 | 2014-11-12 | 腾讯科技(深圳)有限公司 | Acoustic model training method and device |
WO2015003436A1 (en) * | 2013-07-10 | 2015-01-15 | Tencent Technology (Shenzhen) Company Limited | Method and device for parallel processing in model training |
US9508347B2 (en) | 2013-07-10 | 2016-11-29 | Tencent Technology (Shenzhen) Company Limited | Method and device for parallel processing in model training |
CN104143327B (en) * | 2013-07-10 | 2015-12-09 | 腾讯科技(深圳)有限公司 | A kind of acoustic training model method and apparatus |
CN103514883A (en) * | 2013-09-26 | 2014-01-15 | 华南理工大学 | Method for achieving self-adaptive switching of male voice and female voice |
CN103531205A (en) * | 2013-10-09 | 2014-01-22 | 常州工学院 | Asymmetrical voice conversion method based on deep neural network feature mapping |
CN103531205B (en) * | 2013-10-09 | 2016-08-31 | 常州工学院 | The asymmetrical voice conversion method mapped based on deep neural network feature |
CN103886859A (en) * | 2014-02-14 | 2014-06-25 | 河海大学常州校区 | Voice conversion method based on one-to-many codebook mapping |
CN103886859B (en) * | 2014-02-14 | 2016-08-17 | 河海大学常州校区 | Phonetics transfer method based on one-to-many codebook mapping |
CN105023574B (en) * | 2014-04-30 | 2018-06-15 | 科大讯飞股份有限公司 | A kind of method and system for realizing synthesis speech enhan-cement |
CN105023570B (en) * | 2014-04-30 | 2018-11-27 | 科大讯飞股份有限公司 | A kind of method and system for realizing sound conversion |
CN105023570A (en) * | 2014-04-30 | 2015-11-04 | 安徽科大讯飞信息科技股份有限公司 | method and system of transforming speech |
CN104392717A (en) * | 2014-12-08 | 2015-03-04 | 常州工学院 | Sound track spectrum Gaussian mixture model based rapid voice conversion system and method |
CN104464725B (en) * | 2014-12-30 | 2017-09-05 | 福建凯米网络科技有限公司 | A kind of method and apparatus imitated of singing |
CN104464725A (en) * | 2014-12-30 | 2015-03-25 | 福建星网视易信息系统有限公司 | Method and device for singing imitation |
CN106384587A (en) * | 2015-07-24 | 2017-02-08 | 科大讯飞股份有限公司 | Voice recognition method and system thereof |
CN106384587B (en) * | 2015-07-24 | 2019-11-15 | 科大讯飞股份有限公司 | A kind of audio recognition method and system |
CN105321526A (en) * | 2015-09-23 | 2016-02-10 | 联想(北京)有限公司 | Audio processing method and electronic device |
CN105321526B (en) * | 2015-09-23 | 2020-07-24 | 联想(北京)有限公司 | Audio processing method and electronic equipment |
CN105206257A (en) * | 2015-10-14 | 2015-12-30 | 科大讯飞股份有限公司 | Voice conversion method and device |
CN105390141B (en) * | 2015-10-14 | 2019-10-18 | 科大讯飞股份有限公司 | Sound converting method and device |
CN105390141A (en) * | 2015-10-14 | 2016-03-09 | 科大讯飞股份有限公司 | Sound conversion method and sound conversion device |
CN105206257B (en) * | 2015-10-14 | 2019-01-18 | 科大讯飞股份有限公司 | A kind of sound converting method and device |
CN106228976B (en) * | 2016-07-22 | 2019-05-31 | 百度在线网络技术(北京)有限公司 | Audio recognition method and device |
CN106228976A (en) * | 2016-07-22 | 2016-12-14 | 百度在线网络技术(北京)有限公司 | Audio recognition method and device |
CN107871497A (en) * | 2016-09-23 | 2018-04-03 | 北京眼神科技有限公司 | Audio recognition method and device |
WO2018098892A1 (en) * | 2016-11-29 | 2018-06-07 | 科大讯飞股份有限公司 | End-to-end modelling method and system |
US11651578B2 (en) | 2016-11-29 | 2023-05-16 | Iflytek Co., Ltd. | End-to-end modelling method and system |
CN107068157B (en) * | 2017-02-21 | 2020-04-10 | 中国科学院信息工程研究所 | Information hiding method and system based on audio carrier |
CN107068157A (en) * | 2017-02-21 | 2017-08-18 | 中国科学院信息工程研究所 | A kind of information concealing method and system based on audio carrier |
CN111201565A (en) * | 2017-05-24 | 2020-05-26 | 调节股份有限公司 | System and method for sound-to-sound conversion |
US11854563B2 (en) | 2017-05-24 | 2023-12-26 | Modulate, Inc. | System and method for creating timbres |
CN107464569A (en) * | 2017-07-04 | 2017-12-12 | 清华大学 | Vocoder |
CN107545903B (en) * | 2017-07-19 | 2020-11-24 | 南京邮电大学 | A voice conversion method based on deep learning |
CN107545903A (en) * | 2017-07-19 | 2018-01-05 | 南京邮电大学 | A kind of phonetics transfer method based on deep learning |
CN107481735A (en) * | 2017-08-28 | 2017-12-15 | 中国移动通信集团公司 | Method for converting audio sound production, server and computer readable storage medium |
CN107507619B (en) * | 2017-09-11 | 2021-08-20 | 厦门美图之家科技有限公司 | Voice conversion method and device, electronic equipment and readable storage medium |
CN107507619A (en) * | 2017-09-11 | 2017-12-22 | 厦门美图之家科技有限公司 | Phonetics transfer method, device, electronic equipment and readable storage medium storing program for executing |
CN107452369B (en) * | 2017-09-28 | 2021-03-19 | 百度在线网络技术(北京)有限公司 | Method and device for generating speech synthesis model |
US10978042B2 (en) | 2017-09-28 | 2021-04-13 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method and apparatus for generating speech synthesis model |
CN107464554A (en) * | 2017-09-28 | 2017-12-12 | 百度在线网络技术(北京)有限公司 | Phonetic synthesis model generating method and device |
CN107452369A (en) * | 2017-09-28 | 2017-12-08 | 百度在线网络技术(北京)有限公司 | Phonetic synthesis model generating method and device |
CN109935225A (en) * | 2017-12-15 | 2019-06-25 | 富泰华工业(深圳)有限公司 | Character information processor and method, computer storage medium and mobile terminal |
CN108108357A (en) * | 2018-01-12 | 2018-06-01 | 京东方科技集团股份有限公司 | Accent conversion method and device, electronic equipment |
CN108461079A (en) * | 2018-02-02 | 2018-08-28 | 福州大学 | A kind of song synthetic method towards tone color conversion |
CN108550372A (en) * | 2018-03-24 | 2018-09-18 | 上海诚唐展览展示有限公司 | A kind of system that astronomical electric signal is converted into audio |
CN108847249A (en) * | 2018-05-30 | 2018-11-20 | 苏州思必驰信息科技有限公司 | Sound converts optimization method and system |
CN108847249B (en) * | 2018-05-30 | 2020-06-05 | 苏州思必驰信息科技有限公司 | Sound conversion optimization method and system |
CN109147758A (en) * | 2018-09-12 | 2019-01-04 | 科大讯飞股份有限公司 | A kind of speaker's sound converting method and device |
CN109616131A (en) * | 2018-11-12 | 2019-04-12 | 南京南大电子智慧型服务机器人研究院有限公司 | A digital real-time voice change method |
CN109377978A (en) * | 2018-11-12 | 2019-02-22 | 南京邮电大学 | A multi-to-many speaker conversion method based on i-vector under the condition of non-parallel text |
CN109616131B (en) * | 2018-11-12 | 2023-07-07 | 南京南大电子智慧型服务机器人研究院有限公司 | Digital real-time voice sound changing method |
CN109377978B (en) * | 2018-11-12 | 2021-01-26 | 南京邮电大学 | Many-to-many speaker conversion method based on i vector under non-parallel text condition |
CN109326283A (en) * | 2018-11-23 | 2019-02-12 | 南京邮电大学 | Many-to-many speech conversion method based on text encoder under the condition of non-parallel text |
CN109326283B (en) * | 2018-11-23 | 2021-01-26 | 南京邮电大学 | Many-to-many speech conversion method based on text encoder under the condition of non-parallel text |
CN109637551A (en) * | 2018-12-26 | 2019-04-16 | 出门问问信息科技有限公司 | Phonetics transfer method, device, equipment and storage medium |
CN109599091A (en) * | 2019-01-14 | 2019-04-09 | 南京邮电大学 | Multi-to-multi voice conversion method based on STARWGAN-GP and x vector |
CN109599091B (en) * | 2019-01-14 | 2021-01-26 | 南京邮电大学 | Many-to-many speaker conversion method based on STARWGAN-GP and x-vector |
CN109817198B (en) * | 2019-03-06 | 2021-03-02 | 广州多益网络股份有限公司 | Speech synthesis method, apparatus and storage medium |
CN109817198A (en) * | 2019-03-06 | 2019-05-28 | 广州多益网络股份有限公司 | Multiple sound training method, phoneme synthesizing method and device for speech synthesis |
CN110060701A (en) * | 2019-04-04 | 2019-07-26 | 南京邮电大学 | Multi-to-multi phonetics transfer method based on VAWGAN-AC |
CN110060701B (en) * | 2019-04-04 | 2023-01-31 | 南京邮电大学 | Many-to-many speech conversion method based on VAWGAN-AC |
CN110060690A (en) * | 2019-04-04 | 2019-07-26 | 南京邮电大学 | Multi-to-multi voice conversion method based on STARGAN and ResNet |
CN110010144A (en) * | 2019-04-24 | 2019-07-12 | 厦门亿联网络技术股份有限公司 | Voice signals enhancement method and device |
CN111951810A (en) * | 2019-05-14 | 2020-11-17 | 国际商业机器公司 | High quality non-parallel many-to-many voice conversion |
CN110459232A (en) * | 2019-07-24 | 2019-11-15 | 浙江工业大学 | A Speech Conversion Method Based on Recurrent Generative Adversarial Networks |
CN110600012A (en) * | 2019-08-02 | 2019-12-20 | 特斯联(北京)科技有限公司 | Fuzzy speech semantic recognition method and system for artificial intelligence learning |
CN110570873B (en) * | 2019-09-12 | 2022-08-05 | Oppo广东移动通信有限公司 | Voiceprint wake-up method and device, computer equipment and storage medium |
CN110570873A (en) * | 2019-09-12 | 2019-12-13 | Oppo广东移动通信有限公司 | Voiceprint wake-up method, device, computer equipment and storage medium |
CN110600013A (en) * | 2019-09-12 | 2019-12-20 | 苏州思必驰信息科技有限公司 | Training method and device for non-parallel corpus voice conversion data enhancement model |
WO2021134520A1 (en) * | 2019-12-31 | 2021-07-08 | 深圳市优必选科技股份有限公司 | Voice conversion method, voice conversion training method, intelligent device and storage medium |
CN111433847B (en) * | 2019-12-31 | 2023-06-09 | 深圳市优必选科技股份有限公司 | Voice conversion method, training method, intelligent device and storage medium |
CN111433847A (en) * | 2019-12-31 | 2020-07-17 | 深圳市优必选科技股份有限公司 | Speech conversion method and training method, intelligent device and storage medium |
CN111462769A (en) * | 2020-03-30 | 2020-07-28 | 深圳市声希科技有限公司 | End-to-end accent conversion method |
CN111462769B (en) * | 2020-03-30 | 2023-10-27 | 深圳市达旦数生科技有限公司 | End-to-end accent conversion method |
CN111883149A (en) * | 2020-07-30 | 2020-11-03 | 四川长虹电器股份有限公司 | Voice conversion method and device with emotion and rhythm |
CN111883149B (en) * | 2020-07-30 | 2022-02-01 | 四川长虹电器股份有限公司 | Voice conversion method and device with emotion and rhythm |
US11996117B2 (en) | 2020-10-08 | 2024-05-28 | Modulate, Inc. | Multi-stage adaptive system for content moderation |
CN112309365A (en) * | 2020-10-21 | 2021-02-02 | 北京大米科技有限公司 | Training method, device, storage medium and electronic device for speech synthesis model |
CN112309365B (en) * | 2020-10-21 | 2024-05-10 | 北京大米科技有限公司 | Training method and device of speech synthesis model, storage medium and electronic equipment |
CN112382308A (en) * | 2020-11-02 | 2021-02-19 | 天津大学 | Zero-order voice conversion system and method based on deep learning and simple acoustic features |
CN112735434A (en) * | 2020-12-09 | 2021-04-30 | 中国人民解放军陆军工程大学 | Voice communication method and system with voiceprint cloning function |
CN113345452A (en) * | 2021-04-27 | 2021-09-03 | 北京搜狗科技发展有限公司 | Voice conversion method, training method, device and medium of voice conversion model |
CN113345452B (en) * | 2021-04-27 | 2024-04-26 | 北京搜狗科技发展有限公司 | Voice conversion method, training method, device and medium of voice conversion model |
CN113889130A (en) * | 2021-09-27 | 2022-01-04 | 平安科技(深圳)有限公司 | A voice conversion method, device, equipment and medium |
CN114203154A (en) * | 2021-12-09 | 2022-03-18 | 北京百度网讯科技有限公司 | Training method and device of voice style migration model and voice style migration method and device |
CN115410597A (en) * | 2022-05-24 | 2022-11-29 | 北方工业大学 | A Genetic Algorithm-Based Method for Optimizing Speech Conversion Parameters |
Also Published As
Publication number | Publication date |
---|---|
CN102982809B (en) | 2014-12-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102982809B (en) | Conversion method for sound of speaker | |
Cooper et al. | Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings | |
Sun et al. | Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis | |
CN112767958B (en) | Zero-order learning-based cross-language tone conversion system and method | |
JP7395792B2 (en) | 2-level phonetic prosody transcription | |
CN103021406B (en) | Robust speech emotion recognition method based on compressive sensing | |
CN108777140A (en) | Phonetics transfer method based on VAE under a kind of training of non-parallel corpus | |
CN104867489B (en) | A method and system for simulating human reading and pronunciation | |
CN102568476B (en) | Voice conversion method based on self-organizing feature map network cluster and radial basis network | |
CN102063899A (en) | Method for voice conversion under unparallel text condition | |
CN111798874A (en) | Voice emotion recognition method and system | |
Cotescu et al. | Voice conversion for whispered speech synthesis | |
TWI780738B (en) | Abnormal articulation corpus amplification method and system, speech recognition platform, and abnormal articulation auxiliary device | |
Lee et al. | A whispered Mandarin corpus for speech technology applications. | |
Tits et al. | Laughter synthesis: Combining seq2seq modeling with transfer learning | |
Shah et al. | Nonparallel emotional voice conversion for unseen speaker-emotion pairs using dual domain adversarial network & virtual domain pairing | |
CN113539236B (en) | Speech synthesis method and device | |
CN104376850B (en) | A kind of fundamental frequency estimation method of Chinese ear voice | |
Nazir et al. | Deep learning end to end speech synthesis: A review | |
Zhao et al. | Research on voice cloning with a few samples | |
Zhang et al. | AccentSpeech: Learning accent from crowd-sourced data for target speaker TTS with accents | |
Padmini et al. | Age-Based Automatic Voice Conversion Using Blood Relation for Voice Impaired. | |
CN114550701B (en) | Deep neural network-based Chinese electronic throat voice conversion device and method | |
Othmane et al. | Enhancement of esophageal speech using voice conversion techniques | |
Zu et al. | Research on Tibetan Speech Synthesis Based on Fastspeech2 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |