[go: up one dir, main page]

CN107103914B - High-quality voice conversion method - Google Patents

High-quality voice conversion method Download PDF

Info

Publication number
CN107103914B
CN107103914B CN201710166971.7A CN201710166971A CN107103914B CN 107103914 B CN107103914 B CN 107103914B CN 201710166971 A CN201710166971 A CN 201710166971A CN 107103914 B CN107103914 B CN 107103914B
Authority
CN
China
Prior art keywords
speech
training
conversion
gmm
parameter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710166971.7A
Other languages
Chinese (zh)
Other versions
CN107103914A (en
Inventor
李燕萍
崔立梅
吕中良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN201710166971.7A priority Critical patent/CN107103914B/en
Publication of CN107103914A publication Critical patent/CN107103914A/en
Application granted granted Critical
Publication of CN107103914B publication Critical patent/CN107103914B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a high-quality voice conversion method, which comprises the steps of firstly replacing a K-Means algorithm in a traditional GMM (Gaussian mixture model) by a self-organizing clustering algorithm, realizing training and classification of speaker personality characteristic parameters (MFCC) with an EM (effective noise model) algorithm iterative loop, then carrying out training of bilinear frequency warping and amplitude companding to obtain a conversion function required by voice conversion, and then carrying out high-quality voice conversion by using the conversion function. Aiming at the correlation between the space distribution condition of the voice characteristic parameters and the Gaussian mixture model, the invention uses the iterative self-organizing clustering algorithm to realize the determination of the mixing degree, solves the problem that the Gaussian mixture model is inaccurate when the voice characteristic parameters are classified, combines the improved Gaussian mixture model with the bilinear frequency bending and amplitude companding, constructs a high-quality voice conversion system, and has practical value in the voice conversion field.

Description

一种高质量的语音转换方法A high-quality voice conversion method

技术领域technical field

本发明涉及语音转换领域,特别涉及一种高质量的语音转换系统及实现方法。The invention relates to the field of voice conversion, in particular to a high-quality voice conversion system and an implementation method.

背景技术Background technique

语音转换是指改变源说话人的语音个性特征,使之具有目标说话人的语音个性特征,也即使一个人说的语音经过转换后听起来像是另一个人说的语音,同时保留语义。通常衡量语音转换的效果有两个指标:相似度(转换后的语音和目标说话人语音个性特征之间的相似度)和清晰度(转换后语音的音质效果)。Voice conversion refers to changing the voice personality characteristics of the source speaker so that it has the voice personality characteristics of the target speaker, even if the voice spoken by one person sounds like the voice spoken by another person after conversion, while retaining the semantics. There are usually two indicators to measure the effect of speech conversion: similarity (the similarity between the converted speech and the personality characteristics of the target speaker's speech) and clarity (the sound quality of the converted speech).

典型的语音转换方法有:以高斯混合模型(GMM,Gaussian Mixture Model)为代表的统计映射方法,该方法采用最小均方误差(MMSE,Minimum Mean Squared Error)准则,使得源说话人的语音经过转换后与目标语音之间的误差最小,从而实现了比较好的语音相似性转换,但转换后的音质不够理想;以频率弯折(FW,Frequency Warping)为代表的基于共振峰映射的语音频谱转换方法,它利用了人的声道的生理特性与共振峰参数的相关性,取得了良好的音质转换效果,但在语音相似度上转换效果差强人意。Typical speech conversion methods include: statistical mapping method represented by Gaussian Mixture Model (GMM, Gaussian Mixture Model). The error between the target voice and the target voice is the smallest, so as to achieve better voice similarity conversion, but the sound quality after conversion is not ideal; the voice spectrum conversion based on formant mapping represented by Frequency Warping (FW, Frequency Warping) method, which utilizes the correlation between the physiological characteristics of the human vocal tract and the formant parameters, and achieves a good sound quality conversion effect, but the conversion effect on the speech similarity is unsatisfactory.

为了对上述问题进行改善,语音研究者做了很多工作:简志华利用目标语音帧的转移概率矩阵来描述语音帧的时序信息,通过Viterbi搜索算法来寻找每帧语音的最佳GMM分量,避免了传统的基于GMM的语声转换算法因丢失语音帧的时序信息所造成的频谱帧间不连续,同时也减少了因加权求平均所带来的语音频谱过于平滑,增强了共振峰,改善了传统的基于GMM的语声转换算法的性能;苏州大学张炳在设定目标函数(误差函数)时考虑了全局方差,改善了过平滑现象;简志华等人在语音转换系统中引进了压缩感知的概念,得到了改进的语音转换算法,在该转换模型中,将连续多帧语音的线谱对参数矢量压缩成短矢量,并用这些短矢量进行语音训练及转换;毕兴采用共振峰二元映射的频率弯折方法改进了传统的频率弯折方法,转换语音的相似度有所提高。学者Daniel Erro将GMM和FW技术相结合,得到的转换语音在语音相似度和音质效果方面达到一个较好的平衡,然而DanielErro在语音转换中采用GMM对语音特征参数进行固定混合度的软分类训练,限制了语音转换效果提升空间,其原因在于未考虑到不同人的语音特征参数统计分布不同,而GMM混合度与特征参数统计分布密切相关。In order to improve the above problems, speech researchers have done a lot of work: Jian Zhihua uses the transition probability matrix of the target speech frame to describe the timing information of the speech frame, and uses the Viterbi search algorithm to find the best GMM component of each frame of speech, avoiding the traditional The speech conversion algorithm based on GMM causes the discontinuity between the spectrum frames caused by the loss of the timing information of the speech frames, and also reduces the over-smoothing of the speech spectrum caused by the weighted averaging, enhances the formants, and improves the traditional The performance of the speech conversion algorithm based on GMM; Zhang Bing of Soochow University considered the global variance when setting the objective function (error function), which improved the over-smoothing phenomenon; Jian Zhihua et al. introduced the concept of compressed sensing in the speech conversion system, An improved speech conversion algorithm is obtained. In this conversion model, the line spectrum pair parameter vectors of continuous multi-frame speech are compressed into short vectors, and these short vectors are used for speech training and conversion; Bi Xing adopts the frequency of formant binary mapping. The bending method improves the traditional frequency bending method, and the similarity of the converted speech is improved. The scholar Daniel Erro combines GMM and FW technology, and the converted speech obtained achieves a good balance in terms of speech similarity and sound quality. However, Daniel Erro uses GMM in speech conversion to perform soft classification training with fixed mixing degree of speech feature parameters. , which limits the improvement space of the speech conversion effect. The reason is that the statistical distribution of the speech feature parameters of different people is not considered, and the GMM mixing degree is closely related to the statistical distribution of the feature parameters.

发明内容SUMMARY OF THE INVENTION

本发明目的是提供一种高质量的语音转换方法,该系统考虑了不同人的语音特征参数统计分布的不同,提供一种能够根据目标说话人的不同,利用迭代自组织算法ISODATA对特征参数进行初始聚类,然后结合EM算法进行自组织的更改特征参数的分类情况,进而结合后续的BLFW+AS训练得到转换函数,实现了一种高质量的语音转换。本发明具有较好的实用价值,可用于电影配音、语音翻译、保密通信等领域。The purpose of the present invention is to provide a high-quality speech conversion method, which takes into account the difference in the statistical distribution of speech characteristic parameters of different people, and provides a method that can use the iterative self-organization algorithm ISODATA to convert the characteristic parameters according to the different target speakers. Initial clustering, and then combined with the EM algorithm to self-organize the classification of the changed feature parameters, and then combined with the subsequent BLFW+AS training to obtain the conversion function, realizing a high-quality speech conversion. The invention has good practical value and can be used in the fields of movie dubbing, voice translation, confidential communication and the like.

本发明采取的技术方案分为训练部分和转换部分,具体如下:The technical scheme adopted by the present invention is divided into a training part and a conversion part, and the details are as follows:

1)、训练部分步骤:1), the training part steps:

1-1)获取源说话人和目标说话人的平行语料库;1-1) Obtain the parallel corpus of the source speaker and the target speaker;

1-2)使用AHOcoder语音分析模型提取语音特征参数和对数基频;1-2) Use AHOcoder speech analysis model to extract speech feature parameters and logarithmic fundamental frequency;

1-3)对步骤1-2)中的语音特征参数进行DTW;1-3) carry out DTW to the speech characteristic parameter in step 1-2);

1-4)使用迭代自组织算法(ISODATA)代替传统GMM模型中的K-Means算法,并与EM算法步骤1-3)中的特征参数进行GMM训练,得到GMM参数λ,P(X|λ);1-4) Use the iterative self-organization algorithm (ISODATA) to replace the K-Means algorithm in the traditional GMM model, and perform GMM training with the feature parameters in the EM algorithm step 1-3) to obtain the GMM parameters λ, P(X|λ );

1-5)使用步骤1-4)中的后验条件概率矩阵P(X|λ)进行双线性频率弯折加幅度压扩(Bilinear Frequency WarpingplusAmplitude Scaling,BLFW+AS,BLFW+AS)训练,得到频率弯折因子α(x,λ)和幅度调节因子s(x,λ),其中的x指语音的频谱特征参数,λ指GMM模型参数,从而构建BLFW+AS转换函数;使用对数基频的均值和方差建立源语音基音频率和目标语音基音频率之间的转换函数;1-5) Use the posterior conditional probability matrix P(X|λ) in step 1-4) to perform Bilinear Frequency Warping plus Amplitude Scaling (BLFW+AS, BLFW+AS) training, Obtain the frequency bending factor α(x,λ) and the amplitude adjustment factor s(x,λ), where x refers to the spectral characteristic parameter of the speech, λ refers to the GMM model parameter, so as to construct the BLFW+AS conversion function; use the logarithmic base The mean and variance of the frequency establish the conversion function between the source speech pitch frequency and the target speech pitch frequency;

2)、转换部分步骤:2), the conversion part of the steps:

2-1)输入待转换的源说话人语音;2-1) Input the source speaker voice to be converted;

2-2)使用AHOcoder语音分析模型提取特征参数和对数基频;2-2) Use AHOcoder speech analysis model to extract characteristic parameters and logarithmic fundamental frequency;

2-3)使用改进的GMM和训练时得到的参数λ,求取后验条件概率矩阵

Figure GDA0002262156140000021
2-3) Use the improved GMM and the parameter λ obtained during training to obtain the posterior conditional probability matrix
Figure GDA0002262156140000021

2-4)将频率弯折因子α(x,λ)和幅度调节因子s(x,λ)代入BLFW+AS转换函数,求得转换后的特征参数;2-4) Substitute the frequency bending factor α(x, λ) and the amplitude adjustment factor s(x, λ) into the BLFW+AS conversion function to obtain the converted characteristic parameters;

2-5)将对数基频代入训练时得到的基频转换函数得到转换后的对数基频;2-5) Substitute the logarithmic fundamental frequency into the fundamental frequency conversion function obtained when training to obtain the converted logarithmic fundamental frequency;

2-6)使用AHOdecoder语音合成模型将转换后的特征参数和对数基频合成转换后的语音。2-6) Use the AHOdecoder speech synthesis model to synthesize the converted speech with the converted feature parameters and the logarithmic fundamental frequency.

其中:训练部分步骤1-4)中所述的高斯分类数是根据说话人语音特征参数的ISODATA具体分布来决定的。Wherein: the number of Gaussian classifications described in steps 1-4) of the training part is determined according to the specific distribution of ISODATA of the speaker's speech feature parameters.

训练部分步骤1-5)中频率弯折因子和幅度调节因子是根据ISODATA+GMM训练得到的后验条件概率矩阵经过BLFW+AS训练得到的。The frequency bending factor and amplitude adjustment factor in steps 1-5) of the training part are obtained through BLFW+AS training according to the posterior conditional probability matrix obtained by ISODATA+GMM training.

有益效果beneficial effect

1、本发明使用改进的GMM和BLFW+AS相结合来实现语音转换系统,该系统能够根据不同说话人的语音特征参数分布,自组织调节GMM的分量数,实现了高质量的语音转换。1. The present invention uses the combination of improved GMM and BLFW+AS to realize a speech conversion system, which can self-organize and adjust the number of components of GMM according to the distribution of speech feature parameters of different speakers, thereby realizing high-quality speech conversion.

2、本发明实现了完整的高质量语音转换系统,因此语音转换的应用场景下具有实用效果。2. The present invention realizes a complete high-quality voice conversion system, so it has practical effects in the application scenario of voice conversion.

附图说明Description of drawings

图1是本发明的训练及转换原理图。FIG. 1 is a schematic diagram of the training and conversion of the present invention.

图2是本发明涉及的语音特征参数聚类算法的实现流程图。FIG. 2 is a flow chart of the realization of the speech feature parameter clustering algorithm involved in the present invention.

图3是本发明涉及的改进的GMM和BLFW+AS相结合的语音转换系统与传统GMM、GMM+BLFWAS进行语音转换之后MCD值的比较。FIG. 3 is a comparison of the MCD value after the speech conversion between the improved GMM and BLFW+AS combined speech conversion system involved in the present invention and the traditional GMM and GMM+BLFWAS.

图4是本发明涉及的改进的GMM和BLFW+AS相结合的语音转换系统与传统GMM、GMM+BLFWAS进行语音转换之后MOS值的比较。FIG. 4 is a comparison of the MOS value after the voice conversion between the improved GMM and BLFW+AS combined voice conversion system involved in the present invention and the traditional GMM and GMM+BLFWAS.

具体实施方式Detailed ways

下面结合说明书附图对本发明作进一步的详细说明。The present invention will be further described in detail below with reference to the accompanying drawings.

本发明所述高质量语音转换方法分为两个部分:训练部分用于得到语音转换所需的参数和转换函数,而转换部分用于实现源说话人语音转换为目标说话人语音。The high-quality voice conversion method of the present invention is divided into two parts: the training part is used to obtain parameters and conversion functions required for voice conversion, and the conversion part is used to convert the source speaker's voice into the target speaker's voice.

1)、如图1,训练部分实施步骤:1), as shown in Figure 1, the implementation steps of the training part:

1-1)获取源说话人和目标说话人语音平行语料库,平行语料库的获取可采用卡内基梅隆大学的开源CMUARCTIC语料库;1-1) Obtain the parallel corpus of the source speaker and the target speaker, and the open source CMUARCTIC corpus of Carnegie Mellon University can be used to obtain the parallel corpus;

1-2)本发明使用AHOcoder语音分析模型分别提取源说话人和目标说话人的语音梅尔倒谱系数(MFCC,Mel-Frequency Cepstral Coefficient)和对数基音频率参数logf01-2) the present invention uses AHOcoder speech analysis model to extract the speech Mel cepstral coefficient (MFCC, Mel-Frequency Cepstral Coefficient) and logarithmic fundamental frequency parameter logf 0 of source speaker and target speaker respectively;

1-3)对步骤(2)中的源和目标语音的MFCC参数进行声道长度归一化(VTLN,VocalTract Length Normalization)和动态时间规整(DTW,Dynamic Time warping);1-3) Perform channel length normalization (VTLN, VocalTract Length Normalization) and dynamic time warping (DTW, Dynamic Time warping) on the MFCC parameters of the source and target speech in step (2);

1-4)建立ISODATA+GMM模型,采用期望最大化(EM,Expectation-Maximization)算法进行训练,并使用ISODATA迭代方法得到EM训练的初始值。传统的高斯混合模型表示如下:1-4) Establish the ISODATA+GMM model, use the expectation maximization (EM, Expectation-Maximization) algorithm for training, and use the ISODATA iteration method to obtain the initial value of the EM training. The traditional Gaussian mixture model is represented as follows:

Figure GDA0002262156140000031
Figure GDA0002262156140000031

其中ωq,q=1,…,Q表示混合权重,且

Figure GDA0002262156140000032
X表示一个D维随机向量;N(X;μqq),q=1,…,Q表示子分布。每个子分布是一个D维的联合高斯概率分布,可用下式表示:where ω q , q=1,...,Q represents the mixing weight, and
Figure GDA0002262156140000032
X represents a D-dimensional random vector; N(X; μ q , Σ q ), q=1, . . . , Q represents a sub-distribution. Each subdistribution is a D-dimensional joint Gaussian probability distribution, which can be expressed as:

Figure GDA0002262156140000041
Figure GDA0002262156140000041

其中μi为均值矢量,Σi为协方差矩阵,λ={ωqqq}是GMM模型的模型参数,对λ的估算可以通过最大似然估计法(ML,MaximumLikelihood)实现,最大似然估计的目的在于使得条件概率P(X/λ)取得最大,对于语音特征参数矢量集合X={xn,n=1,2,…,N}有:where μ i is the mean vector, Σ i is the covariance matrix, λ={ω q , μ q , Σ q } is the model parameter of the GMM model, and the estimation of λ can be realized by the maximum likelihood estimation method (ML, Maximum Likelihood) , the purpose of maximum likelihood estimation is to maximize the conditional probability P(X/λ). For the speech feature parameter vector set X={x n , n=1,2,...,N}, there are:

Figure GDA0002262156140000042
Figure GDA0002262156140000042

此时:at this time:

Figure GDA0002262156140000043
Figure GDA0002262156140000043

求解公式(4)可使用EM算法,随着EM计算过程中迭代条件满足P(X/λk)≥P(X/λk-1),K是迭代的次数,直至模型参数λ。迭代过程中高斯分量权重系数wq、均值向量μq、协方差矩阵Σq的迭代公式如下:The EM algorithm can be used to solve formula (4), as the iterative conditions in the EM calculation process satisfy P(X/λ k )≥P(X/λ k-1 ), K is the number of iterations until the model parameter λ. In the iterative process, the iterative formulas of the Gaussian component weight coefficient w q , the mean vector μ q , and the covariance matrix Σ q are as follows:

Figure GDA0002262156140000044
Figure GDA0002262156140000044

Figure GDA0002262156140000045
Figure GDA0002262156140000045

Figure GDA0002262156140000046
Figure GDA0002262156140000046

Figure GDA0002262156140000047
Figure GDA0002262156140000047

在改进的GMM训练中,首先需要对特征参数数据的预期类数、初始的聚类中心个数、每一类中允许的最少样本数目、类内各特征分量分布的相对标准差上限、两类中心间的最小距离下限、在每次迭代中最多可以进行“合并”操作的次数和允许的最多迭代次数进行综合分析,动态自组织的调整高斯混合度。其训练过程如下:In the improved GMM training, the expected number of classes of the feature parameter data, the initial number of cluster centers, the minimum number of samples allowed in each class, the upper limit of the relative standard deviation of the distribution of each feature component within the class, the two classes The lower limit of the minimum distance between the centers, the maximum number of "merging" operations that can be performed in each iteration, and the maximum number of iterations allowed are comprehensively analyzed, and the Gaussian mixture degree is dynamically self-organized. The training process is as follows:

1.设定ISODATA的初始特征参数数据的预期类数c、初始的聚类中心个数Nc、每一类中允许的最少样本数目θn、类内各特征分量分布的相对标准差上限θs、两类中心间的最小距离下限θD、在每次迭代中最多可以进行“合并”操作的次数L和允许的最多迭代次数I。1. Set the expected number of classes c of the initial feature parameter data of ISODATA, the initial number of cluster centers N c , the minimum number of samples allowed in each class θ n , and the upper limit θ of the relative standard deviation of the distribution of each feature component within the class s , the minimum distance lower bound θ D between the two types of centers, the maximum number of "merging" operations L that can be performed in each iteration, and the maximum number of iterations I allowed.

2.如图2所示,使用ISODATA算法初始化GMM。2. As shown in Figure 2, use the ISODATA algorithm to initialize the GMM.

(1)按最近邻规则将样本集{xi}中每个样本分到某一类中。(1) Divide each sample in the sample set { xi } into a certain class according to the nearest neighbor rule.

(2)计算分类后的参数:各类重心、类内平均距离及总体平均距离(2) Calculate the parameters after classification: various centers of gravity, average distance within the class and overall average distance

(a)计算各类的重心:(a) Calculate the center of gravity of each class:

Figure GDA0002262156140000051
Figure GDA0002262156140000051

(b)计算各类中由样本到类心的平均距离(b) Calculate the average distance from the sample to the centroid in each category

Figure GDA0002262156140000052
Figure GDA0002262156140000052

(c)计算各个样本到其类内中心的总体平均距离(c) Calculate the overall average distance of each sample to its intra-class center

Figure GDA0002262156140000053
Figure GDA0002262156140000053

3.使用EM算法进行迭代训练。3. Iterative training using the EM algorithm.

4.若训练得到的模型中满足下面的逻辑要求:4. If the trained model meets the following logical requirements:

(类内样本数<θn)∪(类的数目≥2c)∩(两类间中心距离<θD),(the number of samples within a class < θ n )∪ (the number of classes ≥ 2c)∩ (the distance between the centers of the two classes < θ D ),

则认为这两个分量包含信息较少且成分相似,可对其进行合并处理:It is considered that these two components contain less information and are similar in composition, and can be merged:

计算各类中心间的距离DijCalculate the distance D ij between various centers:

Dij=||zi-zj||,i=1,2,…,Nc-1;j=i+1,…,Nc (12)D ij =||z i -z j ||,i=1,2,...,N c -1; j=i+1,...,N c (12)

依据θD判断合并。将Dij与θD比较,并将小于θD的那些Dij按递增次序排列,取前L个。从最小的Dij开始,将相应的两类合并,并计算合并后的聚类中心。在每一次迭代中,某一类最多只能合并一次。Merge is judged according to θ D. Compare Di ij with θ D , and arrange those Di ijs that are less than θ D in increasing order, taking the first L. Starting from the smallest D ij , the corresponding two classes are merged, and the merged cluster centers are calculated. In each iteration, a class can be merged at most once.

经过合并操作后分类数为:Nc=Nc-已并掉的类数After the merge operation, the number of categories is: N c =N c - the number of categories that have been merged

若训练得到的模型中满足下面的逻辑要求:If the trained model meets the following logical requirements:

Figure GDA0002262156140000061
Figure GDA0002262156140000061

,则认为该高斯分量包含过量信息,应将其分裂处理:, it is considered that the Gaussian component contains excessive information and should be split:

(1)计算各类内样本到类中心的标准差向量(1) Calculate the standard deviation vector from the samples within each class to the class center

σj=(σ1j2j,…,σnj),j=1,2,…,Nc (13)σ j =(σ 1j2j ,...,σ nj ),j=1,2,...,N c (13)

(2)计算其各个分量(2) Calculate its components

Figure GDA0002262156140000062
Figure GDA0002262156140000062

其中,j是类的编号,k是分量编号,n是矢量维数,zkj为zj的第k个分量,xki为xi的第k个分量;Among them, j is the class number, k is the component number, n is the vector dimension, z kj is the kth component of z j , and x ki is the kth component of x i ;

(3)求出每一类内标准差向量σj中的最大分量σjmax (3) Find the maximum component σ jmax in the standard deviation vector σ j within each class

Figure GDA0002262156140000063
Figure GDA0002262156140000063

将该类ωj分裂为两个类,原zj取消且令Nc=Nc+1。这两个新类的中心

Figure GDA0002262156140000064
Figure GDA0002262156140000065
分别是在原zj中相应的分量加上和减去kσjmax,而其它分量不变,其中0≤k≤1。分裂后,Ip=Ip+1,转至步骤2-(2)。Split the class ω j into two classes, the original z j cancels and let N c =N c +1. The center of these two new classes
Figure GDA0002262156140000064
and
Figure GDA0002262156140000065
are the addition and subtraction of kσ jmax to the corresponding components in the original z j , respectively, while other components remain unchanged, where 0≤k≤1. After splitting, Ip = Ip +1, go to step 2-(2).

如果迭代次数Ip=I或过程收敛,则算法结束。否则,Ip=Ip+1,若需要调整参数,则转至步骤2-(1);若不改变参数,则转至步骤2-(2),进行下一次迭代。The algorithm ends if the number of iterations I p =I or the process converges. Otherwise, I p =I p +1, if the parameters need to be adjusted, go to step 2-(1); if the parameters are not changed, go to step 2-(2), and go to the next iteration.

5.ISODATA+GMM训练结束,得到后验条件概率矩阵P(X|λ),保存λ。5. After ISODATA+GMM training, the posterior conditional probability matrix P(X|λ) is obtained, and λ is saved.

1-5)利用步骤(3)中得到的源语音特征参数X和目标语音特征参数Y以及步骤(4)中得到的后验条件概率矩阵P(X|λ)进行训练,得到频率弯折因子和幅度调节因子,从而构建双线性频率弯折(BLFW,Bilinear FrequencyWarping)和幅度调节(AS,AmplitudeScaling)语音转换函数,表示如下:1-5) Use the source speech feature parameter X and target speech feature parameter Y obtained in step (3) and the posterior conditional probability matrix P(X|λ) obtained in step (4) for training to obtain the frequency bending factor and amplitude adjustment factor to construct bilinear frequency warping (BLFW, Bilinear FrequencyWarping) and amplitude adjustment (AS, AmplitudeScaling) speech conversion functions, which are expressed as follows:

F(x)=Wα(x,λ)+s(x,λ) (16)F(x)=W α(x,λ) +s(x,λ) (16)

Figure GDA0002262156140000066
Figure GDA0002262156140000066

Figure GDA0002262156140000071
Figure GDA0002262156140000071

Figure GDA0002262156140000072
Figure GDA0002262156140000072

1-6)建立源语音基音频率和目标语音基音频率之间的转换关系:1-6) Establish the conversion relationship between the source voice pitch frequency and the target voice pitch frequency:

Figure GDA0002262156140000073
Figure GDA0002262156140000073

其中μ,σ2用于表示对数基音频率logf0的均值和方差。where μ,σ 2 are used to represent the mean and variance of the logarithmic fundamental frequency logf 0 .

1-7)通过以上步骤,我们建立了源和目标语音特征参数之间的转换关系---公式(16),源语音和目标语音对数基音频率之间的转换关系---公式(20)。1-7) Through the above steps, we established the conversion relationship between the source and target speech feature parameters --- formula (16), the conversion relationship between the source voice and target voice logarithmic fundamental frequency --- formula (20 ).

2)、如图1转换部分具体实施步骤:2), the specific implementation steps of the conversion part as shown in Figure 1:

2-1)输入待转换的源说话人语音;2-1) Input the source speaker voice to be converted;

2-2)使用AHOdecoder语音分析模型提取源说话人的语音20阶MFCC特征参数X以及源语音对数基音频率参数logf0X2-2) use the AHOdecoder speech analysis model to extract the voice 20-order MFCC feature parameter X of the source speaker and the source voice logarithmic fundamental frequency parameter logf 0×;

2-3)利用ISODATA+GMM训练时得到的λ={wiii}和步骤(2)中提取的特征参数X,代入公式(1),得到后验条件概率矩阵P(X/λ);2-3) Using λ={w i , μ i , Σ i } and the feature parameter X extracted in step (2) obtained during ISODATA+GMM training, and substitute it into formula (1) to obtain the posterior conditional probability matrix P( X/λ);

2-4)利用BLFW+AS训练时得到的频率弯折因子α(x/λ)和幅度调节因子s(x/λ)以及步骤(3)中得到的后验条件概率矩阵P(X/λ),分别代入公式(16)、(17)、(18)和(19)后,得到转换后语音的MFCC特征参数Y;2-4) Use the frequency bending factor α(x/λ) and amplitude adjustment factor s(x/λ) obtained during BLFW+AS training and the posterior conditional probability matrix P(X/λ) obtained in step (3) ), after substituting into formulas (16), (17), (18) and (19) respectively, the MFCC feature parameter Y of the converted speech is obtained;

2-5)利用步骤(2)中得到的源语音对数基音频率参数logf0X,代入公式(20),得到转换后语音的对数基音频率参数logf0Y2-5) utilize the source voice logarithmic fundamental frequency parameter logf 0X obtained in step (2), substitute formula (20), obtain the logarithmic fundamental frequency parameter logf 0Y of voice after conversion;

2-6)使用AHOdecoder语音合成模型将步骤(4)中的Y和步骤(5)中的logf0Y作为输入得到转换后的语音logf0Y2-6) Use the AHOdecoder speech synthesis model to take Y in step (4) and logf 0Y in step (5) as input to obtain the converted speech logf 0Y .

3)、性能评价。3), performance evaluation.

3-1)梅尔倒谱失真(Mel-cepstral distortion,MCD),客观性能指标,更多地反映转换后的说话人相似度。3-1) Mel-cepstral distortion (MCD), an objective performance indicator, more reflects the transformed speaker similarity.

Figure GDA0002262156140000081
Figure GDA0002262156140000081

其中MFCC参数为20-D梅尔倒谱参数,我们使用Vd(t)表示,0≤d≤19。在计算MCD时我们去掉第一维参数。T为MFCC经过DTW对齐后的总帧数。MCD值越小,转换后的说话人相似度越高。改进的GMM和BLFW+AS相结合的语音转换系统与传统GMM、GMM+BLFWAS进行语音转换之后的MCD值比较如图3所示。where the MFCC parameter is the 20-D Mel cepstral parameter, which we denote by V d (t), 0≤d≤19. We remove the first dimension parameter when computing the MCD. T is the total number of frames after the MFCC is aligned with DTW. The smaller the MCD value, the higher the transformed speaker similarity. Figure 3 shows the comparison of the MCD values after the voice conversion between the improved GMM and BLFW+AS speech conversion system and the traditional GMM and GMM+BLFWAS.

3-2)平均意见得分(MeanOpinion Score,MOS),主观性能指标,更多地反映语音合成质量。3-2) Mean Opinion Score (MOS), a subjective performance indicator, more reflects the quality of speech synthesis.

MOS评价标准将测试语音的质量分为5个等级,详细的评分体系如表1所示。The MOS evaluation standard divides the quality of the test speech into 5 grades, and the detailed scoring system is shown in Table 1.

表1 MOS评价标准Table 1 MOS evaluation criteria

评分分级Grading 语音级别voice level 评价标准evaluation standard 55 excellent 高音质,无噪音High sound quality, no noise 44 good 较高音质,基本无法听到噪音High sound quality, almost inaudible noise 33 middle 中等音质,有轻微的噪音Moderate sound quality with slight noise 22 Difference 中下等音质,但有较大的噪音Medium to low sound quality, but with a lot of noise 11 inferior 低等音质,无法忍受Low sound quality, unbearable

MOS测试得分越高说明转换后语音的可懂度、自然度越高。MOS得分的数学表达式如下:The higher the MOS test score, the higher the intelligibility and naturalness of the converted speech. The mathematical expression for the MOS score is as follows:

Figure GDA0002262156140000082
Figure GDA0002262156140000082

其中,M是参与测听的总人数,N是测试语句的个数,scoren,m是第m个人对第n条测试语句的打分。MOS值越大,语音合成质量越高。改进的GMM和BLFW+AS相结合的语音转换系统与传统GMM、GMM+BLFWAS进行语音转换之后的MOS值比较如图4所示。Among them, M is the total number of people participating in the test, N is the number of test sentences, and score n,m is the score of the mth person on the nth test sentence. The larger the MOS value, the higher the speech synthesis quality. Figure 4 shows the comparison of the MOS values after the voice conversion between the improved GMM and BLFW+AS speech conversion system and the traditional GMM and GMM+BLFWAS.

综上所述,一个高质量的语音转换系统应该同时满足上述三个标准:转换语音的语谱图与目标语音的语谱图更接近,频谱的失真程度更低、相对于其他模型,转换语音的MCD值最小,表明该模型频谱的失真程度更低,转换语音与目标语音的频谱相似性更好、MOS值最大,表明该模型转换语音的音质更优。To sum up, a high-quality speech conversion system should satisfy the above three criteria at the same time: the spectrogram of the converted speech is closer to the spectrogram of the target speech, the distortion of the spectrum is lower, and the converted speech is less distorted than other models. The MCD value of the model is the smallest, indicating that the distortion of the spectrum of the model is lower, the spectral similarity between the converted speech and the target speech is better, and the MOS value is the largest, indicating that the sound quality of the converted speech of the model is better.

以上所述,只是本发明为进行详细、示范性的说明而呈现的较佳实例,本领域技术人员根据上述具体实例,通过各种等同替换所得到的技术方案,均应包含在本发明的权利要求范围及其等同范围之内。The above are only preferred examples of the present invention for detailed and exemplary description. Those skilled in the art can obtain technical solutions through various equivalent substitutions based on the above-mentioned specific examples, all of which shall be included in the rights of the present invention. requirements and their equivalents.

Claims (3)

1.一种高质量的语音转换方法,包括训练部分和转换部分:1. A high-quality speech conversion method, including a training part and a conversion part: 1)、训练部分步骤:1), the training part steps: 1-1)获取源说话人和目标说话人的平行语料库;1-1) Obtain the parallel corpus of the source speaker and the target speaker; 1-2)使用AHOcoder语音分析模型提取语音特征参数和对数基频;1-2) Use AHOcoder speech analysis model to extract speech feature parameters and logarithmic fundamental frequency; 1-3)对步骤1-2)中的语音特征参数进行DTW;1-3) carry out DTW to the speech characteristic parameter in step 1-2); 2)、转换部分步骤:2), the conversion part of the steps: 2-1)输入待转换的源说话人语音;2-1) Input the source speaker voice to be converted; 2-2)使用AHOcoder语音分析模型提取特征参数和对数基频;2-2) Use AHOcoder speech analysis model to extract characteristic parameters and logarithmic fundamental frequency; 2-3)使用ISODATA+GMM和训练时得到的参数λ,求取后验条件概率矩阵;2-3) Use ISODATA+GMM and the parameter λ obtained during training to obtain a posteriori conditional probability matrix; 2-4)将频率弯折因子α(x,λ)和幅度调节因子s(x,λ)代入双线性频率弯折和幅度调节转换函数,求得转换后的特征参数,其中的x指语音的频谱特征参数,λ指GMM模型参数;2-4) Substitute the frequency bending factor α(x, λ) and the amplitude adjustment factor s(x, λ) into the bilinear frequency bending and amplitude adjustment conversion function to obtain the converted characteristic parameters, where x refers to The spectral characteristic parameter of speech, λ refers to the GMM model parameter; 2-5)将对数基频代入训练时得到的基频转换函数得到转换后的对数基频;2-5) Substitute the logarithmic fundamental frequency into the fundamental frequency conversion function obtained when training to obtain the converted logarithmic fundamental frequency; 2-6)使用AHOdecoder语音合成模型将转换后的特征参数和对数基频合成转换后的语音;2-6) Use the AHOdecoder speech synthesis model to synthesize the converted speech with the converted characteristic parameters and the logarithmic fundamental frequency; 其特征在于,所述训练部分步骤中,还包括:It is characterized in that, in the step of the training part, it also includes: 1-4)使用迭代自组织算法ISODATA对步骤1-3)中的特征参数进行GMM训练的初始值设定,并用EM算法进行GMM训练,得到GMM参数λ和后验条件概率矩阵P(X|λ),其中的X指语音的频谱特征参数集;1-4) Use the iterative self-organization algorithm ISODATA to set the initial value of the GMM training for the characteristic parameters in step 1-3), and use the EM algorithm to perform the GMM training to obtain the GMM parameter λ and the posterior conditional probability matrix P(X| λ), where X refers to the spectral feature parameter set of speech; 1-5)使用步骤1-4)中的后验条件概率矩阵P(X|λ)进行BLFW+AS训练,得到频率弯折因子α(x,λ)和幅度调节因子s(x,λ),从而构建BLFW+AS转换函数;使用对数基频的均值和方差建立源语音基音频率和目标语音基音频率之间的转换函数。1-5) Use the posterior conditional probability matrix P(X|λ) in step 1-4) to perform BLFW+AS training to obtain the frequency bending factor α(x,λ) and the amplitude adjustment factor s(x,λ) , so as to construct the BLFW+AS conversion function; use the mean and variance of the logarithmic fundamental frequency to establish the conversion function between the source speech pitch frequency and the target speech pitch frequency. 2.根据权利要求1所述的一种高质量语音转换方法,其特征在于,训练部分步骤1-4)中参数λ是根据说话人语音特征参数的具体分布来决定的。2. A kind of high-quality speech conversion method according to claim 1, is characterized in that, parameter λ in training part step 1-4) is decided according to the concrete distribution of speaker's speech characteristic parameter. 3.根据权利要求1所述的一种高质量语音转换方法,其特征在于,训练部分步骤1-5)中频率弯折因子和幅度调节因子是根据ISODATA+GMM训练得到的后验条件概率矩阵来训练得到的。3. a kind of high-quality speech conversion method according to claim 1, is characterized in that, in training part step 1-5), frequency bending factor and amplitude adjustment factor are the posterior conditional probability matrix that obtains according to ISODATA+GMM training to train.
CN201710166971.7A 2017-03-20 2017-03-20 High-quality voice conversion method Active CN107103914B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710166971.7A CN107103914B (en) 2017-03-20 2017-03-20 High-quality voice conversion method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710166971.7A CN107103914B (en) 2017-03-20 2017-03-20 High-quality voice conversion method

Publications (2)

Publication Number Publication Date
CN107103914A CN107103914A (en) 2017-08-29
CN107103914B true CN107103914B (en) 2020-06-16

Family

ID=59675474

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710166971.7A Active CN107103914B (en) 2017-03-20 2017-03-20 High-quality voice conversion method

Country Status (1)

Country Link
CN (1) CN107103914B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107785030B (en) * 2017-10-18 2021-04-30 杭州电子科技大学 Voice conversion method
CN107919115B (en) * 2017-11-13 2021-07-27 河海大学 A Feature Compensation Method Based on Nonlinear Spectral Transform
CN108198566B (en) * 2018-01-24 2021-07-20 咪咕文化科技有限公司 Information processing method and device, electronic device and storage medium
CN109671423B (en) * 2018-05-03 2023-06-02 南京邮电大学 Non-parallel text-to-speech conversion method under limited training data
CN114648974B (en) * 2020-12-17 2025-02-18 南京理工大学 Speech synthesis method and system based on speech radar and deep learning
CN115186005A (en) * 2022-06-16 2022-10-14 上海船舶运输科学研究所有限公司 A method and system for classifying working conditions of a ship's main engine

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2352948A (en) * 1999-07-13 2001-02-07 Racal Recorders Ltd Voice activity monitoring
JP2003022088A (en) * 2001-07-10 2003-01-24 Sharp Corp Device and method for speaker's features extraction, voice recognition device, and program recording medium
CN101751921A (en) * 2009-12-16 2010-06-23 南京邮电大学 Real-time voice conversion method under conditions of minimal amount of training data
CN103035236A (en) * 2012-11-27 2013-04-10 河海大学常州校区 High-quality voice conversion method based on modeling of signal timing characteristics

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2352948A (en) * 1999-07-13 2001-02-07 Racal Recorders Ltd Voice activity monitoring
JP2003022088A (en) * 2001-07-10 2003-01-24 Sharp Corp Device and method for speaker's features extraction, voice recognition device, and program recording medium
CN101751921A (en) * 2009-12-16 2010-06-23 南京邮电大学 Real-time voice conversion method under conditions of minimal amount of training data
CN103035236A (en) * 2012-11-27 2013-04-10 河海大学常州校区 High-quality voice conversion method based on modeling of signal timing characteristics

Also Published As

Publication number Publication date
CN107103914A (en) 2017-08-29

Similar Documents

Publication Publication Date Title
CN107103914B (en) High-quality voice conversion method
CN107301859B (en) Speech conversion method under non-parallel text condition based on adaptive Gaussian clustering
Kleijn et al. Generative speech coding with predictive variance regularization
Sun et al. Phonetic posteriorgrams for many-to-one voice conversion without parallel data training
Mandasari et al. Quality measure functions for calibration of speaker recognition systems in various duration conditions
WO2020073694A1 (en) Voiceprint identification method, model training method and server
CN105261367B (en) A method of speaker recognition
CN110634502A (en) Single-channel Speech Separation Algorithm Based on Deep Neural Network
CN104217721B (en) Based on the phonetics transfer method under the conditions of the asymmetric sound bank that speaker model aligns
CN104123933A (en) Self-adaptive non-parallel training based voice conversion method
CN102063899A (en) Method for voice conversion under unparallel text condition
Paul et al. Enhancing speech intelligibility in text-to-speech synthesis using speaking style conversion
CN110648684A (en) A WaveNet-based Bone Conduction Speech Enhancement Waveform Generation Method
Jalil et al. Speaker identification using convolutional neural network for clean and noisy speech samples
Song et al. Non-parallel training for voice conversion based on adaptation method
CN102930863B (en) Voice conversion and reconstruction method based on simplified self-adaptive interpolation weighting spectrum model
Zheng et al. Text-independent speaker identification using gmm-ubm and frame level likelihood normalization
Wu et al. Mixture of factor analyzers using priors from non-parallel speech for voice conversion
Gallardo et al. I-vector speaker verification for speech degraded by narrowband and wideband channels
McLaren et al. Softsad: Integrated frame-based speech confidence for speaker recognition
CN107068165B (en) A method of voice conversion
Ijima et al. Objective Evaluation Using Association Between Dimensions Within Spectral Features for Statistical Parametric Speech Synthesis.
Xie et al. Voice conversion with SI-DNN and KL divergence based mapping without parallel training data
Moinuddin et al. Speaker Identification based on GFCC using GMM
Gonzalez-Rodriguez Speaker Recognition Using Temporal Contours in Linguistic Units: The Case of Formant and Formant-Bandwidth Trajectories.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant