CN107301859B

CN107301859B - Speech conversion method under non-parallel text condition based on adaptive Gaussian clustering

Info

Publication number: CN107301859B
Application number: CN201710474281.8A
Authority: CN
Inventors: 李燕萍; 左宇涛
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2017-06-21
Filing date: 2017-06-21
Publication date: 2020-02-21
Anticipated expiration: 2037-06-21
Also published as: CN107301859A

Abstract

The invention discloses a voice conversion method under the condition of non-parallel text based on adaptive Gaussian clustering, and belongs to the technical field of voice signal processing. Firstly, the method based on the combination of unit selection and channel length normalization is used to align the speech feature parameters of the non-parallel corpus, and then the adaptive Gaussian mixture model and the bilinear frequency bending plus amplitude adjustment are trained, and the speech conversion method is obtained. The required conversion function is finally used to achieve high-quality speech conversion. The invention not only overcomes the limitation of requiring parallel corpus in the training stage, realizes the speech conversion under the condition of non-parallel text, and has stronger applicability and versatility, but also uses the adaptive Gaussian mixture model to replace the traditional Gaussian mixture model, and solves the problem of the Gaussian mixture model. The problem of inaccuracy in the classification of speech feature parameters, and the combination of adaptive Gaussian mixture model and bilinear frequency bending and amplitude adjustment is better in the transformation of personality similarity and speech quality.

Description

Speech conversion method under non-parallel text condition based on adaptive Gaussian clustering

技术领域technical field

本发明涉及一种语音转换技术，尤其是一种非平行文本条件下的语音转换方法，属于语音信号处理技术领域。The invention relates to a voice conversion technology, in particular to a voice conversion method under the condition of non-parallel text, and belongs to the technical field of voice signal processing.

背景技术Background technique

语音转换是语音信号处理领域近年来新兴的研究分支，是在语音分析、识别和合成的研究基础上进行的，同时在此基础上发展起来的。Speech conversion is an emerging research branch in the field of speech signal processing in recent years. It is carried out and developed on the basis of speech analysis, recognition and synthesis research.

语音转换的目标是改变源说话人的语音个性特征，使之具有目标说话人的语音个性特征，也就是使一个人说的语音经过转换后听起来像是另一个人说的语音，同时保留语义。The goal of speech conversion is to change the voice personality of the source speaker so that it has the voice personality of the target speaker, that is, to make the voice spoken by one person sound like the voice spoken by another person after conversion, while preserving the semantics. .

大多数的语音转换方法，尤其是基于GMM的语音转换方法，要求用于训练的语料库是平行文本的，即源说话人和目标说话人需要发出语音内容、语音时长相同的句子，并且发音节奏和情绪等尽量一致。然而在语音转换的实际应用中，获取大量的平行语料殊为不易，甚至无法满足，此外训练时语音特征参数矢量对齐的精确度也成为语音转换系统性能的一种制约。无论从语音转换系统的通用性还是实用性来考虑，非平行文本条件下语音转换方法的研究都具有极大的实际意义和应用价值。Most speech conversion methods, especially GMM-based speech conversion methods, require that the corpus used for training be parallel text, that is, the source speaker and the target speaker need to emit sentences with the same speech content, speech duration, and pronunciation rhythm and rhythm. Emotions are as consistent as possible. However, in the practical application of speech conversion, it is difficult to obtain a large number of parallel corpora, or even unsatisfactory. In addition, the accuracy of speech feature parameter vector alignment during training has also become a constraint on the performance of speech conversion systems. Regardless of the generality or practicability of speech conversion systems, the research on speech conversion methods under the condition of non-parallel texts has great practical significance and application value.

目前非平行文本条件下的语音转换方法主要有两种，基于语音聚类的方法和基于参数自适应的方法。基于语音聚类的方法，是通过对语音帧之间距离的度量或者在音素信息的指导下选择相对应的语音单元进行转换，其本质是一定条件下将非平行文本转化为平行文本进行处理。该方法原理简单，但要对语音文本内容进行预提取，预提取的结果会直接影响语音的转换质量。基于参数自适应的方法，是采用语音识别中的说话人归一化或自适应方法对转换模型的参数进行处理，其本质是使得预先建立的模型向基于目标说话人的模型进行转化。该方法能合理地利用预存储的说话人信息，但通常自适应过程会引起频谱的平滑，导致转换语音中的说话人个性信息不强。At present, there are two main methods of speech conversion under the condition of non-parallel text, the method based on speech clustering and the method based on parameter adaptation. The method based on speech clustering is to convert non-parallel texts into parallel texts under certain conditions by measuring the distance between speech frames or selecting corresponding speech units under the guidance of phoneme information. The principle of this method is simple, but the content of speech and text should be pre-extracted, and the result of pre-extraction will directly affect the conversion quality of speech. The method based on parameter adaptation is to use the speaker normalization or self-adaptation method in speech recognition to process the parameters of the conversion model, and its essence is to convert the pre-established model to the model based on the target speaker. This method can reasonably utilize the pre-stored speaker information, but usually the adaptive process will cause the smoothing of the spectrum, resulting in weak speaker personality information in the converted speech.

发明内容SUMMARY OF THE INVENTION

本发明所要解决的技术问题是：提供一种在非平行文本条件下，能够根据目标说话人的不同，而自适应地确定GMM混合度的语音转换方法，达到增强转换语音中说话人个性特征的同时改善转换语音的质量。The technical problem to be solved by the present invention is to provide a speech conversion method that can adaptively determine the GMM mixing degree according to the difference of target speakers under the condition of non-parallel text, so as to enhance the speaker's personality characteristics in the converted speech. Also improve the quality of the converted speech.

本发明为解决上述技术问题采用以下技术方案：The present invention adopts the following technical solutions for solving the above-mentioned technical problems:

本发明提出一种基于自适应高斯聚类的非平行文本条件下的语音转换方法，包括训练阶段和转换阶段，其中所述训练阶段包括如下步骤：The present invention proposes a speech conversion method under the condition of non-parallel text based on adaptive Gaussian clustering, including a training phase and a conversion phase, wherein the training phase includes the following steps:

步骤1，输入源说话人和目标说话人的非平行训练语料；Step 1, input the non-parallel training corpus of the source speaker and the target speaker;

步骤2，使用AHOcoder语音分析模型分别提取源说话人的非平行训练语料的MFCC特征参数X、目标说话人的非平行训练语料的MFCC特征参数Y，以及源语音基频log f_0X和目标语音基频log f_0Y；Step 2: Use the AHOcoder speech analysis model to extract the MFCC feature parameter X of the non-parallel training corpus of the source speaker, the MFCC feature parameter Y of the non-parallel training corpus of the target speaker, and the source voice fundamental frequency log f _0X and the target voice base frequency. frequency log f _0Y ;

步骤3，对步骤2中的MFCC特征参数X、Y，进行单元挑选和声道长度归一化相结合的语音特征参数对齐和动态时间规整，从而将非平行语料转变成平行语料；Step 3, for the MFCC feature parameters X, Y in step 2, perform unit selection and channel length normalization combined with voice feature parameter alignment and dynamic time regularization, thereby converting non-parallel corpus into parallel corpus;

步骤4，使用期望最大化EM算法进行自适应混合高斯模型AGMM训练，AGMM训练结束，得到后验条件概率矩阵P(X|λ)，并保存AGMM参数λ；Step 4, using the expectation maximization EM algorithm to perform the adaptive mixture Gaussian model AGMM training, after the AGMM training ends, the posterior conditional probability matrix P(X|λ) is obtained, and the AGMM parameter λ is saved;

步骤5，利用步骤3得到的源语音特征参数X和目标语音特征参数Y，使用步骤4中的后验条件概率矩阵P(X|λ)进行双线性频率弯折BLFW+幅度调节AS训练，得到频率弯折因子α(x,λ)和幅度调节因子s(x,λ)，从而构建BLFW+AS转换函数；使用对数基频的均值和方差建立源语音基频log f_0X和目标语音基频log f_0Y之间的基频转换函数；Step 5, use the source speech feature parameter X and target speech feature parameter Y obtained in step 3, use the posterior conditional probability matrix P(X|λ) in step 4 to perform bilinear frequency bending BLFW+amplitude adjustment AS training, obtain Frequency bending factor α(x, λ) and amplitude adjustment factor s(x, λ) to construct BLFW+AS conversion function; use the mean and variance of logarithmic fundamental frequency to establish source speech fundamental frequency log f _0X and target speech fundamental frequency Fundamental frequency conversion function between frequencies log f _0Y ;

所述转换阶段包括如下步骤：The conversion phase includes the following steps:

步骤6，输入待转换的源说话人语音；Step 6, input the source speaker voice to be converted;

步骤7，使用AHOcoder语音分析模型提取源说话人语音的MFCC特征参数X′和对数基频log f_0X′；Step 7, use the AHOcoder speech analysis model to extract the MFCC feature parameter X' and the logarithmic fundamental frequency log f _0X' of the source speaker's speech;

步骤8，使用步骤4中AGMM训练时得到的参数λ，求取后验条件概率矩阵P′(X|λ)；Step 8, use the parameter λ obtained during the AGMM training in step 4 to obtain the posterior conditional probability matrix P'(X|λ);

步骤9，使用步骤5中得到的BLFW+AS转换函数，求得转换后的MFCC特征参数Y′；Step 9, use the BLFW+AS conversion function obtained in step 5 to obtain the converted MFCC characteristic parameter Y';

步骤10，使用步骤5得到的基频转换函数由对数基频log f_0X′得到转换后的对数基频log f_0Y′；Step 10, using the fundamental frequency conversion function obtained in step 5 to obtain the converted logarithmic fundamental frequency log f _0Y' from the logarithmic fundamental frequency log f _0X ';

步骤11，使用AHOdecoder语音合成模型将转换后的MFCC特征参数Y′和对数基频log f_0Y′合成得到转换后的语音。Step 11, use the AHOdecoder speech synthesis model to synthesize the converted MFCC feature parameter Y' and the logarithmic fundamental frequency log f _0Y' to obtain the converted speech.

进一步的，本发明所提出的语音转换方法，步骤3具体过程如下：Further, in the voice conversion method proposed by the present invention, the specific process of step 3 is as follows:

3-1)采用双线性频率弯折方法对源语音MFCC特征参数进行声道长度归一化处理；3-1) adopt the bilinear frequency bending method to carry out the channel length normalization processing to the source speech MFCC characteristic parameter;

3-2)对于给定的N个源语音MFCC特征参数矢量{X_k}，通过公式(1)来动态地寻找N个目标语音特征参数矢量{Y_k}，使得距离耗费函数值C({Y_k})最小；3-2) For the given N source speech MFCC feature parameter vectors {X _k }, the N target speech feature parameter vectors {Y _k } are dynamically found by formula (1), so that the distance cost function value C({ Y _k }) is the smallest;

C({Y_k})＝C₁({Y_k})+C₂({Y_k}) (1)C({Y _k })=C ₁ ({Y _k })+C ₂ ({Y _k }) (1)

其中，C₁({Y_k})和C₂({Y_k})分别由下式表示：where C ₁ ({Y _k }) and C ₂ ({Y _k }) are respectively represented by the following equations:

其中，D(X_k,Y_k)函数表示源语音和目标语音特征参数矢量之间的频谱距离，参数γ表示在特征参数帧对齐的准确度和帧间连续性之间的平衡系数，且有0≤γ≤1；C₁({Y_k})表示的是源语音特征参数矢量和目标语音特征参数矢量之间的频谱距离耗费函数，C₂({Y_k})表示的是经单元挑选的目标语音特征参数矢量之间频谱距离耗费函数；Among them, the D(X _k , Y _k ) function represents the spectral distance between the source speech and the target speech feature parameter vector, the parameter γ represents the balance coefficient between the accuracy of the feature parameter frame alignment and the continuity between frames, and has 0≤γ≤1; C ₁ ({Y _k }) represents the spectral distance cost function between the source speech feature parameter vector and the target speech feature parameter vector, and C ₂ ({Y _k }) represents the unit selection The spectral distance cost function between the target speech feature parameter vectors;

3-3)通过对公式(1)进行多元线性回归分析，得到与源语音特征参数矢量对齐的目标语音特征参数序列集合

即：3-3) By performing multiple linear regression analysis on formula (1), the target speech feature parameter sequence set aligned with the source speech feature parameter vector is obtained

which is:

通过上述步骤，将非平行的MFCC特征参数X、Y转变为平行的语料。Through the above steps, the non-parallel MFCC feature parameters X and Y are transformed into parallel corpus.

进一步的，本发明所提出的语音转换方法，对于公式(4)的求解，使用维特比搜索方法来优化算法的执行效率。Further, in the speech conversion method proposed by the present invention, for the solution of formula (4), the Viterbi search method is used to optimize the execution efficiency of the algorithm.

进一步的，本发明所提出的语音转换方法，步骤4的训练过程如下：Further, in the speech conversion method proposed by the present invention, the training process of step 4 is as follows:

4-1)设定AGMM初始混合数M，高斯分量权重系数阈值t₁,t₂，特征参数矢量之间欧氏距离阈值D和协方差阈值σ；4-1) Set AGMM initial mixture number M, Gaussian component weight coefficient thresholds t ₁ , t ₂ , Euclidean distance threshold D and covariance threshold σ between feature parameter vectors;

4-2)使用K-均值迭代算法得到EM训练的初始值；4-2) Use K-means iterative algorithm to obtain the initial value of EM training;

4-3)使用EM算法进行迭代训练；将高斯混合模型GMM表示如下：4-3) Use the EM algorithm for iterative training; express the Gaussian mixture model GMM as follows:

其中，X为P维的语音特征参数矢量，P＝39；P(w_i)表示各高斯分量的权重系数，且有

M为高斯分量的个数，N(X,μ_i,Σ_i)表示高斯分量的P维联合高斯概率分布，表示如下：Among them, X is a P-dimensional speech feature parameter vector, P=39; P( _wi ) represents the weight coefficient of each Gaussian component, and there are

M is the number of Gaussian components, and N(X, μ _i , Σ _i ) represents the P-dimensional joint Gaussian probability distribution of the Gaussian components, expressed as follows:

其中μ_i为均值矢量，∑_i为协方差矩阵，λ＝{P(w_i),μ_i,Σ_i}，λ是GMM模型的模型参数，对λ的估算通过最大似然估计法实现，对于语音特征参数矢量集合X＝{x_n,n＝1,2,...N}有：where μ _i is the mean vector, ∑ _i is the covariance matrix, λ={P( _wi ), μ _i ,∑ _i }, λ is the model parameter of the GMM model, and the estimation of λ is realized by the maximum likelihood estimation method, For the speech feature parameter vector set X={x _n , n=1,2,...N} there are:

此时：at this time:

λ＝arg_λmax(P(X|λ)) (8)λ=arg _λ max(P(X|λ)) (8)

使用EM算法求解公式(8)，随着EM计算过程中迭代条件满足P(X|λ^k)≥P(X|λ^k-1)，Use the EM algorithm to solve formula (8), as the iterative conditions in the EM calculation process satisfy P(X|λ ^k )≥P(X|λ ^k-1 ),

K是迭代的次数，直至模型参数λ，迭代过程中高斯分量权重系数P(w_i)、均值向量μ_i、协方差矩阵Σ_i的迭代公式如下：K is the number of iterations up to the model parameter λ. The iterative formulas of the Gaussian component weight coefficient P( _wi ), the mean vector μ _i and the covariance matrix Σ _i in the iterative process are as follows:

4-4)若训练得到的模型中某一高斯分量N(P(w_i),μ_i,∑_i)权重系数小于t₁，并且与其最邻近分量N(P(w_j),μ_j,Σ_i)之间的欧氏距离小于阈值D，则对其进行合并处理：4-4) If the weight coefficient of a Gaussian component N(P(w _i ), μ _i , ∑ _i ) in the model obtained by training is less than t ₁ , and its nearest neighbor component N(P(w _j ), μ _j , If the Euclidean distance between Σ _i ) is less than the threshold D, they are merged:

此时，高斯分量个数变为M-1，返回步骤4-3)进行下一次训练，若满足合并条件的高斯分量有多个，则选择最小距离的高斯分量进行合并；At this time, the number of Gaussian components becomes M-1, and return to step 4-3) for the next training. If there are multiple Gaussian components that meet the merging conditions, select the Gaussian component with the minimum distance for merging;

4-5)若训练得到的模型中某一高斯分量N(P(w_i),μ_i,∑_i)权重系数大于t₂，并且协方差矩阵中有至少一维的方差大于σ，则认为该高斯分量包含过量信息，应将其分裂处理：4-5) If the weight coefficient of a Gaussian component N(P(w _i ), μ _i , ∑ _i ) in the model obtained by training is greater than t ₂ , and the variance of at least one dimension in the covariance matrix is greater than σ, it is considered that This Gaussian component contains excess information and should be split:

其中E为全1的列向量，n用于调节高斯分布，经过分裂后高斯分量个数变为M+1，如果满足分裂条件的高斯分量有多个，则选取权重系数最大的分量进行分裂，返回步骤4-3)进行下一次训练；Among them, E is a column vector of all 1s, and n is used to adjust the Gaussian distribution. After splitting, the number of Gaussian components becomes M+1. If there are multiple Gaussian components that meet the splitting conditions, the component with the largest weight coefficient is selected for splitting. Return to step 4-3) for the next training;

4-6)AGMM训练结束，得到后验条件概率矩阵P(X|λ)，保存λ。4-6) After the AGMM training ends, the posterior conditional probability matrix P(X|λ) is obtained, and λ is saved.

进一步的，本发明所提出的语音转换方法，步骤5中构建的BLFW+AS转换函数，表示如下：Further, the speech conversion method proposed by the present invention, the BLFW+AS conversion function constructed in step 5, is expressed as follows:

F(x)＝W_α(x,λ)x+s(x,λ) (15)F(x)=W _α(x,λ) x+s(x,λ) (15)

其中，M为步骤4中混合高斯模型的高斯分量的个数，α(x,λ)表示频率弯折因子，s(x,λ)表示幅度调节因子。Among them, M is the number of Gaussian components of the Gaussian mixture model in step 4, α(x, λ) represents the frequency bending factor, and s(x, λ) represents the amplitude adjustment factor.

进一步的，本发明所提出的语音转换方法，步骤5中建立源语音基音频率和目标语音基音频率之间的转换关系：Further, the voice conversion method proposed by the present invention establishes the conversion relationship between the source voice pitch frequency and the target voice pitch frequency in step 5:

其中μ,σ²分别表示对数基音频率log f₀的均值和方差。where μ and σ ² represent the mean and variance of the logarithmic fundamental frequency log f ₀ , respectively.

本发明采用以上技术方案与现有技术相比，具有以下技术效果：Compared with the prior art, the present invention adopts the above technical scheme, and has the following technical effects:

1、本发明实现了非平行文本条件下的语音转换，解决了平行语料不易获取的问题，提高了语音转换系统的通用性和实用性。1. The present invention realizes the voice conversion under the condition of non-parallel text, solves the problem that parallel corpus is difficult to obtain, and improves the generality and practicability of the voice conversion system.

2、本发明使用AGMM和BLFW+AS相结合来实现语音转换系统，该系统能够根据不同说话人的语音特征参数分布，自适应调节GMM的分类数，在增强语音个性相似度的同时改善了语音质量，实现了高质量的语音转换。2. The present invention uses the combination of AGMM and BLFW+AS to realize a voice conversion system, which can adaptively adjust the classification number of GMM according to the distribution of voice characteristic parameters of different speakers, and improve the voice while enhancing the similarity of voice personality. quality, to achieve high-quality voice conversion.

附图说明Description of drawings

图1是本发明的非平行文本语音转换的示意图。FIG. 1 is a schematic diagram of the non-parallel text-to-speech conversion of the present invention.

图2是自适应高斯混合模型训练流程图。Figure 2 is a flowchart of the adaptive Gaussian mixture model training.

图3是转换后语音的语谱对比图。Fig. 3 is a spectrum comparison diagram of the converted speech.

具体实施方式Detailed ways

下面结合附图对本发明的技术方案做进一步的详细说明：Below in conjunction with accompanying drawing, the technical scheme of the present invention is described in further detail:

本技术领域技术人员可以理解的是，除非另外定义，这里使用的所有术语(包括技术术语和科学术语)具有与本发明所属领域中的普通技术人员的一般理解相同的意义。还应该理解的是，诸如通用字典中定义的那些术语应该被理解为具有与现有技术的上下文中的意义一致的意义，并且除非像这里一样定义，不会用理想化或过于正式的含义来解释。It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It should also be understood that terms such as those defined in general dictionaries should be understood to have meanings consistent with their meanings in the context of the prior art and, unless defined as herein, are not to be taken in an idealized or overly formal sense. explain.

本发明所述高质量语音转换方法分为两个部分：训练部分用于得到语音转换所需的参数和转换函数，而转换部分用于实现源说话人语音转换为目标说话人语音。The high-quality voice conversion method of the present invention is divided into two parts: the training part is used to obtain parameters and conversion functions required for voice conversion, and the conversion part is used to convert the source speaker's voice into the target speaker's voice.

如图1，训练部分实施步骤：As shown in Figure 1, the implementation steps of the training part:

步骤1，输入源说话人和目标说话人的语音非平行语料，非平行语料取自CMU_US_ARCTIC语料库，该语料库是由卡内基梅隆大学语言技术研究所建立的，语料库中的语音由5男2女录制，每个说话人录制了1132段1～6s不等的语音。Step 1. Input the non-parallel corpus of the source speaker and the target speaker. The non-parallel corpus is taken from the CMU_US_ARCTIC corpus, which was established by the Institute of Language Technology of Carnegie Mellon University. The voices in the corpus are composed of 5 males and 2 Female recording, each speaker recorded 1132 segments of speech ranging from 1 to 6 seconds.

步骤2，本发明使用AHOcoder语音分析模型分别提取源说话人和目标说话人的梅尔倒谱系数(MFCC,Mel-Frequency Cepstral Coefficient)X、Y以及对数基音频率参数logf_0X和log f_0Y。其中AHOcoder是西班牙毕尔巴鄂(Bilbao)市Aholab Signal ProcessingLaboratory学者Daniel Erro团队构建的高性能的语音分析合成工具；Step 2, the present invention uses the AHOcoder speech analysis model to extract the Mel-Frequency Cepstral Coefficients (MFCC, MFCC) X, Y and the logarithmic fundamental frequency parameters logf _0X and log f _0Y of the source speaker and the target speaker respectively. Among them, AHOcoder is a high-performance speech analysis and synthesis tool built by the team of Daniel Erro, a scholar at the Aholab Signal Processing Laboratory in Bilbao, Spain;

步骤3，对步骤2中的源和目标语音的MFCC参数X、Y进行单元挑选(UnitSelection)和声道长度归一化(VTLN,Vocal Tract Length No6rmalization)相结合的语音特征参数对齐和动态时间规整(DTW,Dynamic Time Warping)。其中语音特征参数对齐具体过程如下：In step 3, the MFCC parameters X and Y of the source and target speech in step 2 are combined with unit selection (UnitSelection) and channel length normalization (VTLN, Vocal Tract Length No6rmalization) to combine speech feature parameter alignment and dynamic time warping. (DTW, Dynamic Time Warping). The specific process of speech feature parameter alignment is as follows:

3-1)采用双线性频率弯折方法对源语音特征参数进行声道长度归一化处理，使得源语音的共振峰向目标语音靠近，从而增加单元挑选目标语音特征参数的精确性。3-1) The bilinear frequency bending method is used to normalize the channel length of the source speech feature parameters, so that the formants of the source speech are close to the target speech, thereby increasing the accuracy of the unit selection of the target speech feature parameters.

3-2)对于给定的N个源语音特征参数矢量{X_k}，可通过公式(1)来动态地寻找N个目标语音特征参数矢量{Y_k}，使得距离耗费函数值C({Y_k})最小。在单元挑选的过程中考虑到两个因素：一方面是保证对齐的源语音特征参数矢量和目标语音的特征参数矢量之间的频谱距离最小，以增强音素信息的匹配度；另一方面是保证挑选到的目标语音特征参数矢量具有帧连续性，以使得音素信息更完整。3-2) For the given N source speech feature parameter vectors {X _k }, the N target speech feature parameter vectors {Y _k } can be dynamically found by formula (1), so that the distance cost function value C({ Y _k }) is the smallest. In the process of unit selection, two factors are considered: on the one hand, the spectral distance between the aligned source speech feature parameter vector and the target speech feature parameter vector is minimized to enhance the matching degree of phoneme information; on the other hand, it is guaranteed The selected target speech feature parameter vector has frame continuity, so that the phoneme information is more complete.

其中，C₁({Y_k})和C₂({Y_k})分别可由下式表示：where C ₁ ({Y _k }) and C ₂ ({Y _k }) can be respectively represented by the following equations:

其中，D(X_k,Y_k)函数表示源和目标特征参数矢量之间的频谱距离，本发明采用欧氏距离作为距离衡量尺度。参数γ表示在特征参数帧对齐的准确度和帧间连续性之间的平衡系数，且有0≤γ≤1。C₁({Y_k})表示的是源语音特征参数矢量和目标语音的特征参数矢量之间的频谱距离耗费函数，C₂({Y_k})表示的是经单元挑选的目标语音的特征参数矢量之间频谱距离耗费函数。Wherein, the D(X _k , Y _k ) function represents the spectral distance between the source and target feature parameter vectors, and the present invention adopts the Euclidean distance as the distance measurement scale. The parameter γ represents the balance coefficient between the accuracy of the feature parameter frame alignment and the continuity between frames, and has 0≤γ≤1. C ₁ ({Y _k }) represents the spectral distance cost function between the feature parameter vector of the source speech and the feature parameter vector of the target speech, and C ₂ ({Y _k }) represents the feature of the target speech selected by the unit Spectral distance cost function between parameter vectors.

3-3)通过对公式(1)进行多元线性回归分析，可以得到与源语音特征参数矢量对齐的特征参数序列集合

即：3-3) By performing multiple linear regression analysis on formula (1), a feature parameter sequence set aligned with the source speech feature parameter vector can be obtained

which is:

对于公式(4)的求解，可使用维特比(Viterbi)搜索方法来优化算法的执行效率。For the solution of formula (4), the Viterbi search method can be used to optimize the execution efficiency of the algorithm.

通过上述步骤，将非平行的MFCC参数X、Y转变为平行的。Through the above steps, the non-parallel MFCC parameters X, Y are transformed into parallel.

步骤4，建立自适应混合高斯模型(Adaption GMM，AGMM)，采用期望最大化(EM,Expectation-Maximization)算法进行训练，并使用K-均值迭代方法得到EM训练的初始值。通过训练得到AGMM参数λ，P(X|λ)。In step 4, an adaptive Gaussian mixture model (Adaption GMM, AGMM) is established, and the expectation maximization (EM, Expectation-Maximization) algorithm is used for training, and the K-means iteration method is used to obtain the initial value of EM training. The AGMM parameters λ, P(X|λ) are obtained through training.

如图2所示，使用自适应聚类算法训练AGMM参数，首先需要对各高斯分量的权重系数、均值向量、协方差矩阵和特征参数矢量之间的欧氏距离进行综合分析，动态地调整高斯混合度。其训练过程如下：As shown in Figure 2, to use the adaptive clustering algorithm to train AGMM parameters, first of all, it is necessary to comprehensively analyze the Euclidean distance between the weight coefficient, mean vector, covariance matrix and feature parameter vector of each Gaussian component, and dynamically adjust the Gaussian distance. degree of mixing. The training process is as follows:

4-1)设定AGMM初始混合数M，高斯分量权重系数阈值t₁,t₂，特征参数矢量之间欧氏距离阈值D和协方差阈值σ。4-1) Set AGMM initial mixture number M, Gaussian component weight coefficient thresholds t ₁ , t ₂ , Euclidean distance threshold D and covariance threshold σ between feature parameter vectors.

4-2)使用K-均值迭代算法得到EM训练的初始值。4-2) Use the K-means iterative algorithm to obtain the initial value of EM training.

4-3)使用EM算法进行迭代训练。4-3) Use the EM algorithm for iterative training.

传统的高斯混合模型表示如下：The traditional Gaussian mixture model is represented as follows:

其中，X为P维的语音特征参数矢量，本发明中采用P＝39，P(w_i)表示各高斯分量的权重系数，且有

M为高斯分量的个数，N(X,μ_i,∑_i)表示高斯分量的P维联合高斯概率分布，表示如下：Among them, X is a P-dimensional speech feature parameter vector, P=39 is used in the present invention, P( _wi ) represents the weight coefficient of each Gaussian component, and there are

M is the number of Gaussian components, and N(X,μ _i ,∑ _i ) represents the P-dimensional joint Gaussian probability distribution of the Gaussian components, which is expressed as follows:

其中μ_i为均值矢量，∑_i为协方差矩阵，λ＝{P(w_i),μ_i,Σ_i}，是GMM模型的模型参数，对λ的估算可以通过最大似然估计法(ML,Maximum Likelihood)实现,最大似然估计的目的在于使得条件概率P(X|λ)取得最大，对于语音特征参数矢量集合X＝{x_n,n＝1,2,...N}有：where μ _i is the mean vector, ∑ _i is the covariance matrix, λ={P( _wi ), μ _i ,∑ _i } is the model parameter of the GMM model, and the estimation of λ can be done by the maximum likelihood estimation method (ML , Maximum Likelihood), the purpose of maximum likelihood estimation is to maximize the conditional probability P(X|λ), for the speech feature parameter vector set X={x _n , n=1,2,...N} there are:

此时：at this time:

λ＝arg_λmax(P(X|λ)) (8)λ=arg _λ max(P(X|λ)) (8)

求解公式(8)可使用EM算法，随着EM计算过程中迭代条件满足P(X|λ^k)≥P(X|λ^k-1)，K是迭代的次数，直至模型参数λ。迭代过程中高斯分量权重系数P(w_i)、均值向量μ_i、协方差矩阵Σ_i的迭代公式如下：The EM algorithm can be used to solve formula (8). As the iterative conditions in the EM calculation process satisfy P(X|λ ^k )≥P(X|λ ^k-1 ), K is the number of iterations until the model parameter λ. In the iterative process, the iterative formulas of the Gaussian component weight coefficient P( _wi ), the mean vector μ _i , and the covariance matrix Σ _i are as follows:

4-4)若训练得到的模型中某一高斯分量N(P(w_i),μ_i,Σ_i)权重系数小于t₁，并且与其最邻近分量N(P(w_j),μ_j,Σ_i)之间的欧氏距离小于阈值D，则认为这两个分量包含信息较少且成分相似，可对其进行合并处理：4-4) If the weight coefficient of a Gaussian component N(P(w _i ), μ _i , Σ _i ) in the model obtained by training is less than t ₁ , and its nearest neighbor component N(P(w _j ), μ _j , If the Euclidean distance between Σ _i ) is less than the threshold D, it is considered that these two components contain less information and are similar in composition, and can be merged:

此时，高斯分量个数变为M-1，返回步骤(3)进行下一次训练，若满足合并条件的高斯分量有多个，则选择最小距离的高斯分量进行合并。At this time, the number of Gaussian components becomes M-1, and the process returns to step (3) for the next training. If there are multiple Gaussian components that meet the merging conditions, select the Gaussian component with the smallest distance for merging.

4-5)若训练得到的模型中某一高斯分量N(P(w_i),μ_i,∑_i)权重系数大于t₂，并且协方差矩阵中有至少一维的方差(协方差矩阵对角线上元素即为方差)大于σ，则认为该高斯分量包含过量信息，应将其分裂处理：4-5) If the weight coefficient of a Gaussian component N(P(w _i ), μ _i ,∑ _i ) in the model obtained by training is greater than t ₂ , and there is at least one-dimensional variance in the covariance matrix (covariance matrix pair The element on the corner is the variance) is greater than σ, then the Gaussian component is considered to contain excess information and should be split:

其中E为全1的列向量，n用于调节高斯分布，经过分裂后高斯分量个数变为M+1，如果满足分裂条件的高斯分量有多个，则选取权重系数最大的分量进行分裂，返回步骤(3)进行下一次训练。Among them, E is a column vector of all 1s, and n is used to adjust the Gaussian distribution. After splitting, the number of Gaussian components becomes M+1. If there are multiple Gaussian components that meet the splitting conditions, the component with the largest weight coefficient is selected for splitting. Return to step (3) for the next training.

步骤5，利用步骤3中得到的源语音特征参数X和目标语音特征参数Y以及步骤4中得到的后验条件概率矩阵P(X|λ)进行训练，得到频率弯折因子和幅度调节因子，从而构建双线性频率弯折(BLFW,Bilinear Frequency Warping)和幅度调节(AS,AmplitudeScaling)语音转换函数，表示如下：Step 5, using the source speech feature parameter X obtained in step 3 and the target speech feature parameter Y and the posterior conditional probability matrix P(X|λ) obtained in step 4 for training to obtain a frequency bending factor and an amplitude adjustment factor, Thus, the bilinear frequency bending (BLFW, Bilinear Frequency Warping) and amplitude adjustment (AS, AmplitudeScaling) speech conversion functions are constructed, which are expressed as follows:

F(x)＝W_α(x,λ)x+s(x,λ) (15)F(x)=W _α(x,λ) x+s(x,λ) (15)

建立源语音基音频率和目标语音基音频率之间的转换关系：Establish the conversion relationship between the source speech pitch frequency and the target speech pitch frequency:

其中μ,σ²用于表示对数基音频率log f₀的均值和方差。where μ,σ ² are used to represent the mean and variance of the logarithmic fundamental frequency log f ₀ .

如图1，转换部分具体实施步骤：As shown in Figure 1, the specific implementation steps of the conversion part:

步骤7，使用AHOdecoder语音分析模型提取源说话人的语音39阶MFCC特征参数X′以及源语音对数基音频率参数log f_0X′；Step 7, use the AHOdecoder speech analysis model to extract the 39th-order MFCC feature parameter X ' of the source speaker and the logarithmic fundamental frequency parameter log f _{0X '} of the source speech;

步骤8，利用步骤4中AGMM训练时得到的λ＝{P(w_i),μ_i,Σ_i}和步骤7中提取的特征参数X′，代入公式(5)，得到后验条件概率矩阵P′(X|λ)；Step 8: Use λ={P( _wi ), μ _i , Σ _i } obtained during AGMM training in step 4 and the feature parameter X′ extracted in step 7, and substitute it into formula (5) to obtain the posterior conditional probability matrix P'(X|λ);

步骤9，利用步骤5中BLFW+AS训练得到的频率弯折因子α(x,λ)和幅度调节因子s(x,λ)以及步骤8中得到的后验条件概率矩阵P′(X|λ)，分别代入公式(15)、(16)、(17)和(18)后，得到转换后语音的MFCC特征参数Y′；Step 9, use the frequency bending factor α(x, λ) and the amplitude adjustment factor s(x, λ) obtained by BLFW+AS training in step 5 and the posterior conditional probability matrix P′(X|λ) obtained in step 8 ), after substituting into formulas (15), (16), (17) and (18) respectively, the MFCC feature parameter Y′ of the converted speech is obtained;

步骤10，利用步骤7中得到的源语音对数基音频率参数log f_0X′，代入公式(19)，得到转换后语音的对数基音频率参数log f_0Y′；Step 10, utilize the source voice logarithmic fundamental frequency parameter log f _0X' obtained in step 7, substitute formula (19), obtain the logarithmic fundamental frequency parameter log f _0Y' of voice after conversion;

步骤11，使用AHOdecoder语音合成模型将步骤9中的Y′和步骤10中的log f_0Y′作为输入得到转换后的语音。Step 11, use the AHOdecoder speech synthesis model to obtain the converted speech by taking Y' in step 9 and log f _0Y' in step 10 as input.

进一步地，如图3所示，将本发明的方法与INCA方法得到的转换语音的语谱图进行了对比，转换方向为F1-M2(女声1-男声2)，进一步验证了本发明所采用的方法相对于INCA方法的频谱相似度更高的优点。其中，INCA方法是文献(Erro D,Moreno A,BonafonteA.INCA algorithm for training voice conversion systems from nonparallelcorpora[J].IEEE Transactions on Audio,Speech,and Language Processing,2010,18(5):944-953.)中提出的。Further, as shown in Figure 3, the method of the present invention is compared with the spectrogram of the converted voice obtained by the INCA method, and the conversion direction is F1-M2 (female voice 1-male voice 2), further verifying that the present invention adopts Compared with the INCA method, the method has the advantage of higher spectral similarity. Among them, the INCA method is the literature (Erro D, Moreno A, BonafonteA.INCA algorithm for training voice conversion systems from nonparallelcorpora[J].IEEE Transactions on Audio,Speech,and Language Processing,2010,18(5):944-953. ) proposed in.

以上所述仅是本发明的部分实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above are only some embodiments of the present invention. It should be pointed out that for those skilled in the art, without departing from the principles of the present invention, several improvements and modifications can be made. It should be regarded as the protection scope of the present invention.

Claims

1. a speech conversion method under the non-parallel text condition based on adaptive Gaussian clustering, is characterized in that, comprises training stage and conversion stage, and wherein said training stage comprises the steps:

Step 1, input the non-parallel training corpus of the source speaker and the target speaker;

Step 2: Use the AHOcoder speech analysis model to extract the MFCC feature parameter X of the non-parallel training corpus of the source speaker, the MFCC feature parameter Y of the non-parallel training corpus of the target speaker, and the source voice fundamental frequency log f _0X and the target voice base frequency. frequency log f _0Y ;

Step 3, for the MFCC feature parameters X, Y in step 2, perform unit selection and channel length normalization combined with voice feature parameter alignment and dynamic time regularization, thereby converting non-parallel corpus into parallel corpus;

Step 4, using the expectation maximization EM algorithm to perform the adaptive mixture Gaussian model AGMM training, after the AGMM training ends, the posterior conditional probability matrix P(X|λ) is obtained, and the AGMM parameter λ is saved;

Step 5, use the source speech feature parameter X and target speech feature parameter Y obtained in step 3, use the posterior conditional probability matrix P(X|λ) in step 4 to perform bilinear frequency bending BLFW+amplitude adjustment AS training, obtain Frequency bending factor α(x, λ) and amplitude adjustment factor s(x, λ) to construct BLFW+AS conversion function; use the mean and variance of logarithmic fundamental frequency to establish source speech fundamental frequency log f _0X and target speech fundamental frequency Fundamental frequency conversion function between frequencies log f _0Y ;

The conversion phase includes the following steps:

Step 6, input the source speaker voice to be converted;

Step 7, use the AHOcoder speech analysis model to extract the MFCC feature parameter X' and the logarithmic fundamental frequency log f _0X' of the source speaker's speech;

Step 8, use the parameter λ obtained during the AGMM training in step 4 to obtain the posterior conditional probability matrix P'(X|λ);

Step 9, use the BLFW+AS conversion function obtained in step 5 to obtain the converted MFCC characteristic parameter Y';

Step 10, using the fundamental frequency conversion function obtained in step 5 to obtain the converted logarithmic fundamental frequency log f _0Y' from the logarithmic fundamental frequency log f _0X ';

Step 11, use the AHOdecoder speech synthesis model to synthesize the converted MFCC feature parameter Y' and the logarithmic fundamental frequency logf _0Y' to obtain the converted speech.

2. voice conversion method according to claim 1, is characterized in that, step 3 concrete process is as follows:

3-1) adopt the bilinear frequency bending method to carry out the channel length normalization processing to the source speech MFCC characteristic parameter;

3-2) For the given N source speech MFCC feature parameter vectors {X _k }, the N target speech feature parameter vectors {Y _k } are dynamically found by formula (1), so that the distance cost function value C({ Y _k }) is the smallest;

C({Y _k })=C ₁ ({Y _k })+C ₂ ({Y _k }) (1)

where C ₁ ({Y _k }) and C ₂ ({Y _k }) are respectively represented by the following equations:

Among them, the D(X _k , Y _k ) function represents the spectral distance between the source speech and target speech feature parameter vectors, and the D(Y _k , Y _k-1 ) function represents the distance between the target speech feature parameter vectors selected by the unit Spectral distance, parameter γ represents the balance coefficient between the accuracy of feature parameter frame alignment and inter-frame continuity, and 0≤γ≤1; C ₁ ({Y _k }) represents the source speech feature parameter vector sum The spectral distance cost function between the target speech feature parameter vectors, C ₂ ({Y _k }) represents the spectral distance cost function between the target speech feature parameter vectors selected by the unit;

3-3) By performing multiple linear regression analysis on formula (1), the target speech feature parameter sequence set aligned with the source speech feature parameter vector is obtained

which is:

Through the above steps, the MFCC feature parameters X and Y under the non-parallel corpus are transformed into the aligned feature parameter set under the similar parallel corpus.

3 . The speech conversion method according to claim 2 , wherein, for the solution of formula (4), a Viterbi search method is used to optimize the execution efficiency of the algorithm. 4 .

4. speech conversion method according to claim 1, is characterized in that, the training process of step 4 is as follows:

4-1) Set AGMM initial mixture number M, Gaussian component weight coefficient thresholds t ₁ , t ₂ , Euclidean distance threshold D and covariance threshold σ between feature parameter vectors;

4-2) Use K-means iterative algorithm to obtain the initial value of EM training;

4-3) Use the EM algorithm for iterative training; express the Gaussian mixture model GMM as follows:

Among them, X is a P-dimensional speech feature parameter vector, P( _wi ) represents the weight coefficient of each Gaussian component, and there are M is the number of Gaussian components, and N(X, μ _i , Σ _i ) represents the P-dimensional joint Gaussian probability distribution of the Gaussian components, expressed as follows:

where μ _i is the mean vector, ∑ _i is the covariance matrix, λ={P( _wi ), μ _i ,∑ _i }, λ is the model parameter of the GMM model, and the estimation of λ is realized by the maximum likelihood estimation method, For the speech feature parameter vector set X={x _n , n=1,2,...N} there are:

at this time:

λ=arg _λ max(P(X|λ)) (8)

Use the EM algorithm to solve formula (8), as the iterative conditions in the EM calculation process satisfy P(X|λ ^k )≥P(X|λ ^k-1 ),

k is the number of iterations until the model parameter λ is obtained. The iterative formulas of the Gaussian component weight coefficient P( _wi ), the mean vector μ _i and the covariance matrix Σ _i in the iterative process are as follows:

4-4) If the weight coefficient of a Gaussian component N(P(w _i ), μ _i , ∑ _i ) in the model obtained by training is less than t ₁ , and its nearest neighbor component N(P(w _j ), μ _j , If the Euclidean distance between Σ _i ) is less than the threshold D, they are merged:

At this time, the number of Gaussian components becomes M-1, and return to step 4-3) for the next training. If there are multiple Gaussian components that meet the merging conditions, select the Gaussian component with the minimum distance for merging;

4-5) If the weight coefficient of a Gaussian component N(P(w _i ), μ _i , ∑ _i ) in the model obtained by training is greater than t ₂ , and the variance of at least one dimension in the covariance matrix is greater than σ, it is considered that This Gaussian component contains excess information and should be split:

Among them, E is a column vector of all 1s, and n is used to adjust the Gaussian distribution. After splitting, the number of Gaussian components becomes M+1. If there are multiple Gaussian components that meet the splitting conditions, the component with the largest weight coefficient is selected for splitting. Return to step 4-3) for the next training;

4-6) After the AGMM training ends, the posterior conditional probability matrix P(X|λ) is obtained, and λ is saved.

5. The speech conversion method according to claim 4, characterized in that, P=39.

6. speech conversion method according to claim 1, is characterized in that, the BLFW+AS conversion function that constructs in step 5, is represented as follows:

F(x)=W _α(x,λ) x+s(x,λ) (15)

Among them, M is the number of Gaussian components of the mixture Gaussian model in step 4,

Represents the posterior probability of the m-th Gaussian component of the speech feature vector x in the AGMM model λ, α _m and s _m represent the frequency bending factor and amplitude adjustment factor of the m-th Gaussian component in the AGMM model λ, respectively, α (x ,λ) represents the weighted combination of the frequency bending factors of all Gaussian components of the AGMM model λ, and s(x,λ) represents the weighted combination of the amplitude adjustment factors of all the Gaussian components of the AGMM model λ.

7. voice conversion method according to claim 1, is characterized in that, in step 5, establish the conversion relation between source voice pitch frequency and target voice pitch frequency:

where μ and σ ² represent the mean and variance of the logarithmic fundamental frequency log f ₀ , respectively.