[go: up one dir, main page]

CN107301859B - Speech conversion method under non-parallel text condition based on adaptive Gaussian clustering - Google Patents

Speech conversion method under non-parallel text condition based on adaptive Gaussian clustering Download PDF

Info

Publication number
CN107301859B
CN107301859B CN201710474281.8A CN201710474281A CN107301859B CN 107301859 B CN107301859 B CN 107301859B CN 201710474281 A CN201710474281 A CN 201710474281A CN 107301859 B CN107301859 B CN 107301859B
Authority
CN
China
Prior art keywords
speech
gaussian
feature parameter
training
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710474281.8A
Other languages
Chinese (zh)
Other versions
CN107301859A (en
Inventor
李燕萍
左宇涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN201710474281.8A priority Critical patent/CN107301859B/en
Publication of CN107301859A publication Critical patent/CN107301859A/en
Application granted granted Critical
Publication of CN107301859B publication Critical patent/CN107301859B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/032Quantisation or dequantisation of spectral components
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
  • Machine Translation (AREA)

Abstract

本发明公开了一种基于自适应高斯聚类的非平行文本条件下的语音转换方法,属于语音信号处理技术领域。首先利用基于单元挑选和声道长度归一化相结合的方法对非平行语料进行语音特征参数对齐,然后进行自适应高斯混合模型和双线性频率弯折加幅度调节的训练,得到语音转换所需的转换函数,最后使用该转换函数实现高质量的语音转换。本发明不仅克服了训练阶段要求平行语料的限制,实现了非平行文本条件下的语音转换,适用性和通用性更强,而且使用自适应高斯混合模型替代传统高斯混合模型,解决了高斯混合模型在进行语音特征参数分类时不精确的问题,并将自适应高斯混合模型和双线性频率弯折加幅度调节相结合,在转换的个性相似度和语音质量上更好。

Figure 201710474281

The invention discloses a voice conversion method under the condition of non-parallel text based on adaptive Gaussian clustering, and belongs to the technical field of voice signal processing. Firstly, the method based on the combination of unit selection and channel length normalization is used to align the speech feature parameters of the non-parallel corpus, and then the adaptive Gaussian mixture model and the bilinear frequency bending plus amplitude adjustment are trained, and the speech conversion method is obtained. The required conversion function is finally used to achieve high-quality speech conversion. The invention not only overcomes the limitation of requiring parallel corpus in the training stage, realizes the speech conversion under the condition of non-parallel text, and has stronger applicability and versatility, but also uses the adaptive Gaussian mixture model to replace the traditional Gaussian mixture model, and solves the problem of the Gaussian mixture model. The problem of inaccuracy in the classification of speech feature parameters, and the combination of adaptive Gaussian mixture model and bilinear frequency bending and amplitude adjustment is better in the transformation of personality similarity and speech quality.

Figure 201710474281

Description

基于自适应高斯聚类的非平行文本条件下的语音转换方法Speech conversion method under non-parallel text condition based on adaptive Gaussian clustering

技术领域technical field

本发明涉及一种语音转换技术,尤其是一种非平行文本条件下的语音转换方法,属于语音信号处理技术领域。The invention relates to a voice conversion technology, in particular to a voice conversion method under the condition of non-parallel text, and belongs to the technical field of voice signal processing.

背景技术Background technique

语音转换是语音信号处理领域近年来新兴的研究分支,是在语音分析、识别和合成的研究基础上进行的,同时在此基础上发展起来的。Speech conversion is an emerging research branch in the field of speech signal processing in recent years. It is carried out and developed on the basis of speech analysis, recognition and synthesis research.

语音转换的目标是改变源说话人的语音个性特征,使之具有目标说话人的语音个性特征,也就是使一个人说的语音经过转换后听起来像是另一个人说的语音,同时保留语义。The goal of speech conversion is to change the voice personality of the source speaker so that it has the voice personality of the target speaker, that is, to make the voice spoken by one person sound like the voice spoken by another person after conversion, while preserving the semantics. .

大多数的语音转换方法,尤其是基于GMM的语音转换方法,要求用于训练的语料库是平行文本的,即源说话人和目标说话人需要发出语音内容、语音时长相同的句子,并且发音节奏和情绪等尽量一致。然而在语音转换的实际应用中,获取大量的平行语料殊为不易,甚至无法满足,此外训练时语音特征参数矢量对齐的精确度也成为语音转换系统性能的一种制约。无论从语音转换系统的通用性还是实用性来考虑,非平行文本条件下语音转换方法的研究都具有极大的实际意义和应用价值。Most speech conversion methods, especially GMM-based speech conversion methods, require that the corpus used for training be parallel text, that is, the source speaker and the target speaker need to emit sentences with the same speech content, speech duration, and pronunciation rhythm and rhythm. Emotions are as consistent as possible. However, in the practical application of speech conversion, it is difficult to obtain a large number of parallel corpora, or even unsatisfactory. In addition, the accuracy of speech feature parameter vector alignment during training has also become a constraint on the performance of speech conversion systems. Regardless of the generality or practicability of speech conversion systems, the research on speech conversion methods under the condition of non-parallel texts has great practical significance and application value.

目前非平行文本条件下的语音转换方法主要有两种,基于语音聚类的方法和基于参数自适应的方法。基于语音聚类的方法,是通过对语音帧之间距离的度量或者在音素信息的指导下选择相对应的语音单元进行转换,其本质是一定条件下将非平行文本转化为平行文本进行处理。该方法原理简单,但要对语音文本内容进行预提取,预提取的结果会直接影响语音的转换质量。基于参数自适应的方法,是采用语音识别中的说话人归一化或自适应方法对转换模型的参数进行处理,其本质是使得预先建立的模型向基于目标说话人的模型进行转化。该方法能合理地利用预存储的说话人信息,但通常自适应过程会引起频谱的平滑,导致转换语音中的说话人个性信息不强。At present, there are two main methods of speech conversion under the condition of non-parallel text, the method based on speech clustering and the method based on parameter adaptation. The method based on speech clustering is to convert non-parallel texts into parallel texts under certain conditions by measuring the distance between speech frames or selecting corresponding speech units under the guidance of phoneme information. The principle of this method is simple, but the content of speech and text should be pre-extracted, and the result of pre-extraction will directly affect the conversion quality of speech. The method based on parameter adaptation is to use the speaker normalization or self-adaptation method in speech recognition to process the parameters of the conversion model, and its essence is to convert the pre-established model to the model based on the target speaker. This method can reasonably utilize the pre-stored speaker information, but usually the adaptive process will cause the smoothing of the spectrum, resulting in weak speaker personality information in the converted speech.

发明内容SUMMARY OF THE INVENTION

本发明所要解决的技术问题是:提供一种在非平行文本条件下,能够根据目标说话人的不同,而自适应地确定GMM混合度的语音转换方法,达到增强转换语音中说话人个性特征的同时改善转换语音的质量。The technical problem to be solved by the present invention is to provide a speech conversion method that can adaptively determine the GMM mixing degree according to the difference of target speakers under the condition of non-parallel text, so as to enhance the speaker's personality characteristics in the converted speech. Also improve the quality of the converted speech.

本发明为解决上述技术问题采用以下技术方案:The present invention adopts the following technical solutions for solving the above-mentioned technical problems:

本发明提出一种基于自适应高斯聚类的非平行文本条件下的语音转换方法,包括训练阶段和转换阶段,其中所述训练阶段包括如下步骤:The present invention proposes a speech conversion method under the condition of non-parallel text based on adaptive Gaussian clustering, including a training phase and a conversion phase, wherein the training phase includes the following steps:

步骤1,输入源说话人和目标说话人的非平行训练语料;Step 1, input the non-parallel training corpus of the source speaker and the target speaker;

步骤2,使用AHOcoder语音分析模型分别提取源说话人的非平行训练语料的MFCC特征参数X、目标说话人的非平行训练语料的MFCC特征参数Y,以及源语音基频log f0X和目标语音基频log f0YStep 2: Use the AHOcoder speech analysis model to extract the MFCC feature parameter X of the non-parallel training corpus of the source speaker, the MFCC feature parameter Y of the non-parallel training corpus of the target speaker, and the source voice fundamental frequency log f 0X and the target voice base frequency. frequency log f 0Y ;

步骤3,对步骤2中的MFCC特征参数X、Y,进行单元挑选和声道长度归一化相结合的语音特征参数对齐和动态时间规整,从而将非平行语料转变成平行语料;Step 3, for the MFCC feature parameters X, Y in step 2, perform unit selection and channel length normalization combined with voice feature parameter alignment and dynamic time regularization, thereby converting non-parallel corpus into parallel corpus;

步骤4,使用期望最大化EM算法进行自适应混合高斯模型AGMM训练,AGMM训练结束,得到后验条件概率矩阵P(X|λ),并保存AGMM参数λ;Step 4, using the expectation maximization EM algorithm to perform the adaptive mixture Gaussian model AGMM training, after the AGMM training ends, the posterior conditional probability matrix P(X|λ) is obtained, and the AGMM parameter λ is saved;

步骤5,利用步骤3得到的源语音特征参数X和目标语音特征参数Y,使用步骤4中的后验条件概率矩阵P(X|λ)进行双线性频率弯折BLFW+幅度调节AS训练,得到频率弯折因子α(x,λ)和幅度调节因子s(x,λ),从而构建BLFW+AS转换函数;使用对数基频的均值和方差建立源语音基频log f0X和目标语音基频log f0Y之间的基频转换函数;Step 5, use the source speech feature parameter X and target speech feature parameter Y obtained in step 3, use the posterior conditional probability matrix P(X|λ) in step 4 to perform bilinear frequency bending BLFW+amplitude adjustment AS training, obtain Frequency bending factor α(x, λ) and amplitude adjustment factor s(x, λ) to construct BLFW+AS conversion function; use the mean and variance of logarithmic fundamental frequency to establish source speech fundamental frequency log f 0X and target speech fundamental frequency Fundamental frequency conversion function between frequencies log f 0Y ;

所述转换阶段包括如下步骤:The conversion phase includes the following steps:

步骤6,输入待转换的源说话人语音;Step 6, input the source speaker voice to be converted;

步骤7,使用AHOcoder语音分析模型提取源说话人语音的MFCC特征参数X′和对数基频log f0X′Step 7, use the AHOcoder speech analysis model to extract the MFCC feature parameter X' and the logarithmic fundamental frequency log f 0X' of the source speaker's speech;

步骤8,使用步骤4中AGMM训练时得到的参数λ,求取后验条件概率矩阵P′(X|λ);Step 8, use the parameter λ obtained during the AGMM training in step 4 to obtain the posterior conditional probability matrix P'(X|λ);

步骤9,使用步骤5中得到的BLFW+AS转换函数,求得转换后的MFCC特征参数Y′;Step 9, use the BLFW+AS conversion function obtained in step 5 to obtain the converted MFCC characteristic parameter Y';

步骤10,使用步骤5得到的基频转换函数由对数基频log f0X′得到转换后的对数基频log f0Y′Step 10, using the fundamental frequency conversion function obtained in step 5 to obtain the converted logarithmic fundamental frequency log f 0Y' from the logarithmic fundamental frequency log f 0X ';

步骤11,使用AHOdecoder语音合成模型将转换后的MFCC特征参数Y′和对数基频log f0Y′合成得到转换后的语音。Step 11, use the AHOdecoder speech synthesis model to synthesize the converted MFCC feature parameter Y' and the logarithmic fundamental frequency log f 0Y' to obtain the converted speech.

进一步的,本发明所提出的语音转换方法,步骤3具体过程如下:Further, in the voice conversion method proposed by the present invention, the specific process of step 3 is as follows:

3-1)采用双线性频率弯折方法对源语音MFCC特征参数进行声道长度归一化处理;3-1) adopt the bilinear frequency bending method to carry out the channel length normalization processing to the source speech MFCC characteristic parameter;

3-2)对于给定的N个源语音MFCC特征参数矢量{Xk},通过公式(1)来动态地寻找N个目标语音特征参数矢量{Yk},使得距离耗费函数值C({Yk})最小;3-2) For the given N source speech MFCC feature parameter vectors {X k }, the N target speech feature parameter vectors {Y k } are dynamically found by formula (1), so that the distance cost function value C({ Y k }) is the smallest;

C({Yk})=C1({Yk})+C2({Yk}) (1)C({Y k })=C 1 ({Y k })+C 2 ({Y k }) (1)

其中,C1({Yk})和C2({Yk})分别由下式表示:where C 1 ({Y k }) and C 2 ({Y k }) are respectively represented by the following equations:

Figure BDA0001327850060000031
Figure BDA0001327850060000031

Figure BDA0001327850060000032
Figure BDA0001327850060000032

其中,D(Xk,Yk)函数表示源语音和目标语音特征参数矢量之间的频谱距离,参数γ表示在特征参数帧对齐的准确度和帧间连续性之间的平衡系数,且有0≤γ≤1;C1({Yk})表示的是源语音特征参数矢量和目标语音特征参数矢量之间的频谱距离耗费函数,C2({Yk})表示的是经单元挑选的目标语音特征参数矢量之间频谱距离耗费函数;Among them, the D(X k , Y k ) function represents the spectral distance between the source speech and the target speech feature parameter vector, the parameter γ represents the balance coefficient between the accuracy of the feature parameter frame alignment and the continuity between frames, and has 0≤γ≤1; C 1 ({Y k }) represents the spectral distance cost function between the source speech feature parameter vector and the target speech feature parameter vector, and C 2 ({Y k }) represents the unit selection The spectral distance cost function between the target speech feature parameter vectors;

3-3)通过对公式(1)进行多元线性回归分析,得到与源语音特征参数矢量对齐的目标语音特征参数序列集合

Figure BDA0001327850060000033
即:3-3) By performing multiple linear regression analysis on formula (1), the target speech feature parameter sequence set aligned with the source speech feature parameter vector is obtained
Figure BDA0001327850060000033
which is:

Figure BDA0001327850060000034
Figure BDA0001327850060000034

通过上述步骤,将非平行的MFCC特征参数X、Y转变为平行的语料。Through the above steps, the non-parallel MFCC feature parameters X and Y are transformed into parallel corpus.

进一步的,本发明所提出的语音转换方法,对于公式(4)的求解,使用维特比搜索方法来优化算法的执行效率。Further, in the speech conversion method proposed by the present invention, for the solution of formula (4), the Viterbi search method is used to optimize the execution efficiency of the algorithm.

进一步的,本发明所提出的语音转换方法,步骤4的训练过程如下:Further, in the speech conversion method proposed by the present invention, the training process of step 4 is as follows:

4-1)设定AGMM初始混合数M,高斯分量权重系数阈值t1,t2,特征参数矢量之间欧氏距离阈值D和协方差阈值σ;4-1) Set AGMM initial mixture number M, Gaussian component weight coefficient thresholds t 1 , t 2 , Euclidean distance threshold D and covariance threshold σ between feature parameter vectors;

4-2)使用K-均值迭代算法得到EM训练的初始值;4-2) Use K-means iterative algorithm to obtain the initial value of EM training;

4-3)使用EM算法进行迭代训练;将高斯混合模型GMM表示如下:4-3) Use the EM algorithm for iterative training; express the Gaussian mixture model GMM as follows:

Figure BDA0001327850060000035
Figure BDA0001327850060000035

其中,X为P维的语音特征参数矢量,P=39;P(wi)表示各高斯分量的权重系数,且有

Figure BDA0001327850060000036
M为高斯分量的个数,N(X,μii)表示高斯分量的P维联合高斯概率分布,表示如下:Among them, X is a P-dimensional speech feature parameter vector, P=39; P( wi ) represents the weight coefficient of each Gaussian component, and there are
Figure BDA0001327850060000036
M is the number of Gaussian components, and N(X, μ i , Σ i ) represents the P-dimensional joint Gaussian probability distribution of the Gaussian components, expressed as follows:

Figure BDA0001327850060000037
Figure BDA0001327850060000037

其中μi为均值矢量,∑i为协方差矩阵,λ={P(wi),μii},λ是GMM模型的模型参数,对λ的估算通过最大似然估计法实现,对于语音特征参数矢量集合X={xn,n=1,2,...N}有:where μ i is the mean vector, ∑ i is the covariance matrix, λ={P( wi ), μ i ,∑ i }, λ is the model parameter of the GMM model, and the estimation of λ is realized by the maximum likelihood estimation method, For the speech feature parameter vector set X={x n , n=1,2,...N} there are:

Figure BDA0001327850060000041
Figure BDA0001327850060000041

此时:at this time:

λ=argλmax(P(X|λ)) (8)λ=arg λ max(P(X|λ)) (8)

使用EM算法求解公式(8),随着EM计算过程中迭代条件满足P(X|λk)≥P(X|λk-1),Use the EM algorithm to solve formula (8), as the iterative conditions in the EM calculation process satisfy P(X|λ k )≥P(X|λ k-1 ),

K是迭代的次数,直至模型参数λ,迭代过程中高斯分量权重系数P(wi)、均值向量μi、协方差矩阵Σi的迭代公式如下:K is the number of iterations up to the model parameter λ. The iterative formulas of the Gaussian component weight coefficient P( wi ), the mean vector μ i and the covariance matrix Σ i in the iterative process are as follows:

Figure BDA0001327850060000042
Figure BDA0001327850060000042

Figure BDA0001327850060000043
Figure BDA0001327850060000043

Figure BDA0001327850060000044
Figure BDA0001327850060000044

Figure BDA0001327850060000045
Figure BDA0001327850060000045

4-4)若训练得到的模型中某一高斯分量N(P(wi),μi,∑i)权重系数小于t1,并且与其最邻近分量N(P(wj),μji)之间的欧氏距离小于阈值D,则对其进行合并处理:4-4) If the weight coefficient of a Gaussian component N(P(w i ), μ i , ∑ i ) in the model obtained by training is less than t 1 , and its nearest neighbor component N(P(w j ), μ j , If the Euclidean distance between Σ i ) is less than the threshold D, they are merged:

Figure BDA0001327850060000046
Figure BDA0001327850060000046

此时,高斯分量个数变为M-1,返回步骤4-3)进行下一次训练,若满足合并条件的高斯分量有多个,则选择最小距离的高斯分量进行合并;At this time, the number of Gaussian components becomes M-1, and return to step 4-3) for the next training. If there are multiple Gaussian components that meet the merging conditions, select the Gaussian component with the minimum distance for merging;

4-5)若训练得到的模型中某一高斯分量N(P(wi),μi,∑i)权重系数大于t2,并且协方差矩阵中有至少一维的方差大于σ,则认为该高斯分量包含过量信息,应将其分裂处理:4-5) If the weight coefficient of a Gaussian component N(P(w i ), μ i , ∑ i ) in the model obtained by training is greater than t 2 , and the variance of at least one dimension in the covariance matrix is greater than σ, it is considered that This Gaussian component contains excess information and should be split:

Figure BDA0001327850060000051
Figure BDA0001327850060000051

其中E为全1的列向量,n用于调节高斯分布,经过分裂后高斯分量个数变为M+1,如果满足分裂条件的高斯分量有多个,则选取权重系数最大的分量进行分裂,返回步骤4-3)进行下一次训练;Among them, E is a column vector of all 1s, and n is used to adjust the Gaussian distribution. After splitting, the number of Gaussian components becomes M+1. If there are multiple Gaussian components that meet the splitting conditions, the component with the largest weight coefficient is selected for splitting. Return to step 4-3) for the next training;

4-6)AGMM训练结束,得到后验条件概率矩阵P(X|λ),保存λ。4-6) After the AGMM training ends, the posterior conditional probability matrix P(X|λ) is obtained, and λ is saved.

进一步的,本发明所提出的语音转换方法,步骤5中构建的BLFW+AS转换函数,表示如下:Further, the speech conversion method proposed by the present invention, the BLFW+AS conversion function constructed in step 5, is expressed as follows:

F(x)=Wα(x,λ)x+s(x,λ) (15)F(x)=W α(x,λ) x+s(x,λ) (15)

Figure BDA0001327850060000052
Figure BDA0001327850060000052

Figure BDA0001327850060000054
Figure BDA0001327850060000054

其中,M为步骤4中混合高斯模型的高斯分量的个数,α(x,λ)表示频率弯折因子,s(x,λ)表示幅度调节因子。Among them, M is the number of Gaussian components of the Gaussian mixture model in step 4, α(x, λ) represents the frequency bending factor, and s(x, λ) represents the amplitude adjustment factor.

进一步的,本发明所提出的语音转换方法,步骤5中建立源语音基音频率和目标语音基音频率之间的转换关系:Further, the voice conversion method proposed by the present invention establishes the conversion relationship between the source voice pitch frequency and the target voice pitch frequency in step 5:

Figure BDA0001327850060000055
Figure BDA0001327850060000055

其中μ,σ2分别表示对数基音频率log f0的均值和方差。where μ and σ 2 represent the mean and variance of the logarithmic fundamental frequency log f 0 , respectively.

本发明采用以上技术方案与现有技术相比,具有以下技术效果:Compared with the prior art, the present invention adopts the above technical scheme, and has the following technical effects:

1、本发明实现了非平行文本条件下的语音转换,解决了平行语料不易获取的问题,提高了语音转换系统的通用性和实用性。1. The present invention realizes the voice conversion under the condition of non-parallel text, solves the problem that parallel corpus is difficult to obtain, and improves the generality and practicability of the voice conversion system.

2、本发明使用AGMM和BLFW+AS相结合来实现语音转换系统,该系统能够根据不同说话人的语音特征参数分布,自适应调节GMM的分类数,在增强语音个性相似度的同时改善了语音质量,实现了高质量的语音转换。2. The present invention uses the combination of AGMM and BLFW+AS to realize a voice conversion system, which can adaptively adjust the classification number of GMM according to the distribution of voice characteristic parameters of different speakers, and improve the voice while enhancing the similarity of voice personality. quality, to achieve high-quality voice conversion.

附图说明Description of drawings

图1是本发明的非平行文本语音转换的示意图。FIG. 1 is a schematic diagram of the non-parallel text-to-speech conversion of the present invention.

图2是自适应高斯混合模型训练流程图。Figure 2 is a flowchart of the adaptive Gaussian mixture model training.

图3是转换后语音的语谱对比图。Fig. 3 is a spectrum comparison diagram of the converted speech.

具体实施方式Detailed ways

下面结合附图对本发明的技术方案做进一步的详细说明:Below in conjunction with accompanying drawing, the technical scheme of the present invention is described in further detail:

本技术领域技术人员可以理解的是,除非另外定义,这里使用的所有术语(包括技术术语和科学术语)具有与本发明所属领域中的普通技术人员的一般理解相同的意义。还应该理解的是,诸如通用字典中定义的那些术语应该被理解为具有与现有技术的上下文中的意义一致的意义,并且除非像这里一样定义,不会用理想化或过于正式的含义来解释。It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It should also be understood that terms such as those defined in general dictionaries should be understood to have meanings consistent with their meanings in the context of the prior art and, unless defined as herein, are not to be taken in an idealized or overly formal sense. explain.

本发明所述高质量语音转换方法分为两个部分:训练部分用于得到语音转换所需的参数和转换函数,而转换部分用于实现源说话人语音转换为目标说话人语音。The high-quality voice conversion method of the present invention is divided into two parts: the training part is used to obtain parameters and conversion functions required for voice conversion, and the conversion part is used to convert the source speaker's voice into the target speaker's voice.

如图1,训练部分实施步骤:As shown in Figure 1, the implementation steps of the training part:

步骤1,输入源说话人和目标说话人的语音非平行语料,非平行语料取自CMU_US_ARCTIC语料库,该语料库是由卡内基梅隆大学语言技术研究所建立的,语料库中的语音由5男2女录制,每个说话人录制了1132段1~6s不等的语音。Step 1. Input the non-parallel corpus of the source speaker and the target speaker. The non-parallel corpus is taken from the CMU_US_ARCTIC corpus, which was established by the Institute of Language Technology of Carnegie Mellon University. The voices in the corpus are composed of 5 males and 2 Female recording, each speaker recorded 1132 segments of speech ranging from 1 to 6 seconds.

步骤2,本发明使用AHOcoder语音分析模型分别提取源说话人和目标说话人的梅尔倒谱系数(MFCC,Mel-Frequency Cepstral Coefficient)X、Y以及对数基音频率参数logf0X和log f0Y。其中AHOcoder是西班牙毕尔巴鄂(Bilbao)市Aholab Signal ProcessingLaboratory学者Daniel Erro团队构建的高性能的语音分析合成工具;Step 2, the present invention uses the AHOcoder speech analysis model to extract the Mel-Frequency Cepstral Coefficients (MFCC, MFCC) X, Y and the logarithmic fundamental frequency parameters logf 0X and log f 0Y of the source speaker and the target speaker respectively. Among them, AHOcoder is a high-performance speech analysis and synthesis tool built by the team of Daniel Erro, a scholar at the Aholab Signal Processing Laboratory in Bilbao, Spain;

步骤3,对步骤2中的源和目标语音的MFCC参数X、Y进行单元挑选(UnitSelection)和声道长度归一化(VTLN,Vocal Tract Length No6rmalization)相结合的语音特征参数对齐和动态时间规整(DTW,Dynamic Time Warping)。其中语音特征参数对齐具体过程如下:In step 3, the MFCC parameters X and Y of the source and target speech in step 2 are combined with unit selection (UnitSelection) and channel length normalization (VTLN, Vocal Tract Length No6rmalization) to combine speech feature parameter alignment and dynamic time warping. (DTW, Dynamic Time Warping). The specific process of speech feature parameter alignment is as follows:

3-1)采用双线性频率弯折方法对源语音特征参数进行声道长度归一化处理,使得源语音的共振峰向目标语音靠近,从而增加单元挑选目标语音特征参数的精确性。3-1) The bilinear frequency bending method is used to normalize the channel length of the source speech feature parameters, so that the formants of the source speech are close to the target speech, thereby increasing the accuracy of the unit selection of the target speech feature parameters.

3-2)对于给定的N个源语音特征参数矢量{Xk},可通过公式(1)来动态地寻找N个目标语音特征参数矢量{Yk},使得距离耗费函数值C({Yk})最小。在单元挑选的过程中考虑到两个因素:一方面是保证对齐的源语音特征参数矢量和目标语音的特征参数矢量之间的频谱距离最小,以增强音素信息的匹配度;另一方面是保证挑选到的目标语音特征参数矢量具有帧连续性,以使得音素信息更完整。3-2) For the given N source speech feature parameter vectors {X k }, the N target speech feature parameter vectors {Y k } can be dynamically found by formula (1), so that the distance cost function value C({ Y k }) is the smallest. In the process of unit selection, two factors are considered: on the one hand, the spectral distance between the aligned source speech feature parameter vector and the target speech feature parameter vector is minimized to enhance the matching degree of phoneme information; on the other hand, it is guaranteed The selected target speech feature parameter vector has frame continuity, so that the phoneme information is more complete.

C({Yk})=C1({Yk})+C2({Yk}) (1)C({Y k })=C 1 ({Y k })+C 2 ({Y k }) (1)

其中,C1({Yk})和C2({Yk})分别可由下式表示:where C 1 ({Y k }) and C 2 ({Y k }) can be respectively represented by the following equations:

Figure BDA0001327850060000071
Figure BDA0001327850060000071

Figure BDA0001327850060000072
Figure BDA0001327850060000072

其中,D(Xk,Yk)函数表示源和目标特征参数矢量之间的频谱距离,本发明采用欧氏距离作为距离衡量尺度。参数γ表示在特征参数帧对齐的准确度和帧间连续性之间的平衡系数,且有0≤γ≤1。C1({Yk})表示的是源语音特征参数矢量和目标语音的特征参数矢量之间的频谱距离耗费函数,C2({Yk})表示的是经单元挑选的目标语音的特征参数矢量之间频谱距离耗费函数。Wherein, the D(X k , Y k ) function represents the spectral distance between the source and target feature parameter vectors, and the present invention adopts the Euclidean distance as the distance measurement scale. The parameter γ represents the balance coefficient between the accuracy of the feature parameter frame alignment and the continuity between frames, and has 0≤γ≤1. C 1 ({Y k }) represents the spectral distance cost function between the feature parameter vector of the source speech and the feature parameter vector of the target speech, and C 2 ({Y k }) represents the feature of the target speech selected by the unit Spectral distance cost function between parameter vectors.

3-3)通过对公式(1)进行多元线性回归分析,可以得到与源语音特征参数矢量对齐的特征参数序列集合

Figure BDA0001327850060000073
即:3-3) By performing multiple linear regression analysis on formula (1), a feature parameter sequence set aligned with the source speech feature parameter vector can be obtained
Figure BDA0001327850060000073
which is:

Figure BDA0001327850060000074
Figure BDA0001327850060000074

对于公式(4)的求解,可使用维特比(Viterbi)搜索方法来优化算法的执行效率。For the solution of formula (4), the Viterbi search method can be used to optimize the execution efficiency of the algorithm.

通过上述步骤,将非平行的MFCC参数X、Y转变为平行的。Through the above steps, the non-parallel MFCC parameters X, Y are transformed into parallel.

步骤4,建立自适应混合高斯模型(Adaption GMM,AGMM),采用期望最大化(EM,Expectation-Maximization)算法进行训练,并使用K-均值迭代方法得到EM训练的初始值。通过训练得到AGMM参数λ,P(X|λ)。In step 4, an adaptive Gaussian mixture model (Adaption GMM, AGMM) is established, and the expectation maximization (EM, Expectation-Maximization) algorithm is used for training, and the K-means iteration method is used to obtain the initial value of EM training. The AGMM parameters λ, P(X|λ) are obtained through training.

如图2所示,使用自适应聚类算法训练AGMM参数,首先需要对各高斯分量的权重系数、均值向量、协方差矩阵和特征参数矢量之间的欧氏距离进行综合分析,动态地调整高斯混合度。其训练过程如下:As shown in Figure 2, to use the adaptive clustering algorithm to train AGMM parameters, first of all, it is necessary to comprehensively analyze the Euclidean distance between the weight coefficient, mean vector, covariance matrix and feature parameter vector of each Gaussian component, and dynamically adjust the Gaussian distance. degree of mixing. The training process is as follows:

4-1)设定AGMM初始混合数M,高斯分量权重系数阈值t1,t2,特征参数矢量之间欧氏距离阈值D和协方差阈值σ。4-1) Set AGMM initial mixture number M, Gaussian component weight coefficient thresholds t 1 , t 2 , Euclidean distance threshold D and covariance threshold σ between feature parameter vectors.

4-2)使用K-均值迭代算法得到EM训练的初始值。4-2) Use the K-means iterative algorithm to obtain the initial value of EM training.

4-3)使用EM算法进行迭代训练。4-3) Use the EM algorithm for iterative training.

传统的高斯混合模型表示如下:The traditional Gaussian mixture model is represented as follows:

其中,X为P维的语音特征参数矢量,本发明中采用P=39,P(wi)表示各高斯分量的权重系数,且有

Figure BDA0001327850060000082
M为高斯分量的个数,N(X,μi,∑i)表示高斯分量的P维联合高斯概率分布,表示如下:Among them, X is a P-dimensional speech feature parameter vector, P=39 is used in the present invention, P( wi ) represents the weight coefficient of each Gaussian component, and there are
Figure BDA0001327850060000082
M is the number of Gaussian components, and N(X,μ i ,∑ i ) represents the P-dimensional joint Gaussian probability distribution of the Gaussian components, which is expressed as follows:

Figure BDA0001327850060000083
Figure BDA0001327850060000083

其中μi为均值矢量,∑i为协方差矩阵,λ={P(wi),μii},是GMM模型的模型参数,对λ的估算可以通过最大似然估计法(ML,Maximum Likelihood)实现,最大似然估计的目的在于使得条件概率P(X|λ)取得最大,对于语音特征参数矢量集合X={xn,n=1,2,...N}有:where μ i is the mean vector, ∑ i is the covariance matrix, λ={P( wi ), μ i ,∑ i } is the model parameter of the GMM model, and the estimation of λ can be done by the maximum likelihood estimation method (ML , Maximum Likelihood), the purpose of maximum likelihood estimation is to maximize the conditional probability P(X|λ), for the speech feature parameter vector set X={x n , n=1,2,...N} there are:

Figure BDA0001327850060000084
Figure BDA0001327850060000084

此时:at this time:

λ=argλmax(P(X|λ)) (8)λ=arg λ max(P(X|λ)) (8)

求解公式(8)可使用EM算法,随着EM计算过程中迭代条件满足P(X|λk)≥P(X|λk-1),K是迭代的次数,直至模型参数λ。迭代过程中高斯分量权重系数P(wi)、均值向量μi、协方差矩阵Σi的迭代公式如下:The EM algorithm can be used to solve formula (8). As the iterative conditions in the EM calculation process satisfy P(X|λ k )≥P(X|λ k-1 ), K is the number of iterations until the model parameter λ. In the iterative process, the iterative formulas of the Gaussian component weight coefficient P( wi ), the mean vector μ i , and the covariance matrix Σ i are as follows:

Figure BDA0001327850060000087
Figure BDA0001327850060000087

4-4)若训练得到的模型中某一高斯分量N(P(wi),μii)权重系数小于t1,并且与其最邻近分量N(P(wj),μji)之间的欧氏距离小于阈值D,则认为这两个分量包含信息较少且成分相似,可对其进行合并处理:4-4) If the weight coefficient of a Gaussian component N(P(w i ), μ i , Σ i ) in the model obtained by training is less than t 1 , and its nearest neighbor component N(P(w j ), μ j , If the Euclidean distance between Σ i ) is less than the threshold D, it is considered that these two components contain less information and are similar in composition, and can be merged:

Figure BDA0001327850060000092
Figure BDA0001327850060000092

此时,高斯分量个数变为M-1,返回步骤(3)进行下一次训练,若满足合并条件的高斯分量有多个,则选择最小距离的高斯分量进行合并。At this time, the number of Gaussian components becomes M-1, and the process returns to step (3) for the next training. If there are multiple Gaussian components that meet the merging conditions, select the Gaussian component with the smallest distance for merging.

4-5)若训练得到的模型中某一高斯分量N(P(wi),μi,∑i)权重系数大于t2,并且协方差矩阵中有至少一维的方差(协方差矩阵对角线上元素即为方差)大于σ,则认为该高斯分量包含过量信息,应将其分裂处理:4-5) If the weight coefficient of a Gaussian component N(P(w i ), μ i ,∑ i ) in the model obtained by training is greater than t 2 , and there is at least one-dimensional variance in the covariance matrix (covariance matrix pair The element on the corner is the variance) is greater than σ, then the Gaussian component is considered to contain excess information and should be split:

Figure BDA0001327850060000093
Figure BDA0001327850060000093

其中E为全1的列向量,n用于调节高斯分布,经过分裂后高斯分量个数变为M+1,如果满足分裂条件的高斯分量有多个,则选取权重系数最大的分量进行分裂,返回步骤(3)进行下一次训练。Among them, E is a column vector of all 1s, and n is used to adjust the Gaussian distribution. After splitting, the number of Gaussian components becomes M+1. If there are multiple Gaussian components that meet the splitting conditions, the component with the largest weight coefficient is selected for splitting. Return to step (3) for the next training.

4-6)AGMM训练结束,得到后验条件概率矩阵P(X|λ),保存λ。4-6) After the AGMM training ends, the posterior conditional probability matrix P(X|λ) is obtained, and λ is saved.

步骤5,利用步骤3中得到的源语音特征参数X和目标语音特征参数Y以及步骤4中得到的后验条件概率矩阵P(X|λ)进行训练,得到频率弯折因子和幅度调节因子,从而构建双线性频率弯折(BLFW,Bilinear Frequency Warping)和幅度调节(AS,AmplitudeScaling)语音转换函数,表示如下:Step 5, using the source speech feature parameter X obtained in step 3 and the target speech feature parameter Y and the posterior conditional probability matrix P(X|λ) obtained in step 4 for training to obtain a frequency bending factor and an amplitude adjustment factor, Thus, the bilinear frequency bending (BLFW, Bilinear Frequency Warping) and amplitude adjustment (AS, AmplitudeScaling) speech conversion functions are constructed, which are expressed as follows:

F(x)=Wα(x,λ)x+s(x,λ) (15)F(x)=W α(x,λ) x+s(x,λ) (15)

Figure BDA0001327850060000101
Figure BDA0001327850060000101

Figure BDA0001327850060000102
Figure BDA0001327850060000102

Figure BDA0001327850060000103
Figure BDA0001327850060000103

建立源语音基音频率和目标语音基音频率之间的转换关系:Establish the conversion relationship between the source speech pitch frequency and the target speech pitch frequency:

其中μ,σ2用于表示对数基音频率log f0的均值和方差。where μ,σ 2 are used to represent the mean and variance of the logarithmic fundamental frequency log f 0 .

如图1,转换部分具体实施步骤:As shown in Figure 1, the specific implementation steps of the conversion part:

步骤6,输入待转换的源说话人语音;Step 6, input the source speaker voice to be converted;

步骤7,使用AHOdecoder语音分析模型提取源说话人的语音39阶MFCC特征参数X′以及源语音对数基音频率参数log f0X′Step 7, use the AHOdecoder speech analysis model to extract the 39th-order MFCC feature parameter X ' of the source speaker and the logarithmic fundamental frequency parameter log f 0X ' of the source speech;

步骤8,利用步骤4中AGMM训练时得到的λ={P(wi),μii}和步骤7中提取的特征参数X′,代入公式(5),得到后验条件概率矩阵P′(X|λ);Step 8: Use λ={P( wi ), μ i , Σ i } obtained during AGMM training in step 4 and the feature parameter X′ extracted in step 7, and substitute it into formula (5) to obtain the posterior conditional probability matrix P'(X|λ);

步骤9,利用步骤5中BLFW+AS训练得到的频率弯折因子α(x,λ)和幅度调节因子s(x,λ)以及步骤8中得到的后验条件概率矩阵P′(X|λ),分别代入公式(15)、(16)、(17)和(18)后,得到转换后语音的MFCC特征参数Y′;Step 9, use the frequency bending factor α(x, λ) and the amplitude adjustment factor s(x, λ) obtained by BLFW+AS training in step 5 and the posterior conditional probability matrix P′(X|λ) obtained in step 8 ), after substituting into formulas (15), (16), (17) and (18) respectively, the MFCC feature parameter Y′ of the converted speech is obtained;

步骤10,利用步骤7中得到的源语音对数基音频率参数log f0X′,代入公式(19),得到转换后语音的对数基音频率参数log f0Y′Step 10, utilize the source voice logarithmic fundamental frequency parameter log f 0X' obtained in step 7, substitute formula (19), obtain the logarithmic fundamental frequency parameter log f 0Y' of voice after conversion;

步骤11,使用AHOdecoder语音合成模型将步骤9中的Y′和步骤10中的log f0Y′作为输入得到转换后的语音。Step 11, use the AHOdecoder speech synthesis model to obtain the converted speech by taking Y' in step 9 and log f 0Y' in step 10 as input.

进一步地,如图3所示,将本发明的方法与INCA方法得到的转换语音的语谱图进行了对比,转换方向为F1-M2(女声1-男声2),进一步验证了本发明所采用的方法相对于INCA方法的频谱相似度更高的优点。其中,INCA方法是文献(Erro D,Moreno A,BonafonteA.INCA algorithm for training voice conversion systems from nonparallelcorpora[J].IEEE Transactions on Audio,Speech,and Language Processing,2010,18(5):944-953.)中提出的。Further, as shown in Figure 3, the method of the present invention is compared with the spectrogram of the converted voice obtained by the INCA method, and the conversion direction is F1-M2 (female voice 1-male voice 2), further verifying that the present invention adopts Compared with the INCA method, the method has the advantage of higher spectral similarity. Among them, the INCA method is the literature (Erro D, Moreno A, BonafonteA.INCA algorithm for training voice conversion systems from nonparallelcorpora[J].IEEE Transactions on Audio,Speech,and Language Processing,2010,18(5):944-953. ) proposed in.

以上所述仅是本发明的部分实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The above are only some embodiments of the present invention. It should be pointed out that for those skilled in the art, without departing from the principles of the present invention, several improvements and modifications can be made. It should be regarded as the protection scope of the present invention.

Claims (7)

1.一种基于自适应高斯聚类的非平行文本条件下的语音转换方法,其特征在于,包括训练阶段和转换阶段,其中所述训练阶段包括如下步骤:1. a speech conversion method under the non-parallel text condition based on adaptive Gaussian clustering, is characterized in that, comprises training stage and conversion stage, and wherein said training stage comprises the steps: 步骤1,输入源说话人和目标说话人的非平行训练语料;Step 1, input the non-parallel training corpus of the source speaker and the target speaker; 步骤2,使用AHOcoder语音分析模型分别提取源说话人的非平行训练语料的MFCC特征参数X、目标说话人的非平行训练语料的MFCC特征参数Y,以及源语音基频log f0X和目标语音基频log f0YStep 2: Use the AHOcoder speech analysis model to extract the MFCC feature parameter X of the non-parallel training corpus of the source speaker, the MFCC feature parameter Y of the non-parallel training corpus of the target speaker, and the source voice fundamental frequency log f 0X and the target voice base frequency. frequency log f 0Y ; 步骤3,对步骤2中的MFCC特征参数X、Y,进行单元挑选和声道长度归一化相结合的语音特征参数对齐和动态时间规整,从而将非平行语料转变成平行语料;Step 3, for the MFCC feature parameters X, Y in step 2, perform unit selection and channel length normalization combined with voice feature parameter alignment and dynamic time regularization, thereby converting non-parallel corpus into parallel corpus; 步骤4,使用期望最大化EM算法进行自适应混合高斯模型AGMM训练,AGMM训练结束,得到后验条件概率矩阵P(X|λ),并保存AGMM参数λ;Step 4, using the expectation maximization EM algorithm to perform the adaptive mixture Gaussian model AGMM training, after the AGMM training ends, the posterior conditional probability matrix P(X|λ) is obtained, and the AGMM parameter λ is saved; 步骤5,利用步骤3得到的源语音特征参数X和目标语音特征参数Y,使用步骤4中的后验条件概率矩阵P(X|λ)进行双线性频率弯折BLFW+幅度调节AS训练,得到频率弯折因子α(x,λ)和幅度调节因子s(x,λ),从而构建BLFW+AS转换函数;使用对数基频的均值和方差建立源语音基频log f0X和目标语音基频log f0Y之间的基频转换函数;Step 5, use the source speech feature parameter X and target speech feature parameter Y obtained in step 3, use the posterior conditional probability matrix P(X|λ) in step 4 to perform bilinear frequency bending BLFW+amplitude adjustment AS training, obtain Frequency bending factor α(x, λ) and amplitude adjustment factor s(x, λ) to construct BLFW+AS conversion function; use the mean and variance of logarithmic fundamental frequency to establish source speech fundamental frequency log f 0X and target speech fundamental frequency Fundamental frequency conversion function between frequencies log f 0Y ; 所述转换阶段包括如下步骤:The conversion phase includes the following steps: 步骤6,输入待转换的源说话人语音;Step 6, input the source speaker voice to be converted; 步骤7,使用AHOcoder语音分析模型提取源说话人语音的MFCC特征参数X′和对数基频log f0X′Step 7, use the AHOcoder speech analysis model to extract the MFCC feature parameter X' and the logarithmic fundamental frequency log f 0X' of the source speaker's speech; 步骤8,使用步骤4中AGMM训练时得到的参数λ,求取后验条件概率矩阵P′(X|λ);Step 8, use the parameter λ obtained during the AGMM training in step 4 to obtain the posterior conditional probability matrix P'(X|λ); 步骤9,使用步骤5中得到的BLFW+AS转换函数,求得转换后的MFCC特征参数Y′;Step 9, use the BLFW+AS conversion function obtained in step 5 to obtain the converted MFCC characteristic parameter Y'; 步骤10,使用步骤5得到的基频转换函数由对数基频log f0X′得到转换后的对数基频log f0Y′Step 10, using the fundamental frequency conversion function obtained in step 5 to obtain the converted logarithmic fundamental frequency log f 0Y' from the logarithmic fundamental frequency log f 0X '; 步骤11,使用AHOdecoder语音合成模型将转换后的MFCC特征参数Y′和对数基频logf0Y′合成得到转换后的语音。Step 11, use the AHOdecoder speech synthesis model to synthesize the converted MFCC feature parameter Y' and the logarithmic fundamental frequency logf 0Y' to obtain the converted speech. 2.根据权利要求1所述的语音转换方法,其特征在于,步骤3具体过程如下:2. voice conversion method according to claim 1, is characterized in that, step 3 concrete process is as follows: 3-1)采用双线性频率弯折方法对源语音MFCC特征参数进行声道长度归一化处理;3-1) adopt the bilinear frequency bending method to carry out the channel length normalization processing to the source speech MFCC characteristic parameter; 3-2)对于给定的N个源语音MFCC特征参数矢量{Xk},通过公式(1)来动态地寻找N个目标语音特征参数矢量{Yk},使得距离耗费函数值C({Yk})最小;3-2) For the given N source speech MFCC feature parameter vectors {X k }, the N target speech feature parameter vectors {Y k } are dynamically found by formula (1), so that the distance cost function value C({ Y k }) is the smallest; C({Yk})=C1({Yk})+C2({Yk}) (1)C({Y k })=C 1 ({Y k })+C 2 ({Y k }) (1) 其中,C1({Yk})和C2({Yk})分别由下式表示:where C 1 ({Y k }) and C 2 ({Y k }) are respectively represented by the following equations:
Figure FDA0002290795990000021
Figure FDA0002290795990000021
Figure FDA0002290795990000022
Figure FDA0002290795990000022
其中,D(Xk,Yk)函数表示源语音和目标语音特征参数矢量之间的频谱距离,D(Yk,Yk-1)函数表示经单元挑选的目标语音特征参数矢量之间的频谱距离,参数γ表示在特征参数帧对齐的准确度和帧间连续性之间的平衡系数,且有0≤γ≤1;C1({Yk})表示的是源语音特征参数矢量和目标语音特征参数矢量之间的频谱距离耗费函数,C2({Yk})表示的是经单元挑选的目标语音特征参数矢量之间频谱距离耗费函数;Among them, the D(X k , Y k ) function represents the spectral distance between the source speech and target speech feature parameter vectors, and the D(Y k , Y k-1 ) function represents the distance between the target speech feature parameter vectors selected by the unit Spectral distance, parameter γ represents the balance coefficient between the accuracy of feature parameter frame alignment and inter-frame continuity, and 0≤γ≤1; C 1 ({Y k }) represents the source speech feature parameter vector sum The spectral distance cost function between the target speech feature parameter vectors, C 2 ({Y k }) represents the spectral distance cost function between the target speech feature parameter vectors selected by the unit; 3-3)通过对公式(1)进行多元线性回归分析,得到与源语音特征参数矢量对齐的目标语音特征参数序列集合
Figure FDA0002290795990000023
即:
3-3) By performing multiple linear regression analysis on formula (1), the target speech feature parameter sequence set aligned with the source speech feature parameter vector is obtained
Figure FDA0002290795990000023
which is:
Figure FDA0002290795990000024
Figure FDA0002290795990000024
通过上述步骤,将非平行语料下的MFCC特征参数X、Y转变为类似平行语料下的对齐特征参数集。Through the above steps, the MFCC feature parameters X and Y under the non-parallel corpus are transformed into the aligned feature parameter set under the similar parallel corpus.
3.根据权利要求2所述的语音转换方法,其特征在于,对于公式(4)的求解,使用维特比搜索方法来优化算法的执行效率。3 . The speech conversion method according to claim 2 , wherein, for the solution of formula (4), a Viterbi search method is used to optimize the execution efficiency of the algorithm. 4 . 4.根据权利要求1所述的语音转换方法,其特征在于,步骤4的训练过程如下:4. speech conversion method according to claim 1, is characterized in that, the training process of step 4 is as follows: 4-1)设定AGMM初始混合数M,高斯分量权重系数阈值t1,t2,特征参数矢量之间欧氏距离阈值D和协方差阈值σ;4-1) Set AGMM initial mixture number M, Gaussian component weight coefficient thresholds t 1 , t 2 , Euclidean distance threshold D and covariance threshold σ between feature parameter vectors; 4-2)使用K-均值迭代算法得到EM训练的初始值;4-2) Use K-means iterative algorithm to obtain the initial value of EM training; 4-3)使用EM算法进行迭代训练;将高斯混合模型GMM表示如下:4-3) Use the EM algorithm for iterative training; express the Gaussian mixture model GMM as follows: 其中,X为P维的语音特征参数矢量,P(wi)表示各高斯分量的权重系数,且有M为高斯分量的个数,N(X,μii)表示高斯分量的P维联合高斯概率分布,表示如下:Among them, X is a P-dimensional speech feature parameter vector, P( wi ) represents the weight coefficient of each Gaussian component, and there are M is the number of Gaussian components, and N(X, μ i , Σ i ) represents the P-dimensional joint Gaussian probability distribution of the Gaussian components, expressed as follows:
Figure FDA0002290795990000031
Figure FDA0002290795990000031
其中μi为均值矢量,∑i为协方差矩阵,λ={P(wi),μii},λ是GMM模型的模型参数,对λ的估算通过最大似然估计法实现,对于语音特征参数矢量集合X={xn,n=1,2,...N}有:where μ i is the mean vector, ∑ i is the covariance matrix, λ={P( wi ), μ i ,∑ i }, λ is the model parameter of the GMM model, and the estimation of λ is realized by the maximum likelihood estimation method, For the speech feature parameter vector set X={x n , n=1,2,...N} there are:
Figure FDA0002290795990000032
Figure FDA0002290795990000032
此时:at this time: λ=argλmax(P(X|λ)) (8)λ=arg λ max(P(X|λ)) (8) 使用EM算法求解公式(8),随着EM计算过程中迭代条件满足P(X|λk)≥P(X|λk-1),Use the EM algorithm to solve formula (8), as the iterative conditions in the EM calculation process satisfy P(X|λ k )≥P(X|λ k-1 ), k是迭代的次数,直至得到模型参数λ,迭代过程中高斯分量权重系数P(wi)、均值向量μi、协方差矩阵Σi的迭代公式如下:k is the number of iterations until the model parameter λ is obtained. The iterative formulas of the Gaussian component weight coefficient P( wi ), the mean vector μ i and the covariance matrix Σ i in the iterative process are as follows:
Figure FDA0002290795990000033
Figure FDA0002290795990000033
Figure FDA0002290795990000034
Figure FDA0002290795990000034
Figure FDA0002290795990000035
Figure FDA0002290795990000035
Figure FDA0002290795990000036
Figure FDA0002290795990000036
4-4)若训练得到的模型中某一高斯分量N(P(wi),μi,∑i)权重系数小于t1,并且与其最邻近分量N(P(wj),μji)之间的欧氏距离小于阈值D,则对其进行合并处理:4-4) If the weight coefficient of a Gaussian component N(P(w i ), μ i , ∑ i ) in the model obtained by training is less than t 1 , and its nearest neighbor component N(P(w j ), μ j , If the Euclidean distance between Σ i ) is less than the threshold D, they are merged:
Figure FDA0002290795990000037
Figure FDA0002290795990000037
此时,高斯分量个数变为M-1,返回步骤4-3)进行下一次训练,若满足合并条件的高斯分量有多个,则选择最小距离的高斯分量进行合并;At this time, the number of Gaussian components becomes M-1, and return to step 4-3) for the next training. If there are multiple Gaussian components that meet the merging conditions, select the Gaussian component with the minimum distance for merging; 4-5)若训练得到的模型中某一高斯分量N(P(wi),μi,∑i)权重系数大于t2,并且协方差矩阵中有至少一维的方差大于σ,则认为该高斯分量包含过量信息,应将其分裂处理:4-5) If the weight coefficient of a Gaussian component N(P(w i ), μ i , ∑ i ) in the model obtained by training is greater than t 2 , and the variance of at least one dimension in the covariance matrix is greater than σ, it is considered that This Gaussian component contains excess information and should be split:
Figure FDA0002290795990000041
Figure FDA0002290795990000041
其中E为全1的列向量,n用于调节高斯分布,经过分裂后高斯分量个数变为M+1,如果满足分裂条件的高斯分量有多个,则选取权重系数最大的分量进行分裂,返回步骤4-3)进行下一次训练;Among them, E is a column vector of all 1s, and n is used to adjust the Gaussian distribution. After splitting, the number of Gaussian components becomes M+1. If there are multiple Gaussian components that meet the splitting conditions, the component with the largest weight coefficient is selected for splitting. Return to step 4-3) for the next training; 4-6)AGMM训练结束,得到后验条件概率矩阵P(X|λ),保存λ。4-6) After the AGMM training ends, the posterior conditional probability matrix P(X|λ) is obtained, and λ is saved.
5.根据权利要求4所述的语音转换方法,其特征在于,P=39。5. The speech conversion method according to claim 4, characterized in that, P=39. 6.根据权利要求1所述的语音转换方法,其特征在于,步骤5中构建的BLFW+AS转换函数,表示如下:6. speech conversion method according to claim 1, is characterized in that, the BLFW+AS conversion function that constructs in step 5, is represented as follows: F(x)=Wα(x,λ)x+s(x,λ) (15)F(x)=W α(x,λ) x+s(x,λ) (15)
Figure FDA0002290795990000042
Figure FDA0002290795990000042
Figure FDA0002290795990000043
Figure FDA0002290795990000043
Figure FDA0002290795990000044
Figure FDA0002290795990000044
其中,M为步骤4中混合高斯模型的高斯分量的个数,
Figure FDA0002290795990000045
表示语音特征矢量x在AGMM模型λ的第m个高斯分量的后验概率,αm和sm分别表示在AGMM模型λ的第m个高斯分量的频率弯折因子和幅度调节因子,α(x,λ)表示AGMM模型λ的所有高斯分量的频率弯折因子的加权组合,s(x,λ)表示AGMM模型λ的所有高斯分量的幅度调节因子的加权组合。
Among them, M is the number of Gaussian components of the mixture Gaussian model in step 4,
Figure FDA0002290795990000045
Represents the posterior probability of the m-th Gaussian component of the speech feature vector x in the AGMM model λ, α m and s m represent the frequency bending factor and amplitude adjustment factor of the m-th Gaussian component in the AGMM model λ, respectively, α (x ,λ) represents the weighted combination of the frequency bending factors of all Gaussian components of the AGMM model λ, and s(x,λ) represents the weighted combination of the amplitude adjustment factors of all the Gaussian components of the AGMM model λ.
7.根据权利要求1所述的语音转换方法,其特征在于,步骤5中建立源语音基音频率和目标语音基音频率之间的转换关系:7. voice conversion method according to claim 1, is characterized in that, in step 5, establish the conversion relation between source voice pitch frequency and target voice pitch frequency:
Figure FDA0002290795990000051
Figure FDA0002290795990000051
其中μ,σ2分别表示对数基音频率log f0的均值和方差。where μ and σ 2 represent the mean and variance of the logarithmic fundamental frequency log f 0 , respectively.
CN201710474281.8A 2017-06-21 2017-06-21 Speech conversion method under non-parallel text condition based on adaptive Gaussian clustering Active CN107301859B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710474281.8A CN107301859B (en) 2017-06-21 2017-06-21 Speech conversion method under non-parallel text condition based on adaptive Gaussian clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710474281.8A CN107301859B (en) 2017-06-21 2017-06-21 Speech conversion method under non-parallel text condition based on adaptive Gaussian clustering

Publications (2)

Publication Number Publication Date
CN107301859A CN107301859A (en) 2017-10-27
CN107301859B true CN107301859B (en) 2020-02-21

Family

ID=60136451

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710474281.8A Active CN107301859B (en) 2017-06-21 2017-06-21 Speech conversion method under non-parallel text condition based on adaptive Gaussian clustering

Country Status (1)

Country Link
CN (1) CN107301859B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107945791B (en) * 2017-12-05 2021-07-20 华南理工大学 A speech recognition method based on deep learning target detection
CN108198566B (en) * 2018-01-24 2021-07-20 咪咕文化科技有限公司 Information processing method and device, electronic device and storage medium
CN108777140B (en) * 2018-04-27 2020-07-28 南京邮电大学 A VAE-based voice conversion method under non-parallel corpus training
CN109671423B (en) * 2018-05-03 2023-06-02 南京邮电大学 Non-parallel text-to-speech conversion method under limited training data
CN110556092A (en) * 2018-05-15 2019-12-10 中兴通讯股份有限公司 Speech synthesis method and device, storage medium and electronic device
WO2019245916A1 (en) * 2018-06-19 2019-12-26 Georgetown University Method and system for parametric speech synthesis
CN109377978B (en) * 2018-11-12 2021-01-26 南京邮电大学 Many-to-many speaker conversion method based on i vector under non-parallel text condition
CN109326283B (en) * 2018-11-23 2021-01-26 南京邮电大学 Many-to-many speech conversion method based on text encoder under the condition of non-parallel text
CN109671442B (en) * 2019-01-14 2023-02-28 南京邮电大学 Many-to-many speaker conversion method based on STARGAN and x vectors
CN110164463B (en) * 2019-05-23 2021-09-10 北京达佳互联信息技术有限公司 Voice conversion method and device, electronic equipment and storage medium
CN110782908B (en) * 2019-11-05 2020-06-16 广州欢聊网络科技有限公司 Audio signal processing method and device
CN111640453B (en) * 2020-05-13 2023-06-16 广州国音智能科技有限公司 Spectrogram matching method, device, equipment and computer-readable storage medium
US12073819B2 (en) 2020-06-05 2024-08-27 Google Llc Training speech synthesis neural networks using energy scores
CN113112999B (en) * 2021-05-28 2022-07-12 宁夏理工学院 Short word and sentence voice recognition method and system based on DTW and GMM

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003022088A (en) * 2001-07-10 2003-01-24 Sharp Corp Device and method for speaker's features extraction, voice recognition device, and program recording medium
CN101751921A (en) * 2009-12-16 2010-06-23 南京邮电大学 Real-time voice conversion method under conditions of minimal amount of training data
CN102063899A (en) * 2010-10-27 2011-05-18 南京邮电大学 Method for voice conversion under unparallel text condition
CN103280224A (en) * 2013-04-24 2013-09-04 东南大学 Voice conversion method under asymmetric corpus condition on basis of adaptive algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003022088A (en) * 2001-07-10 2003-01-24 Sharp Corp Device and method for speaker's features extraction, voice recognition device, and program recording medium
CN101751921A (en) * 2009-12-16 2010-06-23 南京邮电大学 Real-time voice conversion method under conditions of minimal amount of training data
CN102063899A (en) * 2010-10-27 2011-05-18 南京邮电大学 Method for voice conversion under unparallel text condition
CN103280224A (en) * 2013-04-24 2013-09-04 东南大学 Voice conversion method under asymmetric corpus condition on basis of adaptive algorithm

Also Published As

Publication number Publication date
CN107301859A (en) 2017-10-27

Similar Documents

Publication Publication Date Title
CN107301859B (en) Speech conversion method under non-parallel text condition based on adaptive Gaussian clustering
Saito et al. Non-parallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and d-vectors
EP3438973B1 (en) Method and apparatus for constructing speech decoding network in digital speech recognition, and storage medium
Liu et al. Voice Conversion Across Arbitrary Speakers Based on a Single Target-Speaker Utterance.
US7957959B2 (en) Method and apparatus for processing speech data with classification models
Zhan et al. Vocal tract length normalization for large vocabulary continuous speech recognition
CN101030369B (en) Embedded Speech Recognition Method Based on Subword Hidden Markov Model
US9355642B2 (en) Speaker recognition method through emotional model synthesis based on neighbors preserving principle
Dissen et al. Formant estimation and tracking: A deep learning approach
Sivaraman et al. Unsupervised speaker adaptation for speaker independent acoustic to articulatory speech inversion
CN102063899A (en) Method for voice conversion under unparallel text condition
CN104217721B (en) Based on the phonetics transfer method under the conditions of the asymmetric sound bank that speaker model aligns
Thai et al. Synthetic data augmentation for improving low-resource ASR
CN107103914B (en) High-quality voice conversion method
CN102982799A (en) Speech recognition optimization decoding method integrating guide probability
CN107068165B (en) A method of voice conversion
Nguyen et al. Development of a Vietnamese speech recognition system for Viettel call center
Graciarena et al. Voicing feature integration in SRI's decipher LVCSR system
Giuliani et al. Speaker normalization through constrained MLLR based transforms
Wang et al. Multi-level prosody and spectrum conversion for emotional speech synthesis
Dessalegn Syllable based speaker independent Continous speech recognition for Afan Oromo
Miguel et al. Capturing local variability for speaker normalization in speech recognition
Schnell et al. Neural VTLN for speaker adaptation in TTS
Das et al. Aging speech recognition with speaker adaptation techniques: Study on medium vocabulary continuous Bengali speech
Müller et al. Enhancing Vocal Tract Length Normalization with Elastic Registration for Automatic Speech Recognition.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant