CN1787072B

CN1787072B - Speech Synthesis Method Based on Prosodic Model and Parameter Selection

Info

Publication number: CN1787072B
Application number: CN2004100969685A
Authority: CN
Inventors: 陈明; 吕士楠; 张连毅; 武卫东; 肖娜
Original assignee: JIETONG HUASHENG SPEECH TECHNOLOGY Co Ltd
Current assignee: Beijing Sinovoice Technology Co Ltd
Priority date: 2004-12-07
Filing date: 2004-12-07
Publication date: 2010-06-16
Anticipated expiration: 2024-12-07
Also published as: CN1787072A

Abstract

The present invention provides a speech synthesis method based on prosodic model and parameter sound selection. The method plans the acoustic parameters based on the prosodic model to obtain the target value of the desired acoustic parameter for each syllable; then performs maximum matching and selects the one with the smallest difference as A sample of real use. After the maximum matching is performed, single-word matching is performed for the unmatched segments. Calculate the comprehensive cost of each section path that runs through all syllable candidate samples. The comprehensive cost is calculated by the gap between the acoustic parameters of each candidate sample and its planning value and the acoustic parameters between the candidate samples of two adjacent syllables in the path The gap between them is determined comprehensively. The path with the least comprehensive cost is obtained by dynamic programming algorithm. After all the syllables are sampled, the data is obtained from the speech database and the waveforms are spliced to obtain the final synthesis result.

Description

Select the phoneme synthesizing method of sound based on rhythm model and parameter

Technical field

The present invention relates to the speech synthesis technique field, be specifically related to phoneme synthesizing method.

Background technology

At present, the synthetic developing direction of Chinese speech is based on the waveform concatenation technology in extensive true recording sound storehouse.So-called extensive true recording sound storehouse, be meant the recording sound storehouse of having recorded a large amount of natural-soundings, its scope has covered the situation of the various pronunciations in most context environmentals substantially, and at different context environmentals, system will choose the raw tone fragment of mating is most spliced.Because therefore being on a grand scale of sound storehouse under nearly all situation, can both be found optimal primitive nature voice, and need not to use other technology to regulate, the final synthetic voice and the consistance of raw tone have therefore been guaranteed.In addition, selected here fragment has surmounted the level of syllable, can be multi-character words even phrase segment, has so just further guaranteed the naturalness of synthetic speech.

The shortcoming that this method exists at present is, when splicing, generally adopted and selected suitable syllable based on the method for rhythm level coupling, also promptly according to position, position in prosodic phrase and the position in speech of syllable in whole word that will synthesize, select storehouse these positions sample that mates of trying one's best that neutralizes to splice.(for example the pitch of prosodic phrase head is generally higher although there is certain dependence the real parameters,acoustic (pitch, the duration of a sound, loudness of a sound) of the syllable in short and position, and the rhythm end is lower, and the syllable duration of a sound at rhythm end is longer, the duration of a sound of the syllable in the middle of three words is the shortest etc. in three syllables), but this relation is not absolute, what is more important can not guarantee that a plurality of natural statement of recording in a large number in the storehouse has the consistent pitch or the duration of a sound in addition.Therefore in this case, will produce the uncontinuity in the splicing.For example,,,, may cause these two continuous syllables not meet the Changing Pattern of actual speech owing to do not consider actual parameters,acoustic though carried out selecting sound according to the position if from two statements, select respectively for two continuous syllables in a word.Cause the pitch saltus step on the sense of hearing like this, or the duration of a sound do not match, reduced the naturalness of voice.

The objective of the invention is at existing existing defective of waveform concatenation phoneme synthesizing method and deficiency based on extensive recording sound storehouse, adopt that a kind of to select the method for sound to carry out dynamic Chinese speech based on rhythm model and parameter synthetic, make the syllable sample that splices on real parameters,acoustic, satisfy certain rhythm model, make parameters,acoustic on changing, can control, also just can eliminate in splicing because the naturalness of selecting not matching of syllable to cause reduces.

Summary of the invention

In view of this, the present invention is based on rhythm model and carry out parameters,acoustic planning, obtain the desired value of the desirable parameters,acoustic of each syllable; Carry out maximum match again, select the real sample that uses of conduct of gap minimum.After finishing maximum match, the section at not mating carries out the processing of individual character coupling.Calculate the integrate-cost that each bar runs through the section path of all syllable candidate samples, integrate-cost is to determine by the gap between the parameters,acoustic between the candidate samples of two adjacent syllables in gap between the parameters,acoustic of each candidate samples and its planning value and the path is comprehensive.Obtain the path of integrate-cost minimum by dynamic programming algorithm.Behind the selected sample of all syllables, in sound bank, obtain data and carry out waveform concatenation, obtain final synthetic result.

Provided by the inventionly select the phoneme synthesizing method of sound, comprise the steps: based on rhythm model and parameter

(a) set up rhythm model storehouse, record sound storehouse and index database on a large scale;

(b) text of wanting synthetic speech is carried out pre-service, described pre-service comprises that punctuate, regularization of text, participle, part-of-speech tagging, syntactic analysis, rhythmite level structure are analyzed and commentaries on classics phonetic;

(c) according to the attribute of syllable: in the speech of each syllable in position, the prosodic phrase in position and the sentence sound of position and this syllable connect attribute, the company's of accent attribute, from the rhythm model storehouse, find the parameters,acoustic value that each syllable has, finish planning the parameters,acoustic of each syllable; Wherein said parameters,acoustic comprises: pitch, the duration of a sound and loudness of a sound;

(d), from index database, obtain all candidate samples that this syllable exists in extensive dictation library for each syllable;

(e) calculate the parameters,acoustic in each parameters,acoustic, location parameter and planning of mating string, the cost C between the location parameter _j, find described cost C _jMiddle minimum cost C _MinLess than threshold value C _ThThe coupling string, thereby obtain maximum match length in all candidate samples of current syllable;

(f) section that does not mate in the text is carried out the byte matching treatment:

Calculate the parameters,acoustic of parameters,acoustic, location parameter and the planning of each all candidate samples of syllable, the node cost between the location parameter;

Connection cost between all candidate samples of two adjacent syllables of calculating;

Adopt dynamic programming algorithm, in each path, calculate the path of overall cost minimum; Overall cost is the summation of the connection cost between all node costs and the adjacent node on the path for this reason;

Be provided with each syllable choose sample by optimal path the both candidate nodes of process;

(g) according to selected sample, from described extensive recording sound storehouse, obtain Wave data, splice.

Adopt method provided by the invention can solve the existing discontinuous problem of the existing splicing of waveform concatenation phoneme synthesizing method, improved the naturalness of phonetic synthesis based on extensive recording sound storehouse.

Description of drawings

Fig. 1 is the flow process of phonetic synthesis;

Fig. 2 is the flow process of maximum match step;

Figure 3 shows that individual character selects the rapid example of foot.

Embodiment

Before concrete phonetic synthesis, set up following resource base earlier:

Extensive recording sound storehouse: speech waveform data, each syllable reference position and its parameters,acoustic data (pitch, the duration of a sound, loudness of a sound) in speech waveform.

Index database: to all syllables, write down the sequence number of its all sample in extensive recording sound storehouse, searched extensive recording sound storehouse, can obtain the related data of this syllable fast by this sequence number.

Rhythm model storehouse:, also be which type of pitch, the duration of a sound, the loudness of a sound of each syllable in a word should be by the rhythm model that the statistics training obtains.The numerical value of these parameters,acoustics is closely related with the factors such as length of sentence pattern, part of speech sequence, sentence and prosodic phrase.

The flow process of phonetic synthesis as shown in Figure 1.

Specifically describe as follows:

1, pre-service

For the voice that will synthesize, at first to pass through the text pre-treatment step.This step comprises punctuate, regularization of text, participle, part-of-speech tagging, syntactic analysis, the analysis of rhythmite level structure, changes phonetic etc.Finally can obtain following result:

The phonetic of each syllable in short;

In the speech of each syllable in position, the prosodic phrase position and the sentence in the position;

Part of speech of each speech (for example noun, verb, adjective etc.) and syntactic constituent (subject, predicate, object etc.).

2, parametric programming

By some attributes, from the rhythm model storehouse, find each syllable the parameters,acoustic that should have, also be which type of pitch, the duration of a sound, the loudness of a sound of each syllable should be, finish planning to the parameters,acoustic of each syllable.These attributes comprise: this syllable be in prefix, speech, suffix or monosyllabic word; The speech at this syllable place is in beginning of the sentence, sentence or end of the sentence; What the tone of this syllable front and back is, also promptly accent connects attribute; What the simple or compound vowel of a Chinese syllable of this syllable front and the initial consonant of back be, also is that sound connects attribute; Preceding sticking, the sticking attribute in back of this syllable; The position of this syllable place prosodic phrase, the intonation pattern of this syllable place statement; The part of speech of this syllable place speech, described syntactic constituent etc.

Suppose in short total K syllable (from 1 to K), then afterwards parameters,acoustic of each syllable is as follows in its planning: X _k={ H _k, L _k, T _k, A _k(k=1 ..., K) be respectively the high point of articulation, the low point of articulation, the duration of a sound and the loudness of a sound that k syllable planned.Its location parameter is Y simultaneously _k={ S _k, P _k, W _kRepresent respectively each syllable in sentence, in the prosodic phrase and the position in the speech, wherein beginning of the sentence, prosodic phrase head or prefix all are defined as 0, in the sentence, be defined as 1 in the prosodic phrase or in the speech, end of the sentence, prosodic phrase end or speech end are defined as 2.

3, obtain all candidate samples

For each syllable, from index database, obtain all samples that this syllable exists in extensive dictation library, be called candidate samples.

Index database has been listed all samples of all syllables, and is to discharge according to the order of syllable.For each syllable, total how many samples have all been write down, then the sequence number of each sample of journal in extensive recording sound storehouse.Sample identifies with its sequence number in extensive recording sound storehouse.Therefore, provide a syllable after, just can obtain its all samples in extensive recording sound storehouse fast.

4, maximum match

As shown in Figure 2, begin to handle, establish n=1 from first syllable; (S4.1)

To all candidate samples of current syllable (n syllable), check that whether the follow-up syllable of its candidate samples in former sentence is complementary with the follow-up syllable that will synthesize statement, writes down the length of its coupling.If can not carry out the coupling of follow-up syllable, then matching length was 1 (expression only can be mated self syllable); (S4.2)

Calculate the maximum match length in all candidate samples of current syllable, establish L maximum match length for this reason; (S4.3)

If matching length L is 1, expression does not have polysyllabic coupling, then changes S4.10; (S4.4)

To current syllable, select the candidate samples of all matching length＞=L and the string of follow-up L-1 syllable composition thereof to be the coupling string.Here may find one or more coupling strings.Suppose to find J coupling string, and suppose that the parameters,acoustic of each sample in certain string and the location parameter in former sentence are as follows: X ' _{J, k}=H ' _{J, k}, L ' _{J, k}, T ' _{J, k}, A ' _{J, k}And Y ' _{J, k}=S ' _{J, k}, P ' _{J, k}, W ' _{J, k}(j=1 ..., J, k=0 ..., L-1); (S4.5)

Calculate the parameters,acoustic in each parameters,acoustic, location parameter and planning of mating string, the cost C between the location parameter _j,

C_{j} = \frac{Σ_{k = 0}^{L - 1} f (X_{n + k}, X_{j, k}^{'}, Y_{n + k}, Y_{j, k}^{'})}{L} (S 4.6)

Wherein:

f(X _i，X′ _j，Y _i，Y′ _j)＝g(X _i，X′ _j)+h(Y _i，Y′ _j)

g (X_{i}, X_{j}^{'}) = \sqrt{ω_{H} {(H_{i} - H_{j}^{'})}^{2} + ω_{L} {(L_{i} - L_{j}^{'})}^{2} + ω_{T} {(T_{i} - T_{j}^{'})}^{2} + ω_{A} {(A_{i} - A_{j}^{'})}^{2}}

Wherein ω is a different parameters weight separately.

The coupling string of minimum cost is found in calculating, and establishing its cost is C _Min

C _min＝min(C _j)(j＝1，J) (S4.7)

If minimum cost C _MinGreater than threshold value C _Th, represent that parameters,acoustic of this coupling string and the parameters,acoustic of being planned differ too big, the coupling string of this length can't obtain the result that conforms to ideal value.(S4.8a) then shorten matching length, L=L-1 changes S4.4 then; (S4.8b)

The sample of choosing that identifies syllable to be synthesized is the sample of coupling string representative, identifies a continuous L syllable altogether; (S4.9)

Meet step S4.4, n=n+L is set, maximum match is not carried out in expression, and this moment, L=1 also promptly jumped to next syllable.Perhaps meet step S4.9, L syllable of maximum match skipped in expression; (S4.10)

Whether be ultima, if not, jump to S4.2 and handle.Otherwise withdraw from the processing of maximum match.(S4.11)

5, individual character is selected

Through the step of maximum match, the designated sample of choosing of some syllable in a word, other syllable are not then specified the sample of choosing as yet.For example below in the words: " the up-to-date phonetic synthesis product of having released of Jie Tonghua sound voice technology company limited ", " technology company limited ", " voice ", " product " have had the sample of choosing through maximum match, three parts formed in then remaining syllable, " Jie Tonghua sound voice ", " the up-to-date release ", " synthesizing ", the syllable in these parts does not all have to specify chooses sample.It is exactly pointer carries out sample to the syllable in these parts selection that the individual character is here selected.These parts are called " treatment region ".

Handle operation at each treatment region below.

Suppose that this treatment region is made of N syllable, and sequence number is from C to C+N-1.Concerning each syllable, several candidate samples are arranged, this number of sampling of supposing n syllable is M _n(n=C...C+N-1).Defining each candidate samples is W _Ij(i=C ... C+N-1; J=1 ... .M _i).Therefore, formed the lattice of throwing the net as shown in Figure 3, each candidate samples is a node in this grid.And wherein any path of running through this grid all is a possible sound result that selects.

Calculate the parameters,acoustic of parameters,acoustic, location parameter and the planning of each all candidate samples of syllable, the cost between the location parameter, be called the node cost.Parameters,acoustic and the location parameter of supposing j both candidate nodes of n syllable are X ' _{N, j}=H ' _{N, j}, L ' _{N, j}, T ' _{N, j}, A ' _{N, j}And Y ' _{N, j}=S ' _{N, j}, P ' _{N, j}, W ' _{N, j}(n=1 ..., N, j=1 ..., J).Then its node cost is: D _{N, j}=f (X _n, X ' _{N, j}, Y _n, Y ' _{N, j}).This function definition is the same.

Connection cost between all candidate samples of two adjacent syllables of calculating.For example the connection cost between k the candidate of j candidate of n syllable and n+1 syllable is: E _{N, j, k}=g (X ' _{N, j}, X ' _{N+1, k}).This function definition is the same.

The overall cost that defines a paths is the summation of the connection cost between all node costs and the adjacent node on the path for this reason.

Therefore,, suppose that any one is path from first node to the end-node path for this treatment region node grid, wherein concerning n syllable, the path process be the individual node of p (n).Therefore, the overall cost in this path is:

C_{path} = Σ_{n = I}^{I + N - 1} D_{n, p (n)} + Σ_{n = I}^{I + N - 2} E_{n, p (n), p (n + 1)}

Adopt dynamic programming algorithm, in various possible paths, calculate optimal path, also promptly select the path of overall cost minimum.For example, we have chosen the represented path of line of overstriking in Fig. 3.The concrete steps of dynamic programming are as follows:

At first calculate from the local optimum path of 2 syllables of the 1st syllable to the (also promptly being the syllable of C+1), also promptly to each node W of the 2nd syllable correspondence from sequence number _Ij(i=C+1, j=1...M _C+1), calculate cost from all nodes of previous syllable to this node, this cost is made up of the cost that is connected of certain node of 2 syllables of node cost and this node to the of certain node of previous syllable.As shown in Figure 3, for the 2nd node of the 2nd syllable, calculate the cost of each node of the 1st syllable to this node.It is calculated as follows;

Cost(W _C，1，W _C+1，2)＝21+6＝27

Cost(W _C，2，W _C+1，2)＝32+10＝42

Cost(W _C，3，W _C+1，2)＝24+12＝36

Cost(W _C，4，W _C+1，2)＝18+8＝26

The path of cost Cost minimum is exactly the local optimum path, also promptly from W _{C, 4}To W _C+1,2The path, its local optimum path cost is 26.Equally a local optimum path that all has from certain node of first syllable to it is also arranged, suppose that local optimum path cost separately is respectively 16 and 20 to the 1st node of the 2nd syllable and the 3rd node.As shown in Figure 3.

And then calculate the local optimum path of the 3rd syllable (also being that sequence number is the syllable of C+2).Each node W to this syllable correspondence _Ij(i=C+1, j=1...M _C+1), calculate local path from all nodes of first syllable to the best of this node.Because from the best local path of 2 certain nodes of syllable of first syllable to the as calculated, so add from the local path cost of 3 syllables of the 2nd syllable to the as long as calculate the cost result of this best local path now.For example concerning the 2nd node of the 3rd syllable, its cost is:

Cost(W _C+1，1，W _C+2，2)＝16+18+27＝61

Cost(W _C+1，2，W _C+2，2)＝26+22+10＝58

Cost(W _C+1，3，W _C+2，2)＝20+34+11＝65

Therefore, we can know that its local optimum path is from W from the 2nd node of three syllables of second syllable to the _C+1,2To W _C+2,2The path, again from W _C+1,2Recall local optimum path forward, know that promptly be from W from first syllable up to the optimal path of the 2nd node of the 3rd syllable from first syllable to it _{C, 4}To W _C+1,2Arrive W again _C+2,2The path.

Calculate the optimal path of each node of ultima so always, the Cost value that compares the local optimum path of all these nodes again, get the minimum pairing node of Cost the last node as whole optimal path, by to the recalling of local optimal path, just can know the optimal path of an integral body then.

Be provided with each syllable on this treatment region choose sample by optimal path the both candidate nodes of process.

Handle next treatment region, till no any treatment region.

6, waveform concatenation

By top step, sample all selected in each syllable.After selecting all samples, in fact just know its sequence number in extensive recording sound storehouse, and by this sequence number, in extensive recording sound storehouse, searching, just can obtain the length value that the reference position of the pairing speech waveform data of this sample and the duration by parameters,acoustic obtain.By these values, just can from extensive recording sound storehouse, read out corresponding Wave data.All Wave datas of choosing sample are coupled together, just finished waveform concatenation, thereby obtain final phonetic synthesis result.

Claims

1. one kind is selected the phoneme synthesizing method of sound based on rhythm model and parameter, comprises the steps:

(e) parameters,acoustic in each coupling that may form by the candidate samples institute of calculating adjacent syllable parameters,acoustic, location parameter and planning of going here and there, the cost C between the location parameter _j, find described cost C _jMiddle minimum cost C _MinLess than threshold value C _ThThe coupling string, be provided with these adjacent syllables choose sample mate for this reason the string pairing candidate samples;

Adopt dynamic programming algorithm, in each path, calculate the path of overall cost minimum; Described overall cost is the summation of the connection cost between all node costs and the adjacent node on the path for this reason;