Summary of the invention
In view of this, the present invention is based on rhythm model and carry out parameters,acoustic planning, obtain the desired value of the desirable parameters,acoustic of each syllable; Carry out maximum match again, select the real sample that uses of conduct of gap minimum.After finishing maximum match, the section at not mating carries out the processing of individual character coupling.Calculate the integrate-cost that each bar runs through the section path of all syllable candidate samples, integrate-cost is to determine by the gap between the parameters,acoustic between the candidate samples of two adjacent syllables in gap between the parameters,acoustic of each candidate samples and its planning value and the path is comprehensive.Obtain the path of integrate-cost minimum by dynamic programming algorithm.Behind the selected sample of all syllables, in sound bank, obtain data and carry out waveform concatenation, obtain final synthetic result.
Provided by the inventionly select the phoneme synthesizing method of sound, comprise the steps: based on rhythm model and parameter
(a) set up rhythm model storehouse, record sound storehouse and index database on a large scale;
(b) text of wanting synthetic speech is carried out pre-service, described pre-service comprises that punctuate, regularization of text, participle, part-of-speech tagging, syntactic analysis, rhythmite level structure are analyzed and commentaries on classics phonetic;
(c) according to the attribute of syllable: in the speech of each syllable in position, the prosodic phrase in position and the sentence sound of position and this syllable connect attribute, the company's of accent attribute, from the rhythm model storehouse, find the parameters,acoustic value that each syllable has, finish planning the parameters,acoustic of each syllable; Wherein said parameters,acoustic comprises: pitch, the duration of a sound and loudness of a sound;
(d), from index database, obtain all candidate samples that this syllable exists in extensive dictation library for each syllable;
(e) calculate the parameters,acoustic in each parameters,acoustic, location parameter and planning of mating string, the cost C between the location parameter
j, find described cost C
jMiddle minimum cost C
MinLess than threshold value C
ThThe coupling string, thereby obtain maximum match length in all candidate samples of current syllable;
(f) section that does not mate in the text is carried out the byte matching treatment:
Calculate the parameters,acoustic of parameters,acoustic, location parameter and the planning of each all candidate samples of syllable, the node cost between the location parameter;
Connection cost between all candidate samples of two adjacent syllables of calculating;
Adopt dynamic programming algorithm, in each path, calculate the path of overall cost minimum; Overall cost is the summation of the connection cost between all node costs and the adjacent node on the path for this reason;
Be provided with each syllable choose sample by optimal path the both candidate nodes of process;
(g) according to selected sample, from described extensive recording sound storehouse, obtain Wave data, splice.
Adopt method provided by the invention can solve the existing discontinuous problem of the existing splicing of waveform concatenation phoneme synthesizing method, improved the naturalness of phonetic synthesis based on extensive recording sound storehouse.
Embodiment
Before concrete phonetic synthesis, set up following resource base earlier:
Extensive recording sound storehouse: speech waveform data, each syllable reference position and its parameters,acoustic data (pitch, the duration of a sound, loudness of a sound) in speech waveform.
Index database: to all syllables, write down the sequence number of its all sample in extensive recording sound storehouse, searched extensive recording sound storehouse, can obtain the related data of this syllable fast by this sequence number.
Rhythm model storehouse:, also be which type of pitch, the duration of a sound, the loudness of a sound of each syllable in a word should be by the rhythm model that the statistics training obtains.The numerical value of these parameters,acoustics is closely related with the factors such as length of sentence pattern, part of speech sequence, sentence and prosodic phrase.
The flow process of phonetic synthesis as shown in Figure 1.
Specifically describe as follows:
1, pre-service
For the voice that will synthesize, at first to pass through the text pre-treatment step.This step comprises punctuate, regularization of text, participle, part-of-speech tagging, syntactic analysis, the analysis of rhythmite level structure, changes phonetic etc.Finally can obtain following result:
The phonetic of each syllable in short;
In the speech of each syllable in position, the prosodic phrase position and the sentence in the position;
Part of speech of each speech (for example noun, verb, adjective etc.) and syntactic constituent (subject, predicate, object etc.).
2, parametric programming
By some attributes, from the rhythm model storehouse, find each syllable the parameters,acoustic that should have, also be which type of pitch, the duration of a sound, the loudness of a sound of each syllable should be, finish planning to the parameters,acoustic of each syllable.These attributes comprise: this syllable be in prefix, speech, suffix or monosyllabic word; The speech at this syllable place is in beginning of the sentence, sentence or end of the sentence; What the tone of this syllable front and back is, also promptly accent connects attribute; What the simple or compound vowel of a Chinese syllable of this syllable front and the initial consonant of back be, also is that sound connects attribute; Preceding sticking, the sticking attribute in back of this syllable; The position of this syllable place prosodic phrase, the intonation pattern of this syllable place statement; The part of speech of this syllable place speech, described syntactic constituent etc.
Suppose in short total K syllable (from 1 to K), then afterwards parameters,acoustic of each syllable is as follows in its planning: X
k={ H
k, L
k, T
k, A
k(k=1 ..., K) be respectively the high point of articulation, the low point of articulation, the duration of a sound and the loudness of a sound that k syllable planned.Its location parameter is Y simultaneously
k={ S
k, P
k, W
kRepresent respectively each syllable in sentence, in the prosodic phrase and the position in the speech, wherein beginning of the sentence, prosodic phrase head or prefix all are defined as 0, in the sentence, be defined as 1 in the prosodic phrase or in the speech, end of the sentence, prosodic phrase end or speech end are defined as 2.
3, obtain all candidate samples
For each syllable, from index database, obtain all samples that this syllable exists in extensive dictation library, be called candidate samples.
Index database has been listed all samples of all syllables, and is to discharge according to the order of syllable.For each syllable, total how many samples have all been write down, then the sequence number of each sample of journal in extensive recording sound storehouse.Sample identifies with its sequence number in extensive recording sound storehouse.Therefore, provide a syllable after, just can obtain its all samples in extensive recording sound storehouse fast.
4, maximum match
As shown in Figure 2, begin to handle, establish n=1 from first syllable; (S4.1)
To all candidate samples of current syllable (n syllable), check that whether the follow-up syllable of its candidate samples in former sentence is complementary with the follow-up syllable that will synthesize statement, writes down the length of its coupling.If can not carry out the coupling of follow-up syllable, then matching length was 1 (expression only can be mated self syllable); (S4.2)
Calculate the maximum match length in all candidate samples of current syllable, establish L maximum match length for this reason; (S4.3)
If matching length L is 1, expression does not have polysyllabic coupling, then changes S4.10; (S4.4)
To current syllable, select the candidate samples of all matching length>=L and the string of follow-up L-1 syllable composition thereof to be the coupling string.Here may find one or more coupling strings.Suppose to find J coupling string, and suppose that the parameters,acoustic of each sample in certain string and the location parameter in former sentence are as follows: X '
J, k=H '
J, k, L '
J, k, T '
J, k, A '
J, kAnd Y '
J, k=S '
J, k, P '
J, k, W '
J, k(j=1 ..., J, k=0 ..., L-1); (S4.5)
Calculate the parameters,acoustic in each parameters,acoustic, location parameter and planning of mating string, the cost C between the location parameter
j,
Wherein:
f(X
i,X′
j,Y
i,Y′
j)=g(X
i,X′
j)+h(Y
i,Y′
j)
h(Y
i,Y′
j)=ω
S|S
i-S′
j|+ω
P|P
i-P′
j|+ω
W|W
i-W′
j|
Wherein ω is a different parameters weight separately.
The coupling string of minimum cost is found in calculating, and establishing its cost is C
Min
C
min=min(C
j)(j=1,J) (S4.7)
If minimum cost C
MinGreater than threshold value C
Th, represent that parameters,acoustic of this coupling string and the parameters,acoustic of being planned differ too big, the coupling string of this length can't obtain the result that conforms to ideal value.(S4.8a) then shorten matching length, L=L-1 changes S4.4 then; (S4.8b)
The sample of choosing that identifies syllable to be synthesized is the sample of coupling string representative, identifies a continuous L syllable altogether; (S4.9)
Meet step S4.4, n=n+L is set, maximum match is not carried out in expression, and this moment, L=1 also promptly jumped to next syllable.Perhaps meet step S4.9, L syllable of maximum match skipped in expression; (S4.10)
Whether be ultima, if not, jump to S4.2 and handle.Otherwise withdraw from the processing of maximum match.(S4.11)
5, individual character is selected
Through the step of maximum match, the designated sample of choosing of some syllable in a word, other syllable are not then specified the sample of choosing as yet.For example below in the words: " the up-to-date phonetic synthesis product of having released of Jie Tonghua sound voice technology company limited ", " technology company limited ", " voice ", " product " have had the sample of choosing through maximum match, three parts formed in then remaining syllable, " Jie Tonghua sound voice ", " the up-to-date release ", " synthesizing ", the syllable in these parts does not all have to specify chooses sample.It is exactly pointer carries out sample to the syllable in these parts selection that the individual character is here selected.These parts are called " treatment region ".
Handle operation at each treatment region below.
Suppose that this treatment region is made of N syllable, and sequence number is from C to C+N-1.Concerning each syllable, several candidate samples are arranged, this number of sampling of supposing n syllable is M
n(n=C...C+N-1).Defining each candidate samples is W
Ij(i=C ... C+N-1; J=1 ... .M
i).Therefore, formed the lattice of throwing the net as shown in Figure 3, each candidate samples is a node in this grid.And wherein any path of running through this grid all is a possible sound result that selects.
Calculate the parameters,acoustic of parameters,acoustic, location parameter and the planning of each all candidate samples of syllable, the cost between the location parameter, be called the node cost.Parameters,acoustic and the location parameter of supposing j both candidate nodes of n syllable are X '
N, j=H '
N, j, L '
N, j, T '
N, j, A '
N, jAnd Y '
N, j=S '
N, j, P '
N, j, W '
N, j(n=1 ..., N, j=1 ..., J).Then its node cost is: D
N, j=f (X
n, X '
N, j, Y
n, Y '
N, j).This function definition is the same.
Connection cost between all candidate samples of two adjacent syllables of calculating.For example the connection cost between k the candidate of j candidate of n syllable and n+1 syllable is: E
N, j, k=g (X '
N, j, X '
N+1, k).This function definition is the same.
The overall cost that defines a paths is the summation of the connection cost between all node costs and the adjacent node on the path for this reason.
Therefore,, suppose that any one is path from first node to the end-node path for this treatment region node grid, wherein concerning n syllable, the path process be the individual node of p (n).Therefore, the overall cost in this path is:
Adopt dynamic programming algorithm, in various possible paths, calculate optimal path, also promptly select the path of overall cost minimum.For example, we have chosen the represented path of line of overstriking in Fig. 3.The concrete steps of dynamic programming are as follows:
At first calculate from the local optimum path of 2 syllables of the 1st syllable to the (also promptly being the syllable of C+1), also promptly to each node W of the 2nd syllable correspondence from sequence number
Ij(i=C+1, j=1...M
C+1), calculate cost from all nodes of previous syllable to this node, this cost is made up of the cost that is connected of certain node of 2 syllables of node cost and this node to the of certain node of previous syllable.As shown in Figure 3, for the 2nd node of the 2nd syllable, calculate the cost of each node of the 1st syllable to this node.It is calculated as follows;
Cost(W
C,1,W
C+1,2)=21+6=27
Cost(W
C,2,W
C+1,2)=32+10=42
Cost(W
C,3,W
C+1,2)=24+12=36
Cost(W
C,4,W
C+1,2)=18+8=26
The path of cost Cost minimum is exactly the local optimum path, also promptly from W
C, 4To W
C+1,2The path, its local optimum path cost is 26.Equally a local optimum path that all has from certain node of first syllable to it is also arranged, suppose that local optimum path cost separately is respectively 16 and 20 to the 1st node of the 2nd syllable and the 3rd node.As shown in Figure 3.
And then calculate the local optimum path of the 3rd syllable (also being that sequence number is the syllable of C+2).Each node W to this syllable correspondence
Ij(i=C+1, j=1...M
C+1), calculate local path from all nodes of first syllable to the best of this node.Because from the best local path of 2 certain nodes of syllable of first syllable to the as calculated, so add from the local path cost of 3 syllables of the 2nd syllable to the as long as calculate the cost result of this best local path now.For example concerning the 2nd node of the 3rd syllable, its cost is:
Cost(W
C+1,1,W
C+2,2)=16+18+27=61
Cost(W
C+1,2,W
C+2,2)=26+22+10=58
Cost(W
C+1,3,W
C+2,2)=20+34+11=65
Therefore, we can know that its local optimum path is from W from the 2nd node of three syllables of second syllable to the
C+1,2To W
C+2,2The path, again from W
C+1,2Recall local optimum path forward, know that promptly be from W from first syllable up to the optimal path of the 2nd node of the 3rd syllable from first syllable to it
C, 4To W
C+1,2Arrive W again
C+2,2The path.
Calculate the optimal path of each node of ultima so always, the Cost value that compares the local optimum path of all these nodes again, get the minimum pairing node of Cost the last node as whole optimal path, by to the recalling of local optimal path, just can know the optimal path of an integral body then.
Be provided with each syllable on this treatment region choose sample by optimal path the both candidate nodes of process.
Handle next treatment region, till no any treatment region.
6, waveform concatenation
By top step, sample all selected in each syllable.After selecting all samples, in fact just know its sequence number in extensive recording sound storehouse, and by this sequence number, in extensive recording sound storehouse, searching, just can obtain the length value that the reference position of the pairing speech waveform data of this sample and the duration by parameters,acoustic obtain.By these values, just can from extensive recording sound storehouse, read out corresponding Wave data.All Wave datas of choosing sample are coupled together, just finished waveform concatenation, thereby obtain final phonetic synthesis result.