CN105551481A

CN105551481A - Rhythm marking method of voice data and apparatus thereof

Info

Publication number: CN105551481A
Application number: CN201510967511.5A
Authority: CN
Inventors: 康永国
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2015-12-21
Filing date: 2015-12-21
Publication date: 2016-05-04
Anticipated expiration: 2035-12-21
Also published as: CN105551481B

Abstract

The invention provides a rhythm marking method of voice data and an apparatus thereof. The method comprises the following steps of acquiring text information of voice data to be marked and extracting first text characteristic information and second text characteristic information; extracting acoustic characteristic information; according to the first text characteristic information and a rhythm prediction model, generating N candidate rhythm characteristic information; based on the N candidate rhythm characteristic information, the second text characteristic information and an acoustic prediction model, generating N candidate acoustic characteristic information; calculating a correlative value between each candidate acoustic characteristic information and the acoustic characteristic information respectively; taking the candidate rhythm characteristic information corresponding to a candidate acoustic characteristic with a maximum correlation value as target rhythm characteristic information of the voice data to be marked; and according to the target rhythm characteristic information, marking a rhythm characteristic of the voice data to be marked. By using the method and the apparatus of an embodiment, a rhythmic pause of the marked voice data is accurately marked so that a synthetic speech sound is smooth and natural.

Description

The prosodic labeling method of speech data and device

Technical field

The present invention relates to field of computer technology, particularly a kind of prosodic labeling method of speech data and device.

Background technology

Phonetic synthesis produces the technology of artificial voice by the method for machinery, electronics, it computing machine oneself is produced or the Word message of outside input change into can listen understand, the technology of fluent voice output.The object of phonetic synthesis be by text-converted be speech play to user, target be reach true man's text report effect.

Usually, in order to reach above-mentioned effect, speech synthesis system needs a prosodic features information (such as rhythm pause grade) to mark phonetic synthesis sound storehouse accurately, main two schemes in correlation technique, a kind of scheme first chooses the speaker of announcer's rank, a large amount of speech datas (the general record length needing 10 hours) is recorded in professional recording room, then, artificial basis reads aloud text that people reads aloud and the prosodic features information of voice to speech data marks, to generate the phonetic synthesis storehouse needed for speech synthesis system, that is, the prosodic features information in phonetic synthesis storehouse is that the rhythm read aloud according to speaker manually marks, the tone color that can provide based on the phonetic synthesis of reading aloud made by voice that people reads aloud is limited, the tone color of the voice synthesized by voice system is comparatively single, steadily.Another kind of scheme is the theory based on large Data Synthesis, collect the speech data accurately that pronounces in a large number, then phonetic synthesis sound storehouse is formed based on collected speech data, this based on the phonetic synthesis sound storehouse made by large data, make speech synthesis system have the feature of tone color, no individual demand can be met.Make in phonetic synthesis sound storehouse at the speech data based on mass data, carry out automatic marking to the prosodic features information of a large amount of speech datas is fast one of the key in phonetic synthesis sound storehouse how time saving and energy savingly.

In correlation technique, mainly adopt and in two ways the prosodic features information of the speech data based on large data is marked, a kind of mode is: from speech data, obtain the quiet segment length in the voice characteristics information information-related with prosodic features such as voice signal, the tendency etc. of fundamental frequency feature, then the prosodic features information of speech data is determined based on above-mentioned voice characteristics information, and according to the prosodic features information determined, automatic marking is carried out to speech data, the mode of the prosodic features information of this automatic marking voice does not have robustness due to extracted voice characteristics information, easily cause marked prosodic features information inaccurate, and the above-mentioned prosodic features information obtained based on voice characteristics information does not consider the pause constraint on text, and then it is true not that the rhythm of the speech data synthesized by speech synthesis system can be caused to pause, natural not.Another kind of mode is: use general prosody prediction model to carry out prosody prediction to recording text, the model of direct usage forecastings pauses as the rhythm that sound storehouse voice are corresponding, this mode take into account the distributed intelligence of the rhythm on text, but it is more single based on the tone color of the speech data synthesized by this phonetic synthesis sound storehouse, the larger gap of existence is play for the voice synthesized by the obvious storytelling of tempo variation, the text such as to tell a story and true man, the voice that user hears are smooth not, and Consumer's Experience is undesirable.

Summary of the invention

The present invention is intended to solve one of technical matters in correlation technique at least to a certain extent.For this reason, first object of the present invention is a kind of prosodic labeling method proposing speech data, the method is accurately paused to the rhythm of mark speech data and is marked, and the rhythm of the speech data to be marked marked more rationally, accurately, and then can make synthetic speech remarkable fluency more.

Second object of the present invention is the prosodic labeling device proposing a kind of speech data.

For achieving the above object, the speech data mask method of first aspect present invention embodiment, comprising: the text message obtaining speech data to be marked, and extracts the first text feature information and the second text feature information of described text message; Extract the acoustic feature information of described speech data to be marked; Candidate's prosodic features information aggregate of text message according to described first text feature information and prosody prediction model generation, wherein, described candidate's prosodic features information aggregate comprises N number of candidate's prosodic features information, N be greater than 1 positive integer; N number of candidate's acoustic feature information of described text message is generated based on described N number of candidate's prosodic features information, described second text feature information and acoustics forecast model, wherein, described N number of candidate's acoustic feature information is corresponding with described N number of candidate's prosodic features information; Calculate the correlation between each candidate's acoustic feature information and described acoustic feature information respectively; Maximum related value is determined according to result of calculation, and using the target prosodic features information of the candidate's prosodic features information corresponding to candidate's acoustic feature of maximum related value as described speech data to be marked; And mark according to the prosodic features of described target prosodic features information to described speech data to be marked.

The prosodic labeling method of the speech data of the embodiment of the present invention, first the first text feature information and the second text feature information of the text message of speech data to be marked is extracted, and extract the acoustic feature information of speech data to be marked, then candidate's prosodic features information aggregate of the text message of N number of candidate's prosodic features information is comprised according to the first text feature information and prosody prediction model generation, based on N number of candidate's prosodic features information, second text feature information and acoustics forecast model generate N number of candidate's acoustic feature information of text message, calculate the correlation between each candidate's acoustic feature information and acoustic feature information more respectively, maximum related value is determined according to result of calculation, using the target prosodic features information of the candidate's prosodic features information corresponding to candidate's acoustic feature of maximum related value as speech data to be marked, mark according to the prosodic features of target prosodic features information to speech data to be marked again, thus, accurately the rhythm of mark speech data is paused and mark, the rhythm of the speech data to be marked marked is more reasonable, accurately, and then synthetic speech remarkable fluency more can be made.

For achieving the above object, the prosodic labeling device of the speech data of second aspect present invention embodiment, comprising: acquisition module, for obtaining the text message of speech data to be marked; Extraction module, for extracting the first text feature information and the second text feature information of described text message, and extracts the acoustic feature information of described speech data to be marked; First generation module, for candidate's prosodic features information aggregate of text message according to described first text feature information and prosody prediction model generation, wherein, described candidate's prosodic features information aggregate comprises N number of candidate's prosodic features information, N be greater than 1 positive integer; Second generation module, for generating N number of candidate's acoustic feature information of described text message based on described N number of candidate's prosodic features information, described second text feature information and acoustics forecast model, wherein, described N number of candidate's acoustic feature information is corresponding with described N number of candidate's prosodic features information; Computing module, for calculating the correlation between each candidate's acoustic feature information and described acoustic feature information respectively; Determination module, for determining maximum related value according to result of calculation, and using the target prosodic features information of the candidate's prosodic features information corresponding to candidate's acoustic feature of maximum related value as described speech data to be marked; And labeling module, for marking according to the prosodic features of described target prosodic features information to described speech data to be marked.

The prosodic labeling device of the speech data of the embodiment of the present invention, acquisition module obtains the text message of speech data to be marked, the first text feature information and the second text feature information of text message is extracted by extraction module, and the acoustic feature information of speech data to be marked, first generation module comprises candidate's prosodic features information aggregate of the text message of N number of candidate's prosodic features information according to the first text feature information and prosody prediction model generation, again by the second generation module based on N number of candidate's prosodic features information, second text feature information and acoustics forecast model generate N number of candidate's acoustic feature information of text message, computing module calculates the correlation between each candidate's acoustic feature information and acoustic feature information respectively, determination module determines maximum related value according to result of calculation, and using the target prosodic features information of the candidate's prosodic features information corresponding to candidate's acoustic feature of maximum related value as speech data to be marked, labeling module marks according to the prosodic features of target prosodic features information to speech data to be marked, thus, accurately the rhythm of mark speech data is paused and mark, the rhythm of the speech data to be marked marked is more reasonable, accurately, and then synthetic speech remarkable fluency more can be made.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the prosodic labeling method of speech data according to an embodiment of the invention.

Fig. 2 is the schematic diagram setting up prosody prediction model.

Fig. 3 is the schematic diagram setting up acoustical predictions model.

Fig. 4 is the schematic diagram of the prosodic labeling process of speech data according to an embodiment of the invention.

Fig. 5 is the structural representation of the prosodic labeling device of speech data according to an embodiment of the invention.

Embodiment

Be described below in detail embodiments of the invention, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has element that is identical or similar functions from start to finish.Be exemplary below by the embodiment be described with reference to the drawings, be intended to for explaining the present invention, and can not limitation of the present invention be interpreted as.

Below with reference to the accompanying drawings mask method and the device of the speech data of the embodiment of the present invention are described.

As shown in Figure 1, the prosodic labeling method of this speech data comprises:

S101, obtains the text message of speech data to be marked, and extracts the first text feature information and the second text feature information of text message.

Wherein, the contents such as the first text feature information can comprise that word is long, part of speech and word face (i.e. entry itself), the second text feature information can include but not limited to sound mother and tone.

S102, extracts the acoustic feature information of speech data to be marked.

Wherein, acoustic feature information can include but not limited to the acoustic feature such as duration, fundamental frequency.

S103, according to candidate's prosodic features information aggregate of the first text feature information and prosody prediction model generation text message.

Wherein, above-mentioned candidate's prosodic features information aggregate comprises N number of candidate's prosodic features information, N be greater than 1 positive integer, such as, N is 5.

Particularly, by the first text feature information input prosody prediction model, by prosody prediction model, prosody prediction is carried out to fileinfo, to generate candidate's prosodic features information aggregate of text message.

Wherein, above-mentioned candidate's prosodic features information can comprise rhythm pause grade.Particularly, pause grade can be divided into four kinds of pause grades, is one-level pause, secondary pause, three grades of pauses and level Four pause respectively, and pause rank is higher shows that the time needing to pause is longer herein.Wherein, the one-level available #0 that pauses represents, one-level is paused and indicated without pausing; The one-level available #1 that pauses represents, secondary pauses and represents dwell (corresponding rhythm word); Three grades of pause #2, three grades are paused as large pause (corresponding prosodic phrase); The level Four available #3 that pauses represents, three grades are paused as super large pause (corresponding intonation phrase).

Be understandable that, N number of candidate's prosodic features information of the text message of above-mentioned generation is mutually different.

It should be noted that, above-mentioned prosody prediction model is that training in advance is good.Particularly, as shown in Figure 2, this prosody prediction model can be utilize CRF (conditionalrandomfield, condition random field) algorithm to the text feature information of a large amount of text and the prosodic labeling data of its correspondence carry out the rhythm training set up, namely this prosody prediction model sets up based on the mapping relations between text feature information and prosodic labeling data.Also just say and be, after text feature information being inputted this prosody prediction model, this prosody prediction model and the exportable prosodic features information corresponding with text feature information.

Be understandable that, the feature of this prosody prediction model can predict N number of rhythm pause result for a text to be predicted, and N is greater than 1.

S104, generates N number of candidate's acoustic feature information of text message based on N number of candidate's prosodic features information, the second text feature information and acoustics forecast model.

Wherein, N number of candidate's acoustic feature information is corresponding with N number of candidate's prosodic features information.

Particularly, for each candidate's prosodic features information, by current candidate prosodic features information and the second text feature information input acoustical predictions model, by acoustical predictions model, acoustical predictions is carried out to text message, to generate the current candidate acoustic feature information of text message.

It should be noted that, above-mentioned acoustical predictions model is that training in advance is good.Particularly, as shown in Figure 3, acoustical predictions model can be adopt HMM (HiddenMarkovModel, Hidden Markov Model (HMM)) or deep neural network model based on a large amount of training utterance data (i.e. text feature information, prosodic features information and acoustics characteristic information) with accurate mark, the text feature information of training utterance data, the mapping relations between prosodic features information and acoustics characteristic information set up, wherein, these mapping relations be input as the second text feature information, prosodic features information, export as acoustic feature information.

S105, calculates the correlation between each candidate's acoustic feature information and acoustic feature information respectively.

Specifically, because different rhythm pause grades can cause the change of the acoustic feature such as duration, fundamental frequency information, the rhythm pause grade that the minimum candidate feature information of the acoustical feature distance extracted with voice to be marked is corresponding and real rhythm pause more close grade, therefore can be determined the correlation of the acoustic feature information of N number of candidate's acoustic feature information and speech data to be marked by the mode of tolerance acoustic feature information distance.Be understandable that, distance is less, shows that correlation is larger.

S106, determines maximum related value according to result of calculation, and using the target prosodic features information of the candidate's prosodic features information corresponding to candidate's acoustic feature of maximum related value as speech data to be marked.

S107, marks according to the prosodic features of target prosodic features information to speech data to be marked.

Particularly, after determine target rhythm pause grade according to above-mentioned result of calculation, can mark according to the rhythm pause grade of target rhythm pause grade to speech data to be marked.

For the ease of understanding embodiments of the invention, the prosodic mark method of the speech data of the embodiment of the present invention is described below by Fig. 4.

As shown in Figure 4, after the text message obtaining speech data to be marked, first can carry out word face to text message, part of speech, sound is female, the text analyzings such as tone, and extract the acoustic feature information of speech data to be marked, and by word face, the result (i.e. the first text feature information) of the text analyzing of part of speech inputs in the good prosody prediction model of training in advance, prosody prediction model generation N number of candidate's prosodic features information, then by N number of candidate's prosodic features information harmony simple or compound vowel of a Chinese syllable, the text analyzing result (i.e. the second text feature information) of tone inputs in the good acoustical predictions model of training in advance, N number of candidate's acoustic feature information corresponding to N number of candidate's prosodic features information of acoustical predictions model generation text information, the distance between each candidate's acoustic feature information and above-mentioned acoustic feature information is calculated by distance, obtain the correlation between each candidate's acoustic feature information and above-mentioned acoustic feature information, and therefrom determine maximum related value, using the target prosodic features information of the candidate's prosodic features information corresponding to candidate's acoustic feature of maximum related value as speech data to be marked, by maximum related value candidate's acoustic feature corresponding to candidate's prosodic features information as the target prosodic features information of speech data to be marked, and then the prosodic features of speech data to be marked is marked.

In summary it can be seen, in the process that this embodiment marks in the prosodic features information (such as rhythm pause grade) to speech data to be marked, not only the text feature information of the text message of speech data to be marked is analyzed, also the acoustic feature information of the acoustic feature information doped and speech data to be marked is compared, to determine that being more close to the real rhythm pauses, and by the pause of this rhythm, speech data to be marked is marked, thus accurately the rhythm pause of mark speech data is marked, and then synthetic speech remarkable fluency more can be made.

In order to realize above-described embodiment, the present invention also proposes a kind of prosodic labeling device of speech data.

As shown in Figure 5, the prosodic labeling device of this speech data comprises acquisition module 100, first extraction module 200, second extraction module 300, first generation module 400, second generation module 500, computing module 600, determination module 700 and labeling module 800, wherein:

Acquisition module 100 is for obtaining the text message of speech data to be marked.

First extraction module 200 is for extracting the first text feature information and the second text feature information of text message.

Second extraction module 300 is for extracting the acoustic feature information of speech data to be marked.

First generation module 400 is for the candidate's prosodic features information aggregate according to the first text feature information and prosody prediction model generation text message.

Wherein, candidate's prosodic features information aggregate comprises N number of candidate's prosodic features information, N be greater than 1 positive integer, such as, N is 5.

Particularly, the first generation module 400 by the first text feature information input prosody prediction model, can carry out prosody prediction by prosody prediction model to fileinfo, to generate candidate's prosodic features information aggregate of text message.

Wherein, above-mentioned prosodic features information can comprise rhythm pause grade.Particularly, pause grade can be divided into four kinds of pause grades, is one-level pause, secondary pause, three grades of pauses and level Four pause respectively, and pause rank is higher shows that the time needing to pause is longer herein.Wherein, the one-level available #0 that pauses represents, one-level is paused and indicated without pausing; The one-level available #1 that pauses represents, secondary pauses and represents dwell (corresponding rhythm word); Three grades of pause #2, three grades are paused as large pause (corresponding prosodic phrase); The level Four available #3 that pauses represents, three grades are paused as super large pause (corresponding intonation phrase).

Second generation module 500 is for generating N number of candidate's acoustic feature information of text message based on N number of candidate's prosodic features information, the second text feature information and acoustics forecast model.

Particularly, for each candidate's prosodic features information, second generation module 500 can by current candidate prosodic features information and the second text feature information input acoustical predictions model, by acoustical predictions model, acoustical predictions is carried out to text message, to generate the current candidate acoustic feature information of text message.

Computing module 600 is for calculating the correlation between each candidate's acoustic feature information and acoustic feature information respectively.

Usually, because different rhythm pause grades can cause the change of the acoustic feature such as duration, fundamental frequency information, the rhythm pause grade that the minimum candidate feature information of the acoustical feature distance extracted with voice to be marked is corresponding and real rhythm pause more close grade, therefore, computing module 600 can determine the correlation of the acoustic feature information of N number of candidate's acoustic feature information and speech data to be marked by the mode of tolerance acoustic feature information distance.Wherein, be understandable that, distance is less, shows that correlation is larger.

Determination module 700 for determining maximum related value according to result of calculation, and using the target prosodic features information of the candidate's prosodic features information corresponding to candidate's acoustic feature of maximum related value as speech data to be marked.

It is to be appreciated that candidate's prosodic features information and target prosodic features information all can include but not limited to rhythm pause grade.

Labeling module 800 is for marking according to the prosodic features of target prosodic features information to speech data to be marked.

Particularly, after determination module 700 determines target rhythm pause grade according to above-mentioned result of calculation, labeling module 800 can mark according to the rhythm pause grade of target rhythm pause grade to speech data to be marked.

It should be noted that, the explanation of the above-mentioned prosodic labeling embodiment of the method to speech data illustrates and the prosodic labeling device being also applicable to the speech data of this embodiment does not repeat herein.

The prosodic labeling device of the speech data of the embodiment of the present invention, acquisition module obtains the text message of speech data to be marked, the first text feature information and the second text feature information of text message is extracted by the first extraction module, and second extraction module extract the acoustic feature information of speech data to be marked, first generation module comprises candidate's prosodic features information aggregate of the text message of N number of candidate's prosodic features information according to the first text feature information and prosody prediction model generation, again by the second generation module based on N number of candidate's prosodic features information, second text feature information and acoustics forecast model generate N number of candidate's acoustic feature information of text message, computing module calculates the correlation between each candidate's acoustic feature information and acoustic feature information respectively, determination module determines maximum related value according to result of calculation, and using the target prosodic features information of the candidate's prosodic features information corresponding to candidate's acoustic feature of maximum related value as speech data to be marked, labeling module marks according to the prosodic features of target prosodic features information to speech data to be marked, thus, accurately the rhythm of mark speech data is paused and mark, the rhythm of the speech data to be marked marked is more reasonable, accurately, and then synthetic speech remarkable fluency more can be made.

In the description of this instructions, specific features, structure, material or feature that the description of reference term " embodiment ", " some embodiments ", " example ", " concrete example " or " some examples " etc. means to describe in conjunction with this embodiment or example are contained at least one embodiment of the present invention or example.In this manual, to the schematic representation of above-mentioned term not must for be identical embodiment or example.And the specific features of description, structure, material or feature can combine in one or more embodiment in office or example in an appropriate manner.In addition, when not conflicting, the feature of the different embodiment described in this instructions or example and different embodiment or example can carry out combining and combining by those skilled in the art.

In addition, term " first ", " second " only for describing object, and can not be interpreted as instruction or hint relative importance or imply the quantity indicating indicated technical characteristic.Thus, be limited with " first ", the feature of " second " can express or impliedly comprise at least one this feature.In describing the invention, the implication of " multiple " is at least two, such as two, three etc., unless otherwise expressly limited specifically.

Describe and can be understood in process flow diagram or in this any process otherwise described or method, represent and comprise one or more for realizing the module of the code of the executable instruction of the step of specific logical function or process, fragment or part, and the scope of the preferred embodiment of the present invention comprises other realization, wherein can not according to order that is shown or that discuss, comprise according to involved function by the mode while of basic or by contrary order, carry out n-back test, this should understand by embodiments of the invention person of ordinary skill in the field.

In flow charts represent or in this logic otherwise described and/or step, such as, the sequencing list of the executable instruction for realizing logic function can be considered to, may be embodied in any computer-readable medium, for instruction execution system, device or equipment (as computer based system, comprise the system of processor or other can from instruction execution system, device or equipment instruction fetch and perform the system of instruction) use, or to use in conjunction with these instruction execution systems, device or equipment.With regard to this instructions, " computer-readable medium " can be anyly can to comprise, store, communicate, propagate or transmission procedure for instruction execution system, device or equipment or the device that uses in conjunction with these instruction execution systems, device or equipment.The example more specifically (non-exhaustive list) of computer-readable medium comprises following: the electrical connection section (electronic installation) with one or more wiring, portable computer diskette box (magnetic device), random access memory (RAM), ROM (read-only memory) (ROM), erasablely edit ROM (read-only memory) (EPROM or flash memory), fiber device, and portable optic disk ROM (read-only memory) (CDROM).In addition, computer-readable medium can be even paper or other suitable media that can print described program thereon, because can such as by carrying out optical scanning to paper or other media, then carry out editing, decipher or carry out process with other suitable methods if desired and electronically obtain described program, be then stored in computer memory.

Should be appreciated that each several part of the present invention can realize with hardware, software, firmware or their combination.In the above-described embodiment, multiple step or method can with to store in memory and the software performed by suitable instruction execution system or firmware realize.Such as, if realized with hardware, the same in another embodiment, can realize by any one in following technology well known in the art or their combination: the discrete logic with the logic gates for realizing logic function to data-signal, there is the special IC of suitable combinational logic gate circuit, programmable gate array (PGA), field programmable gate array (FPGA) etc.

Those skilled in the art are appreciated that realizing all or part of step that above-described embodiment method carries is that the hardware that can carry out instruction relevant by program completes, described program can be stored in a kind of computer-readable recording medium, this program perform time, step comprising embodiment of the method one or a combination set of.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing module, also can be that the independent physics of unit exists, also can be integrated in a module by two or more unit.Above-mentioned integrated module both can adopt the form of hardware to realize, and the form of software function module also can be adopted to realize.If described integrated module using the form of software function module realize and as independently production marketing or use time, also can be stored in a computer read/write memory medium.

The above-mentioned storage medium mentioned can be ROM (read-only memory), disk or CD etc.Although illustrate and describe embodiments of the invention above, be understandable that, above-described embodiment is exemplary, can not be interpreted as limitation of the present invention, and those of ordinary skill in the art can change above-described embodiment within the scope of the invention, revises, replace and modification.

Claims

1. a prosodic labeling method for speech data, is characterized in that, comprise the following steps:

Obtain the text message of speech data to be marked, and extract the first text feature information and the second text feature information of described text message;

Extract the acoustic feature information of described speech data to be marked;

Candidate's prosodic features information aggregate of text message according to described first text feature information and prosody prediction model generation, wherein, described candidate's prosodic features information aggregate comprises N number of candidate's prosodic features information, N be greater than 1 positive integer;

N number of candidate's acoustic feature information of described text message is generated based on described N number of candidate's prosodic features information, described second text feature information and acoustics forecast model, wherein, described N number of candidate's acoustic feature information is corresponding with described N number of candidate's prosodic features information;

Calculate the correlation between each candidate's acoustic feature information and described acoustic feature information respectively;

Maximum related value is determined according to result of calculation, and using the target prosodic features information of the candidate's prosodic features information corresponding to candidate's acoustic feature of maximum related value as described speech data to be marked; And

Mark according to the prosodic features of described target prosodic features information to described speech data to be marked.

2. the prosodic labeling method of speech data as claimed in claim 1, it is characterized in that, generate N number of candidate's acoustic feature information of described text message based on described N number of candidate's prosodic features information, described second text feature information and acoustics forecast model, comprising:

For each candidate's prosodic features information, current candidate prosodic features information and described second text feature information are inputted described acoustical predictions model, by described acoustical predictions model, acoustical predictions is carried out to described text message, to generate the current candidate acoustic feature information of described text message.

3. the prosodic labeling method of speech data as claimed in claim 1, it is characterized in that, candidate's prosodic features information aggregate of described text message according to described first text feature information and prosody prediction model generation, comprising:

Described first text feature information is inputted described prosody prediction model, by described prosody prediction model, prosody prediction is carried out to described fileinfo, to generate candidate's prosodic features information aggregate of described text message.

4. the prosodic labeling method of the speech data as described in claim 1-3, it is characterized in that, described first text feature information comprises part of speech and word face, and described second text feature information comprises sound mother and tone, and described target prosodic features information comprises rhythm pause grade.

5. a prosodic labeling device for speech data, is characterized in that, comprising:

Acquisition module, for obtaining the text message of speech data to be marked;

First extraction module, for extracting the first text feature information and the second text feature information of described text message;

Second extraction module, for extracting the acoustic feature information of described speech data to be marked;

First generation module, for candidate's prosodic features information aggregate of text message according to described first text feature information and prosody prediction model generation, wherein, described candidate's prosodic features information aggregate comprises N number of candidate's prosodic features information, N be greater than 1 positive integer;

Second generation module, for generating N number of candidate's acoustic feature information of described text message based on described N number of candidate's prosodic features information, described second text feature information and acoustics forecast model, wherein, described N number of candidate's acoustic feature information is corresponding with described N number of candidate's prosodic features information;

Computing module, for calculating the correlation between each candidate's acoustic feature information and described acoustic feature information respectively;

Determination module, for determining maximum related value according to result of calculation, and using the target prosodic features information of the candidate's prosodic features information corresponding to candidate's acoustic feature of maximum related value as described speech data to be marked; And

Labeling module, for marking according to the prosodic features of described target prosodic features information to described speech data to be marked.

6. the prosodic labeling device of speech data as claimed in claim 5, is characterized in that, described second generation module, specifically for:

7. the prosodic labeling device of speech data as claimed in claim 5, is characterized in that, described first generation module, specifically for:

8. the prosodic labeling device of the speech data as described in claim 5-7, it is characterized in that, described first text feature information comprises part of speech and word face, and described second text feature information comprises sound mother and tone, and described target prosodic features information comprises rhythm pause grade.