CN104464751B

CN104464751B - The detection method and device for rhythm problem of pronouncing

Info

Publication number: CN104464751B
Application number: CN201410674294.6A
Authority: CN
Inventors: 张儒瑞; 赵乾; 潘颂声; 宋碧霄; 吴玲
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2014-11-21
Filing date: 2014-11-21
Publication date: 2018-01-16
Anticipated expiration: 2034-11-21
Also published as: CN104464751A

Abstract

The present invention proposes a kind of detection method and device of pronunciation rhythm problem, including：Receive speech data to be measured；The word boundary information of speech data to be measured is obtained, and extracts the prosodic information of speech data to be measured；The prosodic labeling information of speech data to be measured is generated according to the word boundary information and prosodic information of speech data to be measured；The prosodic labeling information of reference voice data of the prosodic labeling information of voice to be measured with marking in advance is compared, to detect speech data to be measured with the presence or absence of pronunciation rhythm problem.The detection method of the pronunciation rhythm problem of the present invention, the automatic prosodic labeling information for obtaining voice is compared, without manually marking, using more flexibly, extensively, especially in language learning class software,, can be with the rhythm problem of significantly more efficient assessment user pronunciation by the rhythm of automatic detection voice.In addition, not needing the database of Large Copacity in detection process, amount of calculation is few, improves detection efficiency.

Description

The detection method and device for rhythm problem of pronouncing

Technical field

The present invention relates to voice processing technology field, more particularly to a kind of detection method and device of pronunciation rhythm problem.

Background technology

With the continuous development of speech recognition technology, speech evaluating technology plays increasing in speech recognition and application Effect.Voice evaluation technology is mainly used in assessing the quality of speech data, wherein, not only include in speech data The assessment that the voice quality of words is carried out, in addition to whether the rhythm in speech data is accurately detected and assessed.For example, in language In speech study, user can be by listening index zone pronunciation and carrying out learning a language with reading.User can be by comparing with pronunciation It is whether consistent with the pronunciation in standard pronunciation and the rhythm, and according to comparison result correct and improve constantly learning level.Wherein, such as What can assess exactly, feedback user with the existing rhythm problem in pronunciation is the key that quickly has mastery of a language. Phonetic-rhythm problem, refer to occur in voice the rhythm of mistake, such as, there is no liaison during the liaison, do not pause during the pause, Do not read again when this is read again etc..In addition, under some other scenes, in speech recognition, it is also desirable to the pronunciation to voice Rhythm problem is detected.

Being presently used for the technology of rhythm problem detection mainly has artificial mark method and prosodic constraints method.

Wherein, artificial mark method in text corresponding to voice, it is necessary to manually mark out the correct rhythm of voice, Ran Hougen According to positional information corresponding to the rhythm manually marked, the related acoustic feature of the rhythm of relevant position in voice is extracted, and detect Voice whether there is rhythm problem, such as, to being labelled with stressed word, extract the sound such as energy, the fundamental frequency of the voice of the word Feature is learned, by judging that the methods of whether these acoustic features are more than certain thresholding determines whether the word is read again.

Prosodic constraints method, the method that rhythm assessment is carried out to input speech data according to prosodic constraints.Wherein, Prosodic constraints are：By the language construction of the speech data of input or syntactic structure etc. and the received pronunciation in standard corpus storehouse Normal structure matched, and by the rhythm boundary position of the received pronunciation with similar structure come derive input voice should Some rhythm boundary positions.Feelings for there may be numerous received pronunciations similar to input phonetic structure in standard corpus storehouse Condition, it can determine input speech data needs which kind of rhythm border used according to the statistical probability on rhythm border.

The technology that existing two kinds of rhythms are assessed, it is required for word boundary and the rhythm border of artificial mark voice. The rhythm of user pronunciation can not just be assessed in the case of without manually marking.In addition, prosodic constraints method needs greatly The standard corpus storehouse of capacity, on the one hand, take very big memory space, on the other hand, standard corpus Kuku Plays voice It is to need manually to carry out correct prosodic labeling, and when judging prosodic constraints, it is also necessary to inquire about whole standard corpus Storehouse, calculates the statistical probability on rhythm border, and then just can determine that prosodic constraints, and amount of calculation is very big.

The content of the invention

It is contemplated that at least solves above-mentioned technical problem to a certain extent.

Therefore, first purpose of the present invention is to propose a kind of detection method of pronunciation rhythm problem, without artificial mark Note, application more flexibly, extensively, can improve detection efficiency with the rhythm problem of significantly more efficient assessment user pronunciation.

Second object of the present invention is to propose a kind of detection means of pronunciation rhythm problem.

For the above-mentioned purpose, embodiment proposes a kind of detection side of pronunciation rhythm problem according to a first aspect of the present invention Method, including：Receive speech data to be measured；The word boundary information of the speech data to be measured is obtained, and extracts the language to be measured The prosodic information of sound data；The voice to be measured is generated according to the word boundary information of the speech data to be measured and prosodic information The prosodic labeling information of data；By the prosodic labeling information of the voice to be measured and the rhythm of the reference voice data marked in advance Markup information is compared, to detect the speech data to be measured with the presence or absence of pronunciation rhythm problem.

The detection method of the pronunciation rhythm problem of the embodiment of the present invention, is believed by the word boundary for obtaining speech data to be measured Breath, and its prosodic information is extracted, to accordingly generate the prosodic labeling information of speech data to be measured, and the reference language with marking in advance The prosodic labeling letter that to detect pronunciation rhythm problem, can obtain voice automatically is compared in the prosodic labeling information of sound data Breath is compared, and without artificial mark, application more flexibly, extensively, especially in language learning class software, passes through automatic detection The rhythm of voice, can be with the rhythm problem of significantly more efficient assessment user pronunciation.In addition, Large Copacity is not needed in detection process Database, amount of calculation is few, improves detection efficiency.

Second aspect of the present invention embodiment provides a kind of detection means of pronunciation rhythm problem, including：Receiving module, use In reception speech data to be measured；Acquisition module, for obtaining the word boundary information of the speech data to be measured, and described in extraction The prosodic information of speech data to be measured；Generation module, for the word boundary information and the rhythm according to the speech data to be measured Information generates the prosodic labeling information of the speech data to be measured；Detection module, for by the prosodic labeling of the voice to be measured The prosodic labeling information of reference voice data of the information with marking in advance is compared, to detect the speech data to be measured With the presence or absence of pronunciation rhythm problem.

The detection means of the pronunciation rhythm problem of the embodiment of the present invention, is believed by the word boundary for obtaining speech data to be measured Breath, and its prosodic information is extracted, to accordingly generate the prosodic labeling information of speech data to be measured, and the reference language with marking in advance The prosodic labeling letter that to detect pronunciation rhythm problem, can obtain voice automatically is compared in the prosodic labeling information of sound data Breath is compared, and without artificial mark, application more flexibly, extensively, especially in language learning class software, passes through automatic detection The rhythm of voice, can be with the rhythm problem of significantly more efficient assessment user pronunciation.In addition, Large Copacity is not needed in detection process Database, amount of calculation is few, improves detection efficiency.

The additional aspect and advantage of the present invention will be set forth in part in the description, and will partly become from the following description Obtain substantially, or recognized by the practice of the present invention.

Brief description of the drawings

The above-mentioned and/or additional aspect and advantage of the present invention will become in the description from combination accompanying drawings below to embodiment Substantially and it is readily appreciated that, wherein：

Fig. 1 is the flow chart according to the detection method of the pronunciation rhythm problem of one embodiment of the invention；

Fig. 2 is the flow chart according to the method being labeled to reference voice data of one embodiment of the invention；

Fig. 3 is the structural representation according to the detection means of the pronunciation rhythm problem of one embodiment of the invention；

Fig. 4 is the structural representation according to the detection means of the pronunciation rhythm problem of one specific embodiment of the present invention；

Fig. 5 is the structural representation according to the detection means of the pronunciation rhythm problem of another embodiment of the invention.

Embodiment

Embodiments of the invention are described below in detail, the example of the embodiment is shown in the drawings, wherein from beginning to end Same or similar label represents same or similar element or the element with same or like function.Below with reference to attached The embodiment of figure description is exemplary, is only used for explaining the present invention, and is not considered as limiting the invention.

In the description of the invention, it is to be understood that term " multiple " refers to two or more；Term " first ", " second " is only used for describing purpose, and it is not intended that instruction or hint relative importance.

Below with reference to the accompanying drawings the detection method and device of description pronunciation rhythm problem according to embodiments of the present invention.

Fig. 1 is the flow chart according to the detection method of the pronunciation rhythm problem of one embodiment of the invention.As shown in figure 1, The detection method of pronunciation rhythm problem according to embodiments of the present invention, including：

S101, receive speech data to be measured.

For example, speech data to be measured can be the reference voice record that user is directed to standard with reading voice.

S102, obtains the word boundary information of speech data to be measured, and extracts the prosodic information of speech data to be measured.

Specifically, in one embodiment of the invention, content of text (example corresponding to speech data to be measured can be obtained first Such as, with reading content of text of the voice institute with reading), and according to text content structure decoding network, then by decoding network and acoustics Model is transmitted to decoder.Wherein, acoustic model is the background mathematics model of speech recognition, and model unit can be phoneme, syllable Or word, the modeling pattern of main flow is modeled using Hidden Markov at present.Decoder is one of core of speech recognition system, its Task is the acoustic feature to input, according to acoustic model, decoding network, finds the language of maximum probability corresponding to the acoustic feature Say unit sequence.Decoding network is also known as grammer network, is with phoneme (simple or compound vowel of a Chinese syllable, the initial consonant of such as Chinese character in above-mentioned content of text Deng), syllable or word be node, the annexation between phoneme is the digraph of arc, and decoding network defines that decoder exports language Say the scope of unit sequence.

Then, the acoustic feature for extracting speech data to be measured is transmitted to decoder and decoded so that speech data to be measured with Corresponding content of text is alignd.The word boundary information of speech data to be measured can be obtained according to alignment result.Wherein, acoustics It is characterized in the class value for describing Short Time Speech substantive characteristics, the typically a kind of characteristic vector of fixed dimension (MFCC of such as 39 dimensions (abbreviation of Mel frequency cepstral coefficients) characteristic vector).Word boundary information refers to that word is risen corresponding to initiator in voice to be measured Time frame (or moment) is to time frame (or moment) corresponding to pronunciation is terminated, so as to treated according to word boundary acquisition of information Read in survey speech data the period used in each word, and the period between word.

Finally, can be according to the prosodic information of the word boundary information extraction speech data to be measured of speech data to be measured.Voice The rhythm mainly include：The information such as liaison, sense-group pause, read again, rising-falling tone.Detection for the different rhythms, the rhythm of extraction Feature is different.For example, when judging the liaison rhythm, the prosodic features of extraction is including whetheing there is Jing Yin, fundamental frequency between two words Whether no continuous, energy there are the prosodic features such as low ebb；When judging the pause rhythm, the rhythm such as Jing Yin duration between extraction word is special Sign；When judging to read again the rhythm, the prosodic features such as energy magnitude, the fundamental frequency of word are extracted；When judging the rising-falling tone rhythm, word is extracted The prosodic features such as fundamental frequency slope.And then the above-mentioned rhythm between each word and word can be calculated successively according to word boundary information Restrain feature, according to corresponding determination strategy determine liaison between the stressed of each word in voice to be measured, rising-falling tone and word, The prosodic informations such as pause.

For example, if continuous in the absence of Jing Yin and fundamental frequency between two words, the two word liaisons be can determine whether；Such as Mute time between two words of fruit exceedes regular hour threshold value, such as 0.05 second, then can determine whether there is pause between two words；Such as The energy magnitude of fruit one or more word exceedes certain energy threshold, then shows that the one or more word is read again.Similarly, Also the rising-falling tone feature of word can be judged according to fundamental frequency slope.

S103, the rhythm mark of speech data to be measured is generated according to the word boundary information and prosodic information of speech data to be measured Note information.

Wherein, prosodic labeling information includes at least one prosodic information and position corresponding with least one prosodic information difference Confidence ceases, wherein, each positional information determines according to corresponding rhythm boundary information.Prosodic labeling information refers to mark out voice The positional information of the correct rhythm in corresponding text, that is, mark out in text liaison between any two words, pause or which Lexical stress, prosodic labeling are the important evidences assessed as the rhythm.

In one embodiment of the invention, treated according to the generation of the word boundary information and prosodic information of speech data to be measured The prosodic labeling information of speech data is surveyed, may particularly include：According to the word boundary information and prosodic information of speech data to be measured Determine the rhythm boundary information of speech data to be measured；According to the rhythm boundary information of speech data to be measured to speech data to be measured Prosodic information is labeled, to generate the prosodic labeling information of speech data to be measured.

Wherein, the prosodic information according to corresponding to word boundary information and word, you can determine rhythm boundary information, go forward side by side One step determines the positional information of each prosodic information, is then labeled according to the positional information of prosodic information.For example, such as Fruit word A and B liaison, then the initial time frame of the rhythm corresponding to this rhythm of liaison be word A Voice onset time frame (or Moment) and word B pronunciation end time frame (or moment), and can determine that positional information corresponding to this rhythm of liaison for text Position corresponding to word A and word B in this.And then can be corresponding in corresponding position mark according to the positional information of each rhythm Prosodic information.

S104, the prosodic labeling information of reference voice data of the prosodic labeling information of voice to be measured with marking in advance is entered Row compares analysis, to detect speech data to be measured with the presence or absence of pronunciation rhythm problem.

Wherein, reference voice refer to voice to be measured with reading received pronunciation.

In an embodiment of the present invention, specifically, the prosodic labeling information of voice to be measured and the ginseng marked in advance be can determine whether Whether the prosodic labeling information for examining speech data meets following condition：

It is labelled with the prosodic labeling information of reference voice data and is marked in the prosodic labeling information of speech data to be measured Whole prosodic informations, and positional information corresponding to the prosodic information marked is consistent；And the prosodic labeling of speech data to be measured The prosodic information marked in information does not include the prosodic information not marked in the prosodic labeling information of reference voice data.

If be unsatisfactory for, judge that speech data to be measured has the rhythm.

That is, only on whole rhythms (and corresponding rhythm side of speech data to be measured including reference voice data Boundary's information is identical), and in speech data to be measured include reference voice data do not have the rhythm when, just judge voice number to be measured According in the absence of pronunciation rhythm problem.Otherwise, then there is the rhythm in speech data to be measured.

Further, in one embodiment of the invention, when judging that speech data to be measured has the rhythm, then basis Comparison result generation pronunciation rhythm problem clew information, and user is prompted.Specifically, can be judged according to comparison result Speech data to be measured relative to the rhythm (may include the rhythm lacked or the rhythm having more) differed in reference voice data, And the rhythm for differing is prompted user.Thus, it is possible in time to user carry out pronounce rhythm problem prompting and Feedback, is easy to user to improve, and lifts Consumer's Experience.

In an embodiment of the present invention, it may also include and reference voice data be labeled, to obtain reference voice data Prosodic labeling information the step of.Specifically, as shown in Fig. 2 may include to the method that reference voice data is labeled following Step：

S201, reference voice data is decoded, and the word boundary of reference voice data is obtained according to decoded result Information.

In one embodiment of the invention, decoding network can be built by content of text according to corresponding to reference voice data, And decoding network and acoustic model are transmitted to decoder, the acoustic feature of reference voice data is then extracted, and be transmitted to decoder Decoded so that reference voice data is alignd with corresponding content of text.Reference voice can be obtained according to alignment result The word boundary information of data.

S202, extract the prosodic information of reference voice data.

It specifically, can determine whether to whether there is between the word of reference voice data Jing Yin, whether fundamental frequency is continuous and to reference voice Data carry out more pronunciation judgements, the slope for obtaining Jing Yin duration, energy magnitude, fundamental frequency etc., to extract the rhythm of reference voice data Restrain feature.Further, can be determined based on these prosodic features according to corresponding determination strategy liaison in reference voice data, The prosodic informations such as pause, stressed, rising-falling tone.

S203, the rhythm boundary information of reference voice data is determined according to prosodic information and word boundary information.

For example, if word A and B liaison, the initial time frame of the rhythm corresponding to this rhythm of liaison is word A Voice onset time frame (or moment) and word B pronunciation end time frame (or moment).And then can be according to each rhythm Boundary information is in the corresponding prosodic information of corresponding position mark.

S204, reference voice data is labeled according to rhythm boundary information.

Thus, can automatic detection reference voice data prosodic information, and be labeled, avoid manually mark it is numerous Trivial, error etc., and after disposable mark is good, it is later reusable in detection, it is more convenient, accurate.

In order to realize above-described embodiment, the present invention also proposes a kind of detection means of pronunciation rhythm problem.

Fig. 3 is the structural representation according to the detection means of the pronunciation rhythm problem of one embodiment of the invention.

As shown in figure 3, the detection means of pronunciation rhythm problem according to embodiments of the present invention, including：Receiving module 10, obtain Modulus block 20, generation module 30 and detection module 40.

Specifically, receiving module 10 is used to receive speech data to be measured.For example, speech data to be measured can be user For standard reference voice record with read voice.

Acquisition module 20 is used for the word boundary information for obtaining speech data to be measured, and extracts the rhythm of speech data to be measured Information.

More specifically, in one embodiment of the invention, it is corresponding that acquisition module 20 can obtain speech data to be measured first Content of text (for example, with read voice institute with reading content of text), and according to text content build decoding network, then will solve Code network and acoustic model are transmitted to decoder.Wherein, acoustic model is the background mathematics model of speech recognition, and model unit can be with It is phoneme, syllable or word, the modeling pattern of main flow is modeled using Hidden Markov at present.Decoder is speech recognition system One of core, its task are the acoustic features to input, according to acoustic model, decoding network, are found corresponding to the acoustic feature The language unit sequence of maximum probability.Decoding network is also known as grammer network, be with the phoneme in above-mentioned content of text (such as Chinese character Simple or compound vowel of a Chinese syllable, initial consonant etc.), syllable or word be node, the annexation between phoneme is the digraph of arc, and decoding network defines decoding The scope of device output language unit sequence.

Then, acquisition module 20 extracts the acoustic feature of speech data to be measured and is transmitted to decoder and is decoded so as to be measured Speech data is alignd with corresponding content of text.The word boundary that speech data to be measured can be obtained according to alignment result is believed Breath.Wherein, acoustic feature be describe Short Time Speech substantive characteristics a class value, typically a kind of characteristic vector of fixed dimension (MFCC (abbreviation of Mel frequency cepstral coefficients) characteristic vectors of such as 39 dimensions).Word boundary information refers to word in voice to be measured Time frame (or moment) corresponding to time frame (or moment) corresponding to initiator to end pronunciation is played, so that, can be according to word side Boundary's acquisition of information is read the period used in each word in speech data to be measured, and the period between word.

Finally, acquisition module 20 can be according to the rhythm of the word boundary information extraction speech data to be measured of speech data to be measured Information.The rhythm of voice mainly includes：The information such as liaison, sense-group pause, read again, rising-falling tone.Detection for the different rhythms, is carried The prosodic features taken is different.For example, when acquisition module 20 judges the liaison rhythm, the prosodic features of extraction includes two words Between whether there is whether Jing Yin, fundamental frequency continuous, whether energy the prosodic features such as low ebb occurs；When judging the pause rhythm, between extraction word The prosodic features such as Jing Yin duration；When judging to read again the rhythm, the prosodic features such as energy magnitude, the fundamental frequency of word are extracted；Judge to rise During the falling tone rhythm, the prosodic features such as the fundamental frequency slope of word are extracted.And then it can be calculated successively each according to word boundary information Above-mentioned prosodic features between word and word, stressed, the liter of each word in voice to be measured are determined according to corresponding determination strategy The prosodic informations such as liaison, pause between falling tone and word.

Generation module 30 is used to generate voice number to be measured according to the word boundary information and prosodic information of speech data to be measured According to prosodic labeling information.Wherein, prosodic labeling information includes at least one prosodic information and divided with least one prosodic information Not corresponding positional information, wherein, each positional information determines according to corresponding rhythm boundary information.Prosodic labeling information refers to The positional information of the correct rhythm in the text corresponding to voice is marked out, that is, marks out in text liaison between any two words, stop Pause or which lexical stress, prosodic labeling are the important evidences assessed as the rhythm.

In one embodiment of the invention, generation module 30 is specifically used for：According to the word boundary of speech data to be measured Information and prosodic information determine the rhythm boundary information of speech data to be measured；According to the rhythm boundary information pair of speech data to be measured The prosodic information of speech data to be measured is labeled, to generate the prosodic labeling information of speech data to be measured.

Detection module 40 is used for the prosodic labeling information of voice to be measured and the rhythm of the reference voice data marked in advance Markup information is compared, to detect speech data to be measured with the presence or absence of pronunciation rhythm problem.Wherein, reference voice refers to Voice to be measured with reading received pronunciation.

In an embodiment of the present invention, detection module 40 is specifically used for：Judge the prosodic labeling information of voice to be measured with it is pre- Whether the prosodic labeling information of the reference voice data first marked meets following condition：The prosodic labeling information of speech data to be measured In be labelled with the whole prosodic informations marked in the prosodic labeling information of reference voice data, and the prosodic information pair marked The positional information answered is consistent；And the prosodic information marked in the prosodic labeling information of speech data to be measured does not include reference voice The prosodic information not marked in the prosodic labeling information of data；If be unsatisfactory for, judge that speech data to be measured has pronunciation rhythm Rule problem.

Fig. 4 is the structural representation according to the detection means of the pronunciation rhythm problem of one specific embodiment of the present invention.

As shown in figure 4, the detection means of pronunciation rhythm problem according to embodiments of the present invention, including：Receiving module 10, obtain Modulus block 20, generation module 30, detection module 40 and labeling module 50.

Specifically, labeling module 50 is used to be labeled reference voice data, to obtain the rhythm of reference voice data Markup information.

In one embodiment of the invention, labeling module 50 can be specifically used for：Reference voice data is decoded, and The word boundary information of reference voice data is obtained according to decoded result；Extract the prosodic information of reference voice data；According to rhythm Rule information and word boundary information determine the rhythm boundary information of reference voice data；According to rhythm boundary information to reference voice Data are labeled.

More specifically, labeling module 50 can according to corresponding to reference voice data content of text build decoding network, and will Decoding network and acoustic model are transmitted to decoder, then extract the acoustic feature of reference voice data, and are transmitted to decoder progress Decoding so that reference voice data is alignd with corresponding content of text.Reference voice data can be obtained according to alignment result Word boundary information.

Then, labeling module 50 can determine whether to whether there is between the word of reference voice data Jing Yin, and whether fundamental frequency is continuous and right Reference voice data carries out more pronunciation judgements, the slope for obtaining Jing Yin duration, energy magnitude, fundamental frequency etc., to extract reference voice The prosodic features of data.Further, reference voice data can be determined according to corresponding determination strategy based on these prosodic features In liaison, pause, read again, the prosodic information such as rising-falling tone.

As shown in figure 5, the detection means of pronunciation rhythm problem according to embodiments of the present invention, including：Receiving module 10, obtain Modulus block 20, generation module 30, detection module 40, labeling module 50 and reminding module 60.

Specifically, reminding module 60 is used for when judging that speech data to be measured has the rhythm, according to comparison result Generation pronunciation rhythm problem clew information, and user is prompted.It can be tied more specifically, reminding module 60 is used according to comparing Fruit, judge that speech data to be measured (may include the rhythm that lacks or have more relative to the rhythm differed in reference voice data The rhythm), and user is prompted for the rhythm that differs.

Thus, the detection means of the pronunciation rhythm problem of the embodiment of the present invention, can carry out the pronunciation rhythm to user in time The prompting of problem and feedback, are easy to user to improve, and lift Consumer's Experience.

Any process or method described otherwise above description in flow chart or herein is construed as, and represents to include Module, fragment or the portion of the code of the executable instruction of one or more the step of being used to realize specific logical function or process Point, and the scope of the preferred embodiment of the present invention includes other realization, wherein can not press shown or discuss suitable Sequence, including according to involved function by it is basic simultaneously in the way of or in the opposite order, carry out perform function, this should be of the invention Embodiment person of ordinary skill in the field understood.

Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use In the order list for the executable instruction for realizing logic function, may be embodied in any computer-readable medium, for Instruction execution system, device or equipment (such as computer based system including the system of processor or other can be held from instruction The system of row system, device or equipment instruction fetch and execute instruction) use, or combine these instruction execution systems, device or set It is standby and use.For the purpose of this specification, " computer-readable medium " can any can be included, store, communicate, propagate or pass Defeated program is for instruction execution system, device or equipment or the dress used with reference to these instruction execution systems, device or equipment Put.The more specifically example (non-exhaustive list) of computer-readable medium includes following：Electricity with one or more wiring Connecting portion (electronic installation), portable computer diskette box (magnetic device), random access memory (RAM), read-only storage (ROM), erasable edit read-only storage (EPROM or flash memory), fiber device, and portable optic disk is read-only deposits Reservoir (CDROM).In addition, computer-readable medium, which can even is that, to print the paper of described program thereon or other are suitable Medium, because can then enter edlin, interpretation or if necessary with it for example by carrying out optical scanner to paper or other media His suitable method is handled electronically to obtain described program, is then stored in computer storage.

It should be appreciated that each several part of the present invention can be realized with hardware, software, firmware or combinations thereof.Above-mentioned In embodiment, software that multiple steps or method can be performed in memory and by suitable instruction execution system with storage Or firmware is realized.If, and in another embodiment, can be with well known in the art for example, realized with hardware Any one of row technology or their combination are realized：With the logic gates for realizing logic function to data-signal Discrete logic, have suitable combinational logic gate circuit application specific integrated circuit, programmable gate array (PGA), scene Programmable gate array (FPGA) etc..

Those skilled in the art are appreciated that to realize all or part of step that above-described embodiment method carries Suddenly it is that by program the hardware of correlation can be instructed to complete, described program can be stored in a kind of computer-readable storage medium In matter, the program upon execution, including one or a combination set of the step of embodiment of the method.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing module, can also That unit is individually physically present, can also two or more units be integrated in a module.Above-mentioned integrated mould Block can both be realized in the form of hardware, can also be realized in the form of software function module.The integrated module is such as Fruit is realized in the form of software function module and as independent production marketing or in use, can also be stored in a computer In read/write memory medium.

Storage medium mentioned above can be read-only storage, disk or CD etc..

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or the spy for combining the embodiment or example description Point is contained at least one embodiment or example of the present invention.In this manual, to the schematic representation of above-mentioned term not Necessarily refer to identical embodiment or example.Moreover, specific features, structure, material or the feature of description can be any One or more embodiments or example in combine in an appropriate manner.

Although an embodiment of the present invention has been shown and described, it will be understood by those skilled in the art that：Not In the case of departing from the principle and objective of the present invention a variety of change, modification, replacement and modification can be carried out to these embodiments, this The scope of invention is by claim and its equivalent limits.

Claims

A kind of 1. detection method of pronunciation rhythm problem, it is characterised in that including：

Receive speech data to be measured；

The word boundary information of the speech data to be measured is obtained, and extracts the prosodic information of the speech data to be measured；

The rhythm side of the speech data to be measured is determined according to the word boundary information of the speech data to be measured and prosodic information Boundary's information；

The prosodic information of the speech data to be measured is labeled according to the rhythm boundary information of the speech data to be measured, with Generate the prosodic labeling information of the speech data to be measured；

The prosodic labeling information of reference voice data of the prosodic labeling information of the voice to be measured with marking in advance is compared To analysis, to detect the speech data to be measured with the presence or absence of pronunciation rhythm problem.
2. the detection method of pronunciation rhythm problem as claimed in claim 1, it is characterised in that also include：

The reference voice data is labeled, to obtain the prosodic labeling information of the reference voice data.
3. the detection method of pronunciation rhythm problem as claimed in claim 2, it is characterised in that described to the reference voice number According to being labeled, specifically include：

The reference voice data is decoded, and the word boundary letter of the reference voice data is obtained according to decoded result Breath；

Extract the prosodic information of the reference voice data；

The rhythm boundary information of the reference voice data is determined according to the prosodic information and the word boundary information；

The reference voice data is labeled according to the rhythm boundary information.
4. the detection method of the pronunciation rhythm problem as described in claim any one of 1-3, it is characterised in that the prosodic labeling Information includes at least one prosodic information and distinguishes corresponding positional information with least one prosodic information, wherein, each The positional information determines according to corresponding rhythm boundary information.
5. the detection method of pronunciation rhythm problem as claimed in claim 4, it is characterised in that described by the voice to be measured The prosodic labeling information of reference voice data of the prosodic labeling information with marking in advance is compared, and specifically includes：

Whether the prosodic labeling information and the prosodic labeling information of the reference voice data marked in advance for judging the voice to be measured Meet following condition：

Institute in the prosodic labeling information of the reference voice data is labelled with the prosodic labeling information of the speech data to be measured Whole prosodic informations of mark, and positional information corresponding to the prosodic information marked is consistent；

And the prosodic information marked in the prosodic labeling information of the speech data to be measured does not include the reference voice data The prosodic information not marked in prosodic labeling information；

If be unsatisfactory for, judge that the speech data to be measured has the rhythm.
6. the detection method of pronunciation rhythm problem as claimed in claim 1, it is characterised in that also include：

When judging that the speech data to be measured has the rhythm, then pronunciation rhythm problem clew letter is generated according to comparison result Breath, and user is prompted.
A kind of 7. detection means of pronunciation rhythm problem, it is characterised in that including：

Receiving module, for receiving speech data to be measured；

Acquisition module, for obtaining the word boundary information of the speech data to be measured, and extract the speech data to be measured Prosodic information；

Generation module, the voice to be measured is determined for the word boundary information according to the speech data to be measured and prosodic information The rhythm boundary information of data, and the rhythm according to the rhythm boundary information of the speech data to be measured to the speech data to be measured Rule information is labeled, to generate the prosodic labeling information of the speech data to be measured；

Detection module, for by the prosodic labeling information of the voice to be measured and the rhythm mark of the reference voice data in advance marked Note information is compared, to detect the speech data to be measured with the presence or absence of pronunciation rhythm problem.
8. the detection means of pronunciation rhythm problem as claimed in claim 7, it is characterised in that also include：

Labeling module, for being labeled to the reference voice data, to obtain the prosodic labeling of the reference voice data Information.
9. the detection means of pronunciation rhythm problem as claimed in claim 8, it is characterised in that the labeling module is specifically used In：

The reference voice data is decoded, and the word boundary letter of the reference voice data is obtained according to decoded result Breath；

Extract the prosodic information of the reference voice data；

The rhythm boundary information of the reference voice data is determined according to the prosodic information and the word boundary information；

The reference voice data is labeled according to the rhythm boundary information.
10. the detection means of the pronunciation rhythm problem as described in claim any one of 7-9, it is characterised in that the rhythm mark Noting information includes at least one prosodic information and distinguishes corresponding positional information with least one prosodic information, wherein, often The individual positional information determines according to corresponding rhythm boundary information.
11. the detection means of pronunciation rhythm problem as claimed in claim 10, it is characterised in that the detection module is specifically used In：

Whether the prosodic labeling information and the prosodic labeling information of the reference voice data marked in advance for judging the voice to be measured Meet following condition：

Institute in the prosodic labeling information of the reference voice data is labelled with the prosodic labeling information of the speech data to be measured Whole prosodic informations of mark, and positional information corresponding to the prosodic information marked is consistent；

And the prosodic information marked in the prosodic labeling information of the speech data to be measured does not include the reference voice data The prosodic information not marked in prosodic labeling information.
12. the detection means of pronunciation rhythm problem as claimed in claim 7, it is characterised in that also include：

Reminding module, for when judging that the speech data to be measured has the rhythm, generating according to comparison result Rhythm problem clew information, and user is prompted.