CN104464751B - The detection method and device for rhythm problem of pronouncing - Google Patents
The detection method and device for rhythm problem of pronouncing Download PDFInfo
- Publication number
- CN104464751B CN104464751B CN201410674294.6A CN201410674294A CN104464751B CN 104464751 B CN104464751 B CN 104464751B CN 201410674294 A CN201410674294 A CN 201410674294A CN 104464751 B CN104464751 B CN 104464751B
- Authority
- CN
- China
- Prior art keywords
- information
- prosodic
- measured
- rhythm
- speech data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Electrically Operated Instructional Devices (AREA)
Abstract
The present invention proposes a kind of detection method and device of pronunciation rhythm problem, including:Receive speech data to be measured;The word boundary information of speech data to be measured is obtained, and extracts the prosodic information of speech data to be measured;The prosodic labeling information of speech data to be measured is generated according to the word boundary information and prosodic information of speech data to be measured;The prosodic labeling information of reference voice data of the prosodic labeling information of voice to be measured with marking in advance is compared, to detect speech data to be measured with the presence or absence of pronunciation rhythm problem.The detection method of the pronunciation rhythm problem of the present invention, the automatic prosodic labeling information for obtaining voice is compared, without manually marking, using more flexibly, extensively, especially in language learning class software,, can be with the rhythm problem of significantly more efficient assessment user pronunciation by the rhythm of automatic detection voice.In addition, not needing the database of Large Copacity in detection process, amount of calculation is few, improves detection efficiency.
Description
Technical field
The present invention relates to voice processing technology field, more particularly to a kind of detection method and device of pronunciation rhythm problem.
Background technology
With the continuous development of speech recognition technology, speech evaluating technology plays increasing in speech recognition and application
Effect.Voice evaluation technology is mainly used in assessing the quality of speech data, wherein, not only include in speech data
The assessment that the voice quality of words is carried out, in addition to whether the rhythm in speech data is accurately detected and assessed.For example, in language
In speech study, user can be by listening index zone pronunciation and carrying out learning a language with reading.User can be by comparing with pronunciation
It is whether consistent with the pronunciation in standard pronunciation and the rhythm, and according to comparison result correct and improve constantly learning level.Wherein, such as
What can assess exactly, feedback user with the existing rhythm problem in pronunciation is the key that quickly has mastery of a language.
Phonetic-rhythm problem, refer to occur in voice the rhythm of mistake, such as, there is no liaison during the liaison, do not pause during the pause,
Do not read again when this is read again etc..In addition, under some other scenes, in speech recognition, it is also desirable to the pronunciation to voice
Rhythm problem is detected.
Being presently used for the technology of rhythm problem detection mainly has artificial mark method and prosodic constraints method.
Wherein, artificial mark method in text corresponding to voice, it is necessary to manually mark out the correct rhythm of voice, Ran Hougen
According to positional information corresponding to the rhythm manually marked, the related acoustic feature of the rhythm of relevant position in voice is extracted, and detect
Voice whether there is rhythm problem, such as, to being labelled with stressed word, extract the sound such as energy, the fundamental frequency of the voice of the word
Feature is learned, by judging that the methods of whether these acoustic features are more than certain thresholding determines whether the word is read again.
Prosodic constraints method, the method that rhythm assessment is carried out to input speech data according to prosodic constraints.Wherein,
Prosodic constraints are:By the language construction of the speech data of input or syntactic structure etc. and the received pronunciation in standard corpus storehouse
Normal structure matched, and by the rhythm boundary position of the received pronunciation with similar structure come derive input voice should
Some rhythm boundary positions.Feelings for there may be numerous received pronunciations similar to input phonetic structure in standard corpus storehouse
Condition, it can determine input speech data needs which kind of rhythm border used according to the statistical probability on rhythm border.
The technology that existing two kinds of rhythms are assessed, it is required for word boundary and the rhythm border of artificial mark voice.
The rhythm of user pronunciation can not just be assessed in the case of without manually marking.In addition, prosodic constraints method needs greatly
The standard corpus storehouse of capacity, on the one hand, take very big memory space, on the other hand, standard corpus Kuku Plays voice
It is to need manually to carry out correct prosodic labeling, and when judging prosodic constraints, it is also necessary to inquire about whole standard corpus
Storehouse, calculates the statistical probability on rhythm border, and then just can determine that prosodic constraints, and amount of calculation is very big.
The content of the invention
It is contemplated that at least solves above-mentioned technical problem to a certain extent.
Therefore, first purpose of the present invention is to propose a kind of detection method of pronunciation rhythm problem, without artificial mark
Note, application more flexibly, extensively, can improve detection efficiency with the rhythm problem of significantly more efficient assessment user pronunciation.
Second object of the present invention is to propose a kind of detection means of pronunciation rhythm problem.
For the above-mentioned purpose, embodiment proposes a kind of detection side of pronunciation rhythm problem according to a first aspect of the present invention
Method, including:Receive speech data to be measured;The word boundary information of the speech data to be measured is obtained, and extracts the language to be measured
The prosodic information of sound data;The voice to be measured is generated according to the word boundary information of the speech data to be measured and prosodic information
The prosodic labeling information of data;By the prosodic labeling information of the voice to be measured and the rhythm of the reference voice data marked in advance
Markup information is compared, to detect the speech data to be measured with the presence or absence of pronunciation rhythm problem.
The detection method of the pronunciation rhythm problem of the embodiment of the present invention, is believed by the word boundary for obtaining speech data to be measured
Breath, and its prosodic information is extracted, to accordingly generate the prosodic labeling information of speech data to be measured, and the reference language with marking in advance
The prosodic labeling letter that to detect pronunciation rhythm problem, can obtain voice automatically is compared in the prosodic labeling information of sound data
Breath is compared, and without artificial mark, application more flexibly, extensively, especially in language learning class software, passes through automatic detection
The rhythm of voice, can be with the rhythm problem of significantly more efficient assessment user pronunciation.In addition, Large Copacity is not needed in detection process
Database, amount of calculation is few, improves detection efficiency.
Second aspect of the present invention embodiment provides a kind of detection means of pronunciation rhythm problem, including:Receiving module, use
In reception speech data to be measured;Acquisition module, for obtaining the word boundary information of the speech data to be measured, and described in extraction
The prosodic information of speech data to be measured;Generation module, for the word boundary information and the rhythm according to the speech data to be measured
Information generates the prosodic labeling information of the speech data to be measured;Detection module, for by the prosodic labeling of the voice to be measured
The prosodic labeling information of reference voice data of the information with marking in advance is compared, to detect the speech data to be measured
With the presence or absence of pronunciation rhythm problem.
The detection means of the pronunciation rhythm problem of the embodiment of the present invention, is believed by the word boundary for obtaining speech data to be measured
Breath, and its prosodic information is extracted, to accordingly generate the prosodic labeling information of speech data to be measured, and the reference language with marking in advance
The prosodic labeling letter that to detect pronunciation rhythm problem, can obtain voice automatically is compared in the prosodic labeling information of sound data
Breath is compared, and without artificial mark, application more flexibly, extensively, especially in language learning class software, passes through automatic detection
The rhythm of voice, can be with the rhythm problem of significantly more efficient assessment user pronunciation.In addition, Large Copacity is not needed in detection process
Database, amount of calculation is few, improves detection efficiency.
The additional aspect and advantage of the present invention will be set forth in part in the description, and will partly become from the following description
Obtain substantially, or recognized by the practice of the present invention.
Brief description of the drawings
The above-mentioned and/or additional aspect and advantage of the present invention will become in the description from combination accompanying drawings below to embodiment
Substantially and it is readily appreciated that, wherein:
Fig. 1 is the flow chart according to the detection method of the pronunciation rhythm problem of one embodiment of the invention;
Fig. 2 is the flow chart according to the method being labeled to reference voice data of one embodiment of the invention;
Fig. 3 is the structural representation according to the detection means of the pronunciation rhythm problem of one embodiment of the invention;
Fig. 4 is the structural representation according to the detection means of the pronunciation rhythm problem of one specific embodiment of the present invention;
Fig. 5 is the structural representation according to the detection means of the pronunciation rhythm problem of another embodiment of the invention.
Embodiment
Embodiments of the invention are described below in detail, the example of the embodiment is shown in the drawings, wherein from beginning to end
Same or similar label represents same or similar element or the element with same or like function.Below with reference to attached
The embodiment of figure description is exemplary, is only used for explaining the present invention, and is not considered as limiting the invention.
In the description of the invention, it is to be understood that term " multiple " refers to two or more;Term " first ",
" second " is only used for describing purpose, and it is not intended that instruction or hint relative importance.
Below with reference to the accompanying drawings the detection method and device of description pronunciation rhythm problem according to embodiments of the present invention.
Fig. 1 is the flow chart according to the detection method of the pronunciation rhythm problem of one embodiment of the invention.As shown in figure 1,
The detection method of pronunciation rhythm problem according to embodiments of the present invention, including:
S101, receive speech data to be measured.
For example, speech data to be measured can be the reference voice record that user is directed to standard with reading voice.
S102, obtains the word boundary information of speech data to be measured, and extracts the prosodic information of speech data to be measured.
Specifically, in one embodiment of the invention, content of text (example corresponding to speech data to be measured can be obtained first
Such as, with reading content of text of the voice institute with reading), and according to text content structure decoding network, then by decoding network and acoustics
Model is transmitted to decoder.Wherein, acoustic model is the background mathematics model of speech recognition, and model unit can be phoneme, syllable
Or word, the modeling pattern of main flow is modeled using Hidden Markov at present.Decoder is one of core of speech recognition system, its
Task is the acoustic feature to input, according to acoustic model, decoding network, finds the language of maximum probability corresponding to the acoustic feature
Say unit sequence.Decoding network is also known as grammer network, is with phoneme (simple or compound vowel of a Chinese syllable, the initial consonant of such as Chinese character in above-mentioned content of text
Deng), syllable or word be node, the annexation between phoneme is the digraph of arc, and decoding network defines that decoder exports language
Say the scope of unit sequence.
Then, the acoustic feature for extracting speech data to be measured is transmitted to decoder and decoded so that speech data to be measured with
Corresponding content of text is alignd.The word boundary information of speech data to be measured can be obtained according to alignment result.Wherein, acoustics
It is characterized in the class value for describing Short Time Speech substantive characteristics, the typically a kind of characteristic vector of fixed dimension (MFCC of such as 39 dimensions
(abbreviation of Mel frequency cepstral coefficients) characteristic vector).Word boundary information refers to that word is risen corresponding to initiator in voice to be measured
Time frame (or moment) is to time frame (or moment) corresponding to pronunciation is terminated, so as to treated according to word boundary acquisition of information
Read in survey speech data the period used in each word, and the period between word.
Finally, can be according to the prosodic information of the word boundary information extraction speech data to be measured of speech data to be measured.Voice
The rhythm mainly include:The information such as liaison, sense-group pause, read again, rising-falling tone.Detection for the different rhythms, the rhythm of extraction
Feature is different.For example, when judging the liaison rhythm, the prosodic features of extraction is including whetheing there is Jing Yin, fundamental frequency between two words
Whether no continuous, energy there are the prosodic features such as low ebb;When judging the pause rhythm, the rhythm such as Jing Yin duration between extraction word is special
Sign;When judging to read again the rhythm, the prosodic features such as energy magnitude, the fundamental frequency of word are extracted;When judging the rising-falling tone rhythm, word is extracted
The prosodic features such as fundamental frequency slope.And then the above-mentioned rhythm between each word and word can be calculated successively according to word boundary information
Restrain feature, according to corresponding determination strategy determine liaison between the stressed of each word in voice to be measured, rising-falling tone and word,
The prosodic informations such as pause.
For example, if continuous in the absence of Jing Yin and fundamental frequency between two words, the two word liaisons be can determine whether;Such as
Mute time between two words of fruit exceedes regular hour threshold value, such as 0.05 second, then can determine whether there is pause between two words;Such as
The energy magnitude of fruit one or more word exceedes certain energy threshold, then shows that the one or more word is read again.Similarly,
Also the rising-falling tone feature of word can be judged according to fundamental frequency slope.
S103, the rhythm mark of speech data to be measured is generated according to the word boundary information and prosodic information of speech data to be measured
Note information.
Wherein, prosodic labeling information includes at least one prosodic information and position corresponding with least one prosodic information difference
Confidence ceases, wherein, each positional information determines according to corresponding rhythm boundary information.Prosodic labeling information refers to mark out voice
The positional information of the correct rhythm in corresponding text, that is, mark out in text liaison between any two words, pause or which
Lexical stress, prosodic labeling are the important evidences assessed as the rhythm.
In one embodiment of the invention, treated according to the generation of the word boundary information and prosodic information of speech data to be measured
The prosodic labeling information of speech data is surveyed, may particularly include:According to the word boundary information and prosodic information of speech data to be measured
Determine the rhythm boundary information of speech data to be measured;According to the rhythm boundary information of speech data to be measured to speech data to be measured
Prosodic information is labeled, to generate the prosodic labeling information of speech data to be measured.
Wherein, the prosodic information according to corresponding to word boundary information and word, you can determine rhythm boundary information, go forward side by side
One step determines the positional information of each prosodic information, is then labeled according to the positional information of prosodic information.For example, such as
Fruit word A and B liaison, then the initial time frame of the rhythm corresponding to this rhythm of liaison be word A Voice onset time frame (or
Moment) and word B pronunciation end time frame (or moment), and can determine that positional information corresponding to this rhythm of liaison for text
Position corresponding to word A and word B in this.And then can be corresponding in corresponding position mark according to the positional information of each rhythm
Prosodic information.
S104, the prosodic labeling information of reference voice data of the prosodic labeling information of voice to be measured with marking in advance is entered
Row compares analysis, to detect speech data to be measured with the presence or absence of pronunciation rhythm problem.
Wherein, reference voice refer to voice to be measured with reading received pronunciation.
In an embodiment of the present invention, specifically, the prosodic labeling information of voice to be measured and the ginseng marked in advance be can determine whether
Whether the prosodic labeling information for examining speech data meets following condition:
It is labelled with the prosodic labeling information of reference voice data and is marked in the prosodic labeling information of speech data to be measured
Whole prosodic informations, and positional information corresponding to the prosodic information marked is consistent;And the prosodic labeling of speech data to be measured
The prosodic information marked in information does not include the prosodic information not marked in the prosodic labeling information of reference voice data.
If be unsatisfactory for, judge that speech data to be measured has the rhythm.
That is, only on whole rhythms (and corresponding rhythm side of speech data to be measured including reference voice data
Boundary's information is identical), and in speech data to be measured include reference voice data do not have the rhythm when, just judge voice number to be measured
According in the absence of pronunciation rhythm problem.Otherwise, then there is the rhythm in speech data to be measured.
Further, in one embodiment of the invention, when judging that speech data to be measured has the rhythm, then basis
Comparison result generation pronunciation rhythm problem clew information, and user is prompted.Specifically, can be judged according to comparison result
Speech data to be measured relative to the rhythm (may include the rhythm lacked or the rhythm having more) differed in reference voice data,
And the rhythm for differing is prompted user.Thus, it is possible in time to user carry out pronounce rhythm problem prompting and
Feedback, is easy to user to improve, and lifts Consumer's Experience.
The detection method of the pronunciation rhythm problem of the embodiment of the present invention, is believed by the word boundary for obtaining speech data to be measured
Breath, and its prosodic information is extracted, to accordingly generate the prosodic labeling information of speech data to be measured, and the reference language with marking in advance
The prosodic labeling letter that to detect pronunciation rhythm problem, can obtain voice automatically is compared in the prosodic labeling information of sound data
Breath is compared, and without artificial mark, application more flexibly, extensively, especially in language learning class software, passes through automatic detection
The rhythm of voice, can be with the rhythm problem of significantly more efficient assessment user pronunciation.In addition, Large Copacity is not needed in detection process
Database, amount of calculation is few, improves detection efficiency.
In an embodiment of the present invention, it may also include and reference voice data be labeled, to obtain reference voice data
Prosodic labeling information the step of.Specifically, as shown in Fig. 2 may include to the method that reference voice data is labeled following
Step:
S201, reference voice data is decoded, and the word boundary of reference voice data is obtained according to decoded result
Information.
In one embodiment of the invention, decoding network can be built by content of text according to corresponding to reference voice data,
And decoding network and acoustic model are transmitted to decoder, the acoustic feature of reference voice data is then extracted, and be transmitted to decoder
Decoded so that reference voice data is alignd with corresponding content of text.Reference voice can be obtained according to alignment result
The word boundary information of data.
S202, extract the prosodic information of reference voice data.
It specifically, can determine whether to whether there is between the word of reference voice data Jing Yin, whether fundamental frequency is continuous and to reference voice
Data carry out more pronunciation judgements, the slope for obtaining Jing Yin duration, energy magnitude, fundamental frequency etc., to extract the rhythm of reference voice data
Restrain feature.Further, can be determined based on these prosodic features according to corresponding determination strategy liaison in reference voice data,
The prosodic informations such as pause, stressed, rising-falling tone.
S203, the rhythm boundary information of reference voice data is determined according to prosodic information and word boundary information.
For example, if word A and B liaison, the initial time frame of the rhythm corresponding to this rhythm of liaison is word A
Voice onset time frame (or moment) and word B pronunciation end time frame (or moment).And then can be according to each rhythm
Boundary information is in the corresponding prosodic information of corresponding position mark.
S204, reference voice data is labeled according to rhythm boundary information.
Thus, can automatic detection reference voice data prosodic information, and be labeled, avoid manually mark it is numerous
Trivial, error etc., and after disposable mark is good, it is later reusable in detection, it is more convenient, accurate.
In order to realize above-described embodiment, the present invention also proposes a kind of detection means of pronunciation rhythm problem.
Fig. 3 is the structural representation according to the detection means of the pronunciation rhythm problem of one embodiment of the invention.
As shown in figure 3, the detection means of pronunciation rhythm problem according to embodiments of the present invention, including:Receiving module 10, obtain
Modulus block 20, generation module 30 and detection module 40.
Specifically, receiving module 10 is used to receive speech data to be measured.For example, speech data to be measured can be user
For standard reference voice record with read voice.
Acquisition module 20 is used for the word boundary information for obtaining speech data to be measured, and extracts the rhythm of speech data to be measured
Information.
More specifically, in one embodiment of the invention, it is corresponding that acquisition module 20 can obtain speech data to be measured first
Content of text (for example, with read voice institute with reading content of text), and according to text content build decoding network, then will solve
Code network and acoustic model are transmitted to decoder.Wherein, acoustic model is the background mathematics model of speech recognition, and model unit can be with
It is phoneme, syllable or word, the modeling pattern of main flow is modeled using Hidden Markov at present.Decoder is speech recognition system
One of core, its task are the acoustic features to input, according to acoustic model, decoding network, are found corresponding to the acoustic feature
The language unit sequence of maximum probability.Decoding network is also known as grammer network, be with the phoneme in above-mentioned content of text (such as Chinese character
Simple or compound vowel of a Chinese syllable, initial consonant etc.), syllable or word be node, the annexation between phoneme is the digraph of arc, and decoding network defines decoding
The scope of device output language unit sequence.
Then, acquisition module 20 extracts the acoustic feature of speech data to be measured and is transmitted to decoder and is decoded so as to be measured
Speech data is alignd with corresponding content of text.The word boundary that speech data to be measured can be obtained according to alignment result is believed
Breath.Wherein, acoustic feature be describe Short Time Speech substantive characteristics a class value, typically a kind of characteristic vector of fixed dimension
(MFCC (abbreviation of Mel frequency cepstral coefficients) characteristic vectors of such as 39 dimensions).Word boundary information refers to word in voice to be measured
Time frame (or moment) corresponding to time frame (or moment) corresponding to initiator to end pronunciation is played, so that, can be according to word side
Boundary's acquisition of information is read the period used in each word in speech data to be measured, and the period between word.
Finally, acquisition module 20 can be according to the rhythm of the word boundary information extraction speech data to be measured of speech data to be measured
Information.The rhythm of voice mainly includes:The information such as liaison, sense-group pause, read again, rising-falling tone.Detection for the different rhythms, is carried
The prosodic features taken is different.For example, when acquisition module 20 judges the liaison rhythm, the prosodic features of extraction includes two words
Between whether there is whether Jing Yin, fundamental frequency continuous, whether energy the prosodic features such as low ebb occurs;When judging the pause rhythm, between extraction word
The prosodic features such as Jing Yin duration;When judging to read again the rhythm, the prosodic features such as energy magnitude, the fundamental frequency of word are extracted;Judge to rise
During the falling tone rhythm, the prosodic features such as the fundamental frequency slope of word are extracted.And then it can be calculated successively each according to word boundary information
Above-mentioned prosodic features between word and word, stressed, the liter of each word in voice to be measured are determined according to corresponding determination strategy
The prosodic informations such as liaison, pause between falling tone and word.
For example, if continuous in the absence of Jing Yin and fundamental frequency between two words, the two word liaisons be can determine whether;Such as
Mute time between two words of fruit exceedes regular hour threshold value, such as 0.05 second, then can determine whether there is pause between two words;Such as
The energy magnitude of fruit one or more word exceedes certain energy threshold, then shows that the one or more word is read again.Similarly,
Also the rising-falling tone feature of word can be judged according to fundamental frequency slope.
Generation module 30 is used to generate voice number to be measured according to the word boundary information and prosodic information of speech data to be measured
According to prosodic labeling information.Wherein, prosodic labeling information includes at least one prosodic information and divided with least one prosodic information
Not corresponding positional information, wherein, each positional information determines according to corresponding rhythm boundary information.Prosodic labeling information refers to
The positional information of the correct rhythm in the text corresponding to voice is marked out, that is, marks out in text liaison between any two words, stop
Pause or which lexical stress, prosodic labeling are the important evidences assessed as the rhythm.
In one embodiment of the invention, generation module 30 is specifically used for:According to the word boundary of speech data to be measured
Information and prosodic information determine the rhythm boundary information of speech data to be measured;According to the rhythm boundary information pair of speech data to be measured
The prosodic information of speech data to be measured is labeled, to generate the prosodic labeling information of speech data to be measured.
Wherein, the prosodic information according to corresponding to word boundary information and word, you can determine rhythm boundary information, go forward side by side
One step determines the positional information of each prosodic information, is then labeled according to the positional information of prosodic information.For example, such as
Fruit word A and B liaison, then the initial time frame of the rhythm corresponding to this rhythm of liaison be word A Voice onset time frame (or
Moment) and word B pronunciation end time frame (or moment), and can determine that positional information corresponding to this rhythm of liaison for text
Position corresponding to word A and word B in this.And then can be corresponding in corresponding position mark according to the positional information of each rhythm
Prosodic information.
Detection module 40 is used for the prosodic labeling information of voice to be measured and the rhythm of the reference voice data marked in advance
Markup information is compared, to detect speech data to be measured with the presence or absence of pronunciation rhythm problem.Wherein, reference voice refers to
Voice to be measured with reading received pronunciation.
In an embodiment of the present invention, detection module 40 is specifically used for:Judge the prosodic labeling information of voice to be measured with it is pre-
Whether the prosodic labeling information of the reference voice data first marked meets following condition:The prosodic labeling information of speech data to be measured
In be labelled with the whole prosodic informations marked in the prosodic labeling information of reference voice data, and the prosodic information pair marked
The positional information answered is consistent;And the prosodic information marked in the prosodic labeling information of speech data to be measured does not include reference voice
The prosodic information not marked in the prosodic labeling information of data;If be unsatisfactory for, judge that speech data to be measured has pronunciation rhythm
Rule problem.
That is, only on whole rhythms (and corresponding rhythm side of speech data to be measured including reference voice data
Boundary's information is identical), and in speech data to be measured include reference voice data do not have the rhythm when, just judge voice number to be measured
According in the absence of pronunciation rhythm problem.Otherwise, then there is the rhythm in speech data to be measured.
The detection means of the pronunciation rhythm problem of the embodiment of the present invention, is believed by the word boundary for obtaining speech data to be measured
Breath, and its prosodic information is extracted, to accordingly generate the prosodic labeling information of speech data to be measured, and the reference language with marking in advance
The prosodic labeling letter that to detect pronunciation rhythm problem, can obtain voice automatically is compared in the prosodic labeling information of sound data
Breath is compared, and without artificial mark, application more flexibly, extensively, especially in language learning class software, passes through automatic detection
The rhythm of voice, can be with the rhythm problem of significantly more efficient assessment user pronunciation.In addition, Large Copacity is not needed in detection process
Database, amount of calculation is few, improves detection efficiency.
Fig. 4 is the structural representation according to the detection means of the pronunciation rhythm problem of one specific embodiment of the present invention.
As shown in figure 4, the detection means of pronunciation rhythm problem according to embodiments of the present invention, including:Receiving module 10, obtain
Modulus block 20, generation module 30, detection module 40 and labeling module 50.
Specifically, labeling module 50 is used to be labeled reference voice data, to obtain the rhythm of reference voice data
Markup information.
In one embodiment of the invention, labeling module 50 can be specifically used for:Reference voice data is decoded, and
The word boundary information of reference voice data is obtained according to decoded result;Extract the prosodic information of reference voice data;According to rhythm
Rule information and word boundary information determine the rhythm boundary information of reference voice data;According to rhythm boundary information to reference voice
Data are labeled.
More specifically, labeling module 50 can according to corresponding to reference voice data content of text build decoding network, and will
Decoding network and acoustic model are transmitted to decoder, then extract the acoustic feature of reference voice data, and are transmitted to decoder progress
Decoding so that reference voice data is alignd with corresponding content of text.Reference voice data can be obtained according to alignment result
Word boundary information.
Then, labeling module 50 can determine whether to whether there is between the word of reference voice data Jing Yin, and whether fundamental frequency is continuous and right
Reference voice data carries out more pronunciation judgements, the slope for obtaining Jing Yin duration, energy magnitude, fundamental frequency etc., to extract reference voice
The prosodic features of data.Further, reference voice data can be determined according to corresponding determination strategy based on these prosodic features
In liaison, pause, read again, the prosodic information such as rising-falling tone.
Thus, can automatic detection reference voice data prosodic information, and be labeled, avoid manually mark it is numerous
Trivial, error etc., and after disposable mark is good, it is later reusable in detection, it is more convenient, accurate.
Fig. 5 is the structural representation according to the detection means of the pronunciation rhythm problem of another embodiment of the invention.
As shown in figure 5, the detection means of pronunciation rhythm problem according to embodiments of the present invention, including:Receiving module 10, obtain
Modulus block 20, generation module 30, detection module 40, labeling module 50 and reminding module 60.
Specifically, reminding module 60 is used for when judging that speech data to be measured has the rhythm, according to comparison result
Generation pronunciation rhythm problem clew information, and user is prompted.It can be tied more specifically, reminding module 60 is used according to comparing
Fruit, judge that speech data to be measured (may include the rhythm that lacks or have more relative to the rhythm differed in reference voice data
The rhythm), and user is prompted for the rhythm that differs.
Thus, the detection means of the pronunciation rhythm problem of the embodiment of the present invention, can carry out the pronunciation rhythm to user in time
The prompting of problem and feedback, are easy to user to improve, and lift Consumer's Experience.
Any process or method described otherwise above description in flow chart or herein is construed as, and represents to include
Module, fragment or the portion of the code of the executable instruction of one or more the step of being used to realize specific logical function or process
Point, and the scope of the preferred embodiment of the present invention includes other realization, wherein can not press shown or discuss suitable
Sequence, including according to involved function by it is basic simultaneously in the way of or in the opposite order, carry out perform function, this should be of the invention
Embodiment person of ordinary skill in the field understood.
Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use
In the order list for the executable instruction for realizing logic function, may be embodied in any computer-readable medium, for
Instruction execution system, device or equipment (such as computer based system including the system of processor or other can be held from instruction
The system of row system, device or equipment instruction fetch and execute instruction) use, or combine these instruction execution systems, device or set
It is standby and use.For the purpose of this specification, " computer-readable medium " can any can be included, store, communicate, propagate or pass
Defeated program is for instruction execution system, device or equipment or the dress used with reference to these instruction execution systems, device or equipment
Put.The more specifically example (non-exhaustive list) of computer-readable medium includes following:Electricity with one or more wiring
Connecting portion (electronic installation), portable computer diskette box (magnetic device), random access memory (RAM), read-only storage
(ROM), erasable edit read-only storage (EPROM or flash memory), fiber device, and portable optic disk is read-only deposits
Reservoir (CDROM).In addition, computer-readable medium, which can even is that, to print the paper of described program thereon or other are suitable
Medium, because can then enter edlin, interpretation or if necessary with it for example by carrying out optical scanner to paper or other media
His suitable method is handled electronically to obtain described program, is then stored in computer storage.
It should be appreciated that each several part of the present invention can be realized with hardware, software, firmware or combinations thereof.Above-mentioned
In embodiment, software that multiple steps or method can be performed in memory and by suitable instruction execution system with storage
Or firmware is realized.If, and in another embodiment, can be with well known in the art for example, realized with hardware
Any one of row technology or their combination are realized:With the logic gates for realizing logic function to data-signal
Discrete logic, have suitable combinational logic gate circuit application specific integrated circuit, programmable gate array (PGA), scene
Programmable gate array (FPGA) etc..
Those skilled in the art are appreciated that to realize all or part of step that above-described embodiment method carries
Suddenly it is that by program the hardware of correlation can be instructed to complete, described program can be stored in a kind of computer-readable storage medium
In matter, the program upon execution, including one or a combination set of the step of embodiment of the method.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing module, can also
That unit is individually physically present, can also two or more units be integrated in a module.Above-mentioned integrated mould
Block can both be realized in the form of hardware, can also be realized in the form of software function module.The integrated module is such as
Fruit is realized in the form of software function module and as independent production marketing or in use, can also be stored in a computer
In read/write memory medium.
Storage medium mentioned above can be read-only storage, disk or CD etc..
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show
The description of example " or " some examples " etc. means specific features, structure, material or the spy for combining the embodiment or example description
Point is contained at least one embodiment or example of the present invention.In this manual, to the schematic representation of above-mentioned term not
Necessarily refer to identical embodiment or example.Moreover, specific features, structure, material or the feature of description can be any
One or more embodiments or example in combine in an appropriate manner.
Although an embodiment of the present invention has been shown and described, it will be understood by those skilled in the art that:Not
In the case of departing from the principle and objective of the present invention a variety of change, modification, replacement and modification can be carried out to these embodiments, this
The scope of invention is by claim and its equivalent limits.
Claims (12)
- A kind of 1. detection method of pronunciation rhythm problem, it is characterised in that including:Receive speech data to be measured;The word boundary information of the speech data to be measured is obtained, and extracts the prosodic information of the speech data to be measured;The rhythm side of the speech data to be measured is determined according to the word boundary information of the speech data to be measured and prosodic information Boundary's information;The prosodic information of the speech data to be measured is labeled according to the rhythm boundary information of the speech data to be measured, with Generate the prosodic labeling information of the speech data to be measured;The prosodic labeling information of reference voice data of the prosodic labeling information of the voice to be measured with marking in advance is compared To analysis, to detect the speech data to be measured with the presence or absence of pronunciation rhythm problem.
- 2. the detection method of pronunciation rhythm problem as claimed in claim 1, it is characterised in that also include:The reference voice data is labeled, to obtain the prosodic labeling information of the reference voice data.
- 3. the detection method of pronunciation rhythm problem as claimed in claim 2, it is characterised in that described to the reference voice number According to being labeled, specifically include:The reference voice data is decoded, and the word boundary letter of the reference voice data is obtained according to decoded result Breath;Extract the prosodic information of the reference voice data;The rhythm boundary information of the reference voice data is determined according to the prosodic information and the word boundary information;The reference voice data is labeled according to the rhythm boundary information.
- 4. the detection method of the pronunciation rhythm problem as described in claim any one of 1-3, it is characterised in that the prosodic labeling Information includes at least one prosodic information and distinguishes corresponding positional information with least one prosodic information, wherein, each The positional information determines according to corresponding rhythm boundary information.
- 5. the detection method of pronunciation rhythm problem as claimed in claim 4, it is characterised in that described by the voice to be measured The prosodic labeling information of reference voice data of the prosodic labeling information with marking in advance is compared, and specifically includes:Whether the prosodic labeling information and the prosodic labeling information of the reference voice data marked in advance for judging the voice to be measured Meet following condition:Institute in the prosodic labeling information of the reference voice data is labelled with the prosodic labeling information of the speech data to be measured Whole prosodic informations of mark, and positional information corresponding to the prosodic information marked is consistent;And the prosodic information marked in the prosodic labeling information of the speech data to be measured does not include the reference voice data The prosodic information not marked in prosodic labeling information;If be unsatisfactory for, judge that the speech data to be measured has the rhythm.
- 6. the detection method of pronunciation rhythm problem as claimed in claim 1, it is characterised in that also include:When judging that the speech data to be measured has the rhythm, then pronunciation rhythm problem clew letter is generated according to comparison result Breath, and user is prompted.
- A kind of 7. detection means of pronunciation rhythm problem, it is characterised in that including:Receiving module, for receiving speech data to be measured;Acquisition module, for obtaining the word boundary information of the speech data to be measured, and extract the speech data to be measured Prosodic information;Generation module, the voice to be measured is determined for the word boundary information according to the speech data to be measured and prosodic information The rhythm boundary information of data, and the rhythm according to the rhythm boundary information of the speech data to be measured to the speech data to be measured Rule information is labeled, to generate the prosodic labeling information of the speech data to be measured;Detection module, for by the prosodic labeling information of the voice to be measured and the rhythm mark of the reference voice data in advance marked Note information is compared, to detect the speech data to be measured with the presence or absence of pronunciation rhythm problem.
- 8. the detection means of pronunciation rhythm problem as claimed in claim 7, it is characterised in that also include:Labeling module, for being labeled to the reference voice data, to obtain the prosodic labeling of the reference voice data Information.
- 9. the detection means of pronunciation rhythm problem as claimed in claim 8, it is characterised in that the labeling module is specifically used In:The reference voice data is decoded, and the word boundary letter of the reference voice data is obtained according to decoded result Breath;Extract the prosodic information of the reference voice data;The rhythm boundary information of the reference voice data is determined according to the prosodic information and the word boundary information;The reference voice data is labeled according to the rhythm boundary information.
- 10. the detection means of the pronunciation rhythm problem as described in claim any one of 7-9, it is characterised in that the rhythm mark Noting information includes at least one prosodic information and distinguishes corresponding positional information with least one prosodic information, wherein, often The individual positional information determines according to corresponding rhythm boundary information.
- 11. the detection means of pronunciation rhythm problem as claimed in claim 10, it is characterised in that the detection module is specifically used In:Whether the prosodic labeling information and the prosodic labeling information of the reference voice data marked in advance for judging the voice to be measured Meet following condition:Institute in the prosodic labeling information of the reference voice data is labelled with the prosodic labeling information of the speech data to be measured Whole prosodic informations of mark, and positional information corresponding to the prosodic information marked is consistent;And the prosodic information marked in the prosodic labeling information of the speech data to be measured does not include the reference voice data The prosodic information not marked in prosodic labeling information.
- 12. the detection means of pronunciation rhythm problem as claimed in claim 7, it is characterised in that also include:Reminding module, for when judging that the speech data to be measured has the rhythm, generating according to comparison result Rhythm problem clew information, and user is prompted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410674294.6A CN104464751B (en) | 2014-11-21 | 2014-11-21 | The detection method and device for rhythm problem of pronouncing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410674294.6A CN104464751B (en) | 2014-11-21 | 2014-11-21 | The detection method and device for rhythm problem of pronouncing |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104464751A CN104464751A (en) | 2015-03-25 |
CN104464751B true CN104464751B (en) | 2018-01-16 |
Family
ID=52910695
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410674294.6A Active CN104464751B (en) | 2014-11-21 | 2014-11-21 | The detection method and device for rhythm problem of pronouncing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104464751B (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107203539B (en) * | 2016-03-17 | 2020-07-14 | 曾雅梅 | Speech evaluating device of complex word learning machine and evaluating and continuous speech imaging method thereof |
CN107452370A (en) * | 2017-07-18 | 2017-12-08 | 太原理工大学 | A kind of application method of the judgment means of Chinese vowel followed by a nasal consonant dysphonia patient |
CN108536668B (en) * | 2018-02-26 | 2022-06-07 | 科大讯飞股份有限公司 | Wake-up word evaluation method and device, storage medium and electronic equipment |
CN111508522A (en) * | 2019-01-30 | 2020-08-07 | 沪江教育科技(上海)股份有限公司 | Statement analysis processing method and system |
CN111508523A (en) * | 2019-01-30 | 2020-08-07 | 沪江教育科技(上海)股份有限公司 | Voice training prompting method and system |
CN111951827B (en) * | 2019-05-16 | 2022-12-06 | 上海流利说信息技术有限公司 | Continuous reading identification correction method, device, equipment and readable storage medium |
CN112309429A (en) * | 2019-07-30 | 2021-02-02 | 上海流利说信息技术有限公司 | Method, device and equipment for explosion loss detection and computer readable storage medium |
CN111028823B (en) * | 2019-12-11 | 2024-06-07 | 广州酷狗计算机科技有限公司 | Audio generation method, device, computer readable storage medium and computing equipment |
CN112183086B (en) * | 2020-09-23 | 2024-06-14 | 北京先声智能科技有限公司 | English pronunciation continuous reading marking model based on interest group marking |
CN112331229B (en) * | 2020-10-23 | 2024-03-12 | 网易有道信息技术(北京)有限公司 | Voice detection method, device, medium and computing equipment |
CN113053415B (en) * | 2021-03-24 | 2023-09-29 | 北京如布科技有限公司 | Method, device, equipment and storage medium for detecting continuous reading |
CN114170998B (en) * | 2021-11-12 | 2025-01-28 | 科大讯飞股份有限公司 | A pause position prediction method, speech synthesis method and related equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101650942A (en) * | 2009-08-26 | 2010-02-17 | 北京邮电大学 | Prosodic structure forming method based on prosodic phrase |
CN102063898A (en) * | 2010-09-27 | 2011-05-18 | 北京捷通华声语音技术有限公司 | Method for predicting prosodic phrases |
CN102237081A (en) * | 2010-04-30 | 2011-11-09 | 国际商业机器公司 | Method and system for estimating rhythm of voice |
CN102426834A (en) * | 2011-08-30 | 2012-04-25 | 中国科学院自动化研究所 | Method for testing rhythm level of spoken English |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101785051B (en) * | 2007-08-22 | 2012-09-05 | 日本电气株式会社 | Voice recognition device and voice recognition method |
-
2014
- 2014-11-21 CN CN201410674294.6A patent/CN104464751B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101650942A (en) * | 2009-08-26 | 2010-02-17 | 北京邮电大学 | Prosodic structure forming method based on prosodic phrase |
CN102237081A (en) * | 2010-04-30 | 2011-11-09 | 国际商业机器公司 | Method and system for estimating rhythm of voice |
CN102063898A (en) * | 2010-09-27 | 2011-05-18 | 北京捷通华声语音技术有限公司 | Method for predicting prosodic phrases |
CN102426834A (en) * | 2011-08-30 | 2012-04-25 | 中国科学院自动化研究所 | Method for testing rhythm level of spoken English |
Also Published As
Publication number | Publication date |
---|---|
CN104464751A (en) | 2015-03-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104464751B (en) | The detection method and device for rhythm problem of pronouncing | |
CN105336322B (en) | Polyphone model training method, and speech synthesis method and device | |
Wei et al. | A new method for mispronunciation detection using support vector machine based on pronunciation space models | |
US7962341B2 (en) | Method and apparatus for labelling speech | |
Busso et al. | Analysis of emotionally salient aspects of fundamental frequency for emotion detection | |
CN102176310B (en) | Speech recognition system with huge vocabulary | |
CN111369974B (en) | Dialect pronunciation marking method, language identification method and related device | |
CN104464755B (en) | Speech evaluating method and device | |
KR101587866B1 (en) | Apparatus and method for extension of articulation dictionary by speech recognition | |
CN110415725B (en) | Method and system for evaluating pronunciation quality of second language using first language data | |
CN103035241A (en) | Model complementary Chinese rhythm interruption recognition system and method | |
Rao et al. | Language identification using spectral and prosodic features | |
Mary | Extraction of prosody for automatic speaker, language, emotion and speech recognition | |
Mary | Extraction and representation of prosody for speaker, speech and language recognition | |
CN104299612A (en) | Method and device for detecting imitative sound similarity | |
Middag et al. | Robust automatic intelligibility assessment techniques evaluated on speakers treated for head and neck cancer | |
US9129596B2 (en) | Apparatus and method for creating dictionary for speech synthesis utilizing a display to aid in assessing synthesis quality | |
CN109697975B (en) | Voice evaluation method and device | |
Cole et al. | Corpus phonology with speech resources | |
Conkie et al. | Prosody recognition from speech utterances using acoustic and linguistic based models of prosodic events | |
Scharenborg | Modeling the use of durational information in human spoken-word recognition | |
Lin et al. | Improving L2 English rhythm evaluation with automatic sentence stress detection | |
CN114220417A (en) | Intent recognition method, device and related equipment | |
Moró et al. | A prosody inspired RNN approach for punctuation of machine produced speech transcripts to improve human readability | |
KR101188982B1 (en) | Stress studying system and method for studying foreign language |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |