CN107305541A

CN107305541A - Speech recognition text segmentation method and device

Info

Publication number: CN107305541A
Application number: CN201610256898.8A
Authority: CN
Inventors: 胡尹; 潘清华; 王金钖; 胡国平; 胡郁
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2016-04-20
Filing date: 2016-04-20
Publication date: 2017-10-31
Anticipated expiration: 2036-04-20
Also published as: CN107305541B

Abstract

The invention discloses a kind of speech recognition text segmentation method and device, this method includes：End-point detection is carried out to speech data, the beginning frame number of each voice segments and each voice segments is obtained and terminates frame number；Speech recognition is carried out to each voice segments, the corresponding identification text of each voice segments is obtained；Extract the segmentation feature of the corresponding identification text of each voice segments；Using the segmentation feature and the segmented model that builds in advance of extraction, identification text corresponding to the speech data carries out segmentation detection, to determine the position for needing to be segmented；It is segmented according to segmentation testing result identification text corresponding to the speech data.The present invention can be realized automatically to be segmented to identification text, becomes apparent from the structure of an article of identification text.

Description

Speech recognition text segmentation method and device

Technical field

The present invention relates to natural language processing field, and in particular to a kind of speech recognition text segmentation method And device.

Background technology

With the development of voice technology, automatic speech recognition technology has been widely used in life Every field, the life requirement that text greatly facilitates people is changed into by voice, such as turns session recording Into text personnel participating in the meeting is sent to as meeting summary；The recording of interview is changed into text, herein On the basis of compile news release etc..However, the identification text that speech recognition is obtained is not as human-edited Text have the clearly structure of an article, the division of such as paragraph structure, so as to cause user checking identification During text, be often difficult to find the emphasis or theme of whole identification text, especially when identification text compared with Many and when being related to multiple themes, user is more difficult to clear the structure of an article of identification text, accurately finds out The content of each theme.Therefore, how identification text is clearly showed into user, helps user's reason The content of solution identification text, the displaying for speech recognition text is particularly important.

In the prior art, it is usually that the identification text of speech data is exposed directly to user, to knowing Other result is not dealt with；Or the structure of an article of identification text is adjusted by manually, after adjustment Textual presentation is recognized to user, such as according to the content of identification text, will recognize that text divides different sections Fall, by the identification textual presentation after adjustment to user.This mode that manually adjusts is recognizing that text is more When, labor workload is big, and efficiency is low, and time-consuming longer, identifying system is extremely difficult to practical effect Really.

The content of the invention

The present invention provides a kind of speech recognition text segmentation method and device, to solve prior art by people Work adjustment recognizes the problem of structure of an article workload of text is big, efficiency is low.

Therefore, the present invention provides following technical scheme：

A kind of speech recognition text segmentation method, including：

End-point detection is carried out to speech data, obtain each voice segments and each voice segments beginning frame number and Terminate frame number；

Speech recognition is carried out to each voice segments, the corresponding identification text of each voice segments is obtained；

Extract the segmentation feature of the corresponding identification text of each voice segments；

Using the segmentation feature and the segmented model that builds in advance of extraction, to speech data correspondence Identification text carry out segmentation detection, with determine need be segmented position；

It is segmented according to segmentation testing result identification text corresponding to the speech data.

Preferably, methods described also includes, and segmented model is built in the following manner：

Collect speech data；

End-point detection is carried out to the speech data of collection, each voice segments are obtained；

The segment information of the corresponding identification text of each voice segments is marked, the segment information is used to represent to work as Whether the corresponding end position for recognizing text of preceding voice segments needs segmentation；

Using the segmentation feature and the segment information as training data, segmented model is built.

Preferably, the segmentation feature for extracting the corresponding identification text of each voice segments includes：

From the segmentation feature for acoustically extracting each voice segments of the speech data, and by the segmentation feature It is used as the first segmentation feature of the corresponding identification text of institute's speech segment；And/or

Know from the semantically extraction segmentation feature of the identification text, and using the segmentation feature as described Second segmentation feature of other text.

Preferably, first segmentation feature includes：The duration of current speech segment, in addition to：Currently Between the distance between voice segments and previous voice segments, and/or current speech segment and latter voice segments away from From；

It is described to include from the segmentation feature for acoustically extracting each voice segments of the speech data：

The difference for terminating frame number and the beginning frame number of current speech segment of current speech segment is calculated, and Using the difference as current speech segment duration；

Also include：

The difference for starting frame number and the end frame number of previous voice segments of current speech segment is calculated, and It regard the difference as the distance between current speech segment and previous voice segments；And/or

The difference for starting frame number and the end frame number of current speech segment of latter voice segments is calculated, and It regard the difference as the distance between current speech segment and latter voice segments.

Preferably, first segmentation feature also includes：The speaker of current speech segment and previous voice Whether the speaker of section identical, and/or speaker of current speech segment and the speaker of latter voice segments are It is no identical；

It is described also to include from the segmentation feature for acoustically extracting each voice segments of the speech data：

Speaker's change point detection is carried out to the speech data using speaker's isolation technics；

The speaker of current speech segment and previous voice segments are determined according to speaker's change point testing result Whether speaker is identical, and/or determines speaking for current speech segment according to speaker's change point testing result Whether people is identical with the speaker of latter voice segments.

Preferably, second segmentation feature include it is following any one or more：

Forward direction unsegmented sentence number, refers to the starting position from the corresponding identification text of current speech segment to upper The sentence sum that all identification texts are included between one segmentation markers；

Backward unsegmented sentence number, refers to all identifications after the corresponding identification text of current speech segment The sentence sum that text is included；

The sentence number that the corresponding identification text of current speech segment is included；

The similarity of the corresponding identification text of current speech segment identification text corresponding with previous voice segments；

The similarity of the corresponding identification text of current speech segment identification text corresponding with latter voice segments.

Preferably, the semantically extraction segmentation feature from the identification text includes：

Identification text corresponding to the speech data is modified, and the amendment includes：To institute's predicate The corresponding identification text addition punctuate of sound data；

From the semantically extraction segmentation feature of revised identification text.

Preferably, it is described amendment also include it is following any one or more：

Identification text corresponding to the speech data carries out abnormal word filtering；

Identification text corresponding to the speech data carries out smooth processing；

It is regular that identification text corresponding to the speech data carries out numeral；

Identification text corresponding to the speech data carries out text replacement, and the text, which is replaced, to be included： By English lower case upper in the corresponding identification text of the speech data or vice versa； And/or the sensitive word in the corresponding identification text of the speech data is replaced with into additional character.

Preferably, it is described to utilize the segmentation feature extracted and the segmented model built in advance, to described The corresponding identification text of speech data carries out segmentation detection, to determine to need the position being segmented to include：

In units of voice segments, the segmentation feature of the corresponding identification text of each voice segments is inputted into institute successively State segmented model and carry out segmentation detection, whether determine the corresponding end position for recognizing text of each voice segments Need segmentation.

Preferably, methods described also includes：

The identification text after segmentation is shown to user；Or

The theme that each paragraph after segmentation recognizes text is extracted, and by each theme presentation to user；

When perceiving user's theme interested, will the correspondence theme paragraph identification text exhibition Show to user.

A kind of speech recognition text segmentation device, including：

Endpoint detection module, for carrying out end-point detection to speech data, obtains each voice segments and each language The beginning frame number and end frame number of segment；

Sound identification module, for carrying out speech recognition to each voice segments, obtains each voice segments corresponding Recognize text；

Characteristic extracting module, the segmentation feature for extracting the corresponding identification text of each voice segments；

Detection module is segmented, for the segmentation feature using extraction and the segmented model built in advance, Identification text corresponding to the speech data carries out segmentation detection, to determine the position for needing to be segmented；

Segmentation module, for being entered according to segmentation testing result identification text corresponding to the speech data Row segmentation.

Preferably, described device also includes, and segmented model builds module, for building segmented model； The segmented model, which builds module, to be included：

Data collection module, for collecting speech data；

End-point detection unit, the speech data for being collected to the data collection module carries out end points inspection Survey, obtain each voice segments；

Voice recognition unit, for carrying out speech recognition to each voice segments, obtains each voice segments corresponding Recognize text；

Unit is marked, the segment information for marking the corresponding identification text of each voice segments, the segmentation Information is used to represent whether the end position of the corresponding identification text of current speech segment needs segmentation；

Feature extraction unit, the segmentation feature for extracting the corresponding identification text of each voice segments；

Training unit, for the segmentation feature and the segment information, as training data, to be built Segmented model.

Preferably, the characteristic extracting module includes：

Fisrt feature extraction module, for point for acoustically extracting each voice segments from the speech data Duan Tezheng, and it regard the segmentation feature as corresponding the first segmentation feature for recognizing text of institute's speech segment； And/or

Second feature extraction module, for the semantically extraction segmentation feature from the identification text, and It regard the segmentation feature as second segmentation feature for recognizing text.

Preferably, the fisrt feature extraction module includes：

Duration calculation unit, for calculating the end frame number of current speech segment and opening for current speech segment The difference of beginning frame number, and using the difference as current speech segment duration；

Metrics calculation unit, beginning frame number and the knot of previous voice segments for calculating current speech segment The difference of beam frame number, and it regard the difference as the distance between current speech segment and previous voice segments； And/or the difference for starting frame number and the end frame number of current speech segment of latter voice segments is calculated, and It regard the difference as the distance between current speech segment and latter voice segments.

Preferably, the fisrt feature extraction module also includes：

Speaker's change point detection unit, for being entered using speaker's isolation technics to the speech data Row speaker change point is detected；

Speaker's determining unit, for determining current speech segment according to speaker's change point testing result Whether speaker is identical with the speaker of previous voice segments, and/or according to speaker's change point testing result Determine whether the speaker of current speech segment is identical with the speaker of latter voice segments.

Preferably, the second feature extraction module includes：

Amending unit, for being modified to the corresponding identification text of the speech data, the amendment Unit includes：Punctuate adds subelement, for being marked to the corresponding identification text addition of the speech data Point；

Feature extraction unit, for the semantically extraction segmentation feature from revised identification text.

Preferably, the amending unit also includes following any one or more subelements：

Subelement is filtered, is filtered for carrying out abnormal word to the corresponding identification text of the speech data；

Smooth processing subelement, for carrying out smooth processing to the corresponding identification text of the speech data；

Regular subelement, it is regular for carrying out numeral to the corresponding identification text of the speech data；

Text replaces subelement, for carrying out text replacement to the corresponding identification text of the speech data, The text, which is replaced, to be included：English lower case in the corresponding identification text of the speech data is turned It is changed to capitalization or vice versa；And/or replace the sensitive word in the corresponding identification text of the speech data It is changed to additional character.

Preferably, the segmentation detection module, specifically in units of voice segments, successively by each language The segmentation feature of the corresponding identification text of segment inputs the segmented model and carries out segmentation detection, it is determined that respectively Whether the end position of the corresponding identification text of voice segments needs segmentation.

Preferably, described device also includes：

First display module, for showing the identification text after segmentation to user；Or

Subject distillation module, the theme of text is recognized for extracting each paragraph after segmentation；

Second display module, for by each theme presentation to user；

Sensing module, the theme interested for perceiving user, and perceiving user master interested During topic, second display module is triggered by the identification textual presentation of the paragraph of the correspondence theme to use Family.

The present invention provides a kind of speech recognition text segmentation method and device, by being carried out to speech data End-point detection obtains each voice segments, and each voice segments are carried out with speech recognition and obtains the corresponding knowledge of each voice segments Other text, then, extracts the segmentation feature of the corresponding identification text of each voice segments, utilizes point of extraction Duan Tezheng and the segmented model built in advance, identification text corresponding to the speech data are divided Section detection, to determine the position for needing to be segmented, is segmented according to segmentation testing result to identification text, So as to automatically adjust the structure of an article of identification text, become apparent from its structure of an article, and then Contribute to user's fast understanding to recognize the content of text, lift user's reading efficiency.

Further, the segmentation feature can from speech data acoustically or identification text language Extracted in justice, it is of course also possible to both are integrated based on the segmentation feature extracted in different aspects, and Segmentation detection is carried out using corresponding segmented model identification text corresponding to the speech data, it is determined that The position of segmentation is needed, the accuracy of segmentation can be further improved.

It is possible to further show all identification texts after segmentation to user, or extract every section of knowledge The theme of other text, first by every section of theme presentation to user, when user needs to check section interested When falling, then paragraph content shown, so that it is interested interior to contribute to user to be quickly found out oneself Hold.

Brief description of the drawings

, below will be right in order to illustrate more clearly of the embodiment of the present application or technical scheme of the prior art The accompanying drawing used required in embodiment is briefly described, it should be apparent that, it is attached in describing below Figure is only some embodiments described in the present invention, for those of ordinary skill in the art, also Other accompanying drawings can be obtained according to these accompanying drawings.

Fig. 1 is a kind of flow chart of speech recognition text segmentation method of the embodiment of the present invention；

Fig. 2 is the flow chart of structure segmented model in the embodiment of the present invention；

Fig. 3 is a kind of structural representation of speech recognition text segmentation device of the embodiment of the present invention；

Fig. 4 is the structural representation of segmentation module structure module in the embodiment of the present invention；

Fig. 5 is another structural representation of speech recognition text segmentation device of the embodiment of the present invention；

Fig. 6 is another structural representation of speech recognition text segmentation device of the embodiment of the present invention.

Embodiment

In order that those skilled in the art more fully understand the scheme of the embodiment of the present invention, with reference to Drawings and embodiments are described in further detail to the embodiment of the present invention.

As shown in figure 1, be the flow chart of speech recognition text segmentation method of the embodiment of the present invention, including Following steps：

Step 101, end-point detection is carried out to speech data, obtains opening for each voice segments and each voice segments Beginning frame number and end frame number.

The speech data can be obtained according to practical application recording, such as session recording, interview recording.

So-called end-point detection is exactly that the starting point of each voice segments is found out from one section of given voice signal And end point.Some end-point detecting methods of the prior art can be specifically used, it is real to this present invention Example is applied not limit.

Step 102, speech recognition is carried out to each voice segments, obtains the corresponding identification text of each voice segments.

Specifically, each voice segments can be carried out with feature extraction, MFCC (Mel Frequency are such as extracted Cepstrum Coefficient) feature；Then the characteristic and the acoustic mode of training in advance extracted are utilized Type and language model carry out decoding operate；The corresponding identification of each voice segments is obtained finally according to decoded result Text.The detailed process of speech recognition is same as the prior art, will not be described in detail herein.

Step 103, the segmentation feature of the corresponding identification text of each voice segments is extracted.

In actual applications, the segmentation feature can from speech data acoustically or identification text Semantically extraction, it is of course also possible to be that comprehensive both are special based on the segmentation extracted in different aspects Levy, and segmentation detection carried out using corresponding segmented model identification text corresponding to the speech data, It is determined that needing the position being segmented, the accuracy of segmentation can be further improved.

Step 104, using the segmentation feature and the segmented model that builds in advance of extraction, to institute's predicate The corresponding identification text of sound data carries out segmentation detection, to determine the position for needing to be segmented.

Specifically, it is in units of voice segments, the segmentation feature of the corresponding identification text of each voice segments is defeated Enter the segmented model and carry out segmentation detection, determine the end position of the corresponding identification text of each voice segments Whether segmentation is needed.

It should be noted that in actual applications, the output of the segmented model can be current speech Whether the end position of the corresponding identification text of section needs segmentation or current speech segment corresponding Recognize that the end position of text needs the probability being segmented.Certainly, the output of different type parameter, not The training process of segmented model is influenceed, only different input/output arguments need to be set in model training i.e. Can.The specific training process of segmented model will be described in detail later.

If segmented model output is probability parameter, in such a case, it is possible to preset corresponding Threshold value, if the probability is more than the threshold value, then it is assumed that the corresponding identification text of current speech segment End position needs segmentation.

Step 105, divided according to segmentation testing result identification text corresponding to the speech data Section.

Specifically, segmentation markers can be added in the end position for the identification text for needing to be segmented, so It can facilitate and be divided in displaying according to segmentation markers identification text corresponding to the speech data Section displaying.

It should be noted that in actual applications, can be with the corresponding identification text of one voice segments of every extraction This segmentation feature, i.e., carry out segmentation detection to the identification text；Can also be first to all voice segments pair After the segmentation feature for the identification text answered all is extracted, then in units of voice segments, successively by each voice segments The segmentation feature input segmented model of corresponding identification text carries out segmentation detection, and this present invention is implemented Example is not limited.

In another embodiment of the inventive method, it can further include to after user's displaying segmentation The step of recognizing text.During specific displaying, it be able to will be belonged to according to the segmentation markers in identification text Same section of text is placed on a paragraph, and the text segmentation of different sections is shown.

It is A1, A2, A3, A4, A5, A6, A7, A8, A9, A10 such as to recognize text.Wherein, Ai represents one The corresponding identification text of voice segments, after segmentation detection, it is determined that segmentation is needed at A2 and A5, The form that can then show is as follows：

A1,A2

A3,A4,A5

A6,A7,A8,A9,A10

In another embodiment of the inventive method, following steps are can further include：

Wherein, user selects the mode of theme interested to have a variety of, such as, to the point of corresponding theme The action such as choosing is hit or drawn, or corresponding sequence number is provided to each theme, user passes through input through keyboard phase Sequence number answered etc..

It is previously noted that in embodiments of the present invention, the segmentation feature can be from the acoustics of speech data Upper or identification text semantically extraction, it is of course also possible to be it is comprehensive both be based on different aspects The segmentation feature of upper extraction, the i.e. segmentation feature for acoustically extracting each voice segments from the speech data, And using the segmentation feature as corresponding the first segmentation feature for recognizing text of institute's speech segment, and from The semantically extraction segmentation feature of the identification text, and it regard the segmentation feature as the identification text The second segmentation feature.Correspondingly, when carrying out segmented model training, acoustics can also be based solely on On segmentation feature training segmented model, or be based solely on semantically segmentation feature training segmentation mould Type, or segmentation feature training segmented model based on segmentation feature acoustically and semantically.

The segmentation feature in both different aspects is described in detail separately below.

1. from the acoustically extraction segmentation feature of speech data, i.e., foregoing first segmentation feature.

In actual applications, first segmentation feature can include：The duration of current speech segment, also Including：The distance between current speech segment and previous voice segments, and/or current speech segment and latter voice The distance between section.

Further, first segmentation feature may also include：The speaker of current speech segment with it is previous Whether the speaker of voice segments identical, and/or speaker of current speech segment and latter voice segments are spoken Whether people is identical.

This several segmentation feature is described in detail separately below.

A) duration of current speech segment

The frame number that the duration of voice segments can use voice segments to include is represented.Therefore, current speech is calculated The difference for terminating frame number and the beginning frame number of current speech segment of section, you can obtain current speech segment Duration, that is to say, that using the difference as current speech segment duration.

B) the distance between current speech segment and previous voice segments

The distance between current speech segment and previous voice segments can use the start frame sequence of current speech segment Difference number between the end frame number of previous voice segments is represented.Therefore, current speech segment is calculated Start the difference of frame number and the end frame number of previous voice segments, regard the difference as current speech segment The distance between with previous voice segments.

It should be noted that when current speech segment is first voice segments, current speech segment and previous language The distance between segment is 0.

C) the distance between current speech segment and latter voice segments

Similarly, the distance between current speech segment and latter voice segments can use latter voice segments The difference started between frame number and the end frame number of current speech segment is represented.Therefore, calculate latter The difference of the end frame number for starting frame number and current speech segment of voice segments, and using the difference as The distance between current speech segment and latter voice segments.

It should be noted that current speech segment be last voice segments when, current speech segment with it is latter The distance between voice segments are 0.

D) whether the speaker of current speech segment is identical with the speaker of previous voice segments

E) whether the speaker of current speech segment is identical with the speaker of latter voice segments

For neighbouring speech segments speaker, whether identical detection can utilize speaker's isolation technics to institute State speech data and carry out speaker's change point detection, determined according to speaker's change point testing result current Whether the speaker of voice segments is identical and current speech segment with the speaker of previous voice segments to be spoken Whether people is identical with the speaker of latter voice segments.

Speaker's change point is that same speaker speaks end, the place that another speaker starts, Specific detection method is same as the prior art, is not described in detail herein.

2. from the semantically extraction segmentation feature of identification text, i.e., foregoing second segmentation feature.

In actual applications, second segmentation feature can include it is following any one or more：

A) forward direction unsegmented sentence number, refer to from current speech segment it is corresponding identification text starting position to The sentence sum that all identification texts are included between a upper segmentation markers.

A upper segmentation markers can correspond to the identification before identification text start according to current speech segment The segmentation markers of text are obtained, and the segmentation markers can be according to the correspondence identification text of voice segments before Segmentation testing result obtain.

It should be noted that in actual applications, if comprising preceding to not in second segmentation feature This feature of segmentation sentence number, when carrying out segmentation detection, it is necessary to often extracted using above-mentioned The segmentation feature of the corresponding identification text of one voice segments, i.e., carry out being segmented detection to the identification text Mode.

In addition, it is necessary to explanation, if the corresponding identification text of current speech segment is identification text Beginning, the forward direction unsegmented sentence number is 0.

B) backward unsegmented sentence number, the institute referred to after the corresponding identification text of current speech segment is insighted The sentence sum that other text is included.

The backward unsegmented sentence number refers to all identifications after current speech segment correspondence identification text Sentence that text is included sum, can be by analyzing the sentence after current speech segment correspondence identification text Number is obtained.

If it should be noted that current speech segment correspondence identification text for it is all identification texts endings, The backward unsegmented sentence number is 0.

C) the sentence number that the corresponding identification text of current speech segment is included.

Specifically corresponding sentence number can be obtained with the punctuate in Direct Analysis current speech segment correspondence identification text.

D) similarity of the corresponding identification text of current speech segment identification text corresponding with previous voice segments.

The similarity is typically measured by the distance between vector or angle, between such as calculating vector Cosine angle, the angle is smaller, and two identification text vector similarities are higher.The term vector Change process is prior art, be will not be described in detail herein.

In order to exclude interference of some stop-words to Text similarity computing, it can respectively delete and work as first The stop-word that preceding voice segments identification text corresponding with previous voice segments is included, the stop-word is to know The frequency occurred in other text is higher, but the word without practical significance, symbol or mess code etc., such as " this, With, meeting, be ".It is specific when deleting, identification text can be searched by the stopping vocabulary that builds in advance In stop-word realize.Then it will delete and remaining term vector in text is recognized after stop-word, will All term vectors combine in voice segments correspondence identification text, respectively obtain current speech segment correspondence knowledge Other text vector and previous voice segments correspondence identification text vector, calculate the phase of two identification text vectors Like degree.

It should be noted that at the beginning of current speech segment correspondence identification text is all identification text, The similarity is 0.

E) similarity of the corresponding identification text of current speech segment identification text corresponding with latter voice segments.

It is similar with above similarity calculating method, that is, delete after the stop-word in identification text, will recognize Similarity is calculated after text vector.

It should be noted that when current speech segment correspondence identification text is the ending of all identification texts, The similarity is 0.

It should be noted that before the second segmentation feature is extracted, in addition it is also necessary to the speech data pair The identification text answered is modified, then from the semantically extraction described second of revised identification text Segmentation feature.

To recognizing that the amendment of text mainly includes：To identification text addition punctuate.It is described addition punctuate be Corresponding punctuation mark is added to identification text, such as based on conditional random field models to identification text addition Punctuate.In order that the punctuate of addition is more accurate, the threshold that voice can be set intersegmental with addition punctuate in section Value, the threshold value of the intersegmental addition punctuate of such as voice sets smaller, and the threshold value that punctuate is added in voice segments is set Put larger, so as to increase the possibility of the intersegmental addition punctuate of voice, reduce in voice segments and add punctuate Possibility.Add the text after punctuate, will with punctuation mark (including comma, ", question mark "”、 Exclamation mark "！" and fullstop ".") separate text, be used as one.

Secondly, it is described amendment can also further comprise it is following any one or more：

(1) identification text corresponding to the speech data carries out abnormal word filtering.

Text filtering, which is mainly, filters out the abnormal word of mistake in identification text, can specifically be put according to word Reliability and the result of syntactic analysis are filtered.

(2) identification text corresponding to the speech data carries out smooth processing.

Incoherent sentence is mainly smoothed out with the fingers suitable by the smooth processing of text, and the repetitor of no practical significance only retains One, such as " extremely good ", only retain one " very ".Modal particle without practical significance can be neglected Slightly, without typing, fall as " oh " needs of " oh this problem " are smooth.

(3) identification text corresponding to the speech data carries out digital regular.

All numerals in the identification text obtained due to speech recognition are all represented with Chinese figure, still Some numerals represent just to meet the reading habit of user with Arabic numerals, such as 2 points 5 yuan, it should It is expressed as 21.5 yuan.The regular numeral is exactly that some Chinese figures are converted into Arabic numerals, such as The method based on the ABNF syntax can be used.

(4) identification text corresponding to the speech data carries out text replacement, and the text replaces bag Include two kinds of situations：

A kind of situation is the replacement between English capital and small letter, will the corresponding identification text of the speech data In English lower case upper or vice versa；Such as " nba " replaces with " NBA ", " sieve c " It is substituted for " sieve C " etc.；

Another situation be by the speech data it is corresponding identification text in sensitive word replace with it is special Symbol, reaches hiding effect.During specific replacement, sensitive vocabulary can be set up, sensitive vocabulary is then traveled through Search in identification text and whether sensitive word occur, replaced if occurring with additional character, such as some violent tenets Word, such as " robberys " be sensitive word, then " robbery " occurred in text is all with " * * * * " are replaced.

It can be needed to select one or two according to practical application it should be noted that above two text replaces situation Plant to carry out, this embodiment of the present invention is not limited.

As shown in Fig. 2 be the flow chart that segmented model is built in the embodiment of the present invention, including following step Suddenly：

Step 201, speech data is collected.

Step 202, end-point detection is carried out to the speech data, obtains each voice segments.

Step 203, speech recognition is carried out to each voice segments, obtains the corresponding identification text of each voice segments.

Step 204, the segment information of the corresponding identification text of each voice segments, the segment information are marked For representing whether the end position of the corresponding identification text of current speech segment needs segmentation.

Such as, if segmentation, is labeled as 1, otherwise, is labeled as 0.It can certainly use other Symbol represents that the embodiment of the present invention is not construed as limiting.

Step 205, the segmentation feature of the corresponding identification text of each voice segments is extracted.

Step 206, the segmentation feature and the segment information are built into segmentation as training data Model.

The segmented model can use pattern-recognition in common model, such as Bayesian model, support to Amount machine model etc..During specific training, using the segmentation feature for recognizing text as the input of model, it will mark The segment information of note carries out model training, obtains segmented model as the output of model.Segmented model Specific training process it is same as the prior art, will not be described in detail herein.It should be noted that described point Class model can be obtained with off-line training.

It should be noted that when carrying out segmented model training, the segmentation that can be based solely on acoustically is special Training segmented model is levied, or is based solely on segmentation feature training segmented model semantically, Huo Zheji Segmentation feature training segmented model in segmentation feature acoustically and semantically.Correspondingly, above-mentioned , can an extraction base when extracting the segmentation feature of the corresponding identification text of each voice segments in step 205 In segmentation feature acoustically or based on segmentation feature semantically, it can also extract simultaneously based on acoustics On segmentation feature and based on segmentation feature semantically, this embodiment of the present invention is not limited.

In addition, it is necessary to explanation, the segmented model trained based on different type segmentation feature, in profit , it is necessary to extract identification text to be presented when carrying out segmentation detection to identification text to be presented with the segmented model The segmentation feature of this respective type, the segmentation feature of extraction is input to the segmented model to determine to treat The position of segmentation is needed in displaying identification text.

The speech recognition text segmentation method that the present invention is provided, by carrying out end-point detection to speech data Each voice segments are obtained, carrying out speech recognition to each voice segments obtains the corresponding identification text of each voice segments, Then, extract the segmentation feature of the corresponding identification text of each voice segments, using the segmentation feature of extraction with And the segmented model built in advance, identification text progress segmentation detection corresponding to the speech data, Identification text is segmented according to segmentation testing result, so as to automatically adjust identification text The structure of an article, becomes apparent from its structure of an article, and then contributes to user's fast understanding to recognize text Content, lifts user's reading efficiency.

In actual applications, the segmentation feature can from speech data acoustically or identification text Semantically extraction, it is of course also possible to it is comprehensive both based on the segmentation feature extracted in different aspects, And segmentation detection is carried out using corresponding segmented model identification text corresponding to the speech data, really The fixed position for needing to be segmented, can further improve the accuracy of segmentation.

Further, the speech recognition text segmentation method that the present invention is provided, can also show to user All identification texts after segmentation, or the theme of every section of identification text is extracted, first by every section of theme User is showed, is shown when user needs to check paragraph interested, then by paragraph content, So as to contribute to user to be quickly found out oneself content interested.

Correspondingly, the embodiment of the present invention also provides a kind of speech recognition text segmentation device, such as Fig. 3 institutes Show, be a kind of structural representation of the device.

In this embodiment, described device includes：

Endpoint detection module 301, for speech data carry out end-point detection, obtain each voice segments and The beginning frame number and end frame number of each voice segments；

Sound identification module 302, for carrying out speech recognition to each voice segments, obtains each voice segments pair The identification text answered；

Characteristic extracting module 303, the segmentation feature for extracting the corresponding identification text of each voice segments；

Detection module 304 is segmented, for the segmentation feature using extraction and the segmentation mould built in advance Type, identification text corresponding to the speech data carries out segmentation detection, to determine the position for needing to be segmented Put；Specifically, it is successively that the segmentation of the corresponding identification text of each voice segments is special in units of voice segments Levy the input segmented model and carry out segmentation detection, determine the end of the corresponding identification text of each voice segments Whether position needs segmentation；

Segmentation module 305, for according to segmentation testing result identification text corresponding to the speech data This progress is segmented.

It should be noted that in actual applications, the characteristic extracting module 303 can be from voice number According to acoustically or identification text semantically extraction segmentation feature, it is of course also possible to be it is comprehensive this Two kinds are based on extracting segmentation feature in different aspects.Correspondingly, the characteristic extracting module 303 can be with Including：Fisrt feature extraction module and/or second feature extraction module.Wherein：

Fisrt feature extraction module, for point for acoustically extracting each voice segments from the speech data Duan Tezheng, and it regard the segmentation feature as corresponding the first segmentation feature for recognizing text of institute's speech segment；

Wherein, a kind of embodiment of the fisrt feature extraction module includes：Duration calculation unit and away from From computing unit；Another embodiment of the fisrt feature extraction module can also further comprise：Speak People's change point detection unit and speaker's determining unit.These units are illustrated separately below.

What the duration calculation unit was used to calculating current speech segment terminates frame number and current speech segment Start frame number difference, and using the difference as current speech segment duration.

The metrics calculation unit is used for the beginning frame number for calculating current speech segment and previous voice segments Terminate the difference of frame number, and regard the difference as the distance between current speech segment and previous voice segments； And/or the difference for starting frame number and the end frame number of current speech segment of latter voice segments is calculated, and It regard the difference as the distance between current speech segment and latter voice segments.

Speaker's change point detection unit is used for using speaker's isolation technics to the speech data Carry out speaker's change point detection.

Speaker's determining unit is used to determine current speech segment according to speaker's change point testing result Speaker and previous voice segments speaker it is whether identical, and/or detected and tie according to speaker's change point Fruit determines whether the speaker of current speech segment is identical with the speaker of latter voice segments.

A kind of embodiment of the second feature extraction module includes：

Amending unit, for being modified to the corresponding identification text of the speech data, the amendment Unit includes：Punctuate adds subelement, for being marked to the corresponding identification text addition of the speech data Point, such as based on conditional random field models identification text addition punctuate corresponding to the speech data；

Wherein, the second segmentation feature that the feature extraction unit is extracted can include it is following any one Or it is a variety of：Forward direction unsegmented sentence number, backward unsegmented sentence number, the corresponding identification of current speech segment The corresponding identification text of sentence number that text is included, current speech segment is corresponding with previous voice segments to be recognized The corresponding identification text of similarity, the current speech segment identification text corresponding with latter voice segments of text Similarity.

In actual applications, the amending unit can also include following any one or more subelements：

In embodiments of the present invention, the segmented model can by corresponding segmented model build module from Line is built, the segmented model build module can independently of speech recognition text segmentation device of the present invention, It can also be integrated in one with speech recognition text segmentation device of the present invention, to this embodiment of the present invention not Limit.

As shown in figure 4, be a kind of structural representation of segmented model structure module in the embodiment of the present invention, Including：

Data collection module 401, for collecting speech data；

End-point detection unit 402, the speech data for being collected to the data collection module carries out end Point detection, obtains each voice segments；

Voice recognition unit 403, for carrying out speech recognition to each voice segments, obtains each voice segments pair The identification text answered；

Unit 404 is marked, the segment information for marking the corresponding identification text of each voice segments is described Segment information is used to represent whether the end position of the corresponding identification text of current speech segment needs segmentation；

Feature extraction unit 405, the segmentation feature for extracting the corresponding identification text of each voice segments；

Training unit 406, for using the segmentation feature and the segment information as training data, Build segmented model.

It should be noted that when carrying out segmented model training, segmentation acoustically can be based solely on Feature (i.e. above-mentioned first segmentation feature) trains segmented model, or is based solely on semantically Segmentation feature (i.e. above-mentioned second segmentation feature) training segmented model, or based on acoustics On segmentation feature and semantically segmentation feature training segmented model.Correspondingly, features described above is extracted When unit 405 extracts the segmentation feature of the corresponding identification text of each voice segments, it can only extract and be based on sound Segmentation feature on or based on segmentation feature semantically, can also be extracted based on acoustically simultaneously Segmentation feature and based on segmentation feature semantically, is not limited this embodiment of the present invention.

In addition, the output of the segmented model can be the end of the corresponding identification text of current speech segment Whether position needs the end position needs of segmentation or the corresponding identification text of current speech segment The probability of segmentation.Certainly, the output of different type parameter, has no effect on the training process of segmented model, Different input/output arguments only need to be set in model training.

The present invention provides a kind of speech recognition text segmentation device, by carrying out end points inspection to speech data Each voice segments are measured, carrying out speech recognition to each voice segments obtains the corresponding identification text of each voice segments, Then, extract the segmentation feature of the corresponding identification text of each voice segments, using the segmentation feature of extraction with And the segmented model built in advance, identification text progress segmentation detection corresponding to the speech data, Identification text is segmented according to segmentation testing result, so as to automatically adjust identification text The structure of an article, becomes apparent from its structure of an article, and then contributes to user's fast understanding to recognize text Content, lifts user's reading efficiency.

As shown in figure 5, being that another structure of speech recognition text segmentation device of the embodiment of the present invention is shown It is intended to.

From unlike Fig. 3, in this embodiment, described device also includes：

First display module 501, for showing the identification text after segmentation to user.

As shown in fig. 6, being that another structure of speech recognition text segmentation device of the embodiment of the present invention is shown It is intended to.

From unlike Fig. 3, in this embodiment, described device also includes：

Subject distillation module 601, the theme of text is recognized for extracting each paragraph after segmentation；

Second display module 602, for by each theme presentation to user；

Sensing module 603, the theme interested for perceiving user, and it is interested perceiving user Theme when, trigger second display module 602 will the correspondence theme paragraph identification text Show user.

The speech recognition text segmentation device that the present invention is provided, can also be in several ways to user's exhibition Show the identification text after segmentation, not only can clearly recognize text to user's displaying structure of an article, and And may also help in user and be quickly found out oneself content interested, further lift reading efficiency.

Each embodiment in this specification is described by the way of progressive, phase between each embodiment With similar part mutually referring to what each embodiment was stressed is and other embodiment Difference.For device embodiment, because it is substantially similar to embodiment of the method, So describing fairly simple, the relevent part can refer to the partial explaination of embodiments of method.Above institute The device embodiment of description is only schematical, wherein the unit illustrated as separating component can To be or may not be physically separate, the part shown as unit can be or also may be used Not to be physical location, you can with positioned at a place, or multiple NEs can also be distributed to On.Some or all of module therein can be selected to realize the present embodiment side according to the actual needs The purpose of case.Those of ordinary skill in the art are without creative efforts, you can to manage Solve and implement.

The embodiment of the present invention is described in detail above, embodiment pair used herein The present invention is set forth, the explanation of above example be only intended to help to understand the present invention method and Device；Simultaneously for those of ordinary skill in the art, according to the thought of the present invention, specific real Apply and will change in mode and application, in summary, this specification content should not be understood For limitation of the present invention.

Claims

1. a kind of speech recognition text segmentation method, it is characterised in that including：

2. according to the method described in claim 1, it is characterised in that methods described also includes, by with Under type builds segmented model：

Collect speech data；

3. according to the method described in claim 1, it is characterised in that described to extract each voice segments correspondence The segmentation feature of identification text include：

4. method according to claim 3, it is characterised in that first segmentation feature includes： The duration of current speech segment, in addition to：The distance between current speech segment and previous voice segments, and/or The distance between current speech segment and latter voice segments；

Also include：

5. method according to claim 4, it is characterised in that first segmentation feature is also wrapped Include：Whether the speaker of the speaker of current speech segment and previous voice segments identical, and/or current speech Whether the speaker of section is identical with the speaker of latter voice segments；

6. method according to claim 3, it is characterised in that second segmentation feature includes Below any one or more：

7. method according to claim 3, it is characterised in that described from the identification text Semantically extracting segmentation feature includes：

8. method according to claim 7, it is characterised in that the amendment also includes following Meaning is one or more：

9. the method according to any one of claim 1 to 8, it is characterised in that described utilize carries The segmentation feature taken and the segmented model built in advance, identification text corresponding to the speech data Segmentation detection is carried out, to determine to need the position being segmented to include：

10. the method according to any one of claim 1 to 8, it is characterised in that methods described Also include：

The identification text after segmentation is shown to user；Or

11. a kind of speech recognition text segmentation device, it is characterised in that including：

12. device according to claim 11, it is characterised in that described device also includes, point Segment model builds module, for building segmented model；The segmented model, which builds module, to be included：

Data collection module, for collecting speech data；

13. device according to claim 11, it is characterised in that the characteristic extracting module bag Include：

14. device according to claim 13, it is characterised in that the fisrt feature extracts mould Block includes：

15. device according to claim 14, it is characterised in that the fisrt feature extracts mould Block also includes：

16. device according to claim 13, it is characterised in that the second segmentation feature bag Include it is following any one or more：

17. device according to claim 13, it is characterised in that the second feature extracts mould Block includes：

18. device according to claim 17, it is characterised in that the amending unit also includes Any one or more subelements below：

19. the device according to any one of claim 11 to 18, it is characterised in that

The segmentation detection module, specifically in units of voice segments, successively by each voice segments correspondence The segmentation feature of identification text input the segmented model and carry out segmentation detection, determine each voice segments pair Whether the end position for the identification text answered needs segmentation.

20. the device according to any one of claim 11 to 18, it is characterised in that the dress Putting also includes：

Second display module, for by each theme presentation to user；