CN102592592A

CN102592592A - Voice data extraction method and device

Info

Publication number: CN102592592A
Application number: CN2011104543338A
Authority: CN
Inventors: 程辉; 王力劭; 邵颖
Original assignee: SHENZHEN VCYBER TECHNOLOGY Co Ltd
Current assignee: SHENZHEN VCYBER TECHNOLOGY Co Ltd
Priority date: 2011-12-30
Filing date: 2011-12-30
Publication date: 2012-07-18

Abstract

The invention discloses a voice data extraction method and a voice data extraction device, relates to the field of voice identification and aims to solve the problem of inaccurate voice identification in the prior art. The technical scheme disclosed by the embodiment of the invention comprises the following steps of: acquiring an average noise value of the environment in which a voice device is positioned; after a user starts the voice device, segmenting a signal input by the user according to preset time to obtain at least one signal segment; and acquiring voice data from the signal input by the user according to the relation between an average audio value and the average noise value, which corresponds to at least one signal segment. The technical scheme provided by the embodiment of the invention can be used in a voice identification system.

Description

The method for distilling of speech data and device

Technical field

The present invention relates to field of speech recognition, relate in particular to a kind of method for distilling and device of speech data.

Background technology

Along with intelligent development of science and technology, the mankind no longer have been satisfied with through mode such as mouse, button and equipment and have carried out alternately, but hope and can carry out alternately through mode and the equipment of voice that the voice of realization equipment are controlled.Speech recognition technology reaches its maturity as one of core technology of interactive voice technology, and be applied in information processing gradually, fields such as education and business application, consumer electronics.

It is important input element of speech recognition that speech data extracts.After the user started voice device, the process that the prior art speech data extracts comprised: the energy in the signal of search subscriber input successively; Position according to this energy obtains speech data from the signal of user's input.

Yet, because the energy in the signal of user's input possibly come from the sound that the user sends, also maybe be from noises such as the commercial production in the environment, communications and transportation; If there is noise in speech data when extracting, this noise can be taken as speech data and extract, and causes speech recognition inaccurate.

Summary of the invention

Embodiments of the invention provide a kind of method for distilling and device of speech data, can improve the accuracy rate of speech recognition.

On the one hand, a kind of method for distilling of speech data is provided, has comprised: the average noise of obtaining voice device place environment; After the user starts said voice device, the signal that the user imports is carried out segmentation, obtain at least one signal segment according to Preset Time; According to the average audio value of said at least one signal segment correspondence and the relation of said average noise, from the signal of said user's input, obtain speech data.

On the other hand, a kind of extraction element of speech data is provided, has comprised:

The noise figure acquiring unit is used to obtain the average noise that voice device belongs to environment;

Segmenting unit after being used for the user and starting said voice device, carries out segmentation according to Preset Time to the signal of user's input, obtains at least one signal segment;

Data extracting unit is used for from the signal of said user's input, obtaining speech data according to the average audio value of said at least one signal segment correspondence and the relation of said average noise.

The method for distilling of the speech data that the embodiment of the invention provides and device; Through obtaining the average noise and at least one signal segment of voice device place environment; And according to the average audio value of this at least one signal segment correspondence and the relation of average noise; From the signal of user's input, obtain speech data, thereby realize the extraction of speech data.Because when extracting speech data; Need to consider the average audio value of at least one signal segment correspondence and the relation of average noise; The technical scheme that makes the embodiment of the invention provide can reduce the influence that noise extracts speech data, thereby improves the accuracy rate of speech recognition; Solved in the prior art when speech data extracts and have noise, this noise can be taken as speech data and extract, and causes the inaccurate problem of speech recognition.

Description of drawings

In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art; To do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art below; Obviously, the accompanying drawing in describing below only is some embodiments of the present invention, for those of ordinary skills; Under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.

The process flow diagram of the method for distilling of the speech data that Fig. 1 provides for the embodiment of the invention one;

The process flow diagram of the method for distilling of the speech data that Fig. 2 provides for the embodiment of the invention two;

The process flow diagram of the method for distilling of the speech data that Fig. 3 provides for the embodiment of the invention three;

The structural representation one of the extraction element of the speech data that Fig. 4 provides for the embodiment of the invention four;

The structural representation two of the extraction element of the speech data that Fig. 5 provides for the embodiment of the invention four;

Fig. 6 is the structural representation of data extracting unit in the extraction element of speech data shown in Figure 4;

Fig. 7 is the structural representation one that extracts subelement in the extraction element of speech data shown in Figure 6;

Fig. 8 is the structural representation two that extracts subelement in the extraction element of speech data shown in Figure 6;

Fig. 9 is the structural representation three that extracts subelement in the extraction element of speech data shown in Figure 6.

Embodiment

To combine the accompanying drawing in the embodiment of the invention below, the technical scheme in the embodiment of the invention is carried out clear, intactly description, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills are not making the every other embodiment that is obtained under the creative work prerequisite, all belong to the scope of the present invention's protection.

The embodiment of the invention provides a kind of method for distilling and device of speech data, can solve prior art and cause the inaccurate problem of speech recognition.

As shown in Figure 1, the method for distilling of the speech data that the embodiment of the invention one provides comprises:

Step 101 is obtained the average noise that voice device belongs to environment.

In the present embodiment, before voice device started, step 101 can detect the noise figure of voice device place each time point of environment through the mode that decibel detects, and obtained average noise according to the noise figure of this each time point; Step 101 can also be obtained the average noise of the environment at voice device place through other modes, gives unnecessary details no longer one by one at this; Wherein, voice device both can also can also install for other for having the device of speech identifying function for having the device of voice typing and transfer function, did not limit at this.

Step 102 after the user starts voice device, is carried out segmentation according to Preset Time to the signal that the user imports, and obtains at least one signal segment.

In the present embodiment, after the user starts voice device, can carry out segmentation to the signal that the user imports, obtain at least one signal segment according to Preset Time.Wherein, this Preset Time can be provided with arbitrarily; Preferably,, can be set to short time value this time,, give unnecessary details no longer one by one at this as 0.1 second etc. in order to prevent the omission of speech data.

Step 103 according to the average audio value of at least one signal segment correspondence and the relation of average noise, is obtained speech data from the signal of user's input.

In the present embodiment, obtain at least one signal segment through step 102 after, can be to obtaining the audio value of every frame signal in each signal segment respectively, and obtain the average audio value according to the audio value of this every frame signal; Also can obtain the average audio value, not limit at this through other modes.

In the present embodiment; Step 103 is according to the average audio value of at least one signal segment correspondence and the relation of average noise; The process of from the signal of user's input, obtaining speech data can comprise: at first; Respectively that at least one signal segment is corresponding average audio value and average noise are carried out subtraction, obtain first sequence of differences; According to the relation of at least one difference and preset strength threshold value in this first sequence of differences, from the signal of user's input, obtain speech data then.Step 103 can also be obtained speech data through other modes according to the average audio value of at least one signal segment correspondence and the relation of average noise from the signal of user's input, give unnecessary details no longer one by one at this.

In the present embodiment; According to the average audio value of at least one signal segment correspondence and the relation of average noise; From the signal of user's input, obtain speech data; Can for: confirm the starting point of speech data according to the relation of corresponding average audio value of at least one signal segment and average noise, from the signal that the user imports, obtain speech data according to this starting point; Also can from the signal that the user imports, obtain speech data according to this end point for confirm the end point of speech data according to the relation of corresponding average audio value of at least one signal segment and average noise; Can also from the signal that the user imports, obtain speech data according to this starting point and end point for confirm the starting point and the end point of speech data according to the relation of corresponding average audio value of at least one signal segment and average noise.Wherein, when confirming the starting point of speech data, can also can at first filter the interference of transient noise, give unnecessary details no longer one by one at this directly with the initial time of this starting point as speech data.

The method for distilling of the speech data that the embodiment of the invention provides; Through obtaining the average noise and at least one signal segment of voice device place environment; And according to the average audio value of this at least one signal segment correspondence and the relation of average noise; From the signal of user's input, obtain speech data, thereby realize the extraction of speech data.Because when extracting speech data; Need to consider the average audio value of at least one signal segment correspondence and the relation of average noise; The technical scheme that makes the embodiment of the invention provide can reduce the influence that noise extracts speech data, thereby improves the accuracy rate of speech recognition; Solved in the prior art when speech data extracts and have noise, this noise can be taken as speech data and extract, and causes the inaccurate problem of speech recognition.

As shown in Figure 2, the method for distilling of the speech data that the embodiment of the invention two provides, this method is similar with method as shown in Figure 1; Difference is; The method for distilling of the speech data that present embodiment provides before the signal of the user being imported according to Preset Time carries out burst, also comprises:

The signal of user's input is obtained and stored to step 100 after the user starts voice device.

In the present embodiment, after the user starts voice device, can start the sound-recording function of this voice device automatically, thereby obtain and store the signal of user's input; The signal of user's input can obtained and store to step 100 also through other modes, gives unnecessary details no longer one by one at this.

As shown in Figure 3, the method for distilling of the speech data that the embodiment of the invention three provides comprises:

Step 301 to step 302 is obtained average noise, and the signal of user's input is carried out segmentation, obtains at least one signal segment; Detailed process is similar with step 101 to step 102 shown in Figure 1, gives unnecessary details no longer one by one at this.

Step 303, the average audio value that at least one signal segment is corresponding is carried out subtraction with average noise respectively, obtains first sequence of differences.

In the present embodiment, obtain at least one signal segment through step 302 after, can be to obtaining the audio value of every frame signal in each signal segment respectively, and obtain the average audio value according to the audio value of this every frame signal; Also can obtain the average audio value, not limit at this through other modes.

Step 304 according to the relation of at least one difference in first sequence of differences and preset strength threshold value, is obtained speech data from the signal of user's input.

In the present embodiment; In the step 304 according to the corresponding average audio value of at least one signal segment and the relation of average noise; From the signal of user's input, obtain speech data; Can for: confirm the starting point of speech data according to the relation of corresponding average audio value of at least one signal segment and average noise, from the signal that the user imports, obtain speech data according to this starting point; Also can from the signal that the user imports, obtain speech data according to this end point for confirm the end point of speech data according to the relation of corresponding average audio value of at least one signal segment and average noise; Can also from the signal that the user imports, obtain speech data according to this starting point and end point for confirm the starting point and the end point of speech data according to the relation of corresponding average audio value of at least one signal segment and average noise.Wherein, when confirming the starting point of speech data, can also can at first filter the interference of transient noise, give unnecessary details no longer one by one at this directly with the initial time of this starting point as speech data.

In the present embodiment; Step 304 is according to the relation of at least one difference in first sequence of differences and preset strength threshold value; From the signal of user's input, obtain the process of speech data; Can comprise: from first sequence of differences, obtain first difference greater than the preset strength threshold value, this first difference is greater than first difference of preset strength threshold value in first sequence of differences; From the signal of user's input, obtain speech data according to this first difference.Wherein, from the signal of user input, obtain speech data according to this first difference, can be for being that starting point is obtained speech data with the first difference time corresponding point from the signal of user's input; Can not limit at this for from the signal of user's input, obtaining speech data according to this first difference through other modes yet.

In the present embodiment, step 304 also can comprise: from first sequence of differences, obtain second difference less than the preset strength threshold value, this second difference is a first difference less than the preset strength threshold value after first difference in first sequence of differences; From the signal of user's input, obtain speech data according to this second difference.Wherein, from the signal of user input, obtain speech data according to this second difference, can be for being that terminal point obtains speech data with the second difference time corresponding point from the signal of user's input; Can not limit at this for from the signal of user's input, obtaining speech data according to second difference through other modes yet.

In the present embodiment, step 304 can also comprise: from first sequence of differences, obtain first difference greater than predetermined threshold value, and obtain second difference less than predetermined threshold value; From the signal of user's input, obtain speech data according to this first difference and second difference.Wherein, From the signal of user's input, obtain speech data according to this first difference and second difference; Can be for being that starting point, the second difference time corresponding point are the terminal with the first difference time corresponding point; From the signal of user's input, obtain speech data, also can not limit at this for from the signal of user's input, obtaining speech data according to first difference and second difference through other modes.

In the present embodiment; When comprising first difference of from first sequence of differences, obtaining greater than predetermined threshold value in the step 304; Before from the signal of user's input, obtaining speech data, can also comprise: judge whether the corresponding average audio value of first difference is transient noise according to first difference; When the corresponding average audio value of first difference is transient noise, obtain first difference and judgement again; When the corresponding average audio value of first difference is not transient noise, directly from the signal of user's input, obtain speech data according to first difference.Wherein, Judge whether the corresponding average audio value of first difference is the mode of transient noise value; Can be for whether judging the first sound signal time corresponding section greater than predetermined threshold value, this first sound signal for be starting point with first difference, by second sequence of differences correspondence of forming greater than the strength difference of preset strength threshold value continuously; Also can whether comprise the phonetic feature that is provided with in advance for judging the first strength difference corresponding audio signal, this phonetic feature that is provided with in advance can comprise voiced speech fragment or non-voiced speech fragment, does not limit at this; Can also judge that whether the energy of the first strength difference corresponding audio signal is caused by transient noise, gives unnecessary details at this no longer one by one through other modes.

As shown in Figure 4, the extraction element of the speech data that the embodiment of the invention four provides comprises:

Noise figure acquiring unit 401 is used to obtain the average noise that voice device belongs to environment.

In the present embodiment, before voice device started, noise figure acquiring unit 401 can detect the noise figure of voice device place each time point of environment through the mode that decibel detects, and obtained average noise according to the noise figure of this each time point; Noise figure acquiring unit 401 can also obtain the average noise of the environment at voice device place through other modes, gives unnecessary details no longer one by one at this; Wherein, voice device both can also can also install for other for having the device of speech identifying function for having the device of voice typing and transfer function, did not limit at this.

Segmenting unit 402 after being used for the user and starting voice device, carries out segmentation according to Preset Time to the signal of user's input, obtains at least one signal segment.

Data extracting unit 403 is used for from the signal of user's input, obtaining speech data according to the average audio value of at least one signal segment correspondence and the relation of average noise.

In the present embodiment, obtain at least one signal segment through segmenting unit 402 after, can be to obtaining the audio value of every frame signal in each signal segment respectively, and obtain the average audio value according to the audio value of this every frame signal; Also can obtain the average audio value, not limit at this through other modes.

In the present embodiment; Data extracting unit 403 is according to the average audio value of at least one signal segment correspondence and the relation of average noise; The process of from the signal of user's input, obtaining speech data can comprise: at first; Respectively that at least one signal segment is corresponding average audio value and average noise are carried out subtraction, obtain first sequence of differences; According to the relation of at least one difference and preset strength threshold value in this first sequence of differences, from the signal of user's input, obtain speech data then.Data extracting unit 403 can also be obtained speech data through other modes according to the average audio value of at least one signal segment correspondence and the relation of average noise from the signal of user's input, give unnecessary details no longer one by one at this.

As shown in Figure 5, the extraction element of speech data in the present embodiment can also comprise:

Storage unit 400 is used to obtain and store the signal of user's input.

In the present embodiment, after the user starts voice device, can start the sound-recording function of this voice device automatically, thereby obtain and store the signal of user's input; The signal of user's input can obtained and store to storage unit 400 also through other modes, gives unnecessary details no longer one by one at this.

Further, as shown in Figure 6, data extracting unit 403 in the extraction element of the speech data that present embodiment provides, and can comprise:

Subtraction subelement 4031 is used for the average audio value that at least one signal segment is corresponding and carries out subtraction with average noise respectively, obtains first sequence of differences;

Extract subelement 4032, be used for relation, from the signal of user's input, obtain speech data according at least one difference of first sequence of differences and preset strength threshold value.

In the present embodiment; As shown in Figure 7; Extracting subelement 4032 can comprise: first acquisition module 40321, be used for obtaining first difference greater than the preset strength threshold value from first sequence of differences, and first difference is greater than first difference of preset strength threshold value in first sequence of differences; First extraction module 40322 is used for obtaining speech data according to first difference from the signal that the user imports.Wherein, from the signal of user input, obtain speech data according to this first difference, can be for being that starting point is obtained speech data with the first difference time corresponding point from the signal of user's input; Can not limit at this for from the signal of user's input, obtaining speech data according to this first difference through other modes yet.

In the present embodiment; As shown in Figure 8; Extracting subelement 4032 also can comprise: second acquisition module 40323; Be used for obtaining second difference less than the preset strength threshold value from first sequence of differences, second difference is a first difference less than the preset strength threshold value after first difference in first sequence of differences; Second extraction module 40324 is used for obtaining speech data according to second difference from the signal that the user imports.Wherein, from the signal of user input, obtain speech data according to this second difference, can be for being that terminal point obtains speech data with the second difference time corresponding point from the signal of user's input; Can not limit at this for from the signal of user's input, obtaining speech data according to second difference through other modes yet.

In the present embodiment, as shown in Figure 9, extract subelement 4032 and can also both comprise first acquisition module 40321, first extraction module 40322, comprise second acquisition module 40323, second extraction module 40324 again; At this moment, speech data is to obtain from the signal of user's input according to first difference and second difference.Wherein, From the signal of user's input, obtain speech data according to this first difference and second difference; Can be for being that starting point, the second difference time corresponding point are the terminal with the first difference time corresponding point; From the signal of user's input, obtain speech data, also can not limit at this for from the signal of user's input, obtaining speech data according to first difference and second difference through other modes.

In the present embodiment, when extraction subelement 4032 comprised first acquisition module 40321, as shown in Figure 7, this extraction subelement 4032 also comprises: judge module 40325 was used to judge whether the corresponding average audio value of first difference is the transient noise value.This judge module can be realized function corresponding through judging submodule, and this judges submodule, is used to judge that the first sound signal time corresponding section whether greater than the structure of predetermined threshold value, judges whether the corresponding average audio value of first difference is the transient noise value; This first sound signal with first difference be starting point, corresponding by second sequence of differences of forming greater than the strength difference of preset strength threshold value continuously.This judge module 40325 also can judge whether the corresponding average audio value of first difference is the transient noise value through other modes; As judge whether the first strength difference corresponding audio signal comprises the phonetic feature of setting in advance etc.; This phonetic feature that is provided with in advance can comprise voiced speech fragment or non-voiced speech fragment, does not limit at this.

In the present embodiment, when the corresponding average audio value of first difference is transient noise, obtain first difference and judgement again; When the corresponding average audio value of first difference is not transient noise, directly from the signal of user's input, obtain speech data according to first difference.

The extraction element of the speech data that the embodiment of the invention provides; Through obtaining the average noise and at least one signal segment of voice device place environment; And according to the average audio value of this at least one signal segment correspondence and the relation of average noise; From the signal of user's input, obtain speech data, thereby realize the extraction of speech data.Because when extracting speech data; Need to consider the average audio value of at least one signal segment correspondence and the relation of average noise; The technical scheme that makes the embodiment of the invention provide can reduce the influence that noise extracts speech data, thereby improves the accuracy rate of speech recognition; Solved in the prior art when speech data extracts and have noise, this noise can be taken as speech data and extract, and causes the inaccurate problem of speech recognition.

The method for distilling of the speech data that the embodiment of the invention provides and device can be applied in the speech recognition system.

The above; Be merely embodiment of the present invention, but protection scope of the present invention is not limited thereto, any technician who is familiar with the present technique field is in the technical scope that the present invention discloses; Can expect easily changing or replacement, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion by said protection domain with claim.

Claims

1. the method for distilling of a speech data is characterized in that, comprising:

Obtain the average noise of voice device place environment;

After the user starts said voice device, the signal that the user imports is carried out segmentation, obtain at least one signal segment according to Preset Time;

According to the average audio value of said at least one signal segment correspondence and the relation of said average noise, from the signal of said user's input, obtain speech data.

2. the method for distilling of speech data according to claim 1 is characterized in that, before the said signal of the user being imported according to Preset Time carried out segmentation, said method also comprised:

Obtain and store the signal of user's input.

3. the method for distilling of speech data according to claim 1 is characterized in that, and is said according to the average audio value of said at least one signal segment correspondence and the relation of said average noise, from the signal of said user's input, obtains speech data, comprising:

The corresponding average audio value of said at least one signal segment is carried out subtraction with said average noise respectively, obtain first sequence of differences;

According to the relation of at least one difference and preset strength threshold value in said first sequence of differences, from the signal of said user's input, obtain speech data.

4. the method for distilling of speech data according to claim 3 is characterized in that, according to the relation of at least one difference and preset strength threshold value in said first sequence of differences, from the signal of said user's input, obtains speech data, comprising:

From said first sequence of differences, obtain first difference greater than the preset strength threshold value, said first difference is greater than first difference of said preset strength threshold value in said first sequence of differences; From the signal of said user's input, obtain speech data according to said first difference; And/or

From said first sequence of differences, obtain second difference less than said preset strength threshold value, said second difference is a first difference less than said preset strength threshold value after first difference in said first sequence of differences; From the signal of said user's input, obtain speech data according to said second difference.

5. the method for distilling of speech data according to claim 4 is characterized in that, before from the signal of said user's input, obtaining speech data according to said first difference, said method also comprises:

Judge whether the corresponding average audio value of said first difference is the transient noise value;

When the corresponding average audio value of said first difference is the transient noise value, obtain first difference and judgement again.

6. the method for distilling of speech data according to claim 5 is characterized in that, saidly judges whether the corresponding average audio value of said first difference is the transient noise value, comprising:

Whether judge the first sound signal time corresponding section greater than predetermined threshold value, said first sound signal with said first difference be starting point, corresponding by second sequence of differences of forming greater than the strength difference of said preset strength threshold value continuously.

7. the extraction element of a speech data is characterized in that, comprising:

8. the extraction element of speech data according to claim 7 is characterized in that, also comprises:

Storage unit is used to obtain and store the signal of user's input.

9. the extraction element of speech data according to claim 8 is characterized in that, said data extracting unit comprises:

The subtraction subelement is used for the corresponding average audio value of said at least one signal segment is carried out subtraction with said average noise respectively, obtains first sequence of differences;

Extract subelement, be used for relation, from the signal of said user's input, obtain speech data according to said at least one difference of first sequence of differences and preset strength threshold value.

10. the extraction element of speech data according to claim 9 is characterized in that, said extraction subelement comprises:

First acquisition module is used for obtaining first difference greater than the preset strength threshold value from said first sequence of differences, and said first difference is greater than first difference of said preset strength threshold value in said first sequence of differences;

First extraction module is used for obtaining speech data according to said first difference from the signal that said user imports.

11. the extraction element according to claim 9 or 10 described speech datas is characterized in that, said extraction subelement comprises:

Second acquisition module is used for obtaining second difference less than said preset strength threshold value from said first sequence of differences, and said second difference is a first difference less than said preset strength threshold value after first difference in said first sequence of differences;

Said second extraction module is used for obtaining speech data according to said second difference from the signal that said user imports.

12. the extraction element of speech data according to claim 10 is characterized in that, also comprises:

Judge module is used to judge whether the corresponding average audio value of said first difference is the transient noise value.

13. the extraction element of speech data according to claim 12 is characterized in that, said judge module comprises:

Judge submodule; Whether be used to judge the first sound signal time corresponding section greater than predetermined threshold value, said first sound signal with said first difference be starting point, corresponding by second sequence of differences of forming greater than the strength difference of said preset strength threshold value continuously.