CN101154383A

CN101154383A - Method and device for noise suppression, phonetic feature extraction, speech recognition and training voice model

Info

Publication number: CN101154383A
Application number: CNA2006101412409A
Authority: CN
Inventors: 丁沛; 何磊; 鄢翔; 赵蕤; 郝杰
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2006-09-29
Filing date: 2006-09-29
Publication date: 2008-04-02
Anticipated expiration: 2026-09-29
Also published as: CN101154383B

Abstract

The invention provides a noise reduction method, a method for extracting phonetic feature, a speech recognition method and a speech model training method as well as a noise reduction device, a device for extracting phonetic feature, a speech recognition device and a speech model training device. According to one aspect of the invention, the noise reduction method which is used in speech spectrum containing noise includes the following steps: according to a noise estimation spectrum, logarithm spectrum minimum mean square error estimation of the speech spectrum is completed to reduce the noise contained in the speech spectrum, wherein, the logarithm spectrum minimum mean square error estimation is realized through calculating gain function according to the following steps: the gain function is calculated through Taylor series accumulation and numerical integration; finally, the result of the Taylor series accumulation is combined with that of the numerical integration.

Description

The method and apparatus of squelch, extraction phonetic feature, speech recognition and training utterance model

Technical field

The present invention relates in general to speech recognition technology, particularly, relates to the noise reduction techniques of speech manual.

Background technology

Popular speech recognition system can obtain very high accuracy of identification to clean speech at present, but because noise brings the mismatch between acoustic model and the acoustic feature, the performance of existing speech recognition system can sharply descend under noise circumstance.

Mainly concentrate on Front-end Design in the work aspect the noise robustness, purpose is to reduce the mismatch at speech feature space that noise brings.Least mean-square error (Minimum Mean-Square Error MMSE) estimates it is a kind of voice enhancement algorithm, and it can suppress ground unrest effectively, thus the signal to noise ratio (S/N ratio) of raising input signal (Signal-to-Noise Ratio, SNR).Estimate for least mean-square error, document " Speech enhancement using aminimum mean-square error short-time spectral amplitude estimator " at Y.Ephraim and D.Malah, IEEE Trans.Acoustic, Speech, and Signal Processing, Vol.ASSP-32, pp.1109-1121, be described in detail in 1984, its full content is contained in this with way of reference, (hereinafter is called document 1) for your guidance.In the document, utilize MMSE to estimate to short-time spectrum amplitude (Short-Time Spectral Amplitude, STSA) estimate, and the system that has proposed to utilize MMSE STSA to estimate, and with this system with widely used based on Wiener filtering with subtract the system of composing algorithm (Spectral Subtraction Algorithm) and compare.

Although the distortion measurement of the square error of the spectrum of using in the document of Y.Ephraim and D.Malah is easily handled on mathematics, and has obtained good result, it is not optimal mode.As everyone knows, be more suitable in speech processes based on the distortion measurement of the square error of logarithmic spectrum, for example at R.M.Gray, A.Buzo, A.H.Gray, the document of Jr. and Y.Matsuyama " Distortionmeasures for speech processing; " IEEE Trans.Acoust., Speech, Signalprocessing, vol.ASSP-28, pp.367-376, be described in detail among the Aug.1980, its full content is contained in this with way of reference, for your guidance.Therefore, this distortion measurement is widely used for speech analysis and identification.

Estimate for logarithmic spectrum least mean-square error (LogMMSE), document " Speech enhancement using a minimum mean-square errorlog-spectral amplitude estimator " at Y.Ephraim and D.Malah, IEEE Trans.Acoustic, Speech, andSignal Processing, Vol.ASSP-33, pp.443-445, be described in detail in 1985, its full content is contained in this with way of reference, (hereinafter is called document 2) for your guidance.LogMMSE is better than MMSE, because it can obtain littler residual noise level, does not influence the quality of voice itself simultaneously.In the LogMMSE enhancement algorithms, the employing Taylor series add up or numerical integration is come the calculated gains function.

Yet, in this framework, there are following two problems:

1. Taylor series add up to have only when input value and hour calculate accurately, and numerical integration is only calculated accurately when input value is big.

2. utilize Taylor series to add up or the calculated amount of numerical integration calculated gains function very big.

Summary of the invention

In order to solve above-mentioned problems of the prior art, the invention provides noise suppressing method, extract the method for phonetic feature, the method of audio recognition method and training utterance model, and Noise Suppression Device, extract the device of phonetic feature, the device of speech recognition equipment and training utterance model.

According to an aspect of the present invention, provide a kind of noise suppressing method that is used to contain the noise speech manual, having comprised:, the described noise speech manual that contains has been carried out the estimation of logarithmic spectrum least mean-square error, to reduce the described noise that contains the noise speech manual according to the Noise Estimation spectrum; Wherein, replacing gain function to carry out described logarithmic spectrum least mean-square error with piecewise linear function estimates.

According to another aspect of the present invention, a kind of noise suppressing method that is used to contain the noise speech manual is provided, comprise:, the described noise speech manual that contains is carried out the estimation of logarithmic spectrum least mean-square error, to reduce the described noise that contains the noise speech manual according to the Noise Estimation spectrum; Wherein, carrying out described logarithmic spectrum least mean-square error by following steps calculated gains function estimates: utilize the described gain function of Taylor series accumulation calculating; Utilize numerical integration to calculate described gain function; And merge the result that described Taylor series add up and the result of described numerical integration.

According to another aspect of the present invention, provide a kind of method that is used to extract phonetic feature, having comprised: will contain the noise phonetic modification and become to contain the noise speech manual; Utilize above-mentioned noise suppressing method, reduce the described noise that contains the noise speech manual; And extract phonetic feature from the speech manual that described noise reduces.

According to another aspect of the present invention, provide a kind of audio recognition method, having comprised: utilized the method for above-mentioned extraction phonetic feature, extract phonetic feature; And according to the described phonetic feature that extracts, recognizing voice.

According to another aspect of the present invention, provide a kind of method of training utterance model, having comprised: utilized the method for above-mentioned extraction phonetic feature, extract phonetic feature; And, train described speech model according to the described phonetic feature that extracts.

According to another aspect of the present invention, a kind of Noise Suppression Device that is used to contain the noise speech manual is provided, comprise: estimation unit (estimation unit), compose according to Noise Estimation, the described noise speech manual that contains is carried out the estimation of logarithmic spectrum least mean-square error, to reduce the described noise that contains the noise speech manual; Wherein, described estimation unit uses piecewise linear function to replace gain function to carry out described logarithmic spectrum least mean-square error estimation.

According to another aspect of the present invention, a kind of Noise Suppression Device that is used to contain the noise speech manual is provided, comprise: estimation unit (estimation unit), compose according to Noise Estimation, the described noise speech manual that contains is carried out the estimation of logarithmic spectrum least mean-square error, to reduce the described noise that contains the noise speech manual; Wherein, described estimation unit comprises: Taylor series accumulation calculating unit (Taylor seriesaccumulation calculation unit), utilize the described gain function of Taylor series accumulation calculating; Numerical integration computing unit (numeric integration calculation unit) utilizes numerical integration to calculate described gain function; And merge cells (combination unit), be used to merge described Taylor series accumulation calculating unit result calculated and described numerical integration computing unit result calculated.

According to another aspect of the present invention, provide a kind of device that is used to extract phonetic feature, having comprised: converter unit (transforming unit) will contain the noise phonetic modification and become to contain the noise speech manual; According to above-mentioned Noise Suppression Device, be used to reduce the described noise that contains the noise speech manual; And extraction unit (extracting unit), the speech manual that reduces from described noise extracts described phonetic feature.

According to another aspect of the present invention, provide a kind of speech recognition equipment, having comprised: the device according to above-mentioned extraction phonetic feature is used to extract phonetic feature; And voice recognition unit (speechrecognition unit), according to the described phonetic feature that extracts, recognizing voice.

According to another aspect of the present invention, provide a kind of device of training utterance model, having comprised: the device according to above-mentioned extraction phonetic feature is used to extract phonetic feature; And model training unit (model-training unit), according to the described phonetic feature that extracts, train described speech model.

Description of drawings

Believe by below in conjunction with the explanation of accompanying drawing, can make people understand the above-mentioned characteristics of the present invention, advantage and purpose better the specific embodiment of the invention.

Fig. 1 is the process flow diagram of noise suppressing method according to an embodiment of the invention;

Fig. 2 A-2D shows an example of the process of the cut-point that piecewise linear function is set, wherein Fig. 2 A shows the curve of a gain function, Fig. 2 B shows the curve of the derivative of gain function, Fig. 2 C shows the curve of the difference between gain function and the piecewise linear function, and Fig. 2 D shows the curve of the piecewise linear function after cutting apart;

Fig. 3 is the process flow diagram of noise suppressing method according to another embodiment of the invention;

Fig. 4 A-4C shows an example to Taylor series add up and numerical integration merges, wherein Fig. 4 A shows the gain function that adds up and obtain by Taylor series, Fig. 4 B shows the gain function that obtains by numerical integration, and Fig. 4 C shows by merging the gain function that above-mentioned two kinds of computing method obtain;

Fig. 5 shows and calculates an example that merges threshold value;

Fig. 6 is the process flow diagram of the method for extraction phonetic feature according to another embodiment of the invention;

Fig. 7 is the process flow diagram of audio recognition method according to another embodiment of the invention;

Fig. 8 is the process flow diagram of the method for training utterance model according to another embodiment of the invention;

Fig. 9 is the block scheme of Noise Suppression Device according to another embodiment of the invention;

Figure 10 is the block scheme of Noise Suppression Device according to another embodiment of the invention;

Figure 11 is the block scheme of the device of extraction phonetic feature according to another embodiment of the invention;

Figure 12 is the block scheme of speech recognition equipment according to another embodiment of the invention; And

Figure 13 is the block scheme of the device of training utterance model according to another embodiment of the invention.

Embodiment

For the ease of the understanding of back embodiment, at first briefly introduce least mean-square error (MMSE) estimation and logarithmic spectrum least mean-square error (LogMMSE) estimation principles.

It is a kind of voice enhancement algorithm that MMSE estimates, it utilizes the estimation spectrum of ground unrest, and the noise that contains in the noise speech manual is suppressed, and obtains the speech manual that noise is inhibited.Particularly, least mean-square error is estimated to be undertaken by following formula:

y(t)＝x(t)+d(t)， 0≤t≤T (1)

{\hat{A}}_{k} = E {A_{k} | y (t), 0 \leq t \leq T} - - - (2)

Wherein, y (t) expression comprises the signal of voice signal x (t) and noise signal d (t), A _kThe amplitude of k the spectral component of expression voice signal x (t),

A is passed through in expression _kMMSE estimate the speech manual obtain.Obtain by derivation:

{\hat{A}}_{k} = C \frac{\sqrt{{&upsi;}_{k}}}{γ_{k}} M ({&upsi;}_{k}) R_{k} - - - (3)

Wherein

{&upsi;}_{k} = \frac{ξ_{k}}{1 + ξ_{k}} γ_{k} - - - (4)

Wherein The speech manual that the expression noise is inhibited, R _kExpression contains the noise speech manual, and C is a constant, ξ _kBe the priori signal to noise ratio (S/N ratio) that obtains according to the Noise Estimation spectrum, γ _kBe the posteriority signal to noise ratio (S/N ratio) of composing and contain the acquisition of noise speech manual according to Noise Estimation, M (υ _k) be confluent hypergeometric function, and k represents k spectral component.Detail is referring to the document 1 of above-mentioned Y.Ephraim and D.Malah.

It also is a kind of voice enhancement algorithm that LogMMSE estimates, it can obtain littler residual noise level, does not influence the quality of voice itself simultaneously.Particularly, LogMMSE estimates to be undertaken by following formula:

{\hat{A}}_{k} = \exp {E [\ln A_{k} | y (t), 0 \leq t \leq T]} - - - (5)

Wherein, different with the formula (1) that carries out using when MMSE estimates is, to the amplitude A of k the spectral component of voice signal x (t) _kTake the logarithm.Obtain by derivation:

{\hat{A}}_{k} = \frac{ξ_{k}}{1 + ξ_{k}} \exp {\frac{1}{2} {&Integral;}_{{&upsi;}_{k}}^{\infty} \frac{e^{- t}}{t} dt} R_{k} - - - (6)

As the gain function G (υ that gives a definition _k):

G ({&upsi;}_{k}) &equiv; \frac{{\hat{A}}_{k}}{R_{k}} - - - (7)

Wherein

{&upsi;}_{k} = \frac{ξ_{k}}{1 + ξ_{k}} γ_{k} .

Thereby obtain the speech manual that noise is inhibited

For:

{\hat{A}}_{k} = G ({&upsi;}_{k}) R_{k} - - - (8)

Detail is referring to the document 2 of above-mentioned Y.Ephraim and D.Malah.

Below just in conjunction with the accompanying drawings each embodiment of the present invention is described in detail.

Fig. 1 is the process flow diagram of noise suppressing method according to an embodiment of the invention.As shown in Figure 1, at first, in step 101, input contains the noise speech manual.Containing the noise speech manual is according to the voice data that comprises ground unrest and voice, and therefore the speech manual that for example utilizes Fast Fourier Transform (FFT) to obtain is the speech manual that ground unrest and voice are superimposed.

Then, in step 105,, carry out the estimation of logarithmic spectrum least mean-square error to containing the noise voice according to the Noise Estimation spectrum of pre-estimating.The Noise Estimation spectrum is the ground unrest that does not have voice to be pre-estimated obtain.The mode that obtains the Noise Estimation spectrum is a lot, for example, the ground unrest spectrum of repeatedly gathering is averaged or the like, and the present invention is to this not special restriction.Particularly, carry out the logarithmic spectrum least mean-square error according to above-mentioned formula (8) and estimate, wherein utilize the gain function G (υ in the piecewise linear function replacement formula (8) _k), the formula after the conversion is:

{\hat{A}}_{k} = L ({&upsi;}_{k}) R_{k} - - - (9)

Wherein

{&upsi;}_{k} = \frac{ξ_{k}}{1 + ξ_{k}} γ_{k},

Wherein

The speech manual that the expression noise is inhibited, R _kExpression contains noise speech manual, ξ _kBe the priori signal to noise ratio (S/N ratio) that obtains according to the Noise Estimation spectrum, γ _kBe the posteriority signal to noise ratio (S/N ratio) of composing and contain the acquisition of noise speech manual according to Noise Estimation, L (υ _k) be piecewise linear function, and k represents k spectral component.

In the present embodiment, can utilize the piecewise linear function L (υ that preestablishes cut-point _k) approximate gain function G (υ _k).For example, can carry out piecewise linear function L (υ by following steps _k) approximate gain function G (υ _k) process.

Particularly, Fig. 2 A-2D shows an example of the process of the cut-point that piecewise linear function is set, wherein Fig. 2 A shows a gain function G (curve v), Fig. 2 B shows the curve of the derivative of gain function, Fig. 2 C shows the curve of the difference between gain function and the piecewise linear function, and Fig. 2 D shows the piecewise linear function L (curve v) after cutting apart.Concrete cutting procedure is as follows.

At first, (derivative v) is as described in Fig. 2 B for the calculated gains function G.For convenience, in this example, the curve of differentiation value in the 0.05-0.50 scope is as example.

Then, (initial segmentation point v) is as described in Fig. 2 B to set piecewise linear function L.For example in this example, to be located at derivative value be 0.10,1.15,0.20,0.25,0.30,0.35,0.40,0.45 place to the initial segmentation point.

Then, calculate piecewise linear function L between per two continuous cut-points of initial segmentation point (v) and gain function G (difference v) is shown in Fig. 2 C.

Then, the difference and the pre-set threshold of the functional value between per two the continuous cut-points that calculate compared, for example, in this example, threshold setting is 0.037.By relatively,, for example, between cut-point 0.10 and 0.15, for example insert a new cut-point in their midpoint if difference greater than 0.037, is then inserted a new cut-point between two continuous cut-points.

Repeat the step of aforementioned calculation difference and step afterwards thereof, up to not having described difference greater than described threshold value.Thereby, obtain the piecewise linear function shown in Fig. 2 D.

Turn back to Fig. 1, utilizing piecewise linear function L (υ _k) replacement gain function G (υ _k) carry out after the estimation of logarithmic spectrum least mean-square error, in step 110, output estimates to reduce the speech manual of noise by the logarithmic spectrum least mean-square error.

By the noise suppressing method of present embodiment, utilize piecewise linear function to replace gain function, greatly reduced the calculated amount that the logarithmic spectrum least mean-square error is estimated, kept the squelch performance simultaneously.

Under same inventive concept, Fig. 3 is the process flow diagram of noise suppressing method according to another embodiment of the invention.Below just in conjunction with this figure, present embodiment is described.For those parts identical, suitably omit its explanation with front embodiment.

As shown in Figure 3, at first, in step 301, input contains the noise speech manual.Containing the noise speech manual is according to the voice data that comprises ground unrest and voice, and therefore the speech manual that for example utilizes Fast Fourier Transform (FFT) to obtain is the speech manual that ground unrest and voice are superimposed.

Then, in step 305, carry out the estimation of logarithmic spectrum least mean-square error to containing the noise voice.Particularly, in this step, utilize formula (8), carry out the logarithmic spectrum least mean-square error by Taylor series accumulation calculating gain function and estimate, obtain the curve shown in Fig. 4 A.The Taylor series accumulation method that adopts in the present embodiment can be the known any method of those skilled in the art, and the present invention does not repeat them here this not restriction.

As can be seen, in input variable hour very accurate in Fig. 4 A by the add up gain function value that obtains of Taylor series, and when input variable is big, the gain function value out of true that calculates.

Then,,, utilize formula (8), carry out the logarithmic spectrum least mean-square error by numerical integration calculated gains function and estimate, obtain the curve shown in Fig. 4 B according to the Noise Estimation spectrum in step 310.The numerical integration method that adopts in the present embodiment can be the known any method of those skilled in the art, and the present invention does not repeat them here this not restriction.

As can be seen, opposite with Taylor series accumulation method result calculated in Fig. 4 B, when input variable was big, the gain function value that obtains by numerical integration was very accurate, and in input variable hour, the gain function value out of true that calculates.

Then, in step 315, merga pass Taylor series accumulation method result calculated and numerical integration method result calculated.

Particularly, can will replace by the gain function value that numerical integration obtains by Taylor series coarse part utilization that adds up in the gain function value that obtains among Fig. 4 A, perhaps coarse part utilization in the gain function value that obtains by numerical integration among Fig. 4 B be replaced by the Taylor series gain function value that obtains that adds up.In addition, also can in Taylor series accumulation method and all accurate scope of numerical integration method, get a bit (for example the most approaching place of two curves among Fig. 4 A and Fig. 4 B) arbitrarily, as merging threshold value, will be less than merging passing through gain function value that the Taylor series accumulation method calculates and merging of threshold value greater than the gain function value that numerical integration method calculates of passing through that merges threshold value.

Preferably, can determine above-mentioned merging threshold value by the following method.

At first, to subtract each other by the gain function value of Taylor series accumulation method calculating with by the gain function value that numerical integration method calculates, take absolute value and make log-transformation alternatively subtracting each other the result who obtains alternatively then, obtain curve as shown in Figure 3.Then, the input variable of the minimum value place correspondence of the curve of selection Fig. 3 is as above-mentioned merging threshold value.

After determining to merge threshold value, will be less than merging passing through gain function value that the Taylor series accumulation method calculates and merging of threshold value greater than the gain function value that numerical integration method calculates of passing through that merges threshold value, shown in Fig. 4 A-4C, thereby obtain the accurate gain functional value.

Turn back to Fig. 3, after carrying out the estimation of logarithmic spectrum least mean-square error by merging Taylor series accumulation method and numerical integration method, in step 320, the speech manual of reduction noise is estimated in output by the logarithmic spectrum least mean-square error.

Noise suppressing method by present embodiment, carry out the estimation of logarithmic spectrum least mean-square error by merging Taylor series accumulation method and numerical integration method, can access the performance of desired in theory removal noise, thereby remedy independent use Taylor series accumulation method or numerical integration method calculates coarse shortcoming.

Under same inventive concept, Fig. 6 is the process flow diagram of the method for extraction phonetic feature according to another embodiment of the invention.Below just in conjunction with this figure, present embodiment is described.For those parts identical, suitably omit its explanation with front embodiment.

As shown in Figure 6, at first, in step 601, input contains the noise voice, and this contains the noise voice packet and draws together voice and the ground unrest that the speaker says.

Then, in step 605, the described noise phonetic modification that contains is become to contain the noise speech manual, for example (Fast Fourier Transform FFT) becomes the phonetic modification on the time domain speech manual on the frequency domain by fast fourier transform.

Then, in step 610, according to the described noise suppressing method of the embodiment of Fig. 1 and Fig. 2, reduce the described noise that contains the noise speech manual above utilizing.Described noise suppressing method is to carry out the logarithmic spectrum least mean-square error according to above-mentioned formula (9) to estimate, wherein, utilizes piecewise linear function to replace gain function.Identical in concrete noise reduction process and the foregoing description do not repeat them here.

In addition, according to the described noise suppressing method of the embodiment of Fig. 3 to Fig. 5, reduce the described noise that contains the noise speech manual above also can utilizing.Described noise suppressing method is to carry out the logarithmic spectrum least mean-square error according to above-mentioned formula (8) to estimate, wherein, carries out the estimation of logarithmic spectrum least mean-square error by merging Taylor series accumulation method and numerical integration method.Identical in concrete noise reduction process and the foregoing description do not repeat them here.

At last, in step 615, from the speech manual that noise reduces, extract phonetic feature.Particularly, can pass through Mel frequency cepstral coefficient (Mel Frequency ceptral Coefficient, MFCC) or linear prediction cepstrum coefficient (Linear Predictive Cepstral Coefficient, LPCC) etc. conventional method is extracted phonetic feature, and the present invention is not particularly limited this.

By above explanation as can be known, the method of the extraction phonetic feature of present embodiment can be before extracting phonetic feature from contain the noise speech manual, carry out the logarithmic spectrum least mean-square error by above-mentioned formula (9) and estimate to reduce noise, wherein utilize piecewise linear function to replace gain function, greatly reduce the calculated amount that the logarithmic spectrum least mean-square error is estimated, kept the squelch performance simultaneously.Therefore, can improve the quality of phonetic feature.

In addition, the method of the extraction phonetic feature of present embodiment also can be before extracting phonetic feature from contain the noise speech manual, carry out the logarithmic spectrum least mean-square error by above-mentioned formula (8) and estimate to reduce noise, wherein carry out the estimation of logarithmic spectrum least mean-square error by merging Taylor series accumulation method and numerical integration method, can access the performance of desired in theory removal noise, thereby remedy independent use Taylor series accumulation method or numerical integration method calculates coarse shortcoming.Therefore, can improve the quality of phonetic feature.

Under same inventive concept, Fig. 7 is the process flow diagram of audio recognition method according to another embodiment of the invention.Below just in conjunction with this figure, present embodiment is described.For those parts identical, suitably omit its explanation with front embodiment.

As shown in Figure 7, at first,, with reference to the method for the described extraction phonetic feature of the embodiment of figure 6, extract phonetic feature above utilizing in step 701.Identical in concrete leaching process and the foregoing description do not repeat them here.

Then, in step 705,, carry out speech recognition according to the described phonetic feature that extracts.Particularly, for example, phonetic feature and the good template of training in advance that extracts compared, thereby identify the content information of described voice, the present invention is not particularly limited this.

By above explanation as can be known, the audio recognition method of present embodiment can be before extracting phonetic feature from contain the noise speech manual, carry out the logarithmic spectrum least mean-square error by above-mentioned formula (9) and estimate to reduce noise, wherein utilize piecewise linear function to replace gain function, greatly reduced the calculated amount that the logarithmic spectrum least mean-square error is estimated, keep the squelch performance simultaneously, thereby can improve the quality of phonetic feature.Therefore, can improve the performance of speech recognition.

In addition, alternatively, the audio recognition method of present embodiment also can be before extracting phonetic feature from contain the noise speech manual, carry out the logarithmic spectrum least mean-square error by above-mentioned formula (8) and estimate to reduce noise, wherein carry out the estimation of logarithmic spectrum least mean-square error by merging Taylor series accumulation method and numerical integration method, can access the performance of desired in theory removal noise, thereby remedy independent use Taylor series accumulation method or numerical integration method calculates coarse shortcoming.Therefore, can improve the performance of speech recognition.

Under same inventive concept, Fig. 8 is the process flow diagram of the method for training utterance model according to another embodiment of the invention.Below just in conjunction with this figure, present embodiment is described.For those parts identical, suitably omit its explanation with front embodiment.

As shown in Figure 8, at first,, with reference to the method for the described extraction phonetic feature of the embodiment of figure 6, extract phonetic feature above utilizing in step 801.Identical in concrete leaching process and the foregoing description do not repeat them here.

Then, in step 805,, train described speech model according to the described phonetic feature that extracts.

By above explanation as can be known, in the method for the training utterance model of present embodiment, can be before from contain the noise speech manual, extracting phonetic feature, carry out the logarithmic spectrum least mean-square error by above-mentioned formula (9) and estimate to reduce noise, wherein utilize piecewise linear function to replace gain function, greatly reduced the calculated amount that the logarithmic spectrum least mean-square error is estimated, kept the squelch performance simultaneously, thereby can improve the quality of phonetic feature.Therefore, can improve the quality of the model that trains.

In addition, alternatively, the method of the training utterance model of present embodiment also can be before extracting phonetic feature from contain the noise speech manual, carry out the logarithmic spectrum least mean-square error by above-mentioned formula (8) and estimate to reduce noise, wherein carry out the estimation of logarithmic spectrum least mean-square error by merging Taylor series accumulation method and numerical integration method, can access the performance of desired in theory removal noise, thereby remedy independent use Taylor series accumulation method or numerical integration method calculates coarse shortcoming.Therefore, can improve the quality of the model that trains.

Under same inventive concept, Fig. 9 is the block scheme of Noise Suppression Device according to an embodiment of the invention.Below just in conjunction with this figure, present embodiment is described.For those parts identical, suitably omit its explanation with front embodiment.

As shown in Figure 9, the Noise Suppression Device 900 that being used to of present embodiment contains the noise speech manual comprises logarithmic spectrum least mean-square error estimation unit (log-spectral minimum mean-square errorestimation unit) 901, it is composed according to Noise Estimation, the described noise speech manual that contains is carried out the estimation of logarithmic spectrum least mean-square error, to reduce the described noise that contains the noise speech manual.Described logarithmic spectrum least mean-square error estimation unit 900 utilizes piecewise linear function to replace gain function, carrying out the logarithmic spectrum least mean-square error according to above-mentioned formula (9) estimates, detail does not repeat them here with above-mentioned identical with reference to the description about noise suppressing method among the embodiment of Fig. 1 and 2.

The Noise Suppression Device 900 of present embodiment can also comprise cut-point preservation unit 905, is used to preserve the cut-point of described piecewise linear function; And Noise Estimation preservation unit 910, be used to preserve the Noise Estimation of ground unrest being pre-estimated acquisition.In addition, described Noise Estimation also can be imported described logarithmic spectrum least mean-square error estimation unit 901 from the outside.

By above explanation as can be known,, greatly reduced the calculated amount that the logarithmic spectrum least mean-square error is estimated, kept the squelch performance simultaneously because the Noise Suppression Device 900 of present embodiment utilizes piecewise linear function to replace gain function.

Under same inventive concept, Figure 10 is the block scheme of Noise Suppression Device according to another embodiment of the invention.Below just in conjunction with this figure, present embodiment is described.For those parts identical, suitably omit its explanation with front embodiment.

As shown in figure 10, the Noise Suppression Device 1000 that being used to of present embodiment contains the noise speech manual comprises: logarithmic spectrum least mean-square error estimation unit (log-spectral minimum mean-squareerror estimation unit) 1001, it is composed according to Noise Estimation, the described noise speech manual that contains is carried out the estimation of logarithmic spectrum least mean-square error, to reduce the described noise that contains the noise speech manual.Identical among detail and the above-mentioned embodiment about the description of noise suppressing method with reference to figure 3 to Fig. 5, do not repeat them here.

Particularly, logarithmic spectrum least mean-square error estimation unit 1001 also comprises Taylor series accumulation calculating unit (Taylor series accumulation calculation unit) 10011, it utilizes formula (8), carry out the logarithmic spectrum least mean-square error by Taylor series accumulation calculating gain function and estimate, obtain the curve shown in Fig. 4 A.Can to be that those skilled in the art is known anyly can carry out the device that Taylor series add up in the Taylor series accumulation calculating unit 10011 that adopts in the present embodiment, and the present invention does not repeat them here this not restriction.

As can be seen, in input variable hour, the gain function value that is calculated by Taylor series accumulation calculating unit 10011 is very accurate in Fig. 4 A, and when input variable is big, the gain function value out of true that calculates.

In addition, logarithmic spectrum least mean-square error estimation unit 1001 also comprises numerical integration computing unit (numeric integration calculation unit) 10012, it utilizes formula (8), carry out the logarithmic spectrum least mean-square error by numerical integration calculated gains function and estimate, obtain the curve shown in Fig. 4 B.The numerical integration computing unit 10012 that adopts in the present embodiment can be the known any device that can carry out numerical integration of those skilled in the art, and the present invention does not repeat them here this not restriction.

In Fig. 4 B as can be seen, opposite with the result who is calculated by Taylor series accumulation calculating unit 10011, when input variable was big, the gain function value that is calculated by numerical integration computing unit 10012 was very accurate, and in input variable hour, the gain function value out of true that calculates.

In addition, logarithmic spectrum least mean-square error estimation unit 1001 also comprises merge cells (combination unit) 10013, is used to merge result who is calculated by Taylor series accumulation calculating unit 10011 and the result who is calculated by numerical integration computing unit 10012.

Particularly, coarse part utilization in the gain function value that is calculated by Taylor series accumulation calculating unit 10011 among Fig. 4 A can be replaced by the gain function value that numerical integration computing unit 10012 calculates, perhaps coarse part utilization in the gain function value that is calculated by numerical integration computing unit 10012 among Fig. 4 B be replaced by the gain function value that Taylor series accumulation calculating unit 10011 calculates.In addition, also can in Taylor series accumulation calculating unit 10011 and all accurate scope of numerical integration computing unit 10012, get a bit (for example the most approaching place of two curves among Fig. 4 A and Fig. 4 B) arbitrarily, as merging threshold value, will merge less than the gain function value that calculates by Taylor series accumulation calculating unit 10011 that merges threshold value with greater than the gain function value that calculates by numerical integration computing unit 10012 that merges threshold value.

Preferably, merge cells 10013 comprises subtrator (subtraction unit), the gain function value that it will be calculated by Taylor series accumulation calculating unit 10011 and subtracted each other by the gain function value that numerical integration computing unit 10012 calculates; Optional signed magnitude arithmetic(al) unit (absoluteoperation unit), the result that subtrator is obtained takes absolute value; Optional logarithm operation unit (logarithmic operation unit), the result that the signed magnitude arithmetic(al) unit is obtained carries out log-transformation, obtains curve as shown in Figure 3; And selected cell (selection unit), the input variable of the minimum value place correspondence of the curve of selection Fig. 3 is as above-mentioned merging threshold value.

After determining to merge threshold value, merge cells 10013 will merge less than the gain function value that is calculated by Taylor series accumulation calculating unit 10011 that merges threshold value with greater than the gain function value that is calculated by numerical integration computing unit 10012 that merges threshold value, shown in Fig. 4 A-4C, thereby obtain the accurate gain functional value.

Noise Suppression Device 1000 by present embodiment, carrying out the logarithmic spectrum least mean-square error by Taylor series accumulation calculating unit 10011, numerical integration computing unit 10012 and merge cells 10013 merging Taylor series accumulation methods and numerical integration method estimates, can access the performance of desired in theory removal noise, thereby remedy independent use Taylor series accumulation method or numerical integration method calculates coarse shortcoming.

Under same inventive concept, Figure 11 is the block scheme of the device of extraction phonetic feature according to another embodiment of the invention.Below just in conjunction with this figure, present embodiment is described.For those parts identical, suitably omit its explanation with front embodiment.

As shown in figure 11, the device 1100 that being used to of present embodiment extracted phonetic feature comprises: input block (inputting unit) 1501, and input contains the noise voice; Converter unit (transforming unit) 1105 becomes to contain the noise speech manual with the described noise phonetic modification that contains; Noise Suppression Device 900 recited above or Noise Suppression Device 1000 are used to reduce the described noise that contains the noise speech manual; And extraction unit (extracting unit) 1110, the speech manual that reduces from described noise extracts described phonetic feature.Detail does not repeat them here with above-mentioned identical with reference to the description about the method for extracting phonetic feature among the embodiment of figure 6.

By above explanation as can be known, the device 1100 of the extraction phonetic feature of present embodiment can be before extracting phonetic feature from contain the noise speech manual, carry out the logarithmic spectrum least mean-square error by above-mentioned formula (9) and estimate to reduce noise, wherein utilize piecewise linear function to replace gain function, greatly reduce the calculated amount that the logarithmic spectrum least mean-square error is estimated, kept the squelch performance simultaneously.Therefore, can improve the quality of phonetic feature.

In addition, the device 1100 of the extraction phonetic feature of present embodiment also can be before extracting phonetic feature from contain the noise speech manual, carry out the logarithmic spectrum least mean-square error by above-mentioned formula (8) and estimate to reduce noise, wherein carry out the estimation of logarithmic spectrum least mean-square error by merging Taylor series accumulation method and numerical integration method, can access the performance of desired in theory removal noise, thereby remedy independent use Taylor series accumulation method or numerical integration method calculates coarse shortcoming.Therefore, can improve the quality of phonetic feature.

Under same inventive concept, Figure 12 is the block scheme of speech recognition equipment according to another embodiment of the invention.Below just in conjunction with this figure, present embodiment is described.For those parts identical, suitably omit its explanation with front embodiment.

As shown in figure 12, the speech recognition equipment 1200 of present embodiment comprises: the device 1100 of extraction phonetic feature recited above is used to extract phonetic feature; And voice recognition unit (speechrecognition unit) 1201, according to the described phonetic feature that extracts, carry out speech recognition.Detail does not repeat them here with above-mentioned identical with reference to the description about audio recognition method among the embodiment of figure 7.

By above explanation as can be known, the speech recognition equipment 1200 of present embodiment can be before extracting phonetic feature from contain the noise speech manual, carry out the logarithmic spectrum least mean-square error by above-mentioned formula (9) and estimate to reduce noise, wherein utilize piecewise linear function to replace gain function, greatly reduce the calculated amount that the logarithmic spectrum least mean-square error is estimated, kept the squelch performance simultaneously.Therefore, can improve the performance of speech recognition.

In addition, the speech recognition equipment 1200 of present embodiment also can be before extracting phonetic feature from contain the noise speech manual, carry out the logarithmic spectrum least mean-square error by above-mentioned formula (8) and estimate to reduce noise, wherein carry out the estimation of logarithmic spectrum least mean-square error by merging Taylor series accumulation method and numerical integration method, can access the performance of desired in theory removal noise, thereby remedy independent use Taylor series accumulation method or numerical integration method calculates coarse shortcoming.Therefore, can improve the performance of speech recognition.

Under same inventive concept, Figure 13 is the block scheme of the device of training utterance model according to another embodiment of the invention.Below just in conjunction with this figure, present embodiment is described.For those parts identical, suitably omit its explanation with front embodiment.

As shown in figure 13, the device 1300 of the training utterance model of present embodiment comprises: the device 1100 of extraction phonetic feature recited above is used to extract phonetic feature; And model training unit (model-training unit) 1301, according to the described phonetic feature that extracts, train described speech model.Detail does not repeat them here with above-mentioned identical with reference to the description about the method for training utterance model among the embodiment of figure 8.

By above explanation as can be known, the device 1300 of the training utterance model of present embodiment can be before extracting phonetic feature from contain the noise speech manual, carry out the logarithmic spectrum least mean-square error by above-mentioned formula (9) and estimate to reduce noise, wherein utilize piecewise linear function to replace gain function, greatly reduced the calculated amount that the logarithmic spectrum least mean-square error is estimated, keep the squelch performance simultaneously, thereby can improve the quality of phonetic feature.Therefore, can improve the quality of the model that trains.

In addition, alternatively, the device 1300 of the training utterance model of present embodiment also can be before extracting phonetic feature from contain the noise speech manual, carry out the logarithmic spectrum least mean-square error by above-mentioned formula (8) and estimate to reduce noise, wherein carry out the estimation of logarithmic spectrum least mean-square error by merging Taylor series accumulation method and numerical integration method, can access the performance of desired in theory removal noise, thereby remedy independent use Taylor series accumulation method or numerical integration method calculates coarse shortcoming.Therefore, can improve the quality of the model that trains.

Though more than described noise suppressing method of the present invention in detail by some exemplary embodiments, extract the method for phonetic feature, the method of audio recognition method and training utterance model, and Noise Suppression Device, extract the device of phonetic feature, the device of speech recognition equipment and training utterance model, but above these embodiment are not exhaustive, and those skilled in the art can realize variations and modifications within the spirit and scope of the present invention.Therefore, the present invention is not limited to these embodiment, and scope of the present invention only is as the criterion by claims.

Claims

1. noise suppressing method that is used to contain the noise speech manual comprises:

According to the Noise Estimation spectrum, the described noise speech manual that contains is carried out the estimation of logarithmic spectrum least mean-square error, to reduce the described noise that contains the noise speech manual;

Wherein, replacing gain function to carry out described logarithmic spectrum least mean-square error with piecewise linear function estimates.

2. noise suppressing method according to claim 1 wherein, utilizes predefined cut-point that described gain function is transformed to described piecewise linear function, carries out described logarithmic spectrum least mean-square error and estimates.

3. noise suppressing method according to claim 2, wherein, the described predefined cut-point of described piecewise linear function obtains by following steps:

Calculate the derivative of described gain function;

Set the initial segmentation point of described piecewise linear function;

Calculating between per two continuous cut-points of described initial segmentation point described piecewise linear function and the difference between the described gain function;

If described difference greater than a threshold value, is inserted a new cut-point between described two continuous cut-points; And

Repeat the step of described calculating difference and step afterwards thereof, up to not having described difference greater than described threshold value.

4. according to any described noise suppressing method among the claim 1-3, wherein, described logarithmic spectrum least mean-square error is estimated to be undertaken by following formula:

{\hat{A}}_{k} = L ({&upsi;}_{k}) R_{k},

Wherein

{&upsi;}_{k} = \frac{ξ_{k}}{1 + ξ_{k}} γ_{k},

Wherein The speech manual that the expression noise is inhibited, R _kExpression contains noise speech manual, ξ _kBe the priori signal to noise ratio (S/N ratio) that obtains according to the Noise Estimation spectrum, γ _kBe the posteriority signal to noise ratio (S/N ratio) of composing and contain the acquisition of noise speech manual according to Noise Estimation, L (υ _k) be piecewise linear function, and k represents k spectral component.

5. noise suppressing method that is used to contain the noise speech manual comprises:

Wherein, carrying out described logarithmic spectrum least mean-square error by following steps calculated gains function estimates:

Utilize the described gain function of Taylor series accumulation calculating;

Utilize numerical integration to calculate described gain function; And

Merge the result that described Taylor series add up and the result of described numerical integration.

6. noise suppressing method according to claim 5, wherein, described combining step comprises: the result that described Taylor series are added up and the result of the described numerical integration the most approaching place between them merges.

7. noise suppressing method according to claim 6, wherein, described combining step comprises:

The result that described Taylor series are added up and the result of described numerical integration subtract each other;

The value of selecting the minimum place of absolute value among the above-mentioned result who subtracts each other is as threshold value; And

According to described threshold value, merge the result that described Taylor series add up and the result of described numerical integration.

8. noise suppressing method according to claim 7, wherein, described combining step comprises result that the described Taylor series less than described threshold value are added up and merges greater than the result of the described numerical integration of described threshold value.

9. method that is used to extract phonetic feature comprises:

To contain the noise phonetic modification and become to contain the noise speech manual;

Utilize any described noise suppressing method among the aforesaid right requirement 1-8, reduce the described noise that contains the noise speech manual; And

The speech manual that reduces from described noise extracts phonetic feature.

10. the method for extraction phonetic feature according to claim 9, wherein, described shift step comprises fast fourier transform.

11. an audio recognition method comprises:

Utilize the method for aforesaid right requirement 9 or 10 described extraction phonetic features, extract phonetic feature; And

According to the described phonetic feature that extracts, recognizing voice.

12. the method for a training utterance model comprises:

According to the described phonetic feature that extracts, train described speech model.

13. a Noise Suppression Device that is used to contain the noise speech manual comprises:

Estimation unit according to the Noise Estimation spectrum, carries out the estimation of logarithmic spectrum least mean-square error to the described noise speech manual that contains, to reduce the described noise that contains the noise speech manual;

Wherein, described estimation unit uses piecewise linear function to replace gain function to carry out described logarithmic spectrum least mean-square error estimation.

14. Noise Suppression Device according to claim 13 wherein, utilizes predefined cut-point that described gain function is transformed to described piecewise linear function, carries out described logarithmic spectrum least mean-square error and estimates.

15. according to claim 13 or 14 described Noise Suppression Devices, wherein, described estimation unit carries out the logarithmic spectrum least mean-square error by following formula and estimates:

{\hat{A}}_{k} = L ({&upsi;}_{k}) R_{k},

Wherein

{&upsi;}_{k} = \frac{ξ_{k}}{1 + ξ_{k}} γ_{k},

Wherein

16. a Noise Suppression Device that is used to contain the noise speech manual comprises:

Wherein, described estimation unit comprises:

Taylor series accumulation calculating unit utilizes the described gain function of Taylor series accumulation calculating;

The numerical integration computing unit utilizes numerical integration to calculate described gain function; And

Merge cells is used to merge described Taylor series accumulation calculating unit result calculated and described numerical integration computing unit result calculated.

17. Noise Suppression Device according to claim 16, wherein, the most approaching place between them merges described merge cells with described Taylor series accumulation calculating unit result calculated and described numerical integration computing unit result calculated.

18. Noise Suppression Device according to claim 17, wherein, described merge cells comprises:

Subtrator subtracts each other described Taylor series accumulation calculating unit result calculated and described numerical integration computing unit result calculated; And

Selected cell, the minimum value of locating of absolute value as a result that is used for selecting above-mentioned subtrator to obtain is as threshold value;

Wherein said merge cells merges described Taylor series accumulation calculating unit result calculated and described numerical integration computing unit result calculated according to described threshold value.

19. Noise Suppression Device according to claim 18, wherein, described merge cells will merge less than the described Taylor series accumulation calculating unit result calculated of described threshold value with greater than the described numerical integration computing unit result calculated of described threshold value.

20. a device that is used to extract phonetic feature comprises:

Converter unit will contain the noise phonetic modification and become to contain the noise speech manual;

Any described Noise Suppression Device according among the aforesaid right requirement 13-19 is used to reduce the described noise that contains the noise speech manual; And

Extraction unit, the speech manual that reduces from described noise extracts described phonetic feature.

21. the device of extraction phonetic feature according to claim 20, wherein, described converter unit is configured to carry out conversion by fast fourier transform.

22. a speech recognition equipment comprises:

Device according to aforesaid right requirement 20 or 21 described extraction phonetic features is used to extract phonetic feature; And

Voice recognition unit is according to the described phonetic feature that extracts, recognizing voice.

23. the device of a training utterance model comprises:

Described speech model according to the described phonetic feature that extracts, is trained in the model training unit.