[go: up one dir, main page]

EP1944754B1 - Speech fundamental frequency estimator and method for estimating a speech fundamental frequency - Google Patents

Speech fundamental frequency estimator and method for estimating a speech fundamental frequency Download PDF

Info

Publication number
EP1944754B1
EP1944754B1 EP07000568.1A EP07000568A EP1944754B1 EP 1944754 B1 EP1944754 B1 EP 1944754B1 EP 07000568 A EP07000568 A EP 07000568A EP 1944754 B1 EP1944754 B1 EP 1944754B1
Authority
EP
European Patent Office
Prior art keywords
values
fundamental frequency
correlation function
speech fundamental
power density
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Not-in-force
Application number
EP07000568.1A
Other languages
German (de)
French (fr)
Other versions
EP1944754A1 (en
Inventor
Mohamed Krini
Gerhard Schmidt
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nuance Communications Inc
Original Assignee
Nuance Communications Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nuance Communications Inc filed Critical Nuance Communications Inc
Priority to EP07000568.1A priority Critical patent/EP1944754B1/en
Publication of EP1944754A1 publication Critical patent/EP1944754A1/en
Application granted granted Critical
Publication of EP1944754B1 publication Critical patent/EP1944754B1/en
Not-in-force legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02168Noise filtering characterised by the method used for estimating noise the estimation exclusively taking place during speech pauses

Definitions

  • This invention relates to speech analysis systems and especially to a speech fundamental frequency estimator and a method for estimating a speech fundamental frequency.
  • DFT discrete Fourier transform
  • the corresponding spectrum shows distinct amplitude peaks which are located equidistantly in frequency (see for example Fig. 1 ).
  • the distance between two amplitude peaks represents herein the speech fundamental frequency which is dependent of the speaker.
  • This frequency varies between 80 Hz and 150 Hz
  • women and children in contrast, have a higher speech fundamental frequency which varies between 150 Hz and 300 Hz with women, respectively between 200 Hz and 600 Hz with children.
  • a good, sure and reliable estimation of the speech fundamental frequency is often not easy to obtain.
  • Mainly difficulties in detecting low speech fundamental frequencies arise wherein especially men have in most cases a low speech fundamental frequency.
  • FIG. 2 a block diagram of a multi-rate system for speech reconstruction with an analysis and a synthesis filter bank for the signal processing is shown.
  • the speech fundamental frequency estimation is shown as a separate functional block.
  • the aim of such an application is to extract parameters from a distorted speech signal y(n) as, for example, the spectral envelope, the type of stimulation (voiced/ unvoiced) and the speech fundamental frequency f p (n). Subsequently an undistorted speech signal x(n) is resynthesized from these parameters. For this purpose a very precise and reliable estimation of the speech fundamental frequency is necessary.
  • the output signal x(n) after the synthesis filter bank should be nearly without error, the following condition is therefore very desirable: x n ⁇ s n , s(n) denotes herein the undisturbed speech signal.
  • Figure 3 shows a block diagram of a signal analysis system with subsequent feature extraction and speech fundamental frequency estimation, in order to perform a speech recognition.
  • An adequate estimation of the speech fundamental frequency can, for example, contribute to significantly improve the recognition rates of the speech recognizer.
  • the speech fundamental frequency estimator is configured for receiving a first set of values and a second set of values, the first set of values being a frequency domain representation of a first set of time domain signal values within a first time interval and the second set of values being a frequency domain representation of a second set of time domain signal values within a second time interval, the second time interval being later than and offset from the first time interval, the speech fundamental frequency estimator comprising:
  • the analyzer is further configured for performing a first frequency-time-transform of the first power density spectrum in order to obtain a first set of correlation function values, for performing a second frequency-time-transform of the second power density spectrum in order to obtain a second set of correlation function values, and for determining the speech fundamental frequency estimate on the basis of the first and second sets of correlation function values.
  • a method for estimating a speech fundamental frequency using a first set of values and a second set of values, the first set of values being a received frequency domain representation of a first set of time domain signal values within a first time interval and the second set of values being a received frequency domain representation of a second set of time domain signal values within a second time interval, the second time interval being later than and offset from the first time interval, the method for estimating the speech fundamental frequency comprising the steps of:
  • the step of determining the speech fundamental frequency estimate comprises performing a first frequency-time-transform of the first power density spectrum in order to obtain a first set of correlation function values, performing a second frequency-time-transform of the second power density spectrum in order to obtain a second set of correlation function values, and determining the speech fundamental frequency estimate on the basis of the first and second sets of correlation function values.
  • This first aspect of the invention is based on the finding that by utilizing the first and second sets of values, which originate from sets of a time domain signal values in the time intervals which are offset from each other, results in a total analyzed signal portion which is a larger than just one single signal portion, for example the first or the second time intervals.
  • a timely longer signal portion by means of existing (short) time-frequency-transformed signals without the need to provide a new time-frequency-transform just for the estimation of the speech fundamental frequency.
  • the first spectrum represents the spectrum over the longer time interval whereas the second spectrum serves the purpose to determine the characteristics of the second set of values in order to compensate errors in the first spectrum. Therefore it is necessary not only to calculate the first spectrum but also to calculate the second spectrum.
  • the approach according to the first aspect of the invention provides the advantage that a signal given in a time-frequency-transformed version (provided for other applications than speech fundamental frequency estimation) can still be used also for speech fundamental frequency estimation (even in the case the time-frequency-transformed version of the signal would normally be not appropriate for providing a precise speech fundamental frequency estimation).
  • a speech fundamental frequency estimator which is configured for receiving a set of values, the set of values being a frequency domain representation of a set of time domain signal values within a time interval, the speech fundamental frequency estimator comprising:
  • a method for estimating a speech fundamental frequency is provided, the method being configured for receiving a set of values, the set of values being a frequency domain representation of a set of time domain signal values within a time interval, the method comprising the steps of:
  • the second aspect is based on the finding that a significant improvement in the preciseness of speech fundamental frequency estimation can be realized when background noise is adequately compensated. This is especially the case in a scenario where in speech pauses erroneous detections of speech occur which then falsify the detected result and, in consequence, decrease the reliability of the detected speech fundamental frequency.
  • the second aspect thus provides the advantage that by simple means, for example a pause detector or just a further analysis of the already existing signal frames a significant improvement in preciseness and reliability of the estimated speech fundamental frequency can be obtained.
  • the speech fundamental frequency estimator is characterized in that the first power density spectrum calculator is configured for multiplying versions of the sets of values which represent sets of time domain signal values having overlapping time intervals.
  • the speech fundamental frequency estimator is characterized in that the first power density spectrum calculator is configured for multiplying versions of the sets of values which represent time domain signal values having time intervals overlapping in least 25 percent. This provides the possibility that the speech fundamental frequency estimate can be surely determined as the first and second sets of values belonged to time domain signal values which have a sufficiently overlapping a interval structure. Therefore, due to the sufficient overlap of both time intervals, such an estimation can be considered to be an estimation over the "longer" time interval.
  • the speech fundamental frequency estimator is characterized in that the second power density spectrum calculator is configured for providing a conjugate complex version of the second set of values to the first power density spectrum calculator and wherein the first power density spectrum calculator is configured for using the provided conjugate complex version of the second set of values as the version with which the stored version of the first set of values is to be multiplied.
  • the speech fundamental frequency estimator is characterized in that the analyzer is configured for performing a first frequency-time-transform of the first power density spectrum in order to obtain a first set of correlation function values and for performing a second frequency-time-transform of the second power density spectrum in order to obtain a second set of correlation function values, wherein the analyzer is furthermore configured for determining a set of normalization values and a set of weighting values from the second power density spectrum and for using the set of normalization values and the set of weighting values in the first and second frequency-time-transform and wherein the analyzer is furthermore configured for determining the speech fundamental frequency estimate on the basis of the first and second sets of correlation function values.
  • the speech fundamental frequency estimator according to a further embodiment can be characterized in that the analyzer further comprises a compensator being configured for adaptively compensating the values of the first set of correlation function values by a correction factor being based on a value of the second set of correlation function values and wherein the analyzer is furthermore configured for determining the speech fundamental frequency estimate on the basis of the compensated first set of correlation function values and the second set of correlation function values.
  • the speech fundamental frequency estimator can be characterized in that the compensator is configured for multiplying the second set of correlation function values by a lower bounded quotient between a value of the first set of correlation function values and a value of the second set of correlation function values in order to obtain said compensated first set of correlation function values.
  • the compensator is configured for multiplying the second set of correlation function values by a lower bounded quotient between a value of the first set of correlation function values and a value of the second set of correlation function values in order to obtain said compensated first set of correlation function values.
  • the speech fundamental frequency estimator is characterized in that the analyzer is configured for combining the compensated first set of correlation function values and the second set of correlation function values in order to obtain an extended set of correlation function values, wherein the values of the extended set of correlation function values assume corresponding values from the compensated first set of correlation function values, the second set of correlation function values or values between the compensated first set of correlation function values and the second set of correlation function values and wherein the analyzer is furthermore configured for determining the speech fundamental frequency estimate on the basis of said extended set of correlation function values.
  • the extended set of correlation function values comprises now information from the first as well as the second set of correlation function values such that an estimation of the speech fundamental frequency can be based on the information comprised in the first and second time interval as well as a correction of possible errors is also possible by the information of the second time interval. Furthermore, it is also possible to perform a weighting of the values of the first set of correlation function values in contrast to the values of the second set of correlation function values in order to take into account the influence of an offset between the first set of correlation function values (respectively the compensated set of correlation function values) and the second set of correlation function values.
  • the speech fundamental frequency estimator is characterized in that the analyzer is configured for determining the speech fundamental frequency estimate by searching the index of a maximum value from the extended set of correlation function values within a predetermined number of indices of the values of the extended set of correlation values, from the first or second set of correlation function values within a predetermined number of indices of values of the first respectively second set of correlation function values or from the compensated first set of correlation function values within the predetermined number of indices of values of the compensated first set of correlation function values and wherein the analyzer is furthermore configured for determining the speech fundamental frequency estimate as the product of a sampling frequency and a reciprocal value of said searched index.
  • the speech fundamental frequency is characterized in that the analyzer is furthermore configured for determining a reliability factor for the determined speech fundamental frequency estimate and for blocking an output of the determined speech fundamental frequency estimate in the case the determined reliability factor for the determined speech fundamental frequency estimate is below said predetermined reliability factor.
  • the analyzer is furthermore configured for determining a reliability factor for the determined speech fundamental frequency estimate and for blocking an output of the determined speech fundamental frequency estimate in the case the determined reliability factor for the determined speech fundamental frequency estimate is below said predetermined reliability factor.
  • the speech fundamental frequency estimator can be characterized in that the analyzer is furthermore configured for determining said reliability factor by dividing the maximum value at said searched index by the first value of the extended set of correlation function values or, respectively the first, the compensated first or second set of correlation function values.
  • the speech fundamental frequency estimator can be characterized in that the second power density spectrum calculator is configured for determining an estimate of the power density spectrum of background noise and for determining a noise suppression factor on the basis of said power density spectrum of background noise, and wherein the analyzer is configured for multiplying the first and second power density spectrum with said noise suppression factor prior to the frequency-time-transform of the first respectively second power density spectrum.
  • the speech fundamental frequency estimator can be characterized in that the second power density spectrum calculator is configured for determining the noise suppression factor as the maximum of a predetermined maximum suppression coefficient and a term being dependent on a quotient of the estimate of the power density spectrum of background noise and the second power density spectrum. This makes sure, that a minimum suppression factor is used and thus an effective suppression of background noise is accomplished.
  • the speech fundamental frequency estimator can be characterized in that the second power density spectrum calculator is configured for determining the estimate of the power density spectrum of background noise in speech pauses or for determining the estimate of the power density spectrum of background noise from a segment-wise estimation of the minima of the power of a differential signal. This provides an efficient and numerically simple way of determining the estimate of the power density spectrum of background noise.
  • the speech fundamental frequency estimator can be characterized in that the analyzer is furthermore configured for reestimating the speech fundamental frequency estimate in the case the determined speech fundamental frequency estimate is below the predefined frequency value wherein the analyzer is configured for performing the reestimation by searching a further index of a further maximum value of the extended set of correlation function values, the first or second set of correlation function values or the compensated first set of correlation function values within a further number of values of said sets of correlation function values and for outputing a product of a sampling frequency and a reciprocal value of said further index as the determined speech fundamental frequency estimate.
  • This provides a further improvement of the speech fundamental frequency especially in the case when the determined estimate is below said predefined frequency (which means that the estimate may probably not as reliable as actually wanted).
  • Such a use of the doubled speech fundamental frequency estimate from a previous estimation broadens the region to be searched and thus strengthens the reliability and preciseness of the outputted estimate.
  • the speech fundamental frequency estimator can be characterized in that the analyzer is configured for outputting said product as the predetermined speech fundamental frequency estimate only in the case the value of the autocorrelation function at the further index is larger than 60 percent of the value of the autocorrelation function at the previously searched maximal index as well as a value of the extended set of correlation function values at said further index is larger than a previously defined amplitude value. This further strengthens the validity of the outputted speech fundamental frequency estimate as before outputting the result two separate conditions have to be fulfilled.
  • the speech fundamental frequency estimator in a further embodiment can be characterized in that the analyzer is configured for modifying a speech fundamental period corresponding to said determined speech fundamental frequency estimate by an interpolation correction term prior of outputting a modified speech fundamental frequency estimate, wherein said interpolation correction term is dependent on values of said first or second set of correlation function values, of said extended set of correlation function values or said compensated first set of correlation function values, respectively.
  • an interpolation approach provides the advantage that the error terms resulting from the use of a discrete time-frequency-transform respectively a frequency-time-transform can be reduced by a processing of the signals after the inverse transform has been performed.
  • the speech fundamental frequency estimator can be characterized by a frequency domain filtering unit being configured for receiving the frequency domain versions of the first and second set of time domain signal values, for frequency domain filtering said frequency domain versions in order to obtain said first and second sets of values, respectively, and for providing said first and second sets of values to the first and second power density spectrum calculator respectively.
  • a frequency domain filtering unit being configured for receiving the frequency domain versions of the first and second set of time domain signal values, for frequency domain filtering said frequency domain versions in order to obtain said first and second sets of values, respectively, and for providing said first and second sets of values to the first and second power density spectrum calculator respectively.
  • the speech fundamental frequency estimator can be characterized in that the frequency domain filtering unit is configured for filtering only frequencies below a predefined limiting frequency. This relaxes a computational burden as only the parts of the spectrum are filtered which are of the most importance for a reliable estimation of very low speech fundamental frequencies.
  • the speech fundamental frequency estimator can be characterized in that the frequency domain filtering unit is configured for delaying values of said frequency domain versions being above said predefined limiting frequency. This compensates a delay which might be introduced in a signal flow path for filtering signals having a frequency below said limiting frequency.
  • the invention can also be implemented as a computer program having a program code for performing the inventive method, when the computer program runs on a computer.
  • the speech fundamental frequency estimator can be characterized in that the power density spectrum calculator is configured for determining the noise suppression factor as the maximum of a predetermined maximum suppression coefficient and a term being dependent on a quotient of the estimate of the power density spectrum of background noise and the second power density spectrum.
  • the second aspect may comprise a speech fundamental frequency estimator being characterized in that the power density spectrum calculator is configured for determining the estimate of the power density spectrum of background noise in speech pauses or for determining the estimate of the power density spectrum of background noise from a segment-wise estimation of the minima of the power of a differential signal. This makes sure, that a minimum suppression factor is used and thus an effective suppression of background noise is accomplished.
  • the present invention relies mainly on estimation methods based on autocorrelation function which are described herein in advance for a better understanding. However, some aspects of the present invention are also implemented in the conventional autocorrelation methods such that the description in this section is not to be considered as state of the art.
  • the speech signal s(n) will be recorded by a microphone.
  • the weighting function W (e j ⁇ ⁇ , n ) has been chosen such that the attenuation rises with rising frequency. This choice results from the fact that speech mainly at low frequencies has a speech fundamental frequency structure - which in turn results in an improved estimation of the speech fundamental frequency.
  • Fig. 4 the functional principle of a method for speech fundamental frequency estimation is shown.
  • the autocorrelation function r ⁇ yy ( m,n ) is used in order to estimate the speech fundamental frequency f p ( n ).
  • the index m describes herein the autocorrelation offset and the index n describes the present frame (under analysis).
  • the preliminary speech fundamental frequency f' p ( n ) can be determined by a search of the maximum in a selected range of indices, for example 30 ⁇ m ⁇ 100.
  • a threshold value of p 0 ⁇ [0.2,0.3] has turned out to be favourable.
  • the value of the normalized autocorrelation at the location ⁇ p ( n ) can be of large significance as reliability information, for example for a speech signal reconstruction.
  • the desired value of the speech fundamental frequency can be either slowly or quickly traced, dependent on how sure a speech fundamental frequency can be estimated.
  • the time-frequency analysis was considered only in the interesting frequency range up to 1000 Hz.
  • the spectral refinement can be used without using the post-processing or the interpolation or the approach having the additional delay correction structure can be used without using the spectral refinement approach.
  • all the individual aspects commonly contribute to a much improved estimation of the speech fundamental frequency and shall be described herein as an embodiment.
  • the newly proposed method uses an additional spectral refinement of the input spectrum Y ( e j ⁇ ⁇ ,n ).
  • the functional principle of this approach is disclosed in Fig. 6 .
  • FIR finite impulse response
  • the parameter ⁇ denotes herein the ⁇ -th frequency sampling point of a short-time spectrum ⁇ ( e j ⁇ ⁇ ,n ) having a higher resolution and the parameter M denotes the order of the used FIR-filters.
  • a memory length M of the short FIR-filter is chosen between 3 and 5.
  • a spectral refinement in the whole frequency range is not necessary for speech signals.
  • the speech fundamental frequency structure is only present in the lower frequency range that means it is sufficient to perform the refinement up to, for example, 1000 Hz. Above this threshold it is possible to only introduce a delay of (M-1)/2 samples (down-sampled). The numerical effort necessary for such a refinement can thus be kept low.
  • Fig. 7 the analysis-synthesis-system with additional calculation of the spectral refinement in a low frequency range is shown.
  • Fig. 8 the analysis of autocorrelation as well as the time-frequency-analysis with spectral refinement is shown.
  • test signal the same combination from sinusoidal signals have been used which have a varying frequency distance of 300 Hz to 60 Hz.
  • the black graph in the upper diagram of Fig. 8 as well as the white graph in the lower diagram of Fig. 8 show the estimated pitch period duration, respectively; the estimate of speech fundamental frequency when using the spectral refinement approach.
  • Fig. 9A shows a block diagram of an embodiment of a speech fundamental frequency estimator 900.
  • the speech fundamental frequency estimator 900 comprises a power density spectrum calculator 902 and an analyzer 904.
  • the power density spectrum calculator 902 has 2 inputs, one for receiving a set of values and one for receiving background noise information.
  • the set of values ⁇ 1 is a frequency-domain representation of a set of a time domain signal values y 1 in a time interval t 1 .
  • the background noise information can for example be determined in speech pauses in which only a noise signal and no speech signal is provided to the power density spectrum calculator 902.
  • the power density spectrum calculator 902 has 2 outputs, one for outputting a noise suppression factor V(e j ⁇ ,n) and one for outputting values of a power density spectrum.
  • the analyzer 904 has 2 inputs for receiving both of the outputs of the power density spectrum calculator 902.
  • the analyzer 904 has a furthermore one output for outputting the determined speech fundamental frequency f p
  • the function of the speech fundamental frequency estimator 900 shall be described in more detail with reference to Fig. 9B .
  • Fig. 9B a flow diagram of a method for estimating the speech fundamental frequency is disclosed.
  • the method 940 comprises a first step 950 in which a power density spectrum is provided by multiplying a version of the set of values ⁇ 2 with a complex conjugate version of the second set of values.
  • a second step 952 an estimate of a power density spectrum of background noise is determined.
  • the background noise information is used which may originate for example from a speech pause detector or other means which provide only information about the background noise in the absence of speech.
  • a noise suppression factor is determined which is explained in more detail below.
  • a multiplication of the power density spectrum with the noise suppression factor V(e j ⁇ ,n) is performed before in a fifth step 958 a frequency-time-transform is accomplished.
  • a sixth step 960 speech fundamental frequency is determined from the frequency-time-transformed signal resulting in step 958.
  • ⁇ nn ( ⁇ ⁇ ,n ) denotes an estimation of the auto power density spectrum of a disturbance (background noise), V 0 describes a maximal attenuation and the parameter ⁇ is used for overestimating the power density spectrum of the disturbance. Because of the fact that the disturbance can be considered to be non-stationary a short-time estimation value has to be used for this disturbance value. However, signal and disturbance are available only as a sum in the microphone signal y(n).
  • the estimation of the power density spectrum of the background noise can be obtained in two different ways, firstly the power of the microphone signal can be estimated in speech pauses - which requires a speech pause detector - or, secondly, that an estimated value for the power of the disturbance can be determined from the segment-wise estimated minima of the power of the microphone signal.
  • the noise estimation is not the main focus in this patent application other details shall not be explained here; however reference is made to P. Vary, R. Martin: Digital Speech Transmission, John Wiley & Sons, Chichester, England, 2006 .
  • noise reductions are used as a pre-processing stage for a speech fundamental frequency estimation that is instead of the input subband signals Y ( e j ⁇ ⁇ , n ) the noise reduced signals Y ( e j ⁇ ⁇ , n ) ⁇ V ( e j ⁇ ⁇ , n ) are processed.
  • FIG. 10 shows results of the speech fundamental frequency estimation with spectral refinement in terms of time-frequency-analysis with and without noise reduction. All parameters of the methods have been identical to the previously described parameters. As can be seen very clearly erroneous detections (denoted by black ellipses in the upper diagram of Fig. 10 ) can be suppressed in the case when the above-mentioned active noise reduction is used. In speech activity passages nearly nothing changes.
  • Speech fundamental frequency estimation on the basis of a plurality of subband vectors
  • Fig. 11A shows a block diagram of an embodiment of the inventive speech fundamental frequency estimator 1100.
  • the speech fundamental frequency estimator 1100 comprises a first power density spectrum calculator 1102, a second power density spectrum calculator 1104 and an analyzer 1106.
  • the first power density spectrum calculator 1102 and second power density spectrum calculator 1104 are both fed by a common input of width N, on which subsequently a first set of values ⁇ 1 and a second set of values ⁇ 2 is provided.
  • the first set of values ⁇ 1 is a frequency domain representation of a first set of time domain signal values y 1 within a first time interval t 1 .
  • the second set of values ⁇ 2 is a frequency domain representation of a second set of time domain signal values y 2 within a second time interval t 2 .
  • the first power density spectrum calculator 1102 is configured for storing a version of the first set of values and for providing values of a first power density spectrum ⁇ ⁇ ( ⁇ ⁇ ,n ) by multiplying the stored version of the first set of values ⁇ 1 with a complex conjugate version of the second set of values ⁇ 2 .
  • the second power density spectrum calculator 1104 is configured for providing values of a second power density spectrum ⁇ ⁇ ( ⁇ ⁇ ,n ) by multiplying a version of the second set of values with a complex conjugate version of the second set of values.
  • the analyzer 1106 is configured for receiving the first and second power density spectrums of the first respectively second power density spectrum calculator 1102, 1104 and for determining the speech fundamental frequency estimate f p (n) on the basis of the values of the first power density spectrum ⁇ ⁇ d ( ⁇ ⁇ ,n ) and the values of the second power density spectrum ⁇ ⁇ ( ⁇ ⁇ ,n ).
  • Fig. 11B shows the functionality of the speech fundamental frequency estimator as shown in Fig. 11A in more detail.
  • Fig. 11B discloses a method 1140 for estimating the speech fundamental frequency f p (n).
  • first and second sets of values ⁇ 1 and ⁇ 2 are provided, each of which have the number of N individual values (that is a width of N).
  • a first step 1150 a version of the first set of values ⁇ 1 is stored.
  • the stored version of the first set of values ⁇ 1 it is multiplied with a version of the second set of values ⁇ 2 which are directly fed to the multiplication step without a storing step.
  • the result from the multiplication step 1152 is said first power density spectrum ⁇ ⁇ d ( ⁇ ⁇ ,n ).
  • a further step of multiplying 1154 is performed in which a versions of the second set of values ⁇ 2 are multiplied with each other, which results in the second power density spectrum.
  • the speech fundamental frequency estimate f p (n) is determined.
  • the inventive approach as shown in Fig. 11A and 11B has the advantage that it is now possible to estimate lower speech fundamental frequencies as would be possible according to the state of the art. This is mainly due to the fact that (conventional existing) short frequency domain values can be used for a precise speech fundamental frequency estimation as the multiplication in step 1152 with a stored respectively delayed version of a previous set of frequency domain values results in a kind of elongated analysis time interval for estimating the low speech fundamental frequency.
  • a further inventive idea it can be seen in the fact that not only the present signal frame y(n) is used for the estimation of the speech fundamental frequency but also a signal frame y(n-d) which is a signal frame delayed by d clock cycles.
  • the present short-time spectrum ⁇ ( e j ⁇ ⁇ ,n ) and the delayed short-time spectrum ⁇ *(e j ⁇ ⁇ ,n-d ) is used.
  • the cross-correlation function r ⁇ ⁇ ,g ( m,n ) is determined according to equation 13.
  • the aim will be to determine an extended autocorrelation function r ⁇ ⁇ ,erw ( k,n ) of order N/2 + r from the autocorrelation function r ⁇ ⁇ d ,g ( m,n ) and the cross-correlation function r ⁇ ⁇ d ,g ( m,n ), each of which having the order N/2.
  • the index k of the term r ⁇ ⁇ ,erw ( k,n ) describes herein the offset of the autocorrelation, wherein the following equation is valid: k ⁇ 0 , ... , N 2 + r ⁇ 1
  • the linear function a(m) was chosen such that with an increasing offset m the weight of the coefficients reduces.
  • the thus obtained extended autocorrelation function r ⁇ ⁇ ,erw ( k,n ) is finally used for the estimation of the speech fundamental frequency.
  • the speech fundamental frequency is determined by a search of the maximum for each single frame in an elongated area - for example in the range 30 ⁇ k ⁇ 180.
  • Fig. 13 two examples for the analysis of the speech fundamental frequency are shown.
  • the left section of Fig. 13 discloses the analysis of the speech fundamental frequency at about 270 Hz whereas in the right section of Fig. 13 the analysis of a speech fundamental frequency at about 60 Hz is shown.
  • the correlation of the present signal frame with itself (left) and with a proceeding signal frame (right) are shown each, the left and also the right section of Fig. 13 .
  • the lower graph in each of both sections of Fig. 13 shows the extended autocorrelation function r ⁇ ⁇ ,erw ( k,n ) across an elongated autocorrelation offset which is generated by the composition of both correlation functions r ⁇ ⁇ ,g ( m,n ) and r ⁇ ⁇ ,g, mod ( m,n ) respectively by the usage of the equation 30.
  • the corresponding speech fundamental period can be determined and detected quite well using the autocorrelation function r ⁇ ⁇ ,g ( m,n ) (left section of Fig. 13 ).
  • Fig. 13 shows in the lower part that by a combination of the correlation of the signal frame with itself and the correlation with a proceeding signal frame the speech fundamental period can still be determined and detected.
  • Fig. 14 the analysis of the extended autocorrelation function r ⁇ ⁇ ,erw ( k,n ) is shown when a previous spectral refinement in the low frequent region as well as a time-frequency-analysis of the input signal is used.
  • a comparison with the analyses from the Fig. 5 and 14 indicates that by using the previously described approach significant improvements can be achieved.
  • no erroneous detections with low speech fundamental frequencies occur.
  • f p (n) After estimation of the speech fundamental frequency f p (n) a test can be made whether this estimate is below a threshold f k .
  • f p (n) For the determination of this area the previously determined speech fundamental frequency f p (n) is firstly doubled.
  • the parameter f p,max in equation 33 is herein a predefined value of a maximal possible speech fundamental frequency.
  • Fig. 15 shows a time-frequency-analysis of an input signal, respectively, the detection results of the speech fundamental frequency estimation.
  • the post-processing was deactivated and at two locations (at 0.7 and at 0.75 seconds) erroneous detections (bisections of frequency) can be observed.
  • Such erroneous detections can be corrected by the post-processing which can be concluded from the lower part of Fig. 15 .
  • the autocorrelation coefficient is used for the interpolation at which the extended autocorrelation function r ⁇ ⁇ ,erw ( k,n ) has the maximum, and also the adjacent autocorrelation coefficients unconsidered- that is the autocorrelation offsets left and right of the maximum.
  • Fig. 16 the time-frequency-analysis of a portion of several sinusoidal signals of equal amplitude is shown. Contrary hereto a portion of a speech signal of a female voice is shown in the lower part of Fig. 16 .
  • the white graph denotes the estimated quantized speech fundamental frequency in the upper as well as also in the lower part of Fig. 16 .
  • the grey graph in the upper part respectively the black graph in the lower part demonstrates the estimated speech fundamental frequency after the interpolation. It can be seen from the upper part of Fig. 16 that due to the interpolation nearly the desired straight graph of the estimated speech fundamental frequency can be obtained. In the lower part it can be seen that the estimated speech fundamental frequency of the speech fundamental frequency structure follows the speech signal closely when the interpolation is used.
  • this invention describes a method for estimating the fundamental frequency (pitch frequency) of speech signals. This is achieved in the DFT domain by analyzing the current input spectrum as well as past input spectra. To achieve an - compared to standard methods - improved estimation performance a four stage algorithm is applied or proposed whereby the steps can also be used independently: First, pre-processing (called spectral refinement) is applied to the input spectrum at low frequencies. Second, a noise reduction is applied when computing normalization values. Third, estimations for the autocorrelation of the current frame and cross correlation of the current with the previous frame are adaptively combined in order to obtain an extended range. Fourth, post-processing is applied to reduce estimation errors and to achieve an improved pitch accuracy.
  • pre-processing called spectral refinement
  • a noise reduction is applied when computing normalization values.
  • estimations for the autocorrelation of the current frame and cross correlation of the current with the previous frame are adaptively combined in order to obtain an extended range.
  • post-processing is applied to reduce estimation errors and to achieve an improved pitch accuracy

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Description

  • This invention relates to speech analysis systems and especially to a speech fundamental frequency estimator and a method for estimating a speech fundamental frequency.
  • Background of the invention
  • An estimation of the speech fundamental frequency is necessary in various applications:
    • For example, partial speech recognition (model-based noise suppression) can be accomplished by the estimated speech fundamental frequency of distorted speech signals in order to obtain an improvement of speech quality.
    • In order to name a further example a rough preselection of model parameters can be performed by a speech recognizer in speech recognition systems using the temporal average of this frequency. Thus, the recognition rate of a speech recognizer can be increased significantly.
  • For further fields of application reference is made to the following literature:
    • K. Fellbaum: Sprachverarbeitung und Sprachübertragung, Springer, Berlin, Deutschland, 1984
    • D.K. Freeman, G. Cosier, C.B. Southcott und I. Boyd: The Voice Activity Detector for the PAN-European Digital Cellular Mobile Telephone Service, Proceed. of the Intern. Conf. on Acoust., Speech, and Signal Process., Vol. 1, pages 369-372, 1989
    • W. Hess: Pitch Determination of Speech Signals, Springer, Berlin, Deutschland, 1983
    • P. Vary, R. Martin: Digital Speech Transmission, John Wiley & Sons, Chichester, England, 2006
    • P. Vary, U. Heute, W. Hess: Digitale Sprachsignalverarbeitung, Teubner, Stuttgart, Deutschland, 1998
  • Numerous methods for estimation of the speech fundamental frequency exist. A group of methods which is based on a DFT-transform (DFT = discrete Fourier transform) of the input signal is of special importance. Such methods can be integrated in hands-free speech assistance systems with a multi-rate signal processing in a low-cost way as the DFT-transform is already calculated for other algorithms, as, for example, a noise reduction or an echo compensation.
  • In order to describe the relevant state of the art in more detail, a typical multi-rate system is described which can be used, for example, in order to perform a speech signal improvement (noise reduction, speech reconstruction). Following, several further fields of application are presented in which an estimation of the reliable speech fundamental frequency is of importance.
  • In voiced speech portions the corresponding spectrum shows distinct amplitude peaks which are located equidistantly in frequency (see for example Fig. 1). The distance between two amplitude peaks represents herein the speech fundamental frequency which is dependent of the speaker. With men this frequency varies between 80 Hz and 150 Hz, women and children, in contrast, have a higher speech fundamental frequency which varies between 150 Hz and 300 Hz with women, respectively between 200 Hz and 600 Hz with children. A good, sure and reliable estimation of the speech fundamental frequency is often not easy to obtain. Mainly difficulties in detecting low speech fundamental frequencies arise wherein especially men have in most cases a low speech fundamental frequency.
  • In Figure 2 a block diagram of a multi-rate system for speech reconstruction with an analysis and a synthesis filter bank for the signal processing is shown. The speech fundamental frequency estimation is shown as a separate functional block. The aim of such an application is to extract parameters from a distorted speech signal y(n) as, for example, the spectral envelope, the type of stimulation (voiced/ unvoiced) and the speech fundamental frequency fp(n). Subsequently an undistorted speech signal x(n) is resynthesized from these parameters. For this purpose a very precise and reliable estimation of the speech fundamental frequency is necessary. The output signal x(n) after the synthesis filter bank should be nearly without error, the following condition is therefore very desirable: x n s n ,
    Figure imgb0001
    s(n) denotes herein the undisturbed speech signal.
  • A sure estimation of the speech fundamental frequency is also of great importance in speech recognition systems. Figure 3 shows a block diagram of a signal analysis system with subsequent feature extraction and speech fundamental frequency estimation, in order to perform a speech recognition. An adequate estimation of the speech fundamental frequency can, for example, contribute to significantly improve the recognition rates of the speech recognizer.
  • Basically there is a broad variety of application fields in which a reliable estimation of the speech fundamental frequency is necessary. However a detailed description of such applications would go beyond the scope of this description. Thus, reference is made to the following literature:
    • E. Hänsler, G. Schmidt: Acoustic Echo and Noise Control - A Practical Approach, John Wiley & Sons, Hoboken, New Jersey, USA, 2004
    • E. Hänsler, G. Schmidt: Topics in Acoustic Echo and Noise Control - Selected Methods for the Cancellation of Acoustic Echoes, the Reduction of Background Noise, and Speech Processing, Springer, Berlin, Deutschland, 2006
    • P. Vary, R. Martin: Digital Speech Transmission, John Wiley & Sons, Chichester, England, 2006
    • P. Vary, U. Heute, W. Hess: Digitale Sprachsignalverarbeitung, Teubner, Stuttgart, Deutschland, 1998
  • In literature a broad variety of different algorithms for a determination of the speech fundamental frequency estimation exist as for example:
    • Analysis in the cepstral domain - In the case speech generation is modelled as a source-filter-model (see also J. Deller, J. Hansen, J. Proakis: Discrete-Time Processing of Speech Signals, IEEE-Press, New York, USA, 1993) voiced sounds can be described as a convolution of a periodic stimulation signal with the impulse response of the vocal tract. In the spectral domain the convolution becomes the product of the Fourier transforms of both portions. If the Logarithm is taken the product becomes an addition of the separate components. After a further transform (inverse Fourier transform) the cepstral domain is reached. In this domain it is possible to distinguish the spectrally comparatively slowly varying frequency response of the vocal tract from the fundamental frequency of the stimulation signal. Further details can be found for example in W. Hess: Pitch Determination of Speech Signals, Springer, Berlin, Deutschland, 1983.
    • Harmonic Product-Spectrum - Another method to estimate the speech fundamental frequency is the so-called Harmonic Product-Spectrum. Herein the product over several equidistant sampling points of the absolute value of the spectrum is calculated. The product becomes maximal in the case the increment (via frequency) corresponds to just the speech fundamental frequency (respectively a multiple thereof). Further details can be found for example in M. R. Schroeder: Period Histogram and Product Spectrum: New Methods for Fundamental Frequency Measurements, J. Acoust. Soc. Am., Vol. 43, Nr. 4, pages 829-834, 1968.
    • Analysis of the short-time autocorrelation - In voiced speech passages the first side lobe of the short-time autocorrelation with an offset just corresponds to the speech fundamental period.
  • Other methods as the ones mentioned above also exist. The description of each single algorithm would, however, be far beyond the possibilities given the present description. Therefore reference is made to further literature as, for example, K. Fellbaum: Sprachverarbeitung und Sprachübertragung, Springer, Berlin, Deutschland, 1984 or D.K. Freeman, G. Cosier, C.B. Southcott and I. Boyd: The Voice Activity Detector for the PAN-European Digital Cellular Mobile Telephone Service, Proceed. of the Intern. Conf. on Acoust., Speech, and Signal Process., Vol. 1, Seiten 369-372, 1989 or W. Hess: Pitch Determination of Speech Signals, Springer, Berlin, Deutschland, 1983. The approach mentioned last in the above listing has become very popular as it provides the advantage that already determined short-time DFT-portions of an input signal, which are calculated for other applications, can be further used and thus a numerical effort can be reduced.
  • However, the above mentioned approach of the state of the art also has significant disadvantages. Especially the orders of the DFT (i.e. the DFT-block length) used for other purposes are often to little as to provide a reliable estimation of the speech fundamental frequency for low voices.
  • Accordingly, a need exists to provide a speech fundamental frequency estimator and a method for estimating a speech fundamental frequency which allow a more precise estimation of the speech fundamental frequency.
  • This need is met by the features of the independent claims. In the dependent claims further embodiment of the inventions are described.
  • According to a first aspect of the invention the speech fundamental frequency estimator is configured for receiving a first set of values and a second set of values, the first set of values being a frequency domain representation of a first set of time domain signal values within a first time interval and the second set of values being a frequency domain representation of a second set of time domain signal values within a second time interval, the second time interval being later than and offset from the first time interval, the speech fundamental frequency estimator comprising:
    • a first power density spectrum calculator being configured for storing a version of the first set of values and being configured for providing values of a first power density spectrum by multiplying the stored version of the first set of values with a complex conjugate version of the second set of values;
    • a second power density spectrum calculator being configured for providing values of a second power density spectrum by multiplying a version of the second set of values with a complex conjugate version of the second set of values;
    • an analyzer being configured for determining the speech fundamental frequency estimate on the basis of the values of the first power density spectrum and the values of the second power density spectrum.
  • The analyzer is further configured for performing a first frequency-time-transform of the first power density spectrum in order to obtain a first set of correlation function values, for performing a second frequency-time-transform of the second power density spectrum in order to obtain a second set of correlation function values, and for determining the speech fundamental frequency estimate on the basis of the first and second sets of correlation function values.
  • Analogously, according to said first aspect of the invention a method for estimating a speech fundamental frequency is provided, the method using a first set of values and a second set of values, the first set of values being a received frequency domain representation of a first set of time domain signal values within a first time interval and the second set of values being a received frequency domain representation of a second set of time domain signal values within a second time interval, the second time interval being later than and offset from the first time interval, the method for estimating the speech fundamental frequency comprising the steps of:
    • storing a version of the first set of values and providing values of a first power density spectrum by multiplying the stored version of the first set of values with a compley conjugate version of the second set of values;
    • providing values of a second power density spectrum by multiplying a version of the second set of values with a complex conjugate version of the second set of values ;
    • determining the speech fundamental frequency estimate on the basis of the values of the first power density spectrum and the values of the second power density spectrum.
  • The step of determining the speech fundamental frequency estimate comprises performing a first frequency-time-transform of the first power density spectrum in order to obtain a first set of correlation function values, performing a second frequency-time-transform of the second power density spectrum in order to obtain a second set of correlation function values, and determining the speech fundamental frequency estimate on the basis of the first and second sets of correlation function values.
  • This first aspect of the invention is based on the finding that by utilizing the first and second sets of values, which originate from sets of a time domain signal values in the time intervals which are offset from each other, results in a total analyzed signal portion which is a larger than just one single signal portion, for example the first or the second time intervals. Expressed in other words it is now possible to analyze a timely longer signal portion by means of existing (short) time-frequency-transformed signals without the need to provide a new time-frequency-transform just for the estimation of the speech fundamental frequency. However, it is exactly the combination of a given first and second set of values which enables such a timely longer analysis interval, that is the calculation of the first spectrum from the first and second sets of values and the second spectrum from only the second set of values. Thus, the first spectrum represents the spectrum over the longer time interval whereas the second spectrum serves the purpose to determine the characteristics of the second set of values in order to compensate errors in the first spectrum. Therefore it is necessary not only to calculate the first spectrum but also to calculate the second spectrum.
  • The approach according to the first aspect of the invention provides the advantage that a signal given in a time-frequency-transformed version (provided for other applications than speech fundamental frequency estimation) can still be used also for speech fundamental frequency estimation (even in the case the time-frequency-transformed version of the signal would normally be not appropriate for providing a precise speech fundamental frequency estimation).
  • According to a second aspect which does not form part the invention, a speech fundamental frequency estimator is provided which is configured for receiving a set of values, the set of values being a frequency domain representation of a set of time domain signal values within a time interval, the speech fundamental frequency estimator comprising:
    • a power density spectrum calculator being configured for providing values of a power density spectrum by multiplying a version of the set of values with a complex conjugate version of the set of values, wherein the power density spectrum calculator is configured for determining an estimate of the power density spectrum of background noise and for determining a noise suppression factor on the basis of said power density spectrum of background noise;
    • an analyzer being configured for multiplying the power density spectrum with said noise suppression factor and for performing a frequency-time-transform of the multiplied values of the power density spectrum in order to obtain a set of correlation function values, wherein the analyzer is furthermore configured for determining the speech fundamental frequency estimate on the basis of the set of correlation function values.
  • Analogously, according to the second aspect which does not form part of the present invention a method for estimating a speech fundamental frequency is provided, the method being configured for receiving a set of values, the set of values being a frequency domain representation of a set of time domain signal values within a time interval, the method comprising the steps of:
    • providing values of a power density spectrum by multiplying a version of the set of values with a complex conjugate version of the set of values;
    • determining an estimate of the power density spectrum of background noise and determining a noise suppression factor on the basis of said power density spectrum of background noise and the power density sprectrum of the input signal;
    • multiplying the power density spectrum with said noise suppression factor;
    • performing a frequency-time-transform of the multiplied values of the power density spectrum in order to obtain a set of correlation function values; and
    • determining the speech fundamental frequency estimate on the basis of the set of correlation function values.
  • The second aspect is based on the finding that a significant improvement in the preciseness of speech fundamental frequency estimation can be realized when background noise is adequately compensated. This is especially the case in a scenario where in speech pauses erroneous detections of speech occur which then falsify the detected result and, in consequence, decrease the reliability of the detected speech fundamental frequency.
  • The second aspect thus provides the advantage that by simple means, for example a pause detector or just a further analysis of the already existing signal frames a significant improvement in preciseness and reliability of the estimated speech fundamental frequency can be obtained.
  • According to a further aspect of the present invention the speech fundamental frequency estimator is characterized in that the first power density spectrum calculator is configured for multiplying versions of the sets of values which represent sets of time domain signal values having overlapping time intervals. This provides the advantage that by multiplying said sets of values, which represent portions of overlapping and therefore consecutive time intervals, a signal in a total time interval can be analyzed, in which a low fundamental frequency can be reliably estimated in given short time signal portions.
  • Furthermore, the speech fundamental frequency estimator according to another aspect of the present invention is characterized in that the first power density spectrum calculator is configured for multiplying versions of the sets of values which represent time domain signal values having time intervals overlapping in least 25 percent. This provides the possibility that the speech fundamental frequency estimate can be surely determined as the first and second sets of values belonged to time domain signal values which have a sufficiently overlapping a interval structure. Therefore, due to the sufficient overlap of both time intervals, such an estimation can be considered to be an estimation over the "longer" time interval.
  • According to a further aspect of the present invention the speech fundamental frequency estimator is characterized in that the second power density spectrum calculator is configured for providing a conjugate complex version of the second set of values to the first power density spectrum calculator and wherein the first power density spectrum calculator is configured for using the provided conjugate complex version of the second set of values as the version with which the stored version of the first set of values is to be multiplied. This provides the advantage that a complex conjugate version of one of the sets of values has to be calculated only once such that the numerical or computational effort can be reduced.
  • In another embodiment of the present invention the speech fundamental frequency estimator is characterized in that the analyzer is configured for performing a first frequency-time-transform of the first power density spectrum in order to obtain a first set of correlation function values and for performing a second frequency-time-transform of the second power density spectrum in order to obtain a second set of correlation function values, wherein the analyzer is furthermore configured for determining a set of normalization values and a set of weighting values from the second power density spectrum and for using the set of normalization values and the set of weighting values in the first and second frequency-time-transform and wherein the analyzer is furthermore configured for determining the speech fundamental frequency estimate on the basis of the first and second sets of correlation function values. This provides the advantage that, on one hand, the short-time envelope can be eliminated and, on the other hand, it is possible to increase the attenuation with rising frequency. Herewith typical characteristics of the speech, especially the speech fundamental frequency structure in the low frequency rage can be adequately be dealt with.
  • Also, the speech fundamental frequency estimator according to a further embodiment can be characterized in that the analyzer further comprises a compensator being configured for adaptively compensating the values of the first set of correlation function values by a correction factor being based on a value of the second set of correlation function values and wherein the analyzer is furthermore configured for determining the speech fundamental frequency estimate on the basis of the compensated first set of correlation function values and the second set of correlation function values. Providing such an adaptive compensation control provides the advantage that it is now possible to correct error terms in the cross correlation function as to compensate for example undesired amplitudes which occur at the distinct offsets.
  • According to another embodiment the speech fundamental frequency estimator can be characterized in that the compensator is configured for multiplying the second set of correlation function values by a lower bounded quotient between a value of the first set of correlation function values and a value of the second set of correlation function values in order to obtain said compensated first set of correlation function values. Such a configuration of the speech fundamental frequency estimator makes sure that a relation between the cross correlation function and the autocorrelation function does not decrease below a minimal value which, in turn, improves the robustness of speech fundamental frequency estimation.
  • Furthermore, it is also possible according to another embodiment of the present invention that the speech fundamental frequency estimator is characterized in that the analyzer is configured for combining the compensated first set of correlation function values and the second set of correlation function values in order to obtain an extended set of correlation function values, wherein the values of the extended set of correlation function values assume corresponding values from the compensated first set of correlation function values, the second set of correlation function values or values between the compensated first set of correlation function values and the second set of correlation function values and wherein the analyzer is furthermore configured for determining the speech fundamental frequency estimate on the basis of said extended set of correlation function values. This provides the advantage that the extended set of correlation function values comprises now information from the first as well as the second set of correlation function values such that an estimation of the speech fundamental frequency can be based on the information comprised in the first and second time interval as well as a correction of possible errors is also possible by the information of the second time interval. Furthermore, it is also possible to perform a weighting of the values of the first set of correlation function values in contrast to the values of the second set of correlation function values in order to take into account the influence of an offset between the first set of correlation function values (respectively the compensated set of correlation function values) and the second set of correlation function values.
  • In a further embodiment the speech fundamental frequency estimator is characterized in that the analyzer is configured for determining the speech fundamental frequency estimate by searching the index of a maximum value from the extended set of correlation function values within a predetermined number of indices of the values of the extended set of correlation values, from the first or second set of correlation function values within a predetermined number of indices of values of the first respectively second set of correlation function values or from the compensated first set of correlation function values within the predetermined number of indices of values of the compensated first set of correlation function values and wherein the analyzer is furthermore configured for determining the speech fundamental frequency estimate as the product of a sampling frequency and a reciprocal value of said searched index.
  • According to a further embodiment, the speech fundamental frequency is characterized in that the analyzer is furthermore configured for determining a reliability factor for the determined speech fundamental frequency estimate and for blocking an output of the determined speech fundamental frequency estimate in the case the determined reliability factor for the determined speech fundamental frequency estimate is below said predetermined reliability factor. Such a configuration improves the reliability of the estimated speech fundamental frequency.
  • Additionally, in a special embodiment the speech fundamental frequency estimator can be characterized in that the analyzer is furthermore configured for determining said reliability factor by dividing the maximum value at said searched index by the first value of the extended set of correlation function values or, respectively the first, the compensated first or second set of correlation function values. This provides the advantage that the reliability factor is only dependent on the scenario in which the speech fundamental frequency estimator is used and not on just a predefined factor which might be too rough in some situations.
  • Furthermore, the speech fundamental frequency estimator can be characterized in that the second power density spectrum calculator is configured for determining an estimate of the power density spectrum of background noise and for determining a noise suppression factor on the basis of said power density spectrum of background noise, and wherein the analyzer is configured for multiplying the first and second power density spectrum with said noise suppression factor prior to the frequency-time-transform of the first respectively second power density spectrum. This provides the advantage that an additional improvement can be realized as then erroneous detections in speech pauses can be avoided, which, in turn, improve the reliability of the estimated speech fundamental frequency estimate.
  • Especially the speech fundamental frequency estimator can be characterized in that the second power density spectrum calculator is configured for determining the noise suppression factor as the maximum of a predetermined maximum suppression coefficient and a term being dependent on a quotient of the estimate of the power density spectrum of background noise and the second power density spectrum. This makes sure, that a minimum suppression factor is used and thus an effective suppression of background noise is accomplished.
  • In a further embodiment of the present invention the speech fundamental frequency estimator can be characterized in that the second power density spectrum calculator is configured for determining the estimate of the power density spectrum of background noise in speech pauses or for determining the estimate of the power density spectrum of background noise from a segment-wise estimation of the minima of the power of a differential signal. This provides an efficient and numerically simple way of determining the estimate of the power density spectrum of background noise.
  • In particular, the speech fundamental frequency estimator can be characterized in that the noise suppression factor is defined by V e j Ω µ n = max V 0 , 1 β S ^ nn Ω µ n S ^ yy Ω µ n
    Figure imgb0002
    wherein nn µ ,n) denotes the estimate of the power density spectrum of the background noise, yy µ,n) denotes the second power density spectrum of the input signal, V0 denotes a predefined maximum attenuation factor and β denotes a value for overestimating the power density spectrum of the background noise.
  • In a further embodiment of the present invention the speech fundamental frequency estimator can be characterized in that the analyzer is furthermore configured for reestimating the speech fundamental frequency estimate in the case the determined speech fundamental frequency estimate is below the predefined frequency value wherein the analyzer is configured for performing the reestimation by searching a further index of a further maximum value of the extended set of correlation function values, the first or second set of correlation function values or the compensated first set of correlation function values within a further number of values of said sets of correlation function values and for outputing a product of a sampling frequency and a reciprocal value of said further index as the determined speech fundamental frequency estimate. This provides a further improvement of the speech fundamental frequency especially in the case when the determined estimate is below said predefined frequency (which means that the estimate may probably not as reliable as actually wanted).
  • Especially the speech fundamental frequency estimator can be characterized in that the analyzer is configured for searching said index of said further maximum value using a number of values k of said sets of correlation function values which is defined by f s f p , max k < f s 2 f p n + k 0
    Figure imgb0003
    wherein k denotes the number of values of said sets of correlation function values, fp(n) denotes the previously determined speech fundamental frequency estimate, fp,max denotes a predefined value of a maximal possible speech fundamental frequency, fs denotes a sampling frequency and k0 denotes a constant which enables the search of a maximum slightly above k = f s 2 f p n .
    Figure imgb0004
    Such a use of the doubled speech fundamental frequency estimate from a previous estimation broadens the region to be searched and thus strengthens the reliability and preciseness of the outputted estimate.
  • Also, in another embodiment of the present invention the speech fundamental frequency estimator can be characterized in that the analyzer is configured for outputting said product as the predetermined speech fundamental frequency estimate only in the case the value of the autocorrelation function at the further index is larger than 60 percent of the value of the autocorrelation function at the previously searched maximal index as well as a value of the extended set of correlation function values at said further index is larger than a previously defined amplitude value. This further strengthens the validity of the outputted speech fundamental frequency estimate as before outputting the result two separate conditions have to be fulfilled.
  • Additionally the speech fundamental frequency estimator in a further embodiment can be characterized in that the analyzer is configured for modifying a speech fundamental period corresponding to said determined speech fundamental frequency estimate by an interpolation correction term prior of outputting a modified speech fundamental frequency estimate, wherein said interpolation correction term is dependent on values of said first or second set of correlation function values, of said extended set of correlation function values or said compensated first set of correlation function values, respectively. Such an interpolation approach provides the advantage that the error terms resulting from the use of a discrete time-frequency-transform respectively a frequency-time-transform can be reduced by a processing of the signals after the inverse transform has been performed.
  • In a further embodiment of the present invention the speech fundamental frequency estimator can be characterized by a frequency domain filtering unit being configured for receiving the frequency domain versions of the first and second set of time domain signal values, for frequency domain filtering said frequency domain versions in order to obtain said first and second sets of values, respectively, and for providing said first and second sets of values to the first and second power density spectrum calculator respectively. Such a pre-processing of the received signals provides the advantage that a pre-processed version of the input signal significantly increases the reliability and preciseness of the estimation in contrast to an embodiment of the invention in which no pre-processing is performed. However the computational or numerical burden for this is relatively low, especially if the filter has a little number of coefficients.
  • In a further embodiment of the present invention the speech fundamental frequency estimator can be characterized in that the frequency domain filtering unit is configured for filtering only frequencies below a predefined limiting frequency. This relaxes a computational burden as only the parts of the spectrum are filtered which are of the most importance for a reliable estimation of very low speech fundamental frequencies.
  • Furthermore, in another embodiment the speech fundamental frequency estimator can be characterized in that the frequency domain filtering unit is configured for delaying values of said frequency domain versions being above said predefined limiting frequency. This compensates a delay which might be introduced in a signal flow path for filtering signals having a frequency below said limiting frequency.
  • The above mentioned aspects and modifications according to the first aspect of the present invention can also be implemented in corresponding methods where the advantages mentioned above come into effect in an analogous manner.
  • Furthermore, the invention can also be implemented as a computer program having a program code for performing the inventive method, when the computer program runs on a computer.
  • In an embodiment of the second aspect the speech fundamental frequency estimator can be characterized in that the power density spectrum calculator is configured for determining the noise suppression factor as the maximum of a predetermined maximum suppression coefficient and a term being dependent on a quotient of the estimate of the power density spectrum of background noise and the second power density spectrum. This provides the advantage that an additional improvement can be realized as then erroneous detections in speech pauses can be avoided which, in turn, improves the reliability of the estimated speech fundamental frequency. Also, it can be made sure that the noise suppression factor is always above a predefined value.
  • Further, the second aspect may comprise a speech fundamental frequency estimator being characterized in that the power density spectrum calculator is configured for determining the estimate of the power density spectrum of background noise in speech pauses or for determining the estimate of the power density spectrum of background noise from a segment-wise estimation of the minima of the power of a differential signal. This makes sure, that a minimum suppression factor is used and thus an effective suppression of background noise is accomplished.
  • Furthermore, the speech fundamental frequency according to a further embodiment may be characterized in that the noise suppression factor is defined by V e j Ω µ n = max V 0 , 1 β S ^ nn Ω µ n S ^ yy Ω µ n
    Figure imgb0005
    wherein nn µ,n) denotes the estimate of the power density spectrum of the background noise, yy µ,n) denotes the second power density spectrum of the input signal, Vo denotes a predefined maximum attenuation factor and β denotes a value for overestimating the power density spectrum of the background noise.
  • The above mentioned aspects and modifications according to the second aspect which in isolation does not form part of the present invention can also be implemented in corresponding methods where the advantages mentioned above come into effect in an analogous manner.
  • Additional features and advantages of the present invention will become more readily appreciated from the following detailed description of preferred or advantageous embodiments with reference to the accompanying drawings, in which
  • Figure 1
    shows a time-frequency-analysis of a speech signal;
    Figure 2
    shows a block diagram of a multi-rate system for speech recognition having a speech fundamental frequency estimation;
    Figure 3
    shows a block diagram of an analysis system for speech recognition having a speech fundamental frequency estimation;
    Figure 4
    shows a block diagram of a method and a system for speech fundamental frequency estimation;
    Figure 5
    shows an autocorrelation- and time-frequency-analysis of sinusoidal signals with varying frequency distances from 300 Hz to 60 Hz;
    Figure 6
    shows a block diagram of a system respectively method for a speech fundamental frequency estimation with spectral refinement;
    Figure 7
    shows a block diagram of a system for speech fundamental frequency estimation with spectral refinement in the lower frequency band from 0 Hz to 1000 Hz;
    Figure 8
    shows in the upper section the analysis of the autocorrelation and in the lower section the time-frequency analysis of sinusoidal signals with varying frequency distances from 300 Hz to 60 Hz. The analyses have been performed with a previous spectral refinement;
    Figure 9A
    shows a block diagram of an embodiment of the inventive speech fundamental frequency estimator;
    Figure 9B
    shows a flow diagram of an embodiment of the inventive method for estimating the speech fundamental frequency estimate;
    Figure 10
    shows diagrams with results of the speech fundamental frequency estimation with a spectral refinement as a time-frequency-analysis with and without noise reduction;
    Figure 11A
    shows a block diagram of another embodiment of the inventive speech fundamental frequency estimator;
    Figure 11B
    shows a flow diagram of another embodiment of the inventive method for estimating the speech fundamental frequency estimate;
    Figure 12
    shows a block diagram of a method respectively system for speech fundamental frequency estimation with additional consideration of a passed subband input vector and spectral refinement;
    Figure 13
    shows in the left section the analysis of the autocorrelation function of a speech fundamental frequency at about 270 Hz. In the right section the analysis of a low speech fundamental frequency of about 60 Hz is shown;
    Figure 14
    shows in the upper section the analysis of an extended autocorrelation and in the lower section the time-frequency-analysis of several sinusoidal signals with varying frequency distances from 300 Hz to 60 Hz. Additionally a spectral refinement had been performed in a lower frequency range;
    Figure 15
    shows in the upper section the time-frequency-analysis of speech a signal with additional post-processing and in the lower section the time-frequency-analysis of a speech signal without additional post-processing;
    Figure 16
    shows in the upper section the time-frequency analysis of several sinusoidal signals of equal amplitude with varying frequency distance (partial section of the signal). In the lower section the time-frequency-analysis of a speech signal (partial section of the signal) is shown.
    Equal or similar elements may have the same reference numbers in the following description of embodiments of the present invention. Description of preferred embodiments
  • The present invention relies mainly on estimation methods based on autocorrelation function which are described herein in advance for a better understanding. However, some aspects of the present invention are also implemented in the conventional autocorrelation methods such that the description in this section is not to be considered as state of the art.
  • In the following it is assumed that the speech signal s(n) will be recorded by a microphone. To this signal background noise n(n) is often superimposed. Consequently, the microphone signal y(n) is composed by local speech s(n) and disturbances n(n): y n = s n + n n
    Figure imgb0006
  • From this signal the short-time autocorrelation function in the time domain can be estimated in a block-based way according to r ^ yy m n = 1 L k = 0 L 1 y n k y n k + m
    Figure imgb0007
  • As this short-time autocorrelation function has to be performed for a quite large region of the autocorrelation offset m, the direct estimation requires too much effort for many applications. As in hands-free- and speech recognition systems in multi-rate-structure nevertheless a subband transform (for example by a DFT) is calculated a approach which requires less effort can be used here. The analysis filter bank of a multi-rate system can be described as follows:
    • First the input signal y(n) is portioned in windowed, overlapping frame blocks [see also J. Benesty, S. Makino, J. Chen: Speech Enhancement, Springer, Berlin, Deutschland, 2005; E. Hänsler, G. Schmidt: Acoustic Echo and Noise Control - A Practical Approach, John Wiley & Sons, Hoboken, New Jersey, USA, 2004; E. Hänsler, G. Schmidt: Topics in Acoustic Echo and Noise Control - Selected Methods for the Cancellation of Acoustic Echoes, the Reduction of Background Noise, and Speech Processing, Springer, Berlin, Deutschland, 2006 or P. Vary, R. Martin: Digital Speech Transmission, John Wiley & Sons, Chichester, England, 2006). In dependence of a DFT of order N (which is actually the block length of said DFT), one frame block respectively one signal input vector y(n) is composed as follows: y n = y n , y n 1 , , y n N + 1 T
      Figure imgb0008
    • each signal input vector y(n) is weighted subsequently by a window function h = h 0 , h 1 , , h N 1 T
      Figure imgb0009
      and
    • transformed to the frequency domain by a DFT: Y e µ n = k = 0 N 1 y n k h k e j Ω µ k
      Figure imgb0010
  • The sampling points µ are hereby located equidistantly in the normalized frequency domain: Ω µ = 2 π N µ with µ 0 , , N 1
    Figure imgb0011
  • From the short-time spectrum Y(e jΩ µ ,n) the short-time power density spectrum can be estimated by calculating the square of the absolute value according to the following equation: S ^ yy Ω µ n = Y e j Ω µ n 2 = Y e j Ω µ Y * e j Ω µ n
    Figure imgb0012
  • The thus determined power density spectrum yy µ,n) from equation 8 is then smoothed in frequency direction and divided by the thus obtained envelope yy µ,n). Hereby the short-time envelope is removed. The smoothing in frequency direction can be described by S ˜ yy Ω µ n = S ^ yy Ω µ n for µ = 0 , λ S ˜ yy Ω µ 1 n + S ^ yy Ω µ n , for µ 1 , , N 1
    Figure imgb0013
    respectively S yy Ω µ n = { S ˜ yy Ω µ n for µ = N 1 λ S yy Ω µ + 1 n + 1 λ S ˜ yy Ω µ n , for µ 0 , , N 2
    Figure imgb0014
  • Values for the smoothing constant λ are chosen from the range 0.3 < λ < 0.7
    Figure imgb0015
  • Following, a linear weighting of the estimated and normalized power density spectrum is performed: S ^ yy , norm Ω µ n = S ^ yy Ω µ n S yy Ω µ n W e j Ω µ
    Figure imgb0016
  • The weighting function W(e jΩ µ ,n) has been chosen such that the attenuation rises with rising frequency. This choice results from the fact that speech mainly at low frequencies has a speech fundamental frequency structure - which in turn results in an improved estimation of the speech fundamental frequency. In Fig. 4 the functional principle of a method for speech fundamental frequency estimation is shown.
  • The autocorrelation function r ^ yy m n = 1 N µ = 0 N 1 S ^ yy , norm Ω µ n e j 2 π N µm
    Figure imgb0017
    is determined by an inverse transform of the normalized and weighted power density spectrum from equation 12. The autocorrelation function yy (m,n) is used in order to estimate the speech fundamental frequency fp (n). The index m describes herein the autocorrelation offset and the index n describes the present frame (under analysis). For each a single frame the preliminary speech fundamental frequency f'p (n) can be determined by a search of the maximum in a selected range of indices, for example 30 ≤ m ≤ 100. The speech fundamental frequency is then determined as the reciprocal of value of the index at which the maximum of the autocorrelation has occurred (in view to the sampling frequency fs): with f p n = f s τ p n
    Figure imgb0018
    τ p n = argmax 30 m 100 r ^ yy m n
    Figure imgb0019
  • Furthermore a reliability pfp (n) of the estimated speech fundamental frequency is determined. Therefore the value of the normalized autocorrelation at the maximum point, (i.e. the index where the autocorrelation function becomes maximal) is used: p f p n = r ^ yy τ P n , n r ^ yy 0 n
    Figure imgb0020
  • Large values, that are values in the proximity to one, indicate a very sure detection - small values indicate a doubtful detection. For this reason a detection only takes place for values of the normalized autocorrelation function which are larger than po (which is taken as a predefined threshold value): f p n = { f p n for p f p n > p 0 not detectabale , else
    Figure imgb0021
  • A threshold value of p 0 ∈ [0.2,0.3] has turned out to be favourable. The value of the normalized autocorrelation at the location τp (n) can be of large significance as reliability information, for example for a speech signal reconstruction. Hereby the desired value of the speech fundamental frequency can be either slowly or quickly traced, dependent on how sure a speech fundamental frequency can be estimated.
  • Finally the inventive method proposed here is further presented in more detail by an example. Therefore 10 sinusoidal signals of equal amplitude are summed up. The frequencies of the sinusoidal signals have been chosen equidistantly. At the beginning of the signal a fundamental frequency of 300 Hz has been chosen, subsequently this frequency has been decreased linearly over the time to an end value of 60 Hz. In the upper diagram of Fig. 5 the development of the normalized autocorrelation vectors is shown and in the lower diagram of Fig. 5 a time-frequency-analysis of the corresponding input signals y(n) is shown.
  • For the analysis a DFT of order N = 256 (= DFT block length), a sampling frequency of fs = 11025 Hz and the frame offset of r = 64 is used. The analysis of the autocorrelation yy (m,n) has been performed in the range between m=40 to m=128. Detection results have been considered to be well it if the reliability information pfp (n) is larger than po = 0.2. Finally the time-frequency analysis was considered only in the interesting frequency range up to 1000 Hz.
  • In the analysis of the autocorrelation it can be recognized that the speech fundamental frequency up to an offset of about m=95 can be estimated surely - this corresponds to a speech fundamental frequency of about fp(n) = 120 Hz (at a sampling frequency of fs=11025 Hz). The graph of this detection with decreasing frequency can also be seen in the time-frequency-analysis up to about t=3.8 s. However, if speech fundamental frequency is below fp(n) = 120 Hz (which is often the case with men having a low speech fundamental frequency) these speech fundamental frequency can not be determined in a reliable way.
  • Contrary to the approaches mentioned in the previous description of the invention the approach disclosed subsequently has the following further advantages:
    • a sure and reliable estimation can also be performed for a very low voices;
    • a better robustness in environments with background noise can be reached; and
    • the speech fundamental frequency can be determined with a significantly higher degree of precision.
  • Firstly, a method for estimating the speech fundamental frequency having an additional spectral refinement is described in more detail and it is shown how the detection robustness can be increased by a noise reduction which is integrated in the estimation method (no pre-processing). Following an additional part of the method is presented which enables to also detect a very low speech fundamental frequencies by an additional delay correction structure. Finally approaches for adaptively post-processing and interpolating are disclosed which enable an error correction respectively an improvement of the preciseness of the speech fundamental frequency. However it has to be mentioned here that all the disclosed aspects can also be used independently such that the present invention does not only work if all the aspects mentioned above are implemented. For example the spectral refinement can be used without using the post-processing or the interpolation or the approach having the additional delay correction structure can be used without using the spectral refinement approach. However all the individual aspects commonly contribute to a much improved estimation of the speech fundamental frequency and shall be described herein as an embodiment.
  • Speech fundamental frequency estimation with spectral refinement
  • In the preceding section it has been shown that a speech fundamental frequency which is below 120 Hz can not be estimated. In the following an approach is presented which solves the described problem.
  • Additionally to the already mentioned method according to the state of the art the newly proposed method uses an additional spectral refinement of the input spectrum Y(e jΩ µ ,n). The functional principle of this approach is disclosed in Fig. 6. The short-time spectrum Y(e jΩµ ,n) is firstly filtered subband-wise by an FIR-filter (FIR = finite impulse response). Such a filtering serves the purpose to perform a more precise spectral resolution of the input spectrum Y(e jΩµ ,n).
  • It was shown in Patent Application No. EP 06024940.6 that a spectral refinement within one subband can be reached by a short FIR-filter, respectively, how the individual filter coefficients have to be determined. The FIR-filter used for the µ-th subband can be described as follows: g µ = g µ 0 , g µ , 1 , , g µ , M 1 T
    Figure imgb0022
  • The parameter µ denotes herein the µ-th frequency sampling point of a short-time spectrum (e jΩµ ,n) having a higher resolution and the parameter M denotes the order of the used FIR-filters. A memory length M of the short FIR-filter is chosen between 3 and 5. For the frequency subbands of interest the spectral refinement finally can be determined as follows: Y ˜ e j Ω µ n = g µ , 0 Y e j Ω µ n + + g µ , M 1 Y e j Ω µ , n M 1 r
    Figure imgb0023
  • A spectral refinement in the whole frequency range is not necessary for speech signals. Usually the speech fundamental frequency structure is only present in the lower frequency range that means it is sufficient to perform the refinement up to, for example, 1000 Hz. Above this threshold it is possible to only introduce a delay of (M-1)/2 samples (down-sampled). The numerical effort necessary for such a refinement can thus be kept low. In Fig. 7 the analysis-synthesis-system with additional calculation of the spectral refinement in a low frequency range is shown.
  • However, it has to be mentioned that by the calculation of a spectral refinement a low delay is introduced into the signal path. A detailed derivation of this part of the new approach is explained in more detail in Patent Application No. EP 06024940.6 .
  • Subsequently the determination of the speech fundamental frequency can be performed analogously to the way as already disclosed in the previously mentioned description. However, the refined short-time spectrum (e jΩµ ,n) is now used in order to calculate the estimated and refined power density spectrum ỹỹ µ,n) according to the following equation: S ^ y ˜ y ˜ Ω µ n = Y ˜ e j Ω µ n Y ˜ * e j Ω µ n = Y ˜ e j Ω µ n 2
    Figure imgb0024
  • Following the power density spectrum ỹỹ µ,n) is also smoothed, weighted and the autocorrelation function ỹỹ (m,n) for the estimation of the speech fundamental frequency is determined. In order to calculate said power density spectrum an approach corresponding to equations 9 to 17 can be used.
  • In Fig. 8 the analysis of autocorrelation as well as the time-frequency-analysis with spectral refinement is shown. For the analyses the same parameters as previously mentioned have been used -namely a DFT of order N=256, a sampling of frequency fs=11025 Hz, a frame offset of r=64 and a detection of threshold p0 = 0.2. Furthermore as test signal the same combination from sinusoidal signals have been used which have a varying frequency distance of 300 Hz to 60 Hz. The black graph in the upper diagram of Fig. 8 as well as the white graph in the lower diagram of Fig. 8 show the estimated pitch period duration, respectively; the estimate of speech fundamental frequency when using the spectral refinement approach.
  • A comparison of Fig. 5 and 8 shows very clearly that the spectral refinement provides the possibility of a far better detection of the speech fundamental frequency. Very desirable is the fact that the sure and reliable detection rises up to an offset of m=N/2 =128 - this corresponds to a speech fundamental frequency of about 90 Hz. At lower frequencies fp < 90 Hz several detection errors occur. Finally it has to be mentioned that in many applications it is only of interest whether a speech fundamental frequency is present or not - an exact speech fundamental frequency would be of minor importance. Just in these application scenarios the previously presented approach would provide significant advantages.
  • In the following it will be the aim to present a new approach which works robustly in terms of erroneous estimations at very low speech fundamental frequencies. Additionally it is shown in the following section how noise reduction can be advantageously incorporated into the presently known method.
  • Speech fundamental frequency estimation with noise suppression
  • Fig. 9A shows a block diagram of an embodiment of a speech fundamental frequency estimator 900. The speech fundamental frequency estimator 900 comprises a power density spectrum calculator 902 and an analyzer 904. The power density spectrum calculator 902 has 2 inputs, one for receiving a set of values and one for receiving background noise information. The set of values 1 is a frequency-domain representation of a set of a time domain signal values y1 in a time interval t1. The background noise information can for example be determined in speech pauses in which only a noise signal and no speech signal is provided to the power density spectrum calculator 902. The power density spectrum calculator 902 has 2 outputs, one for outputting a noise suppression factor V(ejΩµ,n) and one for outputting values of a power density spectrum. The analyzer 904 has 2 inputs for receiving both of the outputs of the power density spectrum calculator 902. The analyzer 904 has a furthermore one output for outputting the determined speech fundamental frequency fp(n).
  • The function of the speech fundamental frequency estimator 900 shall be described in more detail with reference to Fig. 9B. In Fig. 9B a flow diagram of a method for estimating the speech fundamental frequency is disclosed. The method 940 comprises a first step 950 in which a power density spectrum is provided by multiplying a version of the set of values 2 with a complex conjugate version of the second set of values. In parallel (or in series) in a second step 952 an estimate of a power density spectrum of background noise is determined. In this step 952 of determining the estimate of a power density spectrum of background noise the background noise information is used which may originate for example from a speech pause detector or other means which provide only information about the background noise in the absence of speech. In a third step 954 a noise suppression factor is determined which is explained in more detail below. In a fourth step 956 a multiplication of the power density spectrum with the noise suppression factor V(ejΩµ,n) is performed before in a fifth step 958 a frequency-time-transform is accomplished. Subsequently in a sixth step 960 speech fundamental frequency is determined from the frequency-time-transformed signal resulting in step 958.
  • Such an approach provides the advantage that by considering background noise information the detection preciseness as well as in detection robustness can be improved as for example in speech pauses when only background noise occurs no speech fundamental frequency shall be estimated. Thus, the reliability of an estimated speech fundamental frequency can be significantly improved. This results from the fact that the erroneous detections of speech fundamental frequencies in speech pauses can be avoided. Furthermore the multiplication of the noise suppression factor with the power density spectrum prior to the frequency-time-transform provides the advantage that such a multiplication in the frequency domain requires very little computational and numerical effort in contrast to a similar combination in time domain. Furthermore it is also possible to additionally considered other calculations or normalizations of the noise-compensated signal prior to said frequency-time-transform.
  • To be more precise, methods for the noise reduction are mostly based on modified Wiener-filters which frequency response in the respective frequency intervals is determined by V e j Ω µ n = max V 0 , 1 β S ^ nn Ω µ n S ^ yy Ω µ n
    Figure imgb0025
    (see also S. F. Boll: Suppression of Acoustic Noise in Speech Using Spectral Subtraction, IEEE Trans. Acoust. Speech Signal Process., Vol. 27, Nr. 2, Seiten 113-120, 1979; E. Hänsler: Statistische Signale - Grundlagen und Anwendungen, Springer, Berlin, Deutschland, 2001 or T. Haulick, K. Linhard: Noise Subtraction with Parametric Recursive Gain Curves, Proceed. of the European Conf. on Speech Communications and Technology, Vol. 6, pages 2611-2614, 1999). The value nn µ,n) denotes an estimation of the auto power density spectrum of a disturbance (background noise), V0 describes a maximal attenuation and the parameter β is used for overestimating the power density spectrum of the disturbance. Because of the fact that the disturbance can be considered to be non-stationary a short-time estimation value has to be used for this disturbance value. However, signal and disturbance are available only as a sum in the microphone signal y(n). The estimation of the power density spectrum of the background noise can be obtained in two different ways, firstly the power of the microphone signal can be estimated in speech pauses - which requires a speech pause detector - or, secondly, that an estimated value for the power of the disturbance can be determined from the segment-wise estimated minima of the power of the microphone signal. As the noise estimation is not the main focus in this patent application other details shall not be explained here; however reference is made to P. Vary, R. Martin: Digital Speech Transmission, John Wiley & Sons, Chichester, England, 2006.
  • Normally, noise reductions are used as a pre-processing stage for a speech fundamental frequency estimation that is instead of the input subband signals Y(e jΩµ ,n) the noise reduced signals Y(e jΩµ ,nV(e jΩµ ,n) are processed. The present approach follows a similar way that means that firstly a noise-reduced power density spectrum (see equation 12) respectively after a subsequent spectral refinement is determined according to the following equation: S ^ y ˜ y ˜ , norm , g Ω µ n = S ^ y ˜ y ˜ Ω µ n S y ˜ y ˜ Ω µ n W e j Ω µ V e j Ω µ n
    Figure imgb0026
  • For detection the inverse transform is then calculated as follows: r ^ yy , g m n = 1 N µ = 0 N 1 S ^ y ˜ y ˜ , norm , g Ω µ n e j 2 π N µm
    Figure imgb0027
  • As standardization factor the value yy (0,n) from the equation 16 is again used which is the standardization value of the autocorrelation including noise. This results in the following modified detection p f p n = r ^ yy , g τ p n , n r ^ yy 0 n
    Figure imgb0028
  • As a result a more robust detection in speech pauses is obtained. In order to more clearly show this effect Fig. 10 shows results of the speech fundamental frequency estimation with spectral refinement in terms of time-frequency-analysis with and without noise reduction. All parameters of the methods have been identical to the previously described parameters. As can be seen very clearly erroneous detections (denoted by black ellipses in the upper diagram of Fig. 10) can be suppressed in the case when the above-mentioned active noise reduction is used. In speech activity passages nearly nothing changes.
  • Speech fundamental frequency estimation on the basis of a plurality of subband vectors
  • In this section a further part of the approach for the inventive speech fundamental frequency estimation is described.
  • Fig. 11A shows a block diagram of an embodiment of the inventive speech fundamental frequency estimator 1100. The speech fundamental frequency estimator 1100 comprises a first power density spectrum calculator 1102, a second power density spectrum calculator 1104 and an analyzer 1106. The first power density spectrum calculator 1102 and second power density spectrum calculator 1104 are both fed by a common input of width N, on which subsequently a first set of values 1 and a second set of values 2 is provided. Herein, the first set of values 1 is a frequency domain representation of a first set of time domain signal values y1 within a first time interval t1. The second set of values 2 is a frequency domain representation of a second set of time domain signal values y2 within a second time interval t2. In the embodiment as shown in Fig. 11A the first and second time intervals overlap. The first power density spectrum calculator 1102 is configured for storing a version of the first set of values and for providing values of a first power density spectrum ỹỹ µ,n) by multiplying the stored version of the first set of values 1 with a complex conjugate version of the second set of values 2. The second power density spectrum calculator 1104 is configured for providing values of a second power density spectrum ỹỹ µ,n) by multiplying a version of the second set of values with a complex conjugate version of the second set of values. The analyzer 1106 is configured for receiving the first and second power density spectrums of the first respectively second power density spectrum calculator 1102, 1104 and for determining the speech fundamental frequency estimate fp(n) on the basis of the values of the first power density spectrum ỹỹd µ,n) and the values of the second power density spectrum ỹỹ µ,n).
  • Fig. 11B shows the functionality of the speech fundamental frequency estimator as shown in Fig. 11A in more detail. To be more precise, Fig. 11B discloses a method 1140 for estimating the speech fundamental frequency fp(n). Firstly, first and second sets of values 1 and 2 are provided, each of which have the number of N individual values (that is a width of N). In a first step 1150 a version of the first set of values 1 is stored. In a second step 1152 the stored version of the first set of values 1 it is multiplied with a version of the second set of values 2 which are directly fed to the multiplication step without a storing step. The result from the multiplication step 1152 is said first power density spectrum ỹỹd µ,n). Parallel to the step of multiplying 1152 a further step of multiplying 1154 is performed in which a versions of the second set of values 2 are multiplied with each other, which results in the second power density spectrum. In a final step 1156 the speech fundamental frequency estimate fp(n) is determined.
  • The inventive approach as shown in Fig. 11A and 11B has the advantage that it is now possible to estimate lower speech fundamental frequencies as would be possible according to the state of the art. This is mainly due to the fact that (conventional existing) short frequency domain values can be used for a precise speech fundamental frequency estimation as the multiplication in step 1152 with a stored respectively delayed version of a previous set of frequency domain values results in a kind of elongated analysis time interval for estimating the low speech fundamental frequency. However, it is also possible to correct possible errors which might result from the time offset of the first and second time intervals because for the determination of the Speech fundamental frequency estimate also the second power density spectrum is used which is based on a multiplication of versions of the second set of values. Therefore the first power density spectrum can be compared with the information resulting from the second power density spectrum such that a kind of normalization can be performed or a detection of possible errors in the first power density spectrum can be recognized and corrected.
  • To be more specific, in the previous description it has been shown that a speech fundamental frequency below fp(n)<120 Hz can not be detected correctly anymore. Therefore, in the first approach a subsequent spectral refinement has been applied. However, this spectral refinement provided the possibility for an improvement of the speech fundamental frequency estimation only to about fp(n)=90 Hz. The reason for this threshold can be seen in the fact that in the used DFT of order N a maximal autocorrelation offset of m = N/2 + 1 for the analysis of this speech fundamental frequency is possible - this corresponds to a maximally low speech fundamental frequency detection of about 90 Hz. It has been assumed that the used power density spectra, respectively the autocorrelation functions are only real (and not complex) and are furthermore also symmetrically.
  • A further inventive idea it can be seen in the fact that not only the present signal frame y(n) is used for the estimation of the speech fundamental frequency but also a signal frame y(n-d) which is a signal frame delayed by d clock cycles. For example the speech fundamental frequency estimation can be significantly improved by utilizing of the present signal frame and a signal of frame delayed by one frame cycle, d = r, with an overlap of 75% - this corresponds to a frame offset of r = 64 and a signal block length of N = 256.
  • In Fig. 12 the functional principle of the method for estimating the speech fundamental frequency is shown. Additionally to the already described method the inventive approach uses a cross correlation with the delayed input frame. Firstly it can be seen from Fig. 12 that in addition to the estimated auto power density spectrum ỹỹ µ,n). in the lower path of Fig. 12 also a variant of the cross power density spectrum S ^ y ˜ y ˜ d Ω µ n = Y ˜ e j Ω µ n Y ˜ * e j Ω µ , n d
    Figure imgb0029
    is determined too. For the determination of the cross power density spectrum ỹỹd µ,n) the present short-time spectrum (e jΩη ,n) and the delayed short-time spectrum Ỹ*(e jΩη ,n-d) is used. In following only the short-time spectrum delayed by one frame clock, that is d = r, is dealt with further, however also other delays can be used here.
  • The thus determined cross power density spectrum is divided by the smoothed auto power density spectrum ỹỹd µ,n) and is multiplied with a weighting function as shown below: S ˜ y ˜ y ˜ d Ω μ n = S ˜ y ˜ y ˜ d Ω μ n S y ˜ y ˜ Ω μ n W e j Ω μ
    Figure imgb0030
  • After a subsequent noise reduction and an inverse transform according to equation 23 the cross-correlation function ỹỹ,g (m,n) is determined according to equation 13. In the following, the aim will be to determine an extended autocorrelation function ỹỹ,erw (k,n) of order N/2 + r from the autocorrelation function ỹỹd,g (m,n) and the cross-correlation function ỹỹd,g (m,n), each of which having the order N/2. The index k of the term ỹỹ,erw (k,n) describes herein the offset of the autocorrelation, wherein the following equation is valid: k 0 , ... , N 2 + r 1
    Figure imgb0031
  • By using an adaptive compensation control it can be tried to correct the error terms of the cross-correlation function ỹỹd,g (m,n). For this purpose a correction value Δ(m, n) is determined for each time frame in order to compensate, for example the undesired amplitudes which occur at an offset of m=r=64, or respectively, to correct the remaining amplitude values in order to perform a later combination with the autocorrelation function ỹỹ,g (m,n) : r ^ y ˜ y ˜ d , g , mod m n = r ^ y ˜ y ˜ d , g m n Δ m n = r ^ y ˜ y ˜ d , g m n c n r ^ y ˜ y ˜ , g m r , n
    Figure imgb0032
  • The adaptive constant c(n) is derived from a relation of the cross correlation function ỹỹd,g (m,n) at the location m= r and the autocorrelation function ỹỹ,g (m,n) at the location m=0. In order to perform a robust speech fundamental frequency estimation the relation should not be below a minimum value c0. Therefore the adaptive parameter c(n) is determined as follows: c n = max r ^ y ˜ y ˜ d , g r n r ^ y ˜ y ˜ , g 0 n c 0
    Figure imgb0033
  • Tests have shown that good results can be obtained in the case the constant c0 is set to a value of c0=0.4.
  • Following the auto and cross-correlation coefficients of ỹỹ,g (m,n) and ỹỹd,g,mod(m,n) are weighted by a weighting function and are combined as follows: r ^ y ˜ y ˜ , erw k n = { r ^ y ˜ y ˜ , g k n , for 0 k < N 2 r , a k r r ^ y ˜ y ˜ , g k n + 1 a k r r ^ y ˜ y ˜ d , g , mod k r , n . for N 2 r k < N 2 , r ^ y ˜ y ˜ d , g , mod k r , n , for N 2 k < N 2 + r .
    Figure imgb0034
  • Herein the linear function a(m) was chosen such that with an increasing offset m the weight of the coefficients reduces. The thus obtained extended autocorrelation function ỹỹ,erw (k,n) is finally used for the estimation of the speech fundamental frequency. In comparison to the methods mentioned before the speech fundamental frequency is determined by a search of the maximum for each single frame in an elongated area - for example in the range 30 ≤ k ≤ 180.
  • In order to clarify the functioning of the described method in Fig. 13 two examples for the analysis of the speech fundamental frequency are shown. For this purpose the left section of Fig. 13 discloses the analysis of the speech fundamental frequency at about 270 Hz whereas in the right section of Fig. 13 the analysis of a speech fundamental frequency at about 60 Hz is shown.
  • In the first aspect the correlation of the present signal frame with itself (left) and with a proceeding signal frame (right) are shown each, the left and also the right section of Fig. 13. The grey graph denotes in each section the cross correlation function ỹỹd,g (m,n) before the adaptive compensation control and the dark grey graph denotes the cross correlation function ỹỹd,g,mod(m,n) after the adaptive compensation control. It can be well identified that significant error terms - especially the error terms at the location k = r - are corrected by the adaptive compensation control.
  • The lower graph in each of both sections of Fig. 13 shows the extended autocorrelation function ỹỹ,erw (k,n) across an elongated autocorrelation offset which is generated by the composition of both correlation functions ỹỹ,g (m,n) and ỹỹ,g,mod(m,n) respectively by the usage of the equation 30. At a high speech fundamental frequency the corresponding speech fundamental period can be determined and detected quite well using the autocorrelation function ỹỹ,g (m,n) (left section of Fig. 13). In contrast, with the used low speech fundamental frequency of about 60 Hz the corresponding speech fundamental period can not be determined any longer by the standard autocorrelation ỹỹ,g (m,n). The right section of Fig. 13 shows in the lower part that by a combination of the correlation of the signal frame with itself and the correlation with a proceeding signal frame the speech fundamental period can still be determined and detected.
  • In Fig. 14 the analysis of the extended autocorrelation function ỹỹ,erw (k,n) is shown when a previous spectral refinement in the low frequent region as well as a time-frequency-analysis of the input signal is used. A comparison with the analyses from the Fig. 5 and 14 indicates that by using the previously described approach significant improvements can be achieved. Through this approach a existing speech fundamental period up to an offset of about k=125 can still be detected. Moreover no erroneous detections with low speech fundamental frequencies occur. Thus, a sure and reliable estimation can be performed by the described approach down to a speech fundamental frequency of about fp(n)= 60 Hz.
  • Adaptive post-processing
  • At several locations erroneous estimations of the speech fundamental frequency fp(n) still occur. For these values a half, respectively a third, of the speech fundamental frequency are often estimated. A subsequent post-processing is then preferably used to correct the occurring erroneous detections.
  • After estimation of the speech fundamental frequency fp(n) a test can be made whether this estimate is below a threshold fk. The post-processing only shall be performed in the case the following condition f p n < f k
    Figure imgb0035
    is fulfilled. Values between fk = 140 Hz and fk = 160 Hz have been recognized to be suitable in practice. Subsequently a normalized speech fundamental period is estimated by performing a search for the index of the maximum of the autocorrelation function τ ˜ p n = argmax k r ^ y ˜ y ˜ , erw k n
    Figure imgb0036
    in a selected range f s f p , max k < f s 2 f p n + k 0
    Figure imgb0037
  • For the determination of this area the previously determined speech fundamental frequency fp(n) is firstly doubled. The parameter fp,max in equation 33 is herein a predefined value of a maximal possible speech fundamental frequency. Finally the value k0 is a constant which makes sure that also a search for a maximum which is slightly above k = f s 2 f p n
    Figure imgb0038
    is allowed.
  • In the case the newly determined maximum is higher than 60 percent of the previously determined maximum, that is τ ˜ p n > 0.6 τ p n
    Figure imgb0039
    and in the case also the amplitude of this newly determined maximum is above a predetermined amplitude value r ^ y ˜ y ˜ , erw τ ˜ p n , n > p ˜ 0
    Figure imgb0040
    a correction of the previously estimated speech fundamental frequency is performed according to f p n = f s τ ˜ p n
    Figure imgb0041
  • In order to clarify the improvements which result from such a post-processing, Fig. 15 shows a time-frequency-analysis of an input signal, respectively, the detection results of the speech fundamental frequency estimation. In the upper part of Fig. 15 the post-processing was deactivated and at two locations (at 0.7 and at 0.75 seconds) erroneous detections (bisections of frequency) can be observed. Such erroneous detections can be corrected by the post-processing which can be concluded from the lower part of Fig. 15.
  • Interpolation
  • In the application of the approach described up to now it could be observed that only an inaccurate speech fundamental frequency is estimated. In the estimation results stairs-like graphs of the estimated speech fundamental frequency have been generated as can be seen in Fig. 14 for example. Up to now it was only possible to determine the quantized speech fundamental frequency estimate, that means when the exact speech fundamental period is in between two autocorrelation offsets k of the autocorrelation function ỹỹ,erw (k,n) then a rounding to the nearest autocorrelation offset k is performed in order to determine the estimated speech fundamental period τp (n), respectively τ̃p (n). Therefore quantization errors occur.
  • In numerous applications, as for example for a speech signal construction, an exact speech fundamental frequency estimation is of significant importance. One possible approach to solve the described problem is to perform an interpolation of the estimated speech fundamental frequency which is described in more detail in the following.
  • For the interpolation firstly an approximated si(x)-function is used which can be written as a simple polynom of order 2 according to the following approximation: f x = sin x x 1 x 2 6
    Figure imgb0042
  • Furthermore the autocorrelation coefficient is used for the interpolation at which the extended autocorrelation function ỹỹ,erw (k,n) has the maximum, and also the adjacent autocorrelation coefficients unconsidered- that is the autocorrelation offsets left and right of the maximum. The interpolated speech fundamental period τ p,mod(n) can hereby be written as a function depending on the quantized speech fundamental period τp (n) and the considered autocorrelation coefficients according to the following equation: τ ˜ p , mod n = Fkt τ p n , r ^ y ˜ y ˜ , erw τ p n 1 , n , r ^ y ˜ y ˜ , erw τ p n , n , τ p n , 1 , n
    Figure imgb0043
  • In this context it has to be noted that if a correction according to the post-processing described in the previous section should be performed, the value τp (n) has to be replaced by the value τ̃p (n). Finally the estimated and interpolated speech fundamental period can be determined according to τ ^ p , mod n = τ p n Δ p n ,
    Figure imgb0044
    wherein Δp(n) is a correction value for the quantized speech fundamental period τp (n) which has to be determined in every frame clock n according to the following equation: Δ p n = r ^ y ˜ y ˜ , erw τ p n + 1 , n r ^ y ˜ y ˜ , erw τ p n 1 , n 2 r ^ y ˜ y ˜ , erw τ p n + 1 , n + r ^ y ˜ y ˜ , erw τ p n 1 , n 2 r ^ y ˜ y ˜ , erw τ p n , n
    Figure imgb0045
  • Finally the interpolation for improving the speech fundamental frequency estimation which is presented here shall be clarified in two examples. In the upper part of Fig. 16 the time-frequency-analysis of a portion of several sinusoidal signals of equal amplitude is shown. Contrary hereto a portion of a speech signal of a female voice is shown in the lower part of Fig. 16. The white graph denotes the estimated quantized speech fundamental frequency in the upper as well as also in the lower part of Fig. 16. The grey graph in the upper part respectively the black graph in the lower part demonstrates the estimated speech fundamental frequency after the interpolation. It can be seen from the upper part of Fig. 16 that due to the interpolation nearly the desired straight graph of the estimated speech fundamental frequency can be obtained. In the lower part it can be seen that the estimated speech fundamental frequency of the speech fundamental frequency structure follows the speech signal closely when the interpolation is used.
  • Furthermore the analysis has shown that an improvement of the speech fundamental frequency estimation of female voices up to about 30 Hz respectively with male voices about 10 Hz can be reached in the case the previously described interpolation is used.
  • Summarizing the problem presented in the introductory portion is solved presently in an approach having four independent steps each of which contributes to the total improvement and each of which can also be implemented independently from the others:
    • For improvement of the spectral resolution short FIR-filters can be used in portions of the spectrum having low frequencies. This results in a significant improvement for medium speech fundamental frequencies.
    • After the determination of necessary scaling values a noise reduction is performed. Thus, the method becomes more robust against background noise.
    • In addition to the correlation of the actual signal frame with itself a correlation with the preceding signal frame is also calculated. However, significant error terms are generated hereby. By means of an adaptive correlation compensation those terms can be widely removed and the correlation mentioned second can thus be used for estimation of very low speech fundamental frequencies.
    • By means of a simple interpolation a more precise estimation can be obtained. Finally erroneous detections which lead to doublings, respectively triplications, of the estimation are also corrected by means of adaptive post-processing.
  • Expressed in other words, this invention describes a method for estimating the fundamental frequency (pitch frequency) of speech signals. This is achieved in the DFT domain by analyzing the current input spectrum as well as past input spectra. To achieve an - compared to standard methods - improved estimation performance a four stage algorithm is applied or proposed whereby the steps can also be used independently: First, pre-processing (called spectral refinement) is applied to the input spectrum at low frequencies. Second, a noise reduction is applied when computing normalization values. Third, estimations for the autocorrelation of the current frame and cross correlation of the current with the previous frame are adaptively combined in order to obtain an extended range. Fourth, post-processing is applied to reduce estimation errors and to achieve an improved pitch accuracy.

Claims (42)

  1. Speech fundamental frequency estimator (1100) being configured for receiving a first set of values ( 1) and a second set of values ( 2), the first set of values ( 1) being a frequency domain representation of a first set of time domain signal values (y1) within a first time interval (t1) and the second set of values ( 2) being a frequency domain representation of a second set of time domain signal values (y2) within a second time interval (t2), the second time interval (t2) being later than and offset from the first time interval (t1), the speech fundamental frequency estimator (1100) comprising:
    - a first power density spectrum calculator (1102) being configured for storing a version of the first set of values ( 1) and being configured for providing values of a first power density spectrum (Sỹỹd µ ,n)) by multiplying the stored version of the first set of values ( 1) with a complex conjugate version of the second set of values ( 2);
    - a second power density spectrum calculator (1104) being configured for providing values of a second power density spectrum (ỹỹ µ,n)) by multiplying a version of the second set of values ( 2) with a complex conjugate version of the the second set of values ( 2);
    - an analyzer (1106) being configured for determining the speech fundamental frequency estimate (fp(n)) on the basis of the values of the first power density spectrum (ỹỹd µ,n)) and the values of the second power density spectrum (ỹỹ µ,n)),
    wherein the analyzer is further configured
    for performing a first frequency-time-transform of the first power density spectrum (ỹỹd µ,n)) in order to obtain a first set of correlation function values (ỹỹ,g (m,n)),
    for performing a second frequency-time-transform of the second power density spectrum (ỹỹ µ,n)) in order to obtain a second set of correlation function values (ỹỹ,g (m,n)), and
    for determining the speech fundamental frequency estimate (fp(n)) on the basis of the first and second sets of correlation function values (ỹỹd,g (m,n)),(ỹỹ,g (m,n)).
  2. Speech fundamental frequency estimator (1100) according to claim 1, characterized in that the first power density spectrum calculator (1102) is configured for multiplying versions of the sets of values ( 1, 2) which represent sets of time domain signal values (y1, y2) having overlapping time intervals (t1, t2).
  3. Speech fundamental frequency estimator (1100) according to claim 2, characterized in that the first power density spectrum calculator (1102) is configured for multiplying versions of the sets of values ( 1, 2) which represent time domain signal values (y1, y2) having overlapping time intervals (t1, t2) of that least 25 percent.
  4. Speech fundamental frequency estimator (1100) according to one of claims 1 to 3, characterized in that the second power density spectrum calculator (1104) is configured for providing a conjugate complex version of the second set of values ( 2) to the first power density spectrum calculator (1102) and wherein the first power density spectrum calculator (1102) is configured for using the provided conjugate complex version of the second set of values ( 2) as the version with which the stored a version of the first set of values ( 1) is to be multiplied.
  5. Speech fundamental frequency estimator (1100) according to any of the preceding claims, characterized in that the analyzer (1106) is configured for performing a first frequency-time-transform of the first power density spectrum (ỹỹd µ,n)) in order to obtain a first set of correlation function values (ỹỹd,g (m,n)) and for performing a second frequency-time-transform of the second power density spectrum (ỹỹ µ,n)) in order to obtain a second set of correlation function values (ỹỹ,g (m,n)), wherein the analyzer (1106) is furthermore configured for determining a set of normalization values (ỹỹ µ,n)) and a set of weighting values (V(e jΩ µ ,n)) from the second power density spectrum (ỹỹ µ,n) and for using the set of normalization values (ỹỹ µ,n)) and the set of weighting values (V(e jΩ µ ,n)) in the first and second frequency-time-transform and wherein the analyzer (1106) is furthermore configured for determining the speech fundamental frequency estimate (fp(n)) on the basis of the first and second sets of correlation function values (ỹỹd,g (m,n), (ỹỹ,g (m,n)).
  6. Speech fundamental frequency estimator (1100) according to claim 5, characterized in that the analyzer (1106) further comprises a compensator being configured for adaptively compensating the values of the first set of correlation function values (ỹỹd,g (m,n)) by a correction factor (Δ(m,n)) being based on a value of the second set of correlation function values (ỹỹ,g (m,n)) and wherein the analyzer (1106) is furthermore configured for determining the speech fundamental frequency estimate (fp(n)) on the basis of the compensated first set of correlation function values ( ỹỹd,g,mod(m,n)) and the second set of correlation function values (ỹỹ,g (m,n)).
  7. Speech fundamental frequency estimator (1100) according to claim 6, characterized in that the compensator is configured for multiplying the second set of correlation function values (ỹỹ,g (m,n)) by a lower bounded quotient between a value of the first set of correlation function values (ỹỹd,g (m,n)) and a value of the second set of correlation function values (ỹỹ,g (m,n)) in order to obtain said compensated first set of correlation function values ( ỹỹd,g,mod(m,n)).
  8. Speech fundamental frequency estimator (1100) according to claim 7, characterized in that the analyzer (1106) is configured for combining the compensated first set of correlation function values ( ỹỹd,g,mod(m,n)) and the second set of correlation function values (ỹỹ,g (m,n)) in order to obtain an extended set of correlation function values (ỹỹ,erw (k,n)), wherein the values of the extended set of correlation function values (ỹỹ,erw (k,n)) assume corresponding values from the compensated first set of correlation function values ( ỹỹd,g,mod(m,n)), the second set of correlation function values (ỹỹ,g (m,n)) or values between the compensated first set of correlation function values ( ỹỹd,g,mod(m,n)) and the second set of correlation function values (ỹỹ,g (m,n)) and wherein the analyzer (1106) is furthermore configured for determining the speech fundamental frequency estimate (fp(n)) on the basis of said extended set of correlation function values (ỹỹ,erw (k,n)).
  9. Speech fundamental frequency estimator (1100) according to one of claims 5 to 8, characterized in that the analyzer (1106) is configured for determining the speech fundamental frequency estimate (fp(n)) by searching the index of a maximum value (τp (n)) from the extended set of correlation function values (ỹỹ,erw (k,n)) within a predetermined number of indices (k) of the values of the extended set of correlation values (ỹỹ,erw (k,n)), from the first or second set of correlation function values (ỹỹd,g (m,n), ỹỹ,g (m,n)) within a predetermined number of indexes (m) of values of the first respectively second set of correlation function values ỹỹd,g (m,n), ỹỹ,g (m,n)) or from the compensated first set of correlation function values ỹỹd,g,mod(m,n)) within the predetermined number of indices (m) of values of the compensated first set of correlation function values ỹỹd ,g,mod(m,n)) and wherein the analyzer (1106) is furthermore configured for determining the speech fundamental frequency estimate (fp(n)) as the product of a sampling frequency (fs) and a reciprocal value of said searched index (τp (n)).
  10. Speech fundamental frequency estimator (1100) according to claim 9, characterized in that the analyzer (1106) is furthermore configured for determining a reliability factor (pfp (n)) for the determined speech fundamental frequency estimate and for blocking an output of the determined speech fundamental frequency estimate (fp(n)) in the case the determined reliability factor (pfp (n)) for the determined speech fundamental frequency estimate is below a predetermined reliability factor (po).
  11. Speech fundamental frequency estimator (1100) according to claim 10, characterized in that the analyzer (1106) is furthermore configured for determining said reliability factor (pfp (n)) by dividing the maximum value (τ̃p (n)) at said searched index by the first value of the extended set of correlation function values (ỹỹ,erw (k,n)) or, respectively the first, the compensated first or second set of correlation function values (ỹỹd,g (m,n), ( ỹỹd,g,mod(m,n), (ỹỹ,g (m,n)).
  12. Speech fundamental frequency estimator (1100) according to one of claims 5 to 11, characterized in that the second power density spectrum calculator (1104) is configured for determining an estimate of the power density spectrum of background noise (nn µ,n)) and for determining a noise suppression factor (V(e jΩµ ,n)) on the basis of said power density spectrum of background noise (nn µ,n)), and wherein the analyzer (1106) is configured for multiplying the first and second power density spectrum with said noise suppression factor (V(e jΩµ ,n)) prior to the frequency-time-transform of the first respectively second power density spectrum (ỹỹd µ,n), ỹỹ µ,n)).
  13. Speech fundamental frequency estimator (1100) according to claim 12, characterized in that the second power density spectrum calculator (1104) is configured for determining the noise suppression factor as the maximum of a predetermined maximum suppression coefficient (V0) and a term being dependent on a quotient of the estimate of the power density spectrum of background noise (nn µ,n)) and the second power density spectrum (ỹỹ µ,n)).
  14. Speech fundamental frequency estimator (1100) according to one of claims 12 or 13, characterized in that the second power density spectrum calculator (1104) is configured for determining the estimate of the power density spectrum of background noise (nn µ,n)) in speech pauses or for determining the estimate of the power density spectrum of background noise (nn µ,n)) from a segment-wise estimation of the minima of the power of a microphone signal.
  15. Speech fundamental frequency estimator (1100) according to claim 13 or claims 13 and 14, characterized in that the noise suppression factor is defined by V e j Ω μ n = max V 0 , 1 β S ^ nn Ω μ n S ^ yy Ω μ n
    Figure imgb0046
    wherein (nn µ,n) denotes the estimate of the power density spectrum of the background noise, (yy µ,n) denotes the second power density spectrum, Vo denotes a predefined maximum attenuation factor and β denotes a value for overestimating the power density spectrum of the background noise (nn µ,n)).
  16. Speech fundamental frequency estimator (1100) according to one of claims 5 to 15, characterized in that the analyzer (1106) is furthermore configured for reestimating the speech fundamental frequency estimate in the case the determined speech fundamental frequency estimate is below the predefined frequency value (fk) wherein the analyzer (106) is configured for performing the reestimation by searching a further index (k, m) of a further maximum value (τ̃p (n)) of the extended set of correlation function values (ỹỹ,erw (k,n)), the first or second set of correlation function values (ỹỹd,g (m,n), (ỹỹ,g (m,n) or the compensated first set of correlation function values ( ỹỹd,g,mod(m,n)) within a further number of values of said sets of correlation function values and for outputing a product of a sampling frequency (fs) and a reciprocal value of said further index (τ̃p (n)) as the determined speech fundamental frequency estimate.
  17. Speech fundamental frequency estimator (1100) according to claim 16, characterized in that the analyzer (106) is configured for searching said index (k, m) of said further maximum value (τ̂p (n)) using a number of values k of said sets of correlation function values which is defined by f s f p , max k < f s 2 f p n + k 0
    Figure imgb0047
    wherein k denotes the number of values of said sets of correlation function values, fp(n) denotes the previously determined speech fundamental frequency estimate, fp,max denotes a predefined value of a maximal possible speech fundamental frequency, fs denotes a sampling frequency and ko denotes a constant.
  18. Speech fundamental frequency estimator (1100) according to claim 16 or 17, characterized in that the analyzer (1106) is configured for outputting said product as the predetermined speech fundamental frequency estimate only in the case the further index (τ̃p (n)) is larger than 60 percent of the previously searched maximal index (τp (n)) as well as a value (ỹỹ,erw (τ̃p (n),n)) of the extended set of correlation function values (ỹỹ,erw (k,n)) at said further index (τ̃p (n)) is larger than a previously defined amplitude value ( 0).
  19. Speech fundamental frequency estimator (1100) according to one of claims 5 to 18, characterized in that the analyzer (1106) is configured for modifying a speech fundamental period (τ̃p (n)) corresponding to said determined speech fundamental frequency estimate by a interpolation correction term (Δp(n)) prior of outputing a modified speech fundamental frequency estimate (fp(n)), wherein said interpolation correction term (Δp) is dependent on values of said first or second set of correlation function values (ỹỹd,g (m,n),(ỹỹ,g (m,n), of said extended set of correlation function values (ỹỹ,erw (k,n)) or said compensated first set of correlation function values ( ỹỹd,g,mod(m,n)), respectively.
  20. Speech fundamental frequency estimator (1100) according to one of claims 1 to 19, characterized by a frequency domain filtering unit being configured for receiving the frequency domain versions (Y1, Y2) of the first and second set of time domain signal values (y1, y2), for frequency domain filtering said frequency domain versions in order to obtain said first and second sets of values ( 1, 2), respectively, and for providing said first and second sets of values ( 1, 2) to the first and second power density spectrum calculator respectively.
  21. Speech fundamental frequency estimator (1100) according to claim 20, characterized in that the frequency domain filtering unit is configured for filtering only frequencies below a predefined limiting frequency.
  22. Speech fundamental frequency estimator (1100) according to claim 21, characterized in that the frequency domain filtering unit is configured for delaying values of said frequency domain versions being above said predefined limiting frequency.
  23. Method (1140) for estimating a speech fundamental frequency (fp(n)), the method using a first set of values ( 1) and a second set of values ( 2), the first set of values ( 1) being a received frequency domain representation of a first set of time domain signal values (y1) within a first time interval (t1) and the second set of values ( 2) being a received frequency domain representation of a second set of time domain signal values (y2) within a second time interval (t2), the second time interval (t2) being later than and offset from the first time interval (t1), the method for estimating the speech fundamental frequency (fp(n)) comprising the steps of:
    - storing (1150) a version of the first set of values ( 1) and providing values of a first power density spectrum (ỹỹd µ ,n)) by multiplying (1152) the stored version of the first set of values ( 1) with a complex conjugate version of the second set of values ( 2);
    - providing values of a second power density spectrum (Ŝ µ ,n)) by multiplying (1153) a version of the second set of values ( 2) with a complex conjugate version of the second set of values ( 2);
    - determining (1156) the speech fundamental frequency estimate (fp) on the basis of the values of the first power density spectrum (ỹỹd µ ,n)) and the values of the second power density spectrum (ỹỹ µ ,n)),
    wherein the step of determining the speech fundamental frequency estimate (fp(n)) comprises
    performing a first frequency-time-transform of the first power density spectrum (ỹỹd µ ,n)) in order to obtain a first set of correlation function values (ỹỹd,g (m,n)),
    performing a second frequency-time-transform of the second power density spectrum (ỹỹd µ ,n)) in order to obtain a second set of correlation function values (ỹỹ,g (m,n)), and
    determining the speech fundamental frequency estimate (fp(n)) on the basis of the first and second sets of correlation function values (ỹỹd,g (m,n), ỹỹ,g (m,n)).
  24. Method (1140) according to claim 23, characterized in that the step of determining (1156) the speech fundamental frequency estimate (fp(n)) comprises:
    • performing a first frequency-time-transform of the first power density spectrum (ỹỹd µ ,n)) in order to obtain a first set of correlation function values (ỹỹd,g (m,n));
    • performing a second frequency-time-transform of the second power density spectrum (ỹỹ µ ,n)) in order to obtain a second set of correlation function values (ỹỹ,g (m,n)), wherein the step of determining (1156) further comprises determining a set of normalization values (ỹỹ µ ,n)) and a set of weighting values (V(e jΩ µ ,n)) from the second power density spectrum (ỹỹ µ ,n)) and using the set of normalization values (ỹỹ µ ,n)) and the set of weighting values (V(e jΩµ ,n)) in the first and second frequency-time-transform and wherein the determination of the speech fundamental frequency estimate (fp(n)) is performed on the basis of the first and second sets of correlation function values (ỹỹd,g (m,n), (ỹỹd (m, n)).
  25. Method (1140) according to claim 24, characterized in that the step of determining (1156) the speech fundamental frequency estimate (fp(n)) comprises adaptively compensating the values of the first set of correlation function values (ỹỹd,g (m,n)) by a correction factor (Δ(m,n)) being based on a value of the second set of correlation function values (ỹỹ,g (m,n)) in order to obtain a compensated first set of values and determining the speech fundamental frequency estimate (fp(n)) on the basis of the compensated first set of correlation function values ( ỹỹd ,g,mod(m,n)) and the second set of correlation function values (ỹỹ,g (m,n)).
  26. Method (1140) according to claim 25, characterized in that the step of compensating comprises multiplying the second set of correlation function values (ỹỹ,g (m,n)) by a lower bounded quotient between a value of the first set of correlation function values (ỹỹd,g (m,n)) and a value of the second set of correlation function values (ỹỹ,g (m,n)) in order to obtain said compensated first set of correlation function values ( ỹỹd ,g,mod(m,n)).
  27. Method (1140) according to claim 26, characterized in that the step of determining (1156) the speech fundamental frequency estimate (fp(n)) comprises combining the compensated first set of correlation function values ( ỹỹd ,g,mod(m,n)) and the second set of correlation function values (ỹỹ,g (m,n)) in order to obtain an extended set of correlation function values (ỹỹ,erw (k,n)), wherein the values of the extended set of correlation function values (ỹỹ,erw (k,n)) assume corresponding values from the compensated first set of correlation function values ( ỹỹd ,g,mod(m,n)), the second set of correlation function values (ỹỹ,g (m,n)) or values between the compensated first set of correlation function values ( ỹỹd ,g,mod(m,n)) and the second set of correlation function values (ỹỹ,g (m,n)) (m,n)) and wherein step of determining (1156) the speech fundamental frequency estimate (fp(n)) further comprises determining the speech fundamental frequency estimate (fp(n)) on the basis of said extended set of correlation function values (ỹỹ,erw (k,n)).
  28. Method (1140) according to one of claims 23 to 27, characterized in that the step of determining (1156) the speech fundamental frequency estimate (fp(n)) comprises determining the speech fundamental frequency estimate (fp(n)) by searching the index of a maximum value (τp (n)) from the extended set of correlation function values (ỹỹ,erw (k,n)) within a predetermined number of indices (k) of the values of the extended set of correlation values (ỹỹ,erw (k,n)), from the first or second set of correlation function values (ỹỹd,g (m,n), (ỹỹ,g (m,n)) within a predetermined number of indexes (m) of values of the first respectively second set of correlation function values (ỹỹd,g (m,n), (ỹỹ,g (m,n)) or from the compensated first set of correlation function values ( ỹỹd ,g,mod(m,n)) within the predetermined number of indices (m) of values of the compensated first set of correlation function values ( ỹỹd ,g,mod(m,n)) and wherein the step of determining (1156) the speech fundamental frequency estimate (fp(n)) furthermore comprises determining the speech fundamental frequency estimate (fp(n)) as the product of a sampling frequency (fs) and a reciprocal value of said searched index (τp (n)).
  29. Method (1140) according to claim 28, characterized in that the step of determining (1156) the speech fundamental frequency estimate (fp(n)) comprises determining a reliability factor (pfp (n)) for the determined speech fundamental frequency estimate (fp(n)) and for blocking an output of the determined speech fundamental frequency estimate (fp(n)) in the case the determined reliability factor (pfp (n)) for the determined speech fundamental frequency estimate (fp(n)) is below predetermined reliability factor (po).
  30. Method (1140) according to claim 29, characterized in that the step of determining (1156) the speech fundamental frequency estimate (fp(n)) comprises determining said reliability factor (pfp (n)) by dividing the maximum value (τ̃p (n)) at said searched by the first value of the extended set of correlation function values (ỹỹ,erw (k,n)) or, respectively the first, the compensated first or second set of correlation function values (ỹỹd,g (m,n), ( ỹỹd ,g,mod(m,n)), (ỹỹ,g (m,n)).
  31. Method (1140) according to one of claims 23 to 30 and claim 24, characterized in that the step of providing values of a second power density spectrum (ỹỹ µ ,n)) comprises determining an estimate of the power density spectrum of background noise (nn Ω µ ,n)) and determining a noise suppression factor (V(e jΩµ ,n)) on the basis of said power density spectrum of background noise (nn µ ,n)), and the step of determining (1156) the speech fundamental frequency estimate (fp(n)) comprises multiplying the first and second power density spectrum with said noise suppression factor (V(eµ ,n)) prior to the frequency-time-transform of the first respectively second power density spectrum (ỹỹd µ ,n)), (ỹỹ µ ,n)).
  32. Method (1140) according to claim 31, characterized in that the step of providing values of a second power density spectrum (ỹỹ µ ,n)) comprises determining the noise suppression factor as the maximum of a predetermined maximum suppression coefficient (Vo) and a term being dependent on a quotient of the estimate of the power density spectrum of background noise (nn µ ,n)) and the second power density spectrum (ỹỹ µ ,n)).
  33. Method (1140) according to claim 32, characterized in that the step of providing values of a second power density spectrum (ỹỹ µ ,n)) comprises determining the estimate of the power density spectrum of background noise (nn µ ,n)) in speech pauses or for determining the estimate of the power density spectrum of background noise (nn µ ,n)) from a segment-wise estimation of the minima of the power of a microphone signal.
  34. Method (1140) according to one of claims 31 to 33, characterized in that the noise suppression factor is defined by V e j Ω μ n = max V 0 , 1 β S ^ nn Ω μ n S ^ yy Ω μ n
    Figure imgb0048
    wherein (nn µ ,n)) denotes the estimate of the power density spectrum of the background noise, (yy µ ,n)) denotes the second power density spectrum, V0 denotes a predefined maximum attenuation factor and β denotes a value for overestimating the power density spectrum of the background noise (nn µ ,n)).
  35. Method (1140) according to one of claims 24 to 34, characterized in that the step of determining (1156) the speech fundamental frequency estimate (fp(n)) comprises reestimating the speech fundamental frequency estimate (fp(n)) in the case the determined speech fundamental frequency estimate is below the predefined frequency value (fk) wherein the step of determining (1156) the speech fundamental frequency estimate (fp(n)) comprises performing the reestimation by searching a further index (k, m) of a further maximum value (τ̃p (n)) of the extended set of correlation function values (ỹỹ,erw (k,n)), the first or second set of correlation function values (ỹỹd,g (m,n)),ỹỹ,g (m,n)) or the compensated first set of correlation function values ( ỹỹd ,g,mod(m,n)) within a further number of values of said sets of correlation function values and outputing a product of a sampling frequency (fs) and a reciprocal value of said further index (τ̃p (n)) as the determined speech fundamental frequency estimate.
  36. Method (1140) according to claim 35, characterized in that the step of determining (1156) the speech fundamental frequency estimate (fp(n)) comprises searching said index (k, m) of said further maximum value (τ̃p (n)) using a number of values k of said sets of correlation function values which is defined by f s f p , max k < f s 2 f p n + k 0
    Figure imgb0049
    wherein k denotes the number of values of said sets of correlation function values, fp(n) denotes the previously determined speech fundamental frequency estimate, fp,max denotes a predefined value of a maximal possible speech fundamental frequency, fs denotes a sampling frequency and ko denotes a constant.
  37. Method (1140) according to one of claims 35 or 36, characterized in that the step of determining (1156) the speech fundamental frequency estimate (fp(n)) comprises outputing said product as the predetermined speech fundamental frequency estimate (fp(n)) only in the case that the further index (τ̃p (n)) is larger than 60 percent of the previously searched maximal index (τp (n)) as well as the value ( ỹy,erw (τ̃p (n),n)) of the extended set of correlation function values (ỹỹ,erw (k,n)) at said further index (τ̃p (n)) is larger than a previously defined amplitude value ( 0).
  38. Method (1140) according to one of claims 24 to 37, characterized in that the step of determining the speech fundamental frequency estimate (fp(n)) comprises modifying a speech fundamental period (τ̃p (n)) corresponding to said determined speech fundamental frequency estimate (fp(n)) by a interpolation correction term (Δp(n)) prior of outputing said speech fundamental frequency estimate (fp(n)), wherein said interpolation correction term (Δp(n)) is dependent on values of said first or second set of correlation function values (ỹỹd,g. (m,n)),ỹỹ,g (m,n)), of said extended set of correlation function values (ỹỹ,erw (k,n)) or said compensated first set of correlation function values ( ỹỹd ,g,mod(m,n)), respectively.
  39. Method (1140) according to one of the preceding claims, characterized in that the method further comprises a step of receiving the frequency domain versions (Y1, Y2) of the first and second set of time domain signal values (y1, y2), frequency domain filtering said frequency domain versions in order to obtain said first and second sets of values ( 1, 2), respectively, and providing said first and second sets of values ( 1, 2) the first and second power density spectrum calculator respectively.
  40. Method (1140) according to claim 39, characterized in that the step of frequency domain filtering is only performed for frequencies below a predefined limiting frequency.
  41. Method (1140) according to claim 40, characterized in that the step of frequency domain filtering comprises delaying values of said frequency domain versions being above said predefined limiting frequency.
  42. Computer program product having a program code for performing the method according to one of claims 23 to 41, when the computer program runs on a computer.
EP07000568.1A 2007-01-12 2007-01-12 Speech fundamental frequency estimator and method for estimating a speech fundamental frequency Not-in-force EP1944754B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP07000568.1A EP1944754B1 (en) 2007-01-12 2007-01-12 Speech fundamental frequency estimator and method for estimating a speech fundamental frequency

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
EP07000568.1A EP1944754B1 (en) 2007-01-12 2007-01-12 Speech fundamental frequency estimator and method for estimating a speech fundamental frequency

Publications (2)

Publication Number Publication Date
EP1944754A1 EP1944754A1 (en) 2008-07-16
EP1944754B1 true EP1944754B1 (en) 2016-08-31

Family

ID=37898474

Family Applications (1)

Application Number Title Priority Date Filing Date
EP07000568.1A Not-in-force EP1944754B1 (en) 2007-01-12 2007-01-12 Speech fundamental frequency estimator and method for estimating a speech fundamental frequency

Country Status (1)

Country Link
EP (1) EP1944754B1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021138201A1 (en) * 2019-12-30 2021-07-08 Texas Instruments Incorporated Background noise estimation and voice activity detection system

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2249333B1 (en) 2009-05-06 2014-08-27 Nuance Communications, Inc. Method and apparatus for estimating a fundamental frequency of a speech signal
BR112013011312A2 (en) * 2010-11-10 2019-09-24 Koninl Philips Electronics Nv method for estimating a pattern in a signal (s) having a periodic, semiperiodic or virtually periodic component, device for estimating a pattern in a signal (s) having a periodic, semiperiodic or virtually periodic component and computer program
CN111400883B (en) * 2020-03-10 2023-05-09 南昌航空大学 Magnetic acoustic emission signal characteristic extraction method based on frequency spectrum compression
CN114257917B (en) * 2022-01-26 2024-09-24 恒玄科技(上海)股份有限公司 Noise processing method and system for earphone and earphone
CN114974231A (en) * 2022-01-01 2022-08-30 昆明理工大学 Pitch period extraction method in noise environment
CN117688371B (en) * 2024-02-04 2024-04-19 安徽至博光电科技股份有限公司 Secondary joint generalized cross-correlation time delay estimation method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6587816B1 (en) * 2000-07-14 2003-07-01 International Business Machines Corporation Fast frequency-domain pitch estimation

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021138201A1 (en) * 2019-12-30 2021-07-08 Texas Instruments Incorporated Background noise estimation and voice activity detection system
US11270720B2 (en) 2019-12-30 2022-03-08 Texas Instruments Incorporated Background noise estimation and voice activity detection system

Also Published As

Publication number Publication date
EP1944754A1 (en) 2008-07-16

Similar Documents

Publication Publication Date Title
US11031029B2 (en) Pitch detection algorithm based on multiband PWVT of teager energy operator
EP2031583B1 (en) Fast estimation of spectral noise power density for speech signal enhancement
EP1918910B1 (en) Model-based enhancement of speech signals
US7286980B2 (en) Speech processing apparatus and method for enhancing speech information and suppressing noise in spectral divisions of a speech signal
KR100330230B1 (en) Noise suppression for low bitrate speech coder
EP1944754B1 (en) Speech fundamental frequency estimator and method for estimating a speech fundamental frequency
KR100770839B1 (en) Method and apparatus for estimating harmonic information, spectral envelope information, and voiced speech ratio of speech signals
JP5791092B2 (en) Noise suppression method, apparatus, and program
EP1806739A1 (en) Noise suppressor
EP2249333B1 (en) Method and apparatus for estimating a fundamental frequency of a speech signal
KR19980701735A (en) Spectral subtraction noise suppression method
Yuo et al. Robust features for noisy speech recognition based on temporal trajectory filtering of short-time autocorrelation sequences
CN102612711A (en) Signal processing method, information processor, and signal processing program
US7957964B2 (en) Apparatus and methods for noise suppression in sound signals
EP4128225B1 (en) Noise supression for speech enhancement
US10083705B2 (en) Discrimination and attenuation of pre echoes in a digital audio signal
US7003452B1 (en) Method and device for detecting voice activity
EP1635331A1 (en) Method for estimating a signal to noise ratio
Funaki Speech enhancement based on iterative wiener filter using complex speech analysis
JP4125322B2 (en) Basic frequency extraction device, method thereof, program thereof, and recording medium recording the program
Patil et al. Use of baseband phase structure to improve the performance of current speech enhancement algorithms
Lim et al. Acoustic blur kernel with sliding window for blind estimation of reverberation time
JP2880683B2 (en) Noise suppression device
Iwai et al. Formant frequency estimation with windowless autocorrelation in the presence of noise
JPS5876891A (en) Audio pitch extraction method

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20080527

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU LV MC NL PL PT RO SE SI SK TR

AX Request for extension of the european patent

Extension state: AL BA HR MK RS

AKX Designation fees paid

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU LV MC NL PL PT RO SE SI SK TR

RAP1 Party data changed (applicant data changed or rights of an application transferred)

Owner name: NUANCE COMMUNICATIONS, INC.

17Q First examination report despatched

Effective date: 20120118

REG Reference to a national code

Ref country code: DE

Ref legal event code: R079

Ref document number: 602007047685

Country of ref document: DE

Free format text: PREVIOUS MAIN CLASS: G10L0011040000

Ipc: G10L0025900000

GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

RIC1 Information provided on ipc code assigned before grant

Ipc: G10L 25/90 20130101AFI20160229BHEP

Ipc: G10L 21/0216 20130101ALN20160229BHEP

INTG Intention to grant announced

Effective date: 20160315

GRAS Grant fee paid

Free format text: ORIGINAL CODE: EPIDOSNIGR3

GRAA (expected) grant

Free format text: ORIGINAL CODE: 0009210

AK Designated contracting states

Kind code of ref document: B1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU LV MC NL PL PT RO SE SI SK TR

REG Reference to a national code

Ref country code: CH

Ref legal event code: EP

Ref country code: GB

Ref legal event code: FG4D

REG Reference to a national code

Ref country code: IE

Ref legal event code: FG4D

REG Reference to a national code

Ref country code: DE

Ref legal event code: R096

Ref document number: 602007047685

Country of ref document: DE

REG Reference to a national code

Ref country code: AT

Ref legal event code: REF

Ref document number: 825594

Country of ref document: AT

Kind code of ref document: T

Effective date: 20161015

REG Reference to a national code

Ref country code: LT

Ref legal event code: MG4D

REG Reference to a national code

Ref country code: NL

Ref legal event code: MP

Effective date: 20160831

REG Reference to a national code

Ref country code: AT

Ref legal event code: MK05

Ref document number: 825594

Country of ref document: AT

Kind code of ref document: T

Effective date: 20160831

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: LT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20160831

Ref country code: FI

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20160831

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: GR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20161201

Ref country code: SE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20160831

Ref country code: AT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20160831

Ref country code: ES

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20160831

Ref country code: LV

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20160831

Ref country code: NL

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20160831

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: RO

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20160831

Ref country code: EE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20160831

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: PL

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20160831

Ref country code: BG

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20161130

Ref country code: CZ

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20160831

Ref country code: DK

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20160831

Ref country code: PT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20170102

Ref country code: BE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20160831

Ref country code: SK

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20160831

REG Reference to a national code

Ref country code: DE

Ref legal event code: R097

Ref document number: 602007047685

Country of ref document: DE

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: IT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20160831

PLBE No opposition filed within time limit

Free format text: ORIGINAL CODE: 0009261

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT

REG Reference to a national code

Ref country code: DE

Ref legal event code: R119

Ref document number: 602007047685

Country of ref document: DE

26N No opposition filed

Effective date: 20170601

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: SI

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20160831

REG Reference to a national code

Ref country code: CH

Ref legal event code: PL

GBPC Gb: european patent ceased through non-payment of renewal fee

Effective date: 20170112

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: MC

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20160831

REG Reference to a national code

Ref country code: FR

Ref legal event code: ST

Effective date: 20170929

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: FR

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20170131

Ref country code: CH

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20170131

Ref country code: LI

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20170131

REG Reference to a national code

Ref country code: IE

Ref legal event code: MM4A

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: DE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20170801

Ref country code: GB

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20170112

Ref country code: LU

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20170112

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: IE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20170112

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: HU

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT; INVALID AB INITIO

Effective date: 20070112

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: CY

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20160831

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: TR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20160831

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: IS

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20161231