EP1944754B1

EP1944754B1 - Speech fundamental frequency estimator and method for estimating a speech fundamental frequency

Info

Publication number: EP1944754B1
Application number: EP07000568.1A
Authority: EP
Inventors: Mohamed Krini; Gerhard Schmidt
Original assignee: Nuance Communications Inc
Current assignee: Nuance Communications Inc
Priority date: 2007-01-12
Filing date: 2007-01-12
Publication date: 2016-08-31
Anticipated expiration: 2027-01-12
Also published as: EP1944754A1

Description

This invention relates to speech analysis systems and especially to a speech fundamental frequency estimator and a method for estimating a speech fundamental frequency.

Background of the invention

An estimation of the speech fundamental frequency is necessary in various applications:

For example, partial speech recognition (model-based noise suppression) can be accomplished by the estimated speech fundamental frequency of distorted speech signals in order to obtain an improvement of speech quality.
In order to name a further example a rough preselection of model parameters can be performed by a speech recognizer in speech recognition systems using the temporal average of this frequency. Thus, the recognition rate of a speech recognizer can be increased significantly.

For further fields of application reference is made to the following literature:

K. Fellbaum: Sprachverarbeitung und Sprachübertragung, Springer, Berlin, Deutschland, 1984
D.K. Freeman, G. Cosier, C.B. Southcott und I. Boyd: The Voice Activity Detector for the PAN-European Digital Cellular Mobile Telephone Service, Proceed. of the Intern. Conf. on Acoust., Speech, and Signal Process., Vol. 1, pages 369-372, 1989
W. Hess: Pitch Determination of Speech Signals, Springer, Berlin, Deutschland, 1983
P. Vary, R. Martin: Digital Speech Transmission, John Wiley & Sons, Chichester, England, 2006
P. Vary, U. Heute, W. Hess: Digitale Sprachsignalverarbeitung, Teubner, Stuttgart, Deutschland, 1998

Numerous methods for estimation of the speech fundamental frequency exist. A group of methods which is based on a DFT-transform (DFT = discrete Fourier transform) of the input signal is of special importance. Such methods can be integrated in hands-free speech assistance systems with a multi-rate signal processing in a low-cost way as the DFT-transform is already calculated for other algorithms, as, for example, a noise reduction or an echo compensation.
In order to describe the relevant state of the art in more detail, a typical multi-rate system is described which can be used, for example, in order to perform a speech signal improvement (noise reduction, speech reconstruction). Following, several further fields of application are presented in which an estimation of the reliable speech fundamental frequency is of importance.
In voiced speech portions the corresponding spectrum shows distinct amplitude peaks which are located equidistantly in frequency (see for example Fig. 1). The distance between two amplitude peaks represents herein the speech fundamental frequency which is dependent of the speaker. With men this frequency varies between 80 Hz and 150 Hz, women and children, in contrast, have a higher speech fundamental frequency which varies between 150 Hz and 300 Hz with women, respectively between 200 Hz and 600 Hz with children. A good, sure and reliable estimation of the speech fundamental frequency is often not easy to obtain. Mainly difficulties in detecting low speech fundamental frequencies arise wherein especially men have in most cases a low speech fundamental frequency.
In Figure 2 a block diagram of a multi-rate system for speech reconstruction with an analysis and a synthesis filter bank for the signal processing is shown. The speech fundamental frequency estimation is shown as a separate functional block. The aim of such an application is to extract parameters from a distorted speech signal y(n) as, for example, the spectral envelope, the type of stimulation (voiced/ unvoiced) and the speech fundamental frequency f_p(n). Subsequently an undistorted speech signal x(n) is resynthesized from these parameters. For this purpose a very precise and reliable estimation of the speech fundamental frequency is necessary. The output signal x(n) after the synthesis filter bank should be nearly without error, the following condition is therefore very desirable: $x (n) \approx s (n),$
s(n) denotes herein the undisturbed speech signal.
A sure estimation of the speech fundamental frequency is also of great importance in speech recognition systems. Figure 3 shows a block diagram of a signal analysis system with subsequent feature extraction and speech fundamental frequency estimation, in order to perform a speech recognition. An adequate estimation of the speech fundamental frequency can, for example, contribute to significantly improve the recognition rates of the speech recognizer.
Basically there is a broad variety of application fields in which a reliable estimation of the speech fundamental frequency is necessary. However a detailed description of such applications would go beyond the scope of this description. Thus, reference is made to the following literature:

E. Hänsler, G. Schmidt: Acoustic Echo and Noise Control - A Practical Approach, John Wiley & Sons, Hoboken, New Jersey, USA, 2004
E. Hänsler, G. Schmidt: Topics in Acoustic Echo and Noise Control - Selected Methods for the Cancellation of Acoustic Echoes, the Reduction of Background Noise, and Speech Processing, Springer, Berlin, Deutschland, 2006
P. Vary, R. Martin: Digital Speech Transmission, John Wiley & Sons, Chichester, England, 2006
P. Vary, U. Heute, W. Hess: Digitale Sprachsignalverarbeitung, Teubner, Stuttgart, Deutschland, 1998

In literature a broad variety of different algorithms for a determination of the speech fundamental frequency estimation exist as for example:

Analysis in the cepstral domain - In the case speech generation is modelled as a source-filter-model (see also J. Deller, J. Hansen, J. Proakis: Discrete-Time Processing of Speech Signals, IEEE-Press, New York, USA, 1993) voiced sounds can be described as a convolution of a periodic stimulation signal with the impulse response of the vocal tract. In the spectral domain the convolution becomes the product of the Fourier transforms of both portions. If the Logarithm is taken the product becomes an addition of the separate components. After a further transform (inverse Fourier transform) the cepstral domain is reached. In this domain it is possible to distinguish the spectrally comparatively slowly varying frequency response of the vocal tract from the fundamental frequency of the stimulation signal. Further details can be found for example in W. Hess: Pitch Determination of Speech Signals, Springer, Berlin, Deutschland, 1983.
Harmonic Product-Spectrum - Another method to estimate the speech fundamental frequency is the so-called Harmonic Product-Spectrum. Herein the product over several equidistant sampling points of the absolute value of the spectrum is calculated. The product becomes maximal in the case the increment (via frequency) corresponds to just the speech fundamental frequency (respectively a multiple thereof). Further details can be found for example in M. R. Schroeder: Period Histogram and Product Spectrum: New Methods for Fundamental Frequency Measurements, J. Acoust. Soc. Am., Vol. 43, Nr. 4, pages 829-834, 1968.
Analysis of the short-time autocorrelation - In voiced speech passages the first side lobe of the short-time autocorrelation with an offset just corresponds to the speech fundamental period.

Other methods as the ones mentioned above also exist. The description of each single algorithm would, however, be far beyond the possibilities given the present description. Therefore reference is made to further literature as, for example, K. Fellbaum: Sprachverarbeitung und Sprachübertragung, Springer, Berlin, Deutschland, 1984 or D.K. Freeman, G. Cosier, C.B. Southcott and I. Boyd: The Voice Activity Detector for the PAN-European Digital Cellular Mobile Telephone Service, Proceed. of the Intern. Conf. on Acoust., Speech, and Signal Process., Vol. 1, Seiten 369-372, 1989 or W. Hess: Pitch Determination of Speech Signals, Springer, Berlin, Deutschland, 1983. The approach mentioned last in the above listing has become very popular as it provides the advantage that already determined short-time DFT-portions of an input signal, which are calculated for other applications, can be further used and thus a numerical effort can be reduced.
However, the above mentioned approach of the state of the art also has significant disadvantages. Especially the orders of the DFT (i.e. the DFT-block length) used for other purposes are often to little as to provide a reliable estimation of the speech fundamental frequency for low voices.
Accordingly, a need exists to provide a speech fundamental frequency estimator and a method for estimating a speech fundamental frequency which allow a more precise estimation of the speech fundamental frequency.
This need is met by the features of the independent claims. In the dependent claims further embodiment of the inventions are described.
According to a first aspect of the invention the speech fundamental frequency estimator is configured for receiving a first set of values and a second set of values, the first set of values being a frequency domain representation of a first set of time domain signal values within a first time interval and the second set of values being a frequency domain representation of a second set of time domain signal values within a second time interval, the second time interval being later than and offset from the first time interval, the speech fundamental frequency estimator comprising:

a first power density spectrum calculator being configured for storing a version of the first set of values and being configured for providing values of a first power density spectrum by multiplying the stored version of the first set of values with a complex conjugate version of the second set of values;
a second power density spectrum calculator being configured for providing values of a second power density spectrum by multiplying a version of the second set of values with a complex conjugate version of the second set of values;
an analyzer being configured for determining the speech fundamental frequency estimate on the basis of the values of the first power density spectrum and the values of the second power density spectrum.

The analyzer is further configured for performing a first frequency-time-transform of the first power density spectrum in order to obtain a first set of correlation function values, for performing a second frequency-time-transform of the second power density spectrum in order to obtain a second set of correlation function values, and for determining the speech fundamental frequency estimate on the basis of the first and second sets of correlation function values.
Analogously, according to said first aspect of the invention a method for estimating a speech fundamental frequency is provided, the method using a first set of values and a second set of values, the first set of values being a received frequency domain representation of a first set of time domain signal values within a first time interval and the second set of values being a received frequency domain representation of a second set of time domain signal values within a second time interval, the second time interval being later than and offset from the first time interval, the method for estimating the speech fundamental frequency comprising the steps of:

storing a version of the first set of values and providing values of a first power density spectrum by multiplying the stored version of the first set of values with a compley conjugate version of the second set of values;
providing values of a second power density spectrum by multiplying a version of the second set of values with a complex conjugate version of the second set of values ;
determining the speech fundamental frequency estimate on the basis of the values of the first power density spectrum and the values of the second power density spectrum.

The step of determining the speech fundamental frequency estimate comprises performing a first frequency-time-transform of the first power density spectrum in order to obtain a first set of correlation function values, performing a second frequency-time-transform of the second power density spectrum in order to obtain a second set of correlation function values, and determining the speech fundamental frequency estimate on the basis of the first and second sets of correlation function values.
This first aspect of the invention is based on the finding that by utilizing the first and second sets of values, which originate from sets of a time domain signal values in the time intervals which are offset from each other, results in a total analyzed signal portion which is a larger than just one single signal portion, for example the first or the second time intervals. Expressed in other words it is now possible to analyze a timely longer signal portion by means of existing (short) time-frequency-transformed signals without the need to provide a new time-frequency-transform just for the estimation of the speech fundamental frequency. However, it is exactly the combination of a given first and second set of values which enables such a timely longer analysis interval, that is the calculation of the first spectrum from the first and second sets of values and the second spectrum from only the second set of values. Thus, the first spectrum represents the spectrum over the longer time interval whereas the second spectrum serves the purpose to determine the characteristics of the second set of values in order to compensate errors in the first spectrum. Therefore it is necessary not only to calculate the first spectrum but also to calculate the second spectrum.
The approach according to the first aspect of the invention provides the advantage that a signal given in a time-frequency-transformed version (provided for other applications than speech fundamental frequency estimation) can still be used also for speech fundamental frequency estimation (even in the case the time-frequency-transformed version of the signal would normally be not appropriate for providing a precise speech fundamental frequency estimation).
According to a second aspect which does not form part the invention, a speech fundamental frequency estimator is provided which is configured for receiving a set of values, the set of values being a frequency domain representation of a set of time domain signal values within a time interval, the speech fundamental frequency estimator comprising:

a power density spectrum calculator being configured for providing values of a power density spectrum by multiplying a version of the set of values with a complex conjugate version of the set of values, wherein the power density spectrum calculator is configured for determining an estimate of the power density spectrum of background noise and for determining a noise suppression factor on the basis of said power density spectrum of background noise;
an analyzer being configured for multiplying the power density spectrum with said noise suppression factor and for performing a frequency-time-transform of the multiplied values of the power density spectrum in order to obtain a set of correlation function values, wherein the analyzer is furthermore configured for determining the speech fundamental frequency estimate on the basis of the set of correlation function values.

Analogously, according to the second aspect which does not form part of the present invention a method for estimating a speech fundamental frequency is provided, the method being configured for receiving a set of values, the set of values being a frequency domain representation of a set of time domain signal values within a time interval, the method comprising the steps of:

providing values of a power density spectrum by multiplying a version of the set of values with a complex conjugate version of the set of values;
determining an estimate of the power density spectrum of background noise and determining a noise suppression factor on the basis of said power density spectrum of background noise and the power density sprectrum of the input signal;
multiplying the power density spectrum with said noise suppression factor;
performing a frequency-time-transform of the multiplied values of the power density spectrum in order to obtain a set of correlation function values; and
determining the speech fundamental frequency estimate on the basis of the set of correlation function values.

The second aspect is based on the finding that a significant improvement in the preciseness of speech fundamental frequency estimation can be realized when background noise is adequately compensated. This is especially the case in a scenario where in speech pauses erroneous detections of speech occur which then falsify the detected result and, in consequence, decrease the reliability of the detected speech fundamental frequency.
The second aspect thus provides the advantage that by simple means, for example a pause detector or just a further analysis of the already existing signal frames a significant improvement in preciseness and reliability of the estimated speech fundamental frequency can be obtained.
According to a further aspect of the present invention the speech fundamental frequency estimator is characterized in that the first power density spectrum calculator is configured for multiplying versions of the sets of values which represent sets of time domain signal values having overlapping time intervals. This provides the advantage that by multiplying said sets of values, which represent portions of overlapping and therefore consecutive time intervals, a signal in a total time interval can be analyzed, in which a low fundamental frequency can be reliably estimated in given short time signal portions.
Furthermore, the speech fundamental frequency estimator according to another aspect of the present invention is characterized in that the first power density spectrum calculator is configured for multiplying versions of the sets of values which represent time domain signal values having time intervals overlapping in least 25 percent. This provides the possibility that the speech fundamental frequency estimate can be surely determined as the first and second sets of values belonged to time domain signal values which have a sufficiently overlapping a interval structure. Therefore, due to the sufficient overlap of both time intervals, such an estimation can be considered to be an estimation over the "longer" time interval.
According to a further aspect of the present invention the speech fundamental frequency estimator is characterized in that the second power density spectrum calculator is configured for providing a conjugate complex version of the second set of values to the first power density spectrum calculator and wherein the first power density spectrum calculator is configured for using the provided conjugate complex version of the second set of values as the version with which the stored version of the first set of values is to be multiplied. This provides the advantage that a complex conjugate version of one of the sets of values has to be calculated only once such that the numerical or computational effort can be reduced.
In another embodiment of the present invention the speech fundamental frequency estimator is characterized in that the analyzer is configured for performing a first frequency-time-transform of the first power density spectrum in order to obtain a first set of correlation function values and for performing a second frequency-time-transform of the second power density spectrum in order to obtain a second set of correlation function values, wherein the analyzer is furthermore configured for determining a set of normalization values and a set of weighting values from the second power density spectrum and for using the set of normalization values and the set of weighting values in the first and second frequency-time-transform and wherein the analyzer is furthermore configured for determining the speech fundamental frequency estimate on the basis of the first and second sets of correlation function values. This provides the advantage that, on one hand, the short-time envelope can be eliminated and, on the other hand, it is possible to increase the attenuation with rising frequency. Herewith typical characteristics of the speech, especially the speech fundamental frequency structure in the low frequency rage can be adequately be dealt with.
Also, the speech fundamental frequency estimator according to a further embodiment can be characterized in that the analyzer further comprises a compensator being configured for adaptively compensating the values of the first set of correlation function values by a correction factor being based on a value of the second set of correlation function values and wherein the analyzer is furthermore configured for determining the speech fundamental frequency estimate on the basis of the compensated first set of correlation function values and the second set of correlation function values. Providing such an adaptive compensation control provides the advantage that it is now possible to correct error terms in the cross correlation function as to compensate for example undesired amplitudes which occur at the distinct offsets.
According to another embodiment the speech fundamental frequency estimator can be characterized in that the compensator is configured for multiplying the second set of correlation function values by a lower bounded quotient between a value of the first set of correlation function values and a value of the second set of correlation function values in order to obtain said compensated first set of correlation function values. Such a configuration of the speech fundamental frequency estimator makes sure that a relation between the cross correlation function and the autocorrelation function does not decrease below a minimal value which, in turn, improves the robustness of speech fundamental frequency estimation.
Furthermore, it is also possible according to another embodiment of the present invention that the speech fundamental frequency estimator is characterized in that the analyzer is configured for combining the compensated first set of correlation function values and the second set of correlation function values in order to obtain an extended set of correlation function values, wherein the values of the extended set of correlation function values assume corresponding values from the compensated first set of correlation function values, the second set of correlation function values or values between the compensated first set of correlation function values and the second set of correlation function values and wherein the analyzer is furthermore configured for determining the speech fundamental frequency estimate on the basis of said extended set of correlation function values. This provides the advantage that the extended set of correlation function values comprises now information from the first as well as the second set of correlation function values such that an estimation of the speech fundamental frequency can be based on the information comprised in the first and second time interval as well as a correction of possible errors is also possible by the information of the second time interval. Furthermore, it is also possible to perform a weighting of the values of the first set of correlation function values in contrast to the values of the second set of correlation function values in order to take into account the influence of an offset between the first set of correlation function values (respectively the compensated set of correlation function values) and the second set of correlation function values.
In a further embodiment the speech fundamental frequency estimator is characterized in that the analyzer is configured for determining the speech fundamental frequency estimate by searching the index of a maximum value from the extended set of correlation function values within a predetermined number of indices of the values of the extended set of correlation values, from the first or second set of correlation function values within a predetermined number of indices of values of the first respectively second set of correlation function values or from the compensated first set of correlation function values within the predetermined number of indices of values of the compensated first set of correlation function values and wherein the analyzer is furthermore configured for determining the speech fundamental frequency estimate as the product of a sampling frequency and a reciprocal value of said searched index.
According to a further embodiment, the speech fundamental frequency is characterized in that the analyzer is furthermore configured for determining a reliability factor for the determined speech fundamental frequency estimate and for blocking an output of the determined speech fundamental frequency estimate in the case the determined reliability factor for the determined speech fundamental frequency estimate is below said predetermined reliability factor. Such a configuration improves the reliability of the estimated speech fundamental frequency.
Additionally, in a special embodiment the speech fundamental frequency estimator can be characterized in that the analyzer is furthermore configured for determining said reliability factor by dividing the maximum value at said searched index by the first value of the extended set of correlation function values or, respectively the first, the compensated first or second set of correlation function values. This provides the advantage that the reliability factor is only dependent on the scenario in which the speech fundamental frequency estimator is used and not on just a predefined factor which might be too rough in some situations.
Furthermore, the speech fundamental frequency estimator can be characterized in that the second power density spectrum calculator is configured for determining an estimate of the power density spectrum of background noise and for determining a noise suppression factor on the basis of said power density spectrum of background noise, and wherein the analyzer is configured for multiplying the first and second power density spectrum with said noise suppression factor prior to the frequency-time-transform of the first respectively second power density spectrum. This provides the advantage that an additional improvement can be realized as then erroneous detections in speech pauses can be avoided, which, in turn, improve the reliability of the estimated speech fundamental frequency estimate.
Especially the speech fundamental frequency estimator can be characterized in that the second power density spectrum calculator is configured for determining the noise suppression factor as the maximum of a predetermined maximum suppression coefficient and a term being dependent on a quotient of the estimate of the power density spectrum of background noise and the second power density spectrum. This makes sure, that a minimum suppression factor is used and thus an effective suppression of background noise is accomplished.
In a further embodiment of the present invention the speech fundamental frequency estimator can be characterized in that the second power density spectrum calculator is configured for determining the estimate of the power density spectrum of background noise in speech pauses or for determining the estimate of the power density spectrum of background noise from a segment-wise estimation of the minima of the power of a differential signal. This provides an efficient and numerically simple way of determining the estimate of the power density spectrum of background noise.
In particular, the speech fundamental frequency estimator can be characterized in that the noise suppression factor is defined by $V (e^{j Ω µ}, n) = \max \{V\} \{\} \{_{0}, 1 - β \frac{{\hat{S}}_{nn} (Ω_{µ}, n)}{{\hat{S}}_{yy} (Ω_{µ}, n)}\}$
wherein Ŝ_nn (Ω _µ ,n) denotes the estimate of the power density spectrum of the background noise, Ŝ_yy (Ω _µ,n) denotes the second power density spectrum of the input signal, V₀ denotes a predefined maximum attenuation factor and β denotes a value for overestimating the power density spectrum of the background noise.
In a further embodiment of the present invention the speech fundamental frequency estimator can be characterized in that the analyzer is furthermore configured for reestimating the speech fundamental frequency estimate in the case the determined speech fundamental frequency estimate is below the predefined frequency value wherein the analyzer is configured for performing the reestimation by searching a further index of a further maximum value of the extended set of correlation function values, the first or second set of correlation function values or the compensated first set of correlation function values within a further number of values of said sets of correlation function values and for outputing a product of a sampling frequency and a reciprocal value of said further index as the determined speech fundamental frequency estimate. This provides a further improvement of the speech fundamental frequency especially in the case when the determined estimate is below said predefined frequency (which means that the estimate may probably not as reliable as actually wanted).
Especially the speech fundamental frequency estimator can be characterized in that the analyzer is configured for searching said index of said further maximum value using a number of values k of said sets of correlation function values which is defined by $\frac{f_{s}}{f_{p, \max}} \leq k < \frac{f_{s}}{2 f_{p} (n)} + k_{0}$
wherein k denotes the number of values of said sets of correlation function values, f_p(n) denotes the previously determined speech fundamental frequency estimate, f_p,max denotes a predefined value of a maximal possible speech fundamental frequency, f_s denotes a sampling frequency and k₀ denotes a constant which enables the search of a maximum slightly above $k = \frac{f_{s}}{2 f_{p} (n)} .$
Such a use of the doubled speech fundamental frequency estimate from a previous estimation broadens the region to be searched and thus strengthens the reliability and preciseness of the outputted estimate.
Also, in another embodiment of the present invention the speech fundamental frequency estimator can be characterized in that the analyzer is configured for outputting said product as the predetermined speech fundamental frequency estimate only in the case the value of the autocorrelation function at the further index is larger than 60 percent of the value of the autocorrelation function at the previously searched maximal index as well as a value of the extended set of correlation function values at said further index is larger than a previously defined amplitude value. This further strengthens the validity of the outputted speech fundamental frequency estimate as before outputting the result two separate conditions have to be fulfilled.
Additionally the speech fundamental frequency estimator in a further embodiment can be characterized in that the analyzer is configured for modifying a speech fundamental period corresponding to said determined speech fundamental frequency estimate by an interpolation correction term prior of outputting a modified speech fundamental frequency estimate, wherein said interpolation correction term is dependent on values of said first or second set of correlation function values, of said extended set of correlation function values or said compensated first set of correlation function values, respectively. Such an interpolation approach provides the advantage that the error terms resulting from the use of a discrete time-frequency-transform respectively a frequency-time-transform can be reduced by a processing of the signals after the inverse transform has been performed.
In a further embodiment of the present invention the speech fundamental frequency estimator can be characterized by a frequency domain filtering unit being configured for receiving the frequency domain versions of the first and second set of time domain signal values, for frequency domain filtering said frequency domain versions in order to obtain said first and second sets of values, respectively, and for providing said first and second sets of values to the first and second power density spectrum calculator respectively. Such a pre-processing of the received signals provides the advantage that a pre-processed version of the input signal significantly increases the reliability and preciseness of the estimation in contrast to an embodiment of the invention in which no pre-processing is performed. However the computational or numerical burden for this is relatively low, especially if the filter has a little number of coefficients.
In a further embodiment of the present invention the speech fundamental frequency estimator can be characterized in that the frequency domain filtering unit is configured for filtering only frequencies below a predefined limiting frequency. This relaxes a computational burden as only the parts of the spectrum are filtered which are of the most importance for a reliable estimation of very low speech fundamental frequencies.
Furthermore, in another embodiment the speech fundamental frequency estimator can be characterized in that the frequency domain filtering unit is configured for delaying values of said frequency domain versions being above said predefined limiting frequency. This compensates a delay which might be introduced in a signal flow path for filtering signals having a frequency below said limiting frequency.
The above mentioned aspects and modifications according to the first aspect of the present invention can also be implemented in corresponding methods where the advantages mentioned above come into effect in an analogous manner.
Furthermore, the invention can also be implemented as a computer program having a program code for performing the inventive method, when the computer program runs on a computer.
In an embodiment of the second aspect the speech fundamental frequency estimator can be characterized in that the power density spectrum calculator is configured for determining the noise suppression factor as the maximum of a predetermined maximum suppression coefficient and a term being dependent on a quotient of the estimate of the power density spectrum of background noise and the second power density spectrum. This provides the advantage that an additional improvement can be realized as then erroneous detections in speech pauses can be avoided which, in turn, improves the reliability of the estimated speech fundamental frequency. Also, it can be made sure that the noise suppression factor is always above a predefined value.
Further, the second aspect may comprise a speech fundamental frequency estimator being characterized in that the power density spectrum calculator is configured for determining the estimate of the power density spectrum of background noise in speech pauses or for determining the estimate of the power density spectrum of background noise from a segment-wise estimation of the minima of the power of a differential signal. This makes sure, that a minimum suppression factor is used and thus an effective suppression of background noise is accomplished.
Furthermore, the speech fundamental frequency according to a further embodiment may be characterized in that the noise suppression factor is defined by $V (e^{{j Ω}_{µ}}, n) = \max \{V\} \{\} \{_{0}, 1 - β \frac{{\hat{S}}_{nn} (Ω_{µ}, n)}{{\hat{S}}_{yy} (Ω_{µ}, n)}\}$
wherein Ŝ_nn (Ω _µ,n) denotes the estimate of the power density spectrum of the background noise, Ŝ_yy (Ω _µ,n) denotes the second power density spectrum of the input signal, Vo denotes a predefined maximum attenuation factor and β denotes a value for overestimating the power density spectrum of the background noise.
The above mentioned aspects and modifications according to the second aspect which in isolation does not form part of the present invention can also be implemented in corresponding methods where the advantages mentioned above come into effect in an analogous manner.
Additional features and advantages of the present invention will become more readily appreciated from the following detailed description of preferred or advantageous embodiments with reference to the accompanying drawings, in which

Figure 1: shows a time-frequency-analysis of a speech signal;
Figure 2: shows a block diagram of a multi-rate system for speech recognition having a speech fundamental frequency estimation;
Figure 3: shows a block diagram of an analysis system for speech recognition having a speech fundamental frequency estimation;
Figure 4: shows a block diagram of a method and a system for speech fundamental frequency estimation;
Figure 5: shows an autocorrelation- and time-frequency-analysis of sinusoidal signals with varying frequency distances from 300 Hz to 60 Hz;
Figure 6: shows a block diagram of a system respectively method for a speech fundamental frequency estimation with spectral refinement;
Figure 7: shows a block diagram of a system for speech fundamental frequency estimation with spectral refinement in the lower frequency band from 0 Hz to 1000 Hz;
Figure 8: shows in the upper section the analysis of the autocorrelation and in the lower section the time-frequency analysis of sinusoidal signals with varying frequency distances from 300 Hz to 60 Hz. The analyses have been performed with a previous spectral refinement;
Figure 9A: shows a block diagram of an embodiment of the inventive speech fundamental frequency estimator;
Figure 9B: shows a flow diagram of an embodiment of the inventive method for estimating the speech fundamental frequency estimate;
Figure 10: shows diagrams with results of the speech fundamental frequency estimation with a spectral refinement as a time-frequency-analysis with and without noise reduction;
Figure 11A: shows a block diagram of another embodiment of the inventive speech fundamental frequency estimator;
Figure 11B: shows a flow diagram of another embodiment of the inventive method for estimating the speech fundamental frequency estimate;
Figure 12: shows a block diagram of a method respectively system for speech fundamental frequency estimation with additional consideration of a passed subband input vector and spectral refinement;
Figure 13: shows in the left section the analysis of the autocorrelation function of a speech fundamental frequency at about 270 Hz. In the right section the analysis of a low speech fundamental frequency of about 60 Hz is shown;
Figure 14: shows in the upper section the analysis of an extended autocorrelation and in the lower section the time-frequency-analysis of several sinusoidal signals with varying frequency distances from 300 Hz to 60 Hz. Additionally a spectral refinement had been performed in a lower frequency range;
Figure 15: shows in the upper section the time-frequency-analysis of speech a signal with additional post-processing and in the lower section the time-frequency-analysis of a speech signal without additional post-processing;
Figure 16: shows in the upper section the time-frequency analysis of several sinusoidal signals of equal amplitude with varying frequency distance (partial section of the signal). In the lower section the time-frequency-analysis of a speech signal (partial section of the signal) is shown.

Description of preferred embodiments

The present invention relies mainly on estimation methods based on autocorrelation function which are described herein in advance for a better understanding. However, some aspects of the present invention are also implemented in the conventional autocorrelation methods such that the description in this section is not to be considered as state of the art.
In the following it is assumed that the speech signal s(n) will be recorded by a microphone. To this signal background noise n(n) is often superimposed. Consequently, the microphone signal y(n) is composed by local speech s(n) and disturbances n(n): $y (n) = s (n) + n (n)$
From this signal the short-time autocorrelation function in the time domain can be estimated in a block-based way according to ${\hat{r}}_{yy}^{'} (m, n) = \frac{1}{L} \sum_{k = 0}^{L - 1} y (n - k) y (n - k + m)$
As this short-time autocorrelation function has to be performed for a quite large region of the autocorrelation offset m, the direct estimation requires too much effort for many applications. As in hands-free- and speech recognition systems in multi-rate-structure nevertheless a subband transform (for example by a DFT) is calculated a approach which requires less effort can be used here. The analysis filter bank of a multi-rate system can be described as follows:

First the input signal y(n) is portioned in windowed, overlapping frame blocks [see also J. Benesty, S. Makino, J. Chen: Speech Enhancement, Springer, Berlin, Deutschland, 2005; E. Hänsler, G. Schmidt: Acoustic Echo and Noise Control - A Practical Approach, John Wiley & Sons, Hoboken, New Jersey, USA, 2004; E. Hänsler, G. Schmidt: Topics in Acoustic Echo and Noise Control - Selected Methods for the Cancellation of Acoustic Echoes, the Reduction of Background Noise, and Speech Processing, Springer, Berlin, Deutschland, 2006 or P. Vary, R. Martin: Digital Speech Transmission, John Wiley & Sons, Chichester, England, 2006). In dependence of a DFT of order N (which is actually the block length of said DFT), one frame block respectively one signal input vector y(n) is composed as follows: $y (n) = {[y (n), y (n - 1), \dots, y (n - N + 1)]}^{T}$
each signal input vector y(n) is weighted subsequently by a window function $h = {[h_{0}, h_{1}, \dots, h_{N - 1}]}^{T}$
and
transformed to the frequency domain by a DFT: $Y (e^{{jΩ}_{µ}}, n) = \sum_{k = 0}^{N - 1} y (n - k) h_{k} e^{- {j Ω}_{µ} k}$

The sampling points µ are hereby located equidistantly in the normalized frequency domain: $Ω_{µ} = \frac{2 π}{N} µ with µ \in \{0,, N - 1\}$
From the short-time spectrum Y(e^{jΩ _µ},n) the short-time power density spectrum can be estimated by calculating the square of the absolute value according to the following equation: ${\hat{S}}_{yy} (Ω_{µ}, n) = {|Y (e^{{j Ω}_{µ}}, n)|}^{2} = Y (e^{{j Ω}_{µ}}) Y * (e^{{j Ω}_{µ}}, n)$
The thus determined power density spectrum Ŝ_yy (Ω _µ,n) from equation 8 is then smoothed in frequency direction and divided by the thus obtained envelope S̅_yy (Ω _µ,n). Hereby the short-time envelope is removed. The smoothing in frequency direction can be described by ${\tilde{S}}_{yy} (Ω_{µ}, n) = \{\begin{matrix} {\hat{S}}_{yy} (Ω_{µ}, n) & for µ = 0, \\ λ {\tilde{S}}_{yy} (Ω_{µ - 1}, n) + {\hat{S}}_{yy} (Ω_{µ}, n), & for µ \in \{1, \dots, N - 1\} \end{matrix}\}$
respectively ${\overline{S}}_{yy} (Ω_{µ}, n) = {\begin{matrix} {\tilde{S}}_{yy} (Ω_{µ}, n) & for µ = N - 1 \\ λ {\overline{S}}_{yy} (Ω_{µ + 1}, n) + (1 - λ) {\tilde{S}}_{yy} (Ω_{µ}, n), & for µ \in \{0, \dots, N - 2\} \end{matrix}$
Values for the smoothing constant λ are chosen from the range $0.3 < λ < 0.7$
Following, a linear weighting of the estimated and normalized power density spectrum is performed: ${\hat{S}}_{yy, norm} (Ω_{µ}, n) = \frac{{\hat{S}}_{yy} (Ω_{µ}, n)}{{\overline{S}}_{yy} (Ω_{µ}, n)} W (e^{j Ω_{µ}})$
The weighting function W(e^{jΩ _µ},n) has been chosen such that the attenuation rises with rising frequency. This choice results from the fact that speech mainly at low frequencies has a speech fundamental frequency structure - which in turn results in an improved estimation of the speech fundamental frequency. In Fig. 4 the functional principle of a method for speech fundamental frequency estimation is shown.
The autocorrelation function ${\hat{r}}_{yy} (m, n) = \frac{1}{N} \sum_{µ = 0}^{N - 1} {\hat{S}}_{yy, norm} (Ω_{µ}, n) e^{j \frac{2 π}{N} µm}$
is determined by an inverse transform of the normalized and weighted power density spectrum from equation 12. The autocorrelation function r̂_yy (m,n) is used in order to estimate the speech fundamental frequency f_p (n). The index m describes herein the autocorrelation offset and the index n describes the present frame (under analysis). For each a single frame the preliminary speech fundamental frequency f'_p (n) can be determined by a search of the maximum in a selected range of indices, for example 30 ≤ m ≤ 100. The speech fundamental frequency is then determined as the reciprocal of value of the index at which the maximum of the autocorrelation has occurred (in view to the sampling frequency f_s): with $f_{p}^{'} (n) = \frac{f_{s}}{τ_{p} (n)}$
$τ_{p} (n) = \underset{30 \leq m \leq 100}{argmax} \{{\hat{r}}_{yy} (m, n)\}$
Furthermore a reliability p_fp (n) of the estimated speech fundamental frequency is determined. Therefore the value of the normalized autocorrelation at the maximum point, (i.e. the index where the autocorrelation function becomes maximal) is used: $p_{f_{p}} (n) = \frac{{\hat{r}}_{yy} (τ_{P} (n), n)}{{\hat{r}}_{yy} (0, n)}$
Large values, that are values in the proximity to one, indicate a very sure detection - small values indicate a doubtful detection. For this reason a detection only takes place for values of the normalized autocorrelation function which are larger than po (which is taken as a predefined threshold value): $f_{p} (n) = {\begin{matrix} f_{p}^{'} (n) & {for p}_{f_{p}} (n) > p_{0} \\ not detectabale, & else \end{matrix}$
A threshold value of p ₀ ∈ [0.2,0.3] has turned out to be favourable. The value of the normalized autocorrelation at the location τ_p (n) can be of large significance as reliability information, for example for a speech signal reconstruction. Hereby the desired value of the speech fundamental frequency can be either slowly or quickly traced, dependent on how sure a speech fundamental frequency can be estimated.
Finally the inventive method proposed here is further presented in more detail by an example. Therefore 10 sinusoidal signals of equal amplitude are summed up. The frequencies of the sinusoidal signals have been chosen equidistantly. At the beginning of the signal a fundamental frequency of 300 Hz has been chosen, subsequently this frequency has been decreased linearly over the time to an end value of 60 Hz. In the upper diagram of Fig. 5 the development of the normalized autocorrelation vectors is shown and in the lower diagram of Fig. 5 a time-frequency-analysis of the corresponding input signals y(n) is shown.
For the analysis a DFT of order N = 256 (= DFT block length), a sampling frequency of f_s = 11025 Hz and the frame offset of r = 64 is used. The analysis of the autocorrelation r̂_yy (m,n) has been performed in the range between m=40 to m=128. Detection results have been considered to be well it if the reliability information p_fp (n) is larger than po = 0.2. Finally the time-frequency analysis was considered only in the interesting frequency range up to 1000 Hz.
In the analysis of the autocorrelation it can be recognized that the speech fundamental frequency up to an offset of about m=95 can be estimated surely - this corresponds to a speech fundamental frequency of about f_p(n) = 120 Hz (at a sampling frequency of f_s=11025 Hz). The graph of this detection with decreasing frequency can also be seen in the time-frequency-analysis up to about t=3.8 s. However, if speech fundamental frequency is below f_p(n) = 120 Hz (which is often the case with men having a low speech fundamental frequency) these speech fundamental frequency can not be determined in a reliable way.
Contrary to the approaches mentioned in the previous description of the invention the approach disclosed subsequently has the following further advantages:

a sure and reliable estimation can also be performed for a very low voices;
a better robustness in environments with background noise can be reached; and
the speech fundamental frequency can be determined with a significantly higher degree of precision.

Firstly, a method for estimating the speech fundamental frequency having an additional spectral refinement is described in more detail and it is shown how the detection robustness can be increased by a noise reduction which is integrated in the estimation method (no pre-processing). Following an additional part of the method is presented which enables to also detect a very low speech fundamental frequencies by an additional delay correction structure. Finally approaches for adaptively post-processing and interpolating are disclosed which enable an error correction respectively an improvement of the preciseness of the speech fundamental frequency. However it has to be mentioned here that all the disclosed aspects can also be used independently such that the present invention does not only work if all the aspects mentioned above are implemented. For example the spectral refinement can be used without using the post-processing or the interpolation or the approach having the additional delay correction structure can be used without using the spectral refinement approach. However all the individual aspects commonly contribute to a much improved estimation of the speech fundamental frequency and shall be described herein as an embodiment.

Speech fundamental frequency estimation with spectral refinement

In the preceding section it has been shown that a speech fundamental frequency which is below 120 Hz can not be estimated. In the following an approach is presented which solves the described problem.
Additionally to the already mentioned method according to the state of the art the newly proposed method uses an additional spectral refinement of the input spectrum Y(e ^jΩ ^µ,n). The functional principle of this approach is disclosed in Fig. 6. The short-time spectrum Y(e ^jΩµ ,n) is firstly filtered subband-wise by an FIR-filter (FIR = finite impulse response). Such a filtering serves the purpose to perform a more precise spectral resolution of the input spectrum Y(e ^jΩµ ,n).
It was shown in Patent Application No. EP 06024940.6 that a spectral refinement within one subband can be reached by a short FIR-filter, respectively, how the individual filter coefficients have to be determined. The FIR-filter used for the µ-th subband can be described as follows: $g_{µ} = {[g_{µ} 0, g_{µ}, 1, \dots, g_{µ}, M - 1]}^{T}$
The parameter µ denotes herein the µ-th frequency sampling point of a short-time spectrum Ỹ(e ^jΩµ ,n) having a higher resolution and the parameter M denotes the order of the used FIR-filters. A memory length M of the short FIR-filter is chosen between 3 and 5. For the frequency subbands of interest the spectral refinement finally can be determined as follows: $\tilde{Y} (e^{{j Ω}_{µ}}, n) = g_{µ, 0} Y (e^{j Ω_{µ}}, n) + \dots + g_{µ, M - 1} Y (e^{j Ω_{µ}}, n - (M - 1) \cdot r)$
A spectral refinement in the whole frequency range is not necessary for speech signals. Usually the speech fundamental frequency structure is only present in the lower frequency range that means it is sufficient to perform the refinement up to, for example, 1000 Hz. Above this threshold it is possible to only introduce a delay of (M-1)/2 samples (down-sampled). The numerical effort necessary for such a refinement can thus be kept low. In Fig. 7 the analysis-synthesis-system with additional calculation of the spectral refinement in a low frequency range is shown.
However, it has to be mentioned that by the calculation of a spectral refinement a low delay is introduced into the signal path. A detailed derivation of this part of the new approach is explained in more detail in Patent Application No. EP 06024940.6 .
Subsequently the determination of the speech fundamental frequency can be performed analogously to the way as already disclosed in the previously mentioned description. However, the refined short-time spectrum Ỹ(e ^jΩµ,n) is now used in order to calculate the estimated and refined power density spectrum Ŝ_ỹỹ (Ω _µ,n) according to the following equation: ${\hat{S}}_{\tilde{y} \tilde{y}} (Ω_{µ}, n) = \tilde{Y} (e^{j Ω_{µ}}, n) \tilde{Y} * (e^{j Ω_{µ}}, n) = {|\tilde{Y} (e^{j Ω_{µ}}, n)|}^{2}$
Following the power density spectrum Ŝ_ỹỹ (Ω _µ,n) is also smoothed, weighted and the autocorrelation function r̂_ỹỹ (m,n) for the estimation of the speech fundamental frequency is determined. In order to calculate said power density spectrum an approach corresponding to equations 9 to 17 can be used.
In Fig. 8 the analysis of autocorrelation as well as the time-frequency-analysis with spectral refinement is shown. For the analyses the same parameters as previously mentioned have been used -namely a DFT of order N=256, a sampling of frequency f_s=11025 Hz, a frame offset of r=64 and a detection of threshold p₀ = 0.2. Furthermore as test signal the same combination from sinusoidal signals have been used which have a varying frequency distance of 300 Hz to 60 Hz. The black graph in the upper diagram of Fig. 8 as well as the white graph in the lower diagram of Fig. 8 show the estimated pitch period duration, respectively; the estimate of speech fundamental frequency when using the spectral refinement approach.
A comparison of Fig. 5 and 8 shows very clearly that the spectral refinement provides the possibility of a far better detection of the speech fundamental frequency. Very desirable is the fact that the sure and reliable detection rises up to an offset of m=N/2 =128 - this corresponds to a speech fundamental frequency of about 90 Hz. At lower frequencies fp < 90 Hz several detection errors occur. Finally it has to be mentioned that in many applications it is only of interest whether a speech fundamental frequency is present or not - an exact speech fundamental frequency would be of minor importance. Just in these application scenarios the previously presented approach would provide significant advantages.
In the following it will be the aim to present a new approach which works robustly in terms of erroneous estimations at very low speech fundamental frequencies. Additionally it is shown in the following section how noise reduction can be advantageously incorporated into the presently known method.

Speech fundamental frequency estimation with noise suppression

Fig. 9A shows a block diagram of an embodiment of a speech fundamental frequency estimator 900. The speech fundamental frequency estimator 900 comprises a power density spectrum calculator 902 and an analyzer 904. The power density spectrum calculator 902 has 2 inputs, one for receiving a set of values and one for receiving background noise information. The set of values Ỹ ₁ is a frequency-domain representation of a set of a time domain signal values y₁ in a time interval t₁. The background noise information can for example be determined in speech pauses in which only a noise signal and no speech signal is provided to the power density spectrum calculator 902. The power density spectrum calculator 902 has 2 outputs, one for outputting a noise suppression factor V(e^jΩµ,n) and one for outputting values of a power density spectrum. The analyzer 904 has 2 inputs for receiving both of the outputs of the power density spectrum calculator 902. The analyzer 904 has a furthermore one output for outputting the determined speech fundamental frequency f_p(n).
The function of the speech fundamental frequency estimator 900 shall be described in more detail with reference to Fig. 9B. In Fig. 9B a flow diagram of a method for estimating the speech fundamental frequency is disclosed. The method 940 comprises a first step 950 in which a power density spectrum is provided by multiplying a version of the set of values Ỹ ₂ with a complex conjugate version of the second set of values. In parallel (or in series) in a second step 952 an estimate of a power density spectrum of background noise is determined. In this step 952 of determining the estimate of a power density spectrum of background noise the background noise information is used which may originate for example from a speech pause detector or other means which provide only information about the background noise in the absence of speech. In a third step 954 a noise suppression factor is determined which is explained in more detail below. In a fourth step 956 a multiplication of the power density spectrum with the noise suppression factor V(e^jΩµ,n) is performed before in a fifth step 958 a frequency-time-transform is accomplished. Subsequently in a sixth step 960 speech fundamental frequency is determined from the frequency-time-transformed signal resulting in step 958.
Such an approach provides the advantage that by considering background noise information the detection preciseness as well as in detection robustness can be improved as for example in speech pauses when only background noise occurs no speech fundamental frequency shall be estimated. Thus, the reliability of an estimated speech fundamental frequency can be significantly improved. This results from the fact that the erroneous detections of speech fundamental frequencies in speech pauses can be avoided. Furthermore the multiplication of the noise suppression factor with the power density spectrum prior to the frequency-time-transform provides the advantage that such a multiplication in the frequency domain requires very little computational and numerical effort in contrast to a similar combination in time domain. Furthermore it is also possible to additionally considered other calculations or normalizations of the noise-compensated signal prior to said frequency-time-transform.
To be more precise, methods for the noise reduction are mostly based on modified Wiener-filters which frequency response in the respective frequency intervals is determined by $V (e^{j Ω_{µ}}, n) = \max \{V\} \{\} \{_{0}, 1 - β \frac{{\hat{S}}_{nn} (Ω_{µ}, n)}{{\hat{S}}_{yy} (Ω_{µ}, n)}\}$
(see also S. F. Boll: Suppression of Acoustic Noise in Speech Using Spectral Subtraction, IEEE Trans. Acoust. Speech Signal Process., Vol. 27, Nr. 2, Seiten 113-120, 1979; E. Hänsler: Statistische Signale - Grundlagen und Anwendungen, Springer, Berlin, Deutschland, 2001 or T. Haulick, K. Linhard: Noise Subtraction with Parametric Recursive Gain Curves, Proceed. of the European Conf. on Speech Communications and Technology, Vol. 6, pages 2611-2614, 1999). The value Ŝ_nn (Ω _µ,n) denotes an estimation of the auto power density spectrum of a disturbance (background noise), V₀ describes a maximal attenuation and the parameter β is used for overestimating the power density spectrum of the disturbance. Because of the fact that the disturbance can be considered to be non-stationary a short-time estimation value has to be used for this disturbance value. However, signal and disturbance are available only as a sum in the microphone signal y(n). The estimation of the power density spectrum of the background noise can be obtained in two different ways, firstly the power of the microphone signal can be estimated in speech pauses - which requires a speech pause detector - or, secondly, that an estimated value for the power of the disturbance can be determined from the segment-wise estimated minima of the power of the microphone signal. As the noise estimation is not the main focus in this patent application other details shall not be explained here; however reference is made to P. Vary, R. Martin: Digital Speech Transmission, John Wiley & Sons, Chichester, England, 2006.
Normally, noise reductions are used as a pre-processing stage for a speech fundamental frequency estimation that is instead of the input subband signals Y(e ^jΩµ,n) the noise reduced signals Y(e ^jΩµ,n)·V(e ^jΩµ,n) are processed. The present approach follows a similar way that means that firstly a noise-reduced power density spectrum (see equation 12) respectively after a subsequent spectral refinement is determined according to the following equation: ${\hat{S}}_{\tilde{y} \tilde{y}, norm, g} (Ω_{µ}, n) = \frac{{\hat{S}}_{\tilde{y} \tilde{y}} (Ω_{µ}, n)}{{\overline{S}}_{\tilde{y} \tilde{y}} (Ω_{µ}, n)} W (e^{j Ω_{µ}}) \cdot V (e^{j Ω_{µ}}, n)$
For detection the inverse transform is then calculated as follows: ${\hat{r}}_{yy, g} (m, n) = \frac{1}{N} \sum_{µ = 0}^{N - 1} {\hat{S}}_{\tilde{y} \tilde{y}, norm, g} (Ω_{µ}, n) e^{j \frac{2 π}{N} µm}$
As standardization factor the value r̂_yy (0,n) from the equation 16 is again used which is the standardization value of the autocorrelation including noise. This results in the following modified detection $p_{f_{p}} (n) = \frac{{\hat{r}}_{yy, g} (τ_{p} (n), n)}{{\hat{r}}_{yy} (0, n)}$
As a result a more robust detection in speech pauses is obtained. In order to more clearly show this effect Fig. 10 shows results of the speech fundamental frequency estimation with spectral refinement in terms of time-frequency-analysis with and without noise reduction. All parameters of the methods have been identical to the previously described parameters. As can be seen very clearly erroneous detections (denoted by black ellipses in the upper diagram of Fig. 10) can be suppressed in the case when the above-mentioned active noise reduction is used. In speech activity passages nearly nothing changes.

Speech fundamental frequency estimation on the basis of a plurality of subband vectors

In this section a further part of the approach for the inventive speech fundamental frequency estimation is described.
Fig. 11A shows a block diagram of an embodiment of the inventive speech fundamental frequency estimator 1100. The speech fundamental frequency estimator 1100 comprises a first power density spectrum calculator 1102, a second power density spectrum calculator 1104 and an analyzer 1106. The first power density spectrum calculator 1102 and second power density spectrum calculator 1104 are both fed by a common input of width N, on which subsequently a first set of values Ỹ ₁ and a second set of values Ỹ ₂ is provided. Herein, the first set of values Ỹ ₁ is a frequency domain representation of a first set of time domain signal values y₁ within a first time interval t₁. The second set of values Ỹ ₂ is a frequency domain representation of a second set of time domain signal values y₂ within a second time interval t₂. In the embodiment as shown in Fig. 11A the first and second time intervals overlap. The first power density spectrum calculator 1102 is configured for storing a version of the first set of values and for providing values of a first power density spectrum Ŝ_ỹỹ (Ω _µ,n) by multiplying the stored version of the first set of values Ỹ ₁ with a complex conjugate version of the second set of values Ỹ ₂. The second power density spectrum calculator 1104 is configured for providing values of a second power density spectrum Ŝ_ỹỹ (Ω _µ,n) by multiplying a version of the second set of values with a complex conjugate version of the second set of values. The analyzer 1106 is configured for receiving the first and second power density spectrums of the first respectively second power density spectrum calculator 1102, 1104 and for determining the speech fundamental frequency estimate f_p(n) on the basis of the values of the first power density spectrum Ŝ_ỹỹd (Ω _µ,n) and the values of the second power density spectrum Ŝ_ỹỹ (Ω _µ,n).
Fig. 11B shows the functionality of the speech fundamental frequency estimator as shown in Fig. 11A in more detail. To be more precise, Fig. 11B discloses a method 1140 for estimating the speech fundamental frequency f_p(n). Firstly, first and second sets of values Ỹ ₁ and Ỹ ₂ are provided, each of which have the number of N individual values (that is a width of N). In a first step 1150 a version of the first set of values Ỹ ₁ is stored. In a second step 1152 the stored version of the first set of values Ỹ ₁ it is multiplied with a version of the second set of values Ỹ ₂ which are directly fed to the multiplication step without a storing step. The result from the multiplication step 1152 is said first power density spectrum Ŝ_ỹỹd (Ω _µ,n). Parallel to the step of multiplying 1152 a further step of multiplying 1154 is performed in which a versions of the second set of values Ỹ ₂ are multiplied with each other, which results in the second power density spectrum. In a final step 1156 the speech fundamental frequency estimate f_p(n) is determined.
The inventive approach as shown in Fig. 11A and 11B has the advantage that it is now possible to estimate lower speech fundamental frequencies as would be possible according to the state of the art. This is mainly due to the fact that (conventional existing) short frequency domain values can be used for a precise speech fundamental frequency estimation as the multiplication in step 1152 with a stored respectively delayed version of a previous set of frequency domain values results in a kind of elongated analysis time interval for estimating the low speech fundamental frequency. However, it is also possible to correct possible errors which might result from the time offset of the first and second time intervals because for the determination of the Speech fundamental frequency estimate also the second power density spectrum is used which is based on a multiplication of versions of the second set of values. Therefore the first power density spectrum can be compared with the information resulting from the second power density spectrum such that a kind of normalization can be performed or a detection of possible errors in the first power density spectrum can be recognized and corrected.
To be more specific, in the previous description it has been shown that a speech fundamental frequency below f_p(n)<120 Hz can not be detected correctly anymore. Therefore, in the first approach a subsequent spectral refinement has been applied. However, this spectral refinement provided the possibility for an improvement of the speech fundamental frequency estimation only to about fp(n)=90 Hz. The reason for this threshold can be seen in the fact that in the used DFT of order N a maximal autocorrelation offset of m = N/2 + 1 for the analysis of this speech fundamental frequency is possible - this corresponds to a maximally low speech fundamental frequency detection of about 90 Hz. It has been assumed that the used power density spectra, respectively the autocorrelation functions are only real (and not complex) and are furthermore also symmetrically.
A further inventive idea it can be seen in the fact that not only the present signal frame y(n) is used for the estimation of the speech fundamental frequency but also a signal frame y(n-d) which is a signal frame delayed by d clock cycles. For example the speech fundamental frequency estimation can be significantly improved by utilizing of the present signal frame and a signal of frame delayed by one frame cycle, d = r, with an overlap of 75% - this corresponds to a frame offset of r = 64 and a signal block length of N = 256.
In Fig. 12 the functional principle of the method for estimating the speech fundamental frequency is shown. Additionally to the already described method the inventive approach uses a cross correlation with the delayed input frame. Firstly it can be seen from Fig. 12 that in addition to the estimated auto power density spectrum Ŝ_ỹỹ (Ω _µ,n). in the lower path of Fig. 12 also a variant of the cross power density spectrum ${\hat{S}}_{\tilde{y} {\tilde{y}}_{d}} (Ω_{µ}, n) = \tilde{Y} (e^{j Ω_{µ}}, n) \tilde{Y} * (e^{j Ω_{µ}}, n - d)$
is determined too. For the determination of the cross power density spectrum Ŝ_ỹỹd (Ωµ,n) the present short-time spectrum Ỹ(e ^jΩη ,n) and the delayed short-time spectrum Ỹ*(e ^jΩη ,n-d) is used. In following only the short-time spectrum delayed by one frame clock, that is d = r, is dealt with further, however also other delays can be used here.
The thus determined cross power density spectrum is divided by the smoothed auto power density spectrum S̅_ỹỹd (Ω _µ,n) and is multiplied with a weighting function as shown below: ${\tilde{S}}_{\tilde{y} {\tilde{y}}_{d}} (Ω_{μ}, n) = \frac{{\tilde{S}}_{\tilde{y} {\tilde{y}}_{d}} (Ω_{μ}, n)}{{\overline{S}}_{\tilde{y} \tilde{y}} (Ω_{μ}, n)} W (e^{j Ω_{μ}})$
After a subsequent noise reduction and an inverse transform according to equation 23 the cross-correlation function r̂_ỹỹ,g (m,n) is determined according to equation 13. In the following, the aim will be to determine an extended autocorrelation function r̂_ỹỹ,erw (k,n) of order N/2 + r from the autocorrelation function r̂_ỹỹd,g (m,n) and the cross-correlation function r̂_ỹỹd,g (m,n), each of which having the order N/2. The index k of the term r̂_ỹỹ,erw (k,n) describes herein the offset of the autocorrelation, wherein the following equation is valid: $k \in \{0, ..., \frac{N}{2} + r - 1\}$
By using an adaptive compensation control it can be tried to correct the error terms of the cross-correlation function r̂_ỹỹd,g (m,n). For this purpose a correction value Δ(m, n) is determined for each time frame in order to compensate, for example the undesired amplitudes which occur at an offset of m=r=64, or respectively, to correct the remaining amplitude values in order to perform a later combination with the autocorrelation function r̂_ỹỹ,g (m,n) : ${\hat{r}}_{\tilde{y} {\tilde{y}}_{d}, g, \mod} (m, n) = {\hat{r}}_{\tilde{y} {\tilde{y}}_{d}, g} (m, n) - Δ (m, n) = {\hat{r}}_{\tilde{y} {\tilde{y}}_{d}, g} (m, n) - c (n) {\hat{r}}_{\tilde{y} \tilde{y}, g} (m - r, n)$
The adaptive constant c(n) is derived from a relation of the cross correlation function r̂_ỹỹd,g (m,n) at the location m= r and the autocorrelation function r̂_ỹỹ,g (m,n) at the location m=0. In order to perform a robust speech fundamental frequency estimation the relation should not be below a minimum value c₀. Therefore the adaptive parameter c(n) is determined as follows: $c (n) = \max \{\frac{{\hat{r}}_{\tilde{y} {\tilde{y}}_{d}, g} (r, n)}{{\hat{r}}_{\tilde{y} \tilde{y}, g} (0, n)}, c, _{0}\}$
Tests have shown that good results can be obtained in the case the constant c₀ is set to a value of c₀=0.4.
Following the auto and cross-correlation coefficients of r̂_ỹỹ,g (m,n) and r̂ _{ỹỹ_d,g,mod}(m,n) are weighted by a weighting function and are combined as follows: ${\hat{r}}_{\tilde{y} \tilde{y}, erw} (k, n) = {\begin{cases} {\hat{r}}_{\tilde{y} \tilde{y}, g} (k, n), & for 0 \leq k < \frac{N}{2} - r, \\ a (k - r) {\hat{r}}_{\tilde{y} \tilde{y}, g} (k - n) + (1 - a (k - r)) {\hat{r}}_{\tilde{y} {\tilde{y}}_{d}, g, \mod} (k - r, n) . & for \frac{N}{2} - r \leq k < \frac{N}{2}, \\ {\hat{r}}_{\tilde{y} {\tilde{y}}_{d}, g, \mod} (k - r, n), & for \frac{N}{2} \leq k < \frac{N}{2} + r . \end{cases}$
Herein the linear function a(m) was chosen such that with an increasing offset m the weight of the coefficients reduces. The thus obtained extended autocorrelation function r̂_ỹỹ,erw (k,n) is finally used for the estimation of the speech fundamental frequency. In comparison to the methods mentioned before the speech fundamental frequency is determined by a search of the maximum for each single frame in an elongated area - for example in the range 30 ≤ k ≤ 180.
In order to clarify the functioning of the described method in Fig. 13 two examples for the analysis of the speech fundamental frequency are shown. For this purpose the left section of Fig. 13 discloses the analysis of the speech fundamental frequency at about 270 Hz whereas in the right section of Fig. 13 the analysis of a speech fundamental frequency at about 60 Hz is shown.
In the first aspect the correlation of the present signal frame with itself (left) and with a proceeding signal frame (right) are shown each, the left and also the right section of Fig. 13. The grey graph denotes in each section the cross correlation function r̂_ỹỹd,g (m,n) before the adaptive compensation control and the dark grey graph denotes the cross correlation function r̂ _{ỹỹd,g,mod}(m,n) after the adaptive compensation control. It can be well identified that significant error terms - especially the error terms at the location k = r - are corrected by the adaptive compensation control.
The lower graph in each of both sections of Fig. 13 shows the extended autocorrelation function r̂_ỹỹ,erw (k,n) across an elongated autocorrelation offset which is generated by the composition of both correlation functions r̂_ỹỹ,g (m,n) and r̂ _ỹỹ,g,mod(m,n) respectively by the usage of the equation 30. At a high speech fundamental frequency the corresponding speech fundamental period can be determined and detected quite well using the autocorrelation function r̂_ỹỹ,g (m,n) (left section of Fig. 13). In contrast, with the used low speech fundamental frequency of about 60 Hz the corresponding speech fundamental period can not be determined any longer by the standard autocorrelation r̂_ỹỹ,g (m,n). The right section of Fig. 13 shows in the lower part that by a combination of the correlation of the signal frame with itself and the correlation with a proceeding signal frame the speech fundamental period can still be determined and detected.
In Fig. 14 the analysis of the extended autocorrelation function r̂_ỹỹ,erw (k,n) is shown when a previous spectral refinement in the low frequent region as well as a time-frequency-analysis of the input signal is used. A comparison with the analyses from the Fig. 5 and 14 indicates that by using the previously described approach significant improvements can be achieved. Through this approach a existing speech fundamental period up to an offset of about k=125 can still be detected. Moreover no erroneous detections with low speech fundamental frequencies occur. Thus, a sure and reliable estimation can be performed by the described approach down to a speech fundamental frequency of about f_p(n)= 60 Hz.

Adaptive post-processing

At several locations erroneous estimations of the speech fundamental frequency f_p(n) still occur. For these values a half, respectively a third, of the speech fundamental frequency are often estimated. A subsequent post-processing is then preferably used to correct the occurring erroneous detections.
After estimation of the speech fundamental frequency f_p(n) a test can be made whether this estimate is below a threshold f_k. The post-processing only shall be performed in the case the following condition $f_{p} (n) < f_{k}$
is fulfilled. Values between f_k = 140 Hz and f_k = 160 Hz have been recognized to be suitable in practice. Subsequently a normalized speech fundamental period is estimated by performing a search for the index of the maximum of the autocorrelation function ${\tilde{τ}}_{p} (n) = \underset{k}{argmax} \{{\hat{r}}_{\tilde{y} \tilde{y}, erw} (k, n)\}$
in a selected range $\frac{f_{s}}{f_{p, \max}} \leq k < \frac{f_{s}}{2 f_{p} (n)} + k_{0}$
For the determination of this area the previously determined speech fundamental frequency f_p(n) is firstly doubled. The parameter f_p,max in equation 33 is herein a predefined value of a maximal possible speech fundamental frequency. Finally the value k₀ is a constant which makes sure that also a search for a maximum which is slightly above $k = \frac{f_{s}}{2 f_{p} (n)}$
is allowed.
In the case the newly determined maximum is higher than 60 percent of the previously determined maximum, that is ${\tilde{τ}}_{p} (n) > 0.6 τ_{p} (n)$
and in the case also the amplitude of this newly determined maximum is above a predetermined amplitude value ${\hat{r}}_{\tilde{y} \tilde{y}, erw} ({\tilde{τ}}_{p} (n), n) > {\tilde{p}}_{0}$
a correction of the previously estimated speech fundamental frequency is performed according to $f_{p} (n) = \frac{f_{s}}{{\tilde{τ}}_{p} (n)}$
In order to clarify the improvements which result from such a post-processing, Fig. 15 shows a time-frequency-analysis of an input signal, respectively, the detection results of the speech fundamental frequency estimation. In the upper part of Fig. 15 the post-processing was deactivated and at two locations (at 0.7 and at 0.75 seconds) erroneous detections (bisections of frequency) can be observed. Such erroneous detections can be corrected by the post-processing which can be concluded from the lower part of Fig. 15.

Interpolation

In the application of the approach described up to now it could be observed that only an inaccurate speech fundamental frequency is estimated. In the estimation results stairs-like graphs of the estimated speech fundamental frequency have been generated as can be seen in Fig. 14 for example. Up to now it was only possible to determine the quantized speech fundamental frequency estimate, that means when the exact speech fundamental period is in between two autocorrelation offsets k of the autocorrelation function r̂_ỹỹ,erw (k,n) then a rounding to the nearest autocorrelation offset k is performed in order to determine the estimated speech fundamental period τ_p (n), respectively τ̃_p (n). Therefore quantization errors occur.
In numerous applications, as for example for a speech signal construction, an exact speech fundamental frequency estimation is of significant importance. One possible approach to solve the described problem is to perform an interpolation of the estimated speech fundamental frequency which is described in more detail in the following.
For the interpolation firstly an approximated si(x)-function is used which can be written as a simple polynom of order 2 according to the following approximation: $f (x) = \frac{\sin (x)}{x} \approx 1 - \frac{x^{2}}{6}$
Furthermore the autocorrelation coefficient is used for the interpolation at which the extended autocorrelation function r̂_ỹỹ,erw (k,n) has the maximum, and also the adjacent autocorrelation coefficients unconsidered- that is the autocorrelation offsets left and right of the maximum. The interpolated speech fundamental period τ _p,mod(n) can hereby be written as a function depending on the quantized speech fundamental period τ_p (n) and the considered autocorrelation coefficients according to the following equation: ${\tilde{τ}}_{p, \mod} (n) = Fkt (τ_{p} (n), {\hat{r}}_{\tilde{y} \tilde{y}, erw} (τ_{p} (n) - 1, n), {\hat{r}}_{\tilde{y} \tilde{y}, erw} (τ_{p} (n), n), (τ_{p} (n), 1, n))$
In this context it has to be noted that if a correction according to the post-processing described in the previous section should be performed, the value τ_p (n) has to be replaced by the value τ̃_p (n). Finally the estimated and interpolated speech fundamental period can be determined according to ${\hat{τ}}_{p, \mod} (n) = τ_{p} (n) - Δ_{p} (n),$
wherein Δ_p(n) is a correction value for the quantized speech fundamental period τ_p (n) which has to be determined in every frame clock n according to the following equation: $Δ_{p} (n) = \frac{{\hat{r}}_{\tilde{y} \tilde{y}, erw} (τ_{p} (n) + 1, n) - {\hat{r}}_{\tilde{y} \tilde{y}, erw} (τ_{p} (n) - 1, n)}{2 ({\hat{r}}_{\tilde{y} \tilde{y}, erw} (τ_{p} (n) + 1, n) + {\hat{r}}_{\tilde{y} \tilde{y}, erw} (τ_{p} (n) - 1, n) - 2 {\hat{r}}_{\tilde{y} \tilde{y}, erw} (τ_{p} (n), n))}$
Finally the interpolation for improving the speech fundamental frequency estimation which is presented here shall be clarified in two examples. In the upper part of Fig. 16 the time-frequency-analysis of a portion of several sinusoidal signals of equal amplitude is shown. Contrary hereto a portion of a speech signal of a female voice is shown in the lower part of Fig. 16. The white graph denotes the estimated quantized speech fundamental frequency in the upper as well as also in the lower part of Fig. 16. The grey graph in the upper part respectively the black graph in the lower part demonstrates the estimated speech fundamental frequency after the interpolation. It can be seen from the upper part of Fig. 16 that due to the interpolation nearly the desired straight graph of the estimated speech fundamental frequency can be obtained. In the lower part it can be seen that the estimated speech fundamental frequency of the speech fundamental frequency structure follows the speech signal closely when the interpolation is used.
Furthermore the analysis has shown that an improvement of the speech fundamental frequency estimation of female voices up to about 30 Hz respectively with male voices about 10 Hz can be reached in the case the previously described interpolation is used.
Summarizing the problem presented in the introductory portion is solved presently in an approach having four independent steps each of which contributes to the total improvement and each of which can also be implemented independently from the others:

For improvement of the spectral resolution short FIR-filters can be used in portions of the spectrum having low frequencies. This results in a significant improvement for medium speech fundamental frequencies.
After the determination of necessary scaling values a noise reduction is performed. Thus, the method becomes more robust against background noise.
In addition to the correlation of the actual signal frame with itself a correlation with the preceding signal frame is also calculated. However, significant error terms are generated hereby. By means of an adaptive correlation compensation those terms can be widely removed and the correlation mentioned second can thus be used for estimation of very low speech fundamental frequencies.
By means of a simple interpolation a more precise estimation can be obtained. Finally erroneous detections which lead to doublings, respectively triplications, of the estimation are also corrected by means of adaptive post-processing.

Expressed in other words, this invention describes a method for estimating the fundamental frequency (pitch frequency) of speech signals. This is achieved in the DFT domain by analyzing the current input spectrum as well as past input spectra. To achieve an - compared to standard methods - improved estimation performance a four stage algorithm is applied or proposed whereby the steps can also be used independently: First, pre-processing (called spectral refinement) is applied to the input spectrum at low frequencies. Second, a noise reduction is applied when computing normalization values. Third, estimations for the autocorrelation of the current frame and cross correlation of the current with the previous frame are adaptively combined in order to obtain an extended range. Fourth, post-processing is applied to reduce estimation errors and to achieve an improved pitch accuracy.

Claims

Speech fundamental frequency estimator (1100) being configured for receiving a first set of values (Ỹ ₁) and a second set of values (Ỹ ₂), the first set of values (Ỹ ₁) being a frequency domain representation of a first set of time domain signal values (y₁) within a first time interval (t₁) and the second set of values (Ỹ ₂) being a frequency domain representation of a second set of time domain signal values (y₂) within a second time interval (t₂), the second time interval (t₂) being later than and offset from the first time interval (t₁), the speech fundamental frequency estimator (1100) comprising:
- a first power density spectrum calculator (1102) being configured for storing a version of the first set of values (Ỹ ₁) and being configured for providing values of a first power density spectrum (S_ỹỹd (Ω _µ ,n)) by multiplying the stored version of the first set of values (Ỹ ₁) with a complex conjugate version of the second set of values (Ỹ ₂);

- a second power density spectrum calculator (1104) being configured for providing values of a second power density spectrum (Ŝ_ỹỹ (Ω _µ,n)) by multiplying a version of the second set of values (Ỹ ₂) with a complex conjugate version of the the second set of values (Ỹ ₂);

- an analyzer (1106) being configured for determining the speech fundamental frequency estimate (f_p(n)) on the basis of the values of the first power density spectrum (Ŝ_ỹỹd (Ω _µ,n)) and the values of the second power density spectrum (Ŝ_ỹỹ (Ω _µ,n)),
wherein the analyzer is further configured
for performing a first frequency-time-transform of the first power density spectrum (Ŝ_ỹỹd (Ω _µ,n)) in order to obtain a first set of correlation function values (r̂_ỹỹ,g (m,n)),
for performing a second frequency-time-transform of the second power density spectrum (Ŝ_ỹỹ (Ω _µ,n)) in order to obtain a second set of correlation function values (r̂_ỹỹ,g (m,n)), and
for determining the speech fundamental frequency estimate (f_p(n)) on the basis of the first and second sets of correlation function values (r̂_ỹỹd,g (m,n)),(r̂_ỹỹ,g (m,n)).
Speech fundamental frequency estimator (1100) according to claim 1, characterized in that the first power density spectrum calculator (1102) is configured for multiplying versions of the sets of values (Ỹ ₁, Ỹ ₂) which represent sets of time domain signal values (y₁, y₂) having overlapping time intervals (t₁, t₂).
Speech fundamental frequency estimator (1100) according to claim 2, characterized in that the first power density spectrum calculator (1102) is configured for multiplying versions of the sets of values (Ỹ ₁,Ỹ ₂) which represent time domain signal values (y₁, y₂) having overlapping time intervals (t₁, t₂) of that least 25 percent.
Speech fundamental frequency estimator (1100) according to one of claims 1 to 3, characterized in that the second power density spectrum calculator (1104) is configured for providing a conjugate complex version of the second set of values (Ỹ ₂) to the first power density spectrum calculator (1102) and wherein the first power density spectrum calculator (1102) is configured for using the provided conjugate complex version of the second set of values (Ỹ ₂) as the version with which the stored a version of the first set of values (Ỹ ₁) is to be multiplied.
Speech fundamental frequency estimator (1100) according to any of the preceding claims, characterized in that the analyzer (1106) is configured for performing a first frequency-time-transform of the first power density spectrum (Ŝ_ỹỹd (Ω _µ,n)) in order to obtain a first set of correlation function values (r̂_ỹỹd,g (m,n)) and for performing a second frequency-time-transform of the second power density spectrum (Ŝ_ỹỹ (Ω _µ,n)) in order to obtain a second set of correlation function values (r̂_ỹỹ,g (m,n)), wherein the analyzer (1106) is furthermore configured for determining a set of normalization values (Ŝ_ỹỹ (Ω _µ,n)) and a set of weighting values (V(e ^jΩ ^µ,n)) from the second power density spectrum (Ŝ_ỹỹ (Ω _µ,n) and for using the set of normalization values (Ŝ_ỹỹ (Ω _µ,n)) and the set of weighting values (V(e ^{jΩ _µ},n)) in the first and second frequency-time-transform and wherein the analyzer (1106) is furthermore configured for determining the speech fundamental frequency estimate (f_p(n)) on the basis of the first and second sets of correlation function values (r̂_ỹỹd,g (m,n), (r̂_ỹỹ,g (m,n)).
Speech fundamental frequency estimator (1100) according to claim 5, characterized in that the analyzer (1106) further comprises a compensator being configured for adaptively compensating the values of the first set of correlation function values (r̂_ỹỹd,g (m,n)) by a correction factor (Δ(m,n)) being based on a value of the second set of correlation function values (r̂_ỹỹ,g (m,n)) and wherein the analyzer (1106) is furthermore configured for determining the speech fundamental frequency estimate (f_p(n)) on the basis of the compensated first set of correlation function values (r̂ _{ỹỹ_d,g,mod}(m,n)) and the second set of correlation function values (r̂_ỹỹ,g (m,n)).
Speech fundamental frequency estimator (1100) according to claim 6, characterized in that the compensator is configured for multiplying the second set of correlation function values (r̂_ỹỹ,g (m,n)) by a lower bounded quotient between a value of the first set of correlation function values (r̂_ỹỹd,g (m,n)) and a value of the second set of correlation function values (r̂_ỹỹ,g (m,n)) in order to obtain said compensated first set of correlation function values (r̂ _{ỹỹ_d,g,mod}(m,n)).
Speech fundamental frequency estimator (1100) according to claim 7, characterized in that the analyzer (1106) is configured for combining the compensated first set of correlation function values (r̂ _{ỹỹ_d,g,mod}(m,n)) and the second set of correlation function values (r̂_ỹỹ,g (m,n)) in order to obtain an extended set of correlation function values (r̂_ỹỹ,erw (k,n)), wherein the values of the extended set of correlation function values (r̂_ỹỹ,erw (k,n)) assume corresponding values from the compensated first set of correlation function values (r̂ _{ỹỹ_d,g,mod}(m,n)), the second set of correlation function values (r̂_ỹỹ,g (m,n)) or values between the compensated first set of correlation function values (r̂ _{ỹỹ_d,g,mod}(m,n)) and the second set of correlation function values (r̂_ỹỹ,g (m,n)) and wherein the analyzer (1106) is furthermore configured for determining the speech fundamental frequency estimate (f_p(n)) on the basis of said extended set of correlation function values (r̂_ỹỹ,erw (k,n)).
Speech fundamental frequency estimator (1100) according to one of claims 5 to 8, characterized in that the analyzer (1106) is configured for determining the speech fundamental frequency estimate (f_p(n)) by searching the index of a maximum value (τ_p (n)) from the extended set of correlation function values (r̂_ỹỹ,erw (k,n)) within a predetermined number of indices (k) of the values of the extended set of correlation values (r̂_ỹỹ,erw (k,n)), from the first or second set of correlation function values (r̂_ỹỹd,g (m,n), r̂_ỹỹ,g (m,n)) within a predetermined number of indexes (m) of values of the first respectively second set of correlation function values r̂_ỹỹd,g (m,n), r̂_ỹỹ,g (m,n)) or from the compensated first set of correlation function values r̂ _{ỹỹ_d,g,mod}(m,n)) within the predetermined number of indices (m) of values of the compensated first set of correlation function values r̂ _{ỹỹ_d ,g,mod}(m,n)) and wherein the analyzer (1106) is furthermore configured for determining the speech fundamental frequency estimate (f_p(n)) as the product of a sampling frequency (f_s) and a reciprocal value of said searched index (τ_p (n)).
Speech fundamental frequency estimator (1100) according to claim 9, characterized in that the analyzer (1106) is furthermore configured for determining a reliability factor (p_fp (n)) for the determined speech fundamental frequency estimate and for blocking an output of the determined speech fundamental frequency estimate (f_p(n)) in the case the determined reliability factor (p_fp (n)) for the determined speech fundamental frequency estimate is below a predetermined reliability factor (po).
Speech fundamental frequency estimator (1100) according to claim 10, characterized in that the analyzer (1106) is furthermore configured for determining said reliability factor (p_fp (n)) by dividing the maximum value (τ̃_p (n)) at said searched index by the first value of the extended set of correlation function values (r̂_ỹỹ,erw (k,n)) or, respectively the first, the compensated first or second set of correlation function values (r̂_ỹỹd,g (m,n), (r̂ _{ỹỹ_d,g,mod}(m,n), (r̂_ỹỹ,g (m,n)).
Speech fundamental frequency estimator (1100) according to one of claims 5 to 11, characterized in that the second power density spectrum calculator (1104) is configured for determining an estimate of the power density spectrum of background noise (Ŝ_nn (Ω _µ,n)) and for determining a noise suppression factor (V(e ^jΩµ ,n)) on the basis of said power density spectrum of background noise (Ŝ_nn (Ω _µ,n)), and wherein the analyzer (1106) is configured for multiplying the first and second power density spectrum with said noise suppression factor (V(e ^jΩµ ,n)) prior to the frequency-time-transform of the first respectively second power density spectrum (Ŝ_ỹỹd (Ω _µ,n), Ŝ_ỹỹ (Ω _µ,n)).
Speech fundamental frequency estimator (1100) according to claim 12, characterized in that the second power density spectrum calculator (1104) is configured for determining the noise suppression factor as the maximum of a predetermined maximum suppression coefficient (V₀) and a term being dependent on a quotient of the estimate of the power density spectrum of background noise (Ŝ_nn (Ω _µ,n)) and the second power density spectrum (Ŝ_ỹỹ (Ω _µ,n)).
Speech fundamental frequency estimator (1100) according to one of claims 12 or 13, characterized in that the second power density spectrum calculator (1104) is configured for determining the estimate of the power density spectrum of background noise (Ŝ_nn (Ω _µ,n)) in speech pauses or for determining the estimate of the power density spectrum of background noise (Ŝ_nn (Ω _µ,n)) from a segment-wise estimation of the minima of the power of a microphone signal.
Speech fundamental frequency estimator (1100) according to claim 13 or claims 13 and 14, characterized in that the noise suppression factor is defined by $V (e^{{j Ω}_{μ}}, n) = \max \{V_{0}, 1 - β \frac{{\hat{S}}_{nn} (Ω_{μ}, n)}{{\hat{S}}_{yy} (Ω_{μ}, n)}\}$
wherein (Ŝ_nn (Ω _µ,n) denotes the estimate of the power density spectrum of the background noise, (Ŝ_yy (Ω _µ,n) denotes the second power density spectrum, Vo denotes a predefined maximum attenuation factor and β denotes a value for overestimating the power density spectrum of the background noise (Ŝ_nn (Ω _µ,n)).
Speech fundamental frequency estimator (1100) according to one of claims 5 to 15, characterized in that the analyzer (1106) is furthermore configured for reestimating the speech fundamental frequency estimate in the case the determined speech fundamental frequency estimate is below the predefined frequency value (f_k) wherein the analyzer (106) is configured for performing the reestimation by searching a further index (k, m) of a further maximum value (τ̃_p (n)) of the extended set of correlation function values (r̂_ỹỹ,erw (k,n)), the first or second set of correlation function values (r̂_ỹỹd,g (m,n), (r̂_ỹỹ,g (m,n) or the compensated first set of correlation function values (r̂ _{ỹỹ_d,g,mod}(m,n)) within a further number of values of said sets of correlation function values and for outputing a product of a sampling frequency (f_s) and a reciprocal value of said further index (τ̃_p (n)) as the determined speech fundamental frequency estimate.
Speech fundamental frequency estimator (1100) according to claim 16, characterized in that the analyzer (106) is configured for searching said index (k, m) of said further maximum value (τ̂_p (n)) using a number of values k of said sets of correlation function values which is defined by $\frac{f_{s}}{f_{p, \max}} \leq k < \frac{f_{s}}{2 f_{p} (n)} + k_{0}$
wherein k denotes the number of values of said sets of correlation function values, f_p(n) denotes the previously determined speech fundamental frequency estimate, f_p,max denotes a predefined value of a maximal possible speech fundamental frequency, f_s denotes a sampling frequency and ko denotes a constant.
Speech fundamental frequency estimator (1100) according to claim 16 or 17, characterized in that the analyzer (1106) is configured for outputting said product as the predetermined speech fundamental frequency estimate only in the case the further index (τ̃_p (n)) is larger than 60 percent of the previously searched maximal index (τ_p (n)) as well as a value (r̂_ỹỹ,erw (τ̃_p (n),n)) of the extended set of correlation function values (r̂_ỹỹ,erw (k,n)) at said further index (τ̃_p (n)) is larger than a previously defined amplitude value (p̃ ₀).
Speech fundamental frequency estimator (1100) according to one of claims 5 to 18, characterized in that the analyzer (1106) is configured for modifying a speech fundamental period (τ̃_p (n)) corresponding to said determined speech fundamental frequency estimate by a interpolation correction term (Δ_p(n)) prior of outputing a modified speech fundamental frequency estimate (f_p(n)), wherein said interpolation correction term (Δ_p) is dependent on values of said first or second set of correlation function values (r̂_ỹỹd,g (m,n),(r̂_ỹỹ,g (m,n), of said extended set of correlation function values (r̂_ỹỹ,erw (k,n)) or said compensated first set of correlation function values (r̂ _{ỹỹ_d,g,mod}(m,n)), respectively.
Speech fundamental frequency estimator (1100) according to one of claims 1 to 19, characterized by a frequency domain filtering unit being configured for receiving the frequency domain versions (Y₁, Y₂) of the first and second set of time domain signal values (y₁, y₂), for frequency domain filtering said frequency domain versions in order to obtain said first and second sets of values (Ỹ ₁,Ỹ ₂), respectively, and for providing said first and second sets of values (Ỹ ₁, Ỹ ₂) to the first and second power density spectrum calculator respectively.
Speech fundamental frequency estimator (1100) according to claim 20, characterized in that the frequency domain filtering unit is configured for filtering only frequencies below a predefined limiting frequency.
Speech fundamental frequency estimator (1100) according to claim 21, characterized in that the frequency domain filtering unit is configured for delaying values of said frequency domain versions being above said predefined limiting frequency.
Method (1140) for estimating a speech fundamental frequency (f_p(n)), the method using a first set of values (Ỹ ₁) and a second set of values (Ỹ ₂), the first set of values (Ỹ ₁) being a received frequency domain representation of a first set of time domain signal values (y₁) within a first time interval (t₁) and the second set of values (Ỹ ₂) being a received frequency domain representation of a second set of time domain signal values (y₂) within a second time interval (t₂), the second time interval (t₂) being later than and offset from the first time interval (t₁), the method for estimating the speech fundamental frequency (f_p(n)) comprising the steps of:
- storing (1150) a version of the first set of values (Ỹ ₁) and providing values of a first power density spectrum (Ŝ_ỹỹd (Ω _µ ,n)) by multiplying (1152) the stored version of the first set of values (Ỹ ₁) with a complex conjugate version of the second set of values (Ỹ ₂);

- providing values of a second power density spectrum (Ŝ_ỹỹ(Ω _µ ,n)) by multiplying (1153) a version of the second set of values (Ỹ ₂) with a complex conjugate version of the second set of values (Ỹ ₂);

- determining (1156) the speech fundamental frequency estimate (f_p) on the basis of the values of the first power density spectrum (Ŝ_ỹỹd (Ω _µ ,n)) and the values of the second power density spectrum (Ŝ_ỹỹ (Ω _µ ,n)),
wherein the step of determining the speech fundamental frequency estimate (f_p(n)) comprises
performing a first frequency-time-transform of the first power density spectrum (Ŝ_ỹỹd (Ω _µ ,n)) in order to obtain a first set of correlation function values (r̂_ỹỹd,g (m,n)),
performing a second frequency-time-transform of the second power density spectrum (Ŝ_ỹỹd (Ω _µ ,n)) in order to obtain a second set of correlation function values (r̂_ỹỹ,g (m,n)), and
determining the speech fundamental frequency estimate (f_p(n)) on the basis of the first and second sets of correlation function values (r̂_ỹỹd,g (m,n), r̂_ỹỹ,g (m,n)).
Method (1140) according to claim 23, characterized in that the step of determining (1156) the speech fundamental frequency estimate (f_p(n)) comprises:
• performing a first frequency-time-transform of the first power density spectrum (Ŝ_ỹỹd (Ω _µ ,n)) in order to obtain a first set of correlation function values (r̂_ỹỹd,g (m,n));

• performing a second frequency-time-transform of the second power density spectrum (Ŝ_ỹỹ (Ω _µ ,n)) in order to obtain a second set of correlation function values (r̂_ỹỹ,g (m,n)), wherein the step of determining (1156) further comprises determining a set of normalization values (Ŝ_ỹỹ (Ω _µ ,n)) and a set of weighting values (V(e ^{jΩ _µ},n)) from the second power density spectrum (Ŝ_ỹỹ (Ω _µ ,n)) and using the set of normalization values (Ŝ_ỹỹ (Ω _µ ,n)) and the set of weighting values (V(e ^jΩµ ,n)) in the first and second frequency-time-transform and wherein the determination of the speech fundamental frequency estimate (f_p(n)) is performed on the basis of the first and second sets of correlation function values (r̂_ỹỹd,g (m,n), (r̂_ỹỹd (m, n)).
Method (1140) according to claim 24, characterized in that the step of determining (1156) the speech fundamental frequency estimate (f_p(n)) comprises adaptively compensating the values of the first set of correlation function values (r̂_ỹỹd,g (m,n)) by a correction factor (Δ(m,n)) being based on a value of the second set of correlation function values (r̂_ỹỹ,g (m,n)) in order to obtain a compensated first set of values and determining the speech fundamental frequency estimate (f_p(n)) on the basis of the compensated first set of correlation function values (r̂ _{ỹỹ_d ,g,mod}(m,n)) and the second set of correlation function values (r̂_ỹỹ,g (m,n)).
Method (1140) according to claim 25, characterized in that the step of compensating comprises multiplying the second set of correlation function values (r̂_ỹỹ,g (m,n)) by a lower bounded quotient between a value of the first set of correlation function values (r̂_ỹỹd,g (m,n)) and a value of the second set of correlation function values (r̂_ỹỹ,g (m,n)) in order to obtain said compensated first set of correlation function values (r̂ _{ỹỹ_d ,g,mod}(m,n)).
Method (1140) according to claim 26, characterized in that the step of determining (1156) the speech fundamental frequency estimate (f_p(n)) comprises combining the compensated first set of correlation function values (r̂ _{ỹỹ_d ,g,mod}(m,n)) and the second set of correlation function values (r̂_ỹỹ,g (m,n)) in order to obtain an extended set of correlation function values (r̂_ỹỹ,erw (k,n)), wherein the values of the extended set of correlation function values (r̂_ỹỹ,erw (k,n)) assume corresponding values from the compensated first set of correlation function values (r̂ _{ỹỹ_d ,g,mod}(m,n)), the second set of correlation function values (r̂_ỹỹ,g (m,n)) or values between the compensated first set of correlation function values (r̂ _{ỹỹ_d ,g,mod}(m,n)) and the second set of correlation function values (r̂_ỹỹ,g (m,n)) (m,n)) and wherein step of determining (1156) the speech fundamental frequency estimate (f_p(n)) further comprises determining the speech fundamental frequency estimate (f_p(n)) on the basis of said extended set of correlation function values (r̂_ỹỹ,erw (k,n)).
Method (1140) according to one of claims 23 to 27, characterized in that the step of determining (1156) the speech fundamental frequency estimate (f_p(n)) comprises determining the speech fundamental frequency estimate (f_p(n)) by searching the index of a maximum value (τ_p (n)) from the extended set of correlation function values (r̂_ỹỹ,erw (k,n)) within a predetermined number of indices (k) of the values of the extended set of correlation values (r̂_ỹỹ,erw (k,n)), from the first or second set of correlation function values (r̂_ỹỹd,g (m,n), (r̂_ỹỹ,g (m,n)) within a predetermined number of indexes (m) of values of the first respectively second set of correlation function values (r̂_ỹỹd,g (m,n), (r̂_ỹỹ,g (m,n)) or from the compensated first set of correlation function values (r̂ _{ỹỹ_d ,g,mod}(m,n)) within the predetermined number of indices (m) of values of the compensated first set of correlation function values (r̂ _{ỹỹ_d ,g,mod}(m,n)) and wherein the step of determining (1156) the speech fundamental frequency estimate (f_p(n)) furthermore comprises determining the speech fundamental frequency estimate (f_p(n)) as the product of a sampling frequency (f_s) and a reciprocal value of said searched index (τ_p (n)).
Method (1140) according to claim 28, characterized in that the step of determining (1156) the speech fundamental frequency estimate (f_p(n)) comprises determining a reliability factor (p_fp (n)) for the determined speech fundamental frequency estimate (f_p(n)) and for blocking an output of the determined speech fundamental frequency estimate (f_p(n)) in the case the determined reliability factor (p_fp (n)) for the determined speech fundamental frequency estimate (f_p(n)) is below predetermined reliability factor (po).
Method (1140) according to claim 29, characterized in that the step of determining (1156) the speech fundamental frequency estimate (f_p(n)) comprises determining said reliability factor (p_fp (n)) by dividing the maximum value (τ̃_p (n)) at said searched by the first value of the extended set of correlation function values (r̂_ỹỹ,erw (k,n)) or, respectively the first, the compensated first or second set of correlation function values (r̂_ỹỹd,g (m,n), (r̂ _{ỹỹ_d ,g,mod}(m,n)), (r̂_ỹỹ,g (m,n)).
Method (1140) according to one of claims 23 to 30 and claim 24, characterized in that the step of providing values of a second power density spectrum (Ŝ_ỹỹ (Ω _µ ,n)) comprises determining an estimate of the power density spectrum of background noise (Ŝ_nn Ω _µ ,n)) and determining a noise suppression factor (V(e ^jΩµ,n)) on the basis of said power density spectrum of background noise (Ŝ_nn (Ω _µ ,n)), and the step of determining (1156) the speech fundamental frequency estimate (f_p(n)) comprises multiplying the first and second power density spectrum with said noise suppression factor (V(e^jΩµ,n)) prior to the frequency-time-transform of the first respectively second power density spectrum (Ŝ_ỹỹd (Ω _µ ,n)), (Ŝ_ỹỹ (Ω _µ ,n)).
Method (1140) according to claim 31, characterized in that the step of providing values of a second power density spectrum (Ŝ_ỹỹ (Ω _µ ,n)) comprises determining the noise suppression factor as the maximum of a predetermined maximum suppression coefficient (Vo) and a term being dependent on a quotient of the estimate of the power density spectrum of background noise (Ŝ_nn (Ω _µ ,n)) and the second power density spectrum (Ŝ_ỹỹ (Ω _µ ,n)).
Method (1140) according to claim 32, characterized in that the step of providing values of a second power density spectrum (Ŝ_ỹỹ (Ω _µ ,n)) comprises determining the estimate of the power density spectrum of background noise (Ŝ_nn (Ω _µ ,n)) in speech pauses or for determining the estimate of the power density spectrum of background noise (Ŝ_nn (Ω _µ ,n)) from a segment-wise estimation of the minima of the power of a microphone signal.
Method (1140) according to one of claims 31 to 33, characterized in that the noise suppression factor is defined by $V (e^{{j Ω}_{μ}}, n) = \max \{V_{0}, 1 - β \frac{{\hat{S}}_{nn} (Ω_{μ}, n)}{{\hat{S}}_{yy} (Ω_{μ}, n)}\}$
wherein (Ŝ_nn (Ω _µ ,n)) denotes the estimate of the power density spectrum of the background noise, (Ŝ_yy (Ω _µ ,n)) denotes the second power density spectrum, V₀ denotes a predefined maximum attenuation factor and β denotes a value for overestimating the power density spectrum of the background noise (Ŝ_nn (Ω _µ ,n)).
Method (1140) according to one of claims 24 to 34, characterized in that the step of determining (1156) the speech fundamental frequency estimate (f_p(n)) comprises reestimating the speech fundamental frequency estimate (f_p(n)) in the case the determined speech fundamental frequency estimate is below the predefined frequency value (f_k) wherein the step of determining (1156) the speech fundamental frequency estimate (f_p(n)) comprises performing the reestimation by searching a further index (k, m) of a further maximum value (τ̃_p (n)) of the extended set of correlation function values (r̂_ỹỹ,erw (k,n)), the first or second set of correlation function values (r̂_ỹỹd,g (m,n)),r̂_ỹỹ,g (m,n)) or the compensated first set of correlation function values (r̂ _{ỹỹ_d ,g,mod}(m,n)) within a further number of values of said sets of correlation function values and outputing a product of a sampling frequency (f_s) and a reciprocal value of said further index (τ̃_p (n)) as the determined speech fundamental frequency estimate.
Method (1140) according to claim 35, characterized in that the step of determining (1156) the speech fundamental frequency estimate (f_p(n)) comprises searching said index (k, m) of said further maximum value (τ̃_p (n)) using a number of values k of said sets of correlation function values which is defined by $\frac{f_{s}}{f_{p, \max}} \leq k < \frac{f_{s}}{2 f_{p} (n)} + k_{0}$
wherein k denotes the number of values of said sets of correlation function values, f_p(n) denotes the previously determined speech fundamental frequency estimate, f_p,max denotes a predefined value of a maximal possible speech fundamental frequency, f_s denotes a sampling frequency and ko denotes a constant.
Method (1140) according to one of claims 35 or 36, characterized in that the step of determining (1156) the speech fundamental frequency estimate (f_p(n)) comprises outputing said product as the predetermined speech fundamental frequency estimate (f_p(n)) only in the case that the further index (τ̃_p (n)) is larger than 60 percent of the previously searched maximal index (τ_p (n)) as well as the value (r̂ _ỹy,erw(τ̃_p (n),n)) of the extended set of correlation function values (r̂_ỹỹ,erw (k,n)) at said further index (τ̃_p (n)) is larger than a previously defined amplitude value (p̃ ₀).
Method (1140) according to one of claims 24 to 37, characterized in that the step of determining the speech fundamental frequency estimate (f_p(n)) comprises modifying a speech fundamental period (τ̃_p (n)) corresponding to said determined speech fundamental frequency estimate (f_p(n)) by a interpolation correction term (Δ_p(n)) prior of outputing said speech fundamental frequency estimate (f_p(n)), wherein said interpolation correction term (Δp(n)) is dependent on values of said first or second set of correlation function values (r̂_ỹỹd,g. (m,n)),r̂_ỹỹ,g (m,n)), of said extended set of correlation function values (r̂_ỹỹ,erw (k,n)) or said compensated first set of correlation function values (r̂ _{ỹỹ_d ,g,mod}(m,n)), respectively.
Method (1140) according to one of the preceding claims, characterized in that the method further comprises a step of receiving the frequency domain versions (Y₁, Y₂) of the first and second set of time domain signal values (y₁, y₂), frequency domain filtering said frequency domain versions in order to obtain said first and second sets of values (Ỹ ₁, Ỹ ₂), respectively, and providing said first and second sets of values (Ỹ ₁, Ỹ ₂) the first and second power density spectrum calculator respectively.
Method (1140) according to claim 39, characterized in that the step of frequency domain filtering is only performed for frequencies below a predefined limiting frequency.
Method (1140) according to claim 40, characterized in that the step of frequency domain filtering comprises delaying values of said frequency domain versions being above said predefined limiting frequency.
Computer program product having a program code for performing the method according to one of claims 23 to 41, when the computer program runs on a computer.