Background
The voice endpoint detection refers to distinguishing a voice segment from a non-voice segment in a noise environment, and is a key technology in the field of voice signal processing such as voice coding, voice enhancement and voice recognition.
Currently, voice endpoint detection methods can be mainly classified into two categories: feature-based methods [1] and methods based on machine learning and pattern recognition. Among them, the feature-based method is widely studied and applied due to its advantages of simplicity, rapidity, etc.
1. Voice endpoint detection based on voice short-time characteristics
Early features for voice endpoint detection were mainly: short-time energy and average zero-crossing rate, spectral entropy and cepstrum distance, etc. The method has ideal detection effect in the environment with high signal-to-noise ratio, but the detection performance is sharply reduced when the signal-to-noise ratio is low. In order to improve the noise resistance and robustness of the algorithm, a series of new methods are proposed by related scholars. Such as noise suppression based voice endpoint detection methods; and a voice endpoint detection method combining Fisher linear discrimination and Mel frequency cepstrum coefficients and the like.
2. Voice endpoint detection based on voice long-term characteristics
The above methods are mostly based on short-term characteristics of speech, and do not fully consider long-term change information of speech. In order to better utilize the Long-term characteristics of voice, Ghosh et al propose a detection method based on Long-term Signal variance (LTSV) characteristics, which has strong noise adaptability and can still effectively distinguish voice sections from non-voice sections under extremely low Signal-to-noise ratio (-10 dB); MaY et al propose voice endpoint detection based on Long-term Spectral Flatness (LSFM) characteristics, distinguish voice and noise by measuring the Spectral Flatness of Long-term voice in different frequency bands, and improve accuracy under non-stationary noise such as noisy human voice (babble) and machine gun (machine gun) and robustness under different noise environments. Although the two methods have better robustness under different noises, the detection performance under low signal-to-noise ratio still has a room for improvement, especially for non-stationary noises with slightly poor detection performance such as babble and machine gun.
Disclosure of Invention
The invention aims to solve the technical problem of providing a long-term signal power spectrum change-based voice endpoint detection method which can improve the robustness of a long-term voice feature-based voice endpoint detection algorithm in different noise environments and improve the detection performance in noise environments such as a babble and a machine gun.
The technical scheme adopted by the invention is as follows: a voice endpoint detection method based on long-time signal power spectrum change comprises the following steps:
1) performing frame windowing on an input signal;
2) calculating a power spectrum of the signal subjected to framing and windowing;
3) calculating a power spectrum change value of the long-time signal;
4) carrying out threshold judgment by using the power spectrum change value of the long-time signal;
5) updating the threshold value, namely performing self-adaptive updating on the threshold value by using the threshold value judgment result of the signals of the past 80 frames;
6) voting judgment, wherein the current target frame is the mth frame, and the power spectrum change value L of the long-time signal at the momentx(m) is determined by the current frame and all R-1 frame signals before the current frame, the current target frame participates in R threshold value judgment, and the result of each time is marked as Dm,Dm+1,…,Dm+R-1If the result of more than 80% of the R times of threshold judgment is that the current target frame contains a voice frame, judging that the current target frame is the voice frame, otherwise, judging that the current target frame is a noise frame;
7) and repeating the steps 1) to 6) until the input signal is finished.
Step 2) respectively adopting a classical periodogram method to calculate the frequency w of each frame signal by calculating the short-time discrete Fourier transform of the signal for each frame input signal x (n)kPower spectrum of the ith frame signal at frequency wkThe power spectrum of (a) is represented as follows:
in the formula, NWIndicates the data length of each frame, NSHRepresenting the length of the movement of each frame of data, h (l) representing the length NWThe window function of (2).
The specific calculation process of the step 3) is as follows:
wherein L is
x(m) a long-term signal power spectrum change value N representing the mth frame signal
FFTThe number of points representing the fourier transform,
the power spectrum variation degree of all the past R frame signals at the k-th frequency point is obtained by averaging the power spectrum variation quantity at the k-th frequency point between any two frames of all the past R frame signals, and the corresponding calculation formula is as follows:
wherein Sx(j,wk) And Sx(i,wk) Respectively showing the power spectrums of the signals of the jth frame and the ith frame at the kth frequency point.
Step 4) utilizing the power spectrum change value L of the long-time signalx(m), judging whether all current R frame signals contain voice frames or not, and if so, judging whether all current R frame signals contain voice frames or notx(m) greater than a predetermined threshold value, indicating that a speech frame is present, and flag DmMarked 1, otherwise no speech frame is present, marked DmAnd is noted as 0.
Step 5) specifically designing two buffers BN(m) and BS+N(m), respectively storing the power spectrum change values of the long-time signals judged as the noise frames and the voice frames in the past 80 frames, wherein the threshold value self-adaptive updating formula is as follows:
T(m)=αmin(BS+N(m))+(1-α)max(BN(m))
alpha is a weight parameter.
With the first 50 frames as initial background noise, initializing a threshold according to the initial background noise:
Tinit=μN+pσN
wherein muNAnd σNRespectively representing the average value and the standard deviation of the power spectrum change value of the signal when the background noise is 50 frames, and p is a weighting coefficient.
The voice endpoint detection method based on the long-time signal power spectrum change can obviously improve the detection accuracy rate in the environment of babble and machine gun noise. By adopting the method of updating the threshold value in a self-adaptive manner, the defect of poor environmental adaptability of the traditional fixed threshold value is overcome. Through test, the accuracy of the voice endpoint detection method is wholly superior to that of voice endpoint detection methods of LTSV and LSFM. Under the condition of machine gun noise, the voice endpoint detection accuracy of the method is obviously superior to that of the voice endpoint detection method of LTSV and LSFM, and the average detection accuracy is improved by over 10 percent.
Detailed Description
The following describes a speech endpoint detection method based on long-term signal power spectrum changes in detail with reference to embodiments and drawings.
The invention discloses a voice endpoint detection method based on long-time signal power spectrum change, which comprises the following steps:
1) the input signal is subjected to frame division and windowing, and as the voice signal is a typical non-stationary signal, but compared with the speed of sound wave vibration, the movement of a sounding organ is very slow, and the voice signal is generally considered to be a stationary signal in a time period of 10 ms-30 ms, the signal to be detected is subjected to frame division and truncation;
2) calculating a power spectrum of the signal subjected to framing and windowing; specifically, the frequency w of each frame of input signal x (n) is obtained by calculating the short-time discrete Fourier transform of the input signal by adopting a classical periodogram methodkPower spectrum of the ith frame signal at frequency wkThe power spectrum of (a) is represented as follows:
in the formula, NWIndicates the data length of each frame, NSHRepresenting the length of the movement of each frame of data, h (l) representing the length NWThe window function of (2).
3) Calculating a power spectrum change value of the long-time signal; the power spectrum change parameter of the long-term signal is determined by the power spectrums of the current frame of the input signal x (n) and all R-1 frames before the current frame, and reflects the non-smoothness of the power spectrum of the signal in the past R frame. The specific calculation process of the power spectrum change value of the long-time signal is as follows:
wherein L is
x(m) a long-term signal power spectrum change value N representing the mth frame signal
FFTThe number of points representing the fourier transform,
the power spectrum variation degree of all the past R frame signals at the k-th frequency point is obtained by averaging the power spectrum variation quantity at the k-th frequency point between any two frames of all the past R frame signals, and the corresponding calculation formula is as follows:
wherein Sx(j,wk) And Sx(i,wk) Respectively showing the power spectrums of the signals of the jth frame and the ith frame at the kth frequency point.
4) Carrying out threshold judgment by using the power spectrum change value of the long-time signal; by using the power spectrum variation value L of the long-time signalx(m), judging whether all current R frame signals contain voice frames or not, and if so, judging whether all current R frame signals contain voice frames or notx(m) greater than a predetermined threshold value, indicating that a speech frame is present, and flag DmMarked 1, otherwise no speech frame is present, marked DmAnd is noted as 0.
5) Updating the threshold value, namely performing self-adaptive updating on the threshold value by using the threshold value judgment result of the signals of the past 80 frames; in particular to design two buffers BN(m) and BS+N(m) storing the power spectrum change values of the long-time signals judged as the noise frame and the voice frame in the past 80 frames respectively, wherein the threshold value self-adaptive updating formula is as follows:
T(m)=αmin(BS+N(m))+(1-α)max(BN(m))
The effect is the best when alpha is 0.3 in the weight parameter simulation experiment.
With the first 50 frames as initial background noise, initializing a threshold according to the initial background noise:
Tinit=μN+pσN
wherein muNAnd σNThe mean value and the standard deviation of the power spectrum change value of the signal are respectively shown when the background noise is 50 frames, p is a weighting coefficient, and the effect is best when p is 3 in a simulation experiment.
6) In the voting decision, the long-term characteristics of the signal are counted, so that information of previous and subsequent frames needs to be considered when performing endpoint detection decision. Fig. 2 shows a voting decision diagram, where the current target frame is the mth frame, and the power spectrum change value L of the long-term signal at this time isx(m) is determined by the current frame and all R-1 frame signals before the current frame, the current target frame participates in R threshold value judgment, and the result of each time is marked as Dm,Dm+1,…,Dm+R-1If the result of more than 80% of the R times of threshold judgment is that the current target frame contains a voice frame, judging that the current target frame is the voice frame, otherwise, judging that the current target frame is a noise frame;
7) and repeating the steps 1) to 6) until the input signal is finished.
Specific examples are given below:
according to the flowchart shown in fig. 1, an example analysis is performed on a voice endpoint detection method based on long-time signal power spectrum changes according to the present invention, where voice signals are selected from 20 speakers, 10 men and 10 women, in a timmit voice library, each speaker corresponds to 10 sentences, and endpoints are manually labeled for each sentence (0 represents a noise segment and 1 represents a voice segment). Since the sentences in the TIMIT are short (about 3.5 seconds) and most of the sentences are voice, a silence segment of 1 second is added before each sentence in the experiment so as to count the characteristic parameters of the noise and initialize the decision threshold. The noise is selected from NOISEX-92 noise library, wherein four kinds of noise, namely white, ping, babble and machine gun, are selected. And testing the performance of the algorithm under-5, 0, 5 and 10dB of noise environment respectively, wherein the detection accuracy is taken as a performance index and is defined as:
the error frame number comprises a voice frame number judged as a noise frame number by mistake and a noise frame number judged as a voice frame number by mistake.
Examples are specifically as follows:
1. reading a voice signal, and performing frame windowing, wherein each frame is 512 sampling points, a Hamming window with 512 points is added, and the frame is shifted into 256 sampling points.
2. Carrying out 512-point Fourier transform on each frame of windowed data, and calculating power spectrum parameter S of each frame of datax(i,ωk)。
3. According to the signal power spectrum Sx(i,ωk) Counting long-time signal power spectrum change value L of each frame signalx(m) and initializing a threshold value T using background noise information of the start stageinit。
4. By means of Lx(m) carrying out threshold judgment to judge whether the current R frame signal contains a voice frame or not, if so, judging whether the current R frame signal contains the voice frame or not, and if not, judging whether the current R frame signal contains the voice frame or notx(m) greater than a set threshold, indicating the presence of a speech frame, when DmMarked 1, otherwise no speech frame, DmAnd is noted as 0.
5. The decision threshold is adaptively updated using the threshold decision result of the past 80 frame signal.
6. By using DmAnd voting judgment is carried out on the parameters for the current target frame. As shown in fig. 2, for the R frame threshold determination containing the target frame information, if the result exceeds 80% and is a voice frame, the target frame is determined to be a voice frame, otherwise, the target frame is determined to be a noise frame.
Two segments of speech were randomly picked from the TIMIT speech library, and the VAD results in a 0bB noise environment are shown in FIG. 3. Wherein a1, b1, c1 and d1 respectively represent the voice waveform after 0dB white, ping, babble and machine gun noise is added, and a2, b2, c2 and d2 represent the corresponding VAD results.
Under the noise environments with different signal-to-noise ratios, the voice endpoint detection accuracy rates based on the LTSV, the LSFM and the long-time signal power spectrum change value are respectively counted, as shown in Table 1. It can be seen from the table that under the white, ping and babble noise environments, the detection performances of the three methods are relatively close, and the accuracy of the voice endpoint detection based on the long-time signal power spectrum change value is slightly better than that of the other two methods. However, under the noise environment of the machine gun, the voice endpoint detection accuracy based on the LSVM is obviously superior to that of the other two methods.
TABLE 1 statistical table of results