CN108538310B

CN108538310B - A voice endpoint detection method based on long-term signal power spectrum changes

Info

Publication number: CN108538310B
Application number: CN201810266002.3A
Authority: CN
Inventors: 张涛; 刘阳; 任相赢
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2018-03-28
Filing date: 2018-03-28
Publication date: 2021-06-25
Anticipated expiration: 2038-03-28
Also published as: CN108538310A

Abstract

A voice endpoint detection method based on the change of long-term signal power spectrum: the input signal is divided into frames and windowed; The spectral change value is used for threshold judgment; for threshold update, the threshold is adaptively updated by using the threshold judgment results of the past 80 frames of signals; voting judgment, the current target frame is the mth frame, and the long-term signal power spectrum change value L at this time _x (m) is jointly determined by the current frame and all R-1 frame signals before the current frame, then the current target frame participates in R threshold judgments, and the results of each time are marked as D _m , D _m+1 ,..., D _m+R-1 , if more than 80% of the results of the R threshold judgments are including speech frames, then it is determined that the current target frame is a speech frame, otherwise it is a noise frame; the above process is repeated until the end of the input signal. The present invention can obviously improve the detection accuracy under the noise environment of babble and machine gun.

Description

Voice endpoint detection method based on long-time signal power spectrum change

Technical Field

The invention relates to a voice endpoint detection method. In particular to a voice endpoint detection method based on long-time signal power spectrum change.

Background

The voice endpoint detection refers to distinguishing a voice segment from a non-voice segment in a noise environment, and is a key technology in the field of voice signal processing such as voice coding, voice enhancement and voice recognition.

Currently, voice endpoint detection methods can be mainly classified into two categories: feature-based methods [1] and methods based on machine learning and pattern recognition. Among them, the feature-based method is widely studied and applied due to its advantages of simplicity, rapidity, etc.

1. Voice endpoint detection based on voice short-time characteristics

Early features for voice endpoint detection were mainly: short-time energy and average zero-crossing rate, spectral entropy and cepstrum distance, etc. The method has ideal detection effect in the environment with high signal-to-noise ratio, but the detection performance is sharply reduced when the signal-to-noise ratio is low. In order to improve the noise resistance and robustness of the algorithm, a series of new methods are proposed by related scholars. Such as noise suppression based voice endpoint detection methods; and a voice endpoint detection method combining Fisher linear discrimination and Mel frequency cepstrum coefficients and the like.

2. Voice endpoint detection based on voice long-term characteristics

The above methods are mostly based on short-term characteristics of speech, and do not fully consider long-term change information of speech. In order to better utilize the Long-term characteristics of voice, Ghosh et al propose a detection method based on Long-term Signal variance (LTSV) characteristics, which has strong noise adaptability and can still effectively distinguish voice sections from non-voice sections under extremely low Signal-to-noise ratio (-10 dB); MaY et al propose voice endpoint detection based on Long-term Spectral Flatness (LSFM) characteristics, distinguish voice and noise by measuring the Spectral Flatness of Long-term voice in different frequency bands, and improve accuracy under non-stationary noise such as noisy human voice (babble) and machine gun (machine gun) and robustness under different noise environments. Although the two methods have better robustness under different noises, the detection performance under low signal-to-noise ratio still has a room for improvement, especially for non-stationary noises with slightly poor detection performance such as babble and machine gun.

Disclosure of Invention

The invention aims to solve the technical problem of providing a long-term signal power spectrum change-based voice endpoint detection method which can improve the robustness of a long-term voice feature-based voice endpoint detection algorithm in different noise environments and improve the detection performance in noise environments such as a babble and a machine gun.

The technical scheme adopted by the invention is as follows: a voice endpoint detection method based on long-time signal power spectrum change comprises the following steps:

1) performing frame windowing on an input signal;

2) calculating a power spectrum of the signal subjected to framing and windowing;

3) calculating a power spectrum change value of the long-time signal;

4) carrying out threshold judgment by using the power spectrum change value of the long-time signal;

5) updating the threshold value, namely performing self-adaptive updating on the threshold value by using the threshold value judgment result of the signals of the past 80 frames;

6) voting judgment, wherein the current target frame is the mth frame, and the power spectrum change value L of the long-time signal at the moment_x(m) is determined by the current frame and all R-1 frame signals before the current frame, the current target frame participates in R threshold value judgment, and the result of each time is marked as D_m，D_m+1,…,D_m+R-1If the result of more than 80% of the R times of threshold judgment is that the current target frame contains a voice frame, judging that the current target frame is the voice frame, otherwise, judging that the current target frame is a noise frame;

7) and repeating the steps 1) to 6) until the input signal is finished.

Step 2) respectively adopting a classical periodogram method to calculate the frequency w of each frame signal by calculating the short-time discrete Fourier transform of the signal for each frame input signal x (n)_kPower spectrum of the ith frame signal at frequency w_kThe power spectrum of (a) is represented as follows:

in the formula, N_WIndicates the data length of each frame, N_SHRepresenting the length of the movement of each frame of data, h (l) representing the length N_WThe window function of (2).

The specific calculation process of the step 3) is as follows:

wherein L is_x(m) a long-term signal power spectrum change value N representing the mth frame signal_FFTThe number of points representing the fourier transform,

the power spectrum variation degree of all the past R frame signals at the k-th frequency point is obtained by averaging the power spectrum variation quantity at the k-th frequency point between any two frames of all the past R frame signals, and the corresponding calculation formula is as follows:

wherein S_x(j，w_k) And S_x(i，w_k) Respectively showing the power spectrums of the signals of the jth frame and the ith frame at the kth frequency point.

Step 4) utilizing the power spectrum change value L of the long-time signal_x(m), judging whether all current R frame signals contain voice frames or not, and if so, judging whether all current R frame signals contain voice frames or not_x(m) greater than a predetermined threshold value, indicating that a speech frame is present, and flag D_mMarked 1, otherwise no speech frame is present, marked D_mAnd is noted as 0.

Step 5) specifically designing two buffers B_N(m) and B_S+N(m), respectively storing the power spectrum change values of the long-time signals judged as the noise frames and the voice frames in the past 80 frames, wherein the threshold value self-adaptive updating formula is as follows:

T(m)＝αmin(B_S+N(m))+(1-α)max(B_N(m))

alpha is a weight parameter.

With the first 50 frames as initial background noise, initializing a threshold according to the initial background noise:

T_init＝μ_N+pσ_N

wherein mu_NAnd σ_NRespectively representing the average value and the standard deviation of the power spectrum change value of the signal when the background noise is 50 frames, and p is a weighting coefficient.

The voice endpoint detection method based on the long-time signal power spectrum change can obviously improve the detection accuracy rate in the environment of babble and machine gun noise. By adopting the method of updating the threshold value in a self-adaptive manner, the defect of poor environmental adaptability of the traditional fixed threshold value is overcome. Through test, the accuracy of the voice endpoint detection method is wholly superior to that of voice endpoint detection methods of LTSV and LSFM. Under the condition of machine gun noise, the voice endpoint detection accuracy of the method is obviously superior to that of the voice endpoint detection method of LTSV and LSFM, and the average detection accuracy is improved by over 10 percent.

Drawings

FIG. 1 is a flow chart of a method for detecting a voice endpoint based on long-term signal power spectrum changes according to the present invention;

FIG. 2 is a schematic diagram of voting decisions in the present invention;

fig. 3 shows VAD results in different noise environments.

Detailed Description

The following describes a speech endpoint detection method based on long-term signal power spectrum changes in detail with reference to embodiments and drawings.

The invention discloses a voice endpoint detection method based on long-time signal power spectrum change, which comprises the following steps:

1) the input signal is subjected to frame division and windowing, and as the voice signal is a typical non-stationary signal, but compared with the speed of sound wave vibration, the movement of a sounding organ is very slow, and the voice signal is generally considered to be a stationary signal in a time period of 10 ms-30 ms, the signal to be detected is subjected to frame division and truncation;

2) calculating a power spectrum of the signal subjected to framing and windowing; specifically, the frequency w of each frame of input signal x (n) is obtained by calculating the short-time discrete Fourier transform of the input signal by adopting a classical periodogram method_kPower spectrum of the ith frame signal at frequency w_kThe power spectrum of (a) is represented as follows:

3) Calculating a power spectrum change value of the long-time signal; the power spectrum change parameter of the long-term signal is determined by the power spectrums of the current frame of the input signal x (n) and all R-1 frames before the current frame, and reflects the non-smoothness of the power spectrum of the signal in the past R frame. The specific calculation process of the power spectrum change value of the long-time signal is as follows:

4) Carrying out threshold judgment by using the power spectrum change value of the long-time signal; by using the power spectrum variation value L of the long-time signal_x(m), judging whether all current R frame signals contain voice frames or not, and if so, judging whether all current R frame signals contain voice frames or not_x(m) greater than a predetermined threshold value, indicating that a speech frame is present, and flag D_mMarked 1, otherwise no speech frame is present, marked D_mAnd is noted as 0.

5) Updating the threshold value, namely performing self-adaptive updating on the threshold value by using the threshold value judgment result of the signals of the past 80 frames; in particular to design two buffers B_N(m) and B_S+N(m) storing the power spectrum change values of the long-time signals judged as the noise frame and the voice frame in the past 80 frames respectively, wherein the threshold value self-adaptive updating formula is as follows：

T(m)＝αmin(B_S+N(m))+(1-α)max(B_N(m))

The effect is the best when alpha is 0.3 in the weight parameter simulation experiment.

T_init＝μ_N+pσ_N

wherein mu_NAnd σ_NThe mean value and the standard deviation of the power spectrum change value of the signal are respectively shown when the background noise is 50 frames, p is a weighting coefficient, and the effect is best when p is 3 in a simulation experiment.

6) In the voting decision, the long-term characteristics of the signal are counted, so that information of previous and subsequent frames needs to be considered when performing endpoint detection decision. Fig. 2 shows a voting decision diagram, where the current target frame is the mth frame, and the power spectrum change value L of the long-term signal at this time is_x(m) is determined by the current frame and all R-1 frame signals before the current frame, the current target frame participates in R threshold value judgment, and the result of each time is marked as D_m，D_m+1,…,D_m+R-1If the result of more than 80% of the R times of threshold judgment is that the current target frame contains a voice frame, judging that the current target frame is the voice frame, otherwise, judging that the current target frame is a noise frame;

7) and repeating the steps 1) to 6) until the input signal is finished.

Specific examples are given below:

according to the flowchart shown in fig. 1, an example analysis is performed on a voice endpoint detection method based on long-time signal power spectrum changes according to the present invention, where voice signals are selected from 20 speakers, 10 men and 10 women, in a timmit voice library, each speaker corresponds to 10 sentences, and endpoints are manually labeled for each sentence (0 represents a noise segment and 1 represents a voice segment). Since the sentences in the TIMIT are short (about 3.5 seconds) and most of the sentences are voice, a silence segment of 1 second is added before each sentence in the experiment so as to count the characteristic parameters of the noise and initialize the decision threshold. The noise is selected from NOISEX-92 noise library, wherein four kinds of noise, namely white, ping, babble and machine gun, are selected. And testing the performance of the algorithm under-5, 0, 5 and 10dB of noise environment respectively, wherein the detection accuracy is taken as a performance index and is defined as:

the error frame number comprises a voice frame number judged as a noise frame number by mistake and a noise frame number judged as a voice frame number by mistake.

Examples are specifically as follows:

1. reading a voice signal, and performing frame windowing, wherein each frame is 512 sampling points, a Hamming window with 512 points is added, and the frame is shifted into 256 sampling points.

2. Carrying out 512-point Fourier transform on each frame of windowed data, and calculating power spectrum parameter S of each frame of data_x(i,ω_k)。

3. According to the signal power spectrum S_x(i,ω_k) Counting long-time signal power spectrum change value L of each frame signal_x(m) and initializing a threshold value T using background noise information of the start stage_init。

4. By means of L_x(m) carrying out threshold judgment to judge whether the current R frame signal contains a voice frame or not, if so, judging whether the current R frame signal contains the voice frame or not, and if not, judging whether the current R frame signal contains the voice frame or not_x(m) greater than a set threshold, indicating the presence of a speech frame, when D_mMarked 1, otherwise no speech frame, D_mAnd is noted as 0.

5. The decision threshold is adaptively updated using the threshold decision result of the past 80 frame signal.

6. By using D_mAnd voting judgment is carried out on the parameters for the current target frame. As shown in fig. 2, for the R frame threshold determination containing the target frame information, if the result exceeds 80% and is a voice frame, the target frame is determined to be a voice frame, otherwise, the target frame is determined to be a noise frame.

Two segments of speech were randomly picked from the TIMIT speech library, and the VAD results in a 0bB noise environment are shown in FIG. 3. Wherein a1, b1, c1 and d1 respectively represent the voice waveform after 0dB white, ping, babble and machine gun noise is added, and a2, b2, c2 and d2 represent the corresponding VAD results.

Under the noise environments with different signal-to-noise ratios, the voice endpoint detection accuracy rates based on the LTSV, the LSFM and the long-time signal power spectrum change value are respectively counted, as shown in Table 1. It can be seen from the table that under the white, ping and babble noise environments, the detection performances of the three methods are relatively close, and the accuracy of the voice endpoint detection based on the long-time signal power spectrum change value is slightly better than that of the other two methods. However, under the noise environment of the machine gun, the voice endpoint detection accuracy based on the LSVM is obviously superior to that of the other two methods.

TABLE 1 statistical table of results

Claims

1. A voice endpoint detection method based on long-time signal power spectrum change is characterized by comprising the following steps:

1) performing frame windowing on an input signal;

3) calculating a power spectrum change value of the long-time signal; the specific calculation process is as follows:

wherein S is_x(j,w_k) And S_x(i,w_k) Respectively represents the power spectrum of the signal of the jth frame and the ith frame at the kth frequency point, w_kRepresenting the frequency of the k frequency point;

7) and repeating the steps 1) to 6) until the input signal is finished.

2. The method as claimed in claim 1, wherein the step 2) is to calculate the frequency w of the frame signal by calculating the short-time discrete fourier transform of the input signal x (n) by a classical periodogram method for each frame signal x (n)_kPower spectrum of the ith frame signal at frequency w_kThe power spectrum of (a) is represented as follows:

3. The method of claim 1, wherein the method comprises detecting a voice endpoint based on long-term signal power spectrum changesThe method is characterized in that the step 4) utilizes the power spectrum change value L of the long-time signal_x(m), judging whether all current R frame signals contain voice frames or not, and if so, judging whether all current R frame signals contain voice frames or not_x(m) greater than a predetermined threshold value, indicating that a speech frame is present, and flag D_mMarked 1, otherwise no speech frame is present, marked D_mAnd is noted as 0.

4. The method according to claim 1, wherein step 5) is to design two buffers B_N(m) and B_S+N(m), respectively storing the power spectrum change values of the long-time signals judged as the noise frames and the voice frames in the past 80 frames, wherein the threshold value self-adaptive updating formula is as follows:

T(m)＝αmin(B_S+N(m))+(1-α)max(B_N(m))

wherein alpha is a weight parameter;

T_init＝μ_N+pσ_N