CN103646649A

CN103646649A - High-efficiency voice detecting method

Info

Publication number: CN103646649A
Application number: CN201310743203.5A
Authority: CN
Inventors: 陶建华; 刘斌
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Zhongke Extreme Element Hangzhou Intelligent Technology Co Ltd
Priority date: 2013-12-30
Filing date: 2013-12-30
Publication date: 2014-03-19
Anticipated expiration: 2033-12-30
Also published as: CN103646649B

Abstract

The invention discloses a high-efficiency voice detecting method. The method comprises the following steps: analyzing the short-time energy and the short-time zero-crossing rate of an original audio frequency on a time domain and removing parts of non-voice signals; analyzing the spectral envelop characteristic and the entropy characteristic of a preserved audio frequency signal subband on a frequency domain and further removing parts of non-voice signals; forming an audio frequency segment by continuous frames with similar characteristics in each preserved frame of audio frequency signals; calculating the average value of Mel-frequency Cepstral coefficient of each frame in each audio frequency, respectively inputting the average values into a voice gaussian mixture model and various non-voice gaussian mixture models, and performing band-level judgment on whether the audio frequency segment contains voice data according to the output probability of each model, thereby finally obtaining a voice detecting result. The method is capable of detecting voice signals from audio frequency data streams under various complex environments, and positioning the boundary between voice segment data and non-voice segment data relatively correctly.

Description

A kind of efficient speech detection method

Technical field

The present invention relates to Intelligent Information Processing field, especially a kind of efficient speech detection method.

Background technology

Voice are one of Main Means of mankind's exchange of information, and speech detection technology occupies consequence always in field of voice signal; Speech detection system is as pretreatment module such as speech recognition, Speaker Identification, voice codings, and its robustness will directly affect the performance of other speech processing module.How random noise in the face of under various complex environments, navigate to speech segments accurately by a kind of efficient means, effectively distinguishes voice and non-speech audio, become current study hotspot both domestic and external, is more and more subject to extensive concern.Speech detection system has great practical value, and high-quality robust speech detection technique has all obtained general application in various communication systems, multimedia system, speech recognition system and Voiceprint Recognition System.

The speech detection method of main flow mainly comprises speech detection method and the speech detection method based on model based on parameter at present.Speech detection method based on parameter is analyzed voice signal from signals layer, in time domain, frequency domain or other transform domain, calculates speech parameter, by arranging in rational threshold test audio stream whether comprise voice; Conventional speech parameter comprises energy proportion, harmonic components of short-time energy, short-time zero-crossing rate, each frequency band etc.Speech detection method based on model, by extensive speech data training pattern, is distinguished voice signal and various non-speech audio accurately by intelligentized mathematical model; Conventional method comprises speech detection method based on gauss hybrid models, the speech detection method based on artificial neural network, the speech detection method based on Hidden Markov Model (HMM) etc.Speech detection method based on model need to mark to train reliable speech detection model to large-scale data, belongs to the speech detection method that has supervision; Speech detection method based on parameter, without training mathematical model, belongs to unsupervised speech detection method.The speech detection method of current various main flows can detect fast and accurately voice signal under various quiet environment; Under stationary noise environment and under the nonstationary noise environment of various high s/n ratios, speech detection system has higher accuracy rate; But in the face of the various Non-Stationary random noises under various complex environments, the hydraulic performance decline of speech detection system is serious.

Summary of the invention

For solving above-mentioned one or more problems, the invention provides a kind of efficient speech detection method, under various complex environments, voice signal can from audio stream, be detected fast and accurately, location speech segments that can be relatively accurate and the border between non-speech segment data.

A kind of speech detection method provided by the invention comprises the following steps:

Step S10, obtains original audio, analyzes short-time energy and the short-time zero-crossing rate of described original audio in time domain, by short-time energy and short-time zero-crossing rate, rejects the part non-speech audio in original audio;

Step S20, the sound signal remaining for described step S10 is analyzed the spectrum envelope characteristic of its subband and the entropy characteristic of subband on frequency domain, further rejects the part non-speech audio in described sound signal;

Step S30, for the sound signal of the respectively frame to be screened remaining, forms an audio section by continuous some frames of feature similarity;

Step S40, for each audio section to be screened, for whether comprising the decision-making of the speech data section of carrying out level in this audio section, finally obtains speech detection result by gauss hybrid models.

From technique scheme, can find out, the invention provides a kind of speech detection method of efficient robust, it has following beneficial effect:

(1) speech detection method provided by the invention can be applied to the front-end module of various speech recognition systems, by this module, can reject accurately the non-speech data in audio stream to be identified, improves efficiency and the robustness of speech recognition system;

(2) speech detection method provided by the invention can be applied to the front-end module of various speech coding systems, by this module, can locate accurately the border of speech segments and non-speech segment data, speech coding system is only transmitted speech segments, improve communication efficiency;

(3) speech detection method provided by the invention can steadily and under Non-Stationary random noise environment detect speech data fast and accurately various; Can effectively distinguish voice signal and various non-speech audio, not be subject to the restriction of speaker, environment and languages.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of speech detection method according to an embodiment of the invention;

Fig. 2 is the process flow diagram of time-domain analysis part in speech detection method according to an embodiment of the invention;

Fig. 3 is the process flow diagram of speech detection method frequency domain analysis part according to an embodiment of the invention;

Fig. 4 is the process flow diagram of speech detection method sound intermediate frequency frame cluster part according to an embodiment of the invention;

Fig. 5 in speech detection method according to an embodiment of the invention by the process flow diagram of the gauss hybrid models section of carrying out grade decision-making;

Fig. 6 is the process flow diagram of the off-line training process of gauss hybrid models in speech detection method according to an embodiment of the invention.

Embodiment

For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.

It should be noted that, in accompanying drawing or instructions description, similar or identical part is all used identical figure number.The implementation that does not illustrate in accompanying drawing or describe is form known to a person of ordinary skill in the art in affiliated technical field.In addition, although the demonstration of the parameter that comprises particular value can be provided herein, should be appreciated that, parameter is without definitely equaling corresponding value, but can in acceptable error margin or design constraint, be similar to corresponding value.

The present invention proposes a kind of efficient speech detection mechanism.This mechanism is carried out the speech detection in two stages to audio stream.First by temporal signatures and frequency domain character, original audio is divided into non-speech data and data to be screened, then by sound spectrograph feature, treat examination data and carry out segmentation, by the gauss hybrid models of speech data and the gauss hybrid models of non-speech data, carry out piecemeal speech detection.

Generally speaking, described speech detection method comprises time-domain analysis step, frequency-domain analysis step, Audio clustering step and section level steps in decision-making, Fig. 1 is the process flow diagram of speech detection method according to an embodiment of the invention, and as shown in Figure 1, described speech detection method comprises the following steps:

Utilize short-time energy effectively to detect voiced sound, utilize short-time zero-crossing rate effectively to detect voiceless sound, merge these two kinds of parameters and just can effectively reject part non-speech audio.

Fig. 2 is the process flow diagram of time-domain analysis part in speech detection method according to an embodiment of the invention, and as shown in Figure 2, described step S10 further comprises the steps:

Step S11, is uniformly-spaced divided into some frames by original audio, calculates short-time energy and the short-time zero-crossing rate of every frame original audio;

Step S12, the short-time energy of every frame original audio and short-time zero-crossing rate are compared with predefined low, high two thresholdings respectively, according to comparative result, every frame original audio is divided into quiet section, transition section and voice segments, remove quiet section and transition section signal in described original audio, only retain voice segments signal.

Described the short-time energy of every frame original audio and short-time zero-crossing rate are compared with predefined low, high two thresholdings respectively, the step that every frame original audio is divided into quiet section, transition section and voice segments according to comparative result is specially: if described short-time energy or short-time zero-crossing rate surpass low threshold, mark enters transition section; In transition section, if all falling back to low threshold, two parameters enter into quiet section with next; In transition section, if any one in two parameters surpasses high threshold, think and enter voice segments; In voice segments, if two parameters all drop to below low threshold, and the duration surpass a predetermined threshold, think that voice segments finishes.

The spectrum envelope characteristic of analyzing subband on frequency domain comprises the following steps:

First, described sound signal is divided into some subbands;

Then, in the frequency range of each subband, carry out respectively bandpass filtering, obtain the sound signal of each subband;

Then, each subband sound signal is carried out to Hilbert transform, obtain the spectrum envelope of each subband;

Finally, the statistical property with its spectrum envelope signal of Substrip analysis that comprises more noise contribution to the subband that comprises obvious resonance peak characteristic.

The statistical property of described spectrum envelope signal comprises average and the variance of spectrum envelope, concrete calculative being characterized as: the subband spectrum envelope variance that (1) comprises obvious resonance peak characteristic; (2) the equal value difference of the subband spectrum envelope that comprises obvious resonance peak characteristic and the subband spectrum envelope that comprises more noise contributions.

The entropy characteristic of analyzing subband on frequency domain comprises the following steps:

First, under long span pattern, utilize present frame and the some frames that are adjacent to calculate the entropy of each frequencies of present frame;

Then, within the scope of particular sub-band the average of statistical entropy and variance to determine the complexity of current speech frame.

Subband spectrum envelope characteristic under the short span pattern of fusion and the subband entropy characteristic under long span pattern just can further be rejected part non-speech audio like this, are specially:

For every frame voice signal, utilize the spectrum envelope characteristic of subband and the entropy characteristic of subband, under the ground unrest of various complexity, voice signal is carried out to frequency-domain analysis, and then voice signal and non-speech audio are classified, further reject part non-speech audio.

Fig. 3 is the process flow diagram of speech detection method frequency domain analysis part according to an embodiment of the invention, as shown in Figure 3, according to the entropy characteristic of the spectrum envelope characteristic of subband and subband, the step of further rejecting the part non-speech audio in described sound signal comprises the following steps:

Step S21, for every frame voice signal, first it is carried out to high-pass filtering to remove the interference of power frequency component, in an embodiment of the present invention, described Hi-pass filter is selected 4 rank Chebyshev's Hi-pass filters, then the sound signal through high-pass filtering is carried out to windowing process, in an embodiment of the present invention, window function is selected Hamming window;

Step S22, sound signal after windowing process is divided into N frequency range, in an embodiment of the present invention, described sound signal is divided into 0-500Hz, 500-1000Hz, 1000-2000Hz, 2000-3000Hz and 3000-4000Hz be totally five frequency ranges, in these band limits, respectively described sound signal carried out to bandpass filtering, obtains the sound signal of N subband, in an embodiment of the present invention, bandpass filter adopts 6 rank Butterworth filters;

Step S23, carries out Hilbert transform to the sound signal of each subband, obtains corresponding spectrum envelope signal;

For voiced sound signal, the spectrum envelope of 500-1000Hz frequency band comprises obvious resonance peak characteristic; And under band is made an uproar environment, the spectrum envelope of 3000-4000Hz frequency band comprises more noise contribution, in an embodiment of the present invention, only 500-1000Hz and two son bands of 3000-4000Hz are carried out to Hilbert transform.

Step S24, the spectrum envelope signal that described step S23 is obtained carries out statistical characteristic analysis, calculates their average and variance within the scope of respective sub-bands, and then obtains spectrum envelope judgement output;

If μ ₁the average that represents 500-1000Hz subband spectrum envelope, μ ₂the average that represents 3000-4000Hz subband spectrum envelope, σ ₁and σ ₂the variance that represents respectively above-mentioned two subbands, arranges spectrum envelope judgement and is output as VAD _envelope, it can be expressed as:

VAD _envelope=σ ₂-(μ ₂-μ ₅)，

This step, just by the analysis of antithetical phrase band spectrum envelope, has obtained judgement output VAD like this _envelope.

Step S25, calculates Fourier modulus spectrum to the sound signal of current frame voice frequency signal and adjacent some frames, obtains Fourier's amplitude of each Frequency point of different frame; For different Frequency points, utilize adjacent some frames to calculate present frame at the entropy at this Frequency point place; Within the scope of the subband that comprises obvious resonance peak characteristic, (in an embodiment of the present invention, selection 500-1000Hz frequency band) calculates the variance of each Frequency point entropy, as long span judgement output VAD _entropy;

Step S26, merges two judgement outputs that described step S24 and step S25 obtain and comprehensively adjudicates, and obtains final frequency domain decision result VAD _freq, be expressed as:

VAD _freq=ω ₁VAD _entropy+ω ₂VAD _entropy，

If frequency domain decision result VAD _freqhigher than a threshold value, this frame is labeled as to speech frame, if VAD _freqlower than this threshold value, this frame is labeled as to non-speech frame, in addition, the data that are labeled as speech frame need to be carried out extended length, and the start frame of voice segments is expanded forward to 3 frames, and the ending frame of voice segments is expanded to 3 frames backward.

Sound signal after processing like this has just further been rejected part non-speech audio.

Step S30, for the sound signal of the respectively frame to be screened remaining, forms an audio section by continuous some frames of feature similarity, follow-uply take audio section and carries out speech detection as unit;

Fig. 4 is the process flow diagram of speech detection method sound intermediate frequency frame cluster part according to an embodiment of the invention, and as shown in Figure 4, described step S30 is further comprising the steps:

Step S31, for the sound signal of frame to be screened respectively, considers human auditory system apperceive characteristic, in Mel territory, described sound signal is divided into some subbands, by Mel wave filter, obtains the sound signal of each subband;

Step S32, every frame sound signal is calculated to the entropy of each subband, to measure the proportion of each sub belt energy, the weight of each subband is set according to auditory perception property, the low frequency sub-band weight that can reflect resonance peak characteristic is relatively large, and the weight of high-frequency sub-band is relatively little;

Step S33, the entropy of each subband of take is characteristic parameter, calculate the similarity of adjacent speech frame, in computation process, consider the weight of each subband, then according to metric function conventional in prior art, the consecutive frame of feature similarity is classified as to an audio section, for each frame data in each audio section, the distance between them is less than threshold value T.

By said method, just can the entropy based on each subband of speech frame sound signal be divided into some audio sections, in each audio section, comprise similar speech frame, follow-uply take audio section and carry out speech detection as unit.

Step S40, for each audio section to be screened, the average of each frame Mel cepstrum coefficient in difference compute segment, the Mean Parameters obtaining is input to respectively in voice gauss hybrid models and various non-voice gauss hybrid models, according to the output probability of each model, for whether comprising the decision-making of the speech data section of carrying out level in this audio section, finally obtain speech detection result.

Fig. 5 in speech detection method according to an embodiment of the invention by the process flow diagram of the gauss hybrid models section of carrying out grade decision-making, as shown in Figure 5, described step S40 is specially: first extract the M rank of each frame in audio section to be screened such as the static Mel cepstrum coefficient in 13 rank, then calculate respectively their first order difference and second order difference, finally obtain 3*M Jan Vermeer cepstrum coefficient; Calculate the average of each frame Mel cepstrum coefficient, utilize the average of 3*M Jan Vermeer cepstrum coefficient to carry out speech detection: the average of 3*M Jan Vermeer cepstrum coefficient is input to respectively in the gauss hybrid models of voice signal and the gauss hybrid models of various non-speech audios, if the maximum probability of exporting while being input to the gauss hybrid models of voice signal, judges that this section is as voice signal, otherwise is judged to be non-speech audio.

In described step S40, also need to select various types of audio frequency to train the gauss hybrid models of the gauss hybrid models of voice signal and various non-speech audios, can guarantee the robustness of model like this, improve the accuracy rate of speech detection, when training, need to mark the classification of each audio file.

Fig. 6 is the process flow diagram of the off-line training process of gauss hybrid models in speech detection method according to an embodiment of the invention, as shown in Figure 6, further comprising the steps for the training of gauss hybrid models:

Step S41, carries out filtered audio for whole training audio repositories; Adopt respectively the method for described step S10 and step S20 to carry out time and frequency domain analysis to sound signal, reject part non-speech audio wherein, subsequent step is only trained sound signal remaining to be screened;

Step S42, according to audio categories mark, the sound signal after filtering is classified, the sound signal being about to after filtering is divided into voice signal and non-speech audio, for non-speech audio, need to them, further classify (in an embodiment of the present invention according to the feature of sound signal, non-speech audio is divided into background music, animal sounds, stationary noise and nonstationary noise, dissimilar non-voice is trained respectively to gauss hybrid models);

Step S43, sorted sound signal Yi Zhengwei unit is extracted to Mel cepstrum coefficient, first extract M rank static parameter, then calculate respectively their first order difference and second order difference, final extraction obtains 3*M dimension parameter, adopt the method for described step S30 that continuous some frames of feature similarity are formed to an audio section, the average of each frame Mel cepstrum coefficient in difference compute segment, the characteristic parameter using it as training gauss hybrid models;

Step S44, to voice signal and different classes of non-speech audio, adopt the Mel cepstrum coefficient on 3*M rank to carry out respectively the training of gauss hybrid models, by the training of EM iteration, determine weight, average and the variance of each gauss component in different gauss hybrid models, in an embodiment of the present invention, in each gauss hybrid models, comprise 32 gauss components.

In sum, the present invention proposes a kind of efficient speech detection method, this mechanism is carried out the speech detection in two stages to audio stream.First in time domain He on frequency domain, voice signal is analyzed, by rational parameter threshold is set, voice signal is divided into non-speech data and data to be screened.Then by the parameter model of robust, treat examination data and detect, whether judgement wherein comprises voice.Speech detection method provided by the invention can steadily and under Non-Stationary random noise environment detect speech data fast and accurately various; Can effectively distinguish voice signal and various non-speech audio, not be subject to the restriction of speaker, environment and languages.

It should be noted that, the above-mentioned implementation to each parts is not limited in the various implementations of mentioning in embodiment, and those of ordinary skill in the art can know simply and replace it, for example:

(1) while voice signal being carried out to frequency-domain analysis, according to the sense of hearing characteristic of people's ear, by frequency band division become 0-500Hz, 500-1000Hz, 1000-2000Hz, 2000-3000Hz, 3000-4000Hz totally five subbands on frequency domain, voice signal is analyzed.Can use other sub-band division method to substitute, as used Mel wave filter to divide each subband.

(2) set up in gauss hybrid models process, the mixed Gauss model number of regulation also can be adjusted, and as voice gauss hybrid models comprises 32 Gaussian distribution, non-voice gauss hybrid models comprises 64 Gaussian distribution.

Above-described specific embodiment; object of the present invention, technical scheme and beneficial effect are further described; institute is understood that; the foregoing is only specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any modification of making, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. A voice detection method, characterized in that the method may further comprise the steps:

Step S10, obtaining the original audio, analyzing the short-term energy and short-time zero-crossing rate of the original audio in the time domain, and removing part of the non-speech signals in the original audio through the short-term energy and short-time zero-crossing rate;

Step S20, for the audio signal retained in the step S10, analyze the spectral envelope characteristics of its subbands and the entropy characteristics of the subbands in the frequency domain, and further eliminate part of the non-speech signals in the audio signal;

Step S30, for the retained audio signals of the frames to be discriminated, a plurality of consecutive frames with similar characteristics are formed into an audio segment;

Step S40, for each audio segment to be discriminated, a Gaussian mixture model is used to make a segment-level decision on whether the audio segment contains voice data, and finally obtain a voice detection result.

2. The method according to claim 1, wherein said step S10 further comprises the following steps:

Step S11, divide the original audio into several frames at equal intervals, and calculate the short-term energy and short-time zero-crossing rate of each frame of original audio;

Step S12, compare the short-term energy and short-time zero-crossing rate of each frame of original audio with the preset low and high thresholds respectively, and divide each frame of original audio into silence segment, transition segment and speech segment according to the comparison result. segment, removing the silent segment and the transition segment signal in the original audio, and only retaining the speech segment signal.

3. The method according to claim 2, wherein if the short-term energy or the short-term zero-crossing rate exceeds a low threshold, then the flag enters a transition section; in the transition section, if both parameters fall back to a low If the threshold is below the threshold, it enters the silent segment; in the transition segment, if any one of the two parameters exceeds the high threshold, it is considered to enter the speech segment; in the speech segment, if both parameters drop below the low threshold, and the duration If a predetermined threshold is exceeded, the speech segment is considered to be terminated.

4. The method according to claim 1, characterized in that, in the step S20, analyzing the statistical characteristics of the spectral envelope of each subband in the frequency domain comprises the following steps:

First, the audio signal is divided into several subbands;

Then, band-pass filtering is carried out in the frequency range of each sub-band respectively to obtain the audio signal of each sub-band;

Then, carry out Hilbert transform to each sub-band audio signal, obtain the spectral envelope of each sub-band;

Finally, the statistical properties of the spectral envelope signal are analyzed for the subbands containing obvious formant characteristics and containing more noise components.

5. The method according to claim 4, wherein the statistical properties of the spectral envelope signal comprise the mean value and the variance of the spectral envelope, and the features that need to be calculated specifically are: subband spectral envelopes that contain obvious formant characteristics The variance of the envelope; the mean difference between the subband spectral envelope containing distinct formant characteristics and the subband spectral envelope containing more noise components.

6. The method according to claim 1, characterized in that, in the step S20, analyzing the entropy characteristics of the subbands in the frequency domain comprises the following steps:

First, in the long-span mode, use the current frame and several frames adjacent to it to calculate the entropy of each frequency point of the current frame;

Then, the mean value and variance of the entropy are counted within a specific sub-band range to determine the complexity of the current speech frame.

7. The method according to claim 1, wherein, in the step S20, according to the spectral envelope statistical characteristics and entropy characteristics of each subband, the step of further removing part of the non-speech signal in the audio signal comprises The following steps:

Step S21, for each frame of speech signal, first perform high-pass filtering to remove the interference of the power frequency signal, and then perform windowing processing on the high-pass filtered audio signal;

Step S22, dividing the audio signal after windowing processing into N frequency bands, performing bandpass filtering on the audio signal within these frequency bands respectively, to obtain audio signals of N subbands;

Step S23, performing Hilbert transform on the audio signal of each subband to obtain the corresponding spectral envelope signal;

Step S24, performing statistical characteristic analysis on the spectral envelope signal obtained in step S23, to obtain a spectral envelope decision output;

Step S25, calculate the Fourier amplitude spectrum of the audio signal of the current frame and the audio signals of several adjacent frames, and obtain the Fourier amplitude of each frequency point in different frames; for different frequency points, use several adjacent frames to calculate the current frame Entropy at the frequency point; calculate the variance of the entropy of each frequency point within the subband range containing obvious formant characteristics, and use it as a long-span judgment output;

Step S26, combining the two judgment outputs obtained in step S24 and step S25 for comprehensive judgment to obtain the final frequency domain judgment result; if the frequency domain judgment result is higher than a threshold value, then mark the frame as a speech frame, If it is lower than the threshold, the frame is marked as a non-speech frame.

8. The method according to claim 1, wherein the step S30 further comprises the following steps:

Step S31, for the audio signal of each frame to be discriminated, considering the auditory perception characteristics of the human ear, divide the audio signal into several subbands in the Mel domain;

Step S32, calculating the entropy of each sub-band for each frame of audio signal to measure the proportion of energy of each sub-band, and setting the weight of each sub-band according to the auditory perception characteristics;

Step S33, using the entropy of each subband as a characteristic parameter, calculating the similarity of adjacent speech frames, considering the weight of each subband in the calculation process, and then classifying adjacent frames with similar features into an audio segment according to the measurement function.

9. The method according to claim 1, wherein the step S40 is specifically:

For each audio segment to be discriminated, calculate the mean value of the Mel cepstral coefficient of each frame in the segment, and input the mean value parameters into the speech Gaussian mixture model and various non-speech Gaussian mixture models, according to the output probability of each model A segment-level decision is made on whether the audio segment contains voice data, and finally a voice detection result is obtained.

10. The method according to claim 1, wherein the training of the Gaussian mixture model in the step S40 is specifically:

Step S41, perform audio filtering on all training audio libraries, respectively use the methods of step S10 and step S20 to analyze the audio signals in the time domain and frequency domain, and remove some of the non-speech signals, and the subsequent steps are only for the remaining to be screened audio signal for training;

Step S42, classifying the filtered audio signal according to the audio category label, that is, dividing the filtered audio signal into a speech signal and a non-speech signal;

Step S43, extracting the Mel cepstral coefficients from the classified audio signal in units of frames, first extracting M-order static parameters, then calculating their first-order difference and second-order difference respectively, and finally extracting 3*M-dimensional parameters, using The method of described step S30 forms an audio segment with similar continuous frames of features, calculates the mean value of each frame Mel cepstral coefficient in the segment respectively, and uses it as the characteristic parameter of training Gaussian mixture model;

In step S44, Gaussian mixture models are trained on speech signals and non-speech signals of different categories, that is, weights, mean values, and variances of each Gaussian component in different Gaussian mixture models are determined through EM iterative training.