CN119132337B

CN119132337B - Effective voice detection method and device based on feature enhancement pre-training model

Info

Publication number: CN119132337B
Application number: CN202411031589.1A
Authority: CN
Inventors: 吴石松; 董召杰; 李轩昂; 梁寿愚; 卢志良; 陈柔伊; 陈骞; 赵必美; 李紫京; 苏立伟; 刘振华; 赵翔宇; 郑桦; 李成; 冯勤宇
Original assignee: China Southern Power Grid Artificial Intelligence Technology Co ltd
Current assignee: China Southern Power Grid Artificial Intelligence Technology Co ltd
Priority date: 2024-07-30
Filing date: 2024-07-30
Publication date: 2025-11-11
Anticipated expiration: 2044-07-30
Also published as: CN119132337A

Abstract

This application relates to an effective speech detection method and apparatus based on a feature-enhanced pre-trained model. The method includes: acquiring speech to be detected containing different types of noise; inputting the speech to be detected into a first pre-trained model, and extracting effective speech features from the speech to be detected through the first pre-trained model; the first training data used by the first pre-trained model is obtained by enhancing the data features of unlabeled sample speech; inputting the effective speech features into a second pre-trained model, and performing effective speech classification through the second pre-trained model to obtain a classification result sequence; and outputting effective speech segments of the speech to be detected based on the classification result sequence; the effective speech segments are speech segments from which noise has been removed. This method can adapt to more application scenarios and noise types, effectively improving the effective speech detection effect and performance, thereby enhancing the performance of the speech recognition system.

Description

Effective voice detection method and device based on feature enhancement pre-training model

Technical Field

The present application relates to the field of speech processing technology, and in particular, to an effective speech detection method, apparatus, computer device, computer readable storage medium and computer program product based on a feature-enhanced pre-training model.

Background

With the development of voice recognition technology, the application of the voice recognition technology in power production activities is more and more widespread, such as voice analysis processing based on an intelligent power customer service platform. The complexity of the actual application environment also presents a significant challenge to speech recognition technology.

In the related art, conventional speech recognition is generally based on VAD (Voice activity detection, effective speech detection) technology to remove environmental noise contained in speech. Because of the complex noise types and numerous application scenarios in the practical application environment, the traditional voice recognition method is difficult to completely remove certain noises, and the unremoved noises have a larger influence on the performance of the voice recognition system.

Disclosure of Invention

In view of the foregoing, it is desirable to provide an effective speech detection method, apparatus, computer device, computer readable storage medium, and computer program product based on a feature-enhanced pre-training model that can enhance the effective speech detection effect.

In a first aspect, the present application provides an effective speech detection method based on a feature-enhanced pre-training model, comprising:

acquiring voices to be detected containing different types of noise;

Inputting the voice to be detected into a first pre-training model, and extracting effective voice characteristics of the voice to be detected through the first pre-training model, wherein first training data adopted by the first pre-training model is obtained by carrying out data characteristic enhancement on voice without a marked sample;

the method comprises the steps of inputting the effective voice characteristics into a second pre-training model, and carrying out effective voice classification through the second pre-training model to obtain a classification result sequence, wherein second training data adopted by the second pre-training model are obtained by carrying out data characteristic enhancement on marked sample voices;

and outputting the effective voice fragments of the voice to be detected according to the classification result sequence, wherein the effective voice fragments are voice fragments for removing noise in the voice to be detected.

In one embodiment, the outputting the valid voice segment of the voice to be detected according to the classification result sequence includes:

determining a starting time point and an ending time point of the effective voice frame in the classification result sequence;

and obtaining the effective voice fragment according to the sequence fragment corresponding to the starting time point and the ending time point of the effective voice frame.

In one embodiment, the method further comprises:

acquiring unlabeled sample voice based on voice recognition task;

obtaining a Mel frequency spectrum matrix according to the conversion of the unlabeled sample voice, and obtaining the enhanced data characteristic of the unlabeled sample voice through processing in the time dimension and the frequency dimension of the Mel frequency spectrum matrix;

and taking the enhanced data characteristics of the unlabeled sample voice as the first training data.

In one embodiment, the method further comprises:

acquiring a first model to be trained based on the structures of the encoder and the decoder;

And combining the first training data and a first loss function, and performing self-supervision model training on the first model to be trained to obtain the first pre-training model for extracting effective voice features, wherein the first loss function comprises contrast loss and diversity loss.

In one embodiment, the method further comprises:

acquiring marked sample voice based on a voice recognition task, and taking the enhanced data characteristics of the marked sample voice as the second training data;

and inputting the second training data into the first pre-training model to perform feature extraction processing, so as to obtain effective speech features of the sample.

In one embodiment, the method further comprises:

Acquiring a second model to be trained based on a neural network, wherein the second model to be trained comprises an effective voice classification model;

training the effective voice classification model according to a second loss function by taking the effective voice characteristics of the sample as input to obtain a classification result output model;

and combining the classification result output model and the effective voice fragment output module to obtain the second pre-training model.

In a second aspect, the present application further provides an effective speech detection apparatus based on a feature-enhanced pre-training model, including:

the to-be-detected voice acquisition module is used for acquiring to-be-detected voices containing different types of noise;

the effective voice feature extraction module is used for inputting the voice to be detected into a first pre-training model, extracting the effective voice feature of the voice to be detected through the first pre-training model, wherein the first training data adopted by the first pre-training model is obtained by carrying out data feature enhancement on unlabeled sample voice;

The device comprises an effective voice classification module, a classification result sequence and a voice recognition module, wherein the effective voice classification module is used for inputting the effective voice characteristics into a second pre-training model, and performing effective voice classification through the second pre-training model to obtain a classification result sequence;

And the effective voice segment output module is used for outputting the effective voice segment of the voice to be detected according to the classification result sequence, wherein the effective voice segment is a voice segment for removing noise in the voice to be detected.

In a third aspect, the present application also provides a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

acquiring voices to be detected containing different types of noise;

In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

acquiring voices to be detected containing different types of noise;

In a fifth aspect, the application also provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of:

acquiring voices to be detected containing different types of noise;

According to the method, the device, the computer equipment, the computer readable storage medium and the computer program product for detecting the effective voice based on the feature enhancement pre-training model, the voice to be detected containing different types of noise is obtained, then the voice to be detected is input into the first pre-training model, the effective voice feature of the voice to be detected is extracted through the first pre-training model, the first training data adopted by the first pre-training model is obtained by carrying out data feature enhancement on voice without a labeling sample, the effective voice feature is further input into the second pre-training model, the second pre-training model is used for carrying out effective voice classification to obtain a classification result sequence, the second training data adopted by the second pre-training model is obtained by carrying out data feature enhancement on the labeled sample voice, the classification result sequence is used for representing the probability of whether voice of each frame in the voice to be detected is the effective voice, and the effective voice segment of the voice to be detected is output according to the classification result sequence.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are needed in the description of the embodiments of the present application or the related technologies will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other related drawings may be obtained according to these drawings without inventive effort to those of ordinary skill in the art.

FIG. 1 is a flow diagram of an efficient speech detection method based on a feature-enhanced pre-training model in one embodiment;

FIG. 2 is a schematic diagram of an efficient speech detection process based on a feature-enhanced pre-training model in one embodiment;

FIG. 3a is a schematic diagram of a training process based on a feature-enhanced pre-training model in one embodiment;

FIG. 3b is a schematic diagram of a mold structure in one embodiment;

FIG. 4 is a flow chart of an effective speech detection method based on a feature-enhanced pre-training model in another embodiment;

FIG. 5 is a block diagram of an active speech detection device based on a feature-enhanced pre-training model in one embodiment;

fig. 6 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

In an exemplary embodiment, as shown in fig. 1, an effective speech detection method based on a feature-enhanced pre-training model is provided, where the method is applied to a terminal for illustration, it is understood that the method may also be applied to a server, and may also be applied to a system including the terminal and the server, and implemented through interaction between the terminal and the server. In this embodiment, the method includes the following steps 101 to 104. Wherein:

Step 101, obtaining the voice to be detected containing different types of noise.

The voice to be detected can be obtained based on a voice recognition system, the voice processing process of the voice recognition system comprises an effective voice detection process, and the voice recognition system can be applied to the fields of intelligent customer service voice quality inspection analysis, intelligent voice conference system, multimedia audio analysis and the like.

As an example, the different types of noise may be various types of noise in a practical application environment, such as ambient music, ambient human voice, channel noise, and the like.

In practical application, as shown in fig. 2, taking a test stage as an example, an input test voice may be used as a voice to be detected, so as to further perform effective voice detection processing on the voice to be detected based on a pre-training model with enhanced features.

Step 102, inputting the voice to be detected into a first pre-training model, extracting effective voice characteristics of the voice to be detected through the first pre-training model, wherein first training data adopted by the first pre-training model are obtained by carrying out data characteristic enhancement on voice without marked samples.

The first pre-training model can be a pre-training model obtained by performing data enhancement and self-supervision pre-training by using a specific algorithm, the first training data is obtained by performing data enhancement processing on unlabeled sample voice, and original features can be transformed into enhanced features, so that model training is performed based on feature enhancement, and the robustness of the pre-training model can be enhanced.

In a specific implementation, a trained pre-training model (i.e., a first pre-training model) may be used as a feature extractor, such as the robust VAD feature extraction module based on the pre-training model in fig. 2, and by inputting the voice to be detected, the first pre-training model may be used to extract the effective voice features of the voice to be detected.

And step 103, inputting the effective voice characteristics into a second pre-training model, and carrying out effective voice classification through the second pre-training model to obtain a classification result sequence, wherein second training data adopted by the second pre-training model are obtained by carrying out data characteristic enhancement on marked sample voice.

The classification result sequence may be used to characterize a probability of whether the speech of each frame in the speech to be detected is a valid speech, for example, frame-by-frame processing may be performed on the speech to be detected to determine whether each frame is a valid speech.

After obtaining the effective speech features, the trained classifier model (i.e. the second pre-training model) may be used to perform effective speech classification, and by inputting the extracted effective speech features into the second pre-training model, a robust VAD feature extraction module based on a neural network classifier in fig. 2 may output a probability sequence (i.e. a classification result sequence) representing whether each frame is effective speech.

In an example, the second training data is obtained by performing data enhancement processing on the labeled sample voice, and the second training data can be used for training to obtain a second pre-training model, specifically, since the trained pre-training model can be used as a robust valid sound detection feature extractor, namely a first pre-training model, the first pre-training model is used for performing feature extraction processing on the second training data to train the model, and a nonlinear neural network classifier can be obtained for judging valid voice.

And 104, outputting the effective voice fragments of the voice to be detected according to the classification result sequence, wherein the effective voice fragments are voice fragments for removing noise in the voice to be detected.

After the classification result sequence is obtained, the classification result sequence is input to an effective voice segment output module in the second pre-training model, so that the starting time point and the ending time point of the effective voice segment in the voice to be detected can be determined based on the classification result sequence, and further the effective voice segment for removing noise in the voice to be detected can be obtained. Therefore, effective voice detection is carried out based on the feature enhancement pre-training model, the performance of a voice recognition system can be effectively improved, and the voice recognition effect can be improved.

According to the effective voice detection method based on the feature enhancement pre-training model, the voice to be detected containing different types of noise is obtained, then the voice to be detected is input into the first pre-training model, the effective voice features of the voice to be detected are extracted through the first pre-training model, then the effective voice features are input into the second pre-training model, the effective voice is classified through the second pre-training model, a classification result sequence is obtained, the effective voice fragments of the voice to be detected are output according to the classification result sequence, optimization of effective voice detection is achieved, robustness of the effective voice detection model can be enhanced based on the feature enhancement pre-training model, more application scenes and noise types are adapted, effective voice detection effects are effectively improved, and performance of the effective voice detection can be improved to improve performance of a voice recognition system.

In an exemplary embodiment, the outputting the valid speech segment of the speech to be detected according to the classification result sequence may include the following steps:

And obtaining the effective voice fragments according to the sequence fragments corresponding to the starting time point and the ending time point of the effective voice frame.

In practical application, a starting point searching algorithm can be adopted to judge the starting point and the tail point of the effective voice segment in the classification result sequence, for example, under the condition that a continuous effective voice frame exceeds a threshold value after a certain effective voice frame is detected by the starting point searching algorithm, the certain effective voice frame can be confirmed to be the starting point of the effective voice segment (namely, the starting time point of the effective voice frame), and under the condition that a continuous noise frame after a certain noise frame exceeds the threshold value by the tail point searching algorithm, the certain noise frame can be confirmed to be the tail point of the effective voice segment (namely, the ending time point of the effective voice frame).

In this embodiment, the starting time point and the ending time point of the valid voice frame are determined in the classification result sequence, so that the valid voice fragment is obtained according to the sequence fragments corresponding to the starting time point and the ending time point of the valid voice frame, and the valid voice fragment can be effectively determined.

In an exemplary embodiment, the method may further include the steps of:

The method comprises the steps of obtaining unlabeled sample voice based on a voice recognition task, obtaining a Mel frequency spectrum matrix according to conversion of the unlabeled sample voice, obtaining enhanced data characteristics of the unlabeled sample voice through processing in the time dimension and the frequency dimension of the Mel frequency spectrum matrix, and taking the enhanced data characteristics of the unlabeled sample voice as first training data.

In a specific implementation, as shown in fig. 3a, for the training stage, the overall flow of the effective speech detection system based on the feature enhancement pre-training model may include a non-labeling training data enhancement module based on a specific algorithm, a large model pre-training module based on non-labeling data, a robust VAD feature extraction module based on the pre-training model, a labeling data enhancement module based on a specific algorithm, an effective speech classifier finetune (fine tuning) module based on a neural network, and an effective speech segment output module based on a neural network classifier.

For example, based on a historical speech recognition task, unlabeled sample speech can be acquired through a speech recognition system, and then based on an unlabeled training data enhancement module of a specific algorithm, data enhancement processing can be performed on the unlabeled sample speech, so that unlabeled data (i.e. first training data) after data enhancement is further input into a large model pre-training module for model training.

In one example, a data enhancement method at the log mel-level may be employed by converting an audio segment (i.e., unlabeled sample speech) into a matrix of mel-spectraV represents the frequency dimension, and,Representing the time dimension, the following steps can be used:

1. Zero mean normalization x-x.mean () can be performed on the mel spectrum, so that when masking is performed subsequently, the masking position can be set to 0 directly, and the method is equivalent to filling the mean of the matrix;

2. for time dimension translation, horizontal left-right torsion can be performed on the frequency spectrum;

3. For a time dimension mask, if the maximum range of the time dimension continuous mask is T, a uniform sampling of T can be performed within the range of [0, T ], and then the sampling can be performed within the range of [0, T ] Randomly determining a point t ₀ in the range, and then continuously performing t times of masking (such as setting the matrix value to 0) along the time axis from the position t ₀;

4. For the frequency dimension mask, if the maximum range of the time dimension continuous mask is F, a uniform sampling of F can be performed within the range of [0,F ], a point F ₀ can be randomly determined within the range of [0, v-F ], and then F times of masking can be continuously performed along the time axis from the position F ₀ (for example, the matrix value is set to 0).

Therefore, after data enhancement processing, the original features can be transformed into enhanced features, which is beneficial to enhancing the robustness of subsequent model training.

In this embodiment, by acquiring the unlabeled sample speech based on the speech recognition task, then converting the unlabeled sample speech to obtain the mel spectrum matrix, and processing the unlabeled sample speech in the time dimension and the frequency dimension of the mel spectrum matrix to obtain the enhanced data feature of the unlabeled sample speech, further using the enhanced data feature of the unlabeled sample speech as the first training data, the data support can be provided for further model training.

In an exemplary embodiment, the method may further include the steps of:

The method comprises the steps of obtaining a first model to be trained based on a coder and decoder structure, combining the first training data with a first loss function, and performing self-supervision model training on the first model to be trained to obtain a first pre-training model for extracting effective voice features, wherein the first loss function comprises contrast loss and diversity loss.

In an example, feature enhanced features (i.e., first training data) may be employed for unsupervised pre-training, and a feature enhanced pre-trained large model, i.e., a first pre-training model, may be obtained by self-supervised pre-training using a pre-training model.

Optionally, for the large model pre-training process based on non-labeling data feature enhancement, the pre-training large model (i.e. the first model to be trained) network structure adopted is shown in fig. 3b, where a context network part uses a transducer (encoder-decoder) structure, and feature vectors extracted by the context network can be directly input into a transducer network of context on one hand, and can be quantized by a quantization module for subsequent calculation of a loss function (such as Continuous inputs continuous input, quantized targets quantization target) on the other hand.

For example, the vector Z output by the encoder network may be discretized by product quantization, for example, the vector Z may be split into G subspaces (the series of subspaces are codebook), if each codebook has V entries, the length of each entry is d/G, and the most similar entry to the input vector may be found in each codebook by GUMBEL-softmax or clustering method, so that the discretized vector output by each codebook may be spliced to obtain a d-dimensional vector after Z quantization. The main effect of the quantization process is to have the effect of compressing and removing redundancy of the feature vector, and meanwhile, the robustness of the feature can be stronger through clustering in each subspace, and the feature vector is not easy to be influenced by a small amount of disturbance.

In yet another example, the first loss function may include two portions of contrast loss and diversity loss, and the final loss value may be weighted by the two portions of loss.

In an alternative embodiment, when applied to a downstream task (such as an active speech detection task), a linear layer may be added to the pre-trained model to perform a fine tuning process, where parameters of the linear layer may be updated, and parameters of the transducer portion may be updated, and the parameters of the encoder portion may be frozen and kept unchanged.

In this embodiment, by acquiring a first model to be trained based on the encoder and decoder structures, and further combining the first training data and the first loss function, performing self-supervision model training on the first model to be trained to obtain the first pre-training model for extracting effective speech features, the pre-training model can be used for enhancing the feature based on feature enhancement, and the robustness of the effective speech detection model is effectively enhanced.

In an exemplary embodiment, the method may further include the steps of:

and obtaining the marked sample voice based on the voice recognition task, taking the enhanced data characteristic of the marked sample voice as the second training data, and obtaining the effective voice characteristic of the sample by inputting the second training data into the first pre-training model for characteristic extraction processing.

In practical application, as shown in fig. 3a, for the robust effective speech detection feature extraction process based on the pre-training model, data enhancement can be performed on the labeled sample speech to obtain second training data, and the trained pre-training model is used as a feature extractor, and a robust characterization vector is extracted for the second training data, so that by using the trained pre-training model as the robust feature extractor, the trained pre-training model can be input with the acoustic feature enhanced by the data, and the obtained output vector (i.e., the sample effective speech feature) can characterize the effective speech detection with the robustness.

In this embodiment, the enhanced data features of the labeled sample speech are used as the second training data by obtaining the labeled sample speech based on the speech recognition task, and then the second training data is input into the first pre-training model to perform feature extraction processing, so as to obtain effective speech features of the sample, and provide data support for further classifier model training.

In an exemplary embodiment, the method may further include the steps of:

The method comprises the steps of obtaining a second model to be trained based on a neural network, wherein the second model to be trained comprises an effective voice classification model, training the effective voice classification model according to a second loss function by taking effective voice characteristics of a sample as input to obtain a classification result output model, wherein the second loss function comprises a cross entropy function, and combining the classification result output model and an effective voice fragment output module to obtain a second pre-training model.

In a specific implementation, the effective voice classifier of the neural network can be trained by using the extracted effective voice characteristics of the sample, an effective voice classification model based on the neural network is trained by adopting labeling data with effective voice segment labels, the effective voice classification model is used as a classification result output model, segment judgment of the effective voice is carried out according to a result sequence output by the classifier, and the starting time and the ending time of the effective voice can be output.

In one example, as shown in fig. 3a, for the neural network-based active speech classifier fine tuning process, by inputting data with accurate labeling (i.e., labeled sample speech) into a pre-training model-based robust active speech detection feature extraction module (i.e., a first pre-training model), active speech detection features can be extracted as input for input into the neural network. The neural network can select a fully connected neural network, a time delay neural network and a convolution neural network, and can select a cross entropy function as a loss function (namely a second loss function) to finely tune the neural network, so that the neural network can judge whether the input voice is effective voice frame by frame.

In yet another example, for the valid voice segment output process based on the neural network classifier, after the sequence of labels (valid voice, invalid voice) calculated by the neural network classifier is obtained, burrs in the sequence (such as shorter voices in silence segments or shorter silence in voice segments) can be located and removed according to a set threshold, so that the rationality of valid voice detection segments can be ensured.

The technical scheme of the embodiment is based on the problems of multiple use scenes and complex noise of effective voice detection in practical application, and the data enhancement and self-supervision pre-training model of a specific algorithm can be utilized to obtain the pre-training model with the robust effective voice detection feature extraction capability, so that the performance of effective voice detection can be effectively improved through the training of a nonlinear classifier.

In this embodiment, by acquiring the second model to be trained based on the neural network, then using the effective speech feature of the sample as input, training the effective speech classification model according to the second loss function to obtain a classification result output model, and further combining the classification result output model and the effective speech segment output module to obtain a second pre-training model, the performance of effective speech detection can be improved to improve the performance of the speech recognition system.

In one exemplary embodiment, as shown in FIG. 4, a flow diagram of another method for efficient speech detection based on a feature-enhanced pre-training model is provided. In this embodiment, the method includes the steps of:

In step 401, unlabeled sample speech based on a speech recognition task is obtained, a mel spectrum matrix is obtained according to conversion of the unlabeled sample speech, enhanced data features of the unlabeled sample speech are obtained by processing in a time dimension and a frequency dimension of the mel spectrum matrix, and the enhanced data features of the unlabeled sample speech are used as first training data. In step 402, a first model to be trained based on the encoder and decoder structure is obtained, and the first model to be trained is subjected to self-supervision model training in combination with the first training data and the first loss function, so as to obtain a first pre-training model for extracting effective speech features. In step 403, labeled sample speech based on the speech recognition task is obtained, the enhanced data feature of the labeled sample speech is used as second training data, and the second training data is input into the first pre-training model to perform feature extraction processing, so as to obtain effective speech features of the sample. In step 404, a second model to be trained based on the neural network is obtained, the effective speech characteristics of the sample are used as input, the effective speech classification model is trained according to the second loss function, a classification result output model is obtained, and the classification result output model and the effective speech segment output module are combined to obtain a second pre-training model. In step 405, the to-be-detected voice containing different types of noise is obtained, the to-be-detected voice is input into the first pre-training model, and the effective voice characteristics of the to-be-detected voice are extracted through the first pre-training model. In step 406, the valid speech features are input to a second pre-training model, and valid speech classification is performed by the second pre-training model, resulting in a classification result sequence. In step 407, in the classification result sequence, a start time point and an end time point of the valid voice frame are determined, and the valid voice fragment is obtained according to the sequence fragments corresponding to the start time point and the end time point of the valid voice frame. It should be noted that, the specific limitation of the above steps may be referred to above for specific limitation of an effective speech detection method based on the feature-enhanced pre-training model, which is not described herein.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides an effective voice detection device based on the feature enhancement pre-training model, which is used for realizing the effective voice detection method based on the feature enhancement pre-training model. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the effective speech detection device based on the feature-enhanced pre-training model provided below may be referred to above for the limitation of the effective speech detection method based on the feature-enhanced pre-training model, which is not described herein.

In one exemplary embodiment, as shown in fig. 5, there is provided an effective speech detection apparatus based on a feature-enhanced pre-training model, comprising:

The to-be-detected voice obtaining module 501 is configured to obtain to-be-detected voices containing different types of noise;

The effective voice feature extraction module 502 is configured to input the voice to be detected into a first pre-training model, extract effective voice features of the voice to be detected through the first pre-training model, wherein first training data adopted by the first pre-training model is obtained by performing data feature enhancement on unlabeled sample voice;

An effective speech classification module 503, configured to input the effective speech feature into a second pre-training model, and perform effective speech classification through the second pre-training model to obtain a classification result sequence; the second training data adopted by the second pre-training model is obtained by carrying out data characteristic enhancement on marked sample voices, and the classification result sequence is used for representing the probability of whether the voices of each frame in the voices to be detected are effective voices or not;

and the effective voice segment output module 504 is configured to output an effective voice segment of the voice to be detected according to the classification result sequence, where the effective voice segment is a voice segment for removing noise in the voice to be detected.

In one embodiment, the active speech segment output module 504 includes:

A time point determining sub-module, configured to determine a start time point and an end time point of the valid voice frame in the classification result sequence;

The effective voice segment obtaining submodule is used for obtaining the effective voice segment according to the sequence segment corresponding to the starting time point and the ending time point of the effective voice frame.

In one embodiment, the apparatus further comprises:

the non-labeling sample voice acquisition module is used for acquiring non-labeling sample voice based on voice recognition tasks;

The data characteristic enhancement module is used for obtaining a Mel frequency spectrum matrix according to the voice conversion of the unlabeled sample, and obtaining the enhanced data characteristic of the voice of the unlabeled sample through processing in the time dimension and the frequency dimension of the Mel frequency spectrum matrix;

And the first training data obtaining module is used for taking the enhanced data characteristics of the unlabeled sample voice as the first training data.

In one embodiment, the apparatus further comprises:

The first model to be trained acquisition module is used for acquiring a first model to be trained based on the encoder and decoder structure;

The first pre-training model obtaining module is used for combining the first training data and a first loss function, performing self-supervision model training on the first model to be trained to obtain the first pre-training model for extracting effective voice features, wherein the first loss function comprises a comparison loss and a diversity loss.

In one embodiment, the apparatus further comprises:

the second training data obtaining module is used for obtaining marked sample voice based on a voice recognition task and taking the reinforced data characteristics of the marked sample voice as the second training data;

And the sample effective voice feature obtaining module is used for obtaining sample effective voice features by inputting the second training data into the first pre-training model to perform feature extraction processing.

In one embodiment, the apparatus further comprises:

The system comprises a first training model acquisition module, a second training model acquisition module and a training module, wherein the first training model acquisition module is used for acquiring a first training model based on a neural network;

the classification model training module is used for training the effective voice classification model according to a second loss function by taking the effective voice characteristics of the sample as input to obtain a classification result output model;

and the second pre-training model obtaining module is used for combining the classification result output model and the effective voice segment output module to obtain the second pre-training model.

The various modules in the effective voice detection device based on the feature enhancement pre-training model can be fully or partially implemented by software, hardware and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In an exemplary embodiment, a computer device, which may be a terminal, is provided, and an internal structure diagram thereof may be as shown in fig. 6. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The Communication interface of the computer device is used for conducting wired or wireless Communication with an external terminal, and the wireless Communication can be realized through WIFI, a mobile cellular network, near field Communication (NEAR FIELD Communication) or other technologies. The computer program, when executed by a processor, implements an efficient speech detection method based on a feature-enhanced pre-training model.

It will be appreciated by those skilled in the art that the structure shown in FIG. 6 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one exemplary embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:

acquiring voices to be detected containing different types of noise;

In one embodiment, the processor, when executing the computer program, further implements the steps of the efficient speech detection method based on the feature-enhanced pre-training model in the other embodiments described above.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

acquiring voices to be detected containing different types of noise;

In one embodiment, the computer program when executed by the processor further implements the steps of the efficient speech detection method based on the feature-enhanced pre-training model in the other embodiments described above.

In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, performs the steps of:

acquiring voices to be detected containing different types of noise;

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are both information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to meet the related regulations.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile memory and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (RESISTIVE RANDOM ACCESS MEMORY, reRAM), magneto-resistive Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computation, an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) processor, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the present application.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method for efficient speech detection based on a feature-enhanced pre-training model, the method comprising:

acquiring voices to be detected containing different types of noise;

Inputting the voice to be detected into a first pre-training model, and extracting effective voice characteristics of the voice to be detected through the first pre-training model, wherein first training data adopted by the first pre-training model is obtained by adopting a data enhancement method of a log Mel sound spectrum layer and carrying out data characteristic enhancement on voice without a marked sample;

Outputting an effective voice fragment of the voice to be detected according to the classification result sequence, wherein the effective voice fragment is a voice fragment for removing noise in the voice to be detected;

Wherein the method further comprises:

obtaining a first model to be trained based on the structures of an encoder and a decoder, wherein the first model to be trained comprises a quantization module for quantizing the feature vector;

Performing self-supervision model training on the first model to be trained by combining the first training data and a first loss function to obtain the first pre-training model for extracting effective voice characteristics, wherein the first loss function comprises contrast loss and diversity loss;

the method further comprises the steps of:

acquiring unlabeled sample voice based on voice recognition task;

converting the non-marked sample voice to obtain a Mel frequency spectrum matrix, and performing translation and masking processing on the time dimension of the Mel frequency spectrum matrix and masking processing on the frequency dimension of the Mel frequency spectrum matrix to obtain enhanced data characteristics of the non-marked sample voice;

2. The method according to claim 1, wherein outputting the valid speech segments of the speech to be detected according to the classification result sequence comprises:

3. The method according to claim 1, wherein the method further comprises:

4. A method according to claim 3, characterized in that the method further comprises:

5. An efficient speech detection apparatus based on a feature-enhanced pre-training model, the apparatus comprising:

The device comprises an effective voice feature extraction module, a first pre-training model and a second pre-training model, wherein the effective voice feature extraction module is used for inputting the voice to be detected into the first pre-training model, and extracting the effective voice feature of the voice to be detected through the first pre-training model;

the effective voice segment output module is used for outputting the effective voice segment of the voice to be detected according to the classification result sequence, wherein the effective voice segment is a voice segment for removing noise in the voice to be detected;

wherein the apparatus further comprises:

the system comprises a first model to be trained acquisition module, a second model to be trained acquisition module and a decoder structure acquisition module, wherein the first model to be trained acquisition module is used for acquiring a first model to be trained based on the encoder and decoder structure;

The first pre-training model obtaining module is used for combining the first training data and a first loss function, carrying out self-supervision model training on the first model to be trained, and obtaining the first pre-training model for extracting effective voice characteristics;

the apparatus further comprises:

the data characteristic enhancement module is used for obtaining a Mel frequency spectrum matrix according to the conversion of the non-marked sample voice, and obtaining the enhanced data characteristic of the non-marked sample voice through translation and masking processing in the time dimension of the Mel frequency spectrum matrix and masking processing in the frequency dimension of the Mel frequency spectrum matrix;

6. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 4 when the computer program is executed.

7. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 4.

8. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method of any of claims 1 to 4.