[go: up one dir, main page]

CN119132337B - Effective voice detection method and device based on feature enhancement pre-training model - Google Patents

Effective voice detection method and device based on feature enhancement pre-training model

Info

Publication number
CN119132337B
CN119132337B CN202411031589.1A CN202411031589A CN119132337B CN 119132337 B CN119132337 B CN 119132337B CN 202411031589 A CN202411031589 A CN 202411031589A CN 119132337 B CN119132337 B CN 119132337B
Authority
CN
China
Prior art keywords
voice
model
training
effective
training model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411031589.1A
Other languages
Chinese (zh)
Other versions
CN119132337A (en
Inventor
吴石松
董召杰
李轩昂
梁寿愚
卢志良
陈柔伊
陈骞
赵必美
李紫京
苏立伟
刘振华
赵翔宇
郑桦
李成
冯勤宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Southern Power Grid Artificial Intelligence Technology Co ltd
Original Assignee
China Southern Power Grid Artificial Intelligence Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Southern Power Grid Artificial Intelligence Technology Co ltd filed Critical China Southern Power Grid Artificial Intelligence Technology Co ltd
Priority to CN202411031589.1A priority Critical patent/CN119132337B/en
Publication of CN119132337A publication Critical patent/CN119132337A/en
Application granted granted Critical
Publication of CN119132337B publication Critical patent/CN119132337B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/022Blocking, i.e. grouping of samples in time; Choice of analysis windows; Overlap factoring
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/173Transcoding, i.e. converting between two coded representations avoiding cascaded coding-decoding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

本申请涉及一种基于特征增强预训练模型的有效语音检测方法、装置。所述方法包括:获取包含有不同类型的噪声的待检测语音;将待检测语音输入至第一预训练模型,通过第一预训练模型提取得到待检测语音的有效语音特征;第一预训练模型所采用的第一训练数据为对无标注样本语音进行数据特征增强得到;将有效语音特征输入至第二预训练模型,通过第二预训练模型进行有效语音分类,得到分类结果序列;根据分类结果序列,输出待检测语音的有效语音片段;有效语音片段为去除待检测语音中噪声的语音片段。采用本方法能够适应更多的应用场景以及噪声类型,有效提升了有效语音检测效果,改善了有效语音检测的性能以提升语音识别系统的性能。

This application relates to an effective speech detection method and apparatus based on a feature-enhanced pre-trained model. The method includes: acquiring speech to be detected containing different types of noise; inputting the speech to be detected into a first pre-trained model, and extracting effective speech features from the speech to be detected through the first pre-trained model; the first training data used by the first pre-trained model is obtained by enhancing the data features of unlabeled sample speech; inputting the effective speech features into a second pre-trained model, and performing effective speech classification through the second pre-trained model to obtain a classification result sequence; and outputting effective speech segments of the speech to be detected based on the classification result sequence; the effective speech segments are speech segments from which noise has been removed. This method can adapt to more application scenarios and noise types, effectively improving the effective speech detection effect and performance, thereby enhancing the performance of the speech recognition system.

Description

Effective voice detection method and device based on feature enhancement pre-training model
Technical Field
The present application relates to the field of speech processing technology, and in particular, to an effective speech detection method, apparatus, computer device, computer readable storage medium and computer program product based on a feature-enhanced pre-training model.
Background
With the development of voice recognition technology, the application of the voice recognition technology in power production activities is more and more widespread, such as voice analysis processing based on an intelligent power customer service platform. The complexity of the actual application environment also presents a significant challenge to speech recognition technology.
In the related art, conventional speech recognition is generally based on VAD (Voice activity detection, effective speech detection) technology to remove environmental noise contained in speech. Because of the complex noise types and numerous application scenarios in the practical application environment, the traditional voice recognition method is difficult to completely remove certain noises, and the unremoved noises have a larger influence on the performance of the voice recognition system.
Disclosure of Invention
In view of the foregoing, it is desirable to provide an effective speech detection method, apparatus, computer device, computer readable storage medium, and computer program product based on a feature-enhanced pre-training model that can enhance the effective speech detection effect.
In a first aspect, the present application provides an effective speech detection method based on a feature-enhanced pre-training model, comprising:
acquiring voices to be detected containing different types of noise;
Inputting the voice to be detected into a first pre-training model, and extracting effective voice characteristics of the voice to be detected through the first pre-training model, wherein first training data adopted by the first pre-training model is obtained by carrying out data characteristic enhancement on voice without a marked sample;
the method comprises the steps of inputting the effective voice characteristics into a second pre-training model, and carrying out effective voice classification through the second pre-training model to obtain a classification result sequence, wherein second training data adopted by the second pre-training model are obtained by carrying out data characteristic enhancement on marked sample voices;
and outputting the effective voice fragments of the voice to be detected according to the classification result sequence, wherein the effective voice fragments are voice fragments for removing noise in the voice to be detected.
In one embodiment, the outputting the valid voice segment of the voice to be detected according to the classification result sequence includes:
determining a starting time point and an ending time point of the effective voice frame in the classification result sequence;
and obtaining the effective voice fragment according to the sequence fragment corresponding to the starting time point and the ending time point of the effective voice frame.
In one embodiment, the method further comprises:
acquiring unlabeled sample voice based on voice recognition task;
obtaining a Mel frequency spectrum matrix according to the conversion of the unlabeled sample voice, and obtaining the enhanced data characteristic of the unlabeled sample voice through processing in the time dimension and the frequency dimension of the Mel frequency spectrum matrix;
and taking the enhanced data characteristics of the unlabeled sample voice as the first training data.
In one embodiment, the method further comprises:
acquiring a first model to be trained based on the structures of the encoder and the decoder;
And combining the first training data and a first loss function, and performing self-supervision model training on the first model to be trained to obtain the first pre-training model for extracting effective voice features, wherein the first loss function comprises contrast loss and diversity loss.
In one embodiment, the method further comprises:
acquiring marked sample voice based on a voice recognition task, and taking the enhanced data characteristics of the marked sample voice as the second training data;
and inputting the second training data into the first pre-training model to perform feature extraction processing, so as to obtain effective speech features of the sample.
In one embodiment, the method further comprises:
Acquiring a second model to be trained based on a neural network, wherein the second model to be trained comprises an effective voice classification model;
training the effective voice classification model according to a second loss function by taking the effective voice characteristics of the sample as input to obtain a classification result output model;
and combining the classification result output model and the effective voice fragment output module to obtain the second pre-training model.
In a second aspect, the present application further provides an effective speech detection apparatus based on a feature-enhanced pre-training model, including:
the to-be-detected voice acquisition module is used for acquiring to-be-detected voices containing different types of noise;
the effective voice feature extraction module is used for inputting the voice to be detected into a first pre-training model, extracting the effective voice feature of the voice to be detected through the first pre-training model, wherein the first training data adopted by the first pre-training model is obtained by carrying out data feature enhancement on unlabeled sample voice;
The device comprises an effective voice classification module, a classification result sequence and a voice recognition module, wherein the effective voice classification module is used for inputting the effective voice characteristics into a second pre-training model, and performing effective voice classification through the second pre-training model to obtain a classification result sequence;
And the effective voice segment output module is used for outputting the effective voice segment of the voice to be detected according to the classification result sequence, wherein the effective voice segment is a voice segment for removing noise in the voice to be detected.
In a third aspect, the present application also provides a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
acquiring voices to be detected containing different types of noise;
Inputting the voice to be detected into a first pre-training model, and extracting effective voice characteristics of the voice to be detected through the first pre-training model, wherein first training data adopted by the first pre-training model is obtained by carrying out data characteristic enhancement on voice without a marked sample;
the method comprises the steps of inputting the effective voice characteristics into a second pre-training model, and carrying out effective voice classification through the second pre-training model to obtain a classification result sequence, wherein second training data adopted by the second pre-training model are obtained by carrying out data characteristic enhancement on marked sample voices;
and outputting the effective voice fragments of the voice to be detected according to the classification result sequence, wherein the effective voice fragments are voice fragments for removing noise in the voice to be detected.
In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
acquiring voices to be detected containing different types of noise;
Inputting the voice to be detected into a first pre-training model, and extracting effective voice characteristics of the voice to be detected through the first pre-training model, wherein first training data adopted by the first pre-training model is obtained by carrying out data characteristic enhancement on voice without a marked sample;
the method comprises the steps of inputting the effective voice characteristics into a second pre-training model, and carrying out effective voice classification through the second pre-training model to obtain a classification result sequence, wherein second training data adopted by the second pre-training model are obtained by carrying out data characteristic enhancement on marked sample voices;
and outputting the effective voice fragments of the voice to be detected according to the classification result sequence, wherein the effective voice fragments are voice fragments for removing noise in the voice to be detected.
In a fifth aspect, the application also provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of:
acquiring voices to be detected containing different types of noise;
Inputting the voice to be detected into a first pre-training model, and extracting effective voice characteristics of the voice to be detected through the first pre-training model, wherein first training data adopted by the first pre-training model is obtained by carrying out data characteristic enhancement on voice without a marked sample;
the method comprises the steps of inputting the effective voice characteristics into a second pre-training model, and carrying out effective voice classification through the second pre-training model to obtain a classification result sequence, wherein second training data adopted by the second pre-training model are obtained by carrying out data characteristic enhancement on marked sample voices;
and outputting the effective voice fragments of the voice to be detected according to the classification result sequence, wherein the effective voice fragments are voice fragments for removing noise in the voice to be detected.
According to the method, the device, the computer equipment, the computer readable storage medium and the computer program product for detecting the effective voice based on the feature enhancement pre-training model, the voice to be detected containing different types of noise is obtained, then the voice to be detected is input into the first pre-training model, the effective voice feature of the voice to be detected is extracted through the first pre-training model, the first training data adopted by the first pre-training model is obtained by carrying out data feature enhancement on voice without a labeling sample, the effective voice feature is further input into the second pre-training model, the second pre-training model is used for carrying out effective voice classification to obtain a classification result sequence, the second training data adopted by the second pre-training model is obtained by carrying out data feature enhancement on the labeled sample voice, the classification result sequence is used for representing the probability of whether voice of each frame in the voice to be detected is the effective voice, and the effective voice segment of the voice to be detected is output according to the classification result sequence.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are needed in the description of the embodiments of the present application or the related technologies will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other related drawings may be obtained according to these drawings without inventive effort to those of ordinary skill in the art.
FIG. 1 is a flow diagram of an efficient speech detection method based on a feature-enhanced pre-training model in one embodiment;
FIG. 2 is a schematic diagram of an efficient speech detection process based on a feature-enhanced pre-training model in one embodiment;
FIG. 3a is a schematic diagram of a training process based on a feature-enhanced pre-training model in one embodiment;
FIG. 3b is a schematic diagram of a mold structure in one embodiment;
FIG. 4 is a flow chart of an effective speech detection method based on a feature-enhanced pre-training model in another embodiment;
FIG. 5 is a block diagram of an active speech detection device based on a feature-enhanced pre-training model in one embodiment;
fig. 6 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
In an exemplary embodiment, as shown in fig. 1, an effective speech detection method based on a feature-enhanced pre-training model is provided, where the method is applied to a terminal for illustration, it is understood that the method may also be applied to a server, and may also be applied to a system including the terminal and the server, and implemented through interaction between the terminal and the server. In this embodiment, the method includes the following steps 101 to 104. Wherein:
Step 101, obtaining the voice to be detected containing different types of noise.
The voice to be detected can be obtained based on a voice recognition system, the voice processing process of the voice recognition system comprises an effective voice detection process, and the voice recognition system can be applied to the fields of intelligent customer service voice quality inspection analysis, intelligent voice conference system, multimedia audio analysis and the like.
As an example, the different types of noise may be various types of noise in a practical application environment, such as ambient music, ambient human voice, channel noise, and the like.
In practical application, as shown in fig. 2, taking a test stage as an example, an input test voice may be used as a voice to be detected, so as to further perform effective voice detection processing on the voice to be detected based on a pre-training model with enhanced features.
Step 102, inputting the voice to be detected into a first pre-training model, extracting effective voice characteristics of the voice to be detected through the first pre-training model, wherein first training data adopted by the first pre-training model are obtained by carrying out data characteristic enhancement on voice without marked samples.
The first pre-training model can be a pre-training model obtained by performing data enhancement and self-supervision pre-training by using a specific algorithm, the first training data is obtained by performing data enhancement processing on unlabeled sample voice, and original features can be transformed into enhanced features, so that model training is performed based on feature enhancement, and the robustness of the pre-training model can be enhanced.
In a specific implementation, a trained pre-training model (i.e., a first pre-training model) may be used as a feature extractor, such as the robust VAD feature extraction module based on the pre-training model in fig. 2, and by inputting the voice to be detected, the first pre-training model may be used to extract the effective voice features of the voice to be detected.
And step 103, inputting the effective voice characteristics into a second pre-training model, and carrying out effective voice classification through the second pre-training model to obtain a classification result sequence, wherein second training data adopted by the second pre-training model are obtained by carrying out data characteristic enhancement on marked sample voice.
The classification result sequence may be used to characterize a probability of whether the speech of each frame in the speech to be detected is a valid speech, for example, frame-by-frame processing may be performed on the speech to be detected to determine whether each frame is a valid speech.
After obtaining the effective speech features, the trained classifier model (i.e. the second pre-training model) may be used to perform effective speech classification, and by inputting the extracted effective speech features into the second pre-training model, a robust VAD feature extraction module based on a neural network classifier in fig. 2 may output a probability sequence (i.e. a classification result sequence) representing whether each frame is effective speech.
In an example, the second training data is obtained by performing data enhancement processing on the labeled sample voice, and the second training data can be used for training to obtain a second pre-training model, specifically, since the trained pre-training model can be used as a robust valid sound detection feature extractor, namely a first pre-training model, the first pre-training model is used for performing feature extraction processing on the second training data to train the model, and a nonlinear neural network classifier can be obtained for judging valid voice.
And 104, outputting the effective voice fragments of the voice to be detected according to the classification result sequence, wherein the effective voice fragments are voice fragments for removing noise in the voice to be detected.
After the classification result sequence is obtained, the classification result sequence is input to an effective voice segment output module in the second pre-training model, so that the starting time point and the ending time point of the effective voice segment in the voice to be detected can be determined based on the classification result sequence, and further the effective voice segment for removing noise in the voice to be detected can be obtained. Therefore, effective voice detection is carried out based on the feature enhancement pre-training model, the performance of a voice recognition system can be effectively improved, and the voice recognition effect can be improved.
According to the effective voice detection method based on the feature enhancement pre-training model, the voice to be detected containing different types of noise is obtained, then the voice to be detected is input into the first pre-training model, the effective voice features of the voice to be detected are extracted through the first pre-training model, then the effective voice features are input into the second pre-training model, the effective voice is classified through the second pre-training model, a classification result sequence is obtained, the effective voice fragments of the voice to be detected are output according to the classification result sequence, optimization of effective voice detection is achieved, robustness of the effective voice detection model can be enhanced based on the feature enhancement pre-training model, more application scenes and noise types are adapted, effective voice detection effects are effectively improved, and performance of the effective voice detection can be improved to improve performance of a voice recognition system.
In an exemplary embodiment, the outputting the valid speech segment of the speech to be detected according to the classification result sequence may include the following steps:
And obtaining the effective voice fragments according to the sequence fragments corresponding to the starting time point and the ending time point of the effective voice frame.
In practical application, a starting point searching algorithm can be adopted to judge the starting point and the tail point of the effective voice segment in the classification result sequence, for example, under the condition that a continuous effective voice frame exceeds a threshold value after a certain effective voice frame is detected by the starting point searching algorithm, the certain effective voice frame can be confirmed to be the starting point of the effective voice segment (namely, the starting time point of the effective voice frame), and under the condition that a continuous noise frame after a certain noise frame exceeds the threshold value by the tail point searching algorithm, the certain noise frame can be confirmed to be the tail point of the effective voice segment (namely, the ending time point of the effective voice frame).
In this embodiment, the starting time point and the ending time point of the valid voice frame are determined in the classification result sequence, so that the valid voice fragment is obtained according to the sequence fragments corresponding to the starting time point and the ending time point of the valid voice frame, and the valid voice fragment can be effectively determined.
In an exemplary embodiment, the method may further include the steps of:
The method comprises the steps of obtaining unlabeled sample voice based on a voice recognition task, obtaining a Mel frequency spectrum matrix according to conversion of the unlabeled sample voice, obtaining enhanced data characteristics of the unlabeled sample voice through processing in the time dimension and the frequency dimension of the Mel frequency spectrum matrix, and taking the enhanced data characteristics of the unlabeled sample voice as first training data.
In a specific implementation, as shown in fig. 3a, for the training stage, the overall flow of the effective speech detection system based on the feature enhancement pre-training model may include a non-labeling training data enhancement module based on a specific algorithm, a large model pre-training module based on non-labeling data, a robust VAD feature extraction module based on the pre-training model, a labeling data enhancement module based on a specific algorithm, an effective speech classifier finetune (fine tuning) module based on a neural network, and an effective speech segment output module based on a neural network classifier.
For example, based on a historical speech recognition task, unlabeled sample speech can be acquired through a speech recognition system, and then based on an unlabeled training data enhancement module of a specific algorithm, data enhancement processing can be performed on the unlabeled sample speech, so that unlabeled data (i.e. first training data) after data enhancement is further input into a large model pre-training module for model training.
In one example, a data enhancement method at the log mel-level may be employed by converting an audio segment (i.e., unlabeled sample speech) into a matrix of mel-spectraV represents the frequency dimension, and,Representing the time dimension, the following steps can be used:
1. Zero mean normalization x-x.mean () can be performed on the mel spectrum, so that when masking is performed subsequently, the masking position can be set to 0 directly, and the method is equivalent to filling the mean of the matrix;
2. for time dimension translation, horizontal left-right torsion can be performed on the frequency spectrum;
3. For a time dimension mask, if the maximum range of the time dimension continuous mask is T, a uniform sampling of T can be performed within the range of [0, T ], and then the sampling can be performed within the range of [0, T ] Randomly determining a point t 0 in the range, and then continuously performing t times of masking (such as setting the matrix value to 0) along the time axis from the position t 0;
4. For the frequency dimension mask, if the maximum range of the time dimension continuous mask is F, a uniform sampling of F can be performed within the range of [0,F ], a point F 0 can be randomly determined within the range of [0, v-F ], and then F times of masking can be continuously performed along the time axis from the position F 0 (for example, the matrix value is set to 0).
Therefore, after data enhancement processing, the original features can be transformed into enhanced features, which is beneficial to enhancing the robustness of subsequent model training.
In this embodiment, by acquiring the unlabeled sample speech based on the speech recognition task, then converting the unlabeled sample speech to obtain the mel spectrum matrix, and processing the unlabeled sample speech in the time dimension and the frequency dimension of the mel spectrum matrix to obtain the enhanced data feature of the unlabeled sample speech, further using the enhanced data feature of the unlabeled sample speech as the first training data, the data support can be provided for further model training.
In an exemplary embodiment, the method may further include the steps of:
The method comprises the steps of obtaining a first model to be trained based on a coder and decoder structure, combining the first training data with a first loss function, and performing self-supervision model training on the first model to be trained to obtain a first pre-training model for extracting effective voice features, wherein the first loss function comprises contrast loss and diversity loss.
In an example, feature enhanced features (i.e., first training data) may be employed for unsupervised pre-training, and a feature enhanced pre-trained large model, i.e., a first pre-training model, may be obtained by self-supervised pre-training using a pre-training model.
Optionally, for the large model pre-training process based on non-labeling data feature enhancement, the pre-training large model (i.e. the first model to be trained) network structure adopted is shown in fig. 3b, where a context network part uses a transducer (encoder-decoder) structure, and feature vectors extracted by the context network can be directly input into a transducer network of context on one hand, and can be quantized by a quantization module for subsequent calculation of a loss function (such as Continuous inputs continuous input, quantized targets quantization target) on the other hand.
For example, the vector Z output by the encoder network may be discretized by product quantization, for example, the vector Z may be split into G subspaces (the series of subspaces are codebook), if each codebook has V entries, the length of each entry is d/G, and the most similar entry to the input vector may be found in each codebook by GUMBEL-softmax or clustering method, so that the discretized vector output by each codebook may be spliced to obtain a d-dimensional vector after Z quantization. The main effect of the quantization process is to have the effect of compressing and removing redundancy of the feature vector, and meanwhile, the robustness of the feature can be stronger through clustering in each subspace, and the feature vector is not easy to be influenced by a small amount of disturbance.
In yet another example, the first loss function may include two portions of contrast loss and diversity loss, and the final loss value may be weighted by the two portions of loss.
In an alternative embodiment, when applied to a downstream task (such as an active speech detection task), a linear layer may be added to the pre-trained model to perform a fine tuning process, where parameters of the linear layer may be updated, and parameters of the transducer portion may be updated, and the parameters of the encoder portion may be frozen and kept unchanged.
In this embodiment, by acquiring a first model to be trained based on the encoder and decoder structures, and further combining the first training data and the first loss function, performing self-supervision model training on the first model to be trained to obtain the first pre-training model for extracting effective speech features, the pre-training model can be used for enhancing the feature based on feature enhancement, and the robustness of the effective speech detection model is effectively enhanced.
In an exemplary embodiment, the method may further include the steps of:
and obtaining the marked sample voice based on the voice recognition task, taking the enhanced data characteristic of the marked sample voice as the second training data, and obtaining the effective voice characteristic of the sample by inputting the second training data into the first pre-training model for characteristic extraction processing.
In practical application, as shown in fig. 3a, for the robust effective speech detection feature extraction process based on the pre-training model, data enhancement can be performed on the labeled sample speech to obtain second training data, and the trained pre-training model is used as a feature extractor, and a robust characterization vector is extracted for the second training data, so that by using the trained pre-training model as the robust feature extractor, the trained pre-training model can be input with the acoustic feature enhanced by the data, and the obtained output vector (i.e., the sample effective speech feature) can characterize the effective speech detection with the robustness.
In this embodiment, the enhanced data features of the labeled sample speech are used as the second training data by obtaining the labeled sample speech based on the speech recognition task, and then the second training data is input into the first pre-training model to perform feature extraction processing, so as to obtain effective speech features of the sample, and provide data support for further classifier model training.
In an exemplary embodiment, the method may further include the steps of:
The method comprises the steps of obtaining a second model to be trained based on a neural network, wherein the second model to be trained comprises an effective voice classification model, training the effective voice classification model according to a second loss function by taking effective voice characteristics of a sample as input to obtain a classification result output model, wherein the second loss function comprises a cross entropy function, and combining the classification result output model and an effective voice fragment output module to obtain a second pre-training model.
In a specific implementation, the effective voice classifier of the neural network can be trained by using the extracted effective voice characteristics of the sample, an effective voice classification model based on the neural network is trained by adopting labeling data with effective voice segment labels, the effective voice classification model is used as a classification result output model, segment judgment of the effective voice is carried out according to a result sequence output by the classifier, and the starting time and the ending time of the effective voice can be output.
In one example, as shown in fig. 3a, for the neural network-based active speech classifier fine tuning process, by inputting data with accurate labeling (i.e., labeled sample speech) into a pre-training model-based robust active speech detection feature extraction module (i.e., a first pre-training model), active speech detection features can be extracted as input for input into the neural network. The neural network can select a fully connected neural network, a time delay neural network and a convolution neural network, and can select a cross entropy function as a loss function (namely a second loss function) to finely tune the neural network, so that the neural network can judge whether the input voice is effective voice frame by frame.
In yet another example, for the valid voice segment output process based on the neural network classifier, after the sequence of labels (valid voice, invalid voice) calculated by the neural network classifier is obtained, burrs in the sequence (such as shorter voices in silence segments or shorter silence in voice segments) can be located and removed according to a set threshold, so that the rationality of valid voice detection segments can be ensured.
The technical scheme of the embodiment is based on the problems of multiple use scenes and complex noise of effective voice detection in practical application, and the data enhancement and self-supervision pre-training model of a specific algorithm can be utilized to obtain the pre-training model with the robust effective voice detection feature extraction capability, so that the performance of effective voice detection can be effectively improved through the training of a nonlinear classifier.
In this embodiment, by acquiring the second model to be trained based on the neural network, then using the effective speech feature of the sample as input, training the effective speech classification model according to the second loss function to obtain a classification result output model, and further combining the classification result output model and the effective speech segment output module to obtain a second pre-training model, the performance of effective speech detection can be improved to improve the performance of the speech recognition system.
In one exemplary embodiment, as shown in FIG. 4, a flow diagram of another method for efficient speech detection based on a feature-enhanced pre-training model is provided. In this embodiment, the method includes the steps of:
In step 401, unlabeled sample speech based on a speech recognition task is obtained, a mel spectrum matrix is obtained according to conversion of the unlabeled sample speech, enhanced data features of the unlabeled sample speech are obtained by processing in a time dimension and a frequency dimension of the mel spectrum matrix, and the enhanced data features of the unlabeled sample speech are used as first training data. In step 402, a first model to be trained based on the encoder and decoder structure is obtained, and the first model to be trained is subjected to self-supervision model training in combination with the first training data and the first loss function, so as to obtain a first pre-training model for extracting effective speech features. In step 403, labeled sample speech based on the speech recognition task is obtained, the enhanced data feature of the labeled sample speech is used as second training data, and the second training data is input into the first pre-training model to perform feature extraction processing, so as to obtain effective speech features of the sample. In step 404, a second model to be trained based on the neural network is obtained, the effective speech characteristics of the sample are used as input, the effective speech classification model is trained according to the second loss function, a classification result output model is obtained, and the classification result output model and the effective speech segment output module are combined to obtain a second pre-training model. In step 405, the to-be-detected voice containing different types of noise is obtained, the to-be-detected voice is input into the first pre-training model, and the effective voice characteristics of the to-be-detected voice are extracted through the first pre-training model. In step 406, the valid speech features are input to a second pre-training model, and valid speech classification is performed by the second pre-training model, resulting in a classification result sequence. In step 407, in the classification result sequence, a start time point and an end time point of the valid voice frame are determined, and the valid voice fragment is obtained according to the sequence fragments corresponding to the start time point and the end time point of the valid voice frame. It should be noted that, the specific limitation of the above steps may be referred to above for specific limitation of an effective speech detection method based on the feature-enhanced pre-training model, which is not described herein.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides an effective voice detection device based on the feature enhancement pre-training model, which is used for realizing the effective voice detection method based on the feature enhancement pre-training model. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the effective speech detection device based on the feature-enhanced pre-training model provided below may be referred to above for the limitation of the effective speech detection method based on the feature-enhanced pre-training model, which is not described herein.
In one exemplary embodiment, as shown in fig. 5, there is provided an effective speech detection apparatus based on a feature-enhanced pre-training model, comprising:
The to-be-detected voice obtaining module 501 is configured to obtain to-be-detected voices containing different types of noise;
The effective voice feature extraction module 502 is configured to input the voice to be detected into a first pre-training model, extract effective voice features of the voice to be detected through the first pre-training model, wherein first training data adopted by the first pre-training model is obtained by performing data feature enhancement on unlabeled sample voice;
An effective speech classification module 503, configured to input the effective speech feature into a second pre-training model, and perform effective speech classification through the second pre-training model to obtain a classification result sequence; the second training data adopted by the second pre-training model is obtained by carrying out data characteristic enhancement on marked sample voices, and the classification result sequence is used for representing the probability of whether the voices of each frame in the voices to be detected are effective voices or not;
and the effective voice segment output module 504 is configured to output an effective voice segment of the voice to be detected according to the classification result sequence, where the effective voice segment is a voice segment for removing noise in the voice to be detected.
In one embodiment, the active speech segment output module 504 includes:
A time point determining sub-module, configured to determine a start time point and an end time point of the valid voice frame in the classification result sequence;
The effective voice segment obtaining submodule is used for obtaining the effective voice segment according to the sequence segment corresponding to the starting time point and the ending time point of the effective voice frame.
In one embodiment, the apparatus further comprises:
the non-labeling sample voice acquisition module is used for acquiring non-labeling sample voice based on voice recognition tasks;
The data characteristic enhancement module is used for obtaining a Mel frequency spectrum matrix according to the voice conversion of the unlabeled sample, and obtaining the enhanced data characteristic of the voice of the unlabeled sample through processing in the time dimension and the frequency dimension of the Mel frequency spectrum matrix;
And the first training data obtaining module is used for taking the enhanced data characteristics of the unlabeled sample voice as the first training data.
In one embodiment, the apparatus further comprises:
The first model to be trained acquisition module is used for acquiring a first model to be trained based on the encoder and decoder structure;
The first pre-training model obtaining module is used for combining the first training data and a first loss function, performing self-supervision model training on the first model to be trained to obtain the first pre-training model for extracting effective voice features, wherein the first loss function comprises a comparison loss and a diversity loss.
In one embodiment, the apparatus further comprises:
the second training data obtaining module is used for obtaining marked sample voice based on a voice recognition task and taking the reinforced data characteristics of the marked sample voice as the second training data;
And the sample effective voice feature obtaining module is used for obtaining sample effective voice features by inputting the second training data into the first pre-training model to perform feature extraction processing.
In one embodiment, the apparatus further comprises:
The system comprises a first training model acquisition module, a second training model acquisition module and a training module, wherein the first training model acquisition module is used for acquiring a first training model based on a neural network;
the classification model training module is used for training the effective voice classification model according to a second loss function by taking the effective voice characteristics of the sample as input to obtain a classification result output model;
and the second pre-training model obtaining module is used for combining the classification result output model and the effective voice segment output module to obtain the second pre-training model.
The various modules in the effective voice detection device based on the feature enhancement pre-training model can be fully or partially implemented by software, hardware and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In an exemplary embodiment, a computer device, which may be a terminal, is provided, and an internal structure diagram thereof may be as shown in fig. 6. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The Communication interface of the computer device is used for conducting wired or wireless Communication with an external terminal, and the wireless Communication can be realized through WIFI, a mobile cellular network, near field Communication (NEAR FIELD Communication) or other technologies. The computer program, when executed by a processor, implements an efficient speech detection method based on a feature-enhanced pre-training model.
It will be appreciated by those skilled in the art that the structure shown in FIG. 6 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In one exemplary embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:
acquiring voices to be detected containing different types of noise;
Inputting the voice to be detected into a first pre-training model, and extracting effective voice characteristics of the voice to be detected through the first pre-training model, wherein first training data adopted by the first pre-training model is obtained by carrying out data characteristic enhancement on voice without a marked sample;
the method comprises the steps of inputting the effective voice characteristics into a second pre-training model, and carrying out effective voice classification through the second pre-training model to obtain a classification result sequence, wherein second training data adopted by the second pre-training model are obtained by carrying out data characteristic enhancement on marked sample voices;
and outputting the effective voice fragments of the voice to be detected according to the classification result sequence, wherein the effective voice fragments are voice fragments for removing noise in the voice to be detected.
In one embodiment, the processor, when executing the computer program, further implements the steps of the efficient speech detection method based on the feature-enhanced pre-training model in the other embodiments described above.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:
acquiring voices to be detected containing different types of noise;
Inputting the voice to be detected into a first pre-training model, and extracting effective voice characteristics of the voice to be detected through the first pre-training model, wherein first training data adopted by the first pre-training model is obtained by carrying out data characteristic enhancement on voice without a marked sample;
the method comprises the steps of inputting the effective voice characteristics into a second pre-training model, and carrying out effective voice classification through the second pre-training model to obtain a classification result sequence, wherein second training data adopted by the second pre-training model are obtained by carrying out data characteristic enhancement on marked sample voices;
and outputting the effective voice fragments of the voice to be detected according to the classification result sequence, wherein the effective voice fragments are voice fragments for removing noise in the voice to be detected.
In one embodiment, the computer program when executed by the processor further implements the steps of the efficient speech detection method based on the feature-enhanced pre-training model in the other embodiments described above.
In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, performs the steps of:
acquiring voices to be detected containing different types of noise;
Inputting the voice to be detected into a first pre-training model, and extracting effective voice characteristics of the voice to be detected through the first pre-training model, wherein first training data adopted by the first pre-training model is obtained by carrying out data characteristic enhancement on voice without a marked sample;
the method comprises the steps of inputting the effective voice characteristics into a second pre-training model, and carrying out effective voice classification through the second pre-training model to obtain a classification result sequence, wherein second training data adopted by the second pre-training model are obtained by carrying out data characteristic enhancement on marked sample voices;
and outputting the effective voice fragments of the voice to be detected according to the classification result sequence, wherein the effective voice fragments are voice fragments for removing noise in the voice to be detected.
In one embodiment, the computer program when executed by the processor further implements the steps of the efficient speech detection method based on the feature-enhanced pre-training model in the other embodiments described above.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are both information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to meet the related regulations.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile memory and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (RESISTIVE RANDOM ACCESS MEMORY, reRAM), magneto-resistive Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computation, an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) processor, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the present application.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (8)

1. A method for efficient speech detection based on a feature-enhanced pre-training model, the method comprising:
acquiring voices to be detected containing different types of noise;
Inputting the voice to be detected into a first pre-training model, and extracting effective voice characteristics of the voice to be detected through the first pre-training model, wherein first training data adopted by the first pre-training model is obtained by adopting a data enhancement method of a log Mel sound spectrum layer and carrying out data characteristic enhancement on voice without a marked sample;
the method comprises the steps of inputting the effective voice characteristics into a second pre-training model, and carrying out effective voice classification through the second pre-training model to obtain a classification result sequence, wherein second training data adopted by the second pre-training model are obtained by carrying out data characteristic enhancement on marked sample voices;
Outputting an effective voice fragment of the voice to be detected according to the classification result sequence, wherein the effective voice fragment is a voice fragment for removing noise in the voice to be detected;
Wherein the method further comprises:
obtaining a first model to be trained based on the structures of an encoder and a decoder, wherein the first model to be trained comprises a quantization module for quantizing the feature vector;
Performing self-supervision model training on the first model to be trained by combining the first training data and a first loss function to obtain the first pre-training model for extracting effective voice characteristics, wherein the first loss function comprises contrast loss and diversity loss;
the method further comprises the steps of:
acquiring unlabeled sample voice based on voice recognition task;
converting the non-marked sample voice to obtain a Mel frequency spectrum matrix, and performing translation and masking processing on the time dimension of the Mel frequency spectrum matrix and masking processing on the frequency dimension of the Mel frequency spectrum matrix to obtain enhanced data characteristics of the non-marked sample voice;
and taking the enhanced data characteristics of the unlabeled sample voice as the first training data.
2. The method according to claim 1, wherein outputting the valid speech segments of the speech to be detected according to the classification result sequence comprises:
determining a starting time point and an ending time point of the effective voice frame in the classification result sequence;
and obtaining the effective voice fragment according to the sequence fragment corresponding to the starting time point and the ending time point of the effective voice frame.
3. The method according to claim 1, wherein the method further comprises:
acquiring marked sample voice based on a voice recognition task, and taking the enhanced data characteristics of the marked sample voice as the second training data;
and inputting the second training data into the first pre-training model to perform feature extraction processing, so as to obtain effective speech features of the sample.
4. A method according to claim 3, characterized in that the method further comprises:
Acquiring a second model to be trained based on a neural network, wherein the second model to be trained comprises an effective voice classification model;
training the effective voice classification model according to a second loss function by taking the effective voice characteristics of the sample as input to obtain a classification result output model;
and combining the classification result output model and the effective voice fragment output module to obtain the second pre-training model.
5. An efficient speech detection apparatus based on a feature-enhanced pre-training model, the apparatus comprising:
the to-be-detected voice acquisition module is used for acquiring to-be-detected voices containing different types of noise;
The device comprises an effective voice feature extraction module, a first pre-training model and a second pre-training model, wherein the effective voice feature extraction module is used for inputting the voice to be detected into the first pre-training model, and extracting the effective voice feature of the voice to be detected through the first pre-training model;
The device comprises an effective voice classification module, a classification result sequence and a voice recognition module, wherein the effective voice classification module is used for inputting the effective voice characteristics into a second pre-training model, and performing effective voice classification through the second pre-training model to obtain a classification result sequence;
the effective voice segment output module is used for outputting the effective voice segment of the voice to be detected according to the classification result sequence, wherein the effective voice segment is a voice segment for removing noise in the voice to be detected;
wherein the apparatus further comprises:
the system comprises a first model to be trained acquisition module, a second model to be trained acquisition module and a decoder structure acquisition module, wherein the first model to be trained acquisition module is used for acquiring a first model to be trained based on the encoder and decoder structure;
The first pre-training model obtaining module is used for combining the first training data and a first loss function, carrying out self-supervision model training on the first model to be trained, and obtaining the first pre-training model for extracting effective voice characteristics;
the apparatus further comprises:
the non-labeling sample voice acquisition module is used for acquiring non-labeling sample voice based on voice recognition tasks;
the data characteristic enhancement module is used for obtaining a Mel frequency spectrum matrix according to the conversion of the non-marked sample voice, and obtaining the enhanced data characteristic of the non-marked sample voice through translation and masking processing in the time dimension of the Mel frequency spectrum matrix and masking processing in the frequency dimension of the Mel frequency spectrum matrix;
And the first training data obtaining module is used for taking the enhanced data characteristics of the unlabeled sample voice as the first training data.
6. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 4 when the computer program is executed.
7. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 4.
8. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method of any of claims 1 to 4.
CN202411031589.1A 2024-07-30 2024-07-30 Effective voice detection method and device based on feature enhancement pre-training model Active CN119132337B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411031589.1A CN119132337B (en) 2024-07-30 2024-07-30 Effective voice detection method and device based on feature enhancement pre-training model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411031589.1A CN119132337B (en) 2024-07-30 2024-07-30 Effective voice detection method and device based on feature enhancement pre-training model

Publications (2)

Publication Number Publication Date
CN119132337A CN119132337A (en) 2024-12-13
CN119132337B true CN119132337B (en) 2025-11-11

Family

ID=93764599

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411031589.1A Active CN119132337B (en) 2024-07-30 2024-07-30 Effective voice detection method and device based on feature enhancement pre-training model

Country Status (1)

Country Link
CN (1) CN119132337B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119993202B (en) * 2025-01-23 2025-10-17 东北大学 A method for detecting sound events

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115881103A (en) * 2022-11-23 2023-03-31 镁佳(北京)科技有限公司 Voice emotion recognition model training method, voice emotion recognition method and device
CN115985347A (en) * 2023-02-22 2023-04-18 南方电网数字电网研究院有限公司 Speech endpoint detection method, device and computer equipment based on deep learning

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111862953B (en) * 2019-12-05 2023-08-22 北京嘀嘀无限科技发展有限公司 Training method of voice recognition model, voice recognition method and device
US11803758B2 (en) * 2020-04-17 2023-10-31 Microsoft Technology Licensing, Llc Adversarial pretraining of machine learning models
CN111816218B (en) * 2020-07-31 2024-05-28 平安科技(深圳)有限公司 Voice endpoint detection method, device, equipment and storage medium
CN115762489B (en) * 2022-10-27 2025-11-04 阿里巴巴达摩院(杭州)科技有限公司 Data processing system and methods for speech recognition models, speech recognition methods
CN116504234B (en) * 2023-05-29 2023-10-13 镁佳(北京)科技有限公司 Method, device, equipment and medium for generating voice awakening and detecting model
CN116564287A (en) * 2023-05-31 2023-08-08 中国人民解放军战略支援部队信息工程大学 Semi-supervised Speech Recognition Method Based on Pre-trained Model and Reinforcement Learning Fine-tuning
CN116913325B (en) * 2023-08-11 2025-01-10 广东省生态环境监测中心 Noise event detection method and device
CN118098220A (en) * 2024-03-20 2024-05-28 中国科学院声学研究所 End-to-end bilingual mixed speech recognition training method and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115881103A (en) * 2022-11-23 2023-03-31 镁佳(北京)科技有限公司 Voice emotion recognition model training method, voice emotion recognition method and device
CN115985347A (en) * 2023-02-22 2023-04-18 南方电网数字电网研究院有限公司 Speech endpoint detection method, device and computer equipment based on deep learning

Also Published As

Publication number Publication date
CN119132337A (en) 2024-12-13

Similar Documents

Publication Publication Date Title
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
Deng et al. Exploiting time-frequency patterns with LSTM-RNNs for low-bitrate audio restoration
CN117475038A (en) An image generation method, device, equipment and computer-readable storage medium
CN112287672B (en) Text intent recognition method and device, electronic device, and storage medium
CN109493881A (en) A kind of labeling processing method of audio, device and calculate equipment
CN111933187B (en) Emotion recognition model training method and device, computer equipment and storage medium
CN112599123B (en) Lightweight speech keyword recognition network, method, device and storage medium
Shen et al. Knowledge distillation-based representation learning for short-utterance spoken language identification
CN119132337B (en) Effective voice detection method and device based on feature enhancement pre-training model
Kumar et al. Intelligent Audio Signal Processing for Detecting Rainforest Species Using Deep Learning.
CN110490304A (en) A kind of data processing method and equipment
CN116884435A (en) Voice event detection method and device based on audio prompt learning
Benamer et al. Database for Arabic speech commands recognition
Raj et al. Multilayered convolutional neural network-based auto-CODEC for audio signal denoising using mel-frequency cepstral coefficients
CN117976006A (en) Audio processing method, device, computer equipment and storage medium
Wazir et al. Acoustic pornography recognition using recurrent neural network
Zhang et al. Learning audio sequence representations for acoustic event classification
Feng et al. Spatiotemporal prediction based on feature classification for multivariate floating-point time series lossy compression
CN114999525A (en) Light-weight environment voice recognition method based on neural network
Raj et al. Audio signal quality enhancement using multi-layered convolutional neural network based auto encoder–decoder
CN120148488A (en) Speech recognition model acquisition method, device, computer equipment, readable storage medium and program product
CN114664313B (en) Speech recognition method, device, computer equipment, storage medium and program product
CN115985347B (en) Voice endpoint detection method and device based on deep learning and computer equipment
Weychan et al. Implementation aspects of speaker recognition using Python language and Raspberry Pi platform
CN117174082A (en) Training and execution method, device, equipment and storage medium of voice wake-up model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant