[go: up one dir, main page]

CN114464168B - Speech processing model training method, speech data noise reduction method and device - Google Patents

Speech processing model training method, speech data noise reduction method and device Download PDF

Info

Publication number
CN114464168B
CN114464168B CN202210225936.9A CN202210225936A CN114464168B CN 114464168 B CN114464168 B CN 114464168B CN 202210225936 A CN202210225936 A CN 202210225936A CN 114464168 B CN114464168 B CN 114464168B
Authority
CN
China
Prior art keywords
voice
neural network
features
noise
voice data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210225936.9A
Other languages
Chinese (zh)
Other versions
CN114464168A (en
Inventor
关海欣
梁家恩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN202411883863.8A priority Critical patent/CN119724162A/en
Priority to CN202210225936.9A priority patent/CN114464168B/en
Publication of CN114464168A publication Critical patent/CN114464168A/en
Application granted granted Critical
Publication of CN114464168B publication Critical patent/CN114464168B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Evolutionary Computation (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Soundproofing, Sound Blocking, And Sound Damping (AREA)

Abstract

本申请公开了语音处理模型的训练方法、语音数据的降噪方法及装置。方法包括:获取语音数据样本,语音样本数据中包括多帧经过噪声混合后得到的语音数据;获取语音数据样本对应的标签信息,其中,标签信息用于标记语音数据样本中的纯净语音特征,噪声语音特征以及语音活性特征;确定预设神经网络模型;利用语音数据样本和标签信息对预设神经网络模型进行训练,以使预设神经网络模型学习纯净语音特征与噪声语音特征,以及噪声语音特征与语音活性特征之间的对应关系,得到语音处理模型。本申请在模型训练过程中,使用无噪声的语音数据与噪声混合后的样本,同时结合样本中的语音活性特征进行训练,能够在同等模型计算量条件下,降噪性能更好。

The present application discloses a training method for a speech processing model, a method and device for reducing the noise of speech data. The method includes: obtaining a speech data sample, wherein the speech sample data includes multiple frames of speech data obtained after being mixed with noise; obtaining label information corresponding to the speech data sample, wherein the label information is used to mark the pure speech features, noisy speech features and speech activity features in the speech data sample; determining a preset neural network model; using the speech data sample and the label information to train the preset neural network model, so that the preset neural network model learns the correspondence between the pure speech features and the noisy speech features, and the noisy speech features and the speech activity features, and obtains a speech processing model. In the process of model training, the present application uses noise-free speech data and noise-mixed samples, and combines the speech activity features in the samples for training, which can achieve better noise reduction performance under the same model calculation conditions.

Description

Training method of voice processing model, noise reduction method and device of voice data
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a training method of a voice processing model, a noise reduction method and a noise reduction device of voice data.
Background
The noise reduction method based on deep learning is characterized in that the input is usually noise-containing voice and the transformation characteristics thereof, the output is clean voice and the transformation characteristics thereof, a network model with a complex structure is usually required to be used for obtaining better noise reduction performance, and a large amount of calculation and a large amount of storage resources are consumed in the use process of the network model. The flexibility of the network model is reduced, and the low-resource end side is not beneficial to use.
Disclosure of Invention
In order to solve the above technical problems or at least partially solve the above technical problems, the present application provides a training method of a speech processing model, a noise reduction method of speech data, and a device thereof.
According to an aspect of an embodiment of the present application, there is provided a training method of a speech processing model, including:
obtaining voice data samples, wherein the voice sample data comprises voice data obtained by mixing multiple frames of noise;
Acquiring label information corresponding to the voice data sample, wherein the label information is used for marking pure voice characteristics, noise voice characteristics and voice activity characteristics in the voice data sample;
Determining a preset neural network model;
Training the preset neural network model by utilizing the voice data sample and the label information so that the preset neural network model learns the pure voice characteristics, the noise voice characteristics and the corresponding relation between the noise voice characteristics and the voice activity characteristics to obtain a voice processing model.
Further, the obtaining the voice data sample includes:
acquiring initial voice data, wherein the initial voice data is voice data carrying pure voice characteristics and voice activity characteristics;
determining various types of initial noise characteristics and preset signal-to-noise ratios;
and mixing the initial voice data and the initial noise characteristics according to the preset signal-to-noise ratio to obtain the voice data sample.
Further, the preset neural network model comprises a convolutional neural network, a first cyclic neural network, a second cyclic neural network, a first deep neural network and a second deep neural network;
The convolutional neural network is respectively connected with the first cyclic neural network and the second cyclic neural network, the first cyclic neural network is connected with the first deep neural network, the second cyclic neural network is connected with the second deep neural network, the convolutional neural network is connected with the first cyclic neural network through a connecting unit, and the second cyclic neural network is also connected with the first cyclic neural network through the connecting unit.
Further, training the preset neural network model by using the voice data sample and the tag information, so that the preset neural network model learns the pure voice feature and the noise voice feature, and the corresponding relation between the noise voice feature and the voice activity feature, to obtain a voice processing model, including:
Inputting the voice data sample into the preset neural network model, so that a convolutional neural network in the preset neural network model extracts first features of the voice data sample, the first features are respectively input into the connecting unit and the second cyclic neural network, the second cyclic neural network detects first voice activity features and first noise voice features in the first features, the first voice activity features are extracted, the first voice activity features are respectively input into the connecting unit and the second deep neural network, the connecting unit splices the first features and the first voice activity features to obtain second features, the second features are input into the first cyclic neural network, the first cyclic neural network detects second voice activity features and second noise voice features in the second features, the second voice activity features are extracted, the second voice activity features are input into the first deep neural network to perform feature superposition to obtain third features, and meanwhile target values output by the second deep neural network are determined;
Under the condition that the target value is used for indicating that the voice activity characteristic meets a preset characteristic, determining initial voice data corresponding to the voice data sample, and extracting a fourth characteristic corresponding to the initial voice data;
and calculating a loss function value based on the third feature and the fourth feature, and determining the preset neural network model as the voice processing model under the condition that the loss function value is smaller than a preset threshold value.
According to another aspect of the embodiment of the present application, there is also provided a noise reduction method for voice data, including:
acquiring original voice data to be processed;
inputting the original voice data into a pre-trained voice processing model, so that the voice processing model extracts the characteristics of the original voice data and outputs target values and target voice characteristics based on the characteristics;
And generating target voice data based on the target voice characteristics under the condition that the target value is used for representing that the target voice characteristics meet preset characteristics, wherein the preset characteristics are characteristics of voice data without noise.
Further, the inputting the target voice data into a pre-trained voice processing model, so that the voice processing model outputs a target value and a target voice feature, includes:
Inputting the original voice data into the voice processing model, so that a convolution neural network in the voice processing model extracts original features of the original voice data, inputting the original features into the connecting unit and the second circulation neural network respectively, detecting original voice activity features and original noise voice features in the first original features by the second circulation neural network, extracting the original voice activity features, inputting the original voice activity features into the connecting unit and the second deep neural network respectively, splicing the original features and the original voice activity features by the connecting unit to obtain spliced features, inputting the spliced features into the first circulation neural network, detecting target voice activity features and target noise voice features in the spliced features by the first circulation neural network, extracting target voice activity features, inputting the target voice activity features into the first deep neural network to perform feature superposition to obtain target voice features, and determining target values output by the second deep neural network.
According to another aspect of the embodiment of the present application, there is also provided a training apparatus for a speech processing model, including:
The first acquisition module is used for acquiring voice data samples, wherein the voice sample data comprises voice data obtained by mixing multiple frames of noise;
The second acquisition module is used for acquiring label information corresponding to the voice data sample, wherein the label information is used for marking pure voice characteristics, noise voice characteristics and voice activity characteristics in the voice data sample;
the determining module is used for determining a preset neural network model;
The training module is used for training the preset neural network model by utilizing the voice data sample and the label information so that the preset neural network model learns the pure voice characteristics, the noise voice characteristics and the corresponding relation between the noise voice characteristics and the voice activity characteristics to obtain a voice processing model.
According to another aspect of the embodiment of the present application, there is also provided a noise reduction apparatus for voice data, including:
the acquisition module is used for acquiring the original voice data to be processed;
The extraction module is used for inputting the original voice data into a pre-trained voice processing model so that the voice processing model extracts the characteristics of the original voice data and outputs target values and target voice characteristics based on the characteristics;
And the processing module is used for generating target voice data based on the target voice characteristics under the condition that the target value is used for representing that the target voice characteristics meet preset characteristics, wherein the preset characteristics are characteristics of voice data without noise.
According to another aspect of the embodiments of the present application, there is also provided a storage medium including a stored program that performs the above steps when running.
According to another aspect of the embodiment of the application, there is also provided an electronic device, including a processor, a communication interface, a memory and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus, where the memory is used to store a computer program, and the processor is used to execute the steps in the above method by running the program stored on the memory.
Embodiments of the present application also provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform the steps of the above method.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the advantages that in the model training process, the sample obtained by mixing noise-free voice data and noise is used, and simultaneously the voice activity characteristics in the sample are combined for training, so that the noise reduction performance is better under the condition of the same model calculated amount.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a flowchart of a training method of a speech processing model according to an embodiment of the present application;
Fig. 2 is a schematic structural diagram of a preset neural network model according to an embodiment of the present application;
FIG. 3 is a flowchart of a method for noise reduction of voice data according to another embodiment of the present application;
FIG. 4 is a block diagram of a training device for a speech processing model according to an embodiment of the present application;
FIG. 5 is a block diagram of a noise reduction device for voice data according to another embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments, illustrative embodiments of the present application and descriptions thereof are used to explain the present application and do not constitute undue limitations of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another similar entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
The embodiment of the application provides a training method of a voice processing model, a noise reduction method and a noise reduction device of voice data. The method provided by the embodiment of the application can be applied to any needed electronic equipment, for example, the electronic equipment can be a server, a terminal and the like, is not particularly limited, and is convenient to describe and is called as the electronic equipment for short hereinafter.
According to an aspect of the embodiment of the application, a method embodiment of a training method of a speech processing model is provided. Fig. 1 is a flowchart of a training method of a speech processing model according to an embodiment of the present application, as shown in fig. 1, where the method includes:
Step S11, a voice data sample is obtained, wherein the voice sample data comprises voice data obtained by mixing multiple frames of noise.
In the embodiment of the present application, step S11, obtaining a voice data sample, includes the following steps A1-A3:
and A1, acquiring initial voice data, wherein the initial voice data are voice data carrying pure voice characteristics and voice activity characteristics.
The method provided by the embodiment of the application is applied to the voice processing equipment, and the voice processing equipment can be a smart phone, a computer and the like. The voice processing device may control the voice acquisition device to acquire the initial voice data, and the voice acquisition device may be a recorder, a recording pen, or the like. It will be appreciated that the initial speech data is speech data that does not carry noise (i.e., clean speech data). The original voice data carries pure voice characteristics and voice activity characteristics, and the voice activity characteristics are used for representing sounds existing in the original voice data.
And step A2, determining various types of initial noise characteristics and preset signal-to-noise ratios.
In the embodiment of the application, in order to construct the voice data sample carrying noise, multiple types of noise data are also required to be determined, wherein the multiple types of noise data can be white noise type noise data, additive noise type noise data and multiplicative noise type noise data. The preset signal to noise ratio may be a preset noise mixing ratio.
And step A3, mixing the initial voice data and the initial noise characteristics according to a preset signal-to-noise ratio to obtain a voice data sample.
In the embodiment of the application, the voice processing device can mix the initial voice data and the noise data according to the preset signal ratio to obtain the voice data sample, and the voice data sample obtained at the moment is the voice data obtained by mixing the multi-frame clean noise characteristics through noise.
Step S12, label information corresponding to the voice data sample is obtained, wherein the label information is used for marking pure voice characteristics, noise voice characteristics and voice activity characteristics in the voice data sample.
In the embodiment of the application, the label information of the voice data sample is used for marking the voice activity characteristic and the noise voice characteristic in the voice data sample. Wherein the speech data samples may be speech codes, the speech activity features may be coding features of the speech codes, etc.
Step S13, determining a preset neural network model.
In the embodiment of the present application, fig. 2 is a schematic structural diagram of a preset neural network model provided in the embodiment of the present application, and as shown in fig. 2, the preset neural network model includes a convolutional neural network CNN (Convolutional Neural Networks), a first recurrent neural network RNN1 (Recurrent Neural Network), a second recurrent neural network RNN2, a first deep neural network DNN1 (Deep Neural Networks), and a second deep neural network DNN2.
The Convolutional Neural Network (CNN) is respectively connected with a first cyclic neural network (RNN 1) and a second cyclic neural network (RNN 2), the first cyclic neural network (RNN 1) is connected with the first deep neural network (DNN 1), the second cyclic neural network (RNN 2) is connected with the second deep neural network (DNN 2), wherein the convolutional neural network is connected with the first cyclic neural network (RNN 1) through a connecting unit (Cat), and the second cyclic neural network (RNN 2) is also connected with the first cyclic neural network (RNN 1) through a connecting unit (Cat).
It should be noted that, CNN is used for deep feature extraction, and deep voice features are extracted through a multi-layer CNN network, in the embodiment of the application, 7 layers of CNN networks are used, and Batch Norm is added between each layer, and PReLU activation functions are used.
RNN2 includes an LSTM (Long short-term memory) unit (or GRU unit) to extract the speech activity features.
The Cat is a feature which is synthesized by splicing the feature of CNN output and the voice activity feature.
RNN1 and DNN1, wherein the activation function of DNN1 uses Sigmoid to multiply the original noisy feature point to obtain estimated clean feature, and then the clean voice is obtained through inverse transformation and overlap-add operation.
DNN2 is that a 1-layer full-connection network is used, voice activity characteristics are identified through the full-connection network, a target value is obtained, the target value is 0 or 1, wherein 0 is used for indicating that the current voice activity characteristics are not matched with preset activity characteristics, and 1 is used for indicating that the current voice activity characteristics are matched with the preset activity characteristics.
Step S14, training the preset neural network model by utilizing the voice data sample and the label information so that the preset neural network model learns the corresponding relation between the pure voice characteristic and the noise voice characteristic and between the noise voice characteristic and the voice activity characteristic to obtain a voice processing model.
In the embodiment of the present application, step S14 trains a preset neural network model by using a voice data sample and tag information, so that the preset neural network model learns the pure voice feature and the noise voice feature, and the corresponding relationship between the noise voice feature and the voice activity feature, to obtain a voice processing model, which includes the following steps B1-B3:
Step B1, inputting the voice data sample into a preset neural network model, so that a convolutional neural network in the preset neural network model extracts first features of the voice data sample, inputting the first features into a connecting unit and a second cyclic neural network respectively, detecting first voice activity features and first noise voice features in the first features by the second cyclic neural network, extracting the first voice activity features, inputting the first voice activity features into the connecting unit and the second deep neural network respectively, splicing the first features and the first voice activity features by the connecting unit to obtain second features, inputting the second features into the first cyclic neural network, detecting second voice activity features and second noise voice features in the second features by the first cyclic neural network, extracting the second voice activity features, inputting the second voice activity features into the first deep neural network to perform feature superposition to obtain third features, and determining target values output by the second cyclic neural network.
It should be noted that the preset neural network model includes two branches, the first branch is used for extracting the voice activity feature, and the first branch includes a second cyclic neural network and a second deep neural network. The second branch is used for noise reduction of voice data and comprises a connecting unit, a first circulating neural network and a first deep neural network, wherein the first circulating neural network and the second circulating neural network are used for detecting voice activity characteristics and noise voice characteristics in the characteristics.
And B2, under the condition that the target value is used for indicating that the voice activity characteristic meets the preset characteristic, determining initial voice data corresponding to the voice data sample, and extracting fourth characteristic corresponding to the initial voice data.
And B3, calculating a loss function value based on the third feature and the fourth feature, and determining the preset neural network model as a voice processing model under the condition that the loss function value is smaller than a preset threshold value.
In the embodiment of the application, the calculation process of the value of the loss function is as follows:
In an embodiment of the application, s target is cross entropy, and e noise is feature variation, representing the dot product of the vector, For the fourth feature, s is the third feature and SI-SNR is the loss function value.
In the model training process, the application uses the sample after noise-free voice data and noise are mixed, and simultaneously combines the voice activity characteristics in the sample for training, thereby improving the noise reduction effect on the voice under the condition of the same model calculation amount. In addition, in the model training process, different loss functions are adopted for training according to different branches, and SI-SNR is adopted for the first branch for training as the loss function. The second branch is trained using cross entropy as a loss function. Therefore, a final voice processing model is obtained by utilizing multitask learning, so that the noise reduction performance is better.
Fig. 3 is a flowchart of a method for noise reduction of voice data according to an embodiment of the present application, as shown in fig. 3, the method may include the following steps:
Step S21, obtaining the original voice data to be processed.
Step S22, inputting the original voice data into a pre-trained voice processing model, so that the voice processing model extracts characteristic information of the original voice data, and outputting a target value and target voice characteristics based on the characteristic information.
In the embodiment of the present application, step S22, inputting the original speech data into a pre-trained speech processing model, so that the speech processing model outputs a target value and a target speech feature, includes:
The method comprises the steps of inputting original voice data into a voice processing model, enabling a convolution neural network in the voice processing model to extract original features of the original voice data, inputting the original features into a connecting unit and a second circulation neural network respectively, enabling the second circulation neural network to detect original voice activity features and original noise voice features in first original features, extracting original voice activity features, inputting the original voice activity features into the connecting unit and the second deep neural network respectively, enabling the connecting unit to splice the original features and the original voice activity features to obtain spliced features, inputting the spliced features into the first circulation neural network, enabling the first circulation neural network to detect target voice activity features and target noise voice features in the spliced features, extracting target voice activity features, inputting the target voice activity features into the first deep neural network to conduct feature superposition to obtain target voice features, and determining target values output by the second deep neural network.
It should be noted that, in the embodiment of the present application, two branches appear after the convolutional neural network through the speech processing model, where the first branch is used to extract the speech activity feature, and the first branch includes a second recurrent neural network and a second deep neural network. The second branch is used for noise reduction of voice data and comprises a connecting unit, a first circulating neural network and a first deep neural network.
Specifically, the convolutional neural network is used for extracting original features of original voice data, the convolutional neural network inputs the original features into the connecting unit and the second cyclic neural network respectively, the second cyclic neural network extracts original voice activity features from the first features, the original voice activity features are input into the second deep neural network, and the second deep neural network judges the actual degree of the original voice activity features to obtain a target value. At the same time, the second recurrent neural network will also give the original speech activity characteristics to the connection unit.
The connecting unit splices the original voice activity characteristic and the first characteristic to obtain a spliced characteristic, and then the spliced characteristic is subjected to noise reduction through the first circulating neural network and the first deep neural network to obtain a target voice characteristic.
In step S23, in the case where the target value is used to indicate that the target voice feature satisfies the preset feature, the target voice data is generated based on the target voice feature, and the preset feature is a feature of the voice data that does not carry noise.
In the embodiment of the application, when the target value is 1, the target value is determined to be used for indicating that the target voice characteristic meets the preset characteristic, and at the moment, the target voice data can be generated according to the target voice characteristic.
According to the embodiment of the application, the two branches are adopted to process the voice data, so that on one hand, whether the original voice data has voice activity features can be directly determined through the target value, if the original voice data does not have the voice activity features, the voice data is not processed, and the efficiency of a voice processing model is improved. On the other hand, the model used by the branch for extracting the voice activity features has a simple structure, and the parameter quantity is far smaller than that of the noise reduction branch, so that the calculated quantity of the model is reduced, and the consumption of storage resources is reduced.
Fig. 4 is a block diagram of a training apparatus for a speech processing model according to an embodiment of the present application, where the training apparatus may be implemented as part or all of an electronic device by software, hardware, or a combination of both. As shown in fig. 4, the apparatus includes:
A first obtaining module 41, configured to obtain a voice data sample, where the voice sample data includes voice data obtained by mixing multiple frames of noise;
the second obtaining module 42 is configured to obtain tag information corresponding to the voice data sample, where the tag information is used to mark pure voice features, noise voice features, and voice activity features in the voice data sample.
A determining module 43, configured to determine a preset neural network model.
The training module 44 is configured to train the preset neural network model by using the voice data sample and the tag information, so that the preset neural network model learns the pure voice feature and the noise voice feature, and the corresponding relationship between the noise voice feature and the voice activity feature, to obtain a voice processing model.
In the embodiment of the present application, the first obtaining module 41 is configured to obtain initial voice data, where the initial voice data is voice data carrying pure voice features and voice activity features, determine multiple types of initial noise features and preset signal-to-noise ratios, and mix the initial voice data and the initial noise features according to the preset signal-to-noise ratios to obtain the voice data sample.
The preset neural network model comprises a convolutional neural network, a first circulating neural network, a second circulating neural network, a first deep neural network and a second deep neural network, wherein the convolutional neural network is respectively connected with the first circulating neural network and the second circulating neural network, the first circulating neural network is connected with the first deep neural network, the second circulating neural network is connected with the second deep neural network, the convolutional neural network is connected with the first circulating neural network through a connecting unit, and the second circulating neural network is also connected with the first circulating neural network through the connecting unit.
In the embodiment of the present application, the training module 44 is configured to input the voice data sample into a preset neural network model, so that the convolutional neural network in the preset neural network model extracts a first feature of the voice data sample, inputs the first feature into the connection unit and the second recurrent neural network, and the second recurrent neural network detects a first voice activity feature and a first noise voice feature in the first feature, and extracts the first voice activity feature, inputs the first voice activity feature into the connection unit and the second deep neural network, and the connection unit splices the first feature and the first voice activity feature to obtain a second feature, inputs the second feature into the first recurrent neural network, detects a second voice activity feature and a second noise voice feature in the second feature by the first recurrent neural network, extracts the second voice activity feature, inputs the second voice activity feature into the first deep neural network, performs feature superposition to obtain a third feature, and determines a target value output by the second deep neural network.
Fig. 5 is a block diagram of a noise reduction device for voice data according to an embodiment of the present application, where the device may be implemented as part or all of an electronic device by software, hardware, or a combination of both. As shown in fig. 5, the apparatus includes:
the obtaining module 51 is configured to obtain original voice data to be processed.
The extracting module 52 is configured to input the original speech data into a pre-trained speech processing model, so that the speech processing model extracts features of the original speech data and outputs target values and target speech features based on the features.
The processing module 53 is configured to generate target voice data based on the target voice feature if the target value is used to indicate that the target voice feature satisfies a preset feature, where the preset feature is a feature of the voice data that does not carry noise.
The embodiment of the application also provides an electronic device, as shown in fig. 6, where the electronic device may include a processor 1501, a communication interface 1502, a memory 1503 and a communication bus 1504, where the processor 1501, the communication interface 1502 and the memory 1503 complete communication with each other through the communication bus 1504.
A memory 1503 for storing a computer program;
The processor 1501, when executing the computer program stored in the memory 1503, implements the steps of the above embodiments.
The communication bus mentioned by the above terminal may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated as EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface is used for communication between the terminal and other devices.
The memory may include random access memory (Random Access Memory, RAM) or may include non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central Processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), a digital signal processor (DIGITAL SIGNAL Processing, DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, or discrete hardware components.
In yet another embodiment of the present application, a computer readable storage medium is provided, in which instructions are stored, which when run on a computer, cause the computer to perform the method for training a speech processing model according to any of the above embodiments.
In yet another embodiment of the present application, a computer program product comprising instructions, which when run on a computer, causes the computer to perform the method of training a speech processing model as described in any of the above embodiments is also provided.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk Solid STATE DISK), etc.
The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application are included in the protection scope of the present application.
The foregoing is only a specific embodiment of the application to enable those skilled in the art to understand or practice the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (4)

1. A method of training a speech processing model, comprising:
obtaining a voice data sample, wherein the voice data sample comprises voice data obtained by mixing multiple frames of noise;
Acquiring label information corresponding to the voice data sample, wherein the label information is used for marking pure voice characteristics, noise voice characteristics and voice activity characteristics in the voice data sample;
Determining a preset neural network model;
training the preset neural network model by utilizing the voice data sample and the label information so that the preset neural network model learns the pure voice characteristics, the noise voice characteristics and the corresponding relation between the noise voice characteristics and the voice activity characteristics to obtain a voice processing model;
the obtaining the voice data sample includes:
acquiring initial voice data, wherein the initial voice data is voice data carrying pure voice characteristics and voice activity characteristics;
determining various types of initial noise characteristics and preset signal-to-noise ratios;
mixing the initial voice data and the initial noise characteristics according to the preset signal-to-noise ratio to obtain the voice data sample;
the preset neural network model comprises a convolutional neural network, a first cyclic neural network, a second cyclic neural network, a first deep neural network and a second deep neural network;
The convolutional neural network is respectively connected with the first circulating neural network and the second circulating neural network, the first circulating neural network is connected with the first deep neural network, the second circulating neural network is connected with the second deep neural network, the convolutional neural network is connected with the first circulating neural network through a connecting unit, and the second circulating neural network is also connected with the first circulating neural network through the connecting unit;
Training the preset neural network model by using the voice data sample and the tag information, so that the preset neural network model learns the pure voice feature and the noise voice feature, and the corresponding relation between the noise voice feature and the voice activity feature, to obtain a voice processing model, including:
Inputting the voice data sample into the preset neural network model, so that a convolutional neural network in the preset neural network model extracts first features of the voice data sample, the first features are respectively input into the connecting unit and the second cyclic neural network, the second cyclic neural network detects first voice activity features and first noise voice features in the first features, the first voice activity features are extracted, the first voice activity features are respectively input into the connecting unit and the second deep neural network, the connecting unit splices the first features and the first voice activity features to obtain second features, the second features are input into the first cyclic neural network, the first cyclic neural network detects second voice activity features and second noise voice features in the second features, the second voice activity features are extracted, the second voice activity features are input into the first deep neural network to perform feature superposition to obtain third features, and meanwhile target values output by the second deep neural network are determined;
Under the condition that the target value is used for indicating that the voice activity characteristic meets a preset characteristic, determining initial voice data corresponding to the voice data sample, and extracting a fourth characteristic corresponding to the initial voice data;
and calculating a loss function value based on the third feature and the fourth feature, and determining the preset neural network model as the voice processing model under the condition that the loss function value is smaller than a preset threshold value.
2. A training apparatus for a speech processing model employing the method of claim 1, comprising:
The first acquisition module is used for acquiring voice data samples, wherein the voice data samples comprise voice data obtained by mixing multiple frames of noise;
The second acquisition module is used for acquiring label information corresponding to the voice data sample, wherein the label information is used for marking pure voice characteristics, noise voice characteristics and voice activity characteristics in the voice data sample;
the determining module is used for determining a preset neural network model;
The training module is used for training the preset neural network model by utilizing the voice data sample and the label information so that the preset neural network model learns the pure voice characteristics, the noise voice characteristics and the corresponding relation between the noise voice characteristics and the voice activity characteristics to obtain a voice processing model.
3. A storage medium comprising a stored program, wherein the program when run performs the method steps of claim 1.
4. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus, and the electronic equipment comprises the following components:
a memory for storing a computer program;
a processor for executing the method steps of claim 1 by running a program stored on a memory.
CN202210225936.9A 2022-03-07 2022-03-07 Speech processing model training method, speech data noise reduction method and device Active CN114464168B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202411883863.8A CN119724162A (en) 2022-03-07 2022-03-07 Method and device for reducing noise of speech data
CN202210225936.9A CN114464168B (en) 2022-03-07 2022-03-07 Speech processing model training method, speech data noise reduction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210225936.9A CN114464168B (en) 2022-03-07 2022-03-07 Speech processing model training method, speech data noise reduction method and device

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN202411883863.8A Division CN119724162A (en) 2022-03-07 2022-03-07 Method and device for reducing noise of speech data

Publications (2)

Publication Number Publication Date
CN114464168A CN114464168A (en) 2022-05-10
CN114464168B true CN114464168B (en) 2025-01-28

Family

ID=81417610

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202411883863.8A Pending CN119724162A (en) 2022-03-07 2022-03-07 Method and device for reducing noise of speech data
CN202210225936.9A Active CN114464168B (en) 2022-03-07 2022-03-07 Speech processing model training method, speech data noise reduction method and device

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202411883863.8A Pending CN119724162A (en) 2022-03-07 2022-03-07 Method and device for reducing noise of speech data

Country Status (1)

Country Link
CN (2) CN119724162A (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115273880A (en) * 2022-07-21 2022-11-01 百果园技术(新加坡)有限公司 Voice noise reduction method, model training method, device, equipment, medium and product
CN116825123B (en) * 2023-06-19 2024-06-07 广东保伦电子股份有限公司 Tone quality optimization method and system based on audio push
CN117457017B (en) * 2023-12-20 2024-03-01 浙江华创视讯科技有限公司 Voice data cleaning method and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110444214A (en) * 2017-11-24 2019-11-12 深圳市腾讯计算机系统有限公司 Speech processing model training method, device, electronic equipment and storage medium
CN111223493A (en) * 2020-01-08 2020-06-02 北京声加科技有限公司 Voice signal noise reduction processing method, microphone and electronic equipment

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9786270B2 (en) * 2015-07-09 2017-10-10 Google Inc. Generating acoustic models
EP3381033B1 (en) * 2016-03-23 2020-08-12 Google LLC Adaptive audio enhancement for multichannel speech recognition
CN110600018B (en) * 2019-09-05 2022-04-26 腾讯科技(深圳)有限公司 Voice recognition method and device and neural network training method and device
CN111261146B (en) * 2020-01-16 2022-09-09 腾讯科技(深圳)有限公司 Speech recognition and model training method, device and computer readable storage medium
CN112820324B (en) * 2020-12-31 2024-06-25 平安科技(深圳)有限公司 Multi-label voice activity detection method, device and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110444214A (en) * 2017-11-24 2019-11-12 深圳市腾讯计算机系统有限公司 Speech processing model training method, device, electronic equipment and storage medium
CN111223493A (en) * 2020-01-08 2020-06-02 北京声加科技有限公司 Voice signal noise reduction processing method, microphone and electronic equipment

Also Published As

Publication number Publication date
CN119724162A (en) 2025-03-28
CN114464168A (en) 2022-05-10

Similar Documents

Publication Publication Date Title
CN114464168B (en) Speech processing model training method, speech data noise reduction method and device
CN110377716B (en) Interaction method and device for conversation and computer readable storage medium
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
CN110415687B (en) Voice processing method, device, medium and electronic equipment
WO2020253060A1 (en) Speech recognition method, model training method, apparatus and device, and storage medium
CN112949708A (en) Emotion recognition method and device, computer equipment and storage medium
CN113515594B (en) Intention recognition method, intention recognition model training method, device and equipment
CN111027291A (en) Method and device for adding punctuation marks in text and training model and electronic equipment
CN114757171B (en) Training method of pre-trained language model, training method and device of language model
CN113436640A (en) Audio noise reduction method, device and system and computer readable storage medium
CN111048065B (en) Text error correction data generation method and related device
CN112232070A (en) Natural language processing model construction method, system, electronic device and storage medium
CN114783453A (en) Speech enhancement method, device, computer equipment and storage medium
CN113035179B (en) A speech recognition method, device, equipment and computer-readable storage medium
CN112836076A (en) Image tag generation method, device and equipment
CN111144575A (en) Public opinion early warning model training method, early warning method, device, equipment and medium
CN117370797A (en) Method and device for enhancing channel data set
CN113505582B (en) Music review sentiment analysis method, device and medium
CN114373455A (en) Emotion recognition method and device, electronic equipment and storage medium
CN115273880A (en) Voice noise reduction method, model training method, device, equipment, medium and product
CN111785282B (en) A voice recognition method and device and intelligent speaker
CN113919335A (en) Pre-trained word vector generation method, system, electronic device and storage medium
CN110196981B (en) Text representation method, apparatus, device and storage medium
CN110399482B (en) Text classification method, model and device
WO2020192237A1 (en) Semantic recognition method, device and system based on artificial intelligence, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant