Disclosure of Invention
In order to solve the above technical problems or at least partially solve the above technical problems, the present application provides a training method of a speech processing model, a noise reduction method of speech data, and a device thereof.
According to an aspect of an embodiment of the present application, there is provided a training method of a speech processing model, including:
obtaining voice data samples, wherein the voice sample data comprises voice data obtained by mixing multiple frames of noise;
Acquiring label information corresponding to the voice data sample, wherein the label information is used for marking pure voice characteristics, noise voice characteristics and voice activity characteristics in the voice data sample;
Determining a preset neural network model;
Training the preset neural network model by utilizing the voice data sample and the label information so that the preset neural network model learns the pure voice characteristics, the noise voice characteristics and the corresponding relation between the noise voice characteristics and the voice activity characteristics to obtain a voice processing model.
Further, the obtaining the voice data sample includes:
acquiring initial voice data, wherein the initial voice data is voice data carrying pure voice characteristics and voice activity characteristics;
determining various types of initial noise characteristics and preset signal-to-noise ratios;
and mixing the initial voice data and the initial noise characteristics according to the preset signal-to-noise ratio to obtain the voice data sample.
Further, the preset neural network model comprises a convolutional neural network, a first cyclic neural network, a second cyclic neural network, a first deep neural network and a second deep neural network;
The convolutional neural network is respectively connected with the first cyclic neural network and the second cyclic neural network, the first cyclic neural network is connected with the first deep neural network, the second cyclic neural network is connected with the second deep neural network, the convolutional neural network is connected with the first cyclic neural network through a connecting unit, and the second cyclic neural network is also connected with the first cyclic neural network through the connecting unit.
Further, training the preset neural network model by using the voice data sample and the tag information, so that the preset neural network model learns the pure voice feature and the noise voice feature, and the corresponding relation between the noise voice feature and the voice activity feature, to obtain a voice processing model, including:
Inputting the voice data sample into the preset neural network model, so that a convolutional neural network in the preset neural network model extracts first features of the voice data sample, the first features are respectively input into the connecting unit and the second cyclic neural network, the second cyclic neural network detects first voice activity features and first noise voice features in the first features, the first voice activity features are extracted, the first voice activity features are respectively input into the connecting unit and the second deep neural network, the connecting unit splices the first features and the first voice activity features to obtain second features, the second features are input into the first cyclic neural network, the first cyclic neural network detects second voice activity features and second noise voice features in the second features, the second voice activity features are extracted, the second voice activity features are input into the first deep neural network to perform feature superposition to obtain third features, and meanwhile target values output by the second deep neural network are determined;
Under the condition that the target value is used for indicating that the voice activity characteristic meets a preset characteristic, determining initial voice data corresponding to the voice data sample, and extracting a fourth characteristic corresponding to the initial voice data;
and calculating a loss function value based on the third feature and the fourth feature, and determining the preset neural network model as the voice processing model under the condition that the loss function value is smaller than a preset threshold value.
According to another aspect of the embodiment of the present application, there is also provided a noise reduction method for voice data, including:
acquiring original voice data to be processed;
inputting the original voice data into a pre-trained voice processing model, so that the voice processing model extracts the characteristics of the original voice data and outputs target values and target voice characteristics based on the characteristics;
And generating target voice data based on the target voice characteristics under the condition that the target value is used for representing that the target voice characteristics meet preset characteristics, wherein the preset characteristics are characteristics of voice data without noise.
Further, the inputting the target voice data into a pre-trained voice processing model, so that the voice processing model outputs a target value and a target voice feature, includes:
Inputting the original voice data into the voice processing model, so that a convolution neural network in the voice processing model extracts original features of the original voice data, inputting the original features into the connecting unit and the second circulation neural network respectively, detecting original voice activity features and original noise voice features in the first original features by the second circulation neural network, extracting the original voice activity features, inputting the original voice activity features into the connecting unit and the second deep neural network respectively, splicing the original features and the original voice activity features by the connecting unit to obtain spliced features, inputting the spliced features into the first circulation neural network, detecting target voice activity features and target noise voice features in the spliced features by the first circulation neural network, extracting target voice activity features, inputting the target voice activity features into the first deep neural network to perform feature superposition to obtain target voice features, and determining target values output by the second deep neural network.
According to another aspect of the embodiment of the present application, there is also provided a training apparatus for a speech processing model, including:
The first acquisition module is used for acquiring voice data samples, wherein the voice sample data comprises voice data obtained by mixing multiple frames of noise;
The second acquisition module is used for acquiring label information corresponding to the voice data sample, wherein the label information is used for marking pure voice characteristics, noise voice characteristics and voice activity characteristics in the voice data sample;
the determining module is used for determining a preset neural network model;
The training module is used for training the preset neural network model by utilizing the voice data sample and the label information so that the preset neural network model learns the pure voice characteristics, the noise voice characteristics and the corresponding relation between the noise voice characteristics and the voice activity characteristics to obtain a voice processing model.
According to another aspect of the embodiment of the present application, there is also provided a noise reduction apparatus for voice data, including:
the acquisition module is used for acquiring the original voice data to be processed;
The extraction module is used for inputting the original voice data into a pre-trained voice processing model so that the voice processing model extracts the characteristics of the original voice data and outputs target values and target voice characteristics based on the characteristics;
And the processing module is used for generating target voice data based on the target voice characteristics under the condition that the target value is used for representing that the target voice characteristics meet preset characteristics, wherein the preset characteristics are characteristics of voice data without noise.
According to another aspect of the embodiments of the present application, there is also provided a storage medium including a stored program that performs the above steps when running.
According to another aspect of the embodiment of the application, there is also provided an electronic device, including a processor, a communication interface, a memory and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus, where the memory is used to store a computer program, and the processor is used to execute the steps in the above method by running the program stored on the memory.
Embodiments of the present application also provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform the steps of the above method.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the advantages that in the model training process, the sample obtained by mixing noise-free voice data and noise is used, and simultaneously the voice activity characteristics in the sample are combined for training, so that the noise reduction performance is better under the condition of the same model calculated amount.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments, illustrative embodiments of the present application and descriptions thereof are used to explain the present application and do not constitute undue limitations of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another similar entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
The embodiment of the application provides a training method of a voice processing model, a noise reduction method and a noise reduction device of voice data. The method provided by the embodiment of the application can be applied to any needed electronic equipment, for example, the electronic equipment can be a server, a terminal and the like, is not particularly limited, and is convenient to describe and is called as the electronic equipment for short hereinafter.
According to an aspect of the embodiment of the application, a method embodiment of a training method of a speech processing model is provided. Fig. 1 is a flowchart of a training method of a speech processing model according to an embodiment of the present application, as shown in fig. 1, where the method includes:
Step S11, a voice data sample is obtained, wherein the voice sample data comprises voice data obtained by mixing multiple frames of noise.
In the embodiment of the present application, step S11, obtaining a voice data sample, includes the following steps A1-A3:
and A1, acquiring initial voice data, wherein the initial voice data are voice data carrying pure voice characteristics and voice activity characteristics.
The method provided by the embodiment of the application is applied to the voice processing equipment, and the voice processing equipment can be a smart phone, a computer and the like. The voice processing device may control the voice acquisition device to acquire the initial voice data, and the voice acquisition device may be a recorder, a recording pen, or the like. It will be appreciated that the initial speech data is speech data that does not carry noise (i.e., clean speech data). The original voice data carries pure voice characteristics and voice activity characteristics, and the voice activity characteristics are used for representing sounds existing in the original voice data.
And step A2, determining various types of initial noise characteristics and preset signal-to-noise ratios.
In the embodiment of the application, in order to construct the voice data sample carrying noise, multiple types of noise data are also required to be determined, wherein the multiple types of noise data can be white noise type noise data, additive noise type noise data and multiplicative noise type noise data. The preset signal to noise ratio may be a preset noise mixing ratio.
And step A3, mixing the initial voice data and the initial noise characteristics according to a preset signal-to-noise ratio to obtain a voice data sample.
In the embodiment of the application, the voice processing device can mix the initial voice data and the noise data according to the preset signal ratio to obtain the voice data sample, and the voice data sample obtained at the moment is the voice data obtained by mixing the multi-frame clean noise characteristics through noise.
Step S12, label information corresponding to the voice data sample is obtained, wherein the label information is used for marking pure voice characteristics, noise voice characteristics and voice activity characteristics in the voice data sample.
In the embodiment of the application, the label information of the voice data sample is used for marking the voice activity characteristic and the noise voice characteristic in the voice data sample. Wherein the speech data samples may be speech codes, the speech activity features may be coding features of the speech codes, etc.
Step S13, determining a preset neural network model.
In the embodiment of the present application, fig. 2 is a schematic structural diagram of a preset neural network model provided in the embodiment of the present application, and as shown in fig. 2, the preset neural network model includes a convolutional neural network CNN (Convolutional Neural Networks), a first recurrent neural network RNN1 (Recurrent Neural Network), a second recurrent neural network RNN2, a first deep neural network DNN1 (Deep Neural Networks), and a second deep neural network DNN2.
The Convolutional Neural Network (CNN) is respectively connected with a first cyclic neural network (RNN 1) and a second cyclic neural network (RNN 2), the first cyclic neural network (RNN 1) is connected with the first deep neural network (DNN 1), the second cyclic neural network (RNN 2) is connected with the second deep neural network (DNN 2), wherein the convolutional neural network is connected with the first cyclic neural network (RNN 1) through a connecting unit (Cat), and the second cyclic neural network (RNN 2) is also connected with the first cyclic neural network (RNN 1) through a connecting unit (Cat).
It should be noted that, CNN is used for deep feature extraction, and deep voice features are extracted through a multi-layer CNN network, in the embodiment of the application, 7 layers of CNN networks are used, and Batch Norm is added between each layer, and PReLU activation functions are used.
RNN2 includes an LSTM (Long short-term memory) unit (or GRU unit) to extract the speech activity features.
The Cat is a feature which is synthesized by splicing the feature of CNN output and the voice activity feature.
RNN1 and DNN1, wherein the activation function of DNN1 uses Sigmoid to multiply the original noisy feature point to obtain estimated clean feature, and then the clean voice is obtained through inverse transformation and overlap-add operation.
DNN2 is that a 1-layer full-connection network is used, voice activity characteristics are identified through the full-connection network, a target value is obtained, the target value is 0 or 1, wherein 0 is used for indicating that the current voice activity characteristics are not matched with preset activity characteristics, and 1 is used for indicating that the current voice activity characteristics are matched with the preset activity characteristics.
Step S14, training the preset neural network model by utilizing the voice data sample and the label information so that the preset neural network model learns the corresponding relation between the pure voice characteristic and the noise voice characteristic and between the noise voice characteristic and the voice activity characteristic to obtain a voice processing model.
In the embodiment of the present application, step S14 trains a preset neural network model by using a voice data sample and tag information, so that the preset neural network model learns the pure voice feature and the noise voice feature, and the corresponding relationship between the noise voice feature and the voice activity feature, to obtain a voice processing model, which includes the following steps B1-B3:
Step B1, inputting the voice data sample into a preset neural network model, so that a convolutional neural network in the preset neural network model extracts first features of the voice data sample, inputting the first features into a connecting unit and a second cyclic neural network respectively, detecting first voice activity features and first noise voice features in the first features by the second cyclic neural network, extracting the first voice activity features, inputting the first voice activity features into the connecting unit and the second deep neural network respectively, splicing the first features and the first voice activity features by the connecting unit to obtain second features, inputting the second features into the first cyclic neural network, detecting second voice activity features and second noise voice features in the second features by the first cyclic neural network, extracting the second voice activity features, inputting the second voice activity features into the first deep neural network to perform feature superposition to obtain third features, and determining target values output by the second cyclic neural network.
It should be noted that the preset neural network model includes two branches, the first branch is used for extracting the voice activity feature, and the first branch includes a second cyclic neural network and a second deep neural network. The second branch is used for noise reduction of voice data and comprises a connecting unit, a first circulating neural network and a first deep neural network, wherein the first circulating neural network and the second circulating neural network are used for detecting voice activity characteristics and noise voice characteristics in the characteristics.
And B2, under the condition that the target value is used for indicating that the voice activity characteristic meets the preset characteristic, determining initial voice data corresponding to the voice data sample, and extracting fourth characteristic corresponding to the initial voice data.
And B3, calculating a loss function value based on the third feature and the fourth feature, and determining the preset neural network model as a voice processing model under the condition that the loss function value is smaller than a preset threshold value.
In the embodiment of the application, the calculation process of the value of the loss function is as follows:
In an embodiment of the application, s target is cross entropy, and e noise is feature variation, representing the dot product of the vector, For the fourth feature, s is the third feature and SI-SNR is the loss function value.
In the model training process, the application uses the sample after noise-free voice data and noise are mixed, and simultaneously combines the voice activity characteristics in the sample for training, thereby improving the noise reduction effect on the voice under the condition of the same model calculation amount. In addition, in the model training process, different loss functions are adopted for training according to different branches, and SI-SNR is adopted for the first branch for training as the loss function. The second branch is trained using cross entropy as a loss function. Therefore, a final voice processing model is obtained by utilizing multitask learning, so that the noise reduction performance is better.
Fig. 3 is a flowchart of a method for noise reduction of voice data according to an embodiment of the present application, as shown in fig. 3, the method may include the following steps:
Step S21, obtaining the original voice data to be processed.
Step S22, inputting the original voice data into a pre-trained voice processing model, so that the voice processing model extracts characteristic information of the original voice data, and outputting a target value and target voice characteristics based on the characteristic information.
In the embodiment of the present application, step S22, inputting the original speech data into a pre-trained speech processing model, so that the speech processing model outputs a target value and a target speech feature, includes:
The method comprises the steps of inputting original voice data into a voice processing model, enabling a convolution neural network in the voice processing model to extract original features of the original voice data, inputting the original features into a connecting unit and a second circulation neural network respectively, enabling the second circulation neural network to detect original voice activity features and original noise voice features in first original features, extracting original voice activity features, inputting the original voice activity features into the connecting unit and the second deep neural network respectively, enabling the connecting unit to splice the original features and the original voice activity features to obtain spliced features, inputting the spliced features into the first circulation neural network, enabling the first circulation neural network to detect target voice activity features and target noise voice features in the spliced features, extracting target voice activity features, inputting the target voice activity features into the first deep neural network to conduct feature superposition to obtain target voice features, and determining target values output by the second deep neural network.
It should be noted that, in the embodiment of the present application, two branches appear after the convolutional neural network through the speech processing model, where the first branch is used to extract the speech activity feature, and the first branch includes a second recurrent neural network and a second deep neural network. The second branch is used for noise reduction of voice data and comprises a connecting unit, a first circulating neural network and a first deep neural network.
Specifically, the convolutional neural network is used for extracting original features of original voice data, the convolutional neural network inputs the original features into the connecting unit and the second cyclic neural network respectively, the second cyclic neural network extracts original voice activity features from the first features, the original voice activity features are input into the second deep neural network, and the second deep neural network judges the actual degree of the original voice activity features to obtain a target value. At the same time, the second recurrent neural network will also give the original speech activity characteristics to the connection unit.
The connecting unit splices the original voice activity characteristic and the first characteristic to obtain a spliced characteristic, and then the spliced characteristic is subjected to noise reduction through the first circulating neural network and the first deep neural network to obtain a target voice characteristic.
In step S23, in the case where the target value is used to indicate that the target voice feature satisfies the preset feature, the target voice data is generated based on the target voice feature, and the preset feature is a feature of the voice data that does not carry noise.
In the embodiment of the application, when the target value is 1, the target value is determined to be used for indicating that the target voice characteristic meets the preset characteristic, and at the moment, the target voice data can be generated according to the target voice characteristic.
According to the embodiment of the application, the two branches are adopted to process the voice data, so that on one hand, whether the original voice data has voice activity features can be directly determined through the target value, if the original voice data does not have the voice activity features, the voice data is not processed, and the efficiency of a voice processing model is improved. On the other hand, the model used by the branch for extracting the voice activity features has a simple structure, and the parameter quantity is far smaller than that of the noise reduction branch, so that the calculated quantity of the model is reduced, and the consumption of storage resources is reduced.
Fig. 4 is a block diagram of a training apparatus for a speech processing model according to an embodiment of the present application, where the training apparatus may be implemented as part or all of an electronic device by software, hardware, or a combination of both. As shown in fig. 4, the apparatus includes:
A first obtaining module 41, configured to obtain a voice data sample, where the voice sample data includes voice data obtained by mixing multiple frames of noise;
the second obtaining module 42 is configured to obtain tag information corresponding to the voice data sample, where the tag information is used to mark pure voice features, noise voice features, and voice activity features in the voice data sample.
A determining module 43, configured to determine a preset neural network model.
The training module 44 is configured to train the preset neural network model by using the voice data sample and the tag information, so that the preset neural network model learns the pure voice feature and the noise voice feature, and the corresponding relationship between the noise voice feature and the voice activity feature, to obtain a voice processing model.
In the embodiment of the present application, the first obtaining module 41 is configured to obtain initial voice data, where the initial voice data is voice data carrying pure voice features and voice activity features, determine multiple types of initial noise features and preset signal-to-noise ratios, and mix the initial voice data and the initial noise features according to the preset signal-to-noise ratios to obtain the voice data sample.
The preset neural network model comprises a convolutional neural network, a first circulating neural network, a second circulating neural network, a first deep neural network and a second deep neural network, wherein the convolutional neural network is respectively connected with the first circulating neural network and the second circulating neural network, the first circulating neural network is connected with the first deep neural network, the second circulating neural network is connected with the second deep neural network, the convolutional neural network is connected with the first circulating neural network through a connecting unit, and the second circulating neural network is also connected with the first circulating neural network through the connecting unit.
In the embodiment of the present application, the training module 44 is configured to input the voice data sample into a preset neural network model, so that the convolutional neural network in the preset neural network model extracts a first feature of the voice data sample, inputs the first feature into the connection unit and the second recurrent neural network, and the second recurrent neural network detects a first voice activity feature and a first noise voice feature in the first feature, and extracts the first voice activity feature, inputs the first voice activity feature into the connection unit and the second deep neural network, and the connection unit splices the first feature and the first voice activity feature to obtain a second feature, inputs the second feature into the first recurrent neural network, detects a second voice activity feature and a second noise voice feature in the second feature by the first recurrent neural network, extracts the second voice activity feature, inputs the second voice activity feature into the first deep neural network, performs feature superposition to obtain a third feature, and determines a target value output by the second deep neural network.
Fig. 5 is a block diagram of a noise reduction device for voice data according to an embodiment of the present application, where the device may be implemented as part or all of an electronic device by software, hardware, or a combination of both. As shown in fig. 5, the apparatus includes:
the obtaining module 51 is configured to obtain original voice data to be processed.
The extracting module 52 is configured to input the original speech data into a pre-trained speech processing model, so that the speech processing model extracts features of the original speech data and outputs target values and target speech features based on the features.
The processing module 53 is configured to generate target voice data based on the target voice feature if the target value is used to indicate that the target voice feature satisfies a preset feature, where the preset feature is a feature of the voice data that does not carry noise.
The embodiment of the application also provides an electronic device, as shown in fig. 6, where the electronic device may include a processor 1501, a communication interface 1502, a memory 1503 and a communication bus 1504, where the processor 1501, the communication interface 1502 and the memory 1503 complete communication with each other through the communication bus 1504.
A memory 1503 for storing a computer program;
The processor 1501, when executing the computer program stored in the memory 1503, implements the steps of the above embodiments.
The communication bus mentioned by the above terminal may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated as EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface is used for communication between the terminal and other devices.
The memory may include random access memory (Random Access Memory, RAM) or may include non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central Processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), a digital signal processor (DIGITAL SIGNAL Processing, DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, or discrete hardware components.
In yet another embodiment of the present application, a computer readable storage medium is provided, in which instructions are stored, which when run on a computer, cause the computer to perform the method for training a speech processing model according to any of the above embodiments.
In yet another embodiment of the present application, a computer program product comprising instructions, which when run on a computer, causes the computer to perform the method of training a speech processing model as described in any of the above embodiments is also provided.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk Solid STATE DISK), etc.
The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application are included in the protection scope of the present application.
The foregoing is only a specific embodiment of the application to enable those skilled in the art to understand or practice the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.