CN112712816A

CN112712816A - Training method and device of voice processing model and voice processing method and device

Info

Publication number: CN112712816A
Application number: CN202011537956.7A
Authority: CN
Inventors: 任新蕾; 郑羲光; 李楠; 张晨
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-12-23
Filing date: 2020-12-23
Publication date: 2021-04-27
Anticipated expiration: 2040-12-23
Also published as: CN112712816B

Abstract

The disclosure relates to a method and a device for training a speech processing model and a speech processing method and a speech processing device. The training method comprises the following steps: acquiring audio sample data, wherein each audio sample data comprises a clean voice signal and a wheat spraying voice signal, and the wheat spraying voice signal is obtained by additively mixing the clean voice signal and a wheat spraying noise signal; obtaining an estimated voice signal by utilizing the voice processing model based on the wheat spraying voice signal; calculating a loss function of the speech processing model based on the clean speech signal and the estimated speech signal; and training the voice processing model by using the calculated loss function.

Description

Training method and device of voice processing model and voice processing method and device

Technical Field

The present disclosure relates to the field of audio technologies, and in particular, to a method and an apparatus for training a speech processing model and a method and an apparatus for speech processing.

Background

One of the inevitable problems of many users using the karaoke software on mobile phones to generate their own works is that recorded songs often contain some microphone noise, which is caused by the fact that the microphone is too close to the microphone, resulting in the sound of the airflow emitted from the microphone, which seriously affects the quality of the karaoke. Currently, neither academia nor industry has dedicated research into this area. The traditional signal processing technology can generally analyze the time-frequency characteristics of the jet noise so as to achieve the purpose of inhibiting, however, the actually recorded jet noise has various characteristics, and the method is difficult to achieve an ideal effect.

Disclosure of Invention

The present disclosure provides a method and an apparatus for training a speech processing model, and a method and an apparatus for speech processing, so as to solve at least the problems in the related art described above, and may not solve any of the problems described above.

According to a first aspect of the embodiments of the present disclosure, there is provided a training method of a speech processing model, the training method including: acquiring audio sample data, wherein each audio sample data comprises a clean voice signal and a wheat spraying voice signal, and the wheat spraying voice signal is obtained by additively mixing the clean voice signal and a wheat spraying noise signal; obtaining an estimated voice signal by utilizing the voice processing model based on the wheat spraying voice signal; calculating a loss function of the speech processing model based on the clean speech signal and the estimated speech signal; and training the voice processing model by using the calculated loss function.

Optionally, the shot noise signal may be obtained by performing enhancement processing on the originally acquired shot noise signal.

Optionally, the enhancement processing may include: removing non-jet noise components in the originally acquired jet noise signals; and/or low-pass filtering the originally acquired shot noise signal.

Optionally, the non-jet noise component may include an unvoiced component.

Optionally, the microphone voice signal may be obtained by additively mixing the clean voice signal and the microphone noise signal according to a predetermined signal-to-noise ratio.

Optionally, the input of the voice processing model may be an amplitude spectrum signal of the acoustic signal; wherein, the obtaining an estimated voice signal by using the voice processing model based on the wheat spraying voice signal may include: carrying out time-frequency transformation on the wheat spraying voice signal to obtain an amplitude spectrum signal of the wheat spraying voice signal; based on the amplitude spectrum signal of the wheat spraying voice signal, obtaining the amplitude spectrum signal of the estimated voice signal by using the voice processing model; wherein computing a loss function for the speech processing model based on the clean speech signal and the estimated speech signal may comprise: carrying out time-frequency transformation on the clean voice signal to obtain an amplitude spectrum signal of the clean voice signal; calculating the loss function based on the magnitude spectrum signal of the clean speech signal and the magnitude spectrum signal of the estimated speech signal.

Optionally, in a case that an output of the speech processing model is an amplitude spectrum signal of an estimated speech signal, obtaining, by using the speech processing model, the amplitude spectrum signal of the estimated speech signal based on the amplitude spectrum signal of the microphone speech signal may include: taking the output of the speech processing model as a magnitude spectrum signal of the estimated speech signal; or in a case that the output of the speech processing model is the estimated mask ratio, obtaining the amplitude spectrum signal of the estimated speech signal by using the speech processing model based on the amplitude spectrum signal of the acoustic signal may include: and multiplying the amplitude spectrum signal of the wheat spraying voice signal by the estimated mask ratio output by the voice processing model to obtain the amplitude spectrum signal of the estimated voice signal.

Alternatively, the loss function may be calculated from frequency point-based loss between the magnitude spectrum signal of the clean speech signal and the magnitude spectrum signal of the estimated speech signal, wherein a weighting coefficient is set for the loss of each frequency point, and the smaller the frequency of the frequency point, the larger the weighting coefficient.

Alternatively, the loss function may be expressed as:

loss＝α(k)*[Mag_clean(t，k)-Mag_{clean_est}(t，k)]，

wherein ,

wherein K represents frequency point index, the value range is more than or equal to 0 and less than K, K represents the total number of effective frequency points, Mag_clean(t, k) represents the magnitude spectrum signal, Mag, of the clean speech signal_{clean_est}(t, k) represents a magnitude spectrum signal of the estimated speech signal, t represents a frame index, α (k) represents a weighting coefficient at frequency point k, and a is a constant greater than 1.

Alternatively, the loss function may be expressed as:

wherein ,

wherein K represents frequency point index, the value range is more than or equal to 0 and less than K, K represents the total number of effective frequency points, Mag_clean(t, k) amplitude spectrum signal, Mag, representing a clean speech signal_{clean_est}(t, k) represents the amplitude spectrum signal of the estimated speech signal, t represents the frame index, α (k) represents the weighting coefficient at frequency k, and a is a constant greater than 1.

Alternatively, the loss function may be expressed as:

wherein ,

Alternatively, a may be a natural constant e.

According to a second aspect of the embodiments of the present disclosure, there is provided a speech processing method, including: acquiring a voice signal to be detected; based on the voice signal to be tested, the estimated voice signal is obtained by utilizing the voice processing model trained by the training method disclosed by the invention.

Optionally, the input of the speech processing model may be a magnitude spectrum signal of the speech signal to be detected; wherein the obtaining an estimated speech signal may include: performing time-frequency transformation on the voice signal to be detected to obtain an amplitude spectrum signal and a phase spectrum signal of the voice signal to be detected; based on the amplitude spectrum signal of the voice signal to be detected, obtaining the amplitude spectrum signal of the estimated voice signal by using the voice processing model; and combining the amplitude spectrum signal of the estimated voice signal with the phase spectrum signal of the audio signal to be detected, and performing time-frequency inverse transformation to obtain the estimated voice signal.

Optionally, in a case that an output of the speech processing model is a magnitude spectrum signal of an estimated speech signal, the obtaining, by using the speech processing model, the magnitude spectrum signal of the estimated speech signal based on the magnitude spectrum signal of the speech signal to be measured may include: taking the output of the speech processing model as a magnitude spectrum signal of the estimated speech signal; or in the case that the output of the speech processing model is the estimated mask ratio, obtaining the magnitude spectrum signal of the estimated speech signal by using the speech processing model based on the magnitude spectrum signal of the speech signal to be detected may include: and multiplying the amplitude spectrum signal of the wheat spraying voice signal by the estimated mask ratio output by the voice processing model to obtain the amplitude spectrum signal of the estimated voice signal.

According to a third aspect of the embodiments of the present disclosure, there is provided a training apparatus for a speech processing model, the training apparatus including: an acquisition unit configured to acquire audio sample data, each of the audio sample data including a clean speech signal and a boomerang speech signal, the boomerang speech signal being obtained by additively mixing the clean speech signal with a boomerang noise signal; an estimating unit configured to obtain an estimated speech signal using the speech processing model based on the jet microphone speech signal; a calculation unit configured to calculate a loss function of the speech processing model based on the clean speech signal and the estimated speech signal; a training unit configured to train the speech processing model using the calculated loss function.

Optionally, the non-jet noise component may include an unvoiced component.

Optionally, the input of the voice processing model may be an amplitude spectrum signal of the acoustic signal; wherein the estimation unit may be configured to: carrying out time-frequency transformation on the wheat spraying voice signal to obtain an amplitude spectrum signal of the wheat spraying voice signal; based on the amplitude spectrum signal of the wheat spraying voice signal, obtaining the amplitude spectrum signal of the estimated voice signal by using the voice processing model; wherein the computing unit may be configured to: carrying out time-frequency transformation on the clean voice signal to obtain an amplitude spectrum signal of the clean voice signal; calculating the loss function based on the magnitude spectrum signal of the clean speech signal and the magnitude spectrum signal of the estimated speech signal.

Alternatively, in a case where the output of the speech processing model is a magnitude spectrum signal of an estimated speech signal, the estimating unit may be configured to: taking the output of the speech processing model as a magnitude spectrum signal of the estimated speech signal; or in case the output of the speech processing model is an estimated mask ratio, the estimating unit may be configured to: and multiplying the amplitude spectrum signal of the wheat spraying voice signal by the estimated mask ratio output by the voice processing model to obtain the amplitude spectrum signal of the estimated voice signal.

Alternatively, the loss function may be expressed as:

loss＝α(k)*[Mag_clean(t，k)-Mag_{clean_est}(t，k)]，

wherein ,

Alternatively, the loss function may be expressed as:

wherein ,

Alternatively, the loss function may be expressed as:

wherein ,

Alternatively, a may be a natural constant e.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a speech processing apparatus including: an acquisition unit configured to acquire a voice signal to be measured; and the estimation unit is configured to obtain an estimated voice signal by utilizing the voice processing model trained by the training method according to the disclosure based on the voice signal to be tested.

Optionally, the input of the speech processing model may be a magnitude spectrum signal of the speech signal to be detected; wherein the estimation unit may be configured to: performing time-frequency transformation on the voice signal to be detected to obtain an amplitude spectrum signal and a phase spectrum signal of the voice signal to be detected; based on the amplitude spectrum signal of the voice signal to be detected, obtaining the amplitude spectrum signal of the estimated voice signal by using the voice processing model; and combining the amplitude spectrum signal of the estimated voice signal with the phase spectrum signal of the audio signal to be detected, and performing time-frequency inverse transformation to obtain the estimated voice signal.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic apparatus including: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform a method of training a speech processing model or a method of speech processing according to the present disclosure.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform a method of training a speech processing model or a method of speech processing according to the present disclosure.

According to an eighth aspect of embodiments of the present disclosure, there is provided a computer program product, instructions in which are executable by a processor of a computer device to perform a method of training a speech processing model or a method of speech processing according to the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

according to the training method and the training device for the voice processing model, the voice processing method and the voice processing device, the training data can be generated by additively mixing the clean voice and the wheat spray noise by using an additive noise suppression mode, so that the neural network is trained, the wheat spray noise is suppressed by using the trained neural network, and the inhibition effect of the wheat spray noise is improved.

In addition, according to the training method and the training device of the speech processing model, the speech processing method and the speech processing device disclosed by the invention, the acoustic noise spraying set used for generating training data can be enhanced, so that the acoustic noise suppression capability can be improved, the high-frequency components of speech can be protected, and the normal speech is ensured not to be distorted.

In addition, according to the training method and the training device for the speech processing model, and the speech processing method and the speech processing device, the speech processing model is trained by the loss function of the weighting coefficient related to the frequency point aiming at the time-frequency distribution characteristic of the jet microphone noise, and the speech processing model trained by the loss function can more effectively restrain the jet microphone noise.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is an overall schematic diagram illustrating a speech processing scheme according to an exemplary embodiment of the present disclosure.

FIG. 2 is a flowchart illustrating a method of training a speech processing model according to an exemplary embodiment of the present disclosure.

Fig. 3 is a flowchart illustrating a voice processing method according to an exemplary embodiment of the present disclosure.

FIG. 4 is a block diagram illustrating a training apparatus of a speech processing model according to an exemplary embodiment of the present disclosure.

Fig. 5 is a block diagram illustrating a voice processing apparatus according to an exemplary embodiment of the present disclosure.

Fig. 6 is a block diagram of an electronic device 600 according to an example embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.

Conventional signal processing techniques have limitations on the problem of acoustic noise suppression. For example, in the related patent, the time-frequency characteristic analysis is performed on the shot noise by using the conventional signal processing technology, and whether the current signal belongs to the shot noise is analyzed, and if so, the current signal is suppressed. According to the technical scheme, the time-frequency characteristic of the wheat spraying noise is preset, the wheat spraying noise is considered to be the wheat spraying noise if the current signal meets the characteristic, however, the actually acquired wheat spraying noise is various in characteristic, and the wheat spraying noise cannot be covered by the scheme, so that an ideal effect is difficult to achieve. However, neural networks have been increasingly applied to the field of audio signal processing, such as noise reduction, separation, and the like, and achieve better performance than conventional signal processing techniques.

Therefore, in order to improve the inhibition effect of the wheat spray noise, the disclosure proposes a singing wheat spray noise inhibition scheme based on a neural network. Firstly, the method utilizes an additive noise suppression mode and adopts a neural network method to suppress wheat spray noise in Karaoke. Specifically, clean speech may be recorded separately and jet noise recorded separately, and the two may then be additively mixed to generate training data for training a speech processing model comprising a neural network. In addition, the method and the device enhance the wheat spraying noise training set, so that the wheat spraying noise suppression capability can be improved, the high-frequency components of the voice can be protected, and the normal voice is ensured not to be distorted. In addition, the loss function with the weighting coefficient related to the frequency point is designed for the time-frequency distribution characteristic of the jet microphone noise to train the voice processing model, and the voice processing model trained by the loss function can effectively restrain the jet microphone noise. Hereinafter, a training method and a training apparatus of a speech processing model and a speech processing method and a speech processing apparatus according to the present disclosure will be described in detail with reference to fig. 1 to 6.

Referring to fig. 1, training data may be generated using additive mixing of clean speech in a clean speech data set and voiceprint noise in a voiceprint noise data set at a certain signal-to-noise ratio. Here, the training data includes clean speech as a training target and mixed jet speech as a training sample. Specifically, according to the way of generating the popcorn noise, the ideal way of generating the training data should be to record a set of voices containing the popcorn noise and simultaneously record a set of clean voices corresponding to the popcorn noise, but this way is basically 0 in feasibility in actual operation. This is because the speech containing the boomerang noise and the clean speech corresponding thereto should be identical, including speech speed, volume, emotion, etc., except that the former contains the boomerang noise, but such data is not practically available. Thus, according to an exemplary embodiment of the present disclosure, clean speech may be recorded separately and jet noise may be recorded separately, and then the two may be additively mixed to generate a mixed jet speech. Thus, the jet voice is the same as clean voice except that it includes jet noise. Therefore, the voice processing model trained by the training data generated in the way can effectively inhibit the wheat-spraying noise in the Karaoke.

In addition, the jet noise set can be subjected to enhancement treatment, so that the suppression effect of the jet noise is improved. For example, due to the diversity of the recording personnel, the collected shot noise set often contains some non-shot noise components (e.g., unvoiced components). In order not to suppress non-jet noise components such as unvoiced sounds in speech, non-jet noise components (e.g., unvoiced sound components) in which jet noise is concentrated can be removed. For another example, the microphone noise in the actual speech is often distributed only at low frequencies, but the collected microphone noise set is distributed in the effective frequency band range limited by the sampling rate, so that the collected microphone noise set is subjected to low-pass filtering, the high-frequency component of the speech can be further protected, and the normal speech is ensured not to be distorted. For another example, the processing of removing the non-jet noise components may be performed on the jet noise set first and then the low-pass filtering processing may be performed, or the processing of performing the low-pass filtering processing on the jet noise set first and then the processing of removing the non-jet noise components may be performed.

After the training data is obtained, the clean speech signal and the acoustic signal in the training data may be subjected to Time-frequency transformation (e.g., Short-Time Fourier Transform (STFT)), and then a magnitude spectrum signal (abs ()) is extracted from the clean speech signal and the acoustic signal in the Time-frequency domain, and then the magnitude spectrum signal of the clean speech signal may be used as a training target (label), and a magnitude spectrum signal training feature (feature) of the acoustic signal may be sent to a neural network (e.g., a fully-connected network, a combination of a recursive neural network and a fully-connected network, etc.) for training, so as to obtain a trained speech processing model. Here, the training data may include a plurality of clean speech signals and a plurality of corresponding voiceover signals, and the above operations may be performed on each clean speech signal and each corresponding voiceover signal, respectively, so that the parameters of the speech processing model may be updated iteratively.

In addition, since the jet noise is mostly concentrated in the low frequency, and the high frequency energy of the jet noise is gradually attenuated, a loss function with a weighting coefficient related to the frequency point can be designed for the characteristic of the jet noise, wherein the weight of the high frequency is relatively small, and the weight of the low frequency is relatively large. The loss function is used for training the voice processing model, so that the trained voice processing model can effectively restrain the wheat spraying noise.

After the trained voice processing model is obtained, the amplitude spectrum signal of the voice signal to be tested can be input to obtain the amplitude spectrum signal of the estimated voice signal, so that the estimated voice signal is obtained. The estimated voice signal with the jet noise removed, which is obtained in this way, can achieve the effect of effectively and accurately suppressing the jet noise.

Referring to fig. 2, in step 201, audio sample data may be obtained, where each audio sample data may include a clean voice signal and a boomerang voice signal. Here, the clean speech signal may be a clean speech signal in a clean speech data set. The clean speech data set may be obtained by recording clean speech and/or by being obtained via the internet, but may also be obtained in any other possible way. Here, clean speech refers to speech that is substantially free of any noise, and a clean speech data set may be generated by acquiring clean speech of many different speakers (e.g., men, women, children, the elderly, etc.).

Furthermore, a jet-wheat speech signal may be obtained by additively mixing the clean speech signal with a jet-wheat noise signal. Here, the shot noise signal may be a shot noise signal in a shot noise data set. The microphone noise data set is obtained by recording clean speech and/or by internet, but can also be obtained in any other possible way. Here, the boomerang noise means noise containing only the boomerang noise without any other noise and voice. A jet noise dataset may be generated by acquiring jet noise for a number of different speakers (e.g., men, women, children, elderly, etc.).

According to an exemplary embodiment of the present disclosure, when the shot noise data set is acquired, the originally acquired shot noise signal in the shot noise data set may also be first subjected to enhancement processing to generate a shot noise signal required for training.

For example, due to the diversity of recording personnel, the originally acquired shot noise signal concentrated with originally acquired shot noise often contains some non-shot noise components (e.g., unvoiced components), and in order not to suppress the non-shot noise components in the speech, the non-shot noise components in the originally acquired shot noise signal can be removed. The non-jet noise component can be removed by distinguishing the jet noise component from the non-jet noise component in the originally acquired jet noise signal. For example, the separation can be performed by looking at the frequency spectrum of the jet noise signal and/or listening to the jet noise signal, for example, the frequency spectrum of the jet noise component tends to have a relatively strong energy, and the sound of the jet noise is a relatively heavy sound in terms of the sound of the jet noise signal.

For another example, the microphone noise in the actual speech is often distributed only at low frequencies, but the collected microphone noise is distributed in the effective frequency band limited by the sampling rate, so that the original microphone noise signal can be low-pass filtered, thereby further protecting the high-frequency component of the speech.

According to an exemplary embodiment of the present disclosure, a clean speech signal may be additively mixed with a jet noise signal at a predetermined signal-to-noise ratio to obtain a jet noise signal. Here, a signal-to-noise ratio (SNR) (in dB) can be expressed as the following equation:

SNR＝10×log₁₀(X(t)/y(t)) (1)

where x (t) represents the energy of the clean speech signal and y (t) represents the energy of the acoustic noise signal.

For example, the signal-to-noise ratio coverage can be four kinds of 0dB, 5dB, 10dB, and 15dB during the training process of the speech processing model. Therefore, for each group of clean voice signals and jet noise signals, one signal-to-noise ratio can be randomly selected from the four signal-to-noise ratios to additively mix the clean voice signals and the jet noise signals, so that a plurality of jet voice signals comprising the four signal-to-noise ratios can be generated. Of course, the signal-to-noise ratios of the present disclosure are not limited to the four, and any feasible signal-to-noise ratios can be set for additive mixing.

After the audio sample data is acquired, the speech processing model may be trained based on clean speech signals and the microphone speech signals included in the audio sample data.

Specifically, in step 202, an estimated speech signal may be obtained based on the jet microphone speech signal using a speech processing model. In step 203, a loss function for the speech processing model may be calculated based on the clean speech signal and the estimated speech signal. And the speech processing model can be trained using the calculated loss function at step 204.

According to an exemplary embodiment of the present invention, the input of the speech processing model is an amplitude spectrum signal of the acoustic signal, and the output may be an amplitude spectrum signal of the estimated speech signal or an estimated Mask ratio (Mask). Here, the mask ratio may refer to a ratio of an amplitude spectrum of the clean speech signal and an amplitude spectrum of the voiceover speech signal. The amplitude spectrum signal of the microphone voice signal may be obtained by performing a time-frequency transform (e.g., STFT) on the microphone voice signal. Namely, the acoustic signal is transformed from the time domain to the frequency domain, and the amplitude of the acoustic signal in the frequency domain is extracted to obtain an amplitude spectrum signal of the acoustic signal. After the amplitude spectrum signal of the acoustic signal is obtained, the amplitude spectrum signal of the acoustic signal can be input into the speech processing model to obtain the amplitude spectrum signal of the estimated speech signal. For example, in the case where the output of the speech processing model is the amplitude spectrum signal of the estimated speech signal, the speech processing model may directly output the amplitude spectrum signal of the estimated speech signal. For another example, when the output of the speech processing model is the estimated mask ratio, the amplitude spectrum signal of the acoustic signal may be multiplied by the estimated mask ratio output by the speech processing model to obtain the amplitude spectrum signal of the estimated speech signal.

According to an exemplary embodiment of the present invention, a time-frequency transform (e.g., STFT) may be performed on the clean speech signal to obtain a magnitude spectrum signal of the clean speech signal. That is, the clean speech signal is transformed from the time domain to the frequency domain, and the amplitude is extracted for the clean speech signal on the frequency domain to obtain an amplitude spectrum signal of the clean speech signal. Subsequently, a loss function is calculated based on the magnitude spectrum signal of the clean speech signal and the magnitude spectrum signal of the estimated speech signal, and the speech processing model is trained using the calculated loss function.

According to the exemplary embodiment of the present invention, since the shot noise is mostly concentrated in the low frequency, and the high frequency energy thereof is gradually attenuated, the loss function with the frequency point related weighting coefficient can be designed for this characteristic of the shot noise, wherein the weight of the high frequency is relatively small, and the weight of the low frequency is relatively large. That is, the loss function may be calculated from the loss based on the frequency points between the magnitude spectrum signal of the clean voice signal and the magnitude spectrum signal of the estimated voice signal, wherein a weighting coefficient is set for the loss of each frequency point, wherein the smaller the frequency of the frequency point, the larger the weighting coefficient.

For example, the loss function can be designed to:

loss＝α(k)*[Mag_clean(t，k)-Mag_{clean_est}(t，k)] (2)

wherein ,

wherein K represents frequency point index, the value range is more than or equal to 0 and less than K, K represents the total number of effective frequency points, Mag_clean(t, k) amplitude spectrum signal, Mag, representing a clean speech signal_{clean_est}(t, k) represents the amplitude spectrum signal of the estimated speech signal, t represents the frame index, α (k) represents the weighting coefficient at frequency k, and a is a constant greater than 1. For example, a may be set to a natural constant e.

For example, taking a 512-point STFT as an example,

the weighting factor α (0) of the loss at the first frequency point becomes a, and the weighting factor α (K-1) of the loss at the last frequency point becomes 1.

Therefore, the loss at each frequency point of a frame signal between the magnitude spectrum signal of the clean speech signal and the magnitude spectrum signal of the estimated speech signal can be obtained by the above formula (2), and the speech processing model is trained by updating the parameters of the speech processing model based on the loss at each frequency point.

As another example, the loss function may be designed to:

wherein ,

Therefore, the overall loss of each frequency point of a frame signal between the amplitude spectrum signal of the clean speech signal and the amplitude spectrum signal of the estimated speech signal can be obtained by the above formula (3), and the parameters of the speech processing model are updated based on the overall loss to train the speech processing model.

As another example, the loss function may be designed to:

wherein ,

Therefore, the average loss of each frequency point of one frame signal between the amplitude spectrum signal of the clean speech signal and the amplitude spectrum signal of the estimated speech signal can be obtained by the above formula (4), and the speech processing model is trained by updating the parameters of the speech processing model based on the average loss.

Of course, the loss function and the weighting factor of the present disclosure are not limited to the above examples, and any possible loss function, any possible loss function with frequency-point-related weighting factors, and any possible weighting factor may be designed.

Referring to fig. 3, in step 301, a voice signal to be tested may be acquired. Here, the voice signal to be tested may be a karaoke work, a recorded speech, etc. recorded by the client software by the user using the microphone of the terminal, or may be a voice audio file acquired from a local storage or a local database, or an external data source (e.g., internet, server, database, etc.) through an input device or a transmission medium.

In step 302, an estimated speech signal may be obtained based on the speech signal to be tested using the speech processing model trained according to the training method of the present disclosure. Here, the estimated speech signal refers to a speech signal from which the acoustic noise is removed by the speech processing model. In the inference (application) process, the input of the speech processing model may be the amplitude spectrum signal of the speech signal to be measured, and the output may be the amplitude spectrum signal of the estimated speech signal or the estimated mask ratio. Here, the mask ratio may refer to a ratio of the magnitude spectrum of the clean speech signal and the magnitude spectrum of the speech signal to be measured.

According to an exemplary embodiment of the present disclosure, a time-frequency transform (e.g., a Short Time Fourier Transform (STFT)) may be performed on a speech signal to be tested to obtain a magnitude spectrum signal and a phase spectrum signal of the speech signal to be tested; then, based on the amplitude spectrum signal of the voice signal to be detected, obtaining the amplitude spectrum signal of the estimated voice signal by using a voice processing model; and then combines the amplitude spectrum signal of the estimated speech signal with the phase spectrum signal of the audio signal to be measured, and then performs inverse time-frequency transformation (e.g., inverse short-time fourier transform (ISTFT)) to obtain an estimated speech signal. For example, in the case where the output of the speech processing model is the amplitude spectrum signal of the estimated speech signal, the amplitude spectrum signal of the estimated speech signal can be directly obtained through the output of the speech processing model. For another example, when the output of the speech processing model is the estimated mask ratio, the amplitude spectrum signal of the acoustic signal may be multiplied by the estimated mask ratio output by the speech processing model to obtain the amplitude spectrum signal of the estimated speech signal.

Referring to fig. 4, a training apparatus 400 of a speech processing model according to an exemplary embodiment of the present disclosure may include an acquisition unit 401, an estimation unit 402, a calculation unit 403, and a training unit 404.

The obtaining unit 401 may obtain audio sample data, where each audio sample data may include a clean voice signal and a boomerang voice signal. Here, the clean speech signal may be a clean speech signal in a clean speech data set. The clean speech data set may be obtained by recording clean speech and/or by being obtained via the internet, but may also be obtained in any other possible way. Here, clean speech refers to speech that is substantially free of any noise, and a clean speech data set may be generated by acquiring clean speech of many different speakers (e.g., men, women, children, the elderly, etc.).

Furthermore, a jet-wheat speech signal may be obtained by additively mixing the clean speech signal with a jet-wheat noise signal. For example, the acoustic signal may be obtained by the obtaining unit 401 by additively mixing the clean speech signal with an acoustic noise signal, or the acoustic noise signal may be directly obtained by the obtaining unit 401 by additively mixing the clean speech signal with the acoustic noise signal. Here, the shot noise signal may be a shot noise signal in a shot noise data set. The microphone noise data set is obtained by recording clean speech and/or by internet, but can also be obtained in any other possible way. Here, the boomerang noise means noise containing only the boomerang noise without any other noise and voice. A jet noise dataset may be generated by acquiring jet noise for a number of different speakers (e.g., men, women, children, elderly, etc.).

According to an exemplary embodiment of the present disclosure, when acquiring a shot noise data set, an enhancement process may first be performed on an originally acquired shot noise signal in the shot noise data set to generate a shot noise signal required for training.

According to an exemplary embodiment of the present disclosure, a clean speech signal may be additively mixed with a jet noise signal at a predetermined signal-to-noise ratio to obtain a jet noise signal.

After the audio sample data is acquired by the acquisition unit 401, the speech processing model may be trained based on a clean speech signal and a microphone speech signal included in the audio sample data.

Specifically, the estimation unit 402 may obtain an estimated speech signal using a speech processing model based on the popcorn speech signal. The calculation unit 403 may calculate a loss function of the speech processing model based on the clean speech signal and the estimated speech signal. The training unit 404 may train the speech processing model using the calculated loss function.

According to an exemplary embodiment of the present invention, the input of the speech processing model is an amplitude spectrum signal of the acoustic signal, and the output may be an amplitude spectrum signal of the estimated speech signal or an estimated Mask ratio (Mask). Here, the mask ratio may refer to a ratio of an amplitude spectrum of the clean speech signal and an amplitude spectrum of the voiceover speech signal. The estimation unit 402 may obtain an amplitude spectrum signal of the microphone signal by performing a time-frequency transform (e.g., STFT) on the microphone signal. That is, the estimation unit 402 transforms the acoustic signal from the time domain to the frequency domain, and extracts an amplitude of the acoustic signal in the frequency domain to obtain an amplitude spectrum signal of the acoustic signal. After obtaining the amplitude spectrum signal of the microphone voice signal, the estimating unit 402 may input the amplitude spectrum signal of the microphone voice signal into the voice processing model to obtain the amplitude spectrum signal of the estimated voice signal. For example, in the case where the output of the speech processing model is the amplitude spectrum signal of the estimated speech signal, the estimation unit 402 may directly take the output of the speech processing model as the amplitude spectrum signal of the estimated speech signal. For another example, in a case where the output of the speech processing model is the estimated mask ratio, the estimating unit 402 may multiply the amplitude spectrum signal of the acoustic signal by the estimated mask ratio output by the speech processing model to obtain the amplitude spectrum signal of the estimated speech signal.

According to an exemplary embodiment of the present invention, the computing unit 403 may perform a time-frequency transform (e.g., STFT) on the clean speech signal to obtain a magnitude spectrum signal of the clean speech signal. That is, the calculation unit 403 transforms the clean speech signal from the time domain to the frequency domain, and extracts the amplitude for the clean speech signal on the frequency domain to obtain the amplitude spectrum signal of the clean speech signal. Subsequently, the calculation unit 403 calculates a loss function based on the magnitude spectrum signal of the clean speech signal and the magnitude spectrum signal of the estimated speech signal.

For example, the loss function may be designed as any one of the formula (2) to the formula (4) shown above. In the case where the loss function is designed as shown in equation (2), the training unit 404 may obtain the loss at each frequency point of a frame signal between the magnitude spectrum signal of the clean speech signal and the magnitude spectrum signal of the estimated speech signal by the loss function as shown in equation (2), and update the parameters of the speech processing model based on the loss at each frequency point to train the speech processing model.

In the case where the loss function is designed as shown in equation (3), the training unit 404 may obtain the total loss of each frequency bin of a frame signal between the magnitude spectrum signal of the clean speech signal and the magnitude spectrum signal of the estimated speech signal by the loss function as shown in equation (3), and update the parameters of the speech processing model based on the total loss to train the speech processing model.

In the case where the loss function is designed as shown in equation (4), the training unit 404 may obtain the average loss of each frequency bin of a frame signal between the magnitude spectrum signal of the clean speech signal and the magnitude spectrum signal of the estimated speech signal by the loss function as shown in equation (4), and update the parameters of the speech processing model based on the average loss to train the speech processing model.

Referring to fig. 5, a voice processing apparatus 500 according to an exemplary embodiment of the present disclosure may include an acquisition unit 501 and an estimation unit 502.

The obtaining unit 501 may obtain a speech signal to be measured. Here, the voice signal to be tested may be a karaoke work, a recorded speech, etc. recorded by the client software by the user using the microphone of the terminal, or may be a voice audio file acquired from a local storage or a local database, or an external data source (e.g., internet, server, database, etc.) through an input device or a transmission medium.

The estimation unit 502 may obtain an estimated speech signal based on the speech signal to be measured by using the speech processing model trained according to the training method of the present disclosure. Here, the estimated speech signal refers to a speech signal from which the acoustic noise is removed by the speech processing model. In the inference (application) process, the input of the speech processing model may be the amplitude spectrum signal of the speech signal to be measured, and the output may be the amplitude spectrum signal of the estimated speech signal or the estimated mask ratio. Here, the mask ratio may refer to a ratio of the magnitude spectrum of the clean speech signal and the magnitude spectrum of the speech signal to be measured.

According to an exemplary embodiment of the disclosure, the estimation unit 502 may perform time-frequency transformation (e.g., short-time fourier transformation (STFT)) on the speech signal to be measured to obtain a magnitude spectrum signal and a phase spectrum signal of the speech signal to be measured; then, based on the amplitude spectrum signal of the voice signal to be detected, obtaining the amplitude spectrum signal of the estimated voice signal by using a voice processing model; and then combines the amplitude spectrum signal of the estimated speech signal with the phase spectrum signal of the audio signal to be measured, and then performs inverse time-frequency transformation (e.g., inverse short-time fourier transform (ISTFT)) to obtain an estimated speech signal. For example, in the case where the output of the speech processing model is the amplitude spectrum signal of the estimated speech signal, the estimation unit 502 may directly obtain the amplitude spectrum signal of the estimated speech signal through the output of the speech processing model. For another example, in a case where the output of the speech processing model is the estimated mask ratio, the estimating unit 502 may multiply the amplitude spectrum signal of the acoustic signal by the estimated mask ratio output by the speech processing model to obtain the amplitude spectrum signal of the estimated speech signal.

Referring to fig. 6, an electronic device 600 includes at least one memory 601 and at least one processor 602, the at least one memory 601 having stored therein a set of computer-executable instructions that, when executed by the at least one processor 602, perform a method of training a speech processing model and a method of speech processing according to exemplary embodiments of the present disclosure.

By way of example, the electronic device 600 may be a PC computer, tablet device, personal digital assistant, smartphone, or other device capable of executing the set of instructions described above. Here, the electronic device 600 need not be a single electronic device, but can be any arrangement or collection of circuits capable of executing the above-described instructions (or sets of instructions), either individually or in combination. The electronic device 600 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In the electronic device 600, the processor 602 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

The processor 602 may execute instructions or code stored in the memory 601, wherein the memory 601 may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.

The memory 601 may be integrated with the processor 602, for example, with RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, memory 601 may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory 601 and the processor 602 may be operatively coupled or may communicate with each other, e.g., through I/O ports, network connections, etc., such that the processor 602 can read files stored in the memory.

Further, the electronic device 600 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 600 may be connected to each other via a bus and/or a network.

According to an exemplary embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions, which when executed by at least one processor, cause the at least one processor to perform a training method of a speech processing model or a speech processing method according to the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or compact disc memory, Hard Disk Drive (HDD), solid-state drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or a extreme digital (XD) card), magnetic tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, and the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an exemplary embodiment of the present disclosure, a computer program product may also be provided, in which instructions are executable by a processor of a computer device to perform a method of training a speech processing model or a method of speech processing according to an exemplary embodiment of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for training a speech processing model, the method comprising:

acquiring audio sample data, wherein each audio sample data comprises a clean voice signal and a wheat spraying voice signal, and the wheat spraying voice signal is obtained by additively mixing the clean voice signal and a wheat spraying noise signal;

obtaining an estimated voice signal by utilizing the voice processing model based on the wheat spraying voice signal;

calculating a loss function of the speech processing model based on the clean speech signal and the estimated speech signal;

and training the voice processing model by using the calculated loss function.

2. Training method according to claim 1, wherein the shot noise signal is obtained by enhancement processing of the originally acquired shot noise signal.

3. The training method of claim 2, wherein the enhancement processing comprises:

removing non-jet noise components in the originally acquired jet noise signals; and/or

And carrying out low-pass filtering on the originally acquired wheat spraying noise signal.

4. A training method as claimed in claim 3, wherein the non-jet noise contribution comprises an unvoiced component.

5. Training method according to claim 1, wherein said acoustic signal is obtained by additively mixing said clean speech signal with said acoustic noise signal according to a predetermined signal-to-noise ratio.

6. A method of speech processing, comprising:

acquiring a voice signal to be detected;

based on the speech signal to be tested, an estimated speech signal is obtained using a speech processing model trained according to the training method of any one of claims 1 to 5.

7. An apparatus for training a speech processing model, the apparatus comprising:

an acquisition unit configured to acquire audio sample data, each of the audio sample data including a clean speech signal and a boomerang speech signal, the boomerang speech signal being obtained by additively mixing the clean speech signal with a boomerang noise signal;

an estimating unit configured to obtain an estimated speech signal using the speech processing model based on the jet microphone speech signal;

a calculation unit configured to calculate a loss function of the speech processing model based on the clean speech signal and the estimated speech signal;

a training unit configured to train the speech processing model using the calculated loss function.

8. A speech processing apparatus, comprising:

an acquisition unit configured to acquire a voice signal to be measured;

an estimation unit configured to obtain an estimated speech signal based on the speech signal to be tested, using the speech processing model trained according to the training method of any one of claims 1 to 5.

9. An electronic device, comprising:

at least one processor;

at least one memory storing computer-executable instructions,

wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the method of training of a speech processing model according to any one of claims 1 to 5 or the method of speech processing according to claim 6.

10. A computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform a method of training a speech processing model according to any one of claims 1 to 5 or a method of speech processing according to claim 6.