CN119864047B

CN119864047B - Audio separation method, system and related device

Info

Publication number: CN119864047B
Application number: CN202411781377.5A
Authority: CN
Inventors: 闵锐; 田定书; 马峰; 高建清
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2024-12-05
Filing date: 2024-12-05
Publication date: 2025-09-23
Anticipated expiration: 2044-12-05
Also published as: CN119864047A

Abstract

The application discloses an audio separation method, an audio separation system and a related device, wherein the method comprises the steps of obtaining audio to be separated, inputting the audio to be separated into a trained target separation model to obtain a first sub-audio and a second sub-audio, wherein the target separation model is obtained by training through a plurality of target training samples, the target training samples are determined based on a plurality of initial training audios and candidate sound parts matched with a plurality of audio track categories respectively, the initial training audios comprise reference sound parts matched with the audio track categories respectively, and the candidate sound parts are used for replacing at least one reference sound part in the initial training audios. Through the mode, the audio separation method and the audio separation device can improve the accuracy of audio separation.

Description

Audio separation method, system and related device

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to an audio separation method, system, and related apparatus.

Background

The audio separation technology aims to accurately extract a target sound signal from a complex multi-sound source environment, and is widely applied to the aspects of speech recognition in a noisy environment, background noise removal in music production, speaker separation in conference recording, sound event detection in the safety monitoring field and the like. Along with the rapid development of artificial intelligence and machine learning technologies, the modern audio separation technology continuously breaks through the limitation of the traditional method, achieves a more accurate and real-time sound separation effect, and greatly improves the naturalness and efficiency of human-computer interaction. The existing audio separation method is mainly realized by a neural network model, and the accuracy of audio separation is directly influenced by the training effect of the neural network model. In order to improve the training effect of the model, the training is mainly performed by constructing a large number of training samples, and a large amount of cost is consumed in the mode.

In view of this, how to improve the efficiency of audio separation and reduce the cost of audio separation is a urgent issue to be resolved.

Disclosure of Invention

The application mainly solves the technical problem of providing an audio separation method, an audio separation system and a related device, which can improve the accuracy of audio separation.

In order to solve the technical problems, the technical scheme includes that an audio separation method is provided, the audio separation method comprises the steps of obtaining audio to be separated, inputting the audio to be separated into a trained target separation model to obtain a first sub-audio and a second sub-audio, wherein the target separation model is obtained by training through a plurality of target training samples, the target training samples are determined based on a plurality of initial training audios and candidate sound parts matched with a plurality of audio track categories, the initial training audios comprise reference sound parts matched with the plurality of audio track categories, and the candidate sound parts are used for replacing at least one reference sound part in the initial training audios.

In order to solve the technical problem, the other technical scheme adopted by the application is that the audio separation system comprises an acquisition module and a separation module, wherein the acquisition module is used for acquiring audio to be separated, the separation module is used for inputting the audio to be separated into a trained target separation model to obtain a first sub-audio and a second sub-audio, the target separation model is obtained by training a plurality of target training samples, the target training samples are determined based on a plurality of initial training audios and candidate sound parts respectively matched with a plurality of sound track categories, the initial training audios comprise reference sound parts respectively matched with the sound track categories, and the candidate sound parts are used for replacing at least one of the reference sound parts in the initial training audios.

In order to solve the technical problem, the application adopts a further technical scheme that the electronic equipment comprises a memory and a processor which are mutually coupled, wherein the memory stores program instructions, and the processor is used for executing the program instructions to realize the audio separation method in the technical scheme.

In order to solve the technical problem, a further technical scheme adopted by the application is to provide a computer readable storage medium, wherein program instructions are stored on the computer readable storage medium, and the program instructions realize the audio separation method in the technical scheme when being executed by a processor.

The method has the beneficial effects that the method is different from the situation in the prior art, and the number of target training samples for model training is increased by utilizing the reference sound part and the initial training samples which are respectively matched with a plurality of audio track categories to carry out sample expansion, so that the subsequent training effect on the target separation model and the stability of the target separation model after training are improved. Therefore, after the audio to be separated is obtained, the stability and the accuracy are higher when the audio to be separated is separated by the trained target separation model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:

FIG. 1 is a flow chart of an embodiment of an audio separation method according to the present application;

FIG. 2 is a flow diagram of an embodiment of constructing a target training sample;

fig. 3 is a flowchart of step S204 in fig. 2 according to another embodiment;

fig. 4 is a flowchart of step S102 in fig. 1 according to another embodiment;

Fig. 5 is a flowchart of step S402 in fig. 4 according to another embodiment;

FIG. 6 is a schematic diagram of a first processing network according to an embodiment;

FIG. 7 is a flowchart of step S404 in FIG. 4 according to another embodiment;

FIG. 8 is a flow chart of an embodiment of a training method of the object separation model of the present application;

FIG. 9 is a schematic diagram of an embodiment of an audio separation system according to the present application;

FIG. 10 is a schematic diagram of an embodiment of an electronic device of the present application;

fig. 11 is a schematic diagram of a computer-readable storage medium according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments, and that different embodiments may be adaptively combined. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Referring to fig. 1, fig. 1 is a flow chart of an embodiment of an audio separation method according to the present application, the method includes:

s101, acquiring audio to be separated.

In one embodiment, audio to be separated is obtained that requires a separation task.

In an implementation scenario, the audio to be separated is a musical composition including a musical accompaniment and a vocal part, and the separating task is to separate the musical accompaniment part from the vocal part in the audio to be separated.

In another embodiment, after the audio to be separated is obtained, noise reduction processing is performed on the audio to be separated, so as to facilitate subsequent improvement of recognition accuracy.

Of course, in other embodiments, other types of sounds may be included in the audio to be separated, and the separation task may be determined according to the types of different sounds in the audio to be separated, for example, when the audio to be separated includes the expression speech and the environmental noise, the separation task is to separate the expression speech and the environmental noise.

And S102, inputting the audio to be separated into a trained target separation model to obtain a first sub-audio and a second sub-audio, wherein the target separation model is obtained by training by using a plurality of target training samples, the target training samples are determined based on a plurality of initial training audios and candidate sound parts respectively matched with a plurality of audio track categories, the initial training audios comprise reference sound parts respectively matched with the plurality of audio track categories, and the candidate sound parts are used for replacing at least one reference sound part in the initial training audios.

In an embodiment, a trained target separation model is obtained, and the obtained audio to be separated is input to the target separation model to obtain a first sub-audio and a second sub-audio corresponding to the audio to be separated.

In an implementation scenario, the audio to be separated is a music song, the first sub-audio obtained after separation is an accompaniment part, and the second sub-audio is a human voice part other than the accompaniment part. The target separation model is obtained by training a plurality of target training samples, and the target training samples are obtained by sample expansion by using a plurality of initial training audios and candidate sound parts with a plurality of audio track categories respectively matched.

Specifically, all initial training audio is taken as target training samples. And because each initial training audio comprises reference sound parts matched with a plurality of audio track categories respectively, replacing the corresponding reference sound parts in the initial training audio by utilizing the candidate sound parts so as to generate a new target training sample. Wherein the above-mentioned track categories are determined based on the categories of the human voice and the musical instrument, for example, the track categories include the human voice, and also include drums, bass, zither, flute, and the like.

According to the audio separation method provided by the application, the number of target training samples for model training is increased by using the reference sound part and the initial training sample which are respectively matched with the plurality of audio track categories for sample expansion, so that the subsequent training effect on the target separation model and the stability of the target separation model after training are improved. Therefore, after the audio to be separated is obtained, the stability and the accuracy are higher when the audio to be separated is separated by the trained target separation model.

Referring to fig. 2, fig. 2 is a flow chart of an embodiment of constructing a target training sample. Specifically, the construction process of the target training sample comprises the following steps:

S201, constructing a candidate database, wherein the candidate database comprises a plurality of candidate sound parts with different track categories matched respectively.

In one embodiment, a plurality of track categories are predetermined. For each track category, a corresponding plurality of candidate vocal parts is acquired. And constructing a candidate database according to the candidate vocal parts matched with each track category.

In one implementation, when the track class corresponds to the instrument, the step of acquiring the corresponding candidate sound part comprises acquiring corresponding performance audio through an open source database according to the name of the instrument, and taking the performance audio as the candidate sound part. Or the playing audio of the corresponding musical instrument is acquired on site, and the acquired playing audio is used as a corresponding candidate sound part.

And when the sound track type is voice, acquiring corresponding candidate sound parts, wherein the step of acquiring the corresponding candidate sound parts comprises the steps of acquiring a plurality of target texts, reading the target texts by using related personnel, and taking the acquired reading audio as the corresponding candidate sound parts. Or acquiring a plurality of target texts, converting the target texts into audio through a voice conversion model, and taking the audio as a corresponding candidate sound part.

S202, acquiring a plurality of initial training audios, and taking at least one candidate sound part in the initial training audios as a target sound part.

In one embodiment, a plurality of initial training audio is obtained, each initial training audio including a reference sound portion that matches a respective one of a plurality of audio track categories.

Further, for each initial training audio, at least one of the corresponding all reference vocal folds is selected as the target vocal fold.

In a specific application scenario, the initial training audio is obtained through an open source database.

In another embodiment, a predetermined number threshold is obtained. For each initial training audio, the number of target vocal parts selected from all corresponding reference vocal parts is made smaller than or equal to a number threshold. This approach helps to avoid too many reference voices being replaced in the initial training audio, resulting in poor correlation between different reference voices. For example, in response to each initial training audio including four different track categories, the number threshold is set to 3.

And S203, acquiring the candidate sound parts matched with the first sound track category from a candidate database based on the first sound track category matched with the target sound part.

In one embodiment, the track class to which the target sound part matches is regarded as the first track class. Any reference sound part matched with the first sound track category is selected from the candidate database.

In a specific application scenario, when the first track category matched with the selected target sound part is "drum", a candidate sound part matched with "drum" is selected from the candidate database. Or when the first track category matched with the target sound part comprises 'drum' and 'human voice', selecting the candidate sound part matched with 'drum' and the candidate sound part matched with 'human voice' from the candidate database.

S204, replacing the target vocal part by utilizing the candidate vocal part matched with the first sound track category to obtain a target training sample.

In one embodiment, the target training sample is obtained by replacing the corresponding target sound part with the candidate sound part which is selected from the candidate database and matches with the first sound track category.

In a specific application scenario, when the first audio track class is "drum", the candidate vocal part matched with the "drum" is utilized to replace the target vocal part matched with the first audio track class, and the replaced initial training audio is used as the target training audio.

According to the scheme, the candidate database is constructed in advance, so that after a plurality of initial training audios are acquired, at least one reference vocal part in the initial training audios is replaced by the candidate vocal part in the candidate database, and the number of training samples is increased.

Referring to fig. 3, fig. 3 is a flowchart of step S204 in fig. 2 according to another embodiment. Specifically, the implementation procedure of step S204 includes:

And S301, adjusting parameter information corresponding to the candidate vocal part matched with the first audio track category to obtain an adjusted vocal part, wherein the parameter information comprises at least one of playing speed, audio amplitude, audio sampling rate and audio format of the candidate vocal part.

In one embodiment, in response to determining a candidate vocal part matched with a first audio track category, parameter information corresponding to the candidate vocal part matched with the first audio track category is acquired. And adjusting the parameter information to obtain an adjusting sound part.

In an implementation scenario, the reference information includes at least one of a playing speed, an audio amplitude, an audio sampling rate, and an audio format of the corresponding candidate vocal part.

In a specific application scenario, when the reference information includes a playing speed, an audio amplitude, an audio sampling rate, and an audio format of the candidate vocal part, the adjusting the reference information includes adjusting the playing speed of the candidate vocal part, adjusting the audio amplitude of the candidate vocal part to adjust the playing loudness of the candidate vocal part, adjusting the audio sampling rate corresponding to the candidate vocal part, and converting the audio format corresponding to the candidate vocal part, for example, converting from flac format to mp3 format, etc.

Alternatively, in other embodiments, only a portion of the data in the reference information may be adjusted. For example, when the reference information includes a play speed, an audio amplitude, an audio sampling rate, and an audio format of the candidate sound part, only the audio amplitude is adjusted.

S302, replacing the corresponding target vocal part by the adjusting vocal part to obtain a target training sample.

In one embodiment, the target training sample is obtained by replacing the corresponding target vocal part with the adjustment vocal part.

According to the scheme, the parameter information corresponding to the corresponding candidate vocal part is adjusted, so that the flexibility of acquiring the target training sample is improved.

In yet another embodiment, after constructing the candidate database, the step of constructing the target training sample may further include determining a plurality of second track categories, and selecting a candidate vocal part matching each second track category from the candidate database. And acquiring a target training sample based on the candidate vocal part matched with the second sound track category.

Specifically, a plurality of audio track categories corresponding to the candidate database are randomly selected as the second audio track category. And selecting any matched candidate sound part for each second sound track category. And fusing all the candidate sound parts matched with the second audio track category, so as to obtain a target training sample.

It should be noted that, in order to make the fusion degree between different candidate sound parts higher, before all the candidate sound parts matched with the second audio track categories are fused, the playing speed of the candidate sound parts matched with the second audio track categories is adjusted so that the duration required for playing the different candidate sound parts is the same. Or after any candidate sound part matched with each second sound track category is selected, cutting at least part of the lengths of the candidate sound parts so that the lengths of the candidate sound parts matched with each second sound track category are the same.

Referring to fig. 4, fig. 4 is a flowchart of step S102 in fig. 1 according to another embodiment. Specifically, the implementation procedure of step S102 includes:

s401, inputting the audio to be separated into a feature extraction network in the target separation model to obtain initial frequency domain features corresponding to the audio to be separated.

In an embodiment, the audio to be separated is input to a feature extraction network in the target separation model, so that feature extraction is performed on the audio to be separated by using the feature extraction network, and extraction features corresponding to the initial frequency domain dimension are obtained.

Further, in order to save the operation consumption in the subsequent processing process, the extracted features are compressed in the frequency domain dimension, so as to obtain initial frequency domain features corresponding to the first reference dimension.

In a specific application scenario, the sampling rate of the audio to be separated is 48K, and the feature extraction is performed on the audio to be separated by using a hanning window with a frame length of 2048 and a frame shift of 1024, so as to obtain the extraction feature. And performing dimension compression on the extracted features to obtain initial frequency domain features. The formula for calculating the initial frequency domain features is as follows:

Y(T,F)=FFT(x(t))₂₀₄₈,F∈(0,1024)

Y′(T,F)=log(FtoBark(Y(T,F))²)

Wherein Y (T, F) represents the extracted features, x (T) represents the audio to be separated, T represents the time frame index, F represents the frequency domain point index, Y' (T, F) represents the initial frequency domain features, which correspond to a first reference dimension of 128.

S402, inputting the initial frequency domain features into a first processing network in the target separation model to obtain a real number mask matched with the initial frequency domain features.

In one embodiment, the initial frequency domain features are input to a first processing network in the target separation model to extract time domain and frequency domain related features from the initial frequency domain features using the first processing network, and a real mask for audio separation is obtained after decoding.

S403, based on the real number mask and the initial frequency domain feature, obtaining the enhancement feature corresponding to the initial frequency domain feature.

In an embodiment, the obtained real number mask is dimension-lifted, and the enhanced feature is obtained according to the real number mask after dimension lifting and the extracted feature obtained in step S401.

Specifically, the real number mask with the lifted dimension is multiplied by the extracted feature, so as to obtain the enhanced feature.

S404, inputting the enhanced features into a second processing network in the target separation model to obtain a complex mask.

In one embodiment, the obtained enhanced features are input to a second processing network in the target separation model, so that the second processing network performs further feature extraction on the enhanced features to obtain a complex mask.

S405, based on the complex mask and the enhancement features, acquiring a first sub-audio and a second sub-audio corresponding to the audio to be separated.

In one embodiment, the complex mask and the enhanced feature are multiplied to obtain the target feature. And carrying out Fourier inverse transformation on the target characteristics to obtain a first sub-audio.

Further, the audio to be separated and the first sub-audio are utilized to obtain a second sub-audio, namely, the part of the first sub-audio is removed from the audio to be separated to serve as the second sub-audio.

Referring to fig. 5, fig. 5 is a flowchart of step S402 in fig. 4 according to another embodiment. Specifically, the implementation procedure of step S402 includes:

S501, dividing the initial frequency domain feature into a plurality of frequency domain sub-features.

In an embodiment, the initial frequency domain feature is divided according to a first reference dimension corresponding to the frequency domain point index in the initial frequency domain feature, and the dimension of the feature obtained after division is lifted to a second reference dimension to obtain a plurality of corresponding frequency domain sub-features. The feature obtained after division is subjected to dimension lifting, so that the expression capability of the feature in the subsequent process is improved.

Specifically, referring to fig. 6, fig. 6 is a schematic structural diagram of a first processing network according to an embodiment. And inputting the initial frequency domain characteristics into a frequency band separation layer in the first processing network, so that the frequency band separation layer is utilized to divide and dimension up the initial frequency domain characteristics, and a plurality of corresponding frequency domain sub-characteristics are obtained.

In an implementation scenario, in response to the initial frequency-domain feature being 128 in the first reference dimension of the frequency-domain point index, the initial frequency-domain feature is divided into 32 sub-band features, each sub-band feature corresponding to a dimension of 4, according to each adjacent four frequency-domain points as one sub-band. And the feature dimension of each sub-band feature is lifted to a second reference dimension, so that the frequency domain sub-feature corresponding to each sub-band feature is obtained. Wherein the second reference dimension is 64.

S502, sequentially inputting the frequency domain sub-features into a first coding layer in a first processing network to obtain first coding features corresponding to the frequency domain sub-features.

In an embodiment, the plurality of frequency domain sub-features are sequentially input to a first coding layer in the first processing network according to a time sequence order, so that the frequency domain sub-features are coded by the first coding layer to obtain a first coding feature.

Specifically, as shown in fig. 6, the first coding layer is composed of a normalization layer, a bidirectional GRU (Gate Recurrent Unit, gated loop unit) network, and a full connection. The plurality of frequency domain sub-features are sequentially input into the first coding layer, so that the first coding layer is used for further feature extraction of the input frequency domain sub-features, the features which are respectively related to the time domain and the frequency domain in the frequency domain sub-features are enhanced, and the first coding features with stronger expression capability are obtained, so that the subsequent audio separation effect is improved.

And S503, inputting the first coding features into a first decoding layer in a first processing network to obtain first decoding features corresponding to each first coding feature.

In an embodiment, the first coding feature is input to a first decoding layer in the first processing network, so that the first decoding layer is utilized to perform dimension reduction processing on the first coding feature to obtain a corresponding first decoding feature.

Specifically, the first decoding layer is composed of a normalization layer and a full connection layer, and for the input first coding feature, the first decoding layer performs decoding and dimension reduction processing to obtain the first decoding feature. Wherein the dimension of the first decoding feature corresponds to the dimension of each subband feature.

S504, based on all the first decoding features, a real mask is obtained.

In an embodiment, the first coding features obtained in step S503 are spliced in sequence in order on the frequency domain, so as to obtain a real mask of the first reference dimension.

In another embodiment, in order to save the consumption of computing resources, the step S501 may not perform dimension promotion on the feature obtained after the division in the execution process.

According to the scheme, the real number mask is obtained by utilizing the first processing network, so that the enhancement features matched with the audio to be separated can be obtained through subsequent extraction, and the accuracy of audio separation is improved.

In another embodiment, in response to the real mask being obtained in the foregoing corresponding embodiment, the implementation procedure of step S403 includes, to make the dimension of the real mask coincide with the dimension of the extracted feature obtained in step S401, performing an up-scaling process on the real mask, that is, the dimension of the real mask after the up-scaling process coincides with the initial frequency domain dimension of the extracted feature.

Further, the real number mask with the lifted dimension is multiplied by the extracted feature, so that the enhanced feature is obtained. The specific calculation formula of the enhancement features is as follows:

Y(T,F)=Y_real(T,F)+j·Y_imag(T,F)

Y₁(T,F)＝[Y_real(T,F)+j·Y_imag(T,F)]·E(T,F)

where Y (T, F) represents the extracted feature, E (T, F) represents the real mask, and Y ₁ (T, F) represents the enhanced feature.

Referring to fig. 7, fig. 7 is a flowchart of step S404 in fig. 4 according to another embodiment. Specifically, the implementation procedure of step S404 includes:

S601, acquiring real part features and imaginary part features corresponding to the enhancement features, and obtaining fusion features based on the real part features and the imaginary part features.

In one embodiment, the enhancement features obtained in step S403 are converted into complex form, i.e. include corresponding real and imaginary features.

Further, the real part feature and the imaginary part feature are spliced to obtain a fusion feature. The specific calculation formula of the fusion characteristic is as follows:

Y₁(T,F)＝Y_1-real(T,F)+j·Y_1-imag(T,F)

Y₁′(T,F)＝cat[Y_1-real(T,F),Y_1-imag(T,F)]

Wherein Y ₁ (T, F) represents the enhancement feature, Y _1-real (T, F) represents the real part feature of the enhancement feature, Y _1-imag (T, F) represents the imaginary part feature of the enhancement feature, Y ₁' (T, F) represents the fusion feature, and cat represents the splicing operation.

S602, obtaining a real number mask based on the fusion characteristics.

In one embodiment, the second processing network is used for extracting features of the fusion features, and dimension reduction processing is performed on the fusion features after the feature extraction to obtain the target real number mask. The target real mask is converted into complex form to obtain complex mask.

In an implementation scenario, the second processing network further includes a GRU network and a fully connected layer. And in the forward processing process, extracting the fusion characteristics by utilizing the GRU network and the full connection layer, so that the fusion characteristics are combined with information in the time dimension. Further, in the reverse processing process, the extracted features are subjected to dimension reduction processing to obtain a target real number mask. The complex mask is obtained by converting the target real mask into complex form. The specific calculation formula of the complex mask is as follows:

E′(T,F)=cat[E_real(T,F),E_imag(T,F)]

E₁(T,F)＝e_real(T,F)+j·E_imag(T,F)

where E' (T, F) represents the target real mask and E ₁ (T, F) represents the complex mask.

In another embodiment, the structure of the second processing network may refer to another neural network structure capable of performing feature extraction.

In the above scheme, the complex mask matched with the enhanced feature format is obtained by using the second processing network, so that the first sub-audio and the second sub-audio which are separated are obtained based on multiplying the enhanced feature and the complex mask.

Referring to fig. 8, fig. 8 is a flowchart illustrating an embodiment of a training method of the object separation model according to the present application. The specific implementation process of the training method comprises the following steps:

S701, acquiring a plurality of first target training samples, inputting the first target training samples into an initial separation model to obtain a first separation result, and adjusting parameters of a first processing network in the initial separation model based on the first separation result to obtain a reference separation model.

In an embodiment, a plurality of first target training samples are obtained, the first target training samples are input into a constructed initial separation model, so that the initial separation model performs audio separation on the first target training samples, and a first separation result is obtained.

Specifically, the process of obtaining the first target training sample may refer to the process of obtaining the target training sample in the above-described corresponding embodiment. And determining a training label corresponding to the first target training sample according to the sound track category corresponding to each sound part in the first target training sample. For example, when the audio separation task is to separate accompaniment and human voice, a first reference sub-audio and a second reference sub-audio are determined according to the vocal part in the first target training sample, the first reference sub-audio is an accompaniment part corresponding to the first target training sample, and the second reference sub-audio is a human voice part corresponding to the first target training sample.

In addition, the specific structure of the initial separation model may refer to the structure of the target separation model mentioned in the above-described respective embodiments, and the specific process of audio separation by the initial separation model may refer to the audio separation method mentioned in any of the above-described embodiments.

Further, the first separation result output in response to the initial separation model includes enhanced features corresponding to the first target training sample. And calculating a first training loss according to the first separation result and the corresponding training label, and adjusting parameters of a first processing network in the initial separation model according to the first training loss to obtain a reference separation model after initial training.

In an implementation scenario, the first training loss is calculated by using the MSE loss function, and the specific calculation formula is as follows:

Loss₁＝MSE(|Y_a(T,F)|,|Z_a(T,F)|)

Where Loss ₁ represents the first training Loss, Y _a (T, F) represents the enhancement feature corresponding to the first target training sample, and Z _a (T, F) is obtained by fourier transforming the first reference sub-audio. In addition, in the process of performing preliminary training on the initial separation model, the model learning rate is set to 1e ^-3.

S702, acquiring a plurality of second target training samples, inputting the second target training samples into a reference training model to obtain a second separation result, and adjusting parameters in at least a first processing network and a second processing network in the reference separation model based on the second separation result to obtain a trained target separation model.

In one embodiment, a plurality of second target training samples is obtained. The process of obtaining the second target training sample may refer to the process of obtaining the first target training sample in the above-described corresponding embodiment. And determining a training label corresponding to the second target training sample according to the sound track category corresponding to each sound part in the second target training sample. The training label corresponding to the second target training sample comprises a third reference sub-audio and a fourth reference sub-audio, wherein the third reference sub-audio corresponds to an accompaniment part in the second target training sample, and the fourth reference sub-audio corresponds to a voice part in the second target training sample.

Further, a second target training sample is input into the reference training model after preliminary training, and a second separation result is obtained. The second separation result comprises the enhancement characteristic corresponding to the second target training sample generated in the separation process and third predictor audio finally output by the reference training model corresponding to the second target training sample, wherein the third predictor audio is an accompaniment part in the second target training sample predicted by the reference separation model.

Further, according to the enhancement features, the third reference sub-audio and the third prediction sub-audio corresponding to the second target training sample, a second training loss is calculated. And adjusting parameters of the first processing network and the second processing network in the reference separation model at least by utilizing the second training loss until the model convergence condition is met, so as to obtain the trained target separation model. The specific calculation formula of the second training loss is as follows:

l₁＝MSE(|Y_b(T,F)|,|Z_b(T,F)|)

l₂=SISDR(y(t),z(t))

Loss₂＝l₁+α·l₂

Wherein l ₁ represents a first sub-Loss, Y _b (T, F) represents an enhancement feature corresponding to the second target training sample, Z _b (T, F) is obtained by performing Fourier transform on the third reference sub-audio, l ₂ represents a second sub-Loss, Y (T) represents a third predicted sub-audio, Z (T) represents the third reference sub-audio, loss ₂ represents a second training Loss, SISDR represents a Loss function, and alpha represents a weight value corresponding to the second sub-Loss, wherein a specific value of the weight value can be obtained by a related technician through prediction or can be obtained through reverse thrust through multiple experiments. In addition, in the process of training the reference separation model, the model learning rate is set to 5e ^-4.

According to the scheme, after the initial training model is constructed, testing in different stages is carried out until the target separation model is obtained, so that the training effect is optimized, and the effect of the target separation model in audio separation is improved.

Of course, in other embodiments, to improve training efficiency, the model training process may also include obtaining a target training sample, inputting the target training sample to a reference training model to obtain a separation result, and calculating a training loss based on the separation result. And adjusting parameters in at least the first processing network and the second processing network in the initial separation model by using the training loss to obtain a trained target separation model. The specific calculation process of the training loss may refer to the calculation process of the second training loss in step S702.

Referring to fig. 9, fig. 9 is a schematic structural diagram of an audio separation system according to an embodiment of the application. Specifically, the audio separation system comprises an acquisition module 10 and a separation module 20 coupled to each other.

Specifically, the acquisition module 10 is configured to acquire audio to be separated.

The separation module 20 is configured to input an audio to be separated into a trained target separation model to obtain a first sub-audio and a second sub-audio, where the target separation model is obtained by training using a plurality of target training samples, the target training samples are determined based on a plurality of initial training audios and candidate vocal cords respectively matched with a plurality of audio track categories, the initial training audios include reference vocal cords respectively matched with the plurality of audio track categories, and the candidate vocal cords are used for replacing at least one reference vocal cord in the initial training audios.

In an embodiment, please continue to refer to fig. 9, the audio separation system provided by the present application further includes a sample construction module 30 coupled to the separation module 20, wherein the sample construction module 30 constructs a target training sample, and the method includes the steps of constructing a candidate database, wherein the candidate database includes a plurality of candidate voices with different track types respectively matched, acquiring a plurality of initial training audios, taking at least one candidate voice in the initial training audios as a target voice, acquiring a candidate voice matched with a first track type from the candidate database based on the first track type matched with the target voice, and replacing the target voice with the candidate voice matched with the first track type to obtain the target training sample.

In one embodiment, the sample construction module 30 replaces the target sound part with the candidate sound part matched with the first audio track category to obtain a target training sample, and the sample construction module comprises adjusting parameter information corresponding to the candidate sound part matched with the first audio track category to obtain an adjusted sound part, wherein the parameter information comprises at least one of a playing rate, an audio amplitude, an audio sampling rate and an audio format of the candidate sound part, and replaces the corresponding target sound part with the adjusted sound part to obtain the target training sample.

In one embodiment, the sample construction module 30 further comprises a step of determining a plurality of second audio track categories, selecting candidate sound parts matching each second audio track category from the candidate database, and acquiring the target training sample based on the candidate sound parts matching the second audio track category.

In one embodiment, the separation module 20 inputs the audio to be separated into a trained target separation model to obtain a first sub-audio and a second sub-audio, and the method comprises the steps of inputting the audio to be separated into a feature extraction network in the target separation model to obtain initial frequency domain features corresponding to the audio to be separated, inputting the initial frequency domain features into a first processing network in the target separation model to obtain a real mask matched with the initial frequency domain features, obtaining enhancement features corresponding to the initial frequency domain features based on the real mask and the initial frequency domain features, inputting the enhancement features into a second processing network in the target separation model to obtain a complex mask, and obtaining the first sub-audio and the second sub-audio corresponding to the audio to be separated based on the complex mask and the enhancement features.

In one embodiment, the separation module 20 inputs the initial frequency domain feature to a first processing network in the target separation model to obtain a real number mask matched with the initial frequency domain feature, and the separation module includes dividing the initial frequency domain feature into a plurality of frequency domain sub-features, sequentially inputting the plurality of frequency domain sub-features to a first coding layer in the first processing network to obtain first coding features corresponding to the frequency domain sub-features, inputting the first coding features to a first decoding layer in the first processing network to obtain first decoding features corresponding to each first coding feature, and obtaining the real number mask based on all the first decoding features.

In one embodiment, the separation module 20 inputs the enhanced feature to a second processing network in the object separation model to obtain a complex mask, including obtaining real and imaginary features corresponding to the enhanced feature, obtaining a fused feature based on the real and imaginary features, and obtaining the complex mask based on the fused feature.

Referring to fig. 10, fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the application. The electronic device includes a memory 40 and a processor 50 coupled to each other. The memory 40 has stored therein program instructions and the processor 50 is configured to execute the program instructions to implement the methods of any of the embodiments described above. In particular, the electronic device includes, but is not limited to, a desktop computer, a notebook computer, a tablet computer, a server, etc., without limitation. Further, the processor 50 may also be referred to as a CPU (Center Processing Unit, central processing unit). The processor 50 may be an integrated circuit chip having signal processing capabilities. The Processor 50 may also be a general purpose Processor, a digital signal Processor (DIGITAL SIGNAL Processor, DSP), an Application SPECIFIC INTEGRATED Circuit (ASIC), a Field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, a discrete gate or transistor logic device, a discrete hardware component. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 50 may be commonly implemented by an integrated circuit chip.

Referring to fig. 11, fig. 11 is a schematic structural diagram of an embodiment of a computer readable storage medium 60 according to the present application, where the computer readable storage medium 60 stores program instructions 70 that can be executed by a processor, and the program instructions 70 implement the method mentioned in any of the above embodiments when executed by the processor.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical, or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.

The foregoing description is only of embodiments of the present application, and is not intended to limit the scope of the application, and all equivalent structures or equivalent processes using the descriptions and the drawings of the present application or directly or indirectly applied to other related technical fields are included in the scope of the present application.

Claims

1. An audio separation method, comprising:

Acquiring audio to be separated;

The audio to be separated is input into a trained target separation model to obtain a first sub-audio and a second sub-audio, wherein the target separation model is obtained by training a plurality of target training samples, the target training samples are determined based on a plurality of initial training audios and candidate sound parts respectively matched with a plurality of audio track categories, the initial training audios comprise reference sound parts respectively matched with the audio track categories, and the candidate sound parts are used for replacing at least one of the reference sound parts in the initial training audios;

The target training sample constructing step comprises the steps of constructing a candidate database, wherein the candidate database comprises a plurality of candidate sound parts with different sound track categories matched respectively, acquiring a plurality of initial training audios, taking at least one candidate sound part in the initial training audios as a target sound part, acquiring the candidate sound part matched with a first sound track category from the candidate database based on the first sound track category matched with the target sound part, and replacing the target sound part by utilizing the candidate sound part matched with the first sound track category to obtain the target training sample.

2. The method of claim 1, wherein replacing the target vocal part with the candidate vocal part matched by the first track category results in the target training sample, comprising:

Adjusting parameter information corresponding to the candidate sound part matched with the first sound track category to obtain an adjusted sound part, wherein the parameter information comprises at least one of play rate, audio amplitude, audio sampling rate and audio format of the candidate sound part;

And replacing the corresponding target vocal part by the adjusting vocal part to obtain the target training sample.

3. The method of claim 1, wherein the step of constructing the target training sample further comprises:

Determining a plurality of second track categories, and selecting the candidate sound parts matched with each second track category from the candidate database;

and acquiring the target training sample based on the candidate vocal part matched with the second sound track category.

4. The method of claim 1, wherein inputting the audio to be separated into a trained target separation model results in a first sub-audio and a second sub-audio, comprising:

inputting the audio to be separated into a feature extraction network in the target separation model to obtain initial frequency domain features corresponding to the audio to be separated;

Inputting the initial frequency domain features into a first processing network in the target separation model to obtain a real number mask matched with the initial frequency domain features;

based on the real number mask and the initial frequency domain feature, obtaining an enhancement feature corresponding to the initial frequency domain feature;

inputting the enhanced features to a second processing network in the target separation model to obtain a complex mask;

And acquiring the first sub-audio and the second sub-audio corresponding to the audio to be separated based on the complex mask and the enhancement features.

5. The method of claim 4, wherein said inputting the initial frequency domain features into the first processing network in the target separation model results in a real mask that matches the initial frequency domain features, comprising:

dividing the initial frequency domain feature into a plurality of frequency domain sub-features;

Sequentially inputting the frequency domain sub-features into a first coding layer in the first processing network to obtain first coding features corresponding to the frequency domain sub-features;

inputting the first coding features into a first decoding layer in the first processing network to obtain first decoding features corresponding to each first coding feature;

The real mask is obtained based on all of the first decoding features.

6. The method of claim 4, wherein said inputting the enhanced features to the second processing network in the target separation model obtains a complex mask comprising:

Acquiring real part characteristics and imaginary part characteristics corresponding to the enhancement characteristics, and acquiring fusion characteristics based on the real part characteristics and the imaginary part characteristics;

and obtaining the complex mask based on the fusion feature.

7. An audio separation system, comprising:

The acquisition module is used for acquiring the audio to be separated;

The separation module is used for inputting the audio to be separated into a trained target separation model to obtain a first sub-audio and a second sub-audio, wherein the target separation model is obtained by training a plurality of target training samples, the target training samples are determined based on a plurality of initial training audios and candidate sound parts respectively matched with a plurality of audio track categories, the initial training audios comprise reference sound parts respectively matched with the audio track categories, and the candidate sound parts are used for replacing at least one of the reference sound parts in the initial training audios;

8. An electronic device comprising a memory and a processor coupled to each other, the memory having program instructions stored therein, the processor configured to execute the program instructions to implement the audio separation method of any of claims 1-6.

9. A computer readable storage medium having stored thereon program instructions, which when executed by a processor, implement the audio separation method according to any of claims 1-6.