CN119068891B

CN119068891B - Audio processing method, device, medium, computing device and program product

Info

Publication number: CN119068891B
Application number: CN202411067965.2A
Authority: CN
Inventors: 熊贝尔; 孙校珩; 张柏达; 徐扬帆; 雷童净; 黄安麒; 王燕凤; 刘华平
Original assignee: Hangzhou Netease Cloud Music Technology Co Ltd
Current assignee: Hangzhou Netease Cloud Music Technology Co Ltd
Priority date: 2024-08-06
Filing date: 2024-08-06
Publication date: 2025-10-31
Anticipated expiration: 2044-08-06
Also published as: CN119068891A

Abstract

This disclosure provides an audio processing method, apparatus, medium, computing device, and program product, relating to the field of audio processing. The audio processing method includes: acquiring the timbre characteristics of dry audio; determining the timbre similarity between the dry audio and original audio based on the timbre characteristics; selecting M target audios from N original audios based on the timbre similarity; applying the mixing parameters corresponding to each target audio to the dry audio to obtain M wet audios; and then combining the M wet audios based on the timbre similarity to obtain a mixed audio. This method only requires pre-preparing N original audios and recording the mixing parameters of each original audio. It allows the application of mixing parameters from original audios with high timbre similarity to the dry audio to the mixing process of the dry audio, eliminating the need for repeated listening tests to adjust mixing parameters. This improves mixing efficiency, shortens mixing time, and meets more mixing requirements.

Description

Audio processing method, device, medium, computing device and program product

Technical Field

Embodiments of the present disclosure relate to the field of computer technology, and more particularly, to an audio processing method, apparatus, medium, computing device, and program product.

Background

This section is intended to provide a background or context for embodiments of the present disclosure. The description herein is not admitted to be prior art by inclusion in this section.

Mixing (Audio Mixing) is a key step in the Audio production process, involving combining multiple Audio tracks, such as human voice, musical instruments, effect tones, etc., together to form a harmonious, balanced final Audio object.

At present, in the process of mixing audio, various audio effectors such as a compressor, an equalizer, a reverberator, a push-pull attenuator for adjusting volume and the like are required to be used for mixing audio to be mixed, and parameters of the audio effectors are continuously adjusted mainly by relying on manual listening so as to improve audio quality through mixing.

However, the technical difficulty of mixing is high, and a person with a high experience is usually required to operate the mixer in person, so that the mixing efficiency is low, and more mixing requirements are difficult to meet.

Disclosure of Invention

The present disclosure provides an audio processing method, apparatus, medium, computing device, and program product to meet more mixing requirements.

In a first aspect of embodiments of the present disclosure, there is provided an audio processing method, including:

Acquiring tone characteristics of the dry audio;

For each of N preset original audios, determining tone similarity between the dry audio and the original audios according to tone characteristics of the original audios and tone characteristics of the dry audio, selecting M original audios from the N original audios according to the tone similarity as target audios, wherein N and M are positive integers, N is greater than or equal to M, and the original audios correspond to preset mixing parameters;

For each target audio in M target audio, carrying out audio mixing processing on the dry sound audio according to the audio mixing parameters corresponding to the target audio to obtain wet sound audio corresponding to the target audio;

And combining the M wet sound frequencies according to the tone similarity between each target audio and the dry sound frequency to obtain the mixed audio corresponding to the dry sound frequency.

In another embodiment of the present disclosure, according to a tone similarity between each target audio and the dry audio, combining M wet audio frequencies to obtain a mixed audio corresponding to the dry audio frequency, including:

for each target audio, determining a weight value corresponding to the target audio according to the tone similarity between the target audio and the dry audio;

And carrying out weighted combination on M wet sound frequencies according to the weight value of each target audio frequency to obtain the mixed audio frequency corresponding to the dry sound frequency.

In another embodiment of the present disclosure, determining a weight value corresponding to the target audio according to a timbre similarity between the target audio and the dry audio includes:

And mapping the tone similarity corresponding to the target audio into a weight value in a preset interval.

In another embodiment of the present disclosure, the mixing processing of the dry audio according to the mixing parameters corresponding to the target audio includes:

Adjusting the speed-related parameters in the mixing parameters according to the audio speed of the dry audio, and/or adjusting the loudness-related parameters in the mixing parameters according to the loudness of the dry audio;

and carrying out sound mixing treatment on the dry sound frequency through the adjusted sound mixing parameters to obtain the corresponding wet sound frequency.

In another embodiment of the present disclosure, the mixing parameters include a reverberation time, and the adjusting a speed-related parameter among the mixing parameters according to an audio speed of the dry audio includes:

if the audio speed of the dry audio is greater than or equal to a preset speed threshold, shortening the reverberation time;

and if the audio speed of the dry audio is smaller than the speed threshold value, increasing the reverberation time.

In another embodiment of the present disclosure, the mixing parameters include a compressor threshold, and the adjusting the parameters related to the loudness in the mixing parameters according to the loudness of the dry audio includes:

If the loudness of the dry audio is greater than or equal to a preset loudness threshold, increasing the compressor threshold;

and if the loudness of the dry sound frequency is smaller than the loudness threshold, reducing the compressor threshold.

In another embodiment of the present disclosure, the mixing parameters include effector parameters, and the mixing parameters corresponding to the original audio are obtained by:

Adjusting the original audio through a preset effector, wherein the effector comprises one or more of a front compressor, a parameter equalizer, a dynamic equalizer, a multi-section exciter, a rear compressor, a multi-section compressor, a stereo enhancer, a delayer, a reverberator and a volume regulator;

if the adjustment result meets the preset condition, recording that the effector parameter corresponding to the adjustment result is the mixing parameter corresponding to the original audio.

In another embodiment of the present disclosure, the original audio is generated by a preset timbre model trained based on pre-labeled dry sound data.

In another embodiment of the present disclosure, acquiring timbre characteristics of a dry audio comprises:

and inputting the dry audio to a pre-trained deep learning model so that the deep learning model outputs tone characteristics corresponding to the dry audio, wherein the deep learning model is obtained by training according to a pre-labeled audio sample, and the audio sample is a Mel spectrum generated according to preset audio.

In another embodiment of the present disclosure, the audio processing method further includes:

acquiring energy characteristics of the mixed audio and masking characteristics of accompaniment audio corresponding to the mixed audio;

and adjusting the volume ratio of the mixed audio to the accompaniment audio according to the energy characteristic and the masking characteristic so as to obtain the mixed audio with balanced volume.

acquiring the track loudness of the mixed audio;

And if the track loudness does not fall into the preset loudness interval, adjusting the mixed audio through a preset effector so that the track loudness of the adjusted mixed audio falls into the loudness interval.

In a second aspect of embodiments of the present disclosure, there is provided an audio processing apparatus comprising:

The acquisition module is used for acquiring tone characteristics of the dry sound frequency;

the similarity determining module is used for determining the tone similarity between the dry sound audio and the original audio according to the tone characteristics of the original audio and the tone characteristics of the dry sound audio aiming at each of N preset original audios, and selecting M original audios from the N original audios according to the tone similarity as target audios, wherein N and M are positive integers, N is greater than or equal to M, and the original audios are corresponding to preset mixing parameters;

the wet sound processing module is used for carrying out sound mixing processing on the dry sound frequency according to the sound mixing parameters corresponding to the target audio aiming at each target audio in the M target audios to obtain wet sound audios corresponding to the target audios;

And the audio mixing combination module is used for combining M wet audio frequencies according to the tone similarity between each target audio frequency and the dry audio frequency to obtain audio mixing corresponding to the dry audio frequency.

In a third aspect of the disclosed embodiments, a computer-readable storage medium having stored therein computer-executable instructions which, when executed by a processor, implement the method according to any of the first aspects.

In a fourth aspect of embodiments of the present disclosure, there is provided a computing device comprising at least one processor;

and a memory communicatively coupled to the at least one processor;

Wherein the memory stores instructions executable by the at least one processor to cause the computing device to perform the method of any of the first aspects.

In a fifth aspect of the disclosed embodiments, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to any of the first aspects.

According to the audio processing method, the device, the medium, the computing equipment and the program product of the embodiment of the disclosure, by acquiring the tone color characteristics of the dry audio, the tone color similarity between the dry audio and the original audio can be determined according to the tone color characteristics of the dry audio and the tone color characteristics of the original audio, M target audios can be selected from N original audios according to the tone color similarity, the mixing parameters corresponding to the target audios are respectively applied to the dry audio to obtain M wet audio, and the M wet audio is combined based on the tone color similarity, so that the mixing audio of the dry audio can be obtained. According to the method, only N original audios are prepared in advance, and the mixing parameters of each original audio after mixing are recorded, so that the mixing parameters of the original audio with high tone similarity with the dry audio can be applied to the mixing process of the dry audio, repeated listening of the dry audio is not needed to debug the mixing parameters, the mixing efficiency can be improved, the mixing time is shortened, and more mixing requirements are met.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings, in which:

fig. 1 schematically illustrates an application scenario diagram of an audio processing method according to an embodiment of the present disclosure;

fig. 2 schematically illustrates a flow diagram of an audio processing method according to an embodiment of the disclosure;

FIG. 3 schematically illustrates a flow diagram of an audio processing method according to another embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow diagram of a wet audio combining method according to an embodiment of the present disclosure;

Fig. 5 schematically illustrates a flowchart of an original audio mixing parameter acquisition method according to an embodiment of the disclosure;

FIG. 6 schematically illustrates a structural schematic of an effector chain according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a schematic diagram of a storage medium of an embodiment of the present disclosure;

Fig. 8 schematically illustrates a structural diagram of an audio processing apparatus according to an embodiment of the present disclosure;

fig. 9 schematically illustrates a schematic diagram of a computing device according to an embodiment of the disclosure.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present disclosure will be described below with reference to several exemplary embodiments. It should be understood that these embodiments are presented merely to enable one skilled in the art to better understand and practice the present disclosure and are not intended to limit the scope of the present disclosure in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Those skilled in the art will appreciate that embodiments of the present disclosure may be implemented as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of entirely hardware, entirely software (including firmware, resident software, micro-code, etc.) or in a combination of hardware and software.

According to embodiments of the present disclosure, an audio processing method, apparatus, medium, computing device, and program product are presented.

It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) related to the present disclosure are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards, and be provided with corresponding operation entries for the user to select authorization or rejection.

Furthermore, any number of elements in the figures is for illustration and not limitation, and any naming is used for distinction only and not for any limiting sense.

The principles and spirit of the present disclosure are explained in detail below with reference to several representative embodiments thereof.

Summary of The Invention

Mixing is a processing means for audio, which means that an audio effector is used for processing human voice and accompaniment, so that the voice is better audible, harmonious and fused. In the process of mixing, a plurality of audio effectors such as a compressor, an equalizer, a reverberator and the like are usually used, and the manual mixing needs to have certain music theory and experience of using the audio effectors, so that the difficulty is high.

However, the number of operators who are in rich management is small, and it is difficult to meet the requirements of people for mixing. In particular, with the development of AIGC (ARTIFICIAL INTELLIGENCE GENERATED Content ) and other technologies, the number of audio works sung by AI is increasing, and these audio works also need to be mixed to improve the quality of the works. Therefore, there is a need for an audio processing method capable of realizing automatic audio mixing to meet the audio mixing requirements.

Through researches and experiments, the inventor can prepare a plurality of original audios in advance, the original audios can have different tone colors, each original audio can be subjected to manual mixing and record mixing parameters corresponding to a final mixing result, a plurality of original audios with higher tone color similarity can be selected according to the tone color similarity between the dry audio and the original audios aiming at the dry audio needing to be mixed, the mixing parameters of the selected original audios can be used for respectively mixing the dry audio, and the tone color similarity is mapped into a weighted combination of the results obtained after the dry audio is mixed, so that the final mixing result is obtained. By the method, only limited original audios prepared in advance are subjected to manual mixing, mixing parameters for mixing are recorded, and corresponding mixing parameters are selected to be applied to the dry audios based on the similarity between the dry audios and the original audios, so that complicated manual mixing is not needed for the dry audios, automatic mixing of the dry audios can be realized, mixing efficiency is improved, and more mixing requirements can be met. The dry sound audio may be unmixed human sound audio, and the human sound audio may be audio sung by a human or audio sung by an AI.

Having described the basic principles of the present disclosure, various non-limiting embodiments of the present disclosure are specifically described below.

Application scene overview

Referring first to fig. 1, fig. 1 is a schematic view of an application scenario provided in the present disclosure. As shown in fig. 1, a terminal 102 may communicate with a server 101 through a network. The data storage system may store data that the server 101 needs to process. The data storage system may be integrated on the server 101 or may be placed on a cloud or other network server. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 101 may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers.

For a scenario in which a dry audio needs to be mixed, a plurality of original audios may be prepared in advance, the original audios may be manually mixed, mixing parameters may be recorded, the original audios and the mixing parameters may be stored in the server 101, the terminal 102 or the data storage system, and after the dry audio to be mixed is acquired, the server 101 or the terminal 102 may mix the dry audio based on the stored original audios and mixing parameters.

The application scenario mentioned above is only partially exemplified, and those skilled in the art may expand the application based on the audio processing procedure, which is not particularly limited by the embodiments of the present disclosure.

Exemplary method

An audio processing method according to an exemplary embodiment of the present disclosure is described below with reference to fig. 2 to 6 in conjunction with the application scenario of fig. 1. It should be noted that the above application scenario is only shown for the convenience of understanding the spirit and principles of the present disclosure, and the embodiments of the present disclosure are not limited in any way in this respect. Rather, embodiments of the present disclosure may be applied to any scenario where applicable.

Fig. 2 is a flow chart of an audio processing method according to an embodiment of the disclosure. As shown in fig. 2, the method may include:

in step S201, the tone characteristics of the dry audio are acquired.

The audio processing method of the embodiment of the present disclosure may be applied to the server 101 or the terminal 102 in fig. 1.

Wherein the timbre characteristic may be a parameter characterizing the timbre characteristic of the human voice. For example, dry audio using the jukebox method typically has a full and muddy tone, and dry audio using the popular method typically has a light, rapid and mild tone.

Optionally, the dry audio is input to a pre-trained deep learning model, such that the deep learning model outputs tone characteristics corresponding to the dry audio. The deep learning model is trained from pre-labeled audio samples. The audio sample may be a mel-spectrum generated from a preset audio.

In some possible implementation manners, the method for acquiring the tone color features may also include acquiring the frequency spectrum of the dry audio through fourier transform and the like, and selecting some frequency domain features as the tone color features.

Step S202, for each of N preset original audios, determining the tone similarity between the dry audio and the original audio according to the tone characteristics of the original audios and the tone characteristics of the dry audio, and selecting M original audios from the N original audios according to the tone similarity as target audios.

N and M are positive integers, N is greater than or equal to M, and the original audio frequency corresponds to preset mixing parameters. The mixing parameters may include parameters of audio effectors such as a front compressor, a parameter equalizer, a dynamic equalizer, a multi-stage exciter, a rear compressor, a multi-stage compressor, a stereo enhancer, a delay, a reverberator, and a volume adjuster.

Optionally, based on the preset N original audios, calculating the similarity between the tone characteristic of each original audio and the tone characteristic of the dry audio, so as to obtain N tone similarities, and selecting M target audios from the N original audios through a preset similarity threshold.

In one example, as shown in fig. 3, the N original audio includes audio 1, audio 2, audio N. If the similarity between the tone color characteristics of the audio 1 and the tone color characteristics of the dry audio is greater than or equal to a preset first threshold, the tone color similarity between the tone color characteristics of the audio 2 and the tone color characteristics of the dry audio is greater than or equal to the first threshold, and the tone color similarity between the tone color characteristics of other audio and the tone color characteristics of the dry audio is less than the first threshold, M is 2, and M target audio comprises audio 1 and audio 2.

In some possible implementations, after obtaining the similarity between the timbre feature of each original audio and the timbre feature of the dry audio, the N original audio may be ranked in order of from the higher similarity to the lower similarity, and the first M original audio may be taken as the target audio.

In some possible implementations, the timbre characteristics of the original audio and the timbre characteristics of the dry audio may be extracted in the same manner. For example, a feature extraction model may be trained in advance from the audio sample, and timbre features of the original audio and the dry audio may be extracted by the trained extraction model, respectively.

In some possible implementations, the original audio may be singed by a person, with different timbres of the different original audio.

In some possible implementations, the original audio may be generated by a preset timbre model that is trained based on pre-labeled dry sound data. The timbre model may be an AI model based on AIGC et al artificial intelligence techniques.

Step S203, for each target audio in the M target audio, performing mixing processing on the dry audio according to the mixing parameters corresponding to the target audio, so as to obtain wet audio corresponding to the target audio.

Alternatively, for each target audio, parameters of the target audio when using various audio effectors in the mixing process may be obtained, and the dry audio may be processed by using the corresponding audio effector according to the parameters, so as to obtain the wet audio.

In one example, as shown in fig. 3, the plurality of target audio includes audio 1, audio 2, audio M. Taking m=2 as an example, the target audio includes audio 1 and audio 2, where the audio 1 applies a plurality of audio effectors such as a pre-compressor, a parameter equalizer, a dynamic equalizer, and the like in the audio mixing process, the audio mixer continuously adjusts parameters of the audio effectors by listening to the effects of the audio 1 under the effect of the audio effectors, each audio effector parameter of the audio effects after adjustment by the audio mixer is recorded as a mixing parameter 1, correspondingly, each audio effector parameter obtained after the audio 2 is mixed is recorded as a mixing parameter 2, and for dry audio, the audio effector scope of the pre-compressor, the parameter equalizer, the dynamic equalizer, and the like can be used as the audio effector parameter 1 and the mixing parameter 2 respectively to obtain two different wet audio, wherein one of the two wet audio is a result of mixing the dry audio with the mixing parameter of the audio 1 (wet audio corresponding to the audio 1), and the other is a result of mixing the dry audio with the mixing parameter of the audio 2 (wet audio corresponding to the audio 2).

Step S204, according to the tone similarity between each target audio and the dry audio, M wet audio is combined to obtain the audio mixture corresponding to the dry audio.

Optionally, audio processing software is used to align the audio tracks of each of the M wet audio frequencies on a time axis, adjust the volume of the audio tracks of the wet audio frequency based on the tone similarity between the target audio frequency corresponding to the wet audio frequency and the dry audio frequency, and stack all the adjusted audio tracks together to obtain the audio track of the final mixed audio frequency.

In some possible implementations, the timbre similarity between the target audio corresponding to the wet audio and the dry audio may be positively correlated with the volume of the audio track of the wet audio. For example, the target audio corresponding to the wet audio is audio 1, and the greater the tone similarity between audio 1 and the dry audio, the greater the volume of the wet audio after adjustment.

In the above embodiment, by acquiring the tone color characteristics of the dry audio, the tone color similarity between the dry audio and the original audio may be determined according to the tone color characteristics of the dry audio and the tone color characteristics of the original audio, and M target audio may be selected from N original audio according to the tone color similarity, and the mixing parameters corresponding to the target audio may be applied to the dry audio to obtain M wet audio, and then the M wet audio may be combined based on the tone color similarity, so as to obtain the mixing audio of the dry audio. According to the method, only N original audios are prepared in advance, and the mixing parameters of each original audio after mixing are recorded, so that the mixing parameters of the original audio with high tone similarity with the dry audio can be applied to the mixing process of the dry audio, repeated listening of the dry audio is not needed to debug the mixing parameters, the mixing efficiency can be improved, the mixing time is shortened, and more mixing requirements are met.

In one embodiment, as shown in fig. 4, according to the tone similarity between each target audio and the dry audio, combining M wet audio frequencies to obtain a mixed audio corresponding to the dry audio frequency, including:

step S401, for each target audio, determining a weight value corresponding to the target audio according to the tone similarity between the target audio and the dry audio.

The tone similarity between the target audio and the dry audio may be cosine similarity calculated based on tone characteristics of the target audio and tone characteristics of the dry audio.

In some possible implementations, the timbre similarity between the target audio and the dry audio may be directly used as the weight value corresponding to the target audio.

In some possible implementations, the timbre similarity corresponding to the target audio may also be mapped to a weight value within a preset interval. The range of the tone color similarity between the target audio and the dry audio is [ -1,1], and the value of the interval can be mapped to the value of the interval [0,1], and the mapped value of the tone color similarity is used as the weight value.

Illustratively, the mapping relationship between the timbre similarity and the weight value may be established according to the following formula (1) and formula (2):

1 = Σ _M weight (k) formula (2)

In the above formula, weight (k) may represent a weight value corresponding to the kth target audio, and SIMILARITY (K) may represent a timbre similarity between the kth target audio and the dry audio.

And step S402, carrying out weighted combination on M wet sound frequencies according to the weight value of each target audio to obtain the mixed audio corresponding to the dry sound frequency.

Optionally, inputting the M wet audio frequencies into audio processing software, aligning the audio tracks of all the wet audio frequencies on a time axis, and superposing the audio tracks of the wet audio frequencies corresponding to the target audio frequencies together according to the weight value of each target audio frequency to obtain the audio track of the mixed audio frequency.

In one embodiment, the mixing processing of the dry audio according to the mixing parameters corresponding to the target audio includes:

Adjusting the speed-related parameters in the mixing parameters according to the audio speed of the dry audio; and mixing the dry audio by the adjusted mixing parameters to obtain the corresponding wet audio.

Wherein the mixing parameters may include a reverberation time.

The present inventors have found that, if mixing a dry audio using a mixing parameter of a target audio, it is necessary to consider the influence of the audio speed on the mixing. The audio speed of the target audio may be different from the audio speed of the dry audio, and some mixing parameters of the target audio are determined based on the audio speed of the target audio, and if the mixing parameters are directly used for the dry audio, the mixing effect of the dry audio may be affected.

Therefore, the inventors think that parameters affected by the audio speed, such as the reverberation time, in the mixing parameters can be adjusted according to the speed difference between the target audio and the dry audio, so that the adjusted parameters are more suitable for the audio speed of the dry audio. For example, for fast songs the reverberation time needs to be suitably shortened, and for slow songs the reverberation time needs to be suitably lengthened.

In some possible implementations, adjusting the speed-related parameters in the mixing parameters according to the audio speeds of the dry audio may include shortening the reverberation time with audio speeds of several audio frequencies greater than or equal to a preset speed threshold and increasing the reverberation time with audio speeds of several audio frequencies less than the speed threshold. Wherein the velocity threshold may be determined from an audio velocity of the target audio.

In the above embodiment, the speed threshold may be determined according to the audio speed of the target audio, and the mixing parameters such as the reverberation time may be adjusted according to the relation between the audio speed of the dry audio and the speed threshold, so that the adjusted parameters are more suitable for the dry audio, and thus the mixing effect may be enhanced.

And mixing the dry sound frequency according to the adjusted mixing parameters to obtain the corresponding wet sound frequency.

Wherein the mixing parameters may include a compressor threshold.

The inventor finds that if the mixing parameters of the target audio are used for mixing the dry audio, the influence of the loudness of the audio track on the mixing is also considered. The track loudness of the target audio may be different from the track loudness of the dry audio, and some mixing parameters of the target audio (such as parameters of audio effectors such as reverberator and multi-stage compressor) are determined based on the track loudness of the target audio, where the mixing parameters may affect the mixing effect of the dry audio if directly applied to the dry audio.

Therefore, the inventor thinks that parameters affected by the track loudness, such as a compressor threshold value, in the mixing parameters can be adjusted according to the track loudness difference between the target audio and the dry audio, so that the adjusted parameters are more suitable for the track loudness of the dry audio. For example, for low level tracks, the compressor threshold needs to be appropriately lowered, and for high level tracks, the compressor threshold needs to be appropriately increased.

In some possible implementations, adjusting the loudness-related parameter of the mixing parameters according to the loudness of the dry audio may include increasing the compressor threshold by a loudness of several acoustic audio greater than or equal to a preset loudness threshold and decreasing the compressor threshold by a loudness of several acoustic audio less than the loudness threshold. Wherein the loudness threshold may be determined from the track loudness of the target audio.

In the above embodiment, the loudness threshold may be determined according to the track loudness of the target audio, and the mixing parameters such as the compressor threshold may be adjusted according to the relation between the track loudness of the dry audio and the loudness threshold, so that the adjusted parameters are more suitable for the dry audio, and thus the mixing effect may be enhanced.

In one embodiment, the mixing parameters include effector parameters, as shown in fig. 5, and the mixing parameters corresponding to the original audio are obtained by:

in step S501, the original audio is calibrated by a preset effector.

Wherein the effecter comprises one or more of a pre-compressor, a parametric equalizer, a dynamic equalizer, a multi-segment exciter, a post-compressor, a multi-segment compressor, a stereo enhancer, a delay, a reverberator, and a volume adjuster.

Optionally, the original audio is input into the audio processing software, the selected effector is hung on an audio track of the original audio, the mixing effect is evaluated through a manual listening test and the like, and parameters of the effector can be continuously adjusted according to the mixing effect.

In some possible implementations, all the effectors that are needed for mixing and the order in which the different effectors are used for processing may be predefined. As shown in fig. 6, all the effects in the effect chain can be tuned by hanging them on the audio track of the original audio according to the chain of effect consisting of pre-compressor, parameter equalizer, dynamic equalizer, multi-segment exciter, post-compressor, multi-segment compressor, stereo enhancer, delay, reverberator and clippers.

In some possible implementations, it may also be determined that the original audio is suitable for the matched accompaniment, the original audio and the accompaniment are subjected to audio processing software pair Ji Yingui, and listening is performed based on the situation that the original audio and the accompaniment are played together, and parameters of the effector are adjusted according to the listening effect.

Step S502, if the adjustment result meets the preset condition, recording the effector parameter corresponding to the adjustment result as the mixing parameter corresponding to the original audio.

Optionally, if the listening effect of the original audio after the adjustment meets the requirement, the effector parameters used for the adjustment can be recorded. For N original videos, N sets of parameters may be obtained.

In some possible implementations, if the tuning result meets a preset condition, the audio speed and the audio loudness of the original audio under the condition may also be recorded. The method can record the speed and the loudness of the original audio to calibrate the mixing parameters, so that the parameters of the effector can be flexibly adjusted based on the speed and the loudness of the recorded original audio and the speed and the loudness of the dry audio to be mixed aiming at some effector parameters related to the speed and the loudness of the audio, and the adjusted parameters are more suitable for mixing the dry audio.

In one embodiment, the audio processing method may further include:

and adjusting the volume ratio of the mixed audio to the accompaniment audio according to the energy characteristics and the masking characteristics so as to obtain the mixed audio with balanced volume.

The energy characteristics can be calculated according to a second psychoacoustic model in the standard ISO/IEC 11172-3. The masking characteristics can be calculated according to a masking threshold curve of each frame of audio in the accompaniment audio, and the masking threshold curve can also be calculated according to a second psychoacoustic model in standard ISO/IEC 11172-3. The masking threshold may refer to a value that measures the masking ability of sound a to sound B in units of frequency bands, and may be calculated from a psychoacoustic model.

Optionally, the energy characteristics of each frame of audio in the mixed audio and the masking characteristics of each frame of audio in the accompaniment audio are obtained, the sum of the energy characteristics of each frame of the mixed audio and the masking characteristics of each frame of the accompaniment audio can be calculated, the optimal volume ratio of the mixed audio and the accompaniment audio can be determined according to the sum of the energy characteristics and the sum of the masking characteristics and combining a preset mapping relation, and the mixed audio and the accompaniment audio are overlapped according to the ratio, so that the mixed audio with balanced final volume can be obtained. The preset mapping relation can be obtained based on a plurality of audio samples, each audio sample has a good volume ratio, and the mapping relation between the two parameters of the sum of the energy characteristics and the sum of the masking characteristics and the volume ratio can be established by analyzing the sum of the energy characteristics of the voice and the sum of the masking characteristics of accompaniment in the audio samples.

In one embodiment, the audio processing method may further include:

And if the track loudness does not fall into a preset loudness interval, adjusting the mixed audio through a preset effector so that the track loudness of the adjusted mixed audio falls into the loudness interval.

Wherein the loudness interval may be determined based on audio release criteria.

For example, some song release standards require an audio track loudness of [ -12LUFS, -8LUFS ], for which the Maximizer effector can be used to boost the audio track loudness to the 2 parameter values of-9 lufs, maximizer, as follows:

Threshold_Maximizer＝LUFS_mixed-(-9);Ceiling_Maximizer=-0.2。

in this way, the loudness of mixed audio can be increased to meet the requirements of the audio release standard without distorting the audio waveform.

In one embodiment, the audio processing method may further include:

And adding new original audio, and performing audio mixing processing on the new original audio by means of adjusting parameters of an effector and the like to obtain audio mixing parameters corresponding to the new original audio.

The new original audio can be singed by a person or generated by a new AI model.

In the above embodiment, by adding the original audio, the range of parameters used for the dry audio mixing can be enlarged, so that the overall mixing effect can be enhanced.

Exemplary Medium

Having described the method of the exemplary embodiments of the present disclosure, next, a storage medium of the exemplary embodiments of the present disclosure will be described with reference to fig. 7.

Referring to fig. 7, a storage medium 70, in which a program product for implementing the above-described method according to an embodiment of the present disclosure is stored, may employ a portable compact disc read-only memory (CD-ROM) and include program code, and may be run on a terminal device such as a personal computer. However, the program product of the present disclosure is not limited thereto.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of a readable storage medium include an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. The readable signal medium may also be any readable medium other than a readable storage medium.

Program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the context of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN).

Exemplary apparatus

Having described the media of the exemplary embodiments of the present disclosure, the audio processing device of the exemplary embodiments of the present disclosure will be described with reference to fig. 8, so as to implement the method in any of the foregoing method embodiments, and the implementation principle and technical effect are similar, and are not repeated herein.

As shown in fig. 8, the audio processing apparatus 800 may include:

an acquisition module 801 is configured to acquire a tone characteristic of the dry audio.

The similarity determining module 802 is configured to determine, for each of N preset original audios, a timbre similarity between the dry audio and the original audio according to timbre features of the original audio and timbre features of the dry audio, and select M original audios from the N original audios according to the timbre similarity as target audios, where N and M are positive integers, where N is greater than or equal to M, and the original audios correspond to preset mixing parameters.

The wet sound processing module 803 is configured to mix, for each target audio of the M target audio, the dry sound audio according to a mixing parameter corresponding to the target audio, so as to obtain wet sound audio corresponding to the target audio.

And the mixing combination module 804 is configured to combine the M wet audio frequencies according to the tone similarity between each target audio frequency and the dry audio frequency, so as to obtain a mixed audio frequency corresponding to the dry audio frequency.

In yet another embodiment of the present disclosure, the mixing and combining module 804 may include:

the weight determining unit may be configured to determine, for each target audio, a weight value corresponding to the target audio according to a timbre similarity between the target audio and the dry audio.

And the wet sound combination unit can be used for carrying out weighted combination on M wet sound frequencies according to the weight value of each target audio frequency to obtain the mixed audio corresponding to the dry sound frequency.

In yet another embodiment of the present disclosure, the weight determining unit is further configured to map a timbre similarity corresponding to the target audio to a weight value in a preset interval.

In yet another embodiment of the present disclosure, the wet sound processing module 803 is further configured to adjust a speed-related parameter of the mixing parameters according to an audio speed of the dry sound audio, and/or adjust a loudness-related parameter of the mixing parameters according to a loudness of the dry sound audio, and mix the dry sound audio with the adjusted mixing parameters to obtain a corresponding wet sound audio.

In yet another embodiment of the present disclosure, the wet sound processing module 803 is further configured to shorten the reverberation time if the audio speed of the dry sound audio is greater than or equal to a preset speed threshold, and increase the reverberation time if the audio speed of the dry sound audio is less than the speed threshold.

In yet another embodiment of the present disclosure, the wet sound processing module 803 is further configured to increase the compressor threshold if the loudness of the dry sound frequency is greater than or equal to a preset loudness threshold, and decrease the compressor threshold if the loudness of the dry sound frequency is less than the loudness threshold.

In yet another embodiment of the present disclosure, the audio processing apparatus 800 may further include:

And the tuning module is used for tuning the original audio through a preset effector, wherein the effector comprises one or more of a front compressor, a parameter equalizer, a dynamic equalizer, a multi-section exciter, a rear compressor, a multi-section compressor, a stereo enhancer, a delayer, a reverberator and a volume regulator.

And the recording module is used for recording that the effector parameter corresponding to the adjusting result is the mixing parameter corresponding to the original audio under the condition that the adjusting result meets the preset condition.

In yet another embodiment of the present disclosure, the original audio is generated by a pre-set timbre model trained based on pre-labeled dry sound data.

In yet another embodiment of the present disclosure, the obtaining module 801 is further configured to input a dry audio to a pre-trained deep learning model, so that the deep learning model outputs a tone characteristic corresponding to the dry audio, where the deep learning model is trained according to a pre-labeled audio sample, and the audio sample is a mel spectrum generated according to a preset audio.

and the characteristic acquisition module is used for acquiring the energy characteristic of the mixed audio and the masking characteristic of the accompaniment audio corresponding to the mixed audio.

And the volume balancing module is used for adjusting the volume ratio of the mixed audio to the accompaniment audio according to the energy characteristics and the masking characteristics so as to obtain the mixed audio after volume balancing.

And the loudness acquisition module is used for acquiring the track loudness of the mixed audio.

And the loudness adjusting module is used for adjusting the mixed audio through a preset effector under the condition that the track loudness does not fall into a preset loudness interval, so that the track loudness of the adjusted mixed audio falls into the loudness interval.

Exemplary computing device

Having described the methods, media, and apparatus of exemplary embodiments of the present disclosure, a computing device of exemplary embodiments of the present disclosure is next described with reference to fig. 9.

The computing device 90 shown in fig. 9 is merely an example and should not be taken as limiting the functionality and scope of use of embodiments of the present disclosure.

As shown in fig. 9, the computing device 90 is in the form of a general purpose computing device. Components of computing device 90 may include, but are not limited to, at least one processing unit 901, at least one storage unit 902, and a bus 903 that connects the different system components, including processing unit 901 and storage unit 902. Wherein at least one memory unit 902 has stored therein computer-executable instructions, and at least one processing unit 901 comprises a processor that executes the computer-executable instructions to implement the methods described above.

Bus 903 includes a data bus, a control bus, and an address bus.

The storage unit 902 may include readable media in the form of volatile memory, such as Random Access Memory (RAM) 9021 and/or cache memory 9022, and may further include readable media in the form of non-volatile memory, such as Read Only Memory (ROM) 9023.

The storage unit 902 may also include a program/utility 9025 having a set (at least one) of program modules 9024, such program modules 9024 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The computing device 90 may also communicate with one or more external devices 904 (e.g., keyboard, pointing device, etc.). Such communication may occur through an input/output (I/O) interface 905. Moreover, the computing device 90 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through the network adapter 906. As shown in fig. 9, the network adapter 906 communicates with other modules of the computing device 90 over the bus 903. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with computing device 90, including, but not limited to, microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

It should be noted that although in the above detailed description several units/modules or sub-units/modules of an audio processing device are mentioned, such a division is only exemplary and not mandatory. Indeed, the features and functionality of two or more units/modules described above may be embodied in one unit/module in accordance with embodiments of the present disclosure. Conversely, the features and functions of one unit/module described above may be further divided into ones that are embodied by a plurality of units/modules.

Furthermore, although the operations of the methods of the present disclosure are depicted in the drawings in a particular order, this is not required or suggested that these operations must be performed in this particular order or that all of the illustrated operations must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

While the spirit and principles of the present disclosure have been described with reference to several particular embodiments, it is to be understood that this disclosure is not limited to the particular embodiments disclosed nor does it imply that features in these aspects are not to be combined to benefit from this division, which is done for convenience of description only. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. An audio processing method, characterized in that it includes:

To obtain the timbre characteristics of dry audio;

For each of the N preset original audio files, the timbre similarity between the dry audio file and the original audio file is determined based on the timbre characteristics of the original audio file and the timbre characteristics of the dry audio file. Based on the timbre similarity, M original audio files are selected as target audio files from the N original audio files. N and M are both positive integers, and N is greater than or equal to M. The original audio files correspond to preset mixing parameters.

For each of the M target audios, the dry audio is mixed according to the mixing parameters corresponding to the target audio to obtain the wet audio corresponding to the target audio.

Based on the timbre similarity between each target audio and the dry audio, the M wet audios are combined to obtain the mixed audio corresponding to the dry audio.

2. The audio processing method according to claim 1, characterized in that, the step of combining M wet audio files based on the timbre similarity between each target audio file and the dry audio file to obtain the mixed audio file corresponding to the dry audio file includes:

For each target audio, a weight value corresponding to the target audio is determined based on the timbre similarity between the target audio and the dry audio.

The M wet audio samples are weighted and combined according to the weight value of each target audio sample to obtain the mixed audio corresponding to the dry audio sample.

3. The audio processing method according to claim 2, characterized in that, determining the weight value corresponding to the target audio based on the timbre similarity between the target audio and the dry audio includes:

The timbre similarity of the target audio is mapped to a weight value within a preset range.

4. The audio processing method according to any one of claims 1 to 3, characterized in that, the step of mixing the dry audio according to the mixing parameters corresponding to the target audio includes:

Adjust the speed-related parameters in the mixing parameters according to the audio speed of the dry audio, and/or adjust the loudness-related parameters in the mixing parameters according to the loudness of the dry audio;

The dry audio is mixed using the adjusted mixing parameters to obtain the corresponding wet audio.

5. The audio processing method according to claim 4, characterized in that the mixing parameters include reverberation time, and the step of adjusting the speed-related parameters in the mixing parameters according to the audio speed of the dry audio includes:

If the audio velocity of the dry audio is greater than or equal to a preset velocity threshold, the reverberation time is shortened;

If the audio velocity of the dry audio is less than the velocity threshold, the reverberation time is increased.

6. The audio processing method according to claim 4, characterized in that the mixing parameters include a compressor threshold, and the step of adjusting the loudness-related parameters in the mixing parameters according to the loudness of the dry audio includes:

If the loudness of the dry audio is greater than or equal to a preset loudness threshold, the compressor threshold is increased;

If the loudness of the dry audio is less than the loudness threshold, the compressor threshold is reduced.

7. The audio processing method according to any one of claims 1 to 3, characterized in that the mixing parameters include effects parameters, and the mixing parameters corresponding to the original audio are obtained in the following manner:

The original audio is tuned using preset effects, which include one or more of the following: pre-compressor, parametric equalizer, dynamic equalizer, multi-band exciter, post-compressor, multi-band compressor, stereo enhancer, delay, reverb, and volume adjuster.

If the tuning result meets the preset conditions, the effect parameters corresponding to the tuning result are recorded as the mixing parameters corresponding to the original audio.

8. The audio processing method according to any one of claims 1 to 3, wherein the original audio is generated by a preset timbre model, and the timbre model is trained based on pre-labeled dry audio data.

9. The audio processing method according to any one of claims 1 to 3, characterized in that, obtaining the timbre characteristics of the dry audio includes:

Dry audio is input into a pre-trained deep learning model so that the deep learning model outputs the timbre features corresponding to the dry audio. The deep learning model is trained based on pre-labeled audio samples, which are Mel spectra generated based on preset audio.

10. The audio processing method according to any one of claims 1 to 3, characterized in that it further comprises:

Obtain the energy characteristics of the mixed audio and the masking characteristics of the corresponding accompaniment audio;

The volume ratio of the mixed audio to the accompaniment audio is adjusted according to the energy characteristics and the masking characteristics to obtain a volume-balanced mixed audio.

11. The audio processing method according to claim 10, characterized in that it further comprises:

Obtain the loudness of the audio track in the mixed audio;

If the loudness of the audio track does not fall within the preset loudness range, the mixed audio is adjusted using a preset effect to make the loudness of the adjusted mixed audio track fall within the loudness range.

12. An audio processing apparatus, characterized in that it comprises:

The acquisition module is used to acquire the timbre characteristics of dry audio.

The similarity determination module is used to determine the timbre similarity between the dry audio and the original audio for each of the preset N original audios, based on the timbre characteristics of the original audio and the timbre characteristics of the dry audio, and select M original audios as target audios from the N original audios based on the timbre similarity, where N and M are both positive integers, and N is greater than or equal to M, and the original audios correspond to preset mixing parameters;

The wet audio processing module is used to perform mixing processing on the dry audio according to the mixing parameters corresponding to the target audio for each of the M target audios, so as to obtain the wet audio corresponding to the target audio.

The mixing and combining module is used to combine M wet audio files based on the timbre similarity between each target audio file and the dry audio file to obtain the mixed audio file corresponding to the dry audio file.

13. A computer-readable storage medium, characterized in that the computer-readable storage medium stores computer-executable instructions, which, when executed by a processor, implement the method as described in any one of claims 1 to 11.

14. A computing device, characterized in that it comprises:

At least one processor;

and a memory communicatively connected to the at least one processor;

The memory stores instructions executable by the at least one processor, which, when executed by the at least one processor, cause the computing device to perform the method as described in any one of claims 1 to 11.

15. A computer program product comprising a computer program, characterized in that, when the computer program is executed by a processor, it implements the method as described in any one of claims 1 to 11.