[go: up one dir, main page]

CN119068891B - Audio processing method, device, medium, computing device and program product - Google Patents

Audio processing method, device, medium, computing device and program product

Info

Publication number
CN119068891B
CN119068891B CN202411067965.2A CN202411067965A CN119068891B CN 119068891 B CN119068891 B CN 119068891B CN 202411067965 A CN202411067965 A CN 202411067965A CN 119068891 B CN119068891 B CN 119068891B
Authority
CN
China
Prior art keywords
audio
dry
mixing
original
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411067965.2A
Other languages
Chinese (zh)
Other versions
CN119068891A (en
Inventor
熊贝尔
孙校珩
张柏达
徐扬帆
雷童净
黄安麒
王燕凤
刘华平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Netease Cloud Music Technology Co Ltd
Original Assignee
Hangzhou Netease Cloud Music Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Netease Cloud Music Technology Co Ltd filed Critical Hangzhou Netease Cloud Music Technology Co Ltd
Priority to CN202411067965.2A priority Critical patent/CN119068891B/en
Publication of CN119068891A publication Critical patent/CN119068891A/en
Application granted granted Critical
Publication of CN119068891B publication Critical patent/CN119068891B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/01Correction of time axis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0356Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for synchronising with other signals, e.g. video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Tone Control, Compression And Expansion, Limiting Amplitude (AREA)

Abstract

本公开的实施方式提供了一种音频处理方法、装置、介质、计算设备及程序产品,涉及音频处理领域。该音频处理方法包括:通过获取干声音频的音色特征,可以根据音色特征确定干声音频与原始音频的音色相似度,并可以根据音色相似度从N个原始音频中选出M个目标音频,分别将各目标音频对应的混音参数应用到干声音频以得到M个湿声音频,再基于音色相似度对M个湿声音频组合以得到混音音频。这种方式只需预先准备N个原始音频并记录每个原始音频的混音参数,进而可以将与干声音频的音色相似度高的原始音频的混音参数应用到干声音频的混音过程中,无需对干声音频反复试听以调试混音参数,从而可以提高混音效率,缩短混音时间,满足更多混音需求。

This disclosure provides an audio processing method, apparatus, medium, computing device, and program product, relating to the field of audio processing. The audio processing method includes: acquiring the timbre characteristics of dry audio; determining the timbre similarity between the dry audio and original audio based on the timbre characteristics; selecting M target audios from N original audios based on the timbre similarity; applying the mixing parameters corresponding to each target audio to the dry audio to obtain M wet audios; and then combining the M wet audios based on the timbre similarity to obtain a mixed audio. This method only requires pre-preparing N original audios and recording the mixing parameters of each original audio. It allows the application of mixing parameters from original audios with high timbre similarity to the dry audio to the mixing process of the dry audio, eliminating the need for repeated listening tests to adjust mixing parameters. This improves mixing efficiency, shortens mixing time, and meets more mixing requirements.

Description

Audio processing method, device, medium, computing device and program product
Technical Field
Embodiments of the present disclosure relate to the field of computer technology, and more particularly, to an audio processing method, apparatus, medium, computing device, and program product.
Background
This section is intended to provide a background or context for embodiments of the present disclosure. The description herein is not admitted to be prior art by inclusion in this section.
Mixing (Audio Mixing) is a key step in the Audio production process, involving combining multiple Audio tracks, such as human voice, musical instruments, effect tones, etc., together to form a harmonious, balanced final Audio object.
At present, in the process of mixing audio, various audio effectors such as a compressor, an equalizer, a reverberator, a push-pull attenuator for adjusting volume and the like are required to be used for mixing audio to be mixed, and parameters of the audio effectors are continuously adjusted mainly by relying on manual listening so as to improve audio quality through mixing.
However, the technical difficulty of mixing is high, and a person with a high experience is usually required to operate the mixer in person, so that the mixing efficiency is low, and more mixing requirements are difficult to meet.
Disclosure of Invention
The present disclosure provides an audio processing method, apparatus, medium, computing device, and program product to meet more mixing requirements.
In a first aspect of embodiments of the present disclosure, there is provided an audio processing method, including:
Acquiring tone characteristics of the dry audio;
For each of N preset original audios, determining tone similarity between the dry audio and the original audios according to tone characteristics of the original audios and tone characteristics of the dry audio, selecting M original audios from the N original audios according to the tone similarity as target audios, wherein N and M are positive integers, N is greater than or equal to M, and the original audios correspond to preset mixing parameters;
For each target audio in M target audio, carrying out audio mixing processing on the dry sound audio according to the audio mixing parameters corresponding to the target audio to obtain wet sound audio corresponding to the target audio;
And combining the M wet sound frequencies according to the tone similarity between each target audio and the dry sound frequency to obtain the mixed audio corresponding to the dry sound frequency.
In another embodiment of the present disclosure, according to a tone similarity between each target audio and the dry audio, combining M wet audio frequencies to obtain a mixed audio corresponding to the dry audio frequency, including:
for each target audio, determining a weight value corresponding to the target audio according to the tone similarity between the target audio and the dry audio;
And carrying out weighted combination on M wet sound frequencies according to the weight value of each target audio frequency to obtain the mixed audio frequency corresponding to the dry sound frequency.
In another embodiment of the present disclosure, determining a weight value corresponding to the target audio according to a timbre similarity between the target audio and the dry audio includes:
And mapping the tone similarity corresponding to the target audio into a weight value in a preset interval.
In another embodiment of the present disclosure, the mixing processing of the dry audio according to the mixing parameters corresponding to the target audio includes:
Adjusting the speed-related parameters in the mixing parameters according to the audio speed of the dry audio, and/or adjusting the loudness-related parameters in the mixing parameters according to the loudness of the dry audio;
and carrying out sound mixing treatment on the dry sound frequency through the adjusted sound mixing parameters to obtain the corresponding wet sound frequency.
In another embodiment of the present disclosure, the mixing parameters include a reverberation time, and the adjusting a speed-related parameter among the mixing parameters according to an audio speed of the dry audio includes:
if the audio speed of the dry audio is greater than or equal to a preset speed threshold, shortening the reverberation time;
and if the audio speed of the dry audio is smaller than the speed threshold value, increasing the reverberation time.
In another embodiment of the present disclosure, the mixing parameters include a compressor threshold, and the adjusting the parameters related to the loudness in the mixing parameters according to the loudness of the dry audio includes:
If the loudness of the dry audio is greater than or equal to a preset loudness threshold, increasing the compressor threshold;
and if the loudness of the dry sound frequency is smaller than the loudness threshold, reducing the compressor threshold.
In another embodiment of the present disclosure, the mixing parameters include effector parameters, and the mixing parameters corresponding to the original audio are obtained by:
Adjusting the original audio through a preset effector, wherein the effector comprises one or more of a front compressor, a parameter equalizer, a dynamic equalizer, a multi-section exciter, a rear compressor, a multi-section compressor, a stereo enhancer, a delayer, a reverberator and a volume regulator;
if the adjustment result meets the preset condition, recording that the effector parameter corresponding to the adjustment result is the mixing parameter corresponding to the original audio.
In another embodiment of the present disclosure, the original audio is generated by a preset timbre model trained based on pre-labeled dry sound data.
In another embodiment of the present disclosure, acquiring timbre characteristics of a dry audio comprises:
and inputting the dry audio to a pre-trained deep learning model so that the deep learning model outputs tone characteristics corresponding to the dry audio, wherein the deep learning model is obtained by training according to a pre-labeled audio sample, and the audio sample is a Mel spectrum generated according to preset audio.
In another embodiment of the present disclosure, the audio processing method further includes:
acquiring energy characteristics of the mixed audio and masking characteristics of accompaniment audio corresponding to the mixed audio;
and adjusting the volume ratio of the mixed audio to the accompaniment audio according to the energy characteristic and the masking characteristic so as to obtain the mixed audio with balanced volume.
In another embodiment of the present disclosure, the audio processing method further includes:
acquiring the track loudness of the mixed audio;
And if the track loudness does not fall into the preset loudness interval, adjusting the mixed audio through a preset effector so that the track loudness of the adjusted mixed audio falls into the loudness interval.
In a second aspect of embodiments of the present disclosure, there is provided an audio processing apparatus comprising:
The acquisition module is used for acquiring tone characteristics of the dry sound frequency;
the similarity determining module is used for determining the tone similarity between the dry sound audio and the original audio according to the tone characteristics of the original audio and the tone characteristics of the dry sound audio aiming at each of N preset original audios, and selecting M original audios from the N original audios according to the tone similarity as target audios, wherein N and M are positive integers, N is greater than or equal to M, and the original audios are corresponding to preset mixing parameters;
the wet sound processing module is used for carrying out sound mixing processing on the dry sound frequency according to the sound mixing parameters corresponding to the target audio aiming at each target audio in the M target audios to obtain wet sound audios corresponding to the target audios;
And the audio mixing combination module is used for combining M wet audio frequencies according to the tone similarity between each target audio frequency and the dry audio frequency to obtain audio mixing corresponding to the dry audio frequency.
In a third aspect of the disclosed embodiments, a computer-readable storage medium having stored therein computer-executable instructions which, when executed by a processor, implement the method according to any of the first aspects.
In a fourth aspect of embodiments of the present disclosure, there is provided a computing device comprising at least one processor;
and a memory communicatively coupled to the at least one processor;
Wherein the memory stores instructions executable by the at least one processor to cause the computing device to perform the method of any of the first aspects.
In a fifth aspect of the disclosed embodiments, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to any of the first aspects.
According to the audio processing method, the device, the medium, the computing equipment and the program product of the embodiment of the disclosure, by acquiring the tone color characteristics of the dry audio, the tone color similarity between the dry audio and the original audio can be determined according to the tone color characteristics of the dry audio and the tone color characteristics of the original audio, M target audios can be selected from N original audios according to the tone color similarity, the mixing parameters corresponding to the target audios are respectively applied to the dry audio to obtain M wet audio, and the M wet audio is combined based on the tone color similarity, so that the mixing audio of the dry audio can be obtained. According to the method, only N original audios are prepared in advance, and the mixing parameters of each original audio after mixing are recorded, so that the mixing parameters of the original audio with high tone similarity with the dry audio can be applied to the mixing process of the dry audio, repeated listening of the dry audio is not needed to debug the mixing parameters, the mixing efficiency can be improved, the mixing time is shortened, and more mixing requirements are met.
Drawings
The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings, in which:
fig. 1 schematically illustrates an application scenario diagram of an audio processing method according to an embodiment of the present disclosure;
fig. 2 schematically illustrates a flow diagram of an audio processing method according to an embodiment of the disclosure;
FIG. 3 schematically illustrates a flow diagram of an audio processing method according to another embodiment of the present disclosure;
FIG. 4 schematically illustrates a flow diagram of a wet audio combining method according to an embodiment of the present disclosure;
Fig. 5 schematically illustrates a flowchart of an original audio mixing parameter acquisition method according to an embodiment of the disclosure;
FIG. 6 schematically illustrates a structural schematic of an effector chain according to an embodiment of the present disclosure;
FIG. 7 schematically illustrates a schematic diagram of a storage medium of an embodiment of the present disclosure;
Fig. 8 schematically illustrates a structural diagram of an audio processing apparatus according to an embodiment of the present disclosure;
fig. 9 schematically illustrates a schematic diagram of a computing device according to an embodiment of the disclosure.
In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
Detailed Description
The principles and spirit of the present disclosure will be described below with reference to several exemplary embodiments. It should be understood that these embodiments are presented merely to enable one skilled in the art to better understand and practice the present disclosure and are not intended to limit the scope of the present disclosure in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Those skilled in the art will appreciate that embodiments of the present disclosure may be implemented as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of entirely hardware, entirely software (including firmware, resident software, micro-code, etc.) or in a combination of hardware and software.
According to embodiments of the present disclosure, an audio processing method, apparatus, medium, computing device, and program product are presented.
It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) related to the present disclosure are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards, and be provided with corresponding operation entries for the user to select authorization or rejection.
Furthermore, any number of elements in the figures is for illustration and not limitation, and any naming is used for distinction only and not for any limiting sense.
The principles and spirit of the present disclosure are explained in detail below with reference to several representative embodiments thereof.
Summary of The Invention
Mixing is a processing means for audio, which means that an audio effector is used for processing human voice and accompaniment, so that the voice is better audible, harmonious and fused. In the process of mixing, a plurality of audio effectors such as a compressor, an equalizer, a reverberator and the like are usually used, and the manual mixing needs to have certain music theory and experience of using the audio effectors, so that the difficulty is high.
However, the number of operators who are in rich management is small, and it is difficult to meet the requirements of people for mixing. In particular, with the development of AIGC (ARTIFICIAL INTELLIGENCE GENERATED Content ) and other technologies, the number of audio works sung by AI is increasing, and these audio works also need to be mixed to improve the quality of the works. Therefore, there is a need for an audio processing method capable of realizing automatic audio mixing to meet the audio mixing requirements.
Through researches and experiments, the inventor can prepare a plurality of original audios in advance, the original audios can have different tone colors, each original audio can be subjected to manual mixing and record mixing parameters corresponding to a final mixing result, a plurality of original audios with higher tone color similarity can be selected according to the tone color similarity between the dry audio and the original audios aiming at the dry audio needing to be mixed, the mixing parameters of the selected original audios can be used for respectively mixing the dry audio, and the tone color similarity is mapped into a weighted combination of the results obtained after the dry audio is mixed, so that the final mixing result is obtained. By the method, only limited original audios prepared in advance are subjected to manual mixing, mixing parameters for mixing are recorded, and corresponding mixing parameters are selected to be applied to the dry audios based on the similarity between the dry audios and the original audios, so that complicated manual mixing is not needed for the dry audios, automatic mixing of the dry audios can be realized, mixing efficiency is improved, and more mixing requirements can be met. The dry sound audio may be unmixed human sound audio, and the human sound audio may be audio sung by a human or audio sung by an AI.
Having described the basic principles of the present disclosure, various non-limiting embodiments of the present disclosure are specifically described below.
Application scene overview
Referring first to fig. 1, fig. 1 is a schematic view of an application scenario provided in the present disclosure. As shown in fig. 1, a terminal 102 may communicate with a server 101 through a network. The data storage system may store data that the server 101 needs to process. The data storage system may be integrated on the server 101 or may be placed on a cloud or other network server. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 101 may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers.
For a scenario in which a dry audio needs to be mixed, a plurality of original audios may be prepared in advance, the original audios may be manually mixed, mixing parameters may be recorded, the original audios and the mixing parameters may be stored in the server 101, the terminal 102 or the data storage system, and after the dry audio to be mixed is acquired, the server 101 or the terminal 102 may mix the dry audio based on the stored original audios and mixing parameters.
The application scenario mentioned above is only partially exemplified, and those skilled in the art may expand the application based on the audio processing procedure, which is not particularly limited by the embodiments of the present disclosure.
Exemplary method
An audio processing method according to an exemplary embodiment of the present disclosure is described below with reference to fig. 2 to 6 in conjunction with the application scenario of fig. 1. It should be noted that the above application scenario is only shown for the convenience of understanding the spirit and principles of the present disclosure, and the embodiments of the present disclosure are not limited in any way in this respect. Rather, embodiments of the present disclosure may be applied to any scenario where applicable.
Fig. 2 is a flow chart of an audio processing method according to an embodiment of the disclosure. As shown in fig. 2, the method may include:
in step S201, the tone characteristics of the dry audio are acquired.
The audio processing method of the embodiment of the present disclosure may be applied to the server 101 or the terminal 102 in fig. 1.
Wherein the timbre characteristic may be a parameter characterizing the timbre characteristic of the human voice. For example, dry audio using the jukebox method typically has a full and muddy tone, and dry audio using the popular method typically has a light, rapid and mild tone.
Optionally, the dry audio is input to a pre-trained deep learning model, such that the deep learning model outputs tone characteristics corresponding to the dry audio. The deep learning model is trained from pre-labeled audio samples. The audio sample may be a mel-spectrum generated from a preset audio.
In some possible implementation manners, the method for acquiring the tone color features may also include acquiring the frequency spectrum of the dry audio through fourier transform and the like, and selecting some frequency domain features as the tone color features.
Step S202, for each of N preset original audios, determining the tone similarity between the dry audio and the original audio according to the tone characteristics of the original audios and the tone characteristics of the dry audio, and selecting M original audios from the N original audios according to the tone similarity as target audios.
N and M are positive integers, N is greater than or equal to M, and the original audio frequency corresponds to preset mixing parameters. The mixing parameters may include parameters of audio effectors such as a front compressor, a parameter equalizer, a dynamic equalizer, a multi-stage exciter, a rear compressor, a multi-stage compressor, a stereo enhancer, a delay, a reverberator, and a volume adjuster.
Optionally, based on the preset N original audios, calculating the similarity between the tone characteristic of each original audio and the tone characteristic of the dry audio, so as to obtain N tone similarities, and selecting M target audios from the N original audios through a preset similarity threshold.
In one example, as shown in fig. 3, the N original audio includes audio 1, audio 2, audio N. If the similarity between the tone color characteristics of the audio 1 and the tone color characteristics of the dry audio is greater than or equal to a preset first threshold, the tone color similarity between the tone color characteristics of the audio 2 and the tone color characteristics of the dry audio is greater than or equal to the first threshold, and the tone color similarity between the tone color characteristics of other audio and the tone color characteristics of the dry audio is less than the first threshold, M is 2, and M target audio comprises audio 1 and audio 2.
In some possible implementations, after obtaining the similarity between the timbre feature of each original audio and the timbre feature of the dry audio, the N original audio may be ranked in order of from the higher similarity to the lower similarity, and the first M original audio may be taken as the target audio.
In some possible implementations, the timbre characteristics of the original audio and the timbre characteristics of the dry audio may be extracted in the same manner. For example, a feature extraction model may be trained in advance from the audio sample, and timbre features of the original audio and the dry audio may be extracted by the trained extraction model, respectively.
In some possible implementations, the original audio may be singed by a person, with different timbres of the different original audio.
In some possible implementations, the original audio may be generated by a preset timbre model that is trained based on pre-labeled dry sound data. The timbre model may be an AI model based on AIGC et al artificial intelligence techniques.
Step S203, for each target audio in the M target audio, performing mixing processing on the dry audio according to the mixing parameters corresponding to the target audio, so as to obtain wet audio corresponding to the target audio.
Alternatively, for each target audio, parameters of the target audio when using various audio effectors in the mixing process may be obtained, and the dry audio may be processed by using the corresponding audio effector according to the parameters, so as to obtain the wet audio.
In one example, as shown in fig. 3, the plurality of target audio includes audio 1, audio 2, audio M. Taking m=2 as an example, the target audio includes audio 1 and audio 2, where the audio 1 applies a plurality of audio effectors such as a pre-compressor, a parameter equalizer, a dynamic equalizer, and the like in the audio mixing process, the audio mixer continuously adjusts parameters of the audio effectors by listening to the effects of the audio 1 under the effect of the audio effectors, each audio effector parameter of the audio effects after adjustment by the audio mixer is recorded as a mixing parameter 1, correspondingly, each audio effector parameter obtained after the audio 2 is mixed is recorded as a mixing parameter 2, and for dry audio, the audio effector scope of the pre-compressor, the parameter equalizer, the dynamic equalizer, and the like can be used as the audio effector parameter 1 and the mixing parameter 2 respectively to obtain two different wet audio, wherein one of the two wet audio is a result of mixing the dry audio with the mixing parameter of the audio 1 (wet audio corresponding to the audio 1), and the other is a result of mixing the dry audio with the mixing parameter of the audio 2 (wet audio corresponding to the audio 2).
Step S204, according to the tone similarity between each target audio and the dry audio, M wet audio is combined to obtain the audio mixture corresponding to the dry audio.
Optionally, audio processing software is used to align the audio tracks of each of the M wet audio frequencies on a time axis, adjust the volume of the audio tracks of the wet audio frequency based on the tone similarity between the target audio frequency corresponding to the wet audio frequency and the dry audio frequency, and stack all the adjusted audio tracks together to obtain the audio track of the final mixed audio frequency.
In some possible implementations, the timbre similarity between the target audio corresponding to the wet audio and the dry audio may be positively correlated with the volume of the audio track of the wet audio. For example, the target audio corresponding to the wet audio is audio 1, and the greater the tone similarity between audio 1 and the dry audio, the greater the volume of the wet audio after adjustment.
In the above embodiment, by acquiring the tone color characteristics of the dry audio, the tone color similarity between the dry audio and the original audio may be determined according to the tone color characteristics of the dry audio and the tone color characteristics of the original audio, and M target audio may be selected from N original audio according to the tone color similarity, and the mixing parameters corresponding to the target audio may be applied to the dry audio to obtain M wet audio, and then the M wet audio may be combined based on the tone color similarity, so as to obtain the mixing audio of the dry audio. According to the method, only N original audios are prepared in advance, and the mixing parameters of each original audio after mixing are recorded, so that the mixing parameters of the original audio with high tone similarity with the dry audio can be applied to the mixing process of the dry audio, repeated listening of the dry audio is not needed to debug the mixing parameters, the mixing efficiency can be improved, the mixing time is shortened, and more mixing requirements are met.
In one embodiment, as shown in fig. 4, according to the tone similarity between each target audio and the dry audio, combining M wet audio frequencies to obtain a mixed audio corresponding to the dry audio frequency, including:
step S401, for each target audio, determining a weight value corresponding to the target audio according to the tone similarity between the target audio and the dry audio.
The tone similarity between the target audio and the dry audio may be cosine similarity calculated based on tone characteristics of the target audio and tone characteristics of the dry audio.
In some possible implementations, the timbre similarity between the target audio and the dry audio may be directly used as the weight value corresponding to the target audio.
In some possible implementations, the timbre similarity corresponding to the target audio may also be mapped to a weight value within a preset interval. The range of the tone color similarity between the target audio and the dry audio is [ -1,1], and the value of the interval can be mapped to the value of the interval [0,1], and the mapped value of the tone color similarity is used as the weight value.
Illustratively, the mapping relationship between the timbre similarity and the weight value may be established according to the following formula (1) and formula (2):
1 = Σ M weight (k) formula (2)
In the above formula, weight (k) may represent a weight value corresponding to the kth target audio, and SIMILARITY (K) may represent a timbre similarity between the kth target audio and the dry audio.
And step S402, carrying out weighted combination on M wet sound frequencies according to the weight value of each target audio to obtain the mixed audio corresponding to the dry sound frequency.
Optionally, inputting the M wet audio frequencies into audio processing software, aligning the audio tracks of all the wet audio frequencies on a time axis, and superposing the audio tracks of the wet audio frequencies corresponding to the target audio frequencies together according to the weight value of each target audio frequency to obtain the audio track of the mixed audio frequency.
In one embodiment, the mixing processing of the dry audio according to the mixing parameters corresponding to the target audio includes:
Adjusting the speed-related parameters in the mixing parameters according to the audio speed of the dry audio; and mixing the dry audio by the adjusted mixing parameters to obtain the corresponding wet audio.
Wherein the mixing parameters may include a reverberation time.
The present inventors have found that, if mixing a dry audio using a mixing parameter of a target audio, it is necessary to consider the influence of the audio speed on the mixing. The audio speed of the target audio may be different from the audio speed of the dry audio, and some mixing parameters of the target audio are determined based on the audio speed of the target audio, and if the mixing parameters are directly used for the dry audio, the mixing effect of the dry audio may be affected.
Therefore, the inventors think that parameters affected by the audio speed, such as the reverberation time, in the mixing parameters can be adjusted according to the speed difference between the target audio and the dry audio, so that the adjusted parameters are more suitable for the audio speed of the dry audio. For example, for fast songs the reverberation time needs to be suitably shortened, and for slow songs the reverberation time needs to be suitably lengthened.
In some possible implementations, adjusting the speed-related parameters in the mixing parameters according to the audio speeds of the dry audio may include shortening the reverberation time with audio speeds of several audio frequencies greater than or equal to a preset speed threshold and increasing the reverberation time with audio speeds of several audio frequencies less than the speed threshold. Wherein the velocity threshold may be determined from an audio velocity of the target audio.
In the above embodiment, the speed threshold may be determined according to the audio speed of the target audio, and the mixing parameters such as the reverberation time may be adjusted according to the relation between the audio speed of the dry audio and the speed threshold, so that the adjusted parameters are more suitable for the dry audio, and thus the mixing effect may be enhanced.
In one embodiment, the mixing processing of the dry audio according to the mixing parameters corresponding to the target audio includes:
And mixing the dry sound frequency according to the adjusted mixing parameters to obtain the corresponding wet sound frequency.
Wherein the mixing parameters may include a compressor threshold.
The inventor finds that if the mixing parameters of the target audio are used for mixing the dry audio, the influence of the loudness of the audio track on the mixing is also considered. The track loudness of the target audio may be different from the track loudness of the dry audio, and some mixing parameters of the target audio (such as parameters of audio effectors such as reverberator and multi-stage compressor) are determined based on the track loudness of the target audio, where the mixing parameters may affect the mixing effect of the dry audio if directly applied to the dry audio.
Therefore, the inventor thinks that parameters affected by the track loudness, such as a compressor threshold value, in the mixing parameters can be adjusted according to the track loudness difference between the target audio and the dry audio, so that the adjusted parameters are more suitable for the track loudness of the dry audio. For example, for low level tracks, the compressor threshold needs to be appropriately lowered, and for high level tracks, the compressor threshold needs to be appropriately increased.
In some possible implementations, adjusting the loudness-related parameter of the mixing parameters according to the loudness of the dry audio may include increasing the compressor threshold by a loudness of several acoustic audio greater than or equal to a preset loudness threshold and decreasing the compressor threshold by a loudness of several acoustic audio less than the loudness threshold. Wherein the loudness threshold may be determined from the track loudness of the target audio.
In the above embodiment, the loudness threshold may be determined according to the track loudness of the target audio, and the mixing parameters such as the compressor threshold may be adjusted according to the relation between the track loudness of the dry audio and the loudness threshold, so that the adjusted parameters are more suitable for the dry audio, and thus the mixing effect may be enhanced.
In one embodiment, the mixing parameters include effector parameters, as shown in fig. 5, and the mixing parameters corresponding to the original audio are obtained by:
in step S501, the original audio is calibrated by a preset effector.
Wherein the effecter comprises one or more of a pre-compressor, a parametric equalizer, a dynamic equalizer, a multi-segment exciter, a post-compressor, a multi-segment compressor, a stereo enhancer, a delay, a reverberator, and a volume adjuster.
Optionally, the original audio is input into the audio processing software, the selected effector is hung on an audio track of the original audio, the mixing effect is evaluated through a manual listening test and the like, and parameters of the effector can be continuously adjusted according to the mixing effect.
In some possible implementations, all the effectors that are needed for mixing and the order in which the different effectors are used for processing may be predefined. As shown in fig. 6, all the effects in the effect chain can be tuned by hanging them on the audio track of the original audio according to the chain of effect consisting of pre-compressor, parameter equalizer, dynamic equalizer, multi-segment exciter, post-compressor, multi-segment compressor, stereo enhancer, delay, reverberator and clippers.
In some possible implementations, it may also be determined that the original audio is suitable for the matched accompaniment, the original audio and the accompaniment are subjected to audio processing software pair Ji Yingui, and listening is performed based on the situation that the original audio and the accompaniment are played together, and parameters of the effector are adjusted according to the listening effect.
Step S502, if the adjustment result meets the preset condition, recording the effector parameter corresponding to the adjustment result as the mixing parameter corresponding to the original audio.
Optionally, if the listening effect of the original audio after the adjustment meets the requirement, the effector parameters used for the adjustment can be recorded. For N original videos, N sets of parameters may be obtained.
In some possible implementations, if the tuning result meets a preset condition, the audio speed and the audio loudness of the original audio under the condition may also be recorded. The method can record the speed and the loudness of the original audio to calibrate the mixing parameters, so that the parameters of the effector can be flexibly adjusted based on the speed and the loudness of the recorded original audio and the speed and the loudness of the dry audio to be mixed aiming at some effector parameters related to the speed and the loudness of the audio, and the adjusted parameters are more suitable for mixing the dry audio.
In one embodiment, the audio processing method may further include:
and adjusting the volume ratio of the mixed audio to the accompaniment audio according to the energy characteristics and the masking characteristics so as to obtain the mixed audio with balanced volume.
The energy characteristics can be calculated according to a second psychoacoustic model in the standard ISO/IEC 11172-3. The masking characteristics can be calculated according to a masking threshold curve of each frame of audio in the accompaniment audio, and the masking threshold curve can also be calculated according to a second psychoacoustic model in standard ISO/IEC 11172-3. The masking threshold may refer to a value that measures the masking ability of sound a to sound B in units of frequency bands, and may be calculated from a psychoacoustic model.
Optionally, the energy characteristics of each frame of audio in the mixed audio and the masking characteristics of each frame of audio in the accompaniment audio are obtained, the sum of the energy characteristics of each frame of the mixed audio and the masking characteristics of each frame of the accompaniment audio can be calculated, the optimal volume ratio of the mixed audio and the accompaniment audio can be determined according to the sum of the energy characteristics and the sum of the masking characteristics and combining a preset mapping relation, and the mixed audio and the accompaniment audio are overlapped according to the ratio, so that the mixed audio with balanced final volume can be obtained. The preset mapping relation can be obtained based on a plurality of audio samples, each audio sample has a good volume ratio, and the mapping relation between the two parameters of the sum of the energy characteristics and the sum of the masking characteristics and the volume ratio can be established by analyzing the sum of the energy characteristics of the voice and the sum of the masking characteristics of accompaniment in the audio samples.
In one embodiment, the audio processing method may further include:
And if the track loudness does not fall into a preset loudness interval, adjusting the mixed audio through a preset effector so that the track loudness of the adjusted mixed audio falls into the loudness interval.
Wherein the loudness interval may be determined based on audio release criteria.
For example, some song release standards require an audio track loudness of [ -12LUFS, -8LUFS ], for which the Maximizer effector can be used to boost the audio track loudness to the 2 parameter values of-9 lufs, maximizer, as follows:
ThresholdMaximizer=LUFSmixed-(-9);CeilingMaximizer=-0.2。
in this way, the loudness of mixed audio can be increased to meet the requirements of the audio release standard without distorting the audio waveform.
In one embodiment, the audio processing method may further include:
And adding new original audio, and performing audio mixing processing on the new original audio by means of adjusting parameters of an effector and the like to obtain audio mixing parameters corresponding to the new original audio.
The new original audio can be singed by a person or generated by a new AI model.
In the above embodiment, by adding the original audio, the range of parameters used for the dry audio mixing can be enlarged, so that the overall mixing effect can be enhanced.
Exemplary Medium
Having described the method of the exemplary embodiments of the present disclosure, next, a storage medium of the exemplary embodiments of the present disclosure will be described with reference to fig. 7.
Referring to fig. 7, a storage medium 70, in which a program product for implementing the above-described method according to an embodiment of the present disclosure is stored, may employ a portable compact disc read-only memory (CD-ROM) and include program code, and may be run on a terminal device such as a personal computer. However, the program product of the present disclosure is not limited thereto.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of a readable storage medium include an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. The readable signal medium may also be any readable medium other than a readable storage medium.
Program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the context of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN).
Exemplary apparatus
Having described the media of the exemplary embodiments of the present disclosure, the audio processing device of the exemplary embodiments of the present disclosure will be described with reference to fig. 8, so as to implement the method in any of the foregoing method embodiments, and the implementation principle and technical effect are similar, and are not repeated herein.
As shown in fig. 8, the audio processing apparatus 800 may include:
an acquisition module 801 is configured to acquire a tone characteristic of the dry audio.
The similarity determining module 802 is configured to determine, for each of N preset original audios, a timbre similarity between the dry audio and the original audio according to timbre features of the original audio and timbre features of the dry audio, and select M original audios from the N original audios according to the timbre similarity as target audios, where N and M are positive integers, where N is greater than or equal to M, and the original audios correspond to preset mixing parameters.
The wet sound processing module 803 is configured to mix, for each target audio of the M target audio, the dry sound audio according to a mixing parameter corresponding to the target audio, so as to obtain wet sound audio corresponding to the target audio.
And the mixing combination module 804 is configured to combine the M wet audio frequencies according to the tone similarity between each target audio frequency and the dry audio frequency, so as to obtain a mixed audio frequency corresponding to the dry audio frequency.
In yet another embodiment of the present disclosure, the mixing and combining module 804 may include:
the weight determining unit may be configured to determine, for each target audio, a weight value corresponding to the target audio according to a timbre similarity between the target audio and the dry audio.
And the wet sound combination unit can be used for carrying out weighted combination on M wet sound frequencies according to the weight value of each target audio frequency to obtain the mixed audio corresponding to the dry sound frequency.
In yet another embodiment of the present disclosure, the weight determining unit is further configured to map a timbre similarity corresponding to the target audio to a weight value in a preset interval.
In yet another embodiment of the present disclosure, the wet sound processing module 803 is further configured to adjust a speed-related parameter of the mixing parameters according to an audio speed of the dry sound audio, and/or adjust a loudness-related parameter of the mixing parameters according to a loudness of the dry sound audio, and mix the dry sound audio with the adjusted mixing parameters to obtain a corresponding wet sound audio.
In yet another embodiment of the present disclosure, the wet sound processing module 803 is further configured to shorten the reverberation time if the audio speed of the dry sound audio is greater than or equal to a preset speed threshold, and increase the reverberation time if the audio speed of the dry sound audio is less than the speed threshold.
In yet another embodiment of the present disclosure, the wet sound processing module 803 is further configured to increase the compressor threshold if the loudness of the dry sound frequency is greater than or equal to a preset loudness threshold, and decrease the compressor threshold if the loudness of the dry sound frequency is less than the loudness threshold.
In yet another embodiment of the present disclosure, the audio processing apparatus 800 may further include:
And the tuning module is used for tuning the original audio through a preset effector, wherein the effector comprises one or more of a front compressor, a parameter equalizer, a dynamic equalizer, a multi-section exciter, a rear compressor, a multi-section compressor, a stereo enhancer, a delayer, a reverberator and a volume regulator.
And the recording module is used for recording that the effector parameter corresponding to the adjusting result is the mixing parameter corresponding to the original audio under the condition that the adjusting result meets the preset condition.
In yet another embodiment of the present disclosure, the original audio is generated by a pre-set timbre model trained based on pre-labeled dry sound data.
In yet another embodiment of the present disclosure, the obtaining module 801 is further configured to input a dry audio to a pre-trained deep learning model, so that the deep learning model outputs a tone characteristic corresponding to the dry audio, where the deep learning model is trained according to a pre-labeled audio sample, and the audio sample is a mel spectrum generated according to a preset audio.
In yet another embodiment of the present disclosure, the audio processing apparatus 800 may further include:
and the characteristic acquisition module is used for acquiring the energy characteristic of the mixed audio and the masking characteristic of the accompaniment audio corresponding to the mixed audio.
And the volume balancing module is used for adjusting the volume ratio of the mixed audio to the accompaniment audio according to the energy characteristics and the masking characteristics so as to obtain the mixed audio after volume balancing.
In yet another embodiment of the present disclosure, the audio processing apparatus 800 may further include:
And the loudness acquisition module is used for acquiring the track loudness of the mixed audio.
And the loudness adjusting module is used for adjusting the mixed audio through a preset effector under the condition that the track loudness does not fall into a preset loudness interval, so that the track loudness of the adjusted mixed audio falls into the loudness interval.
Exemplary computing device
Having described the methods, media, and apparatus of exemplary embodiments of the present disclosure, a computing device of exemplary embodiments of the present disclosure is next described with reference to fig. 9.
The computing device 90 shown in fig. 9 is merely an example and should not be taken as limiting the functionality and scope of use of embodiments of the present disclosure.
As shown in fig. 9, the computing device 90 is in the form of a general purpose computing device. Components of computing device 90 may include, but are not limited to, at least one processing unit 901, at least one storage unit 902, and a bus 903 that connects the different system components, including processing unit 901 and storage unit 902. Wherein at least one memory unit 902 has stored therein computer-executable instructions, and at least one processing unit 901 comprises a processor that executes the computer-executable instructions to implement the methods described above.
Bus 903 includes a data bus, a control bus, and an address bus.
The storage unit 902 may include readable media in the form of volatile memory, such as Random Access Memory (RAM) 9021 and/or cache memory 9022, and may further include readable media in the form of non-volatile memory, such as Read Only Memory (ROM) 9023.
The storage unit 902 may also include a program/utility 9025 having a set (at least one) of program modules 9024, such program modules 9024 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
The computing device 90 may also communicate with one or more external devices 904 (e.g., keyboard, pointing device, etc.). Such communication may occur through an input/output (I/O) interface 905. Moreover, the computing device 90 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through the network adapter 906. As shown in fig. 9, the network adapter 906 communicates with other modules of the computing device 90 over the bus 903. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with computing device 90, including, but not limited to, microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
It should be noted that although in the above detailed description several units/modules or sub-units/modules of an audio processing device are mentioned, such a division is only exemplary and not mandatory. Indeed, the features and functionality of two or more units/modules described above may be embodied in one unit/module in accordance with embodiments of the present disclosure. Conversely, the features and functions of one unit/module described above may be further divided into ones that are embodied by a plurality of units/modules.
Furthermore, although the operations of the methods of the present disclosure are depicted in the drawings in a particular order, this is not required or suggested that these operations must be performed in this particular order or that all of the illustrated operations must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.
While the spirit and principles of the present disclosure have been described with reference to several particular embodiments, it is to be understood that this disclosure is not limited to the particular embodiments disclosed nor does it imply that features in these aspects are not to be combined to benefit from this division, which is done for convenience of description only. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (15)

1.一种音频处理方法,其特征在于,包括:1. An audio processing method, characterized in that it includes: 获取干声音频的音色特征;To obtain the timbre characteristics of dry audio; 针对预设的N个原始音频中的每个原始音频,根据所述原始音频的音色特征与所述干声音频的音色特征,确定所述干声音频与所述原始音频之间的音色相似度,并根据音色相似度从N个原始音频中选取M个原始音频为目标音频,N和M均为正整数,且N大于或等于M,所述原始音频对应有预设的混音参数;For each of the N preset original audio files, the timbre similarity between the dry audio file and the original audio file is determined based on the timbre characteristics of the original audio file and the timbre characteristics of the dry audio file. Based on the timbre similarity, M original audio files are selected as target audio files from the N original audio files. N and M are both positive integers, and N is greater than or equal to M. The original audio files correspond to preset mixing parameters. 针对M个目标音频中的每个目标音频,根据所述目标音频对应的混音参数对所述干声音频进行混音处理,得到所述目标音频对应的湿声音频;For each of the M target audios, the dry audio is mixed according to the mixing parameters corresponding to the target audio to obtain the wet audio corresponding to the target audio. 根据每个目标音频与所述干声音频之间的音色相似度,对M个湿声音频进行组合,得到所述干声音频对应的混音音频。Based on the timbre similarity between each target audio and the dry audio, the M wet audios are combined to obtain the mixed audio corresponding to the dry audio. 2.根据权利要求1所述的音频处理方法,其特征在于,所述根据每个目标音频与所述干声音频之间的音色相似度,对M个湿声音频进行组合,得到所述干声音频对应的混音音频,包括:2. The audio processing method according to claim 1, characterized in that, the step of combining M wet audio files based on the timbre similarity between each target audio file and the dry audio file to obtain the mixed audio file corresponding to the dry audio file includes: 针对每个目标音频,根据所述目标音频与所述干声音频之间的音色相似度,确定所述目标音频对应的权重值;For each target audio, a weight value corresponding to the target audio is determined based on the timbre similarity between the target audio and the dry audio. 按照每个目标音频的权重值对M个湿声音频进行加权组合,得到所述干声音频对应的混音音频。The M wet audio samples are weighted and combined according to the weight value of each target audio sample to obtain the mixed audio corresponding to the dry audio sample. 3.根据权利要求2所述的音频处理方法,其特征在于,所述根据所述目标音频与所述干声音频之间的音色相似度,确定所述目标音频对应的权重值,包括:3. The audio processing method according to claim 2, characterized in that, determining the weight value corresponding to the target audio based on the timbre similarity between the target audio and the dry audio includes: 将所述目标音频对应的音色相似度映射为预设区间内的权重值。The timbre similarity of the target audio is mapped to a weight value within a preset range. 4.根据权利要求1至3任一项所述的音频处理方法,其特征在于,所述根据所述目标音频对应的混音参数对所述干声音频进行混音处理,包括:4. The audio processing method according to any one of claims 1 to 3, characterized in that, the step of mixing the dry audio according to the mixing parameters corresponding to the target audio includes: 根据所述干声音频的音频速度调整所述混音参数中与速度相关的参数,和/或,根据所述干声音频的响度调整所述混音参数中与响度相关的参数;Adjust the speed-related parameters in the mixing parameters according to the audio speed of the dry audio, and/or adjust the loudness-related parameters in the mixing parameters according to the loudness of the dry audio; 通过调整后的混音参数对所述干声音频进行混音处理,得到对应的湿声音频。The dry audio is mixed using the adjusted mixing parameters to obtain the corresponding wet audio. 5.根据权利要求4所述的音频处理方法,其特征在于,所述混音参数包括混响时间,所述根据所述干声音频的音频速度调整所述混音参数中与速度相关的参数,包括:5. The audio processing method according to claim 4, characterized in that the mixing parameters include reverberation time, and the step of adjusting the speed-related parameters in the mixing parameters according to the audio speed of the dry audio includes: 若所述干声音频的音频速度大于或等于预设的速度阈值,缩短所述混响时间;If the audio velocity of the dry audio is greater than or equal to a preset velocity threshold, the reverberation time is shortened; 若所述干声音频的音频速度小于所述速度阈值,增加所述混响时间。If the audio velocity of the dry audio is less than the velocity threshold, the reverberation time is increased. 6.根据权利要求4所述的音频处理方法,其特征在于,所述混音参数包括压缩器阈值,所述根据所述干声音频的响度调整所述混音参数中与响度相关的参数,包括:6. The audio processing method according to claim 4, characterized in that the mixing parameters include a compressor threshold, and the step of adjusting the loudness-related parameters in the mixing parameters according to the loudness of the dry audio includes: 若所述干声音频的响度大于或等于预设的响度阈值,增加所述压缩器阈值;If the loudness of the dry audio is greater than or equal to a preset loudness threshold, the compressor threshold is increased; 若所述干声音频的响度小于所述响度阈值,降低所述压缩器阈值。If the loudness of the dry audio is less than the loudness threshold, the compressor threshold is reduced. 7.根据权利要求1至3任一项所述的音频处理方法,其特征在于,所述混音参数包括效果器参数,所述原始音频对应的混音参数通过以下方式得到:7. The audio processing method according to any one of claims 1 to 3, characterized in that the mixing parameters include effects parameters, and the mixing parameters corresponding to the original audio are obtained in the following manner: 通过预设的效果器对所述原始音频进行调校,所述效果器包括前置压缩器、参数均衡器、动态均衡器、多段激励器、后置压缩器、多段压缩器、立体声增强器、延迟器、混响器和音量调节器中的一项或多项;The original audio is tuned using preset effects, which include one or more of the following: pre-compressor, parametric equalizer, dynamic equalizer, multi-band exciter, post-compressor, multi-band compressor, stereo enhancer, delay, reverb, and volume adjuster. 若调校结果满足预设条件,记录所述调校结果对应的效果器参数为所述原始音频对应的混音参数。If the tuning result meets the preset conditions, the effect parameters corresponding to the tuning result are recorded as the mixing parameters corresponding to the original audio. 8.根据权利要求1至3任一项所述的音频处理方法,其特征在于,所述原始音频是通过预设的音色模型生成的,所述音色模型基于预先标注的干声数据训练得到。8. The audio processing method according to any one of claims 1 to 3, wherein the original audio is generated by a preset timbre model, and the timbre model is trained based on pre-labeled dry audio data. 9.根据权利要求1至3任一项所述的音频处理方法,其特征在于,所述获取干声音频的音色特征,包括:9. The audio processing method according to any one of claims 1 to 3, characterized in that, obtaining the timbre characteristics of the dry audio includes: 将干声音频输入至预训练的深度学习模型,以使所述深度学习模型输出所述干声音频对应的音色特征,所述深度学习模型是根据预先标注的音频样本训练得到的,所述音频样本为根据预设音频生成的梅尔谱。Dry audio is input into a pre-trained deep learning model so that the deep learning model outputs the timbre features corresponding to the dry audio. The deep learning model is trained based on pre-labeled audio samples, which are Mel spectra generated based on preset audio. 10.根据权利要求1至3任一项所述的音频处理方法,其特征在于,还包括:10. The audio processing method according to any one of claims 1 to 3, characterized in that it further comprises: 获取所述混音音频的能量特征和所述混音音频对应的伴奏音频的掩蔽特征;Obtain the energy characteristics of the mixed audio and the masking characteristics of the corresponding accompaniment audio; 根据所述能量特征和所述掩蔽特征调整所述混音音频与所述伴奏音频的音量比例,以得到音量平衡后的混音音频。The volume ratio of the mixed audio to the accompaniment audio is adjusted according to the energy characteristics and the masking characteristics to obtain a volume-balanced mixed audio. 11.根据权利要求10所述的音频处理方法,其特征在于,还包括:11. The audio processing method according to claim 10, characterized in that it further comprises: 获取所述混音音频的音轨响度;Obtain the loudness of the audio track in the mixed audio; 若所述音轨响度未落入预设的响度区间,通过预设的效果器调节所述混音音频,以使调节后混音音频的音轨响度落入所述响度区间。If the loudness of the audio track does not fall within the preset loudness range, the mixed audio is adjusted using a preset effect to make the loudness of the adjusted mixed audio track fall within the loudness range. 12.一种音频处理装置,其特征在于,包括:12. An audio processing apparatus, characterized in that it comprises: 获取模块,用于获取干声音频的音色特征;The acquisition module is used to acquire the timbre characteristics of dry audio. 相似度确定模块,用于针对预设的N个原始音频中的每个原始音频,根据所述原始音频的音色特征与所述干声音频的音色特征,确定所述干声音频与所述原始音频之间的音色相似度,并根据音色相似度从N个原始音频中选取M个原始音频为目标音频,N和M均为正整数,且N大于或等于M,所述原始音频对应有预设的混音参数;The similarity determination module is used to determine the timbre similarity between the dry audio and the original audio for each of the preset N original audios, based on the timbre characteristics of the original audio and the timbre characteristics of the dry audio, and select M original audios as target audios from the N original audios based on the timbre similarity, where N and M are both positive integers, and N is greater than or equal to M, and the original audios correspond to preset mixing parameters; 湿声处理模块,用于针对M个目标音频中的每个目标音频,根据所述目标音频对应的混音参数对所述干声音频进行混音处理,得到所述目标音频对应的湿声音频;The wet audio processing module is used to perform mixing processing on the dry audio according to the mixing parameters corresponding to the target audio for each of the M target audios, so as to obtain the wet audio corresponding to the target audio. 混音组合模块,用于根据每个目标音频与所述干声音频之间的音色相似度,对M个湿声音频进行组合,得到所述干声音频对应的混音音频。The mixing and combining module is used to combine M wet audio files based on the timbre similarity between each target audio file and the dry audio file to obtain the mixed audio file corresponding to the dry audio file. 13.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如权利要求1至11中任一项所述的方法。13. A computer-readable storage medium, characterized in that the computer-readable storage medium stores computer-executable instructions, which, when executed by a processor, implement the method as described in any one of claims 1 to 11. 14.一种计算设备,其特征在于,包括:14. A computing device, characterized in that it comprises: 至少一个处理器;At least one processor; 以及与所述至少一个处理器通信连接的存储器;and a memory communicatively connected to the at least one processor; 其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述计算设备执行如权利要求1至11中任一项所述的方法。The memory stores instructions executable by the at least one processor, which, when executed by the at least one processor, cause the computing device to perform the method as described in any one of claims 1 to 11. 15.一种计算机程序产品,包括计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至11任一项所述的方法。15. A computer program product comprising a computer program, characterized in that, when the computer program is executed by a processor, it implements the method as described in any one of claims 1 to 11.
CN202411067965.2A 2024-08-06 2024-08-06 Audio processing method, device, medium, computing device and program product Active CN119068891B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411067965.2A CN119068891B (en) 2024-08-06 2024-08-06 Audio processing method, device, medium, computing device and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411067965.2A CN119068891B (en) 2024-08-06 2024-08-06 Audio processing method, device, medium, computing device and program product

Publications (2)

Publication Number Publication Date
CN119068891A CN119068891A (en) 2024-12-03
CN119068891B true CN119068891B (en) 2025-10-31

Family

ID=93640558

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411067965.2A Active CN119068891B (en) 2024-08-06 2024-08-06 Audio processing method, device, medium, computing device and program product

Country Status (1)

Country Link
CN (1) CN119068891B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119626186B (en) * 2024-12-17 2025-11-14 腾讯音乐娱乐科技(深圳)有限公司 Song recording methods, electronic devices and computer-readable storage media

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109920446A (en) * 2019-03-12 2019-06-21 腾讯音乐娱乐科技(深圳)有限公司 A kind of audio data processing method, device and computer storage medium
CN113870873A (en) * 2021-09-14 2021-12-31 杭州网易云音乐科技有限公司 Intelligent tuning method, device, medium and computing device based on tone color

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4156252B2 (en) * 2002-03-06 2008-09-24 大日本印刷株式会社 Method for encoding an acoustic signal
EP4379715A3 (en) * 2013-09-12 2024-08-21 Dolby Laboratories Licensing Corporation Loudness adjustment for downmixed audio content
CN109785820B (en) * 2019-03-01 2022-12-27 腾讯音乐娱乐科技(深圳)有限公司 Processing method, device and equipment
WO2024081957A1 (en) * 2022-10-14 2024-04-18 Virtuel Works Llc Binaural externalization processing
CN117153131A (en) * 2023-09-19 2023-12-01 广州酷狗计算机科技有限公司 Sound mixing method, device, equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109920446A (en) * 2019-03-12 2019-06-21 腾讯音乐娱乐科技(深圳)有限公司 A kind of audio data processing method, device and computer storage medium
CN113870873A (en) * 2021-09-14 2021-12-31 杭州网易云音乐科技有限公司 Intelligent tuning method, device, medium and computing device based on tone color

Also Published As

Publication number Publication date
CN119068891A (en) 2024-12-03

Similar Documents

Publication Publication Date Title
Manilow et al. Cutting music source separation some Slakh: A dataset to study the impact of training data quality and quantity
CN103597543B (en) Semantic Track Mixer
CN103189915B (en) Decomposition of music signals using basis functions with time-evolution information
CN105612510B (en) Systems and methods for performing automated audio production using semantic data
WO2021229197A1 (en) Time-varying and nonlinear audio processing using deep neural networks
CN110211556B (en) Music file processing method, device, terminal and storage medium
WO2015092492A1 (en) Audio information processing
Taenzer et al. Investigating CNN-based Instrument Family Recognition for Western Classical Music Recordings.
CN115866487B (en) Sound power amplification method and system based on balanced amplification
CN119068891B (en) Audio processing method, device, medium, computing device and program product
US20230057082A1 (en) Electronic device, method and computer program
Rocchesso et al. Bandwidth of perceived inharmonicity for physical modeling of dispersive strings
US20230186782A1 (en) Electronic device, method and computer program
Ziemer et al. Using psychoacoustic models for sound analysis in music
Itoyama et al. Integration and adaptation of harmonic and inharmonic models for separating polyphonic musical signals
d'Escriván Music technology
Hinrichs et al. Convolutional neural networks for the classification of guitar effects and extraction of the parameter settings of single and multi-guitar effects from instrument mixes
Mu et al. A timbre matching approach to enhance audio quality of psychoacoustic bass enhancement system
Kunekar et al. Audio feature extraction: Foreground and background audio separation using knn algorithm
CN116959478A (en) Sound source separation method, device, equipment and storage medium
Shankar et al. Disentangling overlapping sources: Improving vocal and violin source separation in carnatic music
US20250191560A1 (en) Playback device and playback system
CN119132284B (en) Audio recognition methods, devices, media and computing equipment
Zhang et al. Mixing or extracting? Further exploring necessity of music separation for singer identification
Özkeleş et al. Comparison of analog processors and digital signal processors

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant