CN113963707A

CN113963707A - Audio processing method, device, equipment and storage medium

Info

Publication number: CN113963707A
Application number: CN202111196144.5A
Authority: CN
Inventors: 闫震海; 林慧镔
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2021-10-14
Filing date: 2021-10-14
Publication date: 2022-01-21

Abstract

The present application discloses an audio processing method, apparatus, device and storage medium, which belong to the technical field of computers. The method includes: inputting multi-frame song audio frames of a target song into a song element extraction model to obtain an initial audio frame of a first type element corresponding to the song audio frame; using different gain coefficients to perform gain processing on the initial audio frame to obtain a gain The processed initial audio frame; determine the difference audio frame between the song audio frame and each gain-processed initial audio frame, and determine the loudness value of each difference audio frame; determine the target gain coefficient in different gain coefficients, determine The target audio frame of the first-type element; the target audio frames of the first-type element of each frame are formed into the audio segment of the first-type element corresponding to the target song. By using the present application, the target audio frame of the first-type element whose loudness value is closer to the actual loudness value can be obtained, so that the audio segment of the first-type element with better audio quality can be obtained.

Description

Audio processing method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an audio processing method, apparatus, device, and storage medium.

Background

Along with the development of science and technology, to the song audio frequency that contains the vocal and the accompaniment, people can carry out certain data processing to it for vocal and accompaniment separation obtains the vocal audio frequency and the accompaniment audio frequency that the song audio frequency corresponds. Some music applications set up the audio of the two elements of the vocal audio and the accompaniment audio separated from the song audio, and provide richer and more various music entertainment modes for users, for example, for the K song applications, the users can select the original singing mode and the accompaniment mode when the K song, wherein the original singing mode is to play the song audio containing the vocal audio and the accompaniment audio for the users, and the accompaniment mode is to play the accompaniment audio only for the users, and for the song listening application, the users can select to play only the vocal audio or only the accompaniment audio by operation, and so on.

The traditional method for separating the human voice audio or the accompaniment audio from the song is to use two different machine learning models to respectively extract the human voice audio and the accompaniment audio.

However, the loudness value of the human voice or the accompaniment audio extracted through the machine learning model is different from the actual loudness value of the human voice or the accompaniment in the song audio, the change of the loudness value of each human voice audio frame in the human voice audio is different, and each accompaniment audio frame in the accompaniment audio is the same, so that the audio quality of the extracted human voice audio or the extracted accompaniment audio is poor.

Disclosure of Invention

The embodiment of the application provides an audio processing method, which can solve the technical problem that the audio quality of human voice audio or accompaniment audio extracted by a song element extraction model in the prior art is poor.

In a first aspect, an audio processing method is provided, the method including:

inputting a plurality of song audio frames of a target song into a trained song element extraction model to obtain an initial audio frame of a first type of element corresponding to the song audio frame output by the song element extraction model, wherein the first type of element is a voice or an accompaniment;

respectively carrying out gain processing on the initial audio frame by using different gain coefficients to obtain gain-processed initial audio frames corresponding to the different gain coefficients;

respectively determining a difference audio frame of the song audio frame and each gain-processed initial audio frame, and determining the loudness value of the difference audio frame corresponding to each gain coefficient;

determining a target gain coefficient corresponding to the actual loudness value of the first class element in the song audio frame in the different gain coefficients based on the loudness value of the difference audio frame corresponding to each gain coefficient, and determining the initial audio frame after gain processing corresponding to the target gain coefficient as the target audio frame of the first class element corresponding to the song audio frame;

and forming the audio clip of the first-class element corresponding to the target song by using the target audio frame of the first-class element corresponding to each song audio frame.

In one possible implementation, the different gain coefficients are a plurality of gain coefficients distributed with equal difference values within a preset value range.

In one possible implementation manner, the determining a loudness value of the difference audio frame corresponding to each gain coefficient includes:

and determining the root mean square of the loudness values of all sampling points in the difference audio frame as the loudness value of the difference audio frame for the difference audio frame corresponding to each gain coefficient.

In one possible implementation, the determining, based on the loudness value of the difference audio frame corresponding to each gain coefficient, a target gain coefficient corresponding to an actual loudness value of a first type element in the song audio frame among the different gain coefficients includes:

and determining the gain coefficient corresponding to the minimum loudness value in the loudness values of the difference audio frames corresponding to the gain coefficient as a target gain coefficient corresponding to the actual loudness value of the first-class element in the song audio frame.

In one possible implementation manner, after determining the gain-processed initial audio frame corresponding to the target gain coefficient as the target audio frame of the first class element corresponding to the song audio frame, the method further includes:

determining the difference audio frame corresponding to the target gain coefficient as a target audio frame of a second type of element corresponding to the song audio frame, wherein the second type of element is human voice or accompaniment, and the second type of element is different from the first type of element;

and forming the target audio frames of the second type elements corresponding to the song audio frames into audio segments of the second type elements corresponding to the target songs.

In one possible implementation, the method further includes:

for each song audio frame, determining a target adjustment coefficient corresponding to the song audio frame based on a time interval between the song audio frame and a starting time point of the target song, wherein the target adjustment coefficient of the song audio frame is positively or negatively correlated with the time interval;

performing gain processing on the initial audio frame of the first type element corresponding to the song audio frame by using the target adjustment coefficient corresponding to the song audio frame and the target gain coefficient corresponding to the song audio frame to obtain an adjusted audio frame of the first type element corresponding to the song audio frame;

and respectively determining the difference audio frames of the plurality of frames of songs and the corresponding adjusted audio frames of the first type elements to form an adjusted audio clip corresponding to the target song.

In a second aspect, there is provided an audio processing method, the method comprising:

displaying a loudness adjustment interface corresponding to the target song, wherein a human voice loudness adjustment control and an accompaniment loudness adjustment control are arranged in the loudness adjustment interface;

acquiring a target voice adjusting coefficient input through the voice loudness adjusting control and a target accompaniment adjusting coefficient input through the accompaniment loudness adjusting control;

sending an adjustment request to a server, wherein the adjustment request carries identification information of the target song, the target voice adjustment coefficient and the target accompaniment adjustment coefficient;

and receiving the adjusted audio corresponding to the target song sent by the server.

In a third aspect, an audio processing method is provided, the method comprising:

receiving an adjustment request sent by a target terminal, wherein the adjustment request carries identification information of a target song, a target voice adjustment coefficient and a target accompaniment adjustment coefficient;

acquiring a multi-frame song audio frame of the target song based on the identification information of the target song;

determining a voice audio frame and a corresponding accompaniment audio frame corresponding to the multi-frame song audio frame;

respectively using the target voice adjusting coefficient to perform gain processing on the voice audio frame corresponding to each song audio frame to obtain a gain-processed voice audio frame corresponding to each song audio frame;

respectively performing gain processing on the accompaniment audio frames corresponding to each song audio frame by using the target accompaniment adjustment coefficients to obtain the accompaniment audio frames after the gain processing corresponding to each song audio frame;

the voice audio frame after the gain processing corresponding to each song audio frame and the accompaniment audio frame after the corresponding gain processing form the adjusting audio corresponding to the target song;

and sending the adjusted audio corresponding to the target song to a target terminal.

In a fourth aspect, an audio processing apparatus is provided, the apparatus comprising:

the device comprises a first determining module, a second determining module and a third determining module, wherein the first determining module is used for inputting a plurality of song audio frames of a target song into a trained song element extraction model to obtain an initial audio frame of a first type element corresponding to the song audio frame output by the song element extraction model, and the first type element is human voice or accompaniment;

the gain module is used for respectively carrying out gain processing on the initial audio frames by using different gain coefficients to obtain gain-processed initial audio frames corresponding to the different gain coefficients;

the second determining module is used for respectively determining the difference audio frame between the song audio frame and each gain-processed initial audio frame and determining the loudness value of the difference audio frame corresponding to each gain coefficient;

a third determining module, configured to determine, based on a loudness value of the difference audio frame corresponding to each gain coefficient, a target gain coefficient corresponding to an actual loudness value of the first type element in the song audio frame among the different gain coefficients, and determine, as a target audio frame of the first type element corresponding to the song audio frame, an initial audio frame after gain processing corresponding to the target gain coefficient;

and the composition module is used for composing the target audio frames of the first type elements corresponding to the audio frames of the songs into audio clips of the first type elements corresponding to the target songs.

In a possible implementation manner, the second determining module is configured to:

In a possible implementation manner, the third determining module is configured to:

In one possible implementation manner, the apparatus further includes a fourth determining module configured to:

In a possible implementation manner, the apparatus further includes a fifth determining module, configured to:

The technical scheme provided by the embodiment of the application has the following beneficial effects: according to the embodiment of the application, the initial audio frame of the first-class element corresponding to the song audio frame can be extracted based on the song element extraction model, then the target gain coefficient corresponding to the actual loudness value of the first-class element in the song audio frame is determined according to the loudness value of the difference audio frame after gain processing is carried out by using different gain coefficients, and the target audio frame of the first-class element with the loudness value closer to the actual loudness value is obtained based on the target gain coefficient, so that the audio segment of the first-class element with better audio quality can be obtained.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of an audio processing method provided in an embodiment of the present application;

fig. 2 is a flowchart of an audio processing method provided in an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating a relationship between a gain factor and a loudness value of a difference audio frame according to an embodiment of the present disclosure;

FIG. 4 is a flowchart of a method for determining an adjusted audio clip according to an embodiment of the present application;

fig. 5 is a flowchart of an audio processing method provided in an embodiment of the present application;

fig. 6 is a flowchart of an audio processing method provided in an embodiment of the present application;

fig. 7 is a flowchart of an audio processing method provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application;

fig. 9 is a block diagram of a terminal according to an embodiment of the present disclosure;

fig. 10 is a block diagram of a server according to an embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The embodiment of the application provides an audio processing method, which can be realized by computer equipment. The computer device may be a device for extracting vocal audio or accompaniment audio of the target song audio, or a device for extracting vocal audio or accompaniment audio of the target song audio, for example, a background server of a music application program, or a terminal of a user capable of listening to music. The computer equipment can be a terminal, a server and the like, and the terminal can be a desktop computer, a notebook computer, a tablet computer, a mobile phone and the like. The computer device may include a processor, memory, and communication components, among others.

The processor may be a Central Processing Unit (CPU), and the processor may be configured to determine an initial audio frame of the first type element corresponding to the song audio frame based on the song element extraction model, determine a loudness value of the difference audio frame corresponding to each gain coefficient, determine a target audio frame of the first type element corresponding to the song audio frame, and so on.

The memory may be various volatile memories or nonvolatile memories, such as a Solid State Disk (SSD), a Dynamic Random Access Memory (DRAM), and the like. The memory may be used for data storage, such as data storage of song audio frames, data storage of song element extraction models, data storage of initial audio frames of the first type elements corresponding to the determined song audio frames, data storage of difference audio frames corresponding to the determined different gain coefficients, data storage of the loudness values of the difference audio frames corresponding to each gain coefficient, data storage of target audio frames of the first type elements corresponding to the determined song audio frames, and so forth.

The communication means may be a wired network connector, a wireless fidelity (WiFi) module, a bluetooth module, a cellular network communication module, etc. The communication component may be configured to perform data transmission with other devices, for example, the communication component may be configured to send the determined audio piece of the first type element or the determined audio piece of the second type element to a specific device, and so on.

Fig. 1 and fig. 2 are flowcharts of an audio processing method according to an embodiment of the present application. Referring to fig. 1 and 2, the embodiment includes:

101. and inputting the multi-frame song audio frame of the target song into the trained song element extraction model to obtain an initial audio frame of the first type of element corresponding to the song audio frame output by the song element extraction model.

Wherein, the first type element is human voice or accompaniment.

In implementation, when a certain song audio needs to be processed, the song audio to be processed may be referred to as a target song for convenience of description, and an audio frame included in the target song may be referred to as a song audio frame. The target song comprises human voice audio and accompaniment audio, and the human voice audio or the accompaniment audio in the target song can be extracted by using the song element extraction model to obtain the human voice audio or the accompaniment audio corresponding to the target song. The first type of elements in the embodiment of the application can be voices or accompaniments, the audio of the first type of elements can be voice audio or accompany audio, and the song element extraction model can correspondingly comprise a voice extraction model and an accompany extraction model. When the first type element is the voice, the voice audio in the target song can be extracted by using a voice extraction model to obtain the voice audio corresponding to the target song. When the first type of element is the accompaniment, the accompaniment audio in the target song can be extracted by using the accompaniment extraction model to obtain the accompaniment audio corresponding to the target song.

Optionally, there may be a plurality of methods for extracting the audio of the first type element in the target song through the song element extraction model, and one of the following methods is:

and inputting at least one song audio frame of the target song into the trained song element extraction model to obtain an initial audio frame of the first type of elements corresponding to each song audio frame in the at least one song audio frame.

In implementation, the song audio of the target song may be input into the trained song element extraction model, the song element extraction model processes the song audio frame and outputs the processed audio frame, in order to facilitate distinguishing from other audio frames, the audio frame output by the song element extraction model may be referred to as an initial audio frame, and the audio formed by the initial audio frame is the initial audio of the first type element corresponding to the target song audio.

Or, a song audio frame with a preset audio frame number may be input into the song element extraction model each time, that is, the song audio is divided into a plurality of input data according to the preset audio frame number, if the number of the song audio frames in the last input data is less than the preset audio frame number, the silent audio frame completion may be used, then the input data may be input into the trained song element extraction model respectively, a plurality of corresponding output data may be obtained, and each output data is an initial audio frame of the first type element corresponding to the song audio frame in the input data. And deleting the initial audio frame in the output data corresponding to the mute audio frame in the last input data, wherein the rest output data are the initial audio frames of the first type elements corresponding to each song audio frame in the song audio.

102. And respectively carrying out gain processing on the initial audio frame by using different gain coefficients to obtain the initial audio frame after the gain processing corresponding to the different gain coefficients.

In implementation, for each initial audio frame, different gain coefficients are used to perform gain processing on the amplitude of each time-domain sampling point in the initial audio frame, and the amplitude is multiplied by the gain coefficient to obtain the initial audio frame after gain processing corresponding to the different gain coefficients.

103. And respectively determining the difference audio frame of the song audio frame and each gain-processed initial audio frame, and determining the loudness value of the difference audio frame corresponding to each gain coefficient.

In the implementation, the amplitude of each time domain sampling point in the song audio frame is subtracted by the amplitude of the time domain sampling point corresponding to the initial audio frame after the gain processing, so as to obtain a difference audio frame between the song audio frame and the initial audio frame after the gain processing. For each different gain coefficient, the difference audio frame corresponding to each initial audio frame can be obtained in the above manner. If the song audio frame may be represented by Y, the initial audio frame may be represented by X, the difference audio frame may be represented by R, and the gain factor may be represented by a, the formula for the difference audio frame may be represented as: r ═ Y-aX.

Because the initial audio frames are subjected to gain processing by using a plurality of different gain coefficients, a plurality of initial audio frames subjected to gain processing according to the gain coefficients can be obtained, and thus, difference audio frames corresponding to a plurality of different gain coefficients are obtained. Optionally, the plurality of different gain coefficients may be set to a plurality of increasing or decreasing values, so that the influence of the different gain coefficients on the loudness value of the difference audio frame can be obtained by comparison, and then the setting of the different gain coefficients may be: the different gain factors are a plurality of gain factors distributed with equal difference values within a preset value range.

Since the loudness value of the initial audio frame may deviate from the actual loudness value of the first-class element in the song audio frame, a preset value range may be preset for the value of the gain coefficient, and the value of the gain coefficient may be a plurality of gain coefficients distributed with equal difference values within the preset value range. In this embodiment of the application, the gain coefficient may be a plurality of values that are uniformly distributed with a difference of 0.01 within a preset numerical range of [0,2], that is, the gain coefficient may be a value of 0, 0.01, 0.02, 0.03 … … 1.98.98, 1.99, or 2, and of course, the preset numerical range of the gain coefficient may also be other ranges, which is not limited in this embodiment of the application.

Optionally, after obtaining the difference audio frames corresponding to a plurality of different gain coefficients, the loudness value of each difference audio frame may be calculated. There are various methods for calculating the loudness value of the difference audio frame, one of which is as follows:

and determining the root mean square of the loudness values of the sampling points in the difference audio frame as the loudness value of the difference audio frame for the difference audio frame corresponding to each gain coefficient.

In implementation, the initial audio frame is subjected to gain processing by a plurality of different gain coefficients, so that the initial audio frame after gain processing corresponding to the plurality of different gain coefficients can be obtained, and then the difference audio frames corresponding to the plurality of different gain coefficients can be correspondingly obtained. For the difference audio frame corresponding to each gain coefficient, the loudness values of the sampling points in the difference audio frame may be obtained, and then the root mean square of the loudness values of the sampling points of the difference audio frame is calculated as the loudness value of the difference audio frame, where the corresponding formula may be as follows:

RMS(R)＝10lg(sum(R²/n))

sum(R²/n)＝(R₁ ²+R₂ ²+……+R_n ²)/n

wherein, R is the difference audio frame, rms (R) is the loudness value of the difference audio frame, and n is the sample point number.

Optionally, other calculation manners may be further selected to represent the loudness value of the difference audio frame, which is not limited in this application embodiment.

104. And determining a target gain coefficient corresponding to the actual loudness value of the first class element in the song audio frame in different gain coefficients based on the loudness value of the difference audio frame corresponding to each gain coefficient, and determining the initial audio frame after the gain processing corresponding to the target gain coefficient as the target audio frame of the first class element corresponding to the song audio frame.

In implementation, after obtaining loudness values of the difference audio frame corresponding to a plurality of different gain coefficients, a loudness value equal to or closest to an actual loudness value of the first type element in the song audio frame may be determined among the plurality of loudness values, and a gain coefficient corresponding to the determined loudness value equal to or closest to the actual loudness value is determined as a target gain coefficient.

Optionally, the processing procedure for determining the target gain factor in the embodiment of the present application may be as follows:

In implementation, different gain coefficients correspond to different loudness values of the difference audio frame. When the loudness value of the determined difference audio frame corresponding to the plurality of different gain coefficients reaches the minimum value, the loudness value of the initial audio frame after gain processing represents that the loudness value of the initial audio frame after gain processing is closest to the actual loudness value of the first-class element in the song audio frame after the gain coefficient corresponding to the minimum loudness value is used for gain processing of the initial audio frame.

When the selected gain coefficient is smaller than the target gain coefficient, the loudness value of the difference audio frame corresponding to the gain coefficient does not reach the minimum value yet, which indicates that more sounds of the first type elements exist in the difference audio frame.

When the selected gain coefficient is larger than the target gain coefficient, the loudness value of the initial audio frame after the gain processing is performed by using the gain coefficient is larger than the actual loudness value of the first type element in the song audio frame, and then the difference audio frame obtained by subtracting the initial audio frame after the gain processing from the song audio frame is the audio frame obtained by subtracting the first type element from the song audio frame and adding the audio frame of the first type element with the reversed amplitude.

Therefore, when the selected gain coefficient is not equal to the target gain coefficient, the difference between the loudness value of the initial audio frame after the gain processing is performed by using the gain coefficient and the actual loudness value of the first type element in the song audio frame is obvious, which indicates that the difference audio frame also contains the sound of the first type element at this time. Therefore, the gain coefficient corresponding to the minimum loudness value is determined as the target gain coefficient, and the loudness value of the initial audio frame after the gain processing is performed by using the target gain coefficient is equal to or closest to the actual loudness value of the first-class element in the song audio frame.

For example, as shown in fig. 3, the abscissa is a plurality of gain coefficients with equal difference distribution within a preset numerical range of [0,2], and the ordinate is the loudness value of the difference audio frame corresponding to the gain coefficients with different values, as can be seen from fig. 3, when the gain coefficient is 1.52, the loudness value of the difference audio frame is the minimum, which is-18.63 dB, then the target gain coefficient may be determined to be 1.52.

As can be seen from the above, the loudness value of the initial audio frame after the gain processing is performed by using the target gain coefficient is closest to the actual loudness value of the first type element in the song audio frame, so that the initial audio frame after the gain processing corresponding to the target gain coefficient may be determined as the target audio frame of the first type element corresponding to the song audio frame.

When the first type element is human voice, the human voice audio frame corresponding to the song audio frame can be obtained through the steps; when the first type element is the accompaniment, the accompaniment audio frame corresponding to the song audio frame can be obtained through the steps.

Optionally, after determining the target audio frame of the first type element corresponding to the song audio frame, a difference audio frame corresponding to the target audio frame of the first type element may also be determined, and the corresponding processing may be as follows:

and determining the difference audio frame corresponding to the target gain coefficient as a target audio frame of a second type of element corresponding to the song audio frame, wherein the second type of element is human voice or accompaniment, and the second type of element is different from the first type of element.

In implementation, after the target gain coefficient is determined, the difference audio frame corresponding to the target gain coefficient may be determined as the audio frame of the second type element corresponding to the song audio frame, that is, the difference audio frame between the song audio frame and the target audio frame of the first type element is determined as the audio frame of the second type element. When the first type element is a voice, the target audio frame of the first type element is a voice audio frame in the song audio frame, and the audio frame of the second type element is an accompaniment audio frame in the song audio frame; when the first type element is the accompaniment, the target audio frame of the first type element is the accompaniment audio frame in the song audio frame, and the audio frame of the second type element is the human voice audio frame in the song audio frame.

105. And forming the target audio frames of the first-class elements corresponding to the audio frames of the songs into audio clips of the first-class elements corresponding to the target songs.

In implementation, after a target audio frame of a first type element corresponding to each song audio frame in the multi-frame song audio frames is obtained, the target audio frames of the first type elements corresponding to the multi-frame song audio frames may be arranged and combined according to the order of the song audio frames in the target song to form an audio clip of the first type element corresponding to the target song.

Similarly, for the target audio frames of the second type elements corresponding to the obtained multi-frame song audio frames, the target audio frames of the second type elements corresponding to the target song can also be combined according to the arrangement sequence of the multi-frame song audio frames in the target song to form the audio clip of the second type elements corresponding to the target song.

For the voice audio and the accompaniment audio obtained by the above method, besides the independent use, the target gain coefficient can be adjusted to obtain a dynamic song effect that the voice gradually appears, or a dynamic song effect that the voice gradually disappears, or even a dynamic song effect that the voice suddenly appears, and similarly, a dynamic song effect that the accompaniment gradually appears, or a dynamic song effect that the accompaniment gradually disappears, or even a dynamic song effect that the accompaniment suddenly appears can be obtained.

As shown in fig. 4, the corresponding process flow may be as follows:

401. for each song audio frame, determining a target adjustment coefficient corresponding to the song audio frame based on a time interval between the song audio frame and the starting time point of the target song.

Wherein the target adjustment coefficient of the audio frame of the song is positively or negatively correlated with the time interval.

In implementation, after the target gain coefficient corresponding to each initial audio frame is determined, a target adjustment coefficient corresponding to each song audio frame may be determined, where the target adjustment coefficient may be a value in the range of [0, 1] and is in positive correlation or negative correlation with a time interval, that is, the target adjustment coefficients corresponding to a plurality of consecutive song audio frames may be larger when the distance from the start time point of the target song is farther, or may be smaller when the distance from the start time point of the target song is farther.

402. And performing gain processing on the initial audio frame of the first type element corresponding to the song audio frame by using the target adjustment coefficient corresponding to the song audio frame and the target gain coefficient corresponding to the song audio frame to obtain the adjustment audio frame of the first type element corresponding to the song audio frame.

In implementation, for a song audio frame, gain processing is performed on the initial audio frame of the first type element corresponding to the song audio frame by using the target adjustment coefficient corresponding to the song audio frame and the target gain coefficient corresponding to the initial audio frame of the first type element corresponding to the song audio frame, so that the adjusted audio frame of the first type element corresponding to the song audio frame can be obtained. By the method, each song audio frame is subjected to gain processing, and the adjusted audio frame of the first type element corresponding to each song audio frame in the plurality of song audio frames can be obtained.

403. And respectively determining the difference audio frames of the multi-frame songs and the adjusted audio frames of the corresponding first-class elements to form an adjusted audio clip corresponding to the target song.

In implementation, the amplitude of each time domain sampling point in the song audio frame is subtracted by the amplitude of the time domain sampling point of the adjusted audio frame corresponding to the song audio frame to obtain a difference audio frame between the song audio frame and the corresponding adjusted audio frame. By processing each song audio frame in the above manner, a difference audio frame of each adjusted audio frame in the plurality of song audio frames can be obtained.

Arranging the difference audio frames corresponding to the plurality of adjusted audio frames according to the arrangement sequence of the song audio frames in the target song to obtain an audio frequency band, namely the adjusted audio segment corresponding to the target song.

When the target adjustment coefficients corresponding to the song audio frames are positively correlated with the time interval, that is, when the target adjustment coefficients corresponding to the song audio frames are larger in value at a distance from the starting time point, the loudness of the obtained adjusted audio frames of the first-class elements corresponding to the song audio frames is larger and larger, but because the value range of the target adjustment coefficients is [0, 1], the maximum value of the loudness of the adjusted audio frames is not larger than the loudness of the target audio frames of the first-class elements corresponding to the adjusted audio frames.

In the adjusted audio frequency band determined based on the song audio frame and the adjusted audio frame, the loudness of the first type element in the obtained adjusted audio frequency band becomes smaller and smaller because the loudness of the adjusted audio frame of the subtracted first type element becomes larger and larger. Taking the first-class element as an example, if the target adjustment coefficient corresponding to the song audio frame is positively correlated with the time interval, the voice will be smaller and smaller along with the change of time in the finally obtained adjusted audio segment, and thus a dynamic song effect in which the voice gradually disappears can be obtained.

Similarly, when the target adjustment coefficient corresponding to the audio frame of the song is inversely related to the time interval, the loudness of the obtained adjusted audio frame of the first type element is smaller and smaller, and further, the loudness of the first type element in the obtained adjusted audio segment is larger and larger. Taking the first type of element as the example of the human voice, if the target adjustment coefficient corresponding to the audio frame of the song is negatively correlated with the time interval, the human voice will be larger and larger along with the change of time in the finally obtained adjusted audio segment, so that a dynamic song effect in which the human voice gradually appears can be obtained.

The target song can be divided into a plurality of audio frequency bands, the voice gradually disappears for the odd audio frequency segments, the voice gradually appears for the even audio frequency segments, or the voice gradually disappears for the even audio frequency segments, and the voice gradually appears for the odd audio frequency segments, so that a dynamic song effect with the flickering voice can be obtained.

When the first type element is the accompaniment, the processing method is the same as above, and a dynamic song effect with gradually appearing accompaniment, a dynamic song effect with gradually disappearing accompaniment or a dynamic song effect with suddenly appearing accompaniment can also be obtained, and the processing method is not repeated herein.

The embodiment of the present application further provides an audio processing method, referring to fig. 5, the corresponding processing flow is as follows:

501. and displaying a loudness adjustment interface corresponding to the target song, wherein a human voice loudness adjustment control and an accompaniment loudness adjustment control are arranged in the loudness adjustment interface.

In implementation, a music application is installed on the user's target terminal, and the user may open the music application and enter into a loudness adjustment interface (also referred to as a volume adjustment interface) for the target song. In the loudness adjustment interface, a human loudness adjustment control and an accompaniment loudness adjustment control are arranged, the loudness adjustment control can be a slide control, the minimum selectable value in the slide control is 0, and the maximum selectable value is 1. The user can control the loudness of the human voice in the target song by sliding the buttons of the corresponding loudness of the human voice adjusting control, and control the loudness of the accompaniment in the target song by sliding the buttons of the corresponding loudness of the accompaniment adjusting control.

502. And acquiring a target human sound adjusting coefficient input through the human sound loudness adjusting control and a target accompaniment adjusting coefficient input through the accompaniment loudness adjusting control.

In implementation, after the user adjusts the human loudness adjustment control or the accompaniment loudness adjustment control, the target terminal can acquire the target human loudness adjustment coefficient input through the human loudness adjustment control and the target accompaniment adjustment coefficient input through the accompaniment loudness adjustment control.

503. And sending a regulation request to the server.

The adjustment request carries identification information of the target song, a target voice adjustment coefficient and a target accompaniment adjustment coefficient.

504. And receiving the adjusted audio corresponding to the target song sent by the server.

In implementation, after the adjustment request is sent to the server, the server processes a plurality of song audio frames in the target song based on the target vocal adjustment coefficient and the target accompaniment adjustment coefficient, and sends the adjustment audio corresponding to the target song obtained after the processing is completed back to the target terminal, and the target terminal receives and plays the adjustment audio.

The embodiment of the present application further provides an audio processing method, referring to fig. 6, where the corresponding processing flow is as follows:

601. and receiving an adjustment request sent by the target terminal.

602. And acquiring multi-frame song audio frames of the target song based on the identification information of the target song.

603. And determining a human voice audio frame and a corresponding accompaniment audio frame corresponding to the audio frames of the multi-frame song.

In implementation, the target audio frames of voices and the target audio frames of accompaniments corresponding to all the song audio frames of the target song may be obtained through steps 101-105, that is, the voice audio frames and the accompaniment audio frames corresponding to the song audio frames.

604. And respectively using the target voice adjusting coefficient to perform gain processing on the voice audio frame corresponding to each song audio frame to obtain the gain-processed voice audio frame corresponding to each song audio frame.

605. And respectively carrying out gain processing on the accompaniment audio frames corresponding to each song audio frame by using the target accompaniment adjusting coefficients to obtain the accompaniment audio frames after the gain processing corresponding to each song audio frame. In practice, step 604 and step 605 are not in sequence.

606. And combining the gain-processed human voice audio frame corresponding to each frame of song audio frame and the corresponding gain-processed accompaniment audio frame into the adjusting audio corresponding to the target song.

607. And sending the adjusted audio corresponding to the target song to the target terminal.

The embodiment of the present application further provides an audio processing method, referring to fig. 7, where the corresponding processing flow is as follows:

701. the target terminal displays a loudness adjustment interface corresponding to the target song, and a human voice loudness adjustment control and an accompaniment loudness adjustment control are arranged in the loudness adjustment interface.

702. And the target terminal acquires a target voice adjusting coefficient input through the voice loudness adjusting control and a target accompaniment adjusting coefficient input through the accompaniment loudness adjusting control.

703. The target terminal sends a regulation request to the server.

704. And the server receives the adjustment request sent by the target terminal.

705. And the server acquires the multi-frame song audio frames of the target song based on the identification information of the target song.

706. The server determines a voice audio frame and a corresponding accompaniment audio frame corresponding to the audio frames of the multi-frame song.

707. And the server performs gain processing on the voice audio frame corresponding to each song audio frame by using the target voice adjustment coefficient respectively to obtain the gain-processed voice audio frame corresponding to each song audio frame.

708. And the server respectively uses the target accompaniment adjustment coefficients to carry out gain processing on the accompaniment audio frames corresponding to each song audio frame to obtain the accompaniment audio frames after the gain processing corresponding to each song audio frame.

709. And the server combines the gain-processed human voice audio frame corresponding to each frame of song audio frame and the corresponding gain-processed accompaniment audio frame into the regulation audio corresponding to the target song.

710. And the server sends the adjusted audio corresponding to the target song to the target terminal.

711. And the target terminal receives the adjusted audio corresponding to the target song sent by the server.

All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.

According to the scheme provided by the embodiment of the application, an initial audio frame of the first class element corresponding to the song audio frame can be extracted based on the song element extraction model, then a target gain coefficient corresponding to the actual loudness value of the first class element in the song audio frame is determined according to the loudness value of the difference audio frame after gain processing is carried out by using different gain coefficients, and based on the target gain coefficient, a target audio frame of the first class element with the loudness value closer to the actual loudness value is obtained, so that an audio segment of the first class element with better audio quality can be obtained.

An audio processing apparatus, which may be a computer device in the foregoing embodiments, as shown in fig. 8, includes:

a first determining module 810, configured to input a multi-frame song audio frame of a target song into a trained song element extraction model, to obtain an initial audio frame of a first type of element corresponding to the song audio frame output by the song element extraction model, where the first type of element is a human voice or an accompaniment;

a gain module 820, configured to perform gain processing on the initial audio frame by using different gain coefficients, respectively, to obtain gain-processed initial audio frames corresponding to the different gain coefficients;

a second determining module 830, configured to determine a difference audio frame between the song audio frame and each gain-processed initial audio frame, and determine a loudness value of the difference audio frame corresponding to each gain coefficient;

a third determining module 840, configured to determine, based on a loudness value of the difference audio frame corresponding to each gain coefficient, a target gain coefficient corresponding to an actual loudness value of the first type element in the song audio frame among the different gain coefficients, and determine, as a target audio frame of the first type element corresponding to the song audio frame, an initial audio frame after gain processing corresponding to the target gain coefficient;

and a composing module 850, configured to compose the target audio frames of the first type elements corresponding to the audio frames of the songs into audio clips of the first type elements corresponding to the target songs.

In a possible implementation manner, the second determining module 830 is configured to:

In a possible implementation manner, the third determining module 840 is configured to:

It should be noted that: in the audio processing apparatus provided in the above embodiment, when processing audio, only the division of the above functional modules is exemplified, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules, so as to complete all or part of the above described functions. In addition, the audio processing apparatus and the audio processing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments and are not described herein again.

Fig. 9 shows a block diagram of a terminal 900 according to an exemplary embodiment of the present application. The terminal may be the computer device in the above embodiments. The terminal 900 may be: a smart phone, a tablet computer, an MP3 player (moving picture experts group audio layer III, motion picture experts group audio layer 3), an MP4 player (moving picture experts group audio layer IV, motion picture experts group audio layer 4), a notebook computer, or a desktop computer. Terminal 900 may also be referred to by other names such as user equipment, portable terminals, laptop terminals, desktop terminals, and the like.

In general, terminal 900 includes: a processor 901 and a memory 902.

Processor 901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 901 may be implemented in at least one hardware form of a DSP (digital signal processing), an FPGA (field-programmable gate array), and a PLA (programmable logic array). The processor 901 may also include a main processor and a coprocessor, where the main processor is a processor, also called a CPU, for processing data in an awake state; a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 901 may be integrated with a GPU (graphics processing unit) which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 901 may further include an AI (artificial intelligence) processor for processing computing operations related to machine learning.

Memory 902 may include one or more computer-readable storage media, which may be non-transitory. The memory 902 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 902 is used to store at least one instruction for execution by processor 901 to implement the audio processing methods provided by the method embodiments herein.

In some embodiments, terminal 900 can also optionally include: a peripheral interface 903 and at least one peripheral. The processor 901, memory 902, and peripheral interface 903 may be connected by buses or signal lines. Various peripheral devices may be connected to the peripheral interface 903 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 904, display screen 905, camera 906, audio circuitry 907, positioning component 908, and power supply 909.

The peripheral interface 903 may be used to connect at least one peripheral related to I/O (input/output) to the processor 901 and the memory 902. In some embodiments, the processor 901, memory 902, and peripheral interface 903 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 901, the memory 902 and the peripheral interface 903 may be implemented on a separate chip or circuit board, which is not limited by this embodiment.

The radio frequency circuit 904 is used for receiving and transmitting RF (radio frequency) signals, also called electromagnetic signals. The radio frequency circuitry 904 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 904 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 904 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 904 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi networks. In some embodiments, the radio frequency circuit 904 may also include NFC (near field communication) related circuits, which are not limited in this application.

The display screen 905 is used to display a UI (user interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 905 is a touch display screen, the display screen 905 also has the ability to capture touch signals on or over the surface of the display screen 905. The touch signal may be input to the processor 901 as a control signal for processing. At this point, the display 905 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 905 may be one, providing the front panel of the terminal 900; in other embodiments, the number of the display panels 905 may be at least two, and each of the display panels is disposed on a different surface of the terminal 900 or is in a foldable design; in still other embodiments, the display 905 may be a flexible display disposed on a curved surface or a folded surface of the terminal 900. Even more, the display screen 905 may be arranged in a non-rectangular irregular figure, i.e. a shaped screen. The display panel 905 may be made of LCD (liquid crystal display), OLED (organic light-emitting diode), or other materials.

The camera assembly 906 is used to capture images or video. Optionally, camera assembly 906 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each of the rear cameras is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (virtual reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 906 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

Audio circuit 907 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 901 for processing, or inputting the electric signals to the radio frequency circuit 904 for realizing voice communication. For stereo sound acquisition or noise reduction purposes, the microphones may be multiple and disposed at different locations of the terminal 900. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 901 or the radio frequency circuit 904 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuit 907 may also include a headphone jack.

The positioning component 908 is employed to locate a current geographic location of the terminal 900 for navigation or LBS (location based service). The positioning component 908 may be a positioning component based on the united states GPS (global positioning system), the chinese beidou system, the russian graves system, or the european union's galileo system.

Power supply 909 is used to provide power to the various components in terminal 900. The power source 909 may be alternating current, direct current, disposable or rechargeable. When power source 909 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 900 can also include one or more sensors 910. The one or more sensors 910 include, but are not limited to: acceleration sensor 911, gyro sensor 912, pressure sensor 913, fingerprint sensor 914, optical sensor 915, and proximity sensor 916.

The acceleration sensor 911 can detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 900. For example, the acceleration sensor 911 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 901 can control the display screen 905 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 911. The acceleration sensor 911 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 912 may detect a body direction and a rotation angle of the terminal 900, and the gyro sensor 912 may cooperate with the acceleration sensor 911 to acquire a 3D motion of the user on the terminal 900. The processor 901 can implement the following functions according to the data collected by the gyro sensor 912: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensor 913 may be disposed on a side bezel of the terminal 900 and/or underneath the display 905. When the pressure sensor 913 is disposed on the side frame of the terminal 900, the user's holding signal of the terminal 900 may be detected, and the processor 901 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 913. When the pressure sensor 913 is disposed at a lower layer of the display screen 905, the processor 901 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 905. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 914 is used for collecting a fingerprint of the user, and the processor 901 identifies the user according to the fingerprint collected by the fingerprint sensor 914, or the fingerprint sensor 914 identifies the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, processor 901 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 914 may be disposed on the front, back, or side of the terminal 900. When a physical key or vendor Logo is provided on the terminal 900, the fingerprint sensor 914 may be integrated with the physical key or vendor Logo.

The optical sensor 915 is used to collect ambient light intensity. In one embodiment, the processor 901 may control the display brightness of the display screen 905 based on the ambient light intensity collected by the optical sensor 915. Specifically, when the ambient light intensity is high, the display brightness of the display screen 905 is increased; when the ambient light intensity is low, the display brightness of the display screen 905 is reduced. In another embodiment, the processor 901 can also dynamically adjust the shooting parameters of the camera assembly 906 according to the ambient light intensity collected by the optical sensor 915.

Proximity sensor 916, also known as a distance sensor, is typically disposed on the front panel of terminal 900. The proximity sensor 916 is used to collect the distance between the user and the front face of the terminal 900. In one embodiment, when the proximity sensor 916 detects that the distance between the user and the front face of the terminal 900 gradually decreases, the processor 901 controls the display 905 to switch from the bright screen state to the dark screen state; when the proximity sensor 916 detects that the distance between the user and the front surface of the terminal 900 gradually becomes larger, the display 905 is controlled by the processor 901 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 9 does not constitute a limitation of terminal 900, and may include more or fewer components than those shown, or may combine certain components, or may employ a different arrangement of components.

Fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1000 may generate a relatively large difference due to a difference in configuration or performance, and may include one or more processors (CPUs) 1001 and one or more memories 1002, where the memory 1002 stores at least one instruction, and the at least one instruction is loaded and executed by the processors 1001 to implement the methods provided by the foregoing method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

In an exemplary embodiment, a computer-readable storage medium, such as a memory, including instructions executable by a processor in a terminal to perform the audio processing method in the above-described embodiments is also provided. The computer readable storage medium may be non-transitory. For example, the computer-readable storage medium may be a ROM (read-only memory), a RAM (random access memory), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. an audio processing method, it is characterised in that the method comprises:

The multi-frame song audio frame of the target song is input into the song element extraction model that has been trained, and the initial audio frame of the first type element corresponding to the song audio frame output by the song element extraction model is obtained, wherein the first audio frame is obtained. One type of element is vocals or accompaniment;

Using different gain coefficients to perform gain processing on the initial audio frames, respectively, to obtain gain-processed initial audio frames corresponding to the different gain coefficients;

Determine the difference audio frame of the audio frame of the song and the initial audio frame processed by each gain respectively, and determine the loudness value of the difference audio frame corresponding to each gain coefficient;

Based on the loudness value of the difference audio frame corresponding to each gain coefficient, a target gain coefficient corresponding to the actual loudness value of the first type element in the song audio frame is determined among the different gain coefficients, and the The initial audio frame after the gain processing corresponding to the target gain coefficient is determined as the target audio frame of the first type element corresponding to the audio frame of the song;

The target audio frames of the first-type elements corresponding to the song audio frames of each frame are formed into audio segments of the first-type elements corresponding to the target song.

2 . The method according to claim 1 , wherein the different gain coefficients are a plurality of gain coefficients distributed with equal differences within a preset numerical range. 3 .

3. The method according to claim 1, wherein the determining the loudness value of the difference audio frame corresponding to each gain coefficient comprises:

For the difference audio frame corresponding to each gain coefficient, the root mean square of the loudness values of each sampling point in the difference audio frame is determined as the loudness value of the difference audio frame.

4. The method according to claim 1, wherein, based on the loudness value of the difference audio frame corresponding to each gain coefficient, the different gain coefficients are determined to be the same as the number one in the song audio frame. The target gain factor corresponding to the actual loudness value of a class of elements, including:

The gain coefficient corresponding to the smallest loudness value among the loudness values of the difference audio frame corresponding to the gain coefficient is determined as the target gain coefficient corresponding to the actual loudness value of the first type element in the song audio frame.

5. The method according to claim 1, wherein the initial audio frame after the gain processing corresponding to the target gain coefficient is determined as the target audio of the first type element corresponding to the audio frame of the song After the frame, the method further includes:

The difference audio frame corresponding to the target gain coefficient is determined as the target audio frame of the second type element corresponding to the song audio frame, wherein the second type element is a human voice or accompaniment, and the second type element is the element is not the same as the element of the first type;

The target audio frames of the second-type element corresponding to each frame of the song audio frame are formed into audio segments of the second-type element corresponding to the target song.

6. The method of claim 1, wherein the method further comprises:

For each song audio frame, a target adjustment coefficient corresponding to the song audio frame is determined based on the time interval between the song audio frame and the start time point of the target song, wherein the target adjustment coefficient of the song audio frame the coefficients are positively or negatively correlated with said time interval;

Using the target adjustment coefficient corresponding to the audio frame of the song and the target gain coefficient corresponding to the audio frame of the song, the initial audio frame of the first type element corresponding to the audio frame of the song is subjected to gain processing to obtain the audio frame corresponding to the song. The adjusted audio frame of the first class element;

The difference audio frames of the multi-frame song audio frames and the corresponding adjusted audio frames of the first-type elements are respectively determined to form an adjusted audio segment corresponding to the target song.

7. An audio processing method, wherein the method comprises:

A loudness adjustment interface corresponding to the target song is displayed, and a human voice loudness adjustment control and an accompaniment loudness adjustment control are arranged in the loudness adjustment interface;

obtaining the target vocal adjustment coefficient input through the vocal loudness adjustment control and the target accompaniment adjustment coefficient input through the accompaniment loudness adjustment control;

Send an adjustment request to the server, wherein the adjustment request carries the identification information of the target song, the target vocal adjustment coefficient and the target accompaniment adjustment coefficient, so that the server uses the target vocal adjustment The coefficient and the target accompaniment adjustment coefficient obtain the adjustment audio corresponding to the target song;

The adjusted audio corresponding to the target song sent by the server is received.

8. An audio processing method, wherein the method comprises:

Receive an adjustment request sent by the target terminal, wherein the adjustment request carries the identification information of the target song, the target vocal adjustment coefficient and the target accompaniment adjustment coefficient;

Based on the identification information of the target song, obtain multi-frame song audio frames of the target song;

Determine the vocal audio frame corresponding to the multi-frame song audio frame and the corresponding accompaniment audio frame;

Respectively use the target vocal adjustment coefficient to perform gain processing on the human voice audio frame corresponding to each song audio frame, and obtain the human voice audio frame after the gain processing corresponding to each song audio frame;

Use the target accompaniment adjustment coefficient to carry out gain processing to the accompaniment audio frame corresponding to the audio frame of each song, and obtain the accompaniment audio frame after the gain processing corresponding to the audio frame of each song;

The vocal audio frame after the gain processing corresponding to each frame of the song audio frame and the accompaniment audio frame after the corresponding gain processing are formed into the adjustment audio corresponding to the target song;

Send the adjustment audio corresponding to the target song to the target terminal.

9. An audio processing device, wherein the device comprises:

The first determination module is used to input the multi-frame song audio frames of the target song into the song element extraction model that has been trained to obtain the initial audio of the first type element corresponding to the song audio frame output by the song element extraction model. frame, wherein the first type element is vocal or accompaniment;

A gain module, configured to perform gain processing on the initial audio frame using different gain coefficients, respectively, to obtain gain-processed initial audio frames corresponding to the different gain coefficients;

The second determination module is used to respectively determine the difference audio frame of the audio frame of the song and the initial audio frame processed by each gain, and determine the loudness value of the difference audio frame corresponding to each gain coefficient;

A third determining module, configured to determine, from the different gain coefficients, based on the loudness value of the difference audio frame corresponding to each gain coefficient, the actual loudness value corresponding to the first type element in the audio frame of the song. target gain coefficient, and the initial audio frame after the gain processing corresponding to the target gain coefficient is determined as the target audio frame of the first type element corresponding to the audio frame of the song;

The composition module is used for composing the target audio frames of the first-type elements corresponding to the song audio frames of each frame into audio segments of the first-type elements corresponding to the target song.

10. A computer device, characterized in that the computer device comprises a processor and a memory, the memory stores at least one instruction, the at least one instruction is loaded and executed by the processor to implement the method of claim 1 -6 or the operations performed by the audio processing method of any one of claim 7 or claim 8.

11. A computer-readable storage medium, characterized in that, at least one instruction is stored in the storage medium, and the at least one instruction is loaded and executed by a processor to implement claims 1-6 or claim 7 or claims Operations performed by the audio processing method according to any one of requirements 8.