CN106250400B

CN106250400B - Audio data processing method, device and system

Info

Publication number: CN106250400B
Application number: CN201610571692.4A
Authority: CN
Inventors: 赵伟峰; 刘培; 孔令城
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2016-07-19
Filing date: 2016-07-19
Publication date: 2021-03-26
Anticipated expiration: 2036-07-19
Also published as: CN106250400A

Abstract

The embodiment of the invention discloses an audio data processing method, an audio data processing device and an audio data processing system, wherein the audio data processing method comprises the following steps: the client acquires user audio data and sends the user audio data to the server; the server extracts user audio features of the user audio data and respectively calculates tone similarity between the user audio data and a plurality of preset audio data in a preset audio database according to the user audio features; the server selects preset target preset audio data with preset matching quantity from the preset audio data, and sends audio attribute information and tone similarity corresponding to the preset target audio data to the client; the client displays the audio attribute information and the tone similarity corresponding to each target preset audio data in a first preset display area, and displays the audio quality scores corresponding to the user audio data in a second preset display area. By adopting the method and the device, the display content related to the analysis result of the audio data can be richer.

Description

Audio data processing method, device and system

Technical Field

The present invention relates to the field of internet technologies, and in particular, to a method, an apparatus, and a system for processing audio data.

Background

Current smart terminals (e.g., mobile phones, tablet computers, desktop computers, etc.) generally have basic audio processing capabilities, for example, they can record the voice of a user, and thus, current smart terminals can support most of the current audio processing applications. Most of the existing audio processing applications can calculate and analyze the recorded singing voice of the user to calculate the singing score of the user and display the singing score to the user, so that the user can intuitively know the singing level of the user. However, since the audio processing applications currently have a single dimension for analyzing the singing voice of the user (i.e. all can only analyze the singing level of the user), the final display content is also single (i.e. only the singing score of the user is finally displayed), so that the display effect is not rich enough.

Disclosure of Invention

The embodiment of the invention provides an audio data processing method, device and system, which can enrich the display content associated with the analysis result of audio data.

A first aspect of the present invention provides an audio data processing method, including:

the client acquires user audio data and sends the user audio data to the server;

the server extracts the user audio features of the user audio data and respectively calculates the tone similarity between the user audio data and a plurality of preset audio data in a preset audio database according to the user audio features;

the server selects preset target audio data with preset matching quantity from the preset audio data, and sends audio attribute information and tone similarity corresponding to the preset target audio data to the client;

and the client displays the audio attribute information and the tone similarity corresponding to the preset audio data of each target in a first preset display area, and displays the audio quality scores corresponding to the audio data of the user in a second preset display area.

A second aspect of the present invention provides an audio data processing method, including:

the server receives user audio data sent by the client;

the server selects preset target audio data with preset matching quantity from the preset audio data, and sends audio attribute information and tone similarity corresponding to the preset target audio data to the client, so that the client displays the audio attribute information and the tone similarity corresponding to the preset target audio data in a first preset display area, and displays the audio quality scores corresponding to the user audio data in a second preset display area.

A third aspect of the present invention provides an audio data processing apparatus, comprising:

the receiving module is used for receiving user audio data sent by the client;

the calculation module is used for extracting the user audio features of the user audio data and respectively calculating the tone similarity between the user audio data and a plurality of preset audio data in a preset audio database according to the user audio features;

and the selection sending module is used for selecting preset target preset audio data with preset matching quantity from the preset audio data, and sending the audio attribute information and the tone similarity corresponding to the preset audio data of each target to the client, so that the client displays the audio attribute information and the tone similarity corresponding to the preset audio data of each target in a first preset display area, and displays the audio quality scores corresponding to the audio data of the user in a second preset display area.

The fourth aspect of the present invention provides an audio data processing system, comprising a client and a server;

the client is used for acquiring user audio data, sending the user audio data to the server, displaying audio attribute information and tone similarity corresponding to target preset audio data sent by the server in a first preset display area, and displaying audio quality scores corresponding to the user audio data in a second preset display area;

the server comprises the audio data processing device provided by the third aspect.

The client sends the acquired user audio data to the server, so that the server can calculate the tone similarity between the user audio data and a plurality of preset audio data in a preset audio database, further select target preset audio data according to the sequence of the tone similarities, and send audio attribute information and the tone similarity corresponding to the target preset audio data to the client; because the audio data of the user is not only analyzed in the dimension of the singing level, the client can display the audio quality score of the audio data of the user and can also display the audio attribute information and the tone similarity corresponding to the target preset audio data, and therefore the display content related to the analysis result of the audio data of the user is richer.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a block diagram of an audio data processing system according to an embodiment of the present invention;

FIG. 2 is a flow chart of an audio data processing method according to an embodiment of the present invention;

FIG. 2a is a diagram illustrating a client interface according to an embodiment of the present invention;

FIG. 2b is a diagram illustrating another client interface provided by an embodiment of the present invention;

FIG. 3 is a timing diagram illustrating an audio data processing method according to an embodiment of the invention;

FIG. 4 is a flow chart of another audio data processing method according to an embodiment of the invention;

FIG. 5 is a schematic structural diagram of an audio data processing apparatus according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a computing module according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a label setting unit according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a selective transmission module according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of another audio data processing apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic structural diagram of an audio data processing system according to an embodiment of the present invention. The system may include a client 100 and a server 200, the client 100 and the server 200 are connected through a network, the client 100 may include an intelligent terminal having an audio processing function and a network communication function, such as a mobile phone, a tablet computer, a desktop computer, and the like, and the server 200 may be a background server 200 for audio processing application. The system may be applied to an application scenario of performing multidimensional analysis on user singing, for example, when a user records own singing through the client 100, after the recording is completed, the client 100 may calculate and display a singing score of user singing data (the singing score may be calculated according to factors such as accuracy and rhythm of the user singing), the client 100 may also send the user singing data to the server 200 at the same time, the server 200 may extract user audio features of the user singing data, and calculate, according to the user audio features, tone similarity between the user singing data and a plurality of preset star singing data in a preset audio database, respectively; the server 200 selects preset target preset singer voice data in preset matching quantity from the preset singer voice data, and sends the information of the singer name, the singer head image, the tone similarity and the like corresponding to the preset target singer voice data to the client 100, so that the system can analyze the user singer voice data in the dimension of the singing level and can analyze the user singer voice data in the dimension of the tone similarity with the singer voice, and therefore the client 100 can simultaneously display information of the singing score, the singer name, the singer head image, the tone similarity and the like corresponding to the preset target singer voice data, and the display content related to the analysis result of the user singer voice data is richer.

Referring to fig. 2, a flow chart of an audio data processing method according to an embodiment of the present invention is shown, where the method includes:

s201, a client acquires user audio data and sends the user audio data to the server;

specifically, the client may obtain user audio data input by the user. For example, when a user sings a song, the client may obtain a song recording audio of the user through a microphone, where the song recording audio is the user audio data. When the user finishes inputting the user audio data (for example, when recording is finished), the client may calculate the audio quality score corresponding to the acquired complete user audio data, and display the audio quality score and the timbre similarity calculation prompt information. Meanwhile, the client can also add the acquired complete user audio data into the timbre similarity calculation request and send the timbre similarity calculation request carrying the user audio data to a server.

Please refer to fig. 2a together, which is a display diagram of a client interface according to an embodiment of the present invention, as shown in fig. 2a, a region a in fig. 2a is displaying "SS", "4829 score", and "calculating your star voice index, for a little moment". Wherein "SS" and "4829 score" represent the audio quality score, "your star voice index is being calculated, and a little bit later" represents the timbre similarity calculation cue information.

S202, the server extracts the user audio features of the user audio data and respectively calculates the tone similarity between the user audio data and a plurality of preset audio data in a preset audio database according to the user audio features;

specifically, the server may extract user audio features corresponding to each frame of data in the user audio data, set an effective data tag for the frame of data including voice information in the user audio data, calculate an individualized tone color vector corresponding to the user audio data according to the user audio features corresponding to the frame of data carrying the effective data tag and a preset individualized tone color calculation model, and finally calculate a vector cosine distance between the individualized tone color vector corresponding to the user audio data and the individualized tone color vector corresponding to each preset audio data; the personalized tone calculation model is obtained by training based on a preset common tone calculation model and the preset audio data, and a vector cosine distance refers to the tone similarity between the user audio data and the preset audio data.

Optionally, the user audio feature may be an MFCC (Mel Frequency Cepstrum Coefficient) audio feature; the common Background Model may be a UBM (Universal Background Model), and the personalized timbre calculation Model is an I-vector calculation Model.

The process of setting an effective data tag for frame data containing Voice information in the user audio data may include VAD (Voice Activity Detection), and the specific process may be: normalizing the first data in the MFCC audio features respectively corresponding to each frame of data in the user audio data (the first data in the MFCC audio features is used for representing the energy of the signal) to obtain the energy value of the signal to be matched; comparing the energy value of the signal to be matched corresponding to each frame of data in the user audio data with a preset energy threshold value respectively, and identifying each frame of data according to the comparison result so as to identify frame data containing voice information and frame data not containing voice information (if the energy value of the signal to be matched corresponding to a certain frame of data is greater than the preset energy threshold value, the frame of data can be determined to contain voice information according to the comparison result, if the energy value of the signal to be matched corresponding to a certain frame of data is less than or equal to the preset energy threshold value, the frame of data can be determined not to contain voice information according to the comparison result); and setting an effective data label for the frame data containing the voice information, and deleting the frame data not containing the voice information.

Optionally, the specific process of the server to train the UBM and the I-vector calculation model in advance may be: the server extracts preset audio features (the preset audio features can be MFCC audio features) corresponding to each frame of data in each preset audio data, and normalizes the preset audio features carrying the effective data tags; the effective data label is a label used for identifying frame data containing voice information; and then, training the UBM by using the normalized preset audio characteristics carrying the valid data tags through an EM (Expectation Maximization) Algorithm. The UBM is a GMM (Gaussian Mixture Model), which is essentially a multidimensional probability density function, and the probability density function for a GMM of order M can be expressed using the following formula:

wherein,

GMMs of order M are composed of M single gaussian distributions, each of which is given by:

that is, a single Gaussian distribution is a multi-dimensional normal distribution. The training process of the GMM is to know N data points, and to estimate the influence factor c under the condition of obeying M-order GMM distribution_kMean value of μ_kSum covariance ∑_kThe probability distribution determined by these parameters yields the maximum probability of the known N data points, which probability is substantially equal to

This product is called the likelihood function. The probability of a single point is usually small and to prevent underflow during the calculation, it is common to log it and change the product to a sum

When the Log likelihood function is reached, the function is maximized, that is, a set of parameter values is found, which maximizes the likelihood function, and the parameter is the most suitable parameter, that is, the process of parameter estimation, that is, model training, is completed. The Log likelihood function for GMM is:

since there is an addition in the logarithmic function, there is no way to find the maximum using a direct derivation method, but here the EM method can be used. The EM algorithm flow is as follows:

s11: estimate the probability that data is generated from each single gaussian distribution: for each data x_iIn other words, the probability that it is generated from the kth single Gaussian distribution is:

s12: by maximum likelihood estimation, mu can be obtained_k、∑_kThe value of (c):

wherein,

therefore, the temperature of the molten metal is controlled,

the above steps of S11 and S12 are repeated until the values of the likelihood function converge to positions. The step S11 is E-step, namely estimation; the step S12 is M-step, i.e., maximum. After the UBM model is trained, the mean vector of the UBM can be obtained, and the mean vector of the UBM can be used for training the I-vector calculation model. I-vector is a cross-channel algorithm based on a single space that contains both speaker and channel space information. For a given speech, the gaussian supervector is represented as follows:

M＝m+Tw，

where m is a speaker independent and channel independent supervector, usually formed by concatenating the mean vectors of UBMs, i.e., μ in the UBMs_k(ii) a T is a low rank matrix; w is a random vector following a standard normal distribution, this random vector being referred to as I-vector. The input parameters related in the training algorithm of the T comprise preset audio features which are normalized and carry effective data labels, rank, UBM and maximum iteration times of the T, and the output parameters comprise a rank CF matrix T. The training algorithm of T includes the following steps S21-S27:

s21, calculating zero order, first order and second order statistics, and randomly initializing T. Note that the current iteration number is It is 0:

wherein N is_c(h) Is a sufficient statistic of the zero order of speech h, F_c(h) Is a first order sufficient statistic. Wherein,

and the value is the state occupation probability of the t-th frame feature of the voice segment h on the c-th mixed element of the GMM model.

S22, centralizing the first order statistic

S23, expanding the statistic as matrix, convenient operation:

wherein I is a unit matrix of F. Ff (h) is the column vector of CF 1.

S24, calculating the variance and mean of the speaker factor:

s25, accumulating statistics of all voices:

s26, updating V:

s27, It increments. If It is greater than the iteration number, ending the training, otherwise returning to the step of S24. After the training is completed, a total change matrix T can be obtained, and then a total change factor w (I-vector) is calculated, wherein the calculation formula is as follows:

after the I-vector calculation model is trained, calculating the personalized tone vector corresponding to each preset audio data based on the I-vector calculation model, and storing the personalized tone vector corresponding to each preset audio data, so as to be used for calculating the tone similarity between the user audio data and the user audio data subsequently; the personalized tone vector is the w (I-vector) value.

The specific process of the server calculating the personalized tone vector corresponding to the user audio data may be as follows: and normalizing the user audio features corresponding to the frame data carrying the effective data labels, inputting the normalized user audio features carrying the effective data labels into the I-vector calculation model, and calculating a w (I-vector) value corresponding to the user audio data through the I-vector calculation model (namely the calculation formula of the total variation factor w (I-vector)).

When the server calculates the tone similarity between the personalized tone vector corresponding to the user audio data and the personalized tone vector corresponding to each preset audio data, the server may specifically use the cosine distance between the vectors to represent the tone similarity, for example, the cosine distance calculation formula is:

w₁can represent the personalized tone color vector, w, corresponding to the user audio data₂Can represent the personalized tone vector corresponding to one preset audio dataTherefore, the vector cosine distance between the user audio data and the preset audio data is k (w)₁，w₂) I.e. the timbre similarity is k.

S203, the server selects preset target preset audio data with preset matching quantity from the preset audio data, and sends audio attribute information and tone similarity corresponding to the preset target audio data to the client;

specifically, the server may sort the tone similarity between the user audio data and each preset audio data to obtain a tone similarity sorting table, and then sequentially obtain tone similarities of preset matching numbers from the tone similarity sorting table as target tone similarities; the number of the target tone similarity degrees is equal to the preset matching number; the server further obtains preset audio data corresponding to each target tone similarity as target audio data, and sends audio attribute information and tone similarity corresponding to each target preset audio data to the client. For example, if the preset matching number is 3, the server determines the preset audio data corresponding to the first three tone color similarities as target preset audio data, and then sends the audio attribute information and the tone color similarities corresponding to the three target preset audio data to the client. The audio attribute information may include a name of a song, a name of a singer, an avatar of the singer, and a content of a preset document corresponding to the target preset audio data, and a certain content of the preset document may be "you also have dolphin sound".

S204, the client displays the audio attribute information and the tone similarity corresponding to each target preset audio data in a preset display area;

specifically, when the client receives the audio attribute information and the tone similarity corresponding to each target preset audio data, the client may dynamically reduce the graphic area corresponding to the audio quality score, display the audio quality score after the graphic area reduction in a second preset display area, and cancel the display of the tone similarity calculation prompt information, so that a part of a display area may be left out of a current interface of the client (the left-out part of the display area is a first preset display area), and at this time, the audio attribute information and the tone similarity corresponding to each target preset audio data may be further displayed in the first preset display area.

Please refer to fig. 2b, which is a diagram illustrating another client interface according to an embodiment of the present invention, the area B in fig. 2B is the second predetermined display area, the area C is the first predetermined display area, when the client receives the audio attribute information and the tone similarity respectively corresponding to the target preset audio data, the graphical area corresponding to the audio quality score in region a in figure 2a is being dynamically scaled down, and displays the dynamically scaled audio quality score in region B of figure 2B (i.e. "SS", "4829" shown in figure 2B), and simultaneously, the display of the timbre similarity calculation prompt message in fig. 2a is cancelled, so that the area a in fig. 2a can be left free, further, the audio attribute information and the tone similarity corresponding to each target preset audio data can be displayed in the area C in fig. 2 b; wherein, the audio attribute information and the timbre similarity respectively corresponding to the preset audio data of each target in fig. 2b include: the audio attribute information and the tone similarity corresponding to the 3 target preset audio data are respectively the audio attribute information and the tone similarity corresponding to the singer A, the singer B and the singer C, and the tone of the user is also described to be most similar to the tone of the three singers; wherein, the timbre similarity between the user and singer a is 0.96 (i.e. the vector cosine distance is 0.96), the corresponding document contents can be displayed under the singer a head image in fig. 2 b: "96% similarity", "you also have dolphin sound"; wherein, the tone similarity between the user and singer B is 0.9 (i.e. the vector cosine distance is 0.9), then the corresponding document content "similarity 90%", "version XXX" is you "can be displayed under the avatar of singer B in fig. 2B; wherein, the similarity of the user and singer C is 0.88 (i.e. the distance of vector cosine is 0.88), the corresponding document content can be displayed under the head image of singer C: "88% similarity", "do you also dance a woman? "

Optionally, when the server detects that the maximum tone similarity among the tone similarities corresponding to the target preset audio data is greater than a preset similarity threshold, the server sends the audio attribute information of the target preset audio data corresponding to the maximum tone similarity, and the user information of the client to a plurality of friend clients having a friend association relationship with the user information of the client. For example, the preset similarity threshold is 0.9, and the maximum timbre similarity among the timbre similarities corresponding to the plurality of target preset audio data is 0.93, the server may send the audio attribute information of the target preset audio data with the timbre similarity of 0.93, and the user information of the client to the plurality of friend clients having a friend association relationship with the user information of the client.

Referring to fig. 3, a timing diagram of an audio data processing method according to an embodiment of the present invention is shown, where the method includes:

s301, a client acquires user audio data, and calculates and displays an audio quality score corresponding to the user audio data;

specifically, the client may obtain user audio data input by the user. For example, when a user sings a song, the client may obtain a song recording audio of the user through a microphone, where the song recording audio is the user audio data. When the user finishes inputting the user audio data (for example, when recording is finished), the client may calculate the audio quality score corresponding to the acquired complete user audio data, and display the audio quality score and the timbre similarity calculation prompt information. For example, the timbre similarity calculation prompt message may be a set of character strings: "your star voice index is being calculated, for a little while".

S302, the client sends the user audio data to a server;

specifically, the client may further add the acquired complete user audio data to a timbre similarity calculation request, and send the timbre similarity calculation request carrying the user audio data to a server.

S303, the server calculates an individualized tone vector corresponding to the user audio data based on the trained I-vector calculation model;

specifically, before step S301, the server may preset an audio database (where the audio database includes a plurality of preset audio data), preset an I-vector calculation model, and pre-calculate an individualized tone-color vector corresponding to each preset audio data based on the I-vector calculation model, where the preset process of the server may specifically be: the server extracts preset audio features corresponding to each frame of data in each preset audio data respectively and performs normalization processing on the preset audio features carrying the effective data labels; the valid data tag is a tag for identifying frame data containing voice information (by VAD detection, whether the frame data contains voice information can be detected); the server trains model parameters of the UBM based on a maximum expected EM algorithm and the normalized preset audio features carrying the effective data labels, obtains a mean vector of the UBM after the training of the model parameters of the UBM is finished, trains a low-rank matrix in an I-vector calculation model based on the mean vector of the UBM, the normalized preset audio features carrying the effective data labels and preset iteration times, and calculates personalized tone and color vectors corresponding to the preset audio data respectively based on the I-vector calculation model after the training of the low-rank matrix is finished; the personalized timbre vector is an I-vector value. For specific implementation processes of training the model parameters of the UBM and training the low-rank matrix in the I-vector calculation model, reference may be made to the specific training processes of the UBM and the I-vector calculation model in S202 in the corresponding embodiment of fig. 2, and details are not further described here.

When the server receives the user audio data sent by the client, the server may extract a user audio feature corresponding to each frame of data in the user audio data, where the user audio feature may be an MFCC audio feature; the server may further perform normalization processing on first data in the MFCC audio features respectively corresponding to each frame of data in the user audio data (the first data in the MFCC audio features is used to represent energy of a signal) to obtain energy values of signals to be matched, compare the energy values of the signals to be matched respectively corresponding to each frame of data in the user audio data with a preset energy threshold, identify each frame of data according to a comparison result, to identify frame data including voice information and frame data not including voice information, set an effective data tag for the frame data including voice information, and delete the frame data not including voice information; and the server normalizes the user audio features corresponding to the frame data carrying the effective data labels, inputs the normalized user audio features carrying the effective data labels into the I-vector calculation model, and calculates w (I-vector) values corresponding to the user audio data (namely personalized tone color vectors corresponding to the user audio data) through the I-vector calculation model.

S304, the server respectively calculates the tone similarity between the personalized tone vector corresponding to the user audio data and the personalized tone vector corresponding to each preset audio data;

specifically, when the server calculates the tone color similarity between the personalized tone color vector corresponding to the user audio data and the personalized tone color vector corresponding to each preset audio data, the server may specifically use the cosine distance between the vectors to represent the tone color similarity, for example, the cosine distance calculation formula is:

w₁can represent the personalized tone color vector, w, corresponding to the user audio data₂The personalized timbre vector corresponding to one of the preset audio data can be represented, and therefore, the vector cosine distance between the user audio data and the preset audio data is k (w)₁，w₂) I.e. the timbre similarity is k.

S305, the server sorts the tone similarity between the user audio data and each preset audio data to obtain a tone similarity sorting table;

s306, the server obtains preset matching number of tone similarities from the tone similarity ranking table in sequence to serve as target tone similarities;

for example, if the preset number of matches is 3, the server determines the timbre similarity of the first three as the target timbre similarity.

S307, the server acquires preset audio data corresponding to each target tone similarity as target audio data;

s308, the server sends the audio attribute information and the tone similarity corresponding to each target preset audio data to the client;

specifically, the audio attribute information may include a name of a song, a name of a singer, an avatar of the singer, and a content of a preset document corresponding to the target preset audio data, and for example, a certain content of the preset document may be "you also have dolphin sound".

S309, dynamically reducing the graph area corresponding to the audio quality score by the client, displaying the audio quality score after the graph area is reduced in a second preset display area, and displaying the audio attribute information and the tone similarity respectively corresponding to each target preset audio data in a first preset display area;

Referring to fig. 4, a flow chart of another audio data processing method according to an embodiment of the present invention is shown, where the method includes:

s401, a server receives user audio data sent by a client;

s402, the server extracts the user audio features of the user audio data and respectively calculates the tone similarity between the user audio data and a plurality of preset audio data in a preset audio database according to the user audio features;

and S403, the server selects preset target preset audio data with preset matching quantity from the preset audio data, and sends audio attribute information and tone similarity corresponding to the preset audio data of each target to the client, so that the client displays the audio attribute information and the tone similarity corresponding to the preset audio data of each target in a first preset display area, and displays the audio quality score corresponding to the audio data of the user in a second preset display area.

The specific implementation manner of steps S401 to S403 may refer to steps S201 to S204 in the embodiment corresponding to fig. 2, which is not described herein again.

Fig. 5 is a schematic structural diagram of an audio data processing apparatus according to an embodiment of the present invention. The audio data processing apparatus 1 may be applied to a server, and the audio data processing apparatus 1 may include: the system comprises a preset extraction processing module 40, a preset training module 50, a preset calculation module 60, a receiving module 10, a calculation module 20 and a selection sending module 30;

the preset extraction processing module 40 is configured to extract preset audio features corresponding to each frame of data in each preset audio data, and perform normalization processing on the preset audio features carrying the valid data tags; the effective data label is a label used for identifying frame data containing voice information;

the preset training module 50 is configured to train a model parameter of the UBM based on a maximum expected EM algorithm and a normalized preset audio feature carrying an effective data label, and obtain a mean vector of the UBM after the training of the model parameter of the UBM is completed;

the preset training module 50 is further configured to train a low-rank matrix in an I-vector calculation model based on the average vector of the UBM, the normalized preset audio feature carrying the valid data label, and a preset iteration number;

the preset calculation module 60 is configured to calculate, based on the I-vector calculation model, personalized tone vectors corresponding to the preset audio data, respectively, after the low-rank matrix training is completed; the personalized timbre vector is an I-vector value.

For a specific implementation manner of the preset extraction processing module 40, the preset training module 50, and the preset calculation module 60, reference may be made to a specific process of the server pre-training the UBM and the I-vector calculation model in step S202 in the corresponding embodiment of fig. 2, which is not described herein again.

The receiving module 10 is configured to receive user audio data sent by a client;

specifically, the client may obtain user audio data input by the user. For example, when a user sings a song, the client may obtain a song recording audio of the user through a microphone, where the song recording audio is the user audio data. When the user finishes inputting the user audio data (for example, when recording is finished), the client may calculate the audio quality score corresponding to the acquired complete user audio data, and display the audio quality score and the timbre similarity calculation prompt information. Meanwhile, the client may further add the acquired complete user audio data to the timbre similarity calculation request, so that the receiving module 10 may receive the timbre similarity calculation request carrying the user audio data sent by the client.

The calculating module 20 is configured to extract user audio features of the user audio data, and calculate, according to the user audio features, tone similarity between the user audio data and a plurality of preset audio data in a preset audio database;

specifically, please refer to fig. 6, which is a schematic structural diagram of the computing module 20, where the computing module 20 may include: a feature extraction unit 201, a label setting unit 202, and a calculation unit 203;

the feature extraction unit 201 is configured to extract a user audio feature corresponding to each frame of data in the user audio data; the user audio feature may be a MFCC audio feature;

the tag setting unit 202 is configured to set an effective data tag for frame data containing voice information in the user audio data;

the calculating unit 203 is configured to calculate an individualized tone vector corresponding to the user audio data according to a user audio feature corresponding to the frame data carrying the valid data tag and a preset individualized tone calculation model; and the personalized tone calculation model is the I-vector calculation model. The calculating unit 203 may be specifically configured to perform normalization processing on the user audio features corresponding to the frame data with the valid data tags, input the normalized user audio features with the valid data tags into the I-vector calculation model, and calculate the personalized tone-color vector corresponding to the user audio data based on the I-vector calculation model.

The calculating unit 203 is further configured to calculate vector cosine distances between the personalized tone color vectors corresponding to the user audio data and the personalized tone color vectors corresponding to the preset audio data, respectively; wherein a vector cosine distance refers to a timbre similarity between the user audio data and a preset audio data.

Further, please refer to fig. 7 together, which is a schematic structural diagram of a tag setting unit 202 according to an embodiment of the present invention, where the tag setting unit 202 may include: a normalization processing subunit 2021, a matching identification subunit 2022, and a setting deletion subunit 2023;

the normalization processing subunit 2021 is configured to perform normalization processing on the first data in the MFCC audio features respectively corresponding to each frame of data in the user audio data to obtain an energy value of a signal to be matched;

the matching identification subunit 2022 is configured to compare energy values of signals to be matched respectively corresponding to each frame of data in the user audio data with a preset energy threshold, and identify each frame of data according to a comparison result, so as to identify frame data including voice information and frame data not including voice information;

the set deleting subunit 2023 is configured to set an effective data tag for the frame data including the voice information, and delete the frame data not including the voice information.

The selection sending module 30 is configured to select preset matching number of target preset audio data from the plurality of preset audio data, and send audio attribute information and tone similarity corresponding to each target preset audio data to the client, so that the client displays the audio attribute information and tone similarity corresponding to each target preset audio data in a first preset display area, and displays an audio quality score corresponding to the user audio data in a second preset display area.

Specifically, please refer to fig. 8, which is a schematic structural diagram of a selective sending module 30 according to an embodiment of the present invention, where the selective sending module 30 may include: a sorting unit 301, a selecting unit 302, a data acquiring unit 303, and a transmitting unit 304;

the sorting unit 301 is configured to sort the tone similarity between the user audio data and each preset audio data to obtain a tone similarity sorting table;

the selecting unit 302 is configured to obtain tone similarities with preset matching numbers in sequence from the tone similarity ranking table, and use the obtained tone similarities as target tone similarities; the number of the target tone similarity degrees is equal to the preset matching number;

the data obtaining unit 303 is configured to obtain preset audio data corresponding to each target tone similarity as target audio data;

the sending unit 304 is configured to send the audio attribute information and the tone similarity corresponding to each target preset audio data to the client.

When the client receives the audio attribute information and the tone similarity corresponding to each target preset audio data, the client may dynamically reduce the graphic area corresponding to the audio quality score, display the audio quality score after the graphic area is reduced in a second preset display area, and cancel the display of the tone similarity calculation prompt information, so that a part of a display area may be left out of a current interface of the client (the left part of the display area is a first preset display area), and at this time, the audio attribute information and the tone similarity corresponding to each target preset audio data may be further displayed in the first preset display area.

Optionally, the selection sending module 30 is further configured to send, when it is detected that a maximum tone similarity among the tone similarities corresponding to the target preset audio data is greater than a preset similarity threshold, the audio attribute information of the target preset audio data corresponding to the maximum tone similarity, and the user information of the client to a plurality of friend clients having a friend association relationship with the user information of the client.

Fig. 9 is a schematic structural diagram of another audio data processing apparatus according to an embodiment of the present invention. As shown in fig. 9, the audio data processing apparatus 1000 may be applied to a server, and the audio data processing apparatus 1000 may include: at least one processor 1001, such as a CPU, at least one network interface 1004, a user interface 1003, memory 1005, at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a standard wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 9, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the audio data processing apparatus 1000 shown in fig. 9, the network interface 1004 is mainly used for connecting a client; the user interface 1003 is mainly used for providing an input interface for a user and acquiring data output by the user; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement

Receiving user audio data sent by a client;

extracting user audio features of the user audio data, and respectively calculating the tone similarity between the user audio data and a plurality of preset audio data in a preset audio database according to the user audio features;

and selecting preset target preset audio data with preset matching quantity from the preset audio data, and sending audio attribute information and tone similarity corresponding to the preset audio data of each target to the client, so that the client displays the audio attribute information and the tone similarity corresponding to the preset audio data of each target in a first preset display area, and displays the audio quality score corresponding to the audio data of the user in a second preset display area.

In an embodiment, when the processor 1001 selects preset matching number of target preset audio data from the plurality of preset audio data and sends audio attribute information and tone similarity corresponding to each target preset audio data to the client, the following steps are specifically performed:

sorting the tone similarity between the user audio data and each preset audio data to obtain a tone similarity sorting table;

obtaining tone similarity of preset matching quantity in sequence from the tone similarity ranking table as target tone similarity; the number of the target tone similarity degrees is equal to the preset matching number;

acquiring preset audio data corresponding to each target tone similarity as target audio data;

and sending the audio attribute information and the tone similarity corresponding to each target preset audio data to the client.

In one embodiment, when the processor 1001 extracts a user audio feature of the user audio data and calculates a timbre similarity between the user audio data and a plurality of preset audio data in a preset audio database according to the user audio feature, the following steps are specifically performed:

extracting user audio features corresponding to each frame of data in the user audio data;

setting an effective data label for frame data containing voice information in the user audio data;

calculating an individualized tone vector corresponding to the user audio data according to the user audio feature corresponding to the frame data carrying the effective data tag and a preset individualized tone calculation model; the personalized tone color calculation model is obtained by training based on a preset common tone color calculation model and the plurality of preset audio data;

respectively calculating vector cosine distances between the personalized tone vector corresponding to the user audio data and the personalized tone vector corresponding to each preset audio data;

wherein a vector cosine distance refers to a timbre similarity between the user audio data and a preset audio data.

In one embodiment, the user audio features are mel-frequency cepstral coefficients MFCC audio features;

then, when the processor 1001 sets an effective data tag for frame data containing voice information in the user audio data, the following steps are specifically performed:

normalizing the first data in the MFCC audio features respectively corresponding to each frame of data in the user audio data to obtain an energy value of a signal to be matched;

comparing the energy value of the signal to be matched corresponding to each frame of data in the user audio data with a preset energy threshold value respectively, and identifying each frame of data according to the comparison result so as to identify the frame of data containing the voice information and the frame of data not containing the voice information;

and setting an effective data label for the frame data containing the voice information, and deleting the frame data not containing the voice information.

In one embodiment, the common tone color calculation model is a universal background model UBM, and the personalized tone color calculation model is an I-vector calculation model;

the processor 1001, before performing the step of extracting the user audio feature of the user audio data, further performs the following steps:

extracting preset audio features corresponding to each frame of data in each preset audio data, and performing normalization processing on the preset audio features carrying the effective data labels; the effective data label is a label used for identifying frame data containing voice information;

training model parameters of the UBM based on a maximum expected EM algorithm and normalized preset audio features carrying effective data labels, and acquiring a mean vector of the UBM after the training of the model parameters of the UBM is completed;

training a low-rank matrix in an I-vector calculation model based on the average value vector of the UBM, the normalized preset audio features carrying effective data labels and preset iteration times;

after the low-rank matrix training is finished, calculating personalized tone vectors corresponding to the preset audio data respectively based on the I-vector calculation model; the personalized timbre vector is an I-vector value.

In an embodiment, when the processor 1001 calculates the personalized tone vector corresponding to the user audio data according to the user audio feature corresponding to the frame data carrying the valid data tag and a preset personalized tone calculation model, the following steps are specifically performed:

carrying out normalization processing on the user audio characteristics corresponding to the frame data carrying the effective data labels;

and inputting the normalized user audio features carrying the effective data labels into the I-vector calculation model, and calculating the personalized tone vector corresponding to the user audio data based on the I-vector calculation model.

In one embodiment, the processor 1001 further performs the steps of:

and when the maximum tone similarity in the tone similarities respectively corresponding to the target preset audio data is detected to be greater than a preset similarity threshold value, sending the audio attribute information of the target preset audio data corresponding to the maximum tone similarity, the maximum tone similarity and the user information of the client to a plurality of friend clients having friend association relationship with the user information of the client.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.

Claims

1. A method of audio data processing, comprising:

the method comprises the steps that a client side obtains user audio data and sends the user audio data to a server;

the client calculates the audio quality score corresponding to the user audio data and displays the audio quality score and the timbre similarity calculation prompt information;

the server selects preset target audio data with preset matching quantity from the preset audio data, and sends audio attribute information and tone similarity corresponding to the preset target audio data to the client, wherein the audio attribute information comprises a singer name, a singer avatar and preset document content corresponding to the preset target audio data;

and when the client receives the audio attribute information and the tone similarity corresponding to each target preset audio data, dynamically reducing the graphic area corresponding to the audio quality score, displaying the audio quality score after the graphic area is reduced in a second preset display area, canceling the display of the tone similarity calculation prompt information, and displaying the audio attribute information and the tone similarity corresponding to each target preset audio data in a first preset display area.

2. The method of claim 1, wherein the server selects a preset matching number of target preset audio data from the preset audio data, and sends audio attribute information and tone similarity corresponding to each target preset audio data to the client, and the method includes:

3. The method of claim 1, wherein the server extracts a user audio feature of the user audio data and calculates a timbre similarity between the user audio data and a plurality of preset audio data in a preset audio database according to the user audio feature, respectively, comprising:

4. The method of claim 3, wherein the user audio features are Mel Frequency Cepstral Coefficients (MFCC) audio features;

then, the setting of an effective data tag for frame data containing voice information in the user audio data includes:

5. The method of claim 3, wherein the common timbre calculation model is a Universal Background Model (UBM) and the personalized timbre calculation model is an I-vector calculation model;

before the step of extracting the user audio features of the user audio data, the server further includes:

the server extracts preset audio features corresponding to each frame of data in each preset audio data respectively and performs normalization processing on the preset audio features carrying the effective data labels; the effective data label is a label used for identifying frame data containing voice information;

6. The method of claim 5, wherein the calculating the personalized timbre vector corresponding to the user audio data according to the user audio feature corresponding to the frame data carrying the valid data tag and a preset personalized timbre calculation model comprises:

7. The method of claim 1, further comprising:

when the server detects that the maximum tone similarity in the tone similarities respectively corresponding to the target preset audio data is larger than a preset similarity threshold value, the server sends the audio attribute information of the target preset audio data corresponding to the maximum tone similarity, the maximum tone similarity and the user information of the client to a plurality of friend clients having friend association relationship with the user information of the client.

8. A method of audio data processing, comprising:

the server receives user audio data sent by a client, wherein the user audio data is used for calculating corresponding audio quality scores and displaying the audio quality scores and tone similarity calculation prompt information;

the server selects preset audio data of the preset matching number from the preset audio data, and sends audio attribute information and tone similarity corresponding to the preset audio data of each target to the client, so that the client displays the audio attribute information and tone similarity corresponding to the preset audio data of each target in a first preset display area and displays the audio quality score corresponding to the audio data of the user in a second preset display area, the second preset display area is used for displaying the audio quality score after the dynamic reduction of the graphic area, and the audio attribute information comprises a song name, a singer head portrait and preset file content corresponding to the preset audio data of the target.

9. An audio data processing apparatus, comprising:

the receiving module is used for receiving user audio data sent by a client, wherein the user audio data is used for calculating corresponding audio quality scores and displaying the audio quality scores and timbre similarity calculation prompt information;

the selection sending module is used for selecting preset audio data of the targets with preset matching quantity from the preset audio data, sending audio attribute information and tone similarity corresponding to the preset audio data of each target to the client, enabling the client to display the audio attribute information and the tone similarity corresponding to the preset audio data of each target in a first preset display area, displaying audio quality scores corresponding to the audio data of the user in a second preset display area, and displaying the audio quality scores after the dynamic reduction of the graphic area in the second preset display area, wherein the audio attribute information comprises a song name, a singer head portrait and preset file contents corresponding to the preset audio data of the target.

10. The apparatus of claim 9, wherein the means for selectively transmitting comprises:

the sorting unit is used for sorting the tone similarity between the user audio data and each preset audio data to obtain a tone similarity sorting table;

the selecting unit is used for acquiring the tone similarity of preset matching quantity in sequence from the tone similarity sequencing list as the target tone similarity; the number of the target tone similarity degrees is equal to the preset matching number;

the data acquisition unit is used for acquiring preset audio data corresponding to each target tone similarity as target audio data;

and the sending unit is used for sending the audio attribute information and the tone similarity corresponding to each target preset audio data to the client.

11. The apparatus of claim 9, wherein the computing module comprises:

the characteristic extraction unit is used for extracting user audio characteristics corresponding to each frame of data in the user audio data;

the tag setting unit is used for setting an effective data tag for frame data containing voice information in the user audio data;

the calculation unit is used for calculating an individualized tone vector corresponding to the user audio data according to the user audio feature corresponding to the frame data carrying the effective data tag and a preset individualized tone calculation model; the personalized tone color calculation model is obtained by training based on a preset common tone color calculation model and the plurality of preset audio data;

the calculation unit is further configured to calculate vector cosine distances between the personalized tone color vectors corresponding to the user audio data and the personalized tone color vectors corresponding to the preset audio data, respectively;

12. The apparatus of claim 11, wherein the user audio features are mel-frequency cepstral coefficients (MFCC) audio features;

the tag setting unit includes:

the normalization processing subunit is configured to perform normalization processing on first data in the MFCC audio features respectively corresponding to each frame of data in the user audio data to obtain an energy value of a signal to be matched;

the matching identification subunit is used for comparing the energy value of the signal to be matched corresponding to each frame of data in the user audio data with a preset energy threshold value respectively, and identifying each frame of data according to the comparison result so as to identify the frame of data containing the voice information and the frame of data not containing the voice information;

and the deleting subunit is configured to set an effective data tag for the frame data containing the voice information, and delete the frame data not containing the voice information.

13. The apparatus of claim 11, wherein the common timbre calculation model is a Universal Background Model (UBM) and the personalized timbre calculation model is an I-vector calculation model;

the audio data processing apparatus further comprises:

the preset extraction processing module is used for extracting preset audio features corresponding to each frame of data in each preset audio data respectively and carrying out normalization processing on the preset audio features carrying the effective data labels; the effective data label is a label used for identifying frame data containing voice information;

the system comprises a preset training module, a data analysis module and a data analysis module, wherein the preset training module is used for training model parameters of UBMs based on a maximum expected EM algorithm and normalized preset audio features carrying effective data labels, and acquiring mean vectors of the UBMs after the model parameters of the UBMs are trained;

the preset training module is further used for training a low-rank matrix in an I-vector calculation model based on the average vector of the UBM, the normalized preset audio features carrying the effective data labels and preset iteration times;

the preset calculation module is used for calculating the personalized tone vector corresponding to each preset audio data based on the I-vector calculation model after the low-rank matrix training is finished; the personalized timbre vector is an I-vector value.

14. The apparatus of claim 13,

the computing unit is specifically configured to perform normalization processing on the user audio features corresponding to the frame data carrying the valid data tags, input the normalized user audio features carrying the valid data tags into the I-vector computing model, and compute the personalized tone color vectors corresponding to the user audio data based on the I-vector computing model.

15. The apparatus of claim 9,

the selection sending module is further configured to send, when it is detected that a maximum tone similarity among tone similarities corresponding to the target preset audio data is greater than a preset similarity threshold, audio attribute information of the target preset audio data corresponding to the maximum tone similarity, and the user information of the client to a plurality of friend clients having a friend association relationship with the user information of the client.

16. An audio data processing system comprising a client and a server;

the client is used for acquiring user audio data, sending the user audio data to the server, calculating an audio quality score corresponding to the user audio data, displaying the audio quality score and timbre similarity calculation prompt information, dynamically reducing a graph area corresponding to the audio quality score, displaying the audio quality score after the graph area is reduced in a second preset display area, canceling the display of the timbre similarity calculation prompt information, and displaying audio attribute information and timbre similarities corresponding to the target preset audio data in a first preset display area, wherein the audio attribute information comprises a song name, a singer head portrait and preset document contents corresponding to the target preset audio data;

the server comprising an audio data processing device according to any of claims 9-15.

17. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program adapted to load and execute the method steps of claim 8.

18. An audio data processing apparatus, comprising: a processor, memory, and a network interface; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of claim 8.