CN111309965A

CN111309965A - Audio matching method and device, computer equipment and storage medium

Info

Publication number: CN111309965A
Application number: CN202010201517.2A
Authority: CN
Inventors: 缪畅宇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-03-20
Filing date: 2020-03-20
Publication date: 2020-06-19
Anticipated expiration: 2040-03-20
Also published as: CN111309965B

Abstract

The application discloses an audio matching method, an audio matching device, computer equipment and a storage medium, and relates to the technical field of audio. The method comprises the following steps: acquiring a first multi-scale vector sequence of a first audio and a second multi-scale vector sequence of a second audio; matching the frequency domain vectors belonging to the same scale in the first multi-scale vector sequence and the second multi-scale vector sequence to obtain a plurality of matched frequency domain vectors under different scales; splicing the plurality of matched frequency domain vectors under different scales to obtain a prediction vector; and calling a classification layer to predict the prediction vector, and outputting the similarity probability of the first audio and the second audio. The similarity of the two audios is calculated by adopting a matching mode based on the neural network, and the similarity between different songs can be calculated, so that a similarity calculation result with higher precision is obtained between different songs.

Description

Audio matching method and device, computer equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of audio, in particular to an audio matching method, an audio matching device, computer equipment and a storage medium.

Background

Audio matching is a technique of similarity matching two audios. According to the matching type, the audio matching comprises the following steps: audio segment matching and full audio matching. The audio segment matching means that an audio segment P is given to judge whether the audio segment P belongs to a part of the audio D. Full audio matching means that given an audio a, the similarity of audio a and audio B is calculated.

The audio fingerprint technology is provided in the related technology, and the audio fingerprint technology is to select more obvious time frequency points in an audio file, encode the time frequency points into a digital sequence by adopting a hash encoding mode, and take the digital sequence as an audio fingerprint. The audio fingerprinting technique converts the audio matching problem into a retrieval problem between different digital sequences.

Because the audio clip matching mainly aims at matching the audio clip and the full audio of the same song, the audio fingerprint technology based on signal processing has better matching effect in the scene of audio clip matching. However, in a full-audio matching scene, the similarity of two different songs is calculated more, and at this time, the application of the audio fingerprint technology is limited, and a good matching effect cannot be obtained.

Disclosure of Invention

The embodiment of the application provides an audio matching method, an audio matching device, computer equipment and a storage medium, and provides a matching scheme suitable for a full audio matching scene. The technical scheme is as follows:

in one aspect, an embodiment of the present application provides an audio matching method, where the method includes:

acquiring a first multi-scale vector sequence of a first audio and a second multi-scale vector sequence of a second audio;

matching the frequency domain vectors belonging to the same scale in the first multi-scale vector sequence and the second multi-scale vector sequence to obtain a plurality of matched frequency domain vectors under different scales;

splicing the plurality of matched frequency domain vectors under different scales to obtain a prediction vector;

and calling a classification layer to predict the prediction vector, and outputting the similarity probability of the first audio and the second audio.

In another aspect, an embodiment of the present application provides an audio matching apparatus, where the apparatus includes:

the acquisition module is used for acquiring a first multi-scale vector sequence of a first audio and a second multi-scale vector sequence of a second audio;

the matching module is used for matching the frequency domain vectors belonging to the same scale in the first multi-scale vector sequence and the second multi-scale vector sequence to obtain a plurality of matched frequency domain vectors under different scales;

the splicing module is used for splicing the plurality of matched frequency domain vectors under different scales to obtain a prediction vector;

and the prediction module is used for calling a classification layer to predict the prediction vector and outputting the similarity probability of the first audio and the second audio.

In another aspect, embodiments of the present application provide a computer device including a processor and a memory, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the audio matching method according to the above aspect.

In another aspect, there is provided a computer readable storage medium having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by a processor to implement the audio matching method as described in the above aspect.

In another aspect, a computer program product is provided which, when run on a computer, causes the computer to perform the audio matching method as described in the above aspect.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

because the multi-scale vector sequence adopts the feature vectors under multiple scales to represent the potential features and deep features of the audios, the multi-scale vector sequence of the two audios is used as input, the similarity of the two audios is calculated by adopting a matching mode based on a neural network, the similarity between different songs can be calculated, and thus, a similarity calculation result with higher precision is obtained between different songs.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a block diagram of an audio matching system provided by an exemplary embodiment of the present application;

FIG. 2 is a flow chart of an audio matching method provided by an exemplary embodiment of the present application;

FIG. 3 is a flow chart of an audio matching method provided by an exemplary embodiment of the present application;

FIG. 4 is a flow chart of an audio feature extraction method provided by an exemplary embodiment of the present application;

FIG. 5 is a spectral diagram of audio provided by an exemplary embodiment of the present application;

FIG. 6 is a flow chart of an audio feature extraction method provided by another exemplary embodiment of the present application;

FIG. 7 is a flowchart of an audio feature extraction method provided by another exemplary embodiment of the present application;

FIG. 8 is a flow chart of an audio feature extraction method provided by another exemplary embodiment of the present application;

FIG. 9 is a schematic diagram of a time domain feature extraction provided by an exemplary embodiment of the present application;

FIG. 10 is a schematic diagram of frequency domain feature extraction provided by an exemplary embodiment of the present application;

FIG. 11 is a schematic diagram of a concatenation of feature vectors provided by an exemplary embodiment of the present application;

FIG. 12 is a flow chart of an audio matching method provided by an exemplary embodiment of the present application;

FIG. 13 is a flow chart of online matching provided by an exemplary embodiment of the present application;

FIG. 14 illustrates a schematic diagram of a song recommendation scenario provided by an exemplary embodiment of the present application;

FIG. 15 illustrates a schematic diagram of a song scoring scenario provided by an exemplary embodiment of the present application;

FIG. 16 is a flow chart of a model training method provided by an exemplary embodiment of the present application;

fig. 17 is a block diagram illustrating an exemplary embodiment of an audio matching apparatus according to the present application;

fig. 18 is a schematic structural diagram of a computer device according to an exemplary embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

Fig. 1 is a block diagram of an audio matching system 100 provided by an exemplary embodiment of the present application. The audio matching system 100 includes: computer device 120, repository 140, server 160, and terminal 180.

The computer device 120 is a computer or server used by the developer. The computer device 120 is able to calculate the multi-scale vector sequence for all audio in the audio library off-line. The computer device 120 stores the multi-scale vector sequence for all audio in the repository 140.

The computer device 120 and the repository 140 are connected using a wired or wireless network.

The storage repository 140 stores audio IDs and multi-scale vector sequences for a plurality of audios. The correspondence between the audio ID and the sequence of multi-scale audio vectors can be considered as an "audio library". Of course, the audio library may also include audio files for audio, singers, genres, albums, sources, and other information.

The server 160 may be implemented as one server, or may be implemented as a server cluster formed by a group of servers, which may be physical servers or cloud servers. In one possible embodiment, the server 160 is a backend server for applications or applets or web programs in the terminal 180.

The server 160 and the repository 140 are connected by a wired network, a wireless network, or a data line.

The terminal 180 is an electronic device used by a user. The terminal 180 may be a mobile terminal such as a tablet computer, a laptop portable notebook computer, or a terminal such as a desktop computer, a projection computer, and the like, which is not limited in this embodiment of the present application.

In one application scenario, the terminal 180 provides two audios to the server 160: first audio and second audio, the request server 160 calculates the similarity between the first audio and the second audio. The server 160 feeds back the similarity between the first audio and the second audio to the terminal 180.

In another application scenario, the terminal 180 provides the first audio to the server 160, the server 160 determines other audio in the audio library as the second audio, calculates the similarity between the first audio and the second audio, and feeds back the second audio with the highest similarity to the terminal 180.

In the above illustrative example, the entire audio matching process is divided into two parts: the "offline storage phase" and the "retrieval matching phase". The off-line storage stage is to extract a multi-scale vector sequence from each audio in an audio library and store the extracted multi-scale vector sequence into a storage library; the 'retrieval matching stage' is to inquire a corresponding multi-scale vector sequence according to the audio ID of the first audio and the audio ID of the second audio, and perform multi-scale matching and classification according to the multi-scale vector sequences of the two audios.

First, the "search matching stage" is introduced:

fig. 2 is a flowchart of an audio matching method according to an exemplary embodiment of the present application. The present embodiment is illustrated with the method applied to the server 160. The method comprises the following steps:

step 202, obtaining a first multi-scale vector sequence of a first audio and a second multi-scale vector sequence of a second audio;

the first multi-scale vector sequence comprises: k first feature vectors of different scales. Each first feature vector is used for representing the frequency distribution of the audio frequency under a certain scale. The scale means the vector dimension and the number of the first feature vectors, and the vector dimensions of the first feature vectors under different scales are different, or the number of the first feature vectors is different, or the vector dimensions and the number of the first feature vectors are different. Here, the scale refers to the size of a convolution kernel used when extracting the (first) feature vector.

The second multi-scale vector sequence comprises: and K second feature vectors with different scales. Each second feature vector is used for representing the frequency distribution of the audio frequency under a certain scale. The scale means the vector dimension and the number of the second feature vectors, and the vector dimensions of the second feature vectors under different scales are different, or the number of the second feature vectors is different, or the vector dimensions and the number of the second feature vectors are different. Here, the scale refers to the size of a convolution kernel used when extracting the (second) feature vector.

The vector dimensions, the vector number and the physical meanings of the vectors of the first multi-scale vector sequence and the second multi-scale vector sequence are the same and are obtained only by extracting audio files of two different audios.

The server may calculate the first multi-scale vector sequence and/or the second multi-scale vector sequence in real time, or may read the first multi-scale vector sequence that has been calculated offline from the repository according to the audio ID of the first audio, and read the second multi-scale vector sequence that has been calculated offline from the repository according to the audio ID of the second audio, which is not limited in this embodiment.

Step 204, matching the feature vectors belonging to the same scale in the first multi-scale vector sequence and the second multi-scale vector sequence to obtain a plurality of matched feature vectors under different scales;

since the first multi-scale vector sequence and the second multi-scale vector sequence both include feature vectors at K different scales, there are W sets of feature vectors belonging to the same scale.

And for each group of two eigenvectors belonging to the same scale, the server performs matching calculation on the two eigenvectors to obtain a matched eigenvector. And respectively calculating the W groups of feature vectors to obtain K matched feature vectors under different scales.

Step 206, splicing the matched feature vectors under different scales to obtain a prediction vector;

splicing the matched feature vectors under K different scales according to the order of the scales from large to small to obtain a prediction vector; or splicing the matched feature vectors under the K different scales according to the order of the scales from small to large to obtain the prediction vector.

And step 208, calling a classification layer to predict the prediction vector, and outputting the similarity probability of the first audio and the second audio.

Optionally, the classification layer is a softmax function, the input is a prediction vector for the first audio and the second audio, and the output is a probability of similarity for the first audio and the second audio. The server performs at least one of audio recommendation, audio scoring, audio classification, and audio matching according to the similarity probability of the two audios.

In the personalized recommendation scene, the server is used for obtaining a first feature vector of a first audio provided by the client, then obtaining a second feature vector of a second audio in the audio library, using the audio matching model to find out the second audio with higher similarity to the first audio, and recommending the second audio to the client.

In the audio scoring scene, the server is used for obtaining a first feature vector of a first audio provided by the client, obtaining a second feature vector of a second audio in the audio library, calculating the similarity between the first audio and the second audio by using an audio matching model, and recommending the second audio with higher similarity in similarity score to the client.

In the audio matching scene, the server is used for obtaining a first feature vector of a first audio provided by the client, then obtaining a second feature vector of a second audio in the audio library, using the audio matching model to find out the second audio with extremely high similarity to the first audio, and recommending audio information (information such as song title, singer, style, year, record company and the like) of the second audio to the client.

In the audio classification scene, the server is used for calculating the similarity between every two songs in the audio library, and classifying the songs with the similarity higher than a threshold value into the same class cluster so as to divide the songs into the same class.

In summary, because the multi-scale vector sequence represents the potential features and deep features of the audio by using the frequency domain vectors under multiple scales, the similarity between two audios is calculated by using the multi-scale vector sequence of the two audios as input and using a matching method based on a neural network, so that the similarity between different songs can be calculated, and thus a similarity calculation result with higher precision is obtained.

In an alternative embodiment based on fig. 2, step 204 includes the following steps 2041 to 2044, as shown in fig. 3:

2041, multiplying the first eigenvector and the second eigenvector of the same scale to obtain a first vector;

let the first multi-scale vector sequence comprise K first eigenvectors { hA1, hA2, …, hAk }, each eigenvector having a different scale; the second multi-scale vector sequence includes K second eigenvectors { hB1, hB2, …, hBk }, each second eigenvector having a different scale.

The feature vectors in the same position in the two multi-scale vector sequences belong to the same scale. For example, the first eigenvector hA1 and the second eigenvector hB1 belong to the same scale; the first eigenvector hA2 and the second eigenvector hB2 belong to the same scale, …, and the first eigenvector hAk and the second eigenvector hBk belong to the same scale.

And multiplying the first eigenvector and the second eigenvector belonging to the same scale to obtain a first vector. For example, hA1 hB1, hA2 hB2, …, hAk hB hBk.

Step 2042, subtracting the first eigenvector and the second eigenvector of the same scale to obtain a second vector;

for example, hA1-hB1, hA2-hB2, …, hAk-hBk.

Step 2043, subtracting the first eigenvector from the second eigenvector of the same scale to obtain a third vector;

for example, hB1-hA1, hB2-hA2, …, hBk-hAk.

And 2044, splicing the first vector, the second vector and the third vector in the ith scale to obtain a matched feature vector in the ith scale, wherein i is an integer not greater than K.

And splicing the first vector, the second vector and the third vector at the ith scale into hABI ═ hAi × hBi, hAi-hBi, hBi-hAi.

For example, the first vector, the second vector and the third vector in the 1 st scale are spliced to obtain matched feature vectors { hA1 × hB1, hA1-hB1, hB1-hA1} in the 1 st scale; splicing the first vector, the second vector and the third vector under the 2 nd scale to obtain matched feature vectors { hA1 hB1, hA1-hB1, hB1-hA1} under the 2 nd scale; …, splicing the first vector, the second vector and the third vector under the K scale to obtain matched feature vectors { hAk x hBk, hAk-hBk, hBk-hAk } under the K scale.

Since the first vector, the second vector and the third vector under each scale are spliced, K matched feature vectors can be obtained.

Since h 1-hk represent feature vectors of different scales, the above process can be called as: and (4) multi-scale matching.

After K matched eigenvectors are obtained through calculation, splicing the matched eigenvectors under K different scales according to the order of the scales from large to small to obtain a prediction vector; or splicing the matched feature vectors under the K different scales according to the order of the scales from small to large to obtain the prediction vector. Then, the prediction vector is input into a classification layer for prediction, and the probability output by the classification layer is the similarity of the two audios.

In summary, in the method provided in this embodiment, by matching the multi-scale vector sequences of the two audios, the two audios can be compared from different feature levels, so that the matching accuracy when the two audios are matched is improved, and a more accurate probability can be output as the similarity of the two audios.

Next, an "offline storage phase" is introduced:

fig. 4 is a flowchart of a method for extracting a multi-scale vector sequence according to an exemplary embodiment of the present application. The embodiment is exemplified by applying the method to the computer device or the server shown in fig. 1.

The method comprises the following steps:

step 402, acquiring a characteristic sequence of the audio;

the sequence of features of the audio includes:n frequency domain vectors arranged in time sequence. Each frequency domain vector is M-dimensional, and each dimension represents the frequency F of the audio_MThe frequency difference between adjacent dimensions is the same. Wherein N and M are integers greater than 1. Optionally, the obtaining process of the feature sequence is as follows:

the audio is sampled in the time dimension with a preset sampling interval (e.g., every 0.1 second) to obtain a discrete time sequence T₁～T_nEach T value represents the size of the audio at that sample point.

The time series are grouped according to a fixed time period (such as every 3 second time period) to obtain a plurality of time series groups G₁～G_NEach time series packet G_iA plurality of samples, for example, 30 samples per 3 seconds/0.1 seconds, are included, and i is an integer not greater than N.

Will belong to the same time series group G_iA plurality of sampling points in (a) are transformed into a frequency domain vector to obtain N frequency domain vectors arranged in time order. Namely, each time sequence group is transformed from time domain to frequency domain to obtain each time sequence group G_iThe corresponding frequency domain sequence. The time-Frequency transformation method includes, but is not limited to, FFT (Fast Fourier Transform), DFT (Discrete Fourier Transform), MFCC (Mel-scale Frequency Cepstral Coefficients). Each frequency domain sequence represents the same set of time series groups G_iThe distribution of different frequencies contained therein. And respectively sampling the N frequency domain sequences according to different sampling frequencies to obtain N frequency domain vectors. The different sampling frequencies refer to: the upper frequency limit and the lower frequency limit of the audio frequency are equally divided into a plurality of frequency points, and the frequency points are different sampling frequencies.

The N frequency domain vectors arranged in time sequence form a two-dimensional matrix of M x N. The axis on the two-dimensional matrix corresponding to N represents the time domain direction and the axis corresponding to M represents the frequency domain direction. M is the quotient between the upper and lower limits of the frequency distribution and the frequency sampling interval.

Step 404, calling a time sequence correlation layer to perform time domain autocorrelation processing on the characteristic sequence to obtain an autocorrelation vector sequence;

the feature sequence of the audio includes N frequency domain vectors arranged in time order. For the ith frequency domain vector of the N frequency domain vectors, the time domain autocorrelation process is a processing operation that measures the correlation of the ith frequency domain vector by other frequency domain vectors.

And calling a time sequence correlation layer to perform time domain autocorrelation processing on the N frequency domain vectors arranged according to the time sequence to obtain an autocorrelation vector sequence. The autocorrelation vector sequence includes N autocorrelation feature vectors.

The N autocorrelation characteristic vectors arranged according to the time sequence form a two-dimensional matrix of M x N. The axis on the two-dimensional matrix corresponding to N represents the time domain direction and the axis corresponding to M represents the frequency domain direction. M is the quotient between the upper and lower limits of the frequency distribution and the frequency sampling interval.

Step 406, calling the multi-scale time-frequency domain convolution layer to perform multi-scale feature extraction on the autocorrelation vector sequence to obtain a multi-scale vector sequence of the audio;

the multi-scale feature extraction comprises the following steps: at least one of a time domain multi-scale feature extraction process and a frequency domain multi-scale feature extraction process.

The time domain multi-scale feature extraction processing refers to multi-scale feature extraction processing along a time direction, and the frequency multi-scale feature extraction processing refers to multi-scale feature extraction processing along a frequency direction. The time domain multi-scale feature extraction processing and the frequency domain multi-scale feature extraction processing are parallel and different multi-scale feature extraction processing.

Optionally, the computer device calls time domain convolution kernels under different scales to perform time domain feature extraction on the autocorrelation vector sequence along a time domain direction to obtain time domain vectors under different scales; calling frequency domain convolution kernels under different scales to extract frequency domain features of the autocorrelation vector sequence along the frequency domain direction to obtain frequency domain vectors under different scales; splicing the time domain vector and the frequency domain vector under the same scale to obtain a feature vector of the audio under the same scale; and determining a sequence formed by the feature vectors of the audio under different scales as a multi-scale vector sequence of the audio.

Step 408, storing the multi-scale vector sequence of each audio in a repository;

optionally, the storage form is < ID, { h1,.., hk } >. ID refers to the audio ID of the audio, { h 1., hk } refers to the multi-scale vector sequence of the audio. k refers to the number of scales of different scales.

In summary, in the method provided in this embodiment, a time-domain autocorrelation layer is called to perform time-domain autocorrelation processing on a feature sequence to obtain an autocorrelation vector sequence, and a time-frequency-domain processing module is called to perform at least one of time-domain feature extraction processing and frequency-domain feature extraction processing on the autocorrelation vector sequence to obtain a feature vector of an audio, so that characteristics of the audio in the time domain and the frequency domain are comprehensively considered, and meanwhile, substantial features of the audio in the time domain and the frequency domain are extracted, thereby improving the extraction effectiveness of the feature vector of the audio.

In an alternative embodiment based on fig. 4, the sequence of features for the audio mentioned in step 402 is shown in fig. 5. For example, the audio file of the audio is sampled in time dimension, for example, one audio file is sampled every 0.1s, and a discrete time sequence T is obtained₁～T_nEach value represents the size of the audio at the sample point, and then the values are combined for a fixed period of time (e.g. 3s), e.g. 3s sample interval 0.1s, each group of sequences containing 30 values, e.g. T, per 0.1 s/3 s₁～T₃₀Is a group, called G₁,T₃₁～T₆₀Is G₂And so on. Then, a frequency domain transform (including but not limited to FFT, MFCC, DFT, etc.) is performed on each group of time sequences to obtain a frequency domain signal, which represents the distribution of different frequencies contained in a group of time sequences, and the frequency signal is also sampled, for example, 10hz, to obtain a discrete frequency sequence. Assuming that the upper and lower limits of the frequency are 0-f, the number of each frequency sequence is f/10, and each G_iCan be represented as such a plurality of frequency sequences, differing only in the different G' s_iThe same frequency of (a) is of different value. Corresponding to music, some parts of the music are lowHeavy tones, those G_iThe low frequency values of (A) are large and some parts of the treble are high, those G_iThe high frequency value of (2) is large. So G_iCan be represented as a time series T₁～T₃₀Or may be represented as a sequence of frequencies, which collectively is a spectrogram. The spectrogram as illustrated in fig. 5 is a spectrogram after real audio decomposition, wherein the horizontal axis represents time, and the time period is about 1.75s, that is, a time slice is cut every 1.75 s; the frequency corresponding to each time segment is a vertical axis, the upper and lower limits of the frequency are 110 hz-3520 hz, and the depth of the gray scale represents the value corresponding to different frequencies.

In an alternative embodiment based on fig. 4, step 404 may alternatively be implemented as

steps

404a and 404b as shown in fig. 6:

step 404a, calculating an ith correlation fraction between the ith frequency domain vector and other frequency domain vectors except the ith frequency domain vector, wherein i is an integer not more than N;

setting the characteristic sequence of the audio comprises: n frequency domain vectors { G) arranged in time order₁,G₂,...,G_n}. Each G_iIs a frequency domain vector. In order to measure the correlation between other frequency domain vectors in the feature sequence and the ith frequency domain vector, the following correlation calculation formula is introduced for the ith frequency domain vector.

score(G_i)＝(G₁*G_i+G₂*G_i...+G_n*G_i–G_i*G_i)/(G₁^2+G₂^2+...+G_n^2–G_i^2)

That is, the computer device calculates a product sum of the ith frequency-domain vector and frequency-domain vectors other than the ith frequency-domain vector; calculating the sum of squares of other frequency domain vectors except the ith frequency domain vector; the quotient of the product sum and the sum of squares is determined as the ith correlation score between the ith frequency domain vector and the other frequency domain vectors except the ith frequency domain vector.

It should be noted that both the numerator and denominator require the subtraction of G_i*G_i(or G)_i2) because ofMeasure the ith frequency domain vector G of other frequency domain vector pairs_iThe influence of (c). But not exclusively, in some embodiments, G is retained in the numerator and denominator of the above formula_i*G_i(or G)_i^2) possibility.

And step 404b, calculating a weighted sequence of the N frequency domain vectors by taking the ith correlation fraction as the correlation weight of the ith frequency domain vector to obtain an autocorrelation vector sequence.

Each G is obtained by calculation_iCorresponding score (G)_i) The ith correlation score is used as the correlation weight of the ith frequency domain vector, and the product of the ith frequency domain vector and the ith correlation score is used as the ith autocorrelation vector ti. Similar calculation is carried out on the n frequency domain vectors to obtain an autocorrelation vector sequence { t }₁,...,t_nIntroduce the following calculation formula.

{t₁,...,t_n}＝{G₁*score(G₁),...,G_i*score(G_i),...,G_n*score(G_n)}

Optionally, the weighted sequence of N frequency domain vectors refers to: the resulting sequence is arranged in time order by the weighted product between the ith correlation score and the ith frequency domain vector.

In summary, according to the method provided by this embodiment, time-domain autocorrelation processing is performed on the feature sequence by the time-sequence correlation layer, so that autocorrelation characteristics of different frequency-domain vectors in a time-domain dimension can be extracted, and the feature extraction effectiveness of the audio in the time-domain dimension is improved.

In an alternative embodiment based on fig. 4, step 404 is followed by step 405a and step 405b, as shown in fig. 7:

step 405a, sampling S autocorrelation vectors from the N autocorrelation vectors according to the sequence of the correlation fractions corresponding to the N autocorrelation vectors from high to low, wherein S is an integer less than N;

since the autocorrelation feature vectors include N autocorrelation vectors. And screening a part of autocorrelation vectors to participate in subsequent calculation according to the sequence of the correlation scores corresponding to the N autocorrelation vectors from high to low for the purpose of reducing the calculation amount.

The value of S is an empirical value, such as 20% -50% of N. Taking N as 100, can be according to score (G)_i) From high to low, the autocorrelation vectors { t8, t11, t12} ordered at the top 20 are filtered out.

Step 405b, determining the S autocorrelation vectors as the sampled autocorrelation vector sequence.

Optionally, the S autocorrelation vectors are combined in the order of the correlation scores from high to low, and determined as the sequence of the sampled autocorrelation vectors. Or combining the S autocorrelation vectors according to a time domain sequence to determine the autocorrelation vectors as a sampled autocorrelation vector sequence.

In summary, in the method provided in this embodiment, the autocorrelation vector sequence is sampled according to the importance degree, and a part of the important autocorrelation vectors are sampled to form the sampled autocorrelation vector sequence, so that the subsequent calculation workload can be reduced, and the real-time performance of the technical scheme during online audio matching is improved.

In an alternative embodiment based on fig. 4, step 406 includes steps 4061 through 4064, as shown in fig. 8:

step 4061, calling time domain convolution kernels under different scales to extract time domain features of the autocorrelation vector sequence along a time domain direction to obtain time domain extraction vectors under different scales;

the time domain feature extraction comprises the following steps: at least one of temporal direction convolution and temporal direction pooling. In various embodiments, the order of operations of convolution processing and pooling processing can be combined in many ways: for example, convolution before pooling; or pooling first and then convolving; or fully connecting layers, then convolving, fully connecting, and then pooling; multiple iterations are also possible (e.g., ResNet, stacking many layers of convolutions, pooling).

For a time-domain convolution kernel at some scale M × P:

time domain direction convolution:

the time domain direction refers to performing time domain convolution processing on the autocorrelation feature vector sequence along the direction from early to late (or the direction from late to early) of time to obtain a time domain convolution vector.

Alternatively, the autocorrelation vector sequence may be regarded as a matrix of M rows by N columns (the sampled autocorrelation vector sequence may be regarded as a matrix of M rows by S columns), and each column is an M-dimensional frequency domain vector. Assume that the scale size of the time-domain convolution kernel is M x P, P being smaller than N (or S). The time domain direction refers to convolution processing of P adjacent frequency domain vectors along the 0-N direction.

As shown in fig. 9, assuming that the size of the time domain convolution kernel is M × 3, when performing the first convolution according to the time domain direction, the frequency domain vector t1, the frequency domain vector t2, and the frequency domain vector t3 are convolved to obtain t' 1; when carrying out the second convolution according to the time domain direction, carrying out convolution on the frequency domain vector t2, the frequency domain vector t3 and the frequency domain vector t4 to obtain t' 2; and when carrying out convolution for the third time according to the time domain direction, carrying out convolution on the frequency domain vector t3, the frequency domain vector t4 and the frequency domain vector t5 to obtain t '3, and repeating the operation in the same way, and finally carrying out convolution to obtain N-3+1 time domain convolution vectors t' i. Where i is not greater than N-P +1 (or S-P + 1).

Wherein, the physical meaning of each t' i is a new frequency domain vector obtained by compressing P frequency domain vectors after convolution. Each t' i is used to represent the correlation between P frequency domain vectors before convolution.

Time domain direction pooling:

optionally, the time domain convolution vectors are subjected to pooling along a time domain direction to obtain a pooled time domain extraction vector.

When time domain pooling is performed on a plurality of time domain convolution vectors under the same scale, pooling is also performed along the time direction, and the pooling dimension is consistent with the vector dimension. As shown in FIG. 9, after the time-domain pooling operation, the N-P +1 time-domain convolution vectors t '1, t '2, … t '_N-P+1And compressing the time domain into a pooled time domain extraction vector t'. That is, the pooled time domain extraction vector comprises an element, so that the physical meaning of the pooled time domain extraction vector t "is still preserved and can still be regarded as being compressed from the frequency domain dimension to a new vector. The time-domain extraction vector t "is used to represent the condensed nature of the plurality of time-domain convolution vectors.

It should be noted that, in the following description,

in this embodiment of the present application, the time domain convolution kernels may be K scales, where K is an integer greater than 1, and the value of P in the time domain convolution kernel of each scale is different, and the above operation may be performed on each time domain convolution kernel, so as to finally obtain K pooled time domain extraction vectors.

Step 4062, calling frequency domain convolution kernels under different scales to extract frequency domain features of the autocorrelation vector sequence along a frequency domain direction to obtain frequency domain vectors under different scales;

the frequency domain feature extraction comprises the following steps: at least one of frequency domain direction convolution and frequency domain direction pooling. In various embodiments, the order of operations of convolution processing and pooling processing can be combined in many ways: for example, convolution before pooling; or pooling first and then convolving; or fully connecting layers, then convolving, fully connecting, and then pooling; multiple iterations are also possible (e.g., ResNet, stacking many layers of convolutions, pooling).

For a frequency domain convolution kernel at some scale P x N:

and (3) convolution of the frequency domain direction:

the frequency domain direction refers to performing frequency domain convolution processing on the autocorrelation vector sequence along the direction from small to large (or the direction from large to small) of the sampling frequency to obtain a frequency domain convolution vector.

Alternatively, the autocorrelation feature sequence can be regarded as a matrix of M rows by N columns (the sampled autocorrelation vector sequence can be regarded as a matrix of M rows by S columns), and each row is an N-dimensional time domain vector. Assume that the size of the frequency domain convolution kernel is P x N, P being smaller than M. The frequency domain direction refers to convolution processing of M adjacent time domain vectors along the 0-M direction.

As shown in fig. 10, assuming that the size of the frequency domain convolution kernel is 3 × N, when performing the first convolution in the frequency domain direction, the time domain vector f1, the time domain vector f2, and the time domain vector f3 are convolved to obtain f' 1; when carrying out the second convolution according to the time domain direction, carrying out convolution on the time domain vector f2, the time domain vector f3 and the time domain vector f4 to obtain f' 2; and when carrying out convolution for the third time according to the time domain direction, carrying out convolution on the time domain vector f3, the time domain vector f4 and the time domain vector f5 to obtain f '3, and repeating the operation in the same way, and finally carrying out convolution to obtain M-3+1 frequency domain vectors to obtain f' i. Where i is not greater than M-P + 1.

Wherein, the physical meaning of each f' i is a new time domain vector obtained by compressing P time domain vectors after convolution. Each f' i is used to represent the correlation between P time domain vectors before convolution.

Pooling in frequency domain direction:

when frequency domain pooling is performed on a plurality of frequency domain convolution vectors under the same scale, pooling is also performed along the time direction, and the pooling dimension is consistent with the vector dimension. As shown in FIG. 10, after the frequency domain pooling operation, the above N-P +1 frequency domain convolution vectors f '1, f '2, … f '_N-P+1Compressed into a pooled frequency domain extracted vector f ". That is, the pooled frequency domain extracted vector includes an element, so that the physical meaning of the pooled frequency domain extracted vector f ″ is still preserved and can still be regarded as being compressed from the time dimension to a new vector. The pooled frequency domain extraction vector f "is used to represent the condensed nature of the plurality of frequency domain convolution vectors.

In this embodiment of the present application, the frequency domain convolution kernels may be K scales, where K is an integer greater than 1, and the value of P in the frequency domain convolution kernel of each scale is different, and the above operation may be performed on each frequency domain convolution kernel, so as to finally obtain K pooled frequency domain extraction vectors.

4063, splicing the time domain extraction vector and the frequency domain extraction vector under the same scale to obtain a feature vector of the audio under the same scale;

as shown in fig. 11, the time domain extraction vector t "and the frequency domain extraction vector f" at the same scale are spliced to obtain the feature vector { t ", f" } of the audio at the same scale.

Step 4064, determine a sequence formed by feature vectors of the audio at different scales as a multi-scale vector sequence of the audio.

Optionally, for each scale j, the time domain extraction vector t "j and the frequency domain extraction vector f" j are spliced to obtain a feature vector { t "j, f" j } of the audio frequency under the scale j. And then according to the sequence from small to large or the sequence from large to small of different scales, splicing to obtain a multi-scale feature vector sequence { t '1, f'1, t '2, f'2, …, t 'k, f' k } of the audio, or { t '1, t'2, …, t 'k, f'1, f '2, …, f' k }.

FIG. 12 is a flowchart of an audio matching method of an exemplary embodiment. The whole process is divided into two parts:

the left part is called: in the off-line storage stage, for each music in the music library, the features are extracted and then stored in the storage library 1260, and the right part is called as: and a retrieval matching stage, namely, the two pieces of music inquire respective characteristics according to the storage library 1260, then match and output whether the two pieces of music are similar or not.

An off-line storage stage:

at this stage, feature extraction is performed using a sequence autocorrelation module 1220 and a multi-scale time-frequency domain convolution module 1240.

The method comprises the steps of inputting a spectrogram of a section of audio frequency into a sequence autocorrelation module 1220, outputting a processed autocorrelation vector sequence, and then sampling the importance of the sequence, wherein the purpose of the step is to sample an autocorrelation vector sequence with higher importance from the audio frequency sequences for subsequent processing, so that the calculation pressure is reduced. The strategy adopted in the present application is to sequence score (gi) obtained by the sequence autocorrelation module 1220 in the previous step, take the first k autocorrelation vectors to output, k is set empirically and is generally set to 20% to 50% of the whole sequence number, for example, the sequence is { G1, G2., G100}, and k is 20, and the sequence after sequencing according to score (gi) is { G2, G8, G9. } total 20 sequences.

After importance sampling is finished, multi-scale time-frequency domain convolution can be performed, pooling operation is performed on two-dimensional matrix representations h1, h2 and hk of different scales, k is the number of convolution kernels of different scales, then corresponding k vector representations are obtained, and then the k vector representations are input into a storage library. Thus for one audio, the application finally represents it with k frequency domain vectors. The dimensionality and the physical meaning of the k frequency domain vectors are consistent, each frequency domain vector is formed by splicing a time domain vector and a frequency domain vector, and important information of the audio in the time dimension and the frequency dimension is reflected.

Each piece of music in the music library is processed in such a way as to obtain a final repository for storing the feature representation of each piece of music, i.e. m vectors, in the form of < ID, { h 1.,. hm }. Because these vector dimensions and physical meanings are consistent, they are comparable.

The offline storage phase is done offline, resulting in the store 1260 serving the online retrieval matches.

And (3) retrieval matching stage:

for two music pieces a and B to be queried on the line, the application obtains k multi-scale feature vectors of the two music pieces a and B according to the audio ID, namely a feature query 1282 in the block diagram.

Assuming that a corresponds to k eigenvectors { hA 1., hAk } and B corresponds to { hB 1., hBk }, the present application then matches them two by two with each other, such as for hAi and hBi, the present application obtains predicted vectors:

hABi＝[hAi*hBi,hAi-hBi,hBi-hAi]

the multiplication sign and the addition sign represent that the elements at the same position of the two eigenvectors are operated and finally spliced together to obtain hABI. Since h 1-hk represent the results of convolution kernels at different scales, this step is called multi-scale matching.

In this way, { hAB 1., hABk }, can be obtained, and the k vectors are spliced together and input to the classification layer 1286, wherein the classification layer 1286 is a softmax function, and the output Y is a similarity probability representing the degree of similarity between two audios.

Since the multi-scale vector sequence in the present application is stored in the storage 1260 by off-line computation, and is subjected to sequence importance sampling in the off-line computation process. In online matching, only multi-scale matching and classification layer prediction with less calculation amount are needed.

As shown in fig. 13, when the order of magnitude of music in the music library is between millions and millions, it is suitable to use the audio matching model in the offline matching scene to predict the similarity probability between two full audios; when the order of magnitude of music in the music library is between ten orders of magnitude and thousand orders of magnitude, the audio matching model suitable for the online matching scene predicts the similarity probability between two full audios. The order of magnitude of music in the music library is between a million and a thousand, and the method is suitable for predicting the similarity probability between two sections of full audio by adopting an audio matching model in a near-line matching scene. The audio matching model (multi-scale matching + classification layer, or time sequence autocorrelation layer + multi-scale time-frequency domain convolution layer + multi-scale matching + classification layer) provided by the embodiment of the application is more suitable for an online matching scene between ten orders of magnitude and thousand orders of magnitude.

In one illustrative example, the feature vectors of the audio are used for training and prediction of an audio matching model. The audio matching model is the audio matching model in the above embodiment, and after the feature vector of the audio provided by the embodiment of the application is adopted for training, the audio matching model can be used for predicting the similarity between two audios.

Audio recommendation scenario:

referring to the example shown in fig. 14, when the terminal 180 used by the user runs an audio playing application, and the user plays, collects or approves a first audio (a song) on the audio playing application, the server 160 may compare a first multi-scale vector sequence of the first audio (a song) with a second multi-scale vector sequence of a plurality of second audio (B song) to determine the similarity probability between the first audio and the second audio. According to the sequence of the similarity probability from high to low, the songs B, C, D and E which are similar to the song A are taken as recommended songs and sent to the audio playing application program on the terminal 180, so that the user can hear more songs which accord with the preference of the user.

Singing scoring scene:

referring to the example shown in fig. 15, the terminal 180 used by the user has a singing application running thereon, and the user sings a song, the server 160 may compare a first multi-scale vector sequence of a first audio (the song the user sings) with a second multi-scale vector sequence of a second audio (the original song, the star song, or the top scoring song) to determine the similarity probability between the first audio and the second audio. And giving the singing score of the user according to the similarity probability, and feeding the singing score back to the singing application program for displaying so as to be beneficial to improving the singing level of the user.

FIG. 16 is a flowchart of a model training method provided in an exemplary embodiment of the present application. The model training method can be used for training the classification layer in the above embodiments. The embodiment is exemplified by applying the method to the server shown in fig. 1. The method comprises the following steps:

step 501, clustering the audios in the audio library according to the audio attribute features to obtain audio clusters, where the audio attribute features include at least two attribute features with different dimensions, and the feature similarity of the audios in different audio clusters is lower than that of the audios in the same audio cluster.

The audio library stores a large amount of audio, where the audio may include songs, pure music, symphony songs, piano songs, or other musical compositions, and the like, and the embodiment of the present application does not limit the types of the audio in the audio library. Optionally, the audio library is a music library of an audio playing application.

Optionally, the audio has respective audio attribute features, the audio attribute features may be attribute features of the audio itself, or attribute features given by human, and the same piece of audio may include attribute features of a plurality of different dimensions.

In one possible embodiment, the audio attribute characteristics of the audio include at least one of: text features, audio features, emotional features, and scene features. Alternatively, the text features may include text features of the audio itself (such as lyrics, composer, word maker, genre, etc.), or may include artificially assigned text features (such as comments); the audio features are used for representing audio characteristics of the audio, such as melody, rhythm, duration and the like; the emotion characteristics are used for representing emotion expressed by the audio; the scene features are used to characterize the playing scene used by the audio. Of course, besides the above audio attribute features, the audio may also include attribute features of other dimensions, which is not limited in this embodiment.

In the embodiment of the application, the process of clustering the audio based on the audio attribute features may be referred to as primary screening, and is used for primarily screening out the audio with similar audio attribute features. In order to improve the quality of primary screening, the computer equipment clusters according to the attribute characteristics of at least two different dimensions, and clustering deviation caused by clustering based on the attribute characteristics of a single dimension is avoided.

After clustering, the computer device obtains a plurality of audio clusters, and the audio in the same audio cluster has similar audio attribute characteristics (compared with the audio in other audio clusters). The number of audio clusters can be preset in a clustering stage (based on an empirical value), so that clustering is prevented from being too generalized or too detailed.

Step 502, generating a candidate audio pair according to the audio in the audio cluster, where the candidate audio pair includes two pieces of audio, and the two pieces of audio belong to the same audio cluster or different audio clusters.

Because the audios in the same audio class cluster have similar audio attribute characteristics, and the audios in different audio class clusters have great difference in audio attribute characteristics, the server may preliminarily generate audio samples based on the audio class clusters, where each audio sample is a candidate audio pair composed of two audio samples.

Since the audio library contains a large amount of audio, the number of candidate audio pairs generated based on the audio class cluster is also huge, for example, for the audio library containing y pieces of audio, the number of generated candidate audio pairs is C (y, 2). However, while a large number of candidate audio pairs can be generated based on audio class clusters, not all of the candidate audio pairs can be used for subsequent model training. For example, when the audio in the candidate audio pair is the same song (e.g., the same song sung by a different singer), or the audio in the candidate audio pair is completely different (e.g., an english ballad, a suona song), it is too simple to train the candidate audio pair as a model training sample, and a high-quality model cannot be obtained.

In order to improve the quality of the audio samples, in the embodiment of the application, the server further screens high-quality audio pairs from the candidate audio pairs as the audio samples through fine screening.

Step 503, determining an audio positive sample pair and an audio negative sample pair in the candidate audio pairs according to the historical play records of the audio in the audio library, wherein the audio in the audio positive sample pair belongs to the same audio cluster, and the audio in the audio negative sample pair belongs to different audio clusters.

Through analysis, the similarity between the audio playing behavior of the user and the audio is closely related, for example, the user often plays the audio with high similarity continuously, but the audio is not identical. Therefore, in the embodiment of the application, the computer device performs fine screening on the generated candidate audio pairs based on the historical play records of the audio to obtain the audio sample pairs. The audio sample pairs obtained by fine screening comprise audio positive sample pairs formed by similar audios (obtained by screening from candidate audio pairs formed by audios in the same audio cluster), and audio negative sample pairs formed by differential audios (obtained by screening from candidate audio pairs formed by audios in different audio clusters).

Optionally, the history playing record is an audio playing record under each user account, and may be an audio playing list formed according to the playing sequence. For example, the history play records may be song play records of the respective users collected by the audio play application server.

In some embodiments, the degree of distinction between the audio positive sample pair and the audio negative sample pair screened out based on the history playing record is low, so that the quality of a model obtained by training based on the audio sample pair is improved.

And step 504, performing machine learning training on the classification layer according to the audio positive sample pair and the audio negative sample pair.

The sample is an object for training and testing the model, and the object includes label information, where the label information is a reference value (or called true value or supervised value) of the output result of the model, where a sample with label information of 1 is a positive sample, and a sample with label information of 0 is a negative sample. The samples in the embodiment of the present application refer to audio samples used for training a similarity model, and the audio samples are in the form of sample pairs, that is, the audio samples include two pieces of audio. Optionally, when the label information of the audio sample (pair) is 1, it indicates that two pieces of audio in the audio sample pair are similar audio, that is, an audio positive sample pair; when the label information of the audio sample (pair) is 0, it indicates that the two pieces of audio in the audio sample pair are not similar audio, i.e., an audio negative sample pair.

Wherein, the similarity probability of two audios in the same audio positive sample pair can be regarded as 1, or the clustering distance between two audios is quantized to the similarity probability. The similarity probability of two audios in the same audio negative sample pair can be regarded as 0, or the cluster-like distance or the vector distance between two audios is quantized to the similarity probability, for example, the inverse of the cluster-like distance or the inverse of the vector distance is quantized to the similarity probability of two audios in the same audio negative sample pair.

In summary, in the embodiment of the application, firstly, according to audio attribute features of different dimensions, audio with similar features in an audio library is clustered to obtain audio clusters, then, the audio clusters belonging to the same or different audio clusters are combined to obtain a plurality of candidate audio pairs, and further, based on historical playing records of the audio, an audio positive sample pair and an audio negative sample pair are screened out from the candidate audio pairs for subsequent model training; clustering is carried out by fusing multi-dimensional attribute characteristics of audio, and positive and negative sample pairs are screened based on audio playing records of users, so that generated audio sample pairs can reflect the similarity between audio (including the attributes of the audio and the listening habits of the users) from multiple angles, the quality of the generated audio sample pairs is improved while the automatic generation of the audio sample pairs is realized, and the quality of subsequent model training based on the audio samples is further improved.

Fig. 17 is a block diagram of an audio matching apparatus according to an exemplary embodiment of the present application. The audio matching apparatus includes:

an obtaining module 1720 for obtaining a first multi-scale vector sequence of a first audio and a second multi-scale vector sequence of a second audio;

a matching module 1740, configured to match frequency domain vectors belonging to the same scale in the first multi-scale vector sequence and the second multi-scale vector sequence to obtain multiple matched frequency domain vectors at different scales;

a stitching module 1760, configured to stitch the multiple matched frequency domain vectors at different scales to obtain a prediction vector;

a classification module 1780, configured to invoke a classification layer to predict the prediction vector, and output a similarity probability between the first audio and the second audio.

In an optional embodiment, the first multi-scale vector sequence comprises K first feature vectors of different scales, the second multi-scale vector sequence comprises K second feature vectors of different scales, W is an integer greater than 1;

the matching module 1740 is configured to multiply the first eigenvector and the second eigenvector of the same scale to obtain a first vector; subtracting the first eigenvector and the second eigenvector of the same scale to obtain a second vector; subtracting the first eigenvector from the second eigenvector of the same scale to obtain a third vector; and splicing the first vector, the second vector and the third vector under the ith scale to obtain a matched feature vector under the ith scale, wherein i is an integer not greater than W.

In an optional embodiment, the stitching module 1760 is configured to perform second stitching on the matching feature vectors at the K different scales according to a scale descending order to obtain the prediction vector; or performing second splicing on the matched feature vectors under the K different scales according to the order of the scales from small to large to obtain the prediction vector.

In an alternative embodiment, the obtaining module 1720 is configured to obtain the first multi-scale vector sequence of the first audio in the repository and the second multi-scale vector sequence of the second audio in the repository.

In an optional embodiment, the apparatus further comprises: a feature extraction module 1710;

the feature extraction module 1710, configured to obtain a feature sequence of an audio, where the audio includes the first audio and the second audio; calling a time sequence correlation layer to perform time domain autocorrelation processing on the characteristic sequence to obtain an autocorrelation vector sequence; calling a multi-scale time-frequency domain convolution layer to perform multi-scale feature extraction on the autocorrelation vector sequence to obtain a multi-scale vector sequence of the audio; storing the sequence of multi-scale vectors of audio to the repository.

In an optional embodiment, the feature sequence includes N frequency domain vectors sorted according to time, the feature extraction module 1710 is configured to calculate an ith correlation fraction between the ith frequency domain vector and other frequency domain vectors except the ith frequency domain vector, i is an integer no greater than N; and calculating a weighted sequence of the N frequency domain vectors by taking the ith correlation fraction as a correlation weight of the ith frequency domain vector to obtain the self-correlation vector sequence.

In an optional embodiment, the autocorrelation vector sequence includes N autocorrelation vectors, and the feature extraction module 1710 is configured to sample S autocorrelation vectors from the N autocorrelation vectors in order of the correlation scores corresponding to the N autocorrelation vectors from high to low, where S is an integer smaller than N;

and determining the S autocorrelation vectors as the autocorrelation vector sequence after sampling.

In an optional embodiment, the feature extraction module 1712 is configured to invoke time domain convolution kernels at different scales to perform time domain feature extraction on the autocorrelation vector sequence along a time domain direction, so as to obtain time domain vectors at different scales; calling frequency domain convolution kernels under different scales to extract frequency domain features of the autocorrelation vector sequence along a frequency domain direction to obtain frequency domain vectors under different scales; splicing the time domain vector and the frequency domain vector under the same scale to obtain a characteristic vector of the audio under the same scale; and determining a sequence formed by the feature vectors of the audio under different scales as a multi-scale vector sequence of the audio. A storage module 1714 for storing the multi-scale vector sequence of audio.

In an optional embodiment, the time-domain feature extraction includes: at least one of temporal direction convolution and temporal direction pooling; the frequency domain feature extraction comprises: at least one of frequency domain direction convolution and frequency domain direction pooling.

In an optional embodiment, the apparatus further comprises: a training module 1790;

the training module 1790 is configured to cluster the audios in the audio library according to the audio attribute features to obtain audio clusters, where the audio attribute features include attribute features of at least two different dimensions, and feature similarities of the audios in different audio clusters are lower than that of the audios in the same audio cluster; generating a candidate audio pair according to the audio in the audio cluster, wherein the candidate audio pair comprises two pieces of audio, and the two pieces of audio belong to the same audio cluster or different audio clusters; determining an audio positive sample pair and an audio negative sample pair in the candidate audio pairs according to the historical playing records of the audio in the audio library, wherein the audio in the audio positive sample pair belongs to the same audio cluster, and the audio in the audio negative sample pair belongs to different audio clusters; and performing machine learning training on the classification layer according to the audio positive sample pair and the audio negative sample pair.

It should be noted that: the audio matching device provided in the above embodiment is only illustrated by dividing the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the audio matching apparatus and the audio matching method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.

Fig. 18 is a schematic structural diagram of a computer device according to an exemplary embodiment of the present application. Specifically, the method comprises the following steps: the computer device 1800 includes a Central Processing Unit (CPU) 1801, a system memory 1804 including a random access memory 1802 and a read only memory 1803, and a system bus 1805 that couples the system memory 1804 and the CPU 1801. The computer device 1800 also includes a basic Input/Output system (I/O system) 1806, which facilitates information transfer between various devices within the computer, and a mass storage device 1807 for storing an operating system 1813, application programs 1814, and other program modules 1815.

The basic input/output system 1806 includes a display 1808 for displaying information and an input device 1809 such as a mouse, keyboard, etc. for user input of information. Wherein the display 1808 and the input device 1809 are coupled to the central processing unit 1801 via an input/output controller 1810 coupled to the system bus 1805. The basic input/output system 1806 may also include an input/output controller 1810 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 1810 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1807 is connected to the central processing unit 1801 through a mass storage controller (not shown) connected to the system bus 1805. The mass storage device 1807 and its associated computer-readable media provide non-volatile storage for the computer device 1800. That is, the mass storage device 1807 may include a computer-readable medium (not shown) such as a hard disk or drive.

Without loss of generality, the computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes Random Access Memory (RAM), Read Only Memory (ROM), flash memory or other solid state memory technology, Compact Disc Read-Only memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer storage media is not limited to the foregoing. The system memory 1804 and mass storage device 1807 described above may be collectively referred to as memory.

The memory stores one or more programs configured to be executed by the one or more central processing units 1801, the one or more programs containing instructions for implementing the methods described above, and the central processing unit 1801 executes the one or more programs to implement the methods provided by the various method embodiments described above.

The computer device 1800 may also operate in accordance with various embodiments of the present application by connecting to remote computers over a network, such as the internet. That is, the computer device 1800 may be connected to the network 1812 through the network interface unit 1811 that is coupled to the system bus 1805, or the network interface unit 1811 may be used to connect to other types of networks or remote computer systems (not shown).

The memory also includes one or more programs, stored in the memory, that include instructions for performing the steps performed by the computer device in the methods provided by the embodiments of the present application.

The present application further provides a computer-readable storage medium, where at least one instruction, at least one program, a code set, or a set of instructions is stored in the computer-readable storage medium, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by a processor to implement the audio matching method described in any of the above embodiments.

The present application also provides a computer program product, which when run on a computer, causes the computer to execute the audio matching method provided by the above-mentioned method embodiments.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, which may be a computer readable storage medium contained in a memory of the above embodiments; or it may be a separate computer-readable storage medium not incorporated in the terminal. The computer readable storage medium has stored therein at least one instruction, at least one program, a set of codes, or a set of instructions that are loaded and executed by the processor to implement the audio matching method of any of the above method embodiments.

Optionally, the computer-readable storage medium may include: ROM, RAM, Solid State Drives (SSD), or optical disks, etc. The RAM may include a resistive Random Access memory (ReRAM) and a Dynamic Random Access Memory (DRAM), among others. The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is intended to be exemplary only, and not to limit the present application, and any modifications, equivalents, improvements, etc. made within the spirit and scope of the present application are intended to be included therein.

Claims

1. A method of audio matching, the method comprising:

2. The method of claim 1, wherein the first multi-scale vector sequence comprises K different-scale first feature vectors, the second multi-scale vector sequence comprises K different-scale second feature vectors, and W is an integer greater than 1;

the matching the feature vectors belonging to the same scale in the first multi-scale vector sequence and the second multi-scale vector sequence to obtain the matched feature vectors under multiple scales includes:

multiplying the first eigenvector and the second eigenvector of the same scale to obtain a first vector;

subtracting the first eigenvector and the second eigenvector of the same scale to obtain a second vector;

subtracting the first eigenvector from the second eigenvector of the same scale to obtain a third vector;

and splicing the first vector, the second vector and the third vector under the ith scale to obtain a matched feature vector under the ith scale, wherein i is an integer not greater than W.

3. The method according to claim 2, wherein the stitching the plurality of matching feature vectors at the different scales to obtain a prediction vector comprises:

performing second splicing on the matched feature vectors under the K different scales according to the order of the scales from large to small to obtain the prediction vector;

or the like, or, alternatively,

and performing second splicing on the matched feature vectors under the K different scales according to the order of the scales from small to large to obtain the prediction vector.

4. The method of any of claims 1 to 3, wherein obtaining the first multi-scale vector sequence of the first audio and the second multi-scale vector sequence of the second audio comprises:

obtaining the first multi-scale vector sequence of the first audio in a repository and the second multi-scale vector sequence of the second audio in the repository.

5. The method of claim 4, further comprising:

acquiring a characteristic sequence of audio, wherein the audio comprises the first audio and the second audio;

calling a time sequence correlation layer to perform time domain autocorrelation processing on the characteristic sequence to obtain an autocorrelation vector sequence;

calling a multi-scale time-frequency domain convolution layer to perform multi-scale feature extraction on the autocorrelation vector sequence to obtain a multi-scale vector sequence of the audio;

storing the sequence of multi-scale vectors of audio to the repository.

6. The method of claim 5, wherein the signature sequence comprises N frequency domain vectors sorted according to time, and the invoking the time-series correlation layer performs time-domain autocorrelation processing on the signature sequence to obtain an autocorrelation vector sequence, comprising:

calculating an ith correlation score between the ith frequency domain vector and other frequency domain vectors except the ith frequency domain vector, wherein i is an integer not more than N;

and calculating a weighted sequence of the N frequency domain vectors by taking the ith correlation fraction as a correlation weight of the ith frequency domain vector to obtain the self-correlation vector sequence.

7. The method of claim 6, wherein the sequence of autocorrelation vectors includes N autocorrelation vectors, the method further comprising:

according to the sequence from high to low of the correlation fractions corresponding to the N autocorrelation vectors, sampling S autocorrelation vectors from the N autocorrelation vectors, wherein S is an integer smaller than N;

8. The method of claim 5, wherein the invoking the multi-scale time-frequency domain convolution layer to perform multi-scale feature extraction on the autocorrelation vector sequence to obtain the multi-scale vector sequence of the audio comprises:

calling time domain convolution kernels under different scales to extract time domain features of the autocorrelation vector sequence along a time domain direction to obtain time domain vectors under different scales;

calling frequency domain convolution kernels under different scales to extract frequency domain features of the autocorrelation vector sequence along a frequency domain direction to obtain frequency domain vectors under different scales;

splicing the time domain vector and the frequency domain vector under the same scale to obtain a characteristic vector of the audio under the same scale;

and determining a sequence formed by the feature vectors of the audio under different scales as a multi-scale vector sequence of the audio.

9. The method of claim 6,

the time domain feature extraction comprises: at least one of temporal direction convolution and temporal direction pooling;

the frequency domain feature extraction comprises: at least one of frequency domain direction convolution and frequency domain direction pooling.

10. The method of any of claims 1 to 3, further comprising:

clustering the audios in the audio library according to the audio attribute characteristics to obtain audio clusters, wherein the audio attribute characteristics comprise at least two attribute characteristics with different dimensionalities, and the characteristic similarity of the audios in different audio clusters is lower than that of the audios in the same audio cluster;

generating a candidate audio pair according to the audio in the audio cluster, wherein the candidate audio pair comprises two pieces of audio, and the two pieces of audio belong to the same audio cluster or different audio clusters;

determining an audio positive sample pair and an audio negative sample pair in the candidate audio pairs according to the historical playing records of the audio in the audio library, wherein the audio in the audio positive sample pair belongs to the same audio cluster, and the audio in the audio negative sample pair belongs to different audio clusters;

and performing machine learning training on the classification layer according to the audio positive sample pair and the audio negative sample pair.

11. An audio matching apparatus, characterized in that the apparatus comprises:

and the classification module is used for calling a classification layer to predict the prediction vector and outputting the similarity probability of the first audio and the second audio.

12. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the audio matching method of any of claims 1 to 10.

13. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the audio matching method of any of claims 1 to 10.