CN119152863A

CN119152863A - Audio encoding and decoding method, device, equipment and storage medium based on neural network

Info

Publication number: CN119152863A
Application number: CN202411666474.XA
Authority: CN
Inventors: 常沛炜; 李友高; 许朝智; 吴星辰; 李荣基; 戴瑶; 李骁
Original assignee: Ji Hua Laboratory
Current assignee: Ji Hua Laboratory
Priority date: 2024-11-21
Filing date: 2024-11-21
Publication date: 2024-12-17

Abstract

The present invention relates to the field of audio coding and decoding technology, and discloses an audio coding and decoding method, device, equipment and storage medium based on a neural network. The method: first, by introducing an improved residual vector quantization method, the residual information of the audio signal is gradually quantized, and the audio features are more finely retained during the compression process; then, a loss function that integrates reconstruction loss, perception loss, residual vector quantization loss and commitment loss is adopted, which significantly improves the reconstruction quality of the audio signal at different bit rates, and can achieve efficient audio compression while ensuring high sound quality; finally, by introducing an entropy coding module after quantization, the coding bit rate is further reduced, and the bandwidth requirement for real-time audio transmission is significantly reduced.

Description

Audio encoding and decoding method, device, equipment and storage medium based on neural network

Technical Field

The present invention relates to the field of audio encoding and decoding technologies, and in particular, to a method, an apparatus, a device, and a storage medium for audio encoding and decoding based on a neural network.

Background

The current popular traditional codec combines more traditional methods such as linear predictive coding, code excited linear prediction and modified discrete cosine transform. These audio coders take an important role in the field of audio compression codec, but there are still some drawbacks in that conventional audio codecs rely more on manually pre-designed signal processing methods such as linear predictive coding, which usually need to trade off between audio quality and compression coding efficiency, and which usually perform well at medium and high bit rates, but tend to suffer from unavoidable degradation of sound quality at low bit rates, especially in complex audio environments. Conventional audio codecs are usually designed and optimized for specific audio content, but are difficult to perform well in a wide range of general audio content and application scenarios, performance of the conventional audio codecs is significantly degraded in scenarios where complex background noise or echoes exist, because they mainly depend on regularly driven audio signal processing bases, and it is difficult to maintain stable performance in complex environments, and the design of the conventional codecs is often static, which means that they mostly need to operate in designed environments, and it is difficult to flexibly adjust under different code rate, sampling rate or delay requirements.

Accordingly, the prior art is still in need of improvement and development.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide an audio coding and decoding method, device, equipment and storage medium based on a neural network, which aim to solve the problems of reduced tone quality, poor adaptability and lack of flexibility of the traditional coding and decoding method under a low bit rate.

The first aspect of the present invention provides a neural network based audio codec method comprising pre-constructing a codec network comprising an encoder network for receiving an input of raw audio data and outputting an audio latent representation z, a quantizer for compressing the audio latent representation z and outputting a compressed latent representation z _q, and for reconstructing the compressed latent representation z _q into a time-domain signalThe method comprises the steps of designing a fusion loss function of the codec network in advance, wherein the fusion loss function comprises reconstruction loss, perception loss, residual vectorization loss and promised loss, performing end-to-end training on the codec network by using original audio data as input, obtaining model output through forward propagation, calculating a loss value of the fusion loss function, using a backward propagation algorithm to transmit the loss value back to each parameter in the codec network, updating the parameters to reduce the loss value, obtaining an audio codec model, inputting audio to be processed into the audio codec model, and outputting a reconstructed audio signal.

Optionally, in a first implementation manner of the first aspect of the present invention, the encoder network includes a first convolution layer with a channel number of C and a convolution kernel size of 7, B convolution blocks, an LSTM for sequence modeling, and a second convolution layer with a convolution kernel size of 7 and an output channel of D, where each convolution block is composed of a residual unit followed by a downsampling layer composed of convolutions with a stride of S, a kernel size K of the downsampling layer is twice the stride S, and the residual unit includes two convolutions with a kernel size of 3 and 1 jump connection.

Optionally, in a second implementation manner of the first aspect of the present invention, the quantizer is a residual multi-stage vector quantizer, the encoder network outputs an audio latent representation z compressed in the residual multi-stage vector quantizer into discrete quantization indexes, i.e. codebook vectors, the quantization indexes representing discrete values obtained after multi-stage quantization, wherein the residual multi-stage vector quantizer is formed by a Nq-layer vector quantization hierarchy, the unquantized input vector x is processed through a first vector quantizer codebook and quantized residuals are calculated to obtain a first-layer quantization result q1 and a first residual r1, the residual r1 is quantized through a second vector quantizer codebook to obtain a second-layer quantization result q2 and a second residual r2, the iterations are quantized until a predetermined number of quantization layers Nq is reached, and a final quantization result is obtained after combining the quantization results of each layerThe quantized codeword q is the result of the overall quantization process and represents the encoded signal, the final quantized feature。

Optionally, in a third implementation manner of the first aspect of the present invention, the codec network further includes an entropy encoder located after the quantizer, and the entropy encoder is configured to arithmetically encode a probability distribution of the quantization index to generate a compressed code stream.

Optionally, in a fourth implementation manner of the first aspect of the present invention, the entropy encoder is configured to arithmetically encode a probability distribution of the quantization index to generate a compressed code stream, and includes the steps of constructing a probability model for the quantized symbol sequence, assuming that the symbol sequence is s= { S1, S2,..At the beginning of the arithmetic coding, an initial interval [0,1] is defined, and for each symbol si in the sequence of symbols, the current interval [ low, high ] is gradually narrowed down to a smaller interval [ low ', high' ] according to its probability P (si) and the cumulative probability C (si), as follows:; and after all the symbols are processed, selecting the middle point of the interval as a final coding result, wherein the value is expressed by binary system, namely the final compressed code stream.

Optionally, in a fifth implementation manner of the first aspect of the present invention, the fusion loss function: Wherein l_rec represents reconstruction loss, l_ percept represents perception or discrimination loss, l_rvq represents residual vector quantization loss, l_commit represents commitment loss, and λ value corresponding to each loss represents trade-off and balance on each index.

Optionally, in a sixth implementation manner of the first aspect of the present invention, during an end-to-end training process of the codec network, a gradient of each loss term in the fusion loss function is dynamically adjusted by introducing a balancer, and a contribution of each loss term to the model during the optimization process is balanced.

A second aspect of the present invention provides an audio codec device based on a neural network, comprising a network construction module for pre-constructing a codec network comprising an encoder network for receiving an input of raw audio data and outputting an audio latent representation z, a quantizer for compressing the audio latent representation z and outputting a compressed latent representation z _q, and for reconstructing the compressed latent representation z _q into a time-domain signalThe system comprises a decoder network, a loss function design module, an output module and a reconstruction module, wherein the loss function design module is used for predefining a fusion loss function of the codec network, the fusion loss function comprises reconstruction loss, perception loss, residual vectorization loss and promised loss, the codec network is subjected to end-to-end training by using original audio data as input, model output is obtained through forward propagation, a loss value of the fusion loss function is calculated, the loss value is returned to each parameter in the codec network by using a backward propagation algorithm, the parameter is updated to reduce the loss value, an audio codec model is obtained, and the output module is used for inputting audio to be processed into the audio codec model and outputting a reconstructed audio signal.

A third aspect of the present invention provides a neural network based audio codec device comprising a memory and at least one processor, the memory having stored therein computer readable instructions, the memory and the at least one processor being interconnected by a wire, the at least one processor invoking the computer readable instructions in the memory to cause the neural network based audio codec device to perform the steps of the neural network based audio codec method as described above.

A fourth aspect of the present invention provides a computer-readable storage medium having stored therein computer-readable instructions which, when run on a computer, cause the computer to perform the steps of the neural network-based audio codec method described above.

Aiming at the technical problems of low tone quality, poor adaptability, lack of flexibility and the like of the traditional audio codec, the invention provides an audio codec method based on a neural network, which comprises the steps of firstly gradually quantizing residual information of an audio signal by introducing an improved residual vector quantization method, reserving audio characteristics more finely in a compression process, then adopting a loss function integrating reconstruction loss, perception loss, residual vector quantization loss and promise loss, obviously improving the reconstruction quality of the audio signal at different bit rates, realizing high-efficiency audio compression while guaranteeing high tone quality, and finally, further reducing the coding rate and obviously reducing the bandwidth requirement of audio real-time transmission by introducing an entropy coding module after quantization.

Drawings

Fig. 1 is a flowchart of an audio encoding and decoding method based on a neural network according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of an audio encoding and decoding method based on a neural network according to an embodiment of the present invention.

Fig. 3 is a graph of the MUSHRA scoring results for different models at individual data sets and code rates.

Fig. 4 is a schematic structural diagram of an audio encoding and decoding device based on a neural network according to the present invention.

Fig. 5 is a schematic structural diagram of an audio codec device based on a neural network according to the present invention.

Detailed Description

The embodiments of the present invention provide a method, apparatus, device and storage medium for audio encoding and decoding based on a neural network, where the terms "first", "second", "third", "fourth", etc. (if any) in the description and claims of the present invention and the above figures are used to distinguish similar objects, and are not necessarily used to describe a specific order or precedence. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

Referring to fig. 1, fig. 1 is a flowchart of an audio encoding and decoding method based on a neural network, as shown in the figure, which includes the steps of:

s10, pre-constructing a codec network comprising an encoder network for receiving an original audio data input and outputting an audio latent representation z, a quantizer for compressing the audio latent representation z and outputting a compressed latent representation z _q, and for reconstructing the compressed latent representation z _q into a time-domain signal Is a decoder network of (a);

S20, predefining a fusion loss function of the codec network, wherein the fusion loss function comprises reconstruction loss, perception loss, residual vectorization loss and promised loss, the codec network is subjected to end-to-end training by using original audio data as input, model output is obtained through forward propagation, a loss value of the fusion loss function is calculated, the loss value is returned to each parameter in the codec network by using a backward propagation algorithm, and the parameters are updated to reduce the loss value, so that an audio codec model is obtained;

s30, inputting the audio to be processed into the audio coding and decoding model, and outputting a reconstructed audio signal.

Specifically, as shown in fig. 2, the encoder network includes a one-dimensional convolution layer with a channel number of C and a convolution kernel size of 7, followed by B convolution blocks, each of which is composed of a residual unit, followed by a downsampling layer composed of a convolution with a stride of S, the core size K of the downsampling layer being twice as large as that of the stride S, the residual unit including two convolutions with a core size of 3 and a jump connection, the channel number doubling after each downsampling operation, and the convolution blocks being followed by a two-layer LSTM (Long short-term memory) for sequence modeling, and finally a convolution layer with a convolution kernel size of 7 and an output channel of D. As an example, setting c=32, b=4, and step s= (2, 4,5, 8), the time length of the input audio signal is reduced to 1/320 (2 x4x5x 8). For the activation function, ELU (exponential Linear Unit ) is chosen as the nonlinear activation function, withWhere α=1. By way of example, the input to the encoder network is an audio waveform, which, assuming a length t, after sampling, may be represented as a set of discrete sequence signalsWhere c _i is the number of channels of the input audio,Then it is the number of audio samples of this length of audio data at a given sampling rate fs for a time period t.

In this embodiment, as shown in fig. 2, the composition of the decoder network is a mirror image of the encoder network, which uses transposed convolution instead of stride convolution, ultimately outputting inverted mono or stereo audio.

The architecture of the codec network in this embodiment is streaming, due to real-time audio transmission considerations. At the encoder network side, the padding operation is only placed before the first time step, ensuring that the convolution kernel is able to process the boundary portion of the audio signal correctly, while avoiding unnecessary delays in the streaming process. For each convolution layer the padding is equal to the size of the convolution kernel minus the stride size, all padding is applied only to the beginning of the segment, while the end of each segment retains the unfilled data. In each convolution block, the stride setting ensures that a certain number of encoding steps are generated when processing an audio segment. By way of example, for a 24kHz audio signal, the encoder outputs a characteristic representation of 75 potential steps per second. In order to maintain real-time performance, the encoder network end generates a group of latent representations after each segment is processed, and transmits the latent representations to a subsequent quantization module. The decoder network also applies padding only at the beginning of the segments corresponding to the encoder network side, and the decoder network saves the end portion of the last segment (overlapping the beginning of the current segment) each time the segment is processed, decoding together with the newly received potential representation to reduce boundary effects and ensure continuity of the audio signal. The time step output by each encoder network corresponds to 320 samples (multiples of downsampling) of the input, then with an input audio sampling rate of 24kHz, each time step corresponds to an audio length of approximately 13.33 milliseconds.

In this embodiment, the quantizer is a residual multi-stage vector quantizer, and the encoder network outputs the audio latent representation z compressed in the residual multi-stage vector quantizer into discrete quantization indexes, i.e. codebook vectors, which represent discrete values obtained by multi-stage quantization, where the residual multi-stage vector quantizer is composed of Nq-layer vector quantization hierarchies. Specifically, after the encoder network outputs the latent representation, the present embodiment quantizes it using a modified residual vector quantization method. The goal of the quantizer is to compress the output of the encoder network to a target bit rate R (units: bits/second). The codebook size of a conventional vector quantizer increases exponentially with the number of bits allocated to each audio frame, and not only lacks feasibility, but it is also difficult to dynamically adjust the bit rate. In a normal vector quantization, the input vector x is mapped directly into a nearest codebook vector q, with 。

In this embodiment, residual multi-stage vector quantizers are used, and the quantizers are cascaded with Nq layers. First, the unquantized input vector x is processed through a first vector quantizer codebook and quantized residuals are calculated to obtain a first layer quantized result q ₁ and a first residual r ₁, where:

;

The residual r ₁ is then quantized by a second vector quantizer codebook:

;

iterating until a predetermined number of quantized layers Nq is reached, combining quantized results of each layer to obtain a final quantized result

;

The quantized codeword q is the result of the overall quantization process and represents the encoded signal, the final quantized feature 。

For example, assuming a target coding rate r=6000 bps, the stride is 320, the audio sampling rate fs=24000 hz, and 75 latent representation frames are output by the encoder network per second, meaning 6000 bits per second will be equally allocated to 75 frames, and r=6000/75=80 bits need to be allocated to each frame. This is not practical if a common vector quantizer scheme is used, requiring a codebook of size 2 ⁸⁰. In the residual vector quantizer, the computation is uniformly distributed into each quantizer. Taking nq=8, there is a codebook size for each quantizer. For each target code rate, the budget bit number r of each frame is obtained, the coding efficiency and the calculation complexity can be balanced by controlling the Nq, a larger residual vector quantizer layer number can achieve higher coding efficiency and coding quality, but the calculation complexity can be significantly improved, and a lower Nq can cause larger quantization error and poorer coding quality. More importantly, by residual vector quantization, an adaptive bitrate is achieved in the network, higher Nq may lead to higher quality audio at higher bitrates, while fewer Nq may also reduce the bitrate.

The multi-layer quantization structure based on residual vector quantization is introduced in the audio encoding and decoding process, and the structure can gradually approach to the target audio signal in a plurality of quantization steps, so that higher tone quality can be still maintained under the condition of low bit rate. Conventional vector quantization methods generally perform only one-time quantization on a signal, and easily cause degradation of sound quality. The multi-layered quantization structure can better capture the complex details of the audio signal and reduce quantization errors, thereby still maintaining a high sound quality at low bit rates. By adjusting the number of quantization layers and the quantization bit number of each layer, flexible adaptation to different audio contents and application scenes can be realized, and different coding requirements can be met.

In this embodiment, the codec network further includes an entropy encoder located after the quantizer, the entropy encoder being configured to arithmetically encode the probability distribution of the quantization indices to generate a compressed code stream. Specifically, after the residual vector is encoded, the present embodiment further introduces an entropy encoding technique to further compress the audio data. In residual vector quantization, an audio signal is compressed into discrete quantization indexes, i.e., codebook vectors, which represent discrete values obtained by multi-level quantization. For these discrete values, a probability distribution model of the quantization index is established, and the probability of each index occurrence is estimated. The efficiency of entropy coding depends on the probability distribution of the quantization index. As an example, first, a probability model is constructed for the quantized symbol sequence, assuming that the symbol sequence is s= { S1, S2,..once, sn }, and the probability of each symbol si is derived from the statistical result of the history data, denoted as P (si). The cumulative probability C (si) for each symbol is the sum of all symbol probabilities preceding symbol si, i.eAt the beginning of arithmetic coding, an initial interval [0,1] is defined, and for each symbol si in the symbol sequence, the current interval [ low, high ] is gradually reduced to a smaller interval [ low ', high' ] according to the probability P (si) and the cumulative probability C (si), as follows:

;

after all symbols are processed, the encoder selects the middle point of the interval as the final encoding result, and the value is expressed by binary system, namely the final compressed code stream.

The embodiment introduces entropy coding technology on the output of the encoder and the quantized result of the residual vector, and realizes the self-adaptive adjustment of the bit rate by further compressing the quantized coding symbols. The entropy coder dynamically allocates the number of bits according to the statistical properties of the input data, thereby further reducing the required bit rate while maintaining the sound quality.

In this embodiment, a fusion loss function of the codec network is designed in advance, where the fusion loss function includes a reconstruction loss, a perception loss, a residual vectorization loss, and a commitment loss, the codec network is trained end to end by using original audio data as input, a model output is obtained through forward propagation, a loss value of the fusion loss function is calculated, and the loss value is returned to each parameter in the codec network by using a back propagation algorithm, and the parameters are updated to reduce the loss value, so as to obtain an audio codec model. Specifically, the fusion loss function: Where l_rec represents reconstruction loss, l_ percept represents perceptual loss, l_rvq represents residual vector quantization loss, l_commit represents committed loss, and each loss corresponds to a lambda value representing the trade-off and balance on each index.

First, the reconstruction loss term is composed of a time domain term and a frequency domain term. In the time domain, it is desirable to minimize the L1 distance of the target audio and the compressed audio, i.e. In the frequency domain, the Mel spectrogram of the target signal and the reconstructed signal is calculated at different time scales, and the L1 and L2 distances are weighted and linearly combined to obtainWherein S _i (x) represents the Mel spectrum of the ith scale, which is obtained by window sizeStep size ofThere are 64 bins, s represents the number of scales used, e= {5,..11 } represents the set of scales, α represents the set of scalar coefficients that balance L1 and L2 distance loss terms, and α _i =1 is set.

The goal of the reconstruction penalty is to ensure as much as possible that the generated estimated audio approximates the original audio as much as possible in both time and frequency, the resulting reconstruction penalty being the weighted sum of the two: 。

Perceptual loss, also known as discriminant loss, is used to optimize the generative model. The decoder network of the present invention generates waveforms, i.e., generators, from compressed audio encoding. In contrast to the difference in reconstruction loss directly versus sample point, perceptual loss optimizes the model by measuring the similarity of the generated audio and target audio in the high-level feature space, in an attempt to make the generated output perceptually closer to the target.

The perceptual loss of the present embodiment is realized by a discriminator including a time domain discriminator and a frequency domain discriminator. The time domain discriminator receives the time domain audio signal and then passes through four one-dimensional convolution layers, the convolution kernel of each layer has a size k=5, the number of channels is 32,64,128,256, the stride s=2, the activation function is LeakyReLU, the local features of the signal are gradually extracted through the multi-layer convolution layers, the global average pooling layer is applied after the convolution layers in order to aggregate the features in the time dimension, and finally a fully connected layer maps the global pooled output to a scalar. The frequency domain discriminator is based on short-time fourier transform, and first receives a time domain signal, converts it into a frequency domain representation by short-time fourier transform, and includes: Where n=512 is the window length of the STFT, h=256 is the jump of the STFT window, w (N) represents the window function, where a hanning window is used as the window function: The obtained STFT time-frequency diagram passes through three two-dimensional convolution layers, wherein the convolution kernel size of each layer is (3 multiplied by 3), the channel numbers are respectively 32,64 and 128, the convolution layers are followed by a global average pooling layer, and finally a full connection layer is input to map to a scalar. Whether a time domain or frequency domain arbiter, a binary cross entropy loss function is used in training to measure the perceived difference between the estimated audio and the real audio of the arbiter. The method comprises the following steps: ; where D represents the output (i.e., the realism) of the arbiter to the time and frequency domains, respectively. Finally, the perceptual loss is a weighted sum of the time domain and frequency domain discriminant losses: 。

RVQ loss is the loss of residual vector quantization. The input signal x is passed through the encoder network to generate an initial characteristic representation denoted z. As shown above, the residual of the first step of quantization is r ₁=z-Q₁ (z), the residual of the second step of quantization is r ₂=r₁-Q₂(r₁), and so on, up to the Nq quantizer. The RVQ loss function is used for guiding the quantization process, minimizing the sum of squares of residual errors after each level of quantization, and the loss of each level is as follows: The total RVQ loss is the sum of the losses of all quantization levels: 。

the objective of both the committed loss and the RVQ loss is to try to enable the vector quantization process to constrain the representation of the characteristics of the encoder output, and to avoid the committed loss from deviating from the quantized representation. In the vector quantization process, the feature vector z generated by the encoder is finally mapped into discrete codebook vectors, and the commitment loss is calculated by the L2 norm between the encoded representation and the quantized representation, and is: Wherein, the method comprises the steps of, wherein, Is a residual vector representing the i-th layer before quantization,Representing the quantized result vector of the i-th layer. The final loss function is a fusion of reconstruction loss, perceptual loss, RVQ loss, and committed loss.

In order to optimize the performance of the audio codec, the present embodiment fuses four key loss functions, which are reconstruction loss, perceptual loss, residual vector quantization loss and commitment loss, respectively, and introduces a loss balancer to jointly optimize the quality of the audio codec and the reconstructed perceptual effect. By introducing a balancer, the balanced contribution of several loss functions in model training is ensured, thus stabilizing the optimization process of the model. The balance and fusion of several loss functions improves the model in all dimensions of audio coding and reconstruction (time-domain fidelity, frequency-domain accuracy, perceptual quality, quantization efficiency, etc.). The dynamically adjusted weight mechanism improves the adaptability of the model under different audio contents and scenes, and ensures the performance stability and the voice quality improvement under low bit rate or complex audio scenes.

When a network is trained, different gradient can be generated by different loss functions, and the scale and the magnitude of the gradient can be greatly different, so that the unstable training process is easily caused, and the convergence speed and the model performance of the model are affected. To address this problem, the present embodiment introduces a balancer to dynamically adjust the gradient of each loss term, balancing their contribution to the model during the optimization process.

Specifically, the core of the balancer is to normalize each loss term gradient according to its scale, so that the weight λ between different loss terms is no longer affected by its natural scale. First, define the gradient of each loss term to the model outputThen, an exponential moving average of the L2 norm of each gradient is calculated, denoted asAnd in each back propagation process, the gradient is normalized by using a reference norm R to obtain a balanced gradient: wherein, the method comprises the steps of, wherein, Is the weight of the i-th loss,Is the sum of all loss term weights, and this balanced gradient ensures that the gradient of each loss function can be scaled according to its relative importance, etc.

In this embodiment, the audio to be processed is input to the audio codec model, and a reconstructed audio signal is output. The audio codec model designed in this embodiment is designed to adapt to various audio scenes, and in order to verify their performance under different audio content, including speech, music and general audio, this embodiment selects corresponding three representative datasets, namely, in terms of speech, DNS CHALLENGE4 datasets are selected, covering various speech scenes and different speakers, in terms of music, jamendo datasets are used, and in terms of general audio, audioSet datasets are used.

The audio codec of the present invention compares with the currently mainstream Opus and EVS codecs, and evaluates the code rate range {1.5, 3, 6, 12} kbps, covering low, medium, and high bit rate scenarios. The results are shown in table 1 and fig. 3, and it can be found that the neural network structure of the invention exhibits stronger compression capability and perceptual quality compared with the conventional codec of Opus and EVS at a lower code rate, and meanwhile, the data in table 1 also show that the invention has stable performance in different scene data sets, and the MUSHRA score in the scenes of music and general audio is even close to the quality of reference standard audio, which shows that the invention has wide applicability. Subjective audio quality assessment method using MUSHRA (multiple stimulus with hidden reference and Anchor point, multiple Stimuli WITH HIDDEN REFERENCE AND Anchor) is shown in table 1:

TABLE 1 MUSHRA evaluation results of the invention and Opus, EVS Audio codec, with audio sample rates of 24kHz

It can also be seen from fig. 3 that the bit rate of the inventive code is further reduced after the entropy coding has been applied.

In addition to MUSHRA as a subjective assessment, the present invention introduces objective assessment metrics, including ViSQOL (virtual speech quality objective audience, virtual Speech Quality Objective Listener) and SI-SNR (Scale-INVARIANT SIGNAL-to-Noise Ratio). ViSQOL is an objective index for measuring the perceived quality of audio, and the similarity between signals is compared by calculating the spectrum similarity of audio signals to obtain a tone quality score and simulating the perception process of human hearing. Firstly, performing short-time Fourier transform on an audio signal to generate a frequency spectrum, then calculating the similarity of a reference audio and a distorted audio in each time frequency band, finally, obtaining global similarity by weighted summation of local similarity, mapping the global similarity into a MOS-LQO (mean opinion score-auditory quality objective score, mean Opinion Score-LISTENING QUALITY OBJECTIVE) range, and generating a score of ViSQOL. First, using a logarithmic mel-spectrum representation, the perceptual similarity of the reference signal and the distorted signal is calculated, with: Where Dmel (t) is the mel-spectrum distance at the t-th frame, M is the number of mel-bands, where m=40, mref (M, t) and Mdist (M, t) are the mel-spectrum representations of the reference signal and the distorted signal, respectively, and M is the mel-band index. After obtaining the similarity, the similarity of each frame is converted into ViSQOL fraction q (t) through a mapping function, and ViSQOL fraction is usually between 0 and 100, which represents the perceived quality of the signal. The mapping function is Where α is a scaling factor, here α=1 is set. The final ViSQOL score is a weighted average of the similarity of each frame。

SI-SNR is another indicator of signal similarity, and is more focused on the structural similarity of signals. First align the reference signal with the estimated signal, there isWhereinAnd (3) reconstructing the signal, wherein s is a reference signal, and the inner product is represented by the s. Subsequently calculate an error signal, with. Finally, obtain。

The invention is tested at a code rate of 6kbps and compared with the conventional Opus and EVS codec, and the results are shown in Table 2. The invention is significantly superior to the conventional Opus and EVS codec in terms of SI-SNR and ViSQOL indexes, indicating that the invention has a great advantage in terms of fidelity of audio signals. In general, the present invention can provide audio quality superior to that of the conventional codec under the low bit rate condition, which suggests that the present invention can be applied to an application scene with higher sound quality requirements while guaranteeing audio streaming.

TABLE 2 Objective evaluation results of the invention and Opus, EVS Audio codec at a code rate of 6kbps

The above describes the audio encoding and decoding method based on the neural network in the embodiment of the present invention, and the following describes the audio encoding and decoding device based on the neural network in the embodiment of the present invention, please refer to fig. 4, and one embodiment of the audio encoding and decoding device based on the neural network in the embodiment of the present invention includes:

A network construction module 10 for pre-constructing a codec network comprising an encoder network for receiving an original audio data input and outputting an audio latent representation z, a quantizer for compressing the audio latent representation z and outputting a compressed latent representation z _q, and for reconstructing the compressed latent representation z _q into a time-domain signal Is a decoder network of (a);

A loss function design module 20, configured to pre-design a fusion loss function of the codec network, where the fusion loss function includes a reconstruction loss, a perception loss, a residual vectorization loss, and a commitment loss, perform end-to-end training on the codec network using original audio data as input, obtain a model output through forward propagation, calculate a loss value of the fusion loss function, and return the loss value to each parameter in the codec network using a back propagation algorithm, update the parameters to reduce the loss value, and obtain an audio codec model;

the output module 30 is configured to input the audio to be processed into the audio codec model and output a reconstructed audio signal.

Based on the same ideas that of the above embodiments, the apparatus provided by the present application is capable of implementing the method of the above embodiments, and for convenience of explanation, only a portion related to the embodiment of the present application is shown in a schematic structural diagram of an embodiment of the apparatus, and it will be understood by those skilled in the art that the illustrated structure does not limit the apparatus, and may include more or fewer modules than those illustrated, or may combine some modules, or different arrangements of modules.

Fig. 4 is a detailed description of the audio codec device based on the neural network in the embodiment of the present invention from the point of view of a modularized functional entity, and the audio codec device based on the neural network in the embodiment of the present invention is described in detail from the point of view of hardware processing.

Fig. 5 is a schematic structural diagram of a neural network-based audio codec device according to an embodiment of the present invention, where the neural network-based audio codec device 100 may have relatively large differences according to configuration or performance, and may include one or more processors (central processing units, CPU) 11 (e.g., one or more processors) and a memory 12, and one or more storage mediums 13 (e.g., one or more mass storage devices) storing applications 133 or data 132. Wherein the memory 12 and the storage medium 13 may be transitory or persistent storage. The program stored in the storage medium 13 may include one or more modules (not shown), each of which may include a series of instruction operations in the neural network-based audio codec device 100. Still further, the processor 11 may be arranged to communicate with the storage medium 13, executing a series of instruction operations in the storage medium 13 on the neural network based audio codec device 100.

The neural network based audio codec device 100 may also include one or more power supplies 14, one or more wired or wireless network interfaces 15, one or more input output interfaces 16, and/or one or more operating systems 131, such as Windows Serve, mac OS X, unix, linux, freeBSD, and the like. It will be appreciated by those skilled in the art that the device structure shown in fig. 5 does not constitute a limitation of the neural network-based audio codec device 100, and may include more or fewer components than shown, or may combine certain components, or may be a different arrangement of components.

The present invention also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, and which may also be a volatile computer readable storage medium, having instructions stored therein, which when executed on a computer, cause the computer to perform the steps of a neural network based audio codec method.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the system or apparatus and unit described above may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

While the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that the foregoing embodiments may be modified or equivalents may be substituted for some of the features thereof, and that the modifications or substitutions do not depart from the spirit and scope of the embodiments of the invention.

Claims

1. An audio encoding and decoding method based on a neural network is characterized by comprising the following steps:

Pre-constructing a codec network comprising an encoder network for receiving an original audio data input and outputting an audio latent representation z, a quantizer for compressing the audio latent representation z and outputting a compressed latent representation z _q, and for reconstructing the compressed latent representation z _q into a time-domain signal Is a decoder network of (a);

The method comprises the steps of designing a fusion loss function of the codec network in advance, wherein the fusion loss function comprises reconstruction loss, perception loss, residual vectorization loss and promised loss, performing end-to-end training on the codec network by using original audio data as input, obtaining model output through forward propagation, calculating a loss value of the fusion loss function, and using a backward propagation algorithm to transmit the loss value back to each parameter in the codec network, and updating the parameters to reduce the loss value to obtain an audio codec model;

inputting the audio to be processed into the audio coding and decoding model, and outputting a reconstructed audio signal.

2. The audio codec method of claim 1, wherein the encoder network includes a first convolutional layer having a channel number of C and a convolution kernel size of 7, B convolutional blocks, an LSTM for sequence modeling, and a second convolutional layer having a convolution kernel size of 7 and an output channel of D, wherein each convolutional block consists of a residual unit followed by a downsampling layer consisting of a convolution with a stride of S, the downsampling layer having a kernel size of K twice the stride of S, and the residual unit comprises two convolutions with a kernel size of 3 and 1 skip connection.

3. The audio codec method according to claim 1, wherein the quantizer is a residual multi-stage vector quantizer, the encoder network outputs an audio latent representation z compressed in the residual multi-stage vector quantizer into discrete quantization indexes, i.e., codebook vectors, representing discrete values obtained by multi-stage quantization, wherein the residual multi-stage vector quantizer is composed of Nq-layer vector quantization hierarchy, an unquantized input vector x is processed through a first vector quantizer codebook and quantized residuals are calculated to obtain a first-layer quantization result q ₁ and a first residual r ₁, the residual r ₁ is quantized through a second vector quantizer codebook to obtain a second-layer quantization result q ₂ and a second residual r ₂, the iterations are quantized until a predetermined number of quantization Nq is reached, and a final quantization result is obtained after combining the quantization results of each layerThe quantized codeword q is the result of the overall quantization process and represents the encoded signal, the final quantized feature。

4. A neural network based audio codec method according to claim 3, wherein the codec network further comprises an entropy encoder located after the quantizer, the entropy encoder being configured to arithmetically encode a probability distribution of quantization indices to generate a compressed code stream.

5. The audio coding and decoding method based on a neural network according to claim 4, wherein the entropy coder is configured to arithmetically code probability distribution of quantization indexes to generate a compressed code stream, comprising the steps of:

constructing a probability model for the quantized symbol sequence, assuming that the symbol sequence is s= { S1, S2,.. denoted as P (si), the cumulative probability C (si) for each symbol is the sum of all symbol probabilities preceding the symbol si, i.e

;

At the beginning of arithmetic coding, an initial interval [0,1] is defined, and for each symbol si in the symbol sequence, the current interval [ low, high ] is gradually narrowed to a smaller interval [ low ', high' ] according to its probability P (si) and cumulative probability C (si), as follows:

;

When all symbols are processed, the middle point of the interval is selected as the final coding result, and the value is expressed by binary system, namely the final compressed code stream.

6. The neural network-based audio codec method of claim 1, wherein the fusion loss function: Where l_rec represents reconstruction loss, l_ percept represents perceptual loss, l_rvq represents residual vector quantization loss, l_commit represents committed loss, and each loss corresponds to a lambda value representing the trade-off and balance on each index.

7. The audio codec method based on the neural network according to claim 1, wherein the contribution of each loss term to the model in the optimization process is balanced by dynamically adjusting the gradient of each loss term in the fusion loss function by introducing a balancer in the end-to-end training process of the codec network.

8. An audio codec device based on a neural network, comprising:

A network construction module for pre-constructing a codec network comprising an encoder network for receiving an original audio data input and outputting an audio latent representation z, a quantizer for compressing the audio latent representation z and outputting a compressed latent representation z _q, and for reconstructing the compressed latent representation z _q into a time-domain signal Is a decoder network of (a);

The system comprises a loss function design module, a calculation module and a calculation module, wherein the loss function design module is used for predefining a fusion loss function of the codec network, the fusion loss function comprises reconstruction loss, perception loss, residual vectorization loss and promised loss, the original audio data is used as input to perform end-to-end training on the codec network, model output is obtained through forward propagation, a loss value of the fusion loss function is calculated, the loss value is returned to each parameter in the codec network through a back propagation algorithm, and the parameters are updated to reduce the loss value, so that an audio codec model is obtained;

and the output module is used for inputting the audio to be processed into the audio coding and decoding model and outputting a reconstructed audio signal.

9. An audio codec device based on a neural network, comprising a memory and at least one processor, the memory having computer-readable instructions stored therein;

the at least one processor invokes the computer readable instructions in the memory to perform the steps of the neural network based audio codec method of any one of claims 1-7.

10. A computer readable storage medium having computer readable instructions stored thereon, which when executed by a processor, implement the steps of the neural network based audio codec method of any one of claims 1-7.