[go: up one dir, main page]

CN118197325A - Dual-channel to multi-channel upmixing method, device, storage medium and equipment - Google Patents

Dual-channel to multi-channel upmixing method, device, storage medium and equipment Download PDF

Info

Publication number
CN118197325A
CN118197325A CN202410289229.5A CN202410289229A CN118197325A CN 118197325 A CN118197325 A CN 118197325A CN 202410289229 A CN202410289229 A CN 202410289229A CN 118197325 A CN118197325 A CN 118197325A
Authority
CN
China
Prior art keywords
channel
training
coefficient
spectrum coefficient
spectrum
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410289229.5A
Other languages
Chinese (zh)
Inventor
李强
王凌志
叶东翔
朱勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bairui Interconnection Integrated Circuit Shanghai Co ltd
Original Assignee
Bairui Interconnection Integrated Circuit Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bairui Interconnection Integrated Circuit Shanghai Co ltd filed Critical Bairui Interconnection Integrated Circuit Shanghai Co ltd
Priority to CN202410289229.5A priority Critical patent/CN118197325A/en
Publication of CN118197325A publication Critical patent/CN118197325A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/0017Lossless audio signal coding; Perfect reconstruction of coded audio signal by transmission of coding error
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/0018Speech coding using phonetic or linguistical decoding of the source; Reconstruction using text-to-speech synthesis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/005Correction of errors induced by the transmission channel, if related to the coding algorithm
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/80Services using short range communication, e.g. near-field communication [NFC], radio-frequency identification [RFID] or low energy communication

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Mathematical Physics (AREA)
  • Stereophonic System (AREA)

Abstract

The application discloses a double-channel to multi-channel upmixing method, a device, a storage medium and equipment, belonging to the technical field of Bluetooth audio encoding and decoding, wherein the method comprises the steps of obtaining a double-channel LC3 code stream, and performing partial decoding on the double-channel LC3 code stream to obtain left and right channel spectrum coefficients; adding the left channel spectrum coefficient and the right channel spectrum coefficient to obtain a middle channel spectrum coefficient; obtaining a middle vocal tract spectrum coefficient by using a first pre-training neural model according to the middle vocal tract spectrum coefficient; obtaining left surround channel spectrum coefficients by using a second pre-training neural model according to the left channel spectrum coefficients; obtaining a right surround channel spectrum coefficient by using a third pre-training neural model according to the right channel spectrum coefficient; low-pass filtering the middle channel spectrum coefficient to obtain a heavy bass channel spectrum coefficient; and performing time-frequency inverse transformation according to the left channel spectrum coefficient, the right channel spectrum coefficient, the middle channel spectrum coefficient, the left surrounding channel spectrum coefficient, the right surrounding channel spectrum coefficient and the heavy bass channel spectrum coefficient, and outputting multi-channel audio. The application enhances the immersion of the user.

Description

Dual-channel to multi-channel upmixing method, device, storage medium and equipment
Technical Field
The application belongs to the technical field of Bluetooth audio encoding and decoding, and particularly relates to a double-channel to multi-channel upmixing method, a device, a storage medium and equipment.
Background
The LC3 codec receives more and more attention from manufacturers because of its low delay, high sound quality and coding gain, and no patent fee in the bluetooth field.
At present, bluetooth sound boxes are popular, and people hope to improve immersion except for appreciating stereophonic sound. 5.1 channel surround sound is a relatively widely used audio format because it provides a better user experience. However, playing a 5.1 channel sound source using a bluetooth speaker has the following challenges:
(1) If the sound source is in 5.1 channel format, the data stream applied to the Bluetooth sound box is as follows: the Bluetooth transmitting end decodes 5.1 sound channels into 6 sound channel PCM signals, performs audio coding (such as LC3 coding) on each sound channel, and transmits the coded audio signals; the Bluetooth receiving end receives audio signals of 6 sound channels- > decodes- > and plays by using a Bluetooth sound box. The disadvantage is that the rate of the over-the-air transmission is high, if each channel is compressed with the recommended rate of 124kbps, the total rate of the over-the-air is 744kbps, which is not a problem for LE Audio, but too high a rate can cause not only the problem of Audio jamming, but also interference to other surrounding devices, which is a big challenge for the radio frequency front end of bluetooth.
(2) The 5.1 sound channel format has fewer sound sources, and popular sound sources compared on the network are mainly double-channel sound sources, however, the experience of the double-channel sound sources is poor, and especially the immersion is insufficient.
In the existing two-channel upmix to multi-channel technology, PCA (PRINCIPAL COMPONENT ANALYSIS ) is used more and is also used in dolby surround sound decoders, but due to the time-varying characteristics of correlation and nonlinear relation among multi-channel signals, surround sound generated by PCA is greatly different from original surround sound, and the spatial sense and immersion sense of upmix audio are insufficient.
Disclosure of Invention
Aiming at the technical problems in the prior art, the application provides a double-channel to multi-channel upmixing method, a device, a storage medium and equipment, and the deep learning is adopted to realize double-channel to multi-channel upmixing at a Bluetooth receiving end, so that the immersion of a user is enhanced, and the user experience is improved.
In order to achieve the above object, the first technical scheme adopted by the present application is: there is provided a two-channel to multi-channel upmixing method comprising: at a Bluetooth receiving end, acquiring a double-channel LC3 code stream, and performing partial decoding on the double-channel LC3 code stream to obtain a left channel spectrum coefficient and a right channel spectrum coefficient; adding the left channel spectrum coefficient and the right channel spectrum coefficient to obtain a middle channel spectrum coefficient; obtaining a middle vocal tract spectrum coefficient by using a first pre-training neural network model according to the middle vocal tract spectrum coefficient; obtaining left surround channel spectrum coefficients by using a second pre-training neural network model according to the left channel spectrum coefficients; obtaining a right surround channel spectrum coefficient by using a third pre-training neural network model according to the right channel spectrum coefficient; performing low-pass filtering treatment on the middle channel spectrum coefficient to obtain a heavy bass channel spectrum coefficient; and performing a time-frequency inverse transform according to the left channel spectral coefficient, the right channel spectral coefficient, the center channel spectral coefficient, the left surround channel spectral coefficient, the right surround channel spectral coefficient, and the subwoofer channel spectral coefficient, and outputting a multi-channel PCM audio signal.
The second technical scheme adopted by the application is as follows: there is provided a two-channel to multi-channel upmixing apparatus comprising: the module is used for acquiring a double-channel LC3 code stream at a Bluetooth receiving end, and performing partial decoding on the double-channel LC3 code stream to obtain a left channel spectrum coefficient and a right channel spectrum coefficient; a module for adding the left channel spectral coefficient and the right channel spectral coefficient to obtain a middle channel spectral coefficient; the module is used for obtaining a middle sound channel spectrum coefficient by utilizing a first pre-training neural network model according to the middle sound channel spectrum coefficient; a module for obtaining left surround channel spectral coefficients from the left channel spectral coefficients using a second pre-trained neural network model; a module for obtaining a right surround channel spectral coefficient from the right channel spectral coefficient using a third pre-trained neural network model; the module is used for carrying out low-pass filtering treatment on the middle channel spectrum coefficient to obtain a heavy bass channel spectrum coefficient; and a module for performing an inverse time-frequency transform based on the left channel spectral coefficient, the right channel spectral coefficient, the center channel spectral coefficient, the left surround channel spectral coefficient, the right surround channel spectral coefficient, and the subwoofer channel spectral coefficient, and outputting a multi-channel PCM audio signal.
The third technical scheme adopted by the application is as follows: a computer readable storage medium is provided having stored thereon computer instructions operable to perform a two-channel to multi-channel upmixing method in scheme one.
The fourth technical scheme adopted by the application is as follows: there is provided a computer apparatus comprising a processor and a memory, the memory storing computer instructions, wherein the processor operates the computer instructions to perform a two-channel to multi-channel upmixing method in scheme one.
The technical scheme of the application has the following beneficial effects: the technical scheme of the application can be applied to low-power consumption Bluetooth and classical Bluetooth, and the increase of algorithm delay is avoided by utilizing the existing time-frequency conversion and overlap-add technology; deep learning is adopted, and at a Bluetooth receiving end, for an original sound source of 5.1 channels, transmission is only required to be carried out by using a double-channel code rate, so that the code rate of air transmission can be saved, and the interference to a wireless environment is reduced; for the original sound source with double channels, the upmixing from double channels to multiple channels is realized, and the immersion of the user is enhanced.
Drawings
FIG. 1 is a flow diagram of one embodiment of a two-channel to multi-channel upmixing method of the present application;
FIG. 2 is a schematic diagram of one embodiment of the offline training and online reasoning process of the neural network model in the two-channel to multi-channel upmixing method of the present application;
Fig. 3 is a schematic diagram of an embodiment of the inventive two-channel to multi-channel upmixing device.
Detailed Description
The preferred embodiments of the present application will be described in detail below with reference to the accompanying drawings so that the advantages and features of the present application can be more easily understood by those skilled in the art, thereby making clear and defining the scope of the present application.
The invention is exemplified with a multi-channel 5.1 channel, although the principles of the invention can be used with a greater or lesser number of channels, such as 7.1 channels, etc.
Typical 5.1 channels include a Left channel (LEFT CHANNEL), a center channel (CENTER CHANNEL), a Right channel (RIGHT CHANNEL), a Left Surround channel (Left Surround), a Right Surround channel (Right Surround), and a subwoofer channel (Low Frequency Effect).
The bluetooth transmitting terminal generally refers to a device with a bluetooth transmitting function, such as a bluetooth transceiver, a mobile phone, a tablet computer, a notebook computer, and the like.
The bluetooth receiving end generally refers to a bluetooth speaker, or a device with a bluetooth speaker, such as a vehicle bluetooth playing system, etc.
The invention is generally applied to a device with a Bluetooth receiving function, and the overall idea of the invention is that:
(1) At the bluetooth receiving end, first, an intermediate channel signal is generated based on two channel signals (a left channel signal and a right channel signal);
(2) Secondly, using a neural network:
a) Generating a center channel signal based on the center channel signal;
b) Generating a left surround channel signal based on the left channel signal;
c) Generating a right surround channel signal based on the right channel signal;
(3) Finally, the intermediate channel signal is subjected to low-pass filtering to generate a heavy bass channel signal, so that the conversion from the binaural stereo to the 5.1-channel surround sound is realized at the Bluetooth receiving end.
Fig. 1 is a flow chart of one embodiment of the two-channel to multi-channel upmixing method of the present application.
In one embodiment shown in fig. 1, the two-channel to multi-channel upmixing method of the present application includes a process S101, at a bluetooth receiving end, acquiring a two-channel LC3 code stream, and performing partial decoding on the two-channel LC3 code stream to obtain a left channel spectral coefficient and a right channel spectral coefficient.
In this embodiment, at the bluetooth receiving end, a dual-channel LC3 code stream is acquired through the bluetooth communication module, where the dual-channel LC3 code stream includes a left channel code stream and a right channel code stream.
In one embodiment of the present application, and performing partial decoding of a binaural LC3 code stream, comprises: the LC3 audio bitstream is decoded until the transform domain noise shaping decoding is completed.
In this particular embodiment, partial decoding is performed on the binaural LC3 code stream to transform domain noise shaping resulting in left and right channel spectral coefficients: x Left(k),Xright(k),k=0…NF -1. The partially decoded left and right channel spectral coefficients are used to subsequently generate a center channel signal, a left surround channel signal, and a right surround channel signal in the multi-channel.
In one embodiment shown in fig. 1, the two-channel to multi-channel upmixing method of the present application includes a process S102 of adding left channel spectral coefficients and right channel spectral coefficients to obtain intermediate channel spectral coefficients.
In this particular embodiment, the intermediate channel spectral coefficients X Middle (k):
XMiddle(k)=XLeft(k)+Xright(k)
In one embodiment shown in fig. 1, the two-channel to multi-channel upmixing method of the present application includes a process S103, where a first pre-trained neural network model is used to obtain center channel spectral coefficients according to center channel spectral coefficients.
In a specific embodiment of the present application, obtaining the center channel spectral coefficients using a first pre-trained neural network model based on the center channel spectral coefficients, includes: taking logarithm of the spectrum coefficient of the middle channel to obtain amplitude spectrum characteristics of the middle channel signal; inputting the amplitude spectrum characteristics of the middle channel signals into a first pre-training neural network model to obtain gains of the middle channel; and multiplying the middle channel spectrum coefficient with the gain of the middle channel to obtain the middle channel spectrum coefficient.
Specifically, first, the amplitude spectrum characteristics of the intermediate channel signal are extracted:
log|XMiddle(k)|=log|XLeft(k)+Xright(k)|,k=0…NF-1
Secondly, inputting the amplitude spectrum characteristics of the middle channel signals into a first pre-training neural network model to obtain the gain corresponding to each effective frequency index of the middle channel, namely the gain of the middle channel: g nn(k),k=0…NF -1.
Finally, the intermediate channel spectral coefficients are multiplied by the gain of the intermediate channel to obtain the center channel spectral coefficients X Center (k):
XCenter(k)=XMiddle(k)*gnn(k)
in one embodiment shown in fig. 1, the two-channel to multi-channel upmixing method of the present application includes a process S104 of obtaining left surround channel spectral coefficients according to left channel spectral coefficients using a second pre-trained neural network model.
In one embodiment of the present application, obtaining left surround channel spectral coefficients from left channel spectral coefficients using a second pre-trained neural network model, comprises: taking logarithm of the left channel spectrum coefficient to obtain amplitude spectrum characteristics of the left channel signal; inputting the amplitude spectrum characteristics of the left channel signals into a second pre-training neural network model to obtain the gain of the left channel; the left channel spectral coefficient is multiplied by the gain of the left channel to obtain a left surround channel spectral coefficient.
In this embodiment, the operation procedure for obtaining the left surround channel spectrum coefficient is similar to the operation procedure for obtaining the center channel spectrum coefficient, and will not be repeated here.
In one embodiment shown in fig. 1, the two-channel to multi-channel upmixing method of the present application includes a process S105 of obtaining right surround channel spectral coefficients according to the right channel spectral coefficients using a third pre-trained neural network model.
In one embodiment of the present application, obtaining the right surround channel spectral coefficients from the right channel spectral coefficients using a third pre-trained neural network model, comprising: taking logarithm of the right channel spectrum coefficient to obtain amplitude spectrum characteristics of the right channel signal; inputting the amplitude spectrum characteristics of the right channel signals into a third pre-training neural network model to obtain the gain of the right channel; and multiplying the right channel spectrum coefficient with the gain of the right channel to obtain a right surround channel spectrum coefficient.
In this embodiment, the operation procedure for obtaining the spectrum coefficients of the right surround channel is similar to the operation procedure for obtaining the spectrum coefficients of the center channel, and will not be described again here.
FIG. 2 is a schematic diagram of one embodiment of the offline training and online reasoning process of the neural network model in the two-channel to multi-channel upmixing method of the present application.
As shown in fig. 2, the offline training process is above the middle dashed line, and the online reasoning process is below the middle dashed line. During the offline training process: firstly, acquiring a 5.1-channel sound source as a training material; secondly, performing down-mixing on the sound source of 5.1 channels to obtain binaural stereo; thirdly, respectively extracting characteristics of the 5.1-channel sound source and the double-channel stereo of the downmix of the 5.1-channel sound source to obtain 5.1-channel characteristics and double-channel characteristics; and finally, inputting the 5.1 sound channel characteristics and the binaural characteristics into a neural network model for training, thereby obtaining a pre-training neural network model. In the online reasoning process: firstly, obtaining a double-channel sound source code stream through a Bluetooth communication module; secondly, extracting features of the binaural sound source code stream to obtain binaural features; thirdly, inputting the binaural characteristics into a pre-trained neural network model, and obtaining the spectral coefficients of each multichannel according to the binaural spectral coefficients; finally, the remaining decoding modules are continuously executed according to the spectral coefficients of the various channels of the multi-channel, so as to output the multi-channel audio signal.
In a specific embodiment of the present application, the training process of the first pre-trained neural network model, the second pre-trained neural network model and the third pre-trained neural network model comprises: acquiring multi-channel surround sound for training, and performing down-mixing on the multi-channel surround sound for training to obtain dual-channel stereo sound for training; extracting features of the training binaural stereo to obtain binaural features, wherein the binaural features comprise amplitude spectrum features of the training intermediate channel signal, amplitude spectrum features of the training left channel signal and amplitude spectrum features of the training right channel signal; extracting characteristics of the training multichannel surround sound to obtain multichannel characteristics, wherein the multichannel characteristics comprise amplitude spectrum characteristics of a training middle-set channel signal, amplitude spectrum characteristics of a training left surround channel signal and amplitude spectrum characteristics of a training right surround channel signal; inputting the amplitude spectrum characteristics of the training middle channel signal and the amplitude spectrum characteristics of the training middle channel signal into a neural network for training to obtain a first pre-training neural network model; inputting the amplitude spectrum characteristics of the training left channel signal and the amplitude spectrum characteristics of the training left surround channel signal into a neural network for training to obtain a second pre-training neural network model; and inputting the amplitude spectrum characteristics of the training right channel signal and the amplitude spectrum characteristics of the training right surround channel signal into a neural network for training to obtain a third pre-training neural network model.
The training process of the first pre-trained neural network model (center channel) is illustrated below in conjunction with fig. 2.
The training process of the neural network model is shown above the middle dashed line in fig. 2, specifically:
firstly, acquiring a certain number of sound sources of 5.1 sound channels as training materials;
Secondly, performing down-mixing on the audio source of 5.1 channels to obtain binaural stereo, wherein the process is a mature technology, and one of the following methods is briefly described as follows:
Left channel x left (n) =fl (n) 0.4142+fr 0.0000+
FC(n)*0.2929+LFE(n)*0.0000+BL(n)*0.2929+BR(n)*0.0000
Right channel x right (n) =fl (n) ×0.0000+fr (n) ×0.4142+fc (n) × 0.2929
LFE(n)*0.0000+BL(n)*0.0000+BR(n)*0.2929
Wherein FL, FR, FC, LFE, BL, BR corresponds to the left channel, right channel, center channel, subwoofer channel, left surround channel, right surround channel in the 5.1 channel definition.
In a specific embodiment of the present application, feature extraction is performed on a training binaural stereo to obtain binaural features, including: performing LC3 coding on the training binaural stereo to obtain a training binaural LC3 code stream; performing partial decoding on the training binaural LC3 code stream to obtain training left channel spectrum coefficients and training right channel spectrum coefficients; obtaining a training middle channel spectrum coefficient according to the training left channel spectrum coefficient and the training right channel spectrum coefficient; taking logarithm of the training middle sound channel spectrum coefficient to obtain amplitude spectrum characteristics of the training middle sound channel signal; taking logarithm of the training left channel spectrum coefficient to obtain amplitude spectrum characteristics of the training left channel signal; and taking the logarithm of the training right channel spectrum coefficient to obtain the amplitude spectrum characteristic of the training right channel signal.
Then, feature extraction is performed on the binaural stereo obtained by performing downmixing on the 5.1-channel sound source, and the following is briefly described:
(1) Firstly, LC3 coding is carried out on the binaural stereo obtained by the down-mixing;
(2) Performing LC3 partial decoding to transform domain noise shaping on the encoded result to obtain a left channel spectrum coefficient and a right channel spectrum coefficient;
(3) Finally, the amplitude spectrum characteristic log|X Middle (k) | of the intermediate channel signal is calculated, and the method is the same as that described above, and the detailed description is omitted here.
In a specific embodiment of the present application, feature extraction is performed on training multichannel surround sound to obtain multichannel features, including: LC3 coding is performed on the training multichannel surround sound to obtain a training center channel code stream, a training left surround channel code stream and a training right surround channel code stream; performing partial decoding on the training center channel code stream, the training left surround channel code stream and the training right surround channel code stream to obtain training center channel spectrum coefficients, training left surround channel spectrum coefficients and training right surround channel spectrum coefficients; taking logarithm of the middle-set sound channel spectrum coefficient for training to obtain amplitude spectrum characteristics of middle-set sound channel signals for training; taking logarithm of the training left surround channel spectrum coefficient to obtain amplitude spectrum characteristics of the training left surround channel signal; and taking the logarithm of the training right surround channel spectrum coefficient to obtain the amplitude spectrum characteristic of the training right surround channel signal.
Then, feature extraction is performed on the sound source of the 5.1 sound channel, and the following is briefly described:
(1) LC3 encoding each of the 5.1 channels first;
(2) Performing LC3 partial decoding to transform domain noise shaping on the encoded result to obtain a center channel spectrum coefficient;
(3) Finally, the amplitude spectrum characteristic log|X Center (k) | of the middle-set channel signal is calculated.
And finally, inputting the amplitude spectrum characteristic log|X Middle (k) | of the middle channel signal and the amplitude spectrum characteristic log|X Center (k) | of the middle channel signal into a neural network for training, adjusting the weight and bias of the neural network by using a back propagation algorithm, and stopping training when the error is smaller than a preset value or reaches a preset training frequency to obtain a first pre-training neural network model.
Specifically, the application is not limited with respect to the selection of the neural network, and the cyclic neural network (Recurrent Neural Network, RNN) or its modified version LSTM, GRU, etc. are preferably selected in consideration of the front-to-back correlation characteristics of the audio frames.
In one embodiment of the application, the binaural and multi-channel features are each trained in a predetermined number of consecutive frames input into the neural network.
In this particular embodiment, the input to the neural network is the amplitude spectrum of the signal, and the output is the estimated gain g nn (k), and the amplitude spectrum of each input is characterized by a certain number of consecutive frames, taking into account the front-to-back correlation of the audio. For example: if the frame length is 10ms, it is typically 5 frames; if the frame length is 7.5ms, it is typically 7 frames.
The learning targets during neural network training are as follows: for all valid frequencies Bin, the ideal gain is calculated:
wherein, in the neural network training process, a loss function (i.e., error) used in the back propagation is defined as:
since the training process of the second pre-training neural network model and the training process of the third pre-training neural network model are both similar to those of the first pre-training neural network model, a detailed description thereof will be omitted herein.
In one embodiment shown in fig. 1, the dual-channel to multi-channel upmixing method of the present application includes a process S106 of performing low-pass filtering on the intermediate channel spectral coefficients to obtain the sub-bass channel spectral coefficients.
In one embodiment shown in fig. 1, the two-channel to multi-channel upmixing method of the present application includes a process S107 of performing a time-frequency inverse transform based on left channel spectral coefficients, right channel spectral coefficients, center channel spectral coefficients, left surround channel spectral coefficients, right surround channel spectral coefficients, and subwoofer channel spectral coefficients, and outputting a multi-channel PCM audio signal.
In this embodiment, after obtaining the left channel spectral coefficient, the right channel spectral coefficient, the center channel spectral coefficient, the left surround channel spectral coefficient, the right surround channel spectral coefficient, and the subwoofer channel spectral coefficient of the multi-channel, the time-frequency inverse transform LD-IMDCT (low-delay modified discrete cosine transform) is continuously completed, that is, the remaining decoding modules in the online reasoning process as shown in fig. 2. Since the spectral coefficients of each channel of the newly generated audio are all obtained at the decoding end and are not subjected to the standard encoding process, no LTPF (long-term post-filtering) related operation is performed, wherein the main effect of the left surround channel signal and the right surround channel signal is to increase the sense of immersion, the center channel signal is mainly used for the dialogue, and therefore no LTPF operation is performed, the influence of the obtained multi-channel PCM audio signal on the tone quality is small, and the obtained multi-channel PCM audio signal is used for audio equipment (such as a bluetooth speaker) output. This process utilizes existing time-frequency transforms and overlap-add, avoiding an increase in algorithm delay.
In the two-channel to multi-channel upmixing method, the increase of algorithm delay is avoided by utilizing the existing time-frequency transformation and overlap-add technology; deep learning is adopted, and at a Bluetooth receiving end, for an original sound source of 5.1 channels, transmission is only required to be carried out by using a double-channel code rate, so that the code rate of air transmission can be saved, and the interference to a wireless environment is reduced; for the original sound source with double channels, the upmixing from double channels to multiple channels is realized, and the immersion of the user is enhanced.
Fig. 3 is a schematic diagram of an embodiment of the inventive two-channel to multi-channel upmixing device.
In one embodiment shown in fig. 3, the two-channel to multi-channel upmixing apparatus of the present application includes: a module 301 for acquiring a dual-channel LC3 code stream at a bluetooth receiving end, and performing partial decoding on the dual-channel LC3 code stream to obtain a left channel spectrum coefficient and a right channel spectrum coefficient; a module 302 for adding left channel spectral coefficients and right channel spectral coefficients to obtain intermediate channel spectral coefficients; a module 303 for obtaining mid-channel spectral coefficients from the mid-channel spectral coefficients using a first pre-trained neural network model; a module 304 for obtaining left surround channel spectral coefficients from the left channel spectral coefficients using a second pre-trained neural network model; a module 305 for obtaining right surround channel spectral coefficients from the right channel spectral coefficients using a third pre-trained neural network model; a module 306 for performing low pass filtering on the middle channel spectrum coefficients to obtain the sub-bass channel spectrum coefficients; and a module 307 for performing an inverse time-frequency transform based on the left channel spectral coefficients, the right channel spectral coefficients, the center channel spectral coefficients, the left surround channel spectral coefficients, the right surround channel spectral coefficients, and the subwoofer channel spectral coefficients, to output a multi-channel PCM audio signal.
In one embodiment of the present application, and performing partial decoding of a binaural LC3 code stream, comprises: the LC3 audio bitstream is decoded until the transform domain noise shaping decoding is completed.
In a specific embodiment of the present application, obtaining the center channel spectral coefficients using a first pre-trained neural network model based on the center channel spectral coefficients, includes: taking logarithm of the spectrum coefficient of the middle channel to obtain amplitude spectrum characteristics of the middle channel signal; inputting the amplitude spectrum characteristics of the middle channel signals into a first pre-training neural network model to obtain gains of the middle channel; and multiplying the middle channel spectrum coefficient with the gain of the middle channel to obtain the middle channel spectrum coefficient.
In one embodiment of the present application, obtaining left surround channel spectral coefficients from left channel spectral coefficients using a second pre-trained neural network model, comprises: taking logarithm of the left channel spectrum coefficient to obtain amplitude spectrum characteristics of the left channel signal; inputting the amplitude spectrum characteristics of the left channel signals into a second pre-training neural network model to obtain the gain of the left channel; the left channel spectral coefficient is multiplied by the gain of the left channel to obtain a left surround channel spectral coefficient.
In one embodiment of the present application, obtaining the right surround channel spectral coefficients from the right channel spectral coefficients using a third pre-trained neural network model, comprising: taking logarithm of the right channel spectrum coefficient to obtain amplitude spectrum characteristics of the right channel signal; inputting the amplitude spectrum characteristics of the right channel signals into a third pre-training neural network model to obtain the gain of the right channel; and multiplying the right channel spectrum coefficient with the gain of the right channel to obtain a right surround channel spectrum coefficient.
In a specific embodiment of the present application, the training process of the first pre-trained neural network model, the second pre-trained neural network model and the third pre-trained neural network model comprises: acquiring multi-channel surround sound for training, and performing down-mixing on the multi-channel surround sound for training to obtain dual-channel stereo sound for training; extracting features of the training binaural stereo to obtain binaural features, wherein the binaural features comprise amplitude spectrum features of the training intermediate channel signal, amplitude spectrum features of the training left channel signal and amplitude spectrum features of the training right channel signal; extracting characteristics of the training multichannel surround sound to obtain multichannel characteristics, wherein the multichannel characteristics comprise amplitude spectrum characteristics of a training middle-set channel signal, amplitude spectrum characteristics of a training left surround channel signal and amplitude spectrum characteristics of a training right surround channel signal; inputting the amplitude spectrum characteristics of the training middle channel signal and the amplitude spectrum characteristics of the training middle channel signal into a neural network for training to obtain a first pre-training neural network model; inputting the amplitude spectrum characteristics of the training left channel signal and the amplitude spectrum characteristics of the training left surround channel signal into a neural network for training to obtain a second pre-training neural network model; and inputting the amplitude spectrum characteristics of the training right channel signal and the amplitude spectrum characteristics of the training right surround channel signal into a neural network for training to obtain a third pre-training neural network model.
In a specific embodiment of the present application, feature extraction is performed on a training binaural stereo to obtain binaural features, including: performing LC3 coding on the training binaural stereo to obtain a training binaural LC3 code stream; performing partial decoding on the training binaural LC3 code stream to obtain training left channel spectrum coefficients and training right channel spectrum coefficients; obtaining a training middle channel spectrum coefficient according to the training left channel spectrum coefficient and the training right channel spectrum coefficient; taking logarithm of the training middle sound channel spectrum coefficient to obtain amplitude spectrum characteristics of the training middle sound channel signal; taking logarithm of the training left channel spectrum coefficient to obtain amplitude spectrum characteristics of the training left channel signal; and taking the logarithm of the training right channel spectrum coefficient to obtain the amplitude spectrum characteristic of the training right channel signal.
In a specific embodiment of the present application, feature extraction is performed on training multichannel surround sound to obtain multichannel features, including: LC3 coding is performed on the training multichannel surround sound to obtain a training center channel code stream, a training left surround channel code stream and a training right surround channel code stream; performing partial decoding on the training center channel code stream, the training left surround channel code stream and the training right surround channel code stream to obtain training center channel spectrum coefficients, training left surround channel spectrum coefficients and training right surround channel spectrum coefficients; taking logarithm of the middle-set sound channel spectrum coefficient for training to obtain amplitude spectrum characteristics of middle-set sound channel signals for training; taking logarithm of the training left surround channel spectrum coefficient to obtain amplitude spectrum characteristics of the training left surround channel signal; and taking the logarithm of the training right surround channel spectrum coefficient to obtain the amplitude spectrum characteristic of the training right surround channel signal.
In one embodiment of the application, the binaural and multi-channel features are each trained in a predetermined number of consecutive frames input into the neural network.
In the two-channel to multi-channel upmixing device, the increase of algorithm delay is avoided by utilizing the existing time-frequency conversion and overlap-add technology; deep learning is adopted, and at a Bluetooth receiving end, for an original sound source of 5.1 channels, transmission is only required to be carried out by using a double-channel code rate, so that the code rate of air transmission can be saved, and the interference to a wireless environment is reduced; for the original sound source with double channels, the upmixing from double channels to multiple channels is realized, and the immersion of the user is enhanced.
In one embodiment of the application, a computer readable storage medium stores computer instructions operable to perform the two-channel to multi-channel upmixing method described in any of the embodiments. Wherein the storage medium may be directly in hardware, in a software module executed by a processor, or in a combination of the two.
A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium.
The Processor may be a central processing unit (English: central Processing Unit, CPU for short), other general purpose Processor, digital signal Processor (English: DIGITAL SIGNAL Processor, DSP for short), application specific integrated Circuit (Application SPECIFIC INTEGRATED Circuit, ASIC for short), field programmable gate array (English: field Programmable GATE ARRAY, FPGA for short), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
In one embodiment of the application, a computer device includes a processor and a memory storing computer instructions, wherein the processor operates the computer instructions to perform the two-channel to multi-channel upmixing method described in any of the embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
The foregoing description is only illustrative of the present application and is not intended to limit the scope of the application, and all equivalent structural changes made by the present application and the accompanying drawings, or direct or indirect application in other related technical fields, are included in the scope of the present application.

Claims (10)

1. A two-channel to multi-channel upmixing method, comprising:
At a Bluetooth receiving end, acquiring a double-channel LC3 code stream, and performing partial decoding on the double-channel LC3 code stream to obtain a left channel spectrum coefficient and a right channel spectrum coefficient;
adding the left channel spectrum coefficient and the right channel spectrum coefficient to obtain a middle channel spectrum coefficient;
Obtaining a middle channel spectrum coefficient by using a first pre-training neural network model according to the middle channel spectrum coefficient;
Obtaining left surround channel spectrum coefficients by using a second pre-training neural network model according to the left channel spectrum coefficients;
obtaining a right surround channel spectrum coefficient by using a third pre-training neural network model according to the right channel spectrum coefficient;
performing low-pass filtering treatment on the middle channel spectrum coefficient to obtain a heavy bass channel spectrum coefficient; and
And executing time-frequency inverse transformation according to the left channel spectrum coefficient, the right channel spectrum coefficient, the middle channel spectrum coefficient, the left surrounding channel spectrum coefficient, the right surrounding channel spectrum coefficient and the heavy bass channel spectrum coefficient, and outputting a multi-channel PCM audio signal.
2. The two-channel to multi-channel upmixing method of claim 1, wherein the obtaining the center channel spectral coefficients from the center channel spectral coefficients using a first pre-trained neural network model comprises:
Taking logarithm of the spectrum coefficient of the middle channel to obtain amplitude spectrum characteristics of the middle channel signal;
Inputting the amplitude spectrum characteristics of the middle channel signals into the first pre-training neural network model to obtain gains of the middle channel;
and multiplying the middle channel spectrum coefficient with the gain of the middle channel to obtain the middle channel spectrum coefficient.
3. The two-channel to multi-channel upmixing method of claim 1, wherein the obtaining left surround channel spectral coefficients from the left channel spectral coefficients using a second pre-trained neural network model comprises:
taking logarithm of the left channel spectrum coefficient to obtain amplitude spectrum characteristics of a left channel signal;
Inputting the amplitude spectrum characteristics of the left channel signal into the second pre-training neural network model to obtain the gain of the left channel;
Multiplying the left channel spectrum coefficient with the gain of the left channel to obtain the left surrounding channel spectrum coefficient.
4. The two-channel to multi-channel upmixing method of claim 1, wherein the obtaining the right surround channel spectral coefficients from the right channel spectral coefficients using a third pre-trained neural network model comprises:
taking logarithm of the right channel spectrum coefficient to obtain amplitude spectrum characteristics of a right channel signal;
inputting the amplitude spectrum characteristics of the right channel signals into the third pre-training neural network model to obtain the gain of the right channel;
And multiplying the right channel spectrum coefficient with the gain of the right channel to obtain the right surround channel spectrum coefficient.
5. The two-channel to multi-channel upmixing method of claim 1, wherein the training process of the first pre-trained neural network model, the second pre-trained neural network model, and the third pre-trained neural network model comprises:
Acquiring multi-channel surround sound for training, and performing down-mixing on the multi-channel surround sound for training to obtain dual-channel stereo sound for training;
extracting features of the training binaural stereo to obtain binaural features, wherein the binaural features comprise amplitude spectrum features of a training intermediate channel signal, amplitude spectrum features of a training left channel signal and amplitude spectrum features of a training right channel signal;
Extracting characteristics of the training multichannel surround sound to obtain multichannel characteristics, wherein the multichannel characteristics comprise amplitude spectrum characteristics of a training middle-set channel signal, amplitude spectrum characteristics of a training left surround channel signal and amplitude spectrum characteristics of a training right surround channel signal;
inputting the amplitude spectrum characteristics of the training middle channel signal and the amplitude spectrum characteristics of the training middle channel signal into a neural network for training to obtain the first pre-training neural network model;
inputting the amplitude spectrum characteristics of the training left channel signal and the amplitude spectrum characteristics of the training left surround channel signal into a neural network for training to obtain the second pre-training neural network model;
and inputting the amplitude spectrum characteristics of the training right channel signal and the amplitude spectrum characteristics of the training right surround channel signal into a neural network for training to obtain the third pre-training neural network model.
6. The two-channel to multi-channel upmixing method according to claim 5, wherein the feature extraction of the two-channel stereo to obtain two-channel features comprises:
performing LC3 coding on the training binaural stereo to obtain a training binaural LC3 code stream;
performing partial decoding on the training double-channel LC3 code stream to obtain a training left channel spectrum coefficient and a training right channel spectrum coefficient;
Obtaining a training middle channel spectrum coefficient according to the training left channel spectrum coefficient and the training right channel spectrum coefficient;
taking the logarithm of the training middle sound channel spectrum coefficient to obtain the amplitude spectrum characteristic of the training middle sound channel signal;
taking the logarithm of the training left channel spectrum coefficient to obtain the amplitude spectrum characteristic of the training left channel signal;
and taking the logarithm of the training right channel spectrum coefficient to obtain the amplitude spectrum characteristic of the training right channel signal.
7. The two-channel to multi-channel upmixing method of claim 5, wherein the feature extraction of the training multi-channel surround sound to obtain multi-channel features comprises:
Performing LC3 coding on the training multichannel surround sound to obtain a training center channel code stream, a training left surround channel code stream and a training right surround channel code stream;
performing partial decoding on the training center channel code stream, the training left surround channel code stream and the training right surround channel code stream to obtain training center channel spectrum coefficients, training left surround channel spectrum coefficients and training right surround channel spectrum coefficients;
Taking the logarithm of the middle-set sound channel spectrum coefficient for training to obtain the amplitude spectrum characteristic of the middle-set sound channel signal for training;
Taking logarithm of the training left surround channel spectrum coefficient to obtain amplitude spectrum characteristics of the training left surround channel signal;
And taking the logarithm of the training right surround channel spectrum coefficient to obtain the amplitude spectrum characteristic of the training right surround channel signal.
8. A two-channel to multi-channel upmixing apparatus comprising:
The module is used for acquiring a double-channel LC3 code stream at a Bluetooth receiving end, and executing partial decoding on the double-channel LC3 code stream to obtain a left channel spectrum coefficient and a right channel spectrum coefficient;
a module for adding the left channel spectral coefficient and the right channel spectral coefficient to obtain an intermediate channel spectral coefficient;
the module is used for obtaining a middle sound channel spectrum coefficient by utilizing a first pre-training neural network model according to the middle sound channel spectrum coefficient;
a module for obtaining left surround channel spectral coefficients from the left channel spectral coefficients using a second pre-trained neural network model;
A module for obtaining a right surround channel spectral coefficient using a third pre-trained neural network model according to the right channel spectral coefficient;
the module is used for carrying out low-pass filtering processing on the middle channel spectrum coefficient to obtain a heavy bass channel spectrum coefficient; and
And the module is used for executing time-frequency inverse transformation according to the left channel spectrum coefficient, the right channel spectrum coefficient, the middle channel spectrum coefficient, the left surrounding channel spectrum coefficient, the right surrounding channel spectrum coefficient and the heavy bass channel spectrum coefficient and outputting a multi-channel PCM audio signal.
9. A computer readable storage medium storing computer instructions, wherein the computer instructions are operative to perform a two-channel to multi-channel upmixing method of any one of claims 1-7.
10. A computer device comprising a processor and a memory, the memory storing computer instructions, wherein the processor operates the computer instructions to perform a two-channel to multi-channel upmixing method of any of claims 1-7.
CN202410289229.5A 2024-03-14 2024-03-14 Dual-channel to multi-channel upmixing method, device, storage medium and equipment Pending CN118197325A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410289229.5A CN118197325A (en) 2024-03-14 2024-03-14 Dual-channel to multi-channel upmixing method, device, storage medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410289229.5A CN118197325A (en) 2024-03-14 2024-03-14 Dual-channel to multi-channel upmixing method, device, storage medium and equipment

Publications (1)

Publication Number Publication Date
CN118197325A true CN118197325A (en) 2024-06-14

Family

ID=91403545

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410289229.5A Pending CN118197325A (en) 2024-03-14 2024-03-14 Dual-channel to multi-channel upmixing method, device, storage medium and equipment

Country Status (1)

Country Link
CN (1) CN118197325A (en)

Similar Documents

Publication Publication Date Title
CN101044551B (en) Single-channel shaping for binaural cue coding schemes and similar schemes
AU2006222285B2 (en) Device and method for generating an encoded stereo signal of an audio piece or audio data stream
RU2376655C2 (en) Energy-dependant quantisation for efficient coding spatial parametres of sound
RU2409911C2 (en) Decoding binaural audio signals
EP1934973B1 (en) Temporal and spatial shaping of multi-channel audio signals
CN101044794B (en) Method and apparatus for diffuse sound shaping for binaural cue code coding schemes and similar schemes
EP2898509B1 (en) Audio coding with gain profile extraction and transmission for speech enhancement at the decoder
CN110890101B (en) Method and apparatus for decoding based on speech enhancement metadata
US11501785B2 (en) Method and apparatus for adaptive control of decorrelation filters
CN105580070A (en) Method for processing audio signal according to room impulse response, signal processing unit, audio encoder, audio decoder and stereo renderer
WO2014041067A1 (en) Apparatus and method for providing enhanced guided downmix capabilities for 3d audio
US9311925B2 (en) Method, apparatus and computer program for processing multi-channel signals
CN118741406A (en) Mono upmixing method, device, medium, equipment and program product
CN118197325A (en) Dual-channel to multi-channel upmixing method, device, storage medium and equipment
CN119007744A (en) Multichannel upmixing method, system, storage medium and equipment based on deep learning
CN118447858A (en) Deep learning direct sound and ambient sound extraction method, system, medium and equipment
CN119028365A (en) Efficient Bluetooth transmitter mono upmixing method, system, medium and device
CN119152881A (en) Efficient Bluetooth receiving end mono upmixing method, device, medium and equipment
CN120220711A (en) Voice noise reduction method, device and medium for nonnegative matrix factorization and time-frequency masking
HK1111855B (en) Device and method for generating an encoded stereo signal
HK1117262A (en) Temporal and spatial shaping of multi-channel audio signals
HK1117262B (en) Temporal and spatial shaping of multi-channel audio signals
HK1095993B (en) Energy dependent quantization for efficient coding of spatial audio parameters
HK1132576A1 (en) Method and apparatus for encoding/decoding multi-channel audio signal
HK1122174A1 (en) Generation of spatial downmixes from parametric representations of multi channel signals

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination