CN118197325A - Dual-channel to multi-channel upmixing method, device, storage medium and equipment - Google Patents
Dual-channel to multi-channel upmixing method, device, storage medium and equipment Download PDFInfo
- Publication number
- CN118197325A CN118197325A CN202410289229.5A CN202410289229A CN118197325A CN 118197325 A CN118197325 A CN 118197325A CN 202410289229 A CN202410289229 A CN 202410289229A CN 118197325 A CN118197325 A CN 118197325A
- Authority
- CN
- China
- Prior art keywords
- channel
- training
- coefficient
- spectrum coefficient
- spectrum
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 64
- 238000001228 spectrum Methods 0.000 claims abstract description 224
- 238000001914 filtration Methods 0.000 claims abstract description 9
- 230000009466 transformation Effects 0.000 claims abstract description 4
- 230000003595 spectral effect Effects 0.000 claims description 97
- 238000003062 neural network model Methods 0.000 claims description 66
- 230000008569 process Effects 0.000 claims description 25
- 238000013528 artificial neural network Methods 0.000 claims description 20
- 230000005236 sound signal Effects 0.000 claims description 11
- 238000000605 extraction Methods 0.000 claims description 8
- 238000007654 immersion Methods 0.000 abstract description 9
- 230000001755 vocal effect Effects 0.000 abstract description 4
- 230000001537 neural effect Effects 0.000 abstract 3
- 230000005540 biological transmission Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 238000007493 shaping process Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000000513 principal component analysis Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/0017—Lossless audio signal coding; Perfect reconstruction of coded audio signal by transmission of coding error
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/0018—Speech coding using phonetic or linguistical decoding of the source; Reconstruction using text-to-speech synthesis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/005—Correction of errors induced by the transmission channel, if related to the coding algorithm
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/008—Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W4/00—Services specially adapted for wireless communication networks; Facilities therefor
- H04W4/80—Services using short range communication, e.g. near-field communication [NFC], radio-frequency identification [RFID] or low energy communication
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Mathematical Physics (AREA)
- Stereophonic System (AREA)
Abstract
The application discloses a double-channel to multi-channel upmixing method, a device, a storage medium and equipment, belonging to the technical field of Bluetooth audio encoding and decoding, wherein the method comprises the steps of obtaining a double-channel LC3 code stream, and performing partial decoding on the double-channel LC3 code stream to obtain left and right channel spectrum coefficients; adding the left channel spectrum coefficient and the right channel spectrum coefficient to obtain a middle channel spectrum coefficient; obtaining a middle vocal tract spectrum coefficient by using a first pre-training neural model according to the middle vocal tract spectrum coefficient; obtaining left surround channel spectrum coefficients by using a second pre-training neural model according to the left channel spectrum coefficients; obtaining a right surround channel spectrum coefficient by using a third pre-training neural model according to the right channel spectrum coefficient; low-pass filtering the middle channel spectrum coefficient to obtain a heavy bass channel spectrum coefficient; and performing time-frequency inverse transformation according to the left channel spectrum coefficient, the right channel spectrum coefficient, the middle channel spectrum coefficient, the left surrounding channel spectrum coefficient, the right surrounding channel spectrum coefficient and the heavy bass channel spectrum coefficient, and outputting multi-channel audio. The application enhances the immersion of the user.
Description
Technical Field
The application belongs to the technical field of Bluetooth audio encoding and decoding, and particularly relates to a double-channel to multi-channel upmixing method, a device, a storage medium and equipment.
Background
The LC3 codec receives more and more attention from manufacturers because of its low delay, high sound quality and coding gain, and no patent fee in the bluetooth field.
At present, bluetooth sound boxes are popular, and people hope to improve immersion except for appreciating stereophonic sound. 5.1 channel surround sound is a relatively widely used audio format because it provides a better user experience. However, playing a 5.1 channel sound source using a bluetooth speaker has the following challenges:
(1) If the sound source is in 5.1 channel format, the data stream applied to the Bluetooth sound box is as follows: the Bluetooth transmitting end decodes 5.1 sound channels into 6 sound channel PCM signals, performs audio coding (such as LC3 coding) on each sound channel, and transmits the coded audio signals; the Bluetooth receiving end receives audio signals of 6 sound channels- > decodes- > and plays by using a Bluetooth sound box. The disadvantage is that the rate of the over-the-air transmission is high, if each channel is compressed with the recommended rate of 124kbps, the total rate of the over-the-air is 744kbps, which is not a problem for LE Audio, but too high a rate can cause not only the problem of Audio jamming, but also interference to other surrounding devices, which is a big challenge for the radio frequency front end of bluetooth.
(2) The 5.1 sound channel format has fewer sound sources, and popular sound sources compared on the network are mainly double-channel sound sources, however, the experience of the double-channel sound sources is poor, and especially the immersion is insufficient.
In the existing two-channel upmix to multi-channel technology, PCA (PRINCIPAL COMPONENT ANALYSIS ) is used more and is also used in dolby surround sound decoders, but due to the time-varying characteristics of correlation and nonlinear relation among multi-channel signals, surround sound generated by PCA is greatly different from original surround sound, and the spatial sense and immersion sense of upmix audio are insufficient.
Disclosure of Invention
Aiming at the technical problems in the prior art, the application provides a double-channel to multi-channel upmixing method, a device, a storage medium and equipment, and the deep learning is adopted to realize double-channel to multi-channel upmixing at a Bluetooth receiving end, so that the immersion of a user is enhanced, and the user experience is improved.
In order to achieve the above object, the first technical scheme adopted by the present application is: there is provided a two-channel to multi-channel upmixing method comprising: at a Bluetooth receiving end, acquiring a double-channel LC3 code stream, and performing partial decoding on the double-channel LC3 code stream to obtain a left channel spectrum coefficient and a right channel spectrum coefficient; adding the left channel spectrum coefficient and the right channel spectrum coefficient to obtain a middle channel spectrum coefficient; obtaining a middle vocal tract spectrum coefficient by using a first pre-training neural network model according to the middle vocal tract spectrum coefficient; obtaining left surround channel spectrum coefficients by using a second pre-training neural network model according to the left channel spectrum coefficients; obtaining a right surround channel spectrum coefficient by using a third pre-training neural network model according to the right channel spectrum coefficient; performing low-pass filtering treatment on the middle channel spectrum coefficient to obtain a heavy bass channel spectrum coefficient; and performing a time-frequency inverse transform according to the left channel spectral coefficient, the right channel spectral coefficient, the center channel spectral coefficient, the left surround channel spectral coefficient, the right surround channel spectral coefficient, and the subwoofer channel spectral coefficient, and outputting a multi-channel PCM audio signal.
The second technical scheme adopted by the application is as follows: there is provided a two-channel to multi-channel upmixing apparatus comprising: the module is used for acquiring a double-channel LC3 code stream at a Bluetooth receiving end, and performing partial decoding on the double-channel LC3 code stream to obtain a left channel spectrum coefficient and a right channel spectrum coefficient; a module for adding the left channel spectral coefficient and the right channel spectral coefficient to obtain a middle channel spectral coefficient; the module is used for obtaining a middle sound channel spectrum coefficient by utilizing a first pre-training neural network model according to the middle sound channel spectrum coefficient; a module for obtaining left surround channel spectral coefficients from the left channel spectral coefficients using a second pre-trained neural network model; a module for obtaining a right surround channel spectral coefficient from the right channel spectral coefficient using a third pre-trained neural network model; the module is used for carrying out low-pass filtering treatment on the middle channel spectrum coefficient to obtain a heavy bass channel spectrum coefficient; and a module for performing an inverse time-frequency transform based on the left channel spectral coefficient, the right channel spectral coefficient, the center channel spectral coefficient, the left surround channel spectral coefficient, the right surround channel spectral coefficient, and the subwoofer channel spectral coefficient, and outputting a multi-channel PCM audio signal.
The third technical scheme adopted by the application is as follows: a computer readable storage medium is provided having stored thereon computer instructions operable to perform a two-channel to multi-channel upmixing method in scheme one.
The fourth technical scheme adopted by the application is as follows: there is provided a computer apparatus comprising a processor and a memory, the memory storing computer instructions, wherein the processor operates the computer instructions to perform a two-channel to multi-channel upmixing method in scheme one.
The technical scheme of the application has the following beneficial effects: the technical scheme of the application can be applied to low-power consumption Bluetooth and classical Bluetooth, and the increase of algorithm delay is avoided by utilizing the existing time-frequency conversion and overlap-add technology; deep learning is adopted, and at a Bluetooth receiving end, for an original sound source of 5.1 channels, transmission is only required to be carried out by using a double-channel code rate, so that the code rate of air transmission can be saved, and the interference to a wireless environment is reduced; for the original sound source with double channels, the upmixing from double channels to multiple channels is realized, and the immersion of the user is enhanced.
Drawings
FIG. 1 is a flow diagram of one embodiment of a two-channel to multi-channel upmixing method of the present application;
FIG. 2 is a schematic diagram of one embodiment of the offline training and online reasoning process of the neural network model in the two-channel to multi-channel upmixing method of the present application;
Fig. 3 is a schematic diagram of an embodiment of the inventive two-channel to multi-channel upmixing device.
Detailed Description
The preferred embodiments of the present application will be described in detail below with reference to the accompanying drawings so that the advantages and features of the present application can be more easily understood by those skilled in the art, thereby making clear and defining the scope of the present application.
The invention is exemplified with a multi-channel 5.1 channel, although the principles of the invention can be used with a greater or lesser number of channels, such as 7.1 channels, etc.
Typical 5.1 channels include a Left channel (LEFT CHANNEL), a center channel (CENTER CHANNEL), a Right channel (RIGHT CHANNEL), a Left Surround channel (Left Surround), a Right Surround channel (Right Surround), and a subwoofer channel (Low Frequency Effect).
The bluetooth transmitting terminal generally refers to a device with a bluetooth transmitting function, such as a bluetooth transceiver, a mobile phone, a tablet computer, a notebook computer, and the like.
The bluetooth receiving end generally refers to a bluetooth speaker, or a device with a bluetooth speaker, such as a vehicle bluetooth playing system, etc.
The invention is generally applied to a device with a Bluetooth receiving function, and the overall idea of the invention is that:
(1) At the bluetooth receiving end, first, an intermediate channel signal is generated based on two channel signals (a left channel signal and a right channel signal);
(2) Secondly, using a neural network:
a) Generating a center channel signal based on the center channel signal;
b) Generating a left surround channel signal based on the left channel signal;
c) Generating a right surround channel signal based on the right channel signal;
(3) Finally, the intermediate channel signal is subjected to low-pass filtering to generate a heavy bass channel signal, so that the conversion from the binaural stereo to the 5.1-channel surround sound is realized at the Bluetooth receiving end.
Fig. 1 is a flow chart of one embodiment of the two-channel to multi-channel upmixing method of the present application.
In one embodiment shown in fig. 1, the two-channel to multi-channel upmixing method of the present application includes a process S101, at a bluetooth receiving end, acquiring a two-channel LC3 code stream, and performing partial decoding on the two-channel LC3 code stream to obtain a left channel spectral coefficient and a right channel spectral coefficient.
In this embodiment, at the bluetooth receiving end, a dual-channel LC3 code stream is acquired through the bluetooth communication module, where the dual-channel LC3 code stream includes a left channel code stream and a right channel code stream.
In one embodiment of the present application, and performing partial decoding of a binaural LC3 code stream, comprises: the LC3 audio bitstream is decoded until the transform domain noise shaping decoding is completed.
In this particular embodiment, partial decoding is performed on the binaural LC3 code stream to transform domain noise shaping resulting in left and right channel spectral coefficients: x Left(k),Xright(k),k=0…NF -1. The partially decoded left and right channel spectral coefficients are used to subsequently generate a center channel signal, a left surround channel signal, and a right surround channel signal in the multi-channel.
In one embodiment shown in fig. 1, the two-channel to multi-channel upmixing method of the present application includes a process S102 of adding left channel spectral coefficients and right channel spectral coefficients to obtain intermediate channel spectral coefficients.
In this particular embodiment, the intermediate channel spectral coefficients X Middle (k):
XMiddle(k)=XLeft(k)+Xright(k)
In one embodiment shown in fig. 1, the two-channel to multi-channel upmixing method of the present application includes a process S103, where a first pre-trained neural network model is used to obtain center channel spectral coefficients according to center channel spectral coefficients.
In a specific embodiment of the present application, obtaining the center channel spectral coefficients using a first pre-trained neural network model based on the center channel spectral coefficients, includes: taking logarithm of the spectrum coefficient of the middle channel to obtain amplitude spectrum characteristics of the middle channel signal; inputting the amplitude spectrum characteristics of the middle channel signals into a first pre-training neural network model to obtain gains of the middle channel; and multiplying the middle channel spectrum coefficient with the gain of the middle channel to obtain the middle channel spectrum coefficient.
Specifically, first, the amplitude spectrum characteristics of the intermediate channel signal are extracted:
log|XMiddle(k)|=log|XLeft(k)+Xright(k)|,k=0…NF-1
Secondly, inputting the amplitude spectrum characteristics of the middle channel signals into a first pre-training neural network model to obtain the gain corresponding to each effective frequency index of the middle channel, namely the gain of the middle channel: g nn(k),k=0…NF -1.
Finally, the intermediate channel spectral coefficients are multiplied by the gain of the intermediate channel to obtain the center channel spectral coefficients X Center (k):
XCenter(k)=XMiddle(k)*gnn(k)
in one embodiment shown in fig. 1, the two-channel to multi-channel upmixing method of the present application includes a process S104 of obtaining left surround channel spectral coefficients according to left channel spectral coefficients using a second pre-trained neural network model.
In one embodiment of the present application, obtaining left surround channel spectral coefficients from left channel spectral coefficients using a second pre-trained neural network model, comprises: taking logarithm of the left channel spectrum coefficient to obtain amplitude spectrum characteristics of the left channel signal; inputting the amplitude spectrum characteristics of the left channel signals into a second pre-training neural network model to obtain the gain of the left channel; the left channel spectral coefficient is multiplied by the gain of the left channel to obtain a left surround channel spectral coefficient.
In this embodiment, the operation procedure for obtaining the left surround channel spectrum coefficient is similar to the operation procedure for obtaining the center channel spectrum coefficient, and will not be repeated here.
In one embodiment shown in fig. 1, the two-channel to multi-channel upmixing method of the present application includes a process S105 of obtaining right surround channel spectral coefficients according to the right channel spectral coefficients using a third pre-trained neural network model.
In one embodiment of the present application, obtaining the right surround channel spectral coefficients from the right channel spectral coefficients using a third pre-trained neural network model, comprising: taking logarithm of the right channel spectrum coefficient to obtain amplitude spectrum characteristics of the right channel signal; inputting the amplitude spectrum characteristics of the right channel signals into a third pre-training neural network model to obtain the gain of the right channel; and multiplying the right channel spectrum coefficient with the gain of the right channel to obtain a right surround channel spectrum coefficient.
In this embodiment, the operation procedure for obtaining the spectrum coefficients of the right surround channel is similar to the operation procedure for obtaining the spectrum coefficients of the center channel, and will not be described again here.
FIG. 2 is a schematic diagram of one embodiment of the offline training and online reasoning process of the neural network model in the two-channel to multi-channel upmixing method of the present application.
As shown in fig. 2, the offline training process is above the middle dashed line, and the online reasoning process is below the middle dashed line. During the offline training process: firstly, acquiring a 5.1-channel sound source as a training material; secondly, performing down-mixing on the sound source of 5.1 channels to obtain binaural stereo; thirdly, respectively extracting characteristics of the 5.1-channel sound source and the double-channel stereo of the downmix of the 5.1-channel sound source to obtain 5.1-channel characteristics and double-channel characteristics; and finally, inputting the 5.1 sound channel characteristics and the binaural characteristics into a neural network model for training, thereby obtaining a pre-training neural network model. In the online reasoning process: firstly, obtaining a double-channel sound source code stream through a Bluetooth communication module; secondly, extracting features of the binaural sound source code stream to obtain binaural features; thirdly, inputting the binaural characteristics into a pre-trained neural network model, and obtaining the spectral coefficients of each multichannel according to the binaural spectral coefficients; finally, the remaining decoding modules are continuously executed according to the spectral coefficients of the various channels of the multi-channel, so as to output the multi-channel audio signal.
In a specific embodiment of the present application, the training process of the first pre-trained neural network model, the second pre-trained neural network model and the third pre-trained neural network model comprises: acquiring multi-channel surround sound for training, and performing down-mixing on the multi-channel surround sound for training to obtain dual-channel stereo sound for training; extracting features of the training binaural stereo to obtain binaural features, wherein the binaural features comprise amplitude spectrum features of the training intermediate channel signal, amplitude spectrum features of the training left channel signal and amplitude spectrum features of the training right channel signal; extracting characteristics of the training multichannel surround sound to obtain multichannel characteristics, wherein the multichannel characteristics comprise amplitude spectrum characteristics of a training middle-set channel signal, amplitude spectrum characteristics of a training left surround channel signal and amplitude spectrum characteristics of a training right surround channel signal; inputting the amplitude spectrum characteristics of the training middle channel signal and the amplitude spectrum characteristics of the training middle channel signal into a neural network for training to obtain a first pre-training neural network model; inputting the amplitude spectrum characteristics of the training left channel signal and the amplitude spectrum characteristics of the training left surround channel signal into a neural network for training to obtain a second pre-training neural network model; and inputting the amplitude spectrum characteristics of the training right channel signal and the amplitude spectrum characteristics of the training right surround channel signal into a neural network for training to obtain a third pre-training neural network model.
The training process of the first pre-trained neural network model (center channel) is illustrated below in conjunction with fig. 2.
The training process of the neural network model is shown above the middle dashed line in fig. 2, specifically:
firstly, acquiring a certain number of sound sources of 5.1 sound channels as training materials;
Secondly, performing down-mixing on the audio source of 5.1 channels to obtain binaural stereo, wherein the process is a mature technology, and one of the following methods is briefly described as follows:
Left channel x left (n) =fl (n) 0.4142+fr 0.0000+
FC(n)*0.2929+LFE(n)*0.0000+BL(n)*0.2929+BR(n)*0.0000
Right channel x right (n) =fl (n) ×0.0000+fr (n) ×0.4142+fc (n) × 0.2929
LFE(n)*0.0000+BL(n)*0.0000+BR(n)*0.2929
Wherein FL, FR, FC, LFE, BL, BR corresponds to the left channel, right channel, center channel, subwoofer channel, left surround channel, right surround channel in the 5.1 channel definition.
In a specific embodiment of the present application, feature extraction is performed on a training binaural stereo to obtain binaural features, including: performing LC3 coding on the training binaural stereo to obtain a training binaural LC3 code stream; performing partial decoding on the training binaural LC3 code stream to obtain training left channel spectrum coefficients and training right channel spectrum coefficients; obtaining a training middle channel spectrum coefficient according to the training left channel spectrum coefficient and the training right channel spectrum coefficient; taking logarithm of the training middle sound channel spectrum coefficient to obtain amplitude spectrum characteristics of the training middle sound channel signal; taking logarithm of the training left channel spectrum coefficient to obtain amplitude spectrum characteristics of the training left channel signal; and taking the logarithm of the training right channel spectrum coefficient to obtain the amplitude spectrum characteristic of the training right channel signal.
Then, feature extraction is performed on the binaural stereo obtained by performing downmixing on the 5.1-channel sound source, and the following is briefly described:
(1) Firstly, LC3 coding is carried out on the binaural stereo obtained by the down-mixing;
(2) Performing LC3 partial decoding to transform domain noise shaping on the encoded result to obtain a left channel spectrum coefficient and a right channel spectrum coefficient;
(3) Finally, the amplitude spectrum characteristic log|X Middle (k) | of the intermediate channel signal is calculated, and the method is the same as that described above, and the detailed description is omitted here.
In a specific embodiment of the present application, feature extraction is performed on training multichannel surround sound to obtain multichannel features, including: LC3 coding is performed on the training multichannel surround sound to obtain a training center channel code stream, a training left surround channel code stream and a training right surround channel code stream; performing partial decoding on the training center channel code stream, the training left surround channel code stream and the training right surround channel code stream to obtain training center channel spectrum coefficients, training left surround channel spectrum coefficients and training right surround channel spectrum coefficients; taking logarithm of the middle-set sound channel spectrum coefficient for training to obtain amplitude spectrum characteristics of middle-set sound channel signals for training; taking logarithm of the training left surround channel spectrum coefficient to obtain amplitude spectrum characteristics of the training left surround channel signal; and taking the logarithm of the training right surround channel spectrum coefficient to obtain the amplitude spectrum characteristic of the training right surround channel signal.
Then, feature extraction is performed on the sound source of the 5.1 sound channel, and the following is briefly described:
(1) LC3 encoding each of the 5.1 channels first;
(2) Performing LC3 partial decoding to transform domain noise shaping on the encoded result to obtain a center channel spectrum coefficient;
(3) Finally, the amplitude spectrum characteristic log|X Center (k) | of the middle-set channel signal is calculated.
And finally, inputting the amplitude spectrum characteristic log|X Middle (k) | of the middle channel signal and the amplitude spectrum characteristic log|X Center (k) | of the middle channel signal into a neural network for training, adjusting the weight and bias of the neural network by using a back propagation algorithm, and stopping training when the error is smaller than a preset value or reaches a preset training frequency to obtain a first pre-training neural network model.
Specifically, the application is not limited with respect to the selection of the neural network, and the cyclic neural network (Recurrent Neural Network, RNN) or its modified version LSTM, GRU, etc. are preferably selected in consideration of the front-to-back correlation characteristics of the audio frames.
In one embodiment of the application, the binaural and multi-channel features are each trained in a predetermined number of consecutive frames input into the neural network.
In this particular embodiment, the input to the neural network is the amplitude spectrum of the signal, and the output is the estimated gain g nn (k), and the amplitude spectrum of each input is characterized by a certain number of consecutive frames, taking into account the front-to-back correlation of the audio. For example: if the frame length is 10ms, it is typically 5 frames; if the frame length is 7.5ms, it is typically 7 frames.
The learning targets during neural network training are as follows: for all valid frequencies Bin, the ideal gain is calculated:
wherein, in the neural network training process, a loss function (i.e., error) used in the back propagation is defined as:
since the training process of the second pre-training neural network model and the training process of the third pre-training neural network model are both similar to those of the first pre-training neural network model, a detailed description thereof will be omitted herein.
In one embodiment shown in fig. 1, the dual-channel to multi-channel upmixing method of the present application includes a process S106 of performing low-pass filtering on the intermediate channel spectral coefficients to obtain the sub-bass channel spectral coefficients.
In one embodiment shown in fig. 1, the two-channel to multi-channel upmixing method of the present application includes a process S107 of performing a time-frequency inverse transform based on left channel spectral coefficients, right channel spectral coefficients, center channel spectral coefficients, left surround channel spectral coefficients, right surround channel spectral coefficients, and subwoofer channel spectral coefficients, and outputting a multi-channel PCM audio signal.
In this embodiment, after obtaining the left channel spectral coefficient, the right channel spectral coefficient, the center channel spectral coefficient, the left surround channel spectral coefficient, the right surround channel spectral coefficient, and the subwoofer channel spectral coefficient of the multi-channel, the time-frequency inverse transform LD-IMDCT (low-delay modified discrete cosine transform) is continuously completed, that is, the remaining decoding modules in the online reasoning process as shown in fig. 2. Since the spectral coefficients of each channel of the newly generated audio are all obtained at the decoding end and are not subjected to the standard encoding process, no LTPF (long-term post-filtering) related operation is performed, wherein the main effect of the left surround channel signal and the right surround channel signal is to increase the sense of immersion, the center channel signal is mainly used for the dialogue, and therefore no LTPF operation is performed, the influence of the obtained multi-channel PCM audio signal on the tone quality is small, and the obtained multi-channel PCM audio signal is used for audio equipment (such as a bluetooth speaker) output. This process utilizes existing time-frequency transforms and overlap-add, avoiding an increase in algorithm delay.
In the two-channel to multi-channel upmixing method, the increase of algorithm delay is avoided by utilizing the existing time-frequency transformation and overlap-add technology; deep learning is adopted, and at a Bluetooth receiving end, for an original sound source of 5.1 channels, transmission is only required to be carried out by using a double-channel code rate, so that the code rate of air transmission can be saved, and the interference to a wireless environment is reduced; for the original sound source with double channels, the upmixing from double channels to multiple channels is realized, and the immersion of the user is enhanced.
Fig. 3 is a schematic diagram of an embodiment of the inventive two-channel to multi-channel upmixing device.
In one embodiment shown in fig. 3, the two-channel to multi-channel upmixing apparatus of the present application includes: a module 301 for acquiring a dual-channel LC3 code stream at a bluetooth receiving end, and performing partial decoding on the dual-channel LC3 code stream to obtain a left channel spectrum coefficient and a right channel spectrum coefficient; a module 302 for adding left channel spectral coefficients and right channel spectral coefficients to obtain intermediate channel spectral coefficients; a module 303 for obtaining mid-channel spectral coefficients from the mid-channel spectral coefficients using a first pre-trained neural network model; a module 304 for obtaining left surround channel spectral coefficients from the left channel spectral coefficients using a second pre-trained neural network model; a module 305 for obtaining right surround channel spectral coefficients from the right channel spectral coefficients using a third pre-trained neural network model; a module 306 for performing low pass filtering on the middle channel spectrum coefficients to obtain the sub-bass channel spectrum coefficients; and a module 307 for performing an inverse time-frequency transform based on the left channel spectral coefficients, the right channel spectral coefficients, the center channel spectral coefficients, the left surround channel spectral coefficients, the right surround channel spectral coefficients, and the subwoofer channel spectral coefficients, to output a multi-channel PCM audio signal.
In one embodiment of the present application, and performing partial decoding of a binaural LC3 code stream, comprises: the LC3 audio bitstream is decoded until the transform domain noise shaping decoding is completed.
In a specific embodiment of the present application, obtaining the center channel spectral coefficients using a first pre-trained neural network model based on the center channel spectral coefficients, includes: taking logarithm of the spectrum coefficient of the middle channel to obtain amplitude spectrum characteristics of the middle channel signal; inputting the amplitude spectrum characteristics of the middle channel signals into a first pre-training neural network model to obtain gains of the middle channel; and multiplying the middle channel spectrum coefficient with the gain of the middle channel to obtain the middle channel spectrum coefficient.
In one embodiment of the present application, obtaining left surround channel spectral coefficients from left channel spectral coefficients using a second pre-trained neural network model, comprises: taking logarithm of the left channel spectrum coefficient to obtain amplitude spectrum characteristics of the left channel signal; inputting the amplitude spectrum characteristics of the left channel signals into a second pre-training neural network model to obtain the gain of the left channel; the left channel spectral coefficient is multiplied by the gain of the left channel to obtain a left surround channel spectral coefficient.
In one embodiment of the present application, obtaining the right surround channel spectral coefficients from the right channel spectral coefficients using a third pre-trained neural network model, comprising: taking logarithm of the right channel spectrum coefficient to obtain amplitude spectrum characteristics of the right channel signal; inputting the amplitude spectrum characteristics of the right channel signals into a third pre-training neural network model to obtain the gain of the right channel; and multiplying the right channel spectrum coefficient with the gain of the right channel to obtain a right surround channel spectrum coefficient.
In a specific embodiment of the present application, the training process of the first pre-trained neural network model, the second pre-trained neural network model and the third pre-trained neural network model comprises: acquiring multi-channel surround sound for training, and performing down-mixing on the multi-channel surround sound for training to obtain dual-channel stereo sound for training; extracting features of the training binaural stereo to obtain binaural features, wherein the binaural features comprise amplitude spectrum features of the training intermediate channel signal, amplitude spectrum features of the training left channel signal and amplitude spectrum features of the training right channel signal; extracting characteristics of the training multichannel surround sound to obtain multichannel characteristics, wherein the multichannel characteristics comprise amplitude spectrum characteristics of a training middle-set channel signal, amplitude spectrum characteristics of a training left surround channel signal and amplitude spectrum characteristics of a training right surround channel signal; inputting the amplitude spectrum characteristics of the training middle channel signal and the amplitude spectrum characteristics of the training middle channel signal into a neural network for training to obtain a first pre-training neural network model; inputting the amplitude spectrum characteristics of the training left channel signal and the amplitude spectrum characteristics of the training left surround channel signal into a neural network for training to obtain a second pre-training neural network model; and inputting the amplitude spectrum characteristics of the training right channel signal and the amplitude spectrum characteristics of the training right surround channel signal into a neural network for training to obtain a third pre-training neural network model.
In a specific embodiment of the present application, feature extraction is performed on a training binaural stereo to obtain binaural features, including: performing LC3 coding on the training binaural stereo to obtain a training binaural LC3 code stream; performing partial decoding on the training binaural LC3 code stream to obtain training left channel spectrum coefficients and training right channel spectrum coefficients; obtaining a training middle channel spectrum coefficient according to the training left channel spectrum coefficient and the training right channel spectrum coefficient; taking logarithm of the training middle sound channel spectrum coefficient to obtain amplitude spectrum characteristics of the training middle sound channel signal; taking logarithm of the training left channel spectrum coefficient to obtain amplitude spectrum characteristics of the training left channel signal; and taking the logarithm of the training right channel spectrum coefficient to obtain the amplitude spectrum characteristic of the training right channel signal.
In a specific embodiment of the present application, feature extraction is performed on training multichannel surround sound to obtain multichannel features, including: LC3 coding is performed on the training multichannel surround sound to obtain a training center channel code stream, a training left surround channel code stream and a training right surround channel code stream; performing partial decoding on the training center channel code stream, the training left surround channel code stream and the training right surround channel code stream to obtain training center channel spectrum coefficients, training left surround channel spectrum coefficients and training right surround channel spectrum coefficients; taking logarithm of the middle-set sound channel spectrum coefficient for training to obtain amplitude spectrum characteristics of middle-set sound channel signals for training; taking logarithm of the training left surround channel spectrum coefficient to obtain amplitude spectrum characteristics of the training left surround channel signal; and taking the logarithm of the training right surround channel spectrum coefficient to obtain the amplitude spectrum characteristic of the training right surround channel signal.
In one embodiment of the application, the binaural and multi-channel features are each trained in a predetermined number of consecutive frames input into the neural network.
In the two-channel to multi-channel upmixing device, the increase of algorithm delay is avoided by utilizing the existing time-frequency conversion and overlap-add technology; deep learning is adopted, and at a Bluetooth receiving end, for an original sound source of 5.1 channels, transmission is only required to be carried out by using a double-channel code rate, so that the code rate of air transmission can be saved, and the interference to a wireless environment is reduced; for the original sound source with double channels, the upmixing from double channels to multiple channels is realized, and the immersion of the user is enhanced.
In one embodiment of the application, a computer readable storage medium stores computer instructions operable to perform the two-channel to multi-channel upmixing method described in any of the embodiments. Wherein the storage medium may be directly in hardware, in a software module executed by a processor, or in a combination of the two.
A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium.
The Processor may be a central processing unit (English: central Processing Unit, CPU for short), other general purpose Processor, digital signal Processor (English: DIGITAL SIGNAL Processor, DSP for short), application specific integrated Circuit (Application SPECIFIC INTEGRATED Circuit, ASIC for short), field programmable gate array (English: field Programmable GATE ARRAY, FPGA for short), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
In one embodiment of the application, a computer device includes a processor and a memory storing computer instructions, wherein the processor operates the computer instructions to perform the two-channel to multi-channel upmixing method described in any of the embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
The foregoing description is only illustrative of the present application and is not intended to limit the scope of the application, and all equivalent structural changes made by the present application and the accompanying drawings, or direct or indirect application in other related technical fields, are included in the scope of the present application.
Claims (10)
1. A two-channel to multi-channel upmixing method, comprising:
At a Bluetooth receiving end, acquiring a double-channel LC3 code stream, and performing partial decoding on the double-channel LC3 code stream to obtain a left channel spectrum coefficient and a right channel spectrum coefficient;
adding the left channel spectrum coefficient and the right channel spectrum coefficient to obtain a middle channel spectrum coefficient;
Obtaining a middle channel spectrum coefficient by using a first pre-training neural network model according to the middle channel spectrum coefficient;
Obtaining left surround channel spectrum coefficients by using a second pre-training neural network model according to the left channel spectrum coefficients;
obtaining a right surround channel spectrum coefficient by using a third pre-training neural network model according to the right channel spectrum coefficient;
performing low-pass filtering treatment on the middle channel spectrum coefficient to obtain a heavy bass channel spectrum coefficient; and
And executing time-frequency inverse transformation according to the left channel spectrum coefficient, the right channel spectrum coefficient, the middle channel spectrum coefficient, the left surrounding channel spectrum coefficient, the right surrounding channel spectrum coefficient and the heavy bass channel spectrum coefficient, and outputting a multi-channel PCM audio signal.
2. The two-channel to multi-channel upmixing method of claim 1, wherein the obtaining the center channel spectral coefficients from the center channel spectral coefficients using a first pre-trained neural network model comprises:
Taking logarithm of the spectrum coefficient of the middle channel to obtain amplitude spectrum characteristics of the middle channel signal;
Inputting the amplitude spectrum characteristics of the middle channel signals into the first pre-training neural network model to obtain gains of the middle channel;
and multiplying the middle channel spectrum coefficient with the gain of the middle channel to obtain the middle channel spectrum coefficient.
3. The two-channel to multi-channel upmixing method of claim 1, wherein the obtaining left surround channel spectral coefficients from the left channel spectral coefficients using a second pre-trained neural network model comprises:
taking logarithm of the left channel spectrum coefficient to obtain amplitude spectrum characteristics of a left channel signal;
Inputting the amplitude spectrum characteristics of the left channel signal into the second pre-training neural network model to obtain the gain of the left channel;
Multiplying the left channel spectrum coefficient with the gain of the left channel to obtain the left surrounding channel spectrum coefficient.
4. The two-channel to multi-channel upmixing method of claim 1, wherein the obtaining the right surround channel spectral coefficients from the right channel spectral coefficients using a third pre-trained neural network model comprises:
taking logarithm of the right channel spectrum coefficient to obtain amplitude spectrum characteristics of a right channel signal;
inputting the amplitude spectrum characteristics of the right channel signals into the third pre-training neural network model to obtain the gain of the right channel;
And multiplying the right channel spectrum coefficient with the gain of the right channel to obtain the right surround channel spectrum coefficient.
5. The two-channel to multi-channel upmixing method of claim 1, wherein the training process of the first pre-trained neural network model, the second pre-trained neural network model, and the third pre-trained neural network model comprises:
Acquiring multi-channel surround sound for training, and performing down-mixing on the multi-channel surround sound for training to obtain dual-channel stereo sound for training;
extracting features of the training binaural stereo to obtain binaural features, wherein the binaural features comprise amplitude spectrum features of a training intermediate channel signal, amplitude spectrum features of a training left channel signal and amplitude spectrum features of a training right channel signal;
Extracting characteristics of the training multichannel surround sound to obtain multichannel characteristics, wherein the multichannel characteristics comprise amplitude spectrum characteristics of a training middle-set channel signal, amplitude spectrum characteristics of a training left surround channel signal and amplitude spectrum characteristics of a training right surround channel signal;
inputting the amplitude spectrum characteristics of the training middle channel signal and the amplitude spectrum characteristics of the training middle channel signal into a neural network for training to obtain the first pre-training neural network model;
inputting the amplitude spectrum characteristics of the training left channel signal and the amplitude spectrum characteristics of the training left surround channel signal into a neural network for training to obtain the second pre-training neural network model;
and inputting the amplitude spectrum characteristics of the training right channel signal and the amplitude spectrum characteristics of the training right surround channel signal into a neural network for training to obtain the third pre-training neural network model.
6. The two-channel to multi-channel upmixing method according to claim 5, wherein the feature extraction of the two-channel stereo to obtain two-channel features comprises:
performing LC3 coding on the training binaural stereo to obtain a training binaural LC3 code stream;
performing partial decoding on the training double-channel LC3 code stream to obtain a training left channel spectrum coefficient and a training right channel spectrum coefficient;
Obtaining a training middle channel spectrum coefficient according to the training left channel spectrum coefficient and the training right channel spectrum coefficient;
taking the logarithm of the training middle sound channel spectrum coefficient to obtain the amplitude spectrum characteristic of the training middle sound channel signal;
taking the logarithm of the training left channel spectrum coefficient to obtain the amplitude spectrum characteristic of the training left channel signal;
and taking the logarithm of the training right channel spectrum coefficient to obtain the amplitude spectrum characteristic of the training right channel signal.
7. The two-channel to multi-channel upmixing method of claim 5, wherein the feature extraction of the training multi-channel surround sound to obtain multi-channel features comprises:
Performing LC3 coding on the training multichannel surround sound to obtain a training center channel code stream, a training left surround channel code stream and a training right surround channel code stream;
performing partial decoding on the training center channel code stream, the training left surround channel code stream and the training right surround channel code stream to obtain training center channel spectrum coefficients, training left surround channel spectrum coefficients and training right surround channel spectrum coefficients;
Taking the logarithm of the middle-set sound channel spectrum coefficient for training to obtain the amplitude spectrum characteristic of the middle-set sound channel signal for training;
Taking logarithm of the training left surround channel spectrum coefficient to obtain amplitude spectrum characteristics of the training left surround channel signal;
And taking the logarithm of the training right surround channel spectrum coefficient to obtain the amplitude spectrum characteristic of the training right surround channel signal.
8. A two-channel to multi-channel upmixing apparatus comprising:
The module is used for acquiring a double-channel LC3 code stream at a Bluetooth receiving end, and executing partial decoding on the double-channel LC3 code stream to obtain a left channel spectrum coefficient and a right channel spectrum coefficient;
a module for adding the left channel spectral coefficient and the right channel spectral coefficient to obtain an intermediate channel spectral coefficient;
the module is used for obtaining a middle sound channel spectrum coefficient by utilizing a first pre-training neural network model according to the middle sound channel spectrum coefficient;
a module for obtaining left surround channel spectral coefficients from the left channel spectral coefficients using a second pre-trained neural network model;
A module for obtaining a right surround channel spectral coefficient using a third pre-trained neural network model according to the right channel spectral coefficient;
the module is used for carrying out low-pass filtering processing on the middle channel spectrum coefficient to obtain a heavy bass channel spectrum coefficient; and
And the module is used for executing time-frequency inverse transformation according to the left channel spectrum coefficient, the right channel spectrum coefficient, the middle channel spectrum coefficient, the left surrounding channel spectrum coefficient, the right surrounding channel spectrum coefficient and the heavy bass channel spectrum coefficient and outputting a multi-channel PCM audio signal.
9. A computer readable storage medium storing computer instructions, wherein the computer instructions are operative to perform a two-channel to multi-channel upmixing method of any one of claims 1-7.
10. A computer device comprising a processor and a memory, the memory storing computer instructions, wherein the processor operates the computer instructions to perform a two-channel to multi-channel upmixing method of any of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410289229.5A CN118197325A (en) | 2024-03-14 | 2024-03-14 | Dual-channel to multi-channel upmixing method, device, storage medium and equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410289229.5A CN118197325A (en) | 2024-03-14 | 2024-03-14 | Dual-channel to multi-channel upmixing method, device, storage medium and equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118197325A true CN118197325A (en) | 2024-06-14 |
Family
ID=91403545
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410289229.5A Pending CN118197325A (en) | 2024-03-14 | 2024-03-14 | Dual-channel to multi-channel upmixing method, device, storage medium and equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118197325A (en) |
-
2024
- 2024-03-14 CN CN202410289229.5A patent/CN118197325A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101044551B (en) | Single-channel shaping for binaural cue coding schemes and similar schemes | |
AU2006222285B2 (en) | Device and method for generating an encoded stereo signal of an audio piece or audio data stream | |
RU2376655C2 (en) | Energy-dependant quantisation for efficient coding spatial parametres of sound | |
RU2409911C2 (en) | Decoding binaural audio signals | |
EP1934973B1 (en) | Temporal and spatial shaping of multi-channel audio signals | |
CN101044794B (en) | Method and apparatus for diffuse sound shaping for binaural cue code coding schemes and similar schemes | |
EP2898509B1 (en) | Audio coding with gain profile extraction and transmission for speech enhancement at the decoder | |
CN110890101B (en) | Method and apparatus for decoding based on speech enhancement metadata | |
US11501785B2 (en) | Method and apparatus for adaptive control of decorrelation filters | |
CN105580070A (en) | Method for processing audio signal according to room impulse response, signal processing unit, audio encoder, audio decoder and stereo renderer | |
WO2014041067A1 (en) | Apparatus and method for providing enhanced guided downmix capabilities for 3d audio | |
US9311925B2 (en) | Method, apparatus and computer program for processing multi-channel signals | |
CN118741406A (en) | Mono upmixing method, device, medium, equipment and program product | |
CN118197325A (en) | Dual-channel to multi-channel upmixing method, device, storage medium and equipment | |
CN119007744A (en) | Multichannel upmixing method, system, storage medium and equipment based on deep learning | |
CN118447858A (en) | Deep learning direct sound and ambient sound extraction method, system, medium and equipment | |
CN119028365A (en) | Efficient Bluetooth transmitter mono upmixing method, system, medium and device | |
CN119152881A (en) | Efficient Bluetooth receiving end mono upmixing method, device, medium and equipment | |
CN120220711A (en) | Voice noise reduction method, device and medium for nonnegative matrix factorization and time-frequency masking | |
HK1111855B (en) | Device and method for generating an encoded stereo signal | |
HK1117262A (en) | Temporal and spatial shaping of multi-channel audio signals | |
HK1117262B (en) | Temporal and spatial shaping of multi-channel audio signals | |
HK1095993B (en) | Energy dependent quantization for efficient coding of spatial audio parameters | |
HK1132576A1 (en) | Method and apparatus for encoding/decoding multi-channel audio signal | |
HK1122174A1 (en) | Generation of spatial downmixes from parametric representations of multi channel signals |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |