TW202447609A

TW202447609A - Scene audio signal encoding method and electronic device

Info

Publication number: TW202447609A
Application number: TW113119344A
Authority: TW
Inventors: 劉帥; 高原; 李佳蔚; 夏丙寅; 王喆
Original assignee: 大陸商華為技術有限公司
Priority date: 2023-05-27
Filing date: 2024-05-24
Publication date: 2024-12-01
Also published as: WO2024245018A1; CN119049483A

Abstract

The present application discloses a scene audio signal encoding method. The decoding method includes: receiving and decoding a first bitstream to obtain a first reconstructed signal, attribute information of a target virtual speaker, and a higher order energy gain coding result, where the first reconstructed signal is a reconstructed signal of a first audio signal in the scene audio signal; generating, based on the attribute information of the target virtual speaker and the first audio signal, a virtual speaker signal corresponding to the target virtual speaker; performing reconstruction based on the attribute information of the target virtual speaker, so as to obtain a first reconstructed scene audio signal; determining a fading factor according to a frequency band sequence number of a reconstructed signal in the first reconstructed scene audio signal and/or an order of the first reconstructed scene audio signal; and adjusting the first reconstructed scene audio signal according to the higher order energy gain coding result and the attenuation factor, so as to obtain a reconstructed scene audio signal.

Description

Scene audio decoding method and electronic device

本發明實施例涉及音訊解碼領域，尤其涉及一種場景音訊解碼方法及電子設備。The present invention relates to the field of audio decoding, and more particularly to a scene audio decoding method and electronic device.

三維音訊技術是通過電腦、信號處理等方式對真實世界中聲音事件和三維聲場資訊進行獲取、處理，傳輸和渲染重播的音訊技術。三維音訊使聲音具有強烈的空間感、包圍感及沉浸感，給人以“聲臨其境”的非凡聽覺體驗。其中，高階立體混響（Higher Order Ambisonics，HOA）技術具有在錄製、編碼與重播階段與揚聲器佈局無關的性質以及HOA格式資料的可旋轉重播特性，在進行三維音訊重播時具有更高的靈活性，因而也得到了更為廣泛的關注和研究。Three-dimensional audio technology is an audio technology that acquires, processes, transmits, renders and replays sound events and three-dimensional sound field information in the real world through computers and signal processing. Three-dimensional audio gives sound a strong sense of space, enclosure and immersion, giving people an extraordinary auditory experience of "sound immersion". Among them, the Higher Order Ambisonics (HOA) technology has the properties of being independent of the speaker layout in the recording, encoding and replay stages, as well as the rotatable replay characteristics of HOA format data. It has higher flexibility in three-dimensional audio replay, and has therefore received more extensive attention and research.

對於N階HOA信號來說，其對應的通道數為(N+1) ²。隨著HOA階數的增加，HOA信號中用於記錄更詳細的聲音場景的資訊也會隨之增加；但HOA信號的資料量也會隨之增多，大量的資料造成傳輸和存儲的困難，因此需要對HOA信號進行編解碼。然而，現有技術對HOA信號的重建存在準確率低的問題。 For an N-order HOA signal, the corresponding number of channels is (N+1) ² . As the HOA order increases, the information used to record more detailed sound scenes in the HOA signal will also increase; however, the amount of data in the HOA signal will also increase, and a large amount of data will cause difficulties in transmission and storage, so the HOA signal needs to be encoded and decoded. However, the existing technology has the problem of low accuracy in reconstructing HOA signals.

本發明提供一種場景音訊編解碼方法及電子設備。The present invention provides a scene audio encoding and decoding method and an electronic device.

第一方面，本發明實施例提供一種場景音訊編碼方法，該方法包括：獲取待編碼的場景音訊信號，所述場景音訊信號包括C1個通道的音訊信號，C1為正整數；獲取所述場景音訊信號對應的目標虛擬揚聲器的屬性資訊；獲取所述場景音訊信號的高階能量增益；對所述高階能量增益進行編碼，以得到高階能量增益編碼結果；編碼所述場景音訊信號中第一音訊信號、所述目標虛擬揚聲器的屬性資訊和所述高階能量增益編碼結果，以得到第一碼流；其中，所述第一音訊信號為所述場景音訊信號中K個通道的音訊信號，K為小於或等於C1的正整數。In a first aspect, an embodiment of the present invention provides a scene audio coding method, the method comprising: obtaining a scene audio signal to be encoded, the scene audio signal comprising audio signals of C1 channels, C1 being a positive integer; obtaining attribute information of a target virtual speaker corresponding to the scene audio signal; obtaining a high-order energy gain of the scene audio signal; encoding the high-order energy gain to obtain a high-order energy gain coding result; encoding a first audio signal in the scene audio signal, the attribute information of the target virtual speaker and the high-order energy gain coding result to obtain a first bit stream; wherein the first audio signal is an audio signal of K channels in the scene audio signal, K being a positive integer less than or equal to C1.

一種可能的方式中，所述場景音訊信號為N1階高階立體混響HOA信號，所述N1階HOA信號包括第二音訊信號，所述第二音訊信號為所述N1階HOA信號中除所述第一音訊信號之外的音訊信號，C1等於（N1+1）的平方；所述獲取所述場景音訊信號的高階能量增益，包括：根據所述第二音訊信號的特徵資訊和所述第一音訊信號的特徵資訊獲取所述高階能量增益。In one possible manner, the scene audio signal is an N1-order high-order stereo reverberation HOA signal, the N1-order HOA signal includes a second audio signal, the second audio signal is an audio signal in the N1-order HOA signal except the first audio signal, and C1 is equal to the square of (N1+1); obtaining the high-order energy gain of the scene audio signal includes: obtaining the high-order energy gain according to feature information of the second audio signal and feature information of the first audio signal.

示例性的，N1階HOA信號包括第二音訊信號，可以理解為N1階HOA信號僅包括第二音訊信號。Exemplarily, the N1-order HOA signal includes the second audio signal, which can be understood as the N1-order HOA signal only includes the second audio signal.

示例性的，N1階HOA信號包括第二音訊信號，可以理解為N1階HOA信號包括第二音訊信號和其他音訊信號。Exemplarily, the N1-order HOA signal includes the second audio signal, which can be understood as the N1-order HOA signal includes the second audio signal and other audio signals.

示例性的，第一音訊信號可以稱為場景音訊信號的低階部分，第二音訊信號可以稱為場景音訊信號的高階部分。也就是說，可以編碼場景音訊信號的低階部分與場景音訊信號的高階部分中的一部分。Exemplarily, the first audio signal may be referred to as the low-level portion of the scene audio signal, and the second audio signal may be referred to as the high-level portion of the scene audio signal. In other words, a portion of the low-level portion of the scene audio signal and a portion of the high-level portion of the scene audio signal may be encoded.

應該理解的是，相對於N1階HOA信號包括第二音訊信號而言，N1階HOA信號僅包括第一音訊信號時，編碼的N1階HOA信號的通道數更少，對應的碼率更低。It should be understood that, compared with the N1-order HOA signal including the second audio signal, when the N1-order HOA signal only includes the first audio signal, the number of channels of the encoded N1-order HOA signal is smaller and the corresponding bit rate is lower.

一種可能的方式中，所述根據所述第二音訊信號的特徵資訊和所述第一音訊信號的特徵資訊獲取所述高階能量增益，包括：獲取所述第一音訊信號的能量增益和所述第二音訊信號的能量增益；根據所述第一音訊信號的能量增益和所述第二音訊信號的能量增益獲取所述高階能量增益。In one possible manner, obtaining the high-order energy gain based on the feature information of the second audio signal and the feature information of the first audio signal includes: obtaining the energy gain of the first audio signal and the energy gain of the second audio signal; obtaining the high-order energy gain based on the energy gain of the first audio signal and the energy gain of the second audio signal.

一種可能的方式中，所述根據所述第一音訊信號的能量增益和所述第二音訊信號的能量增益獲取所述高階能量增益，包括：通過如下方式獲取所述高階能量增益Gain’(i，b)：In one possible manner, obtaining the high-order energy gain according to the energy gain of the first audio signal and the energy gain of the second audio signal includes: obtaining the high-order energy gain Gain'(i, b) by the following method:

Gain’(i，b) = 10*log10( )； Gain'(i，b) = 10*log10( );

其中，log10表示對數函數log，*表示相乘運算，E(1，b)為所述第一音訊信號的第b個頻帶的通道能量，E(i，b)為所述第二音訊信號的第b個頻帶的第i個通道能量，i為所述第二音訊信號的第i個通道的編號，b為所述第二音訊信號的頻帶序號。Wherein, log10 represents the logarithmic function log, * represents the multiplication operation, E(1, b) is the channel energy of the b-th frequency band of the first audio signal, E(i, b) is the i-th channel energy of the b-th frequency band of the second audio signal, i is the number of the i-th channel of the second audio signal, and b is the frequency band sequence number of the second audio signal.

一種可能的方式中，所述對所述高階能量增益進行編碼，以得到高階能量增益編碼結果，包括：對所述高階能量增益進行量化，以得到量化後的高階能量增益；對所述量化後的高階能量增益進行熵編碼，以得到所述高階能量增益編碼結果。In one possible manner, encoding the high-order energy gain to obtain a high-order energy gain encoding result includes: quantizing the high-order energy gain to obtain a quantized high-order energy gain; and entropy encoding the quantized high-order energy gain to obtain the high-order energy gain encoding result.

需要說明的是，目標虛擬揚聲器的位置與場景音訊信號中聲源的位置相匹配；根據目標虛擬揚聲器的屬性資訊和場景音訊信號中第一音訊信號，可以生成目標虛擬揚聲器對應的虛擬揚聲器信號；根據虛擬揚聲器信號和高階能量增益編碼結果，可以重建出該場景音訊信號。因此，編碼端將場景音訊信號中第一音訊信號、目標虛擬揚聲器的屬性資訊和高階能量增益編碼結果一起編碼後發送給解碼端，解碼端可以基於解碼得到第一重建信號（即場景音訊信號中第一音訊信號的重建信號）、目標虛擬揚聲器的屬性資訊和高階能量增益編碼結果，重建出該場景音訊信號。It should be noted that the position of the target virtual speaker matches the position of the sound source in the scene audio signal; based on the attribute information of the target virtual speaker and the first audio signal in the scene audio signal, a virtual speaker signal corresponding to the target virtual speaker can be generated; based on the virtual speaker signal and the high-order energy gain coding result, the scene audio signal can be reconstructed. Therefore, the encoder encodes the first audio signal in the scene audio signal, the property information of the target virtual speaker, and the high-order energy gain coding result, and sends them to the decoder. The decoder can obtain the first reconstructed signal (i.e., the reconstructed signal of the first audio signal in the scene audio signal), the property information of the target virtual speaker, and the high-order energy gain coding result based on the decoding, and reconstruct the scene audio signal.

相對於現有技術中其他重建場景音訊信號的方法而言，基於虛擬揚聲器信號重建出的場景音訊信號的音訊品質更高；因此當K等於C1時，在同等碼率下，本發明的重建出的場景音訊信號的音訊品質更高。Compared with other methods of reconstructing scene audio signals in the prior art, the scene audio signals reconstructed based on the virtual speaker signals have higher audio quality; therefore, when K is equal to C1, at the same bit rate, the scene audio signals reconstructed by the present invention have higher audio quality.

當K小於C1時，相對於現有技術而言，本發明編碼的音訊信號的通道數更少，且目標虛擬揚聲器的屬性資訊的資料量，遠小於一個通道的音訊信號的資料量；因此在達到同等品質的前提下，本發明編碼碼率更低。When K is less than C1, compared with the prior art, the number of channels of the audio signal encoded by the present invention is smaller, and the amount of data of the property information of the target virtual speaker is much smaller than the amount of data of the audio signal of one channel; therefore, under the premise of achieving the same quality, the encoding bit rate of the present invention is lower.

此外，現有技術是將場景音訊信號轉換為虛擬揚聲器信號和殘差信號後再編碼，而本發明編碼端直接編碼場景音訊信號中第一音訊信號，無需計算虛擬揚聲器信號和殘差信號，編碼端的編碼複雜度更低。In addition, the prior art converts the scene audio signal into a virtual speaker signal and a residual signal and then encodes the signal. However, the encoding end of the present invention directly encodes the first audio signal in the scene audio signal without calculating the virtual speaker signal and the residual signal, and the encoding complexity of the encoding end is lower.

示例性的，本發明實施例涉及的場景音訊信號，可以是指用於描述聲場的信號；其中，場景音訊信號可以包括：HOA信號（其中，HOA信號可以包括三維HOA信號和二維HOA信號（也可以稱為平面HOA信號））和三維音訊信號；三維音訊信號可以是指場景音訊信號中除HOA信號之外的其他音訊信號。Exemplarily, the scene audio signal involved in the embodiments of the present invention may refer to a signal used to describe a sound field; wherein the scene audio signal may include: an HOA signal (wherein the HOA signal may include a three-dimensional HOA signal and a two-dimensional HOA signal (also referred to as a planar HOA signal)) and a three-dimensional audio signal; the three-dimensional audio signal may refer to other audio signals in the scene audio signal except the HOA signal.

一種可能的方式中，當N1等於1時，K可以等於C1；當N1大於1時，K可以小於C1。應該理解的是，當N1等於1時，K也可以小於C1。In one possible manner, when N1 is equal to 1, K may be equal to C1; when N1 is greater than 1, K may be less than C1. It should be understood that when N1 is equal to 1, K may also be less than C1.

示例性的，編碼場景音訊信號中第一音訊信號和目標虛擬揚聲器的屬性資訊的過程可以包括：下混、變換、量化以及熵編碼等操作，本發明對此不作限制。Exemplarily, the process of encoding the property information of the first audio signal and the target virtual speaker in the scene audio signal may include operations such as downmixing, transformation, quantization, and entropy coding, which are not limited in the present invention.

示例性的，第一碼流可以包括場景音訊信號中第一音訊信號的編碼資料，以及目標虛擬揚聲器的屬性資訊的編碼資料。Exemplarily, the first code stream may include coded data of a first audio signal in the scene audio signal and coded data of property information of a target virtual speaker.

一種可能的方式中，可以基於場景音訊信號，從多個候選虛擬揚聲器中，選取目標虛擬揚聲器，再確定目標虛擬揚聲器的屬性資訊。示例性的，虛擬揚聲器（包括候選虛擬揚聲器和目標虛擬揚聲器）是虛擬的揚聲器，不是真實存在的揚聲器。In one possible manner, a target virtual speaker may be selected from a plurality of candidate virtual speakers based on a scene audio signal, and then the property information of the target virtual speaker may be determined. Exemplarily, the virtual speakers (including the candidate virtual speakers and the target virtual speakers) are virtual speakers, not real speakers.

示例性的，多個候選虛擬揚聲器可以均勻的分佈在球面上，目標虛擬揚聲器的數量可以為一個或多個。Exemplarily, a plurality of candidate virtual speakers may be evenly distributed on a sphere, and the number of target virtual speakers may be one or more.

一種可能的方式中，可以獲取預先設定的目標虛擬揚聲器，再確定目標虛擬揚聲器的屬性資訊。In one possible approach, a preset target virtual speaker may be obtained, and then the property information of the target virtual speaker may be determined.

應該理解的是，本發明不限制確定目標虛擬揚聲器的方式。It should be understood that the present invention is not limited to the manner of determining the target virtual speaker.

第二方面，本發明實施例提供一種場景音訊解碼方法，該場景音訊解碼方法包括：接收第一碼流；解碼所述第一碼流，以得到第一重建信號、目標虛擬揚聲器的屬性資訊和高階能量增益編碼結果，所述第一重建信號是場景音訊信號中第一音訊信號的重建信號，所述場景音訊信號包括C1個通道的音訊信號，所述第一音訊信號為場景音訊信號中K個通道的音訊信號，C1為正整數，K為小於或等於C1的正整數；基於所述目標虛擬揚聲器的屬性資訊和所述第一音訊信號，生成所述目標虛擬揚聲器對應的虛擬揚聲器信號；基於所述目標虛擬揚聲器的屬性資訊和所述虛擬揚聲器信號進行重建，以得到第一重建場景音訊信號；所述第一重建場景音訊信號包括C2個通道的音訊信號，C2為正整數；根據所述第一重建場景音訊信號中的重建信號的頻帶序號和/或所述第一重建場景音訊信號的階數確定衰減因數；根據所述高階能量增益編碼結果和所述衰減因數對所述第一重建場景音訊信號進行調整，以得到重建後的場景音訊信號。In a second aspect, an embodiment of the present invention provides a scene audio decoding method, the scene audio decoding method comprising: receiving a first code stream; decoding the first code stream to obtain a first reconstructed signal, property information of a target virtual speaker, and a high-order energy gain coding result, wherein the first reconstructed signal is a reconstructed signal of a first audio signal in a scene audio signal, the scene audio signal comprises an audio signal of C1 channels, the first audio signal is an audio signal of K channels in the scene audio signal, C1 is a positive integer, and K is a positive integer less than or equal to C1; based on the property information of the target virtual speaker and the first audio signal, generating a virtual speaker signal corresponding to the target virtual speaker; reconstructing based on the attribute information of the target virtual speaker and the virtual speaker signal to obtain a first reconstructed scene audio signal; the first reconstructed scene audio signal includes audio signals of C2 channels, where C2 is a positive integer; determining an attenuation factor according to a frequency band sequence number of a reconstructed signal in the first reconstructed scene audio signal and/or an order of the first reconstructed scene audio signal; adjusting the first reconstructed scene audio signal according to the high-order energy gain coding result and the attenuation factor to obtain a reconstructed scene audio signal.

一種可能的方式中，所述場景音訊信號為N1階高階立體混響HOA信號，所述N1階HOA信號包括第二音訊信號，所述第二音訊信號為所述N1階HOA信號中除所述第一音訊信號之外的音訊信號，C1等於（N1+1）的平方；和/或，所述第一重建場景音訊信號為N2階HOA信號，所述N2階HOA信號包括第三音訊信號，所述第三音訊信號為所述N2階HOA信號中與所述第二音訊信號的各通道對應的重建信號，C2等於（N2+1）的平方。In one possible manner, the scene audio signal is an N1-order high-order stereo reverberation HOA signal, the N1-order HOA signal includes a second audio signal, the second audio signal is an audio signal in the N1-order HOA signal except the first audio signal, and C1 is equal to the square of (N1+1); and/or, the first reconstructed scene audio signal is an N2-order HOA signal, the N2-order HOA signal includes a third audio signal, the third audio signal is a reconstructed signal in the N2-order HOA signal corresponding to each channel of the second audio signal, and C2 is equal to the square of (N2+1).

一種可能的方式中，所述根據所述高階能量增益編碼結果和所述衰減因數對所述第一重建場景音訊信號進行調整，以得到重建後的場景音訊信號包括：對所述高階能量增益編碼結果進行熵解碼，以得到熵解碼後的高階能量增益；對所述熵解碼後的高階能量增益進行反量化，以得到高階能量增益；根據所述第二音訊信號的特徵資訊和所述第一音訊信號的特徵資訊對所述高階能量增益進行調整，以得到調整後的解碼高階能量增益；根據所述調整後的解碼高階能量增益和所述衰減因數對所述N2階HOA信號中的第三音訊信號進行調整，以得到調整後的第三音訊信號，所述調整後的第三音訊信號屬於所述重建後的場景音訊信號。在上述方案中，獲取到調整後的解碼高階能量增益之後，可以對當前幀的第三音訊信號的增益進行加權處理，增益隨著第三音訊信號所在的頻帶序號和/或N2階HOA信號的階數進行衰減，可以先根據第三音訊信號所在的頻帶序號和/或N2階HOA信號的階數獲取衰減因數。例如，該衰減因數可以隨頻帶和Ambisonic階數兩個因素衰減，然後將調整後的解碼高階能量增益和獲取到的衰減因數作用於當前幀重建的第三音訊信號的高階通道，使得高階通道能量更加均勻和平滑，提高重建的音訊信號的品質。In one possible manner, the adjusting the first reconstructed scene audio signal according to the high-order energy gain encoding result and the attenuation factor to obtain the reconstructed scene audio signal includes: entropy decoding the high-order energy gain encoding result to obtain the high-order energy gain after entropy decoding; dequantizing the high-order energy gain after entropy decoding to obtain the high-order energy gain; adjusting the high-order energy gain according to the characteristic information of the second audio signal and the characteristic information of the first audio signal to obtain the adjusted decoded high-order energy gain; adjusting the third audio signal in the N2-order HOA signal according to the adjusted decoded high-order energy gain and the attenuation factor to obtain the adjusted third audio signal, and the adjusted third audio signal belongs to the reconstructed scene audio signal. In the above scheme, after the adjusted decoded high-order energy gain is obtained, the gain of the third audio signal of the current frame can be weighted, and the gain is attenuated along with the frequency band number of the third audio signal and/or the order of the N2-order HOA signal. The attenuation factor can be first obtained according to the frequency band number of the third audio signal and/or the order of the N2-order HOA signal. For example, the attenuation factor can be attenuated along with the two factors of frequency band and Ambisonic order, and then the adjusted decoded high-order energy gain and the obtained attenuation factor are applied to the high-order channel of the third audio signal reconstructed in the current frame, so that the energy of the high-order channel is more uniform and smooth, thereby improving the quality of the reconstructed audio signal.

一種可能的方式中，所述根據所述第二音訊信號的特徵資訊和所述第一音訊信號的特徵資訊對所述高階能量增益進行調整，包括：根據所述第一音訊信號的通道能量和所述高階能量增益獲取所述第二音訊信號的高階能量；根據所述第三音訊信號的通道能量和所述第二音訊信號的高階能量獲取解碼能量比例因數；根據所述第三音訊信號的通道能量和所述第一音訊信號的通道能量獲取所述第三音訊信號的解碼高階能量增益；根據所述解碼能量比例因數對所述第三音訊信號的解碼高階能量增益進行調整，以得到所述調整後的解碼高階能量增益。在上述方案中，為使得高階通道的能量更加均勻和平滑，使用解碼能量比例因數對第三音訊信號的解碼高階能量增益進行調整，確定調整後的解碼高階能量增益。使用解碼能量比例因數調整之後，高階通道的能量更加均勻和平滑，重建出的音訊信號的品質更優。In one possible manner, the adjusting the high-order energy gain according to the characteristic information of the second audio signal and the characteristic information of the first audio signal includes: obtaining the high-order energy of the second audio signal according to the channel energy of the first audio signal and the high-order energy gain; obtaining a decoding energy proportional factor according to the channel energy of the third audio signal and the high-order energy of the second audio signal; obtaining a decoding high-order energy gain of the third audio signal according to the channel energy of the third audio signal and the channel energy of the first audio signal; and adjusting the decoding high-order energy gain of the third audio signal according to the decoding energy proportional factor to obtain the adjusted decoding high-order energy gain. In the above scheme, in order to make the energy of the high-order channel more uniform and smooth, the decoding high-order energy gain of the third audio signal is adjusted using the decoding energy proportional factor, and the adjusted decoding high-order energy gain is determined. After the adjustment using the decoding energy proportional factor, the energy of the high-order channel is more uniform and smooth, and the quality of the reconstructed audio signal is better.

一種可能的方式中，所述根據所述第一重建場景音訊信號中的重建信號的頻帶序號和/或所述第一重建場景音訊信號的階數確定衰減因數，包括：根據所述第三音訊信號所在的頻帶序號和/或所述N2階HOA信號的階數獲取衰減因數。在上述方案中，解碼端可以根據第三音訊信號所在的頻帶序號獲取衰減因數，或者解碼端根據N2階HOA信號的階數獲取衰減因數，該N2階HOA信號的階數具體可以是Ambisonic階數，或者解碼端可以根據上述頻帶序號和N2階HOA信號的階數獲取衰減因數，該衰減因數可以稱為雙衰減因數。In one possible manner, determining the attenuation factor according to the frequency band number of the reconstructed signal in the first reconstructed scene audio signal and/or the order of the first reconstructed scene audio signal includes: obtaining the attenuation factor according to the frequency band number of the third audio signal and/or the order of the N2-order HOA signal. In the above scheme, the decoding end can obtain the attenuation factor according to the frequency band number where the third audio signal is located, or the decoding end can obtain the attenuation factor according to the order of the N2-order HOA signal, and the order of the N2-order HOA signal can specifically be the Ambisonic order, or the decoding end can obtain the attenuation factor according to the above-mentioned frequency band number and the order of the N2-order HOA signal, and the attenuation factor can be called a double attenuation factor.

一種可能的方式中，所述根據所述高階能量增益編碼結果和所述衰減因數對所述N2階HOA信號中的第三音訊信號進行調整之後，所述方法還包括：獲取所述調整後的第三音訊信號對應的第四音訊信號的通道能量，所述第三音訊信號包括當前幀的音訊信號，所述第四音訊信號包括所述當前幀的在先幀的音訊信號；根據所述第四音訊信號的通道能量對所述調整後的第三音訊信號再次進行調整。在上述方案中，解碼端還可以利用第三音訊信號的在先幀對當前幀的調整後的第三音訊信號再次進行調整，以使得重建的音訊信號的品質提高。In one possible manner, after adjusting the third audio signal in the N2-order HOA signal according to the high-order energy gain coding result and the attenuation factor, the method further includes: obtaining the channel energy of the fourth audio signal corresponding to the adjusted third audio signal, the third audio signal includes the audio signal of the current frame, and the fourth audio signal includes the audio signal of the previous frame of the current frame; and adjusting the adjusted third audio signal again according to the channel energy of the fourth audio signal. In the above scheme, the decoding end can also use the previous frame of the third audio signal to adjust the adjusted third audio signal of the current frame again, so as to improve the quality of the reconstructed audio signal.

一種可能的方式中，所述根據所述第四音訊信號的通道能量對所述調整後的第三音訊信號再次進行調整，包括：獲取所述第四音訊信號的通道能量平均值和所述調整後的第三音訊信號的通道能量；根據所述第四音訊信號的通道能量平均值和所述調整後的第三音訊信號的通道能量獲取能量平均閾值；根據所述能量平均閾值對所述第四音訊信號的通道能量平均值和所述調整後的第三音訊信號的通道能量進行加權平均計算，以得到目標能量；根據所述目標能量和所述調整後的第三音訊信號的通道能量獲取能量平滑因數；根據所述能量平滑因數對所述第三音訊信號進行調整。在上述方案中，通過使用能量平滑因數q(i，b)對調整後的第三音訊信號再次調整，進一步提高第三音訊信號的解碼品質。In one possible manner, the adjusting the adjusted third audio signal again according to the channel energy of the fourth audio signal includes: obtaining a channel energy average value of the fourth audio signal and a channel energy of the adjusted third audio signal; obtaining an energy average threshold according to the channel energy average value of the fourth audio signal and the channel energy of the adjusted third audio signal; performing weighted averaging calculation on the channel energy average value of the fourth audio signal and the channel energy of the adjusted third audio signal according to the energy average threshold to obtain a target energy; obtaining an energy smoothing factor according to the target energy and the channel energy of the adjusted third audio signal; and adjusting the third audio signal according to the energy smoothing factor. In the above scheme, the adjusted third audio signal is further adjusted by using the energy smoothing factor q(i, b) to further improve the decoding quality of the third audio signal.

一種可能的方式中，所述根據所述第三音訊信號所在的頻帶序號和/或所述N2階HOA信號的階數獲取衰減因數，包括：In one possible manner, obtaining the attenuation factor according to the frequency band number of the third audio signal and/or the order of the N2-order HOA signal includes:

通過如下方式獲取衰減因數g’(i，b)： The attenuation factor g'(i, b) is obtained as follows:

其中，i為所述第三音訊信號的第i個通道的編號，b為所述第三音訊信號的頻帶序號，表示所述目標虛擬揚聲器的數量，表示所述N2階HOA信號的映射通道數量，，M為所述N2階HOA信號的階數，γ表示所述第三音訊信號的通道號i對應的階數，表示相乘運算。 Wherein, i is the number of the i-th channel of the third audio signal, b is the frequency band number of the third audio signal, represents the number of the target virtual speakers, represents the number of mapping channels of the N2-order HOA signal, , M is the order of the N2-order HOA signal, γ represents the order corresponding to the channel number i of the third audio signal, Represents a multiplication operation.

在上述方案中，通過上述衰減因數g’(i，b)的計算方式，b為第三音訊信號的頻帶序號，表示目標虛擬揚聲器的數量，表示N2階HOA信號的映射通道數量，，M為N2階HOA信號的階數，γ表示第三音訊信號的通道號i對應的階數，通過上述參數可以準確計算出衰減因數，通過參數的調節使得衰減因數隨著揚聲器數量、HOA信號映射通道數量、HOA階數三層因素而改變，使得該衰減因數和調整後的解碼高階能量增益用於調整第三音訊信號時，提高重建音訊信號的品質。 In the above scheme, through the above attenuation factor g'(i, b) calculation method, b is the frequency band number of the third audio signal, Indicates the number of target virtual speakers. Indicates the number of mapping channels of N2-order HOA signals. , M is the order of the N2-order HOA signal, γ represents the order corresponding to the channel number i of the third audio signal. The attenuation factor can be accurately calculated through the above parameters. The attenuation factor changes with the number of speakers, the number of HOA signal mapping channels, and the HOA order through parameter adjustment, so that when the attenuation factor and the adjusted decoded high-order energy gain are used to adjust the third audio signal, the quality of the reconstructed audio signal is improved.

一種可能的方式中，所述方法還包括：當b≤d時，將所述更新為，；d為預設的第一閾值；當b＞d時，將所述更新為，。 In one possible manner, the method further comprises: when b≤d, Updated to , ; d is the default first threshold; when b＞d, the Updated to , .

一種可能的方式中，所述方法還包括：將所述更新為，；其中，bands表示所述第三音訊信號的頻帶數量。 In one possible manner, the method further comprises: Updated to , ; wherein bands represents the number of frequency bands of the third audio signal.

在上述方案中，當頻帶b小於第一閾值時，也就是說第b個頻帶可以表示低頻帶，那麼衰減係數設定一個較小值，例如衰減係數等於0.375，當頻帶b大於閾值時，也就是說第b個頻帶可以表示高頻帶，那麼衰減係數設為較大值，例如衰減係數等於0.5。上述衰減係數為0.375或0.5只是一種可能的舉例實現方式，例如上述0.375還可以替換為0.38或者0.37，上述0.5還可以替換為0.55或者0.6，具體需要結合應用場景確定衰減係數的取值，此處不做限定。上述方案中根據頻帶b的取值大小，可以靈活調整的取值，達到隨著頻帶越高衰減因數衰減效果越顯著的效果，從而使得重建出的音訊信號更符合人耳聽覺特性。 In the above scheme, when the frequency band b is less than the first threshold value, that is to say, the bth frequency band can represent the low frequency band, then the attenuation coefficient is set to a smaller value, for example, the attenuation coefficient is equal to 0.375, and when the frequency band b is greater than the threshold value, that is to say, the bth frequency band can represent the high frequency band, then the attenuation coefficient is set to a larger value, for example, the attenuation coefficient is equal to 0.5. The above attenuation coefficient of 0.375 or 0.5 is only a possible example implementation method, for example, the above 0.375 can also be replaced by 0.38 or 0.37, and the above 0.5 can also be replaced by 0.55 or 0.6. The specific value of the attenuation coefficient needs to be determined in combination with the application scenario, and is not limited here. In the above scheme, the value of band b can be adjusted flexibly. The value of is set so that the attenuation effect becomes more significant as the frequency band becomes higher, so that the reconstructed audio signal is more in line with the hearing characteristics of the human ear.

一種可能的方式中，所述方法還包括：當b≤d時，將所述更新為，；d為預設的第一閾值，w為預設的調節比例閾值。 In one possible manner, the method further comprises: when b≤d, Updated to , ; d is the default first threshold, w is the default adjustment ratio threshold.

在上述方案中，當b≤第一閾值時，上述方案中根據頻帶b的取值大小，可以靈活調整的取值，達到隨著頻帶越高衰減因數衰減效果越顯著的效果，從而使得重建出的音訊信號更符合人耳聽覺特性。 In the above scheme, when b ≤ the first threshold, the above scheme can be flexibly adjusted according to the value of the frequency band b. The value of is set so that the attenuation effect becomes more significant as the frequency band becomes higher, so that the reconstructed audio signal is more in line with the hearing characteristics of the human ear.

一種可能的方式中，所述方法還包括：將所述w更新為w2，w2=w+ ×0.05。在上述方案中，可以根據對w的取值進行更新，使得衰減因數中的權重w與參數建立關係，隨著的增加w也增加，達到隨著頻帶越高衰減因數衰減效果越顯著的效果，從而使得重建出的音訊信號更符合人耳聽覺特性。 In one possible manner, the method further comprises: updating w to w2, w2=w+ ×0.05. In the above scheme, Update the value of w so that the weight w in the attenuation factor is consistent with the parameter Build relationships, As w increases, the attenuation effect becomes more significant as the frequency band increases, making the reconstructed audio signal more in line with the auditory characteristics of the human ear.

一種可能的方式中，當i的取值為0、1、2、或3時，γ的取值為1；當i的取值為4、5、6、7、或8時，γ的取值為2；當i的取值為9、10、11、12、13、14、或15時，γ的取值為3。通過上述γ的取值可知，通過γ的取值確定衰減因數，γ取值為與通道所在HOA階數相關的分段函數，隨著i所在的HOA階數的增加，γ取值增加，但不會超過最大HOA階數，該衰減因數用於調整第三音訊信號時，提高重建音訊信號的品質。In one possible manner, when the value of i is 0, 1, 2, or 3, the value of γ is 1; when the value of i is 4, 5, 6, 7, or 8, the value of γ is 2; when the value of i is 9, 10, 11, 12, 13, 14, or 15, the value of γ is 3. It can be seen from the above value of γ that the attenuation factor is determined by the value of γ, and the value of γ is a piecewise function related to the HOA order of the channel. As the HOA order of i increases, the value of γ increases, but will not exceed the maximum HOA order. The attenuation factor is used to adjust the third audio signal to improve the quality of the reconstructed audio signal.

一種可能的方式中，所述根據所述高階能量增益編碼結果和所述衰減因數對所述第一重建場景音訊信號進行調整，包括：對所述高階能量增益編碼結果進行熵解碼，以得到熵解碼後的高階能量增益；對所述熵解碼後的高階能量增益進行反量化，以得到高階能量增益；根據所述第一音訊信號的通道能量和所述高階能量增益獲取所述第二音訊信號的高階能量；根據所述第三音訊信號的通道能量和所述第二音訊信號的高階能量獲取解碼能量比例因數；根據所述第一碼流確定所述第一重建場景音訊信號的彌散度因數；根據所述彌散度因數對所述衰減因數進行線性加權，以得到加權後的衰減因數：根據所述加權後的衰減因數和所述解碼能量比例因數對所述N2階HOA信號中的第三音訊信號進行調整，以得到調整後的第三音訊信號。在上述方案中，彌散度因數還可以用於對衰減因數的線性加權，加權後的衰減因數用於對第三音訊信號進行調整。使用加權方法來平衡衰減因數中彌散成分和方向性成分之間的占比，由於彌散度因數可用於衡量待編碼HOA信號中非方向性成分的能量比例，並用來調整重建HOA信號各個通道的能量，使能量調整後的重建HOA信號與待編碼HOA信號能量更接近。In one possible manner, the adjusting the first reconstructed scene audio signal according to the high-order energy gain coding result and the attenuation factor includes: entropy decoding the high-order energy gain coding result to obtain the high-order energy gain after entropy decoding; dequantizing the high-order energy gain after entropy decoding to obtain the high-order energy gain; obtaining the high-order energy of the second audio signal according to the channel energy of the first audio signal and the high-order energy gain; and adjusting the first reconstructed scene audio signal according to the high-order energy gain. The channel energy of the third audio signal and the high-order energy of the second audio signal are used to obtain a decoding energy proportional factor; the dispersion factor of the first reconstructed scene audio signal is determined according to the first bitstream; the attenuation factor is linearly weighted according to the dispersion factor to obtain a weighted attenuation factor; the third audio signal in the N2-order HOA signal is adjusted according to the weighted attenuation factor and the decoding energy proportional factor to obtain an adjusted third audio signal. In the above scheme, the dispersion factor can also be used for linear weighting of the attenuation factor, and the weighted attenuation factor is used to adjust the third audio signal. A weighted method is used to balance the proportion between the dispersion component and the directional component in the attenuation factor. The dispersion factor can be used to measure the energy ratio of the non-directional component in the HOA signal to be encoded, and is used to adjust the energy of each channel of the reconstructed HOA signal, so that the reconstructed HOA signal after energy adjustment is closer to the energy of the HOA signal to be encoded.

一種可能的方式中，所述根據所述彌散度因數對所述衰減因數進行線性加權，得到加權後的衰減因數，包括：通過如下至少一種方式獲取所述加權後的衰減因數gd(i,b)：In one possible manner, linearly weighting the attenuation factor according to the divergence factor to obtain a weighted attenuation factor includes: obtaining the weighted attenuation factor gd(i,b) by at least one of the following methods:

gd(i,b)= w diffusion(b)+(1-w) ；其中，w為預設的調節比例閾值，表示相乘運算，diffusion(b)表示所述第三音訊信號的第b個頻帶的彌散度因數，表示所述第三音訊信號的第b個頻帶的衰減因數； gd(i,b)= w diffusion(b)+(1-w) ; Where w is the preset adjustment ratio threshold, represents a multiplication operation, diffusion(b) represents the diffusion factor of the b-th frequency band of the third audio signal, represents an attenuation factor of the b-th frequency band of the third audio signal;

或者，or,

當所述衰減因數為全帶信號時，gd(i,b)=w mean(diffusion)+ (1-w) ，其中，mean(diffusion)為所述第三音訊信號的多個頻帶的彌散度因數的平均值，表示所述第三音訊信號的第b個頻帶的衰減因數，w為預設的調節比例閾值； When the attenuation factor is a full-band signal, gd(i,b)=w mean(diffusion)+ (1-w) , wherein mean(diffusion) is the average value of the dispersion factors of multiple frequency bands of the third audio signal, represents the attenuation factor of the bth frequency band of the third audio signal, and w is a preset adjustment proportional threshold;

或者，or,

gd(i,b)= w diffusion(b)+(1-w) +offset(i,b)，其中，offset(i,b)表示所述第三音訊信號的第i個通道上第b個頻帶的偏置常數，表示所述第三音訊信號的第b個頻帶的衰減因數，w為預設的調節比例閾值，diffusion(b)表示所述第三音訊信號的第b個頻帶的彌散度因數； gd(i,b)= w diffusion(b)+(1-w) +offset(i,b), wherein offset(i,b) represents the offset constant of the bth frequency band on the i-th channel of the third audio signal, represents the attenuation factor of the b-th frequency band of the third audio signal, w is a preset adjustment proportional threshold, and diffusion(b) represents the diffusion factor of the b-th frequency band of the third audio signal;

或者，or,

gd(i,b)= w diffusion(b)+(1-w) +direction(i,b)，其中，direction(i,b)表示所述第三音訊信號的第i個通道上第b個頻帶的方向參數，表示所述第三音訊信號的第b個頻帶的衰減因數，w為預設的調節比例閾值，diffusion(b)表示所述第三音訊信號的第b個頻帶的彌散度因數。 gd(i,b)= w diffusion(b)+(1-w) +direction(i,b), wherein direction(i,b) represents the direction parameter of the bth frequency band on the i-th channel of the third audio signal, represents the attenuation factor of the b-th frequency band of the third audio signal, w is a preset adjustment proportional threshold, and diffusion(b) represents the diffusion factor of the b-th frequency band of the third audio signal.

一種可能的方式中，所述根據所述加權後的衰減因數和所述解碼能量比例因數對所述N2階HOA信號中的第三音訊信號進行調整，以得到調整後的第三音訊信號：In one possible manner, the third audio signal in the N2-order HOA signal is adjusted according to the weighted attenuation factor and the decoding energy proportional factor to obtain an adjusted third audio signal:

通過如下方式獲取所述調整後的第三音訊信號X’(i,b)：The adjusted third audio signal X'(i,b) is obtained by the following method:

X’(i,b) = X(i) g(i,b) gd(i,b)； X'(i,b) = X(i) g(i,b) gd(i,b);

其中，gd(i,b)表示所述加權後的衰減因數，g(i,b)表示解碼能量比例因數，X(i)表示所述第三音訊信號。Wherein, gd(i,b) represents the weighted attenuation factor, g(i,b) represents the decoding energy proportional factor, and X(i) represents the third audio signal.

上述方案中，由於彌散度因數可用於衡量待編碼HOA信號中非方向性成分的能量比例，解碼能量比例因數可用於衡量重建HOA信號的能量比例，將彌散度因數和解碼能量比例因數用來調整重建HOA信號各個通道的能量，使能量調整後的重建HOA信號與待編碼HOA信號能量更接近。In the above scheme, since the dispersion factor can be used to measure the energy ratio of the non-directional component in the HOA signal to be encoded, and the decoding energy ratio factor can be used to measure the energy ratio of the reconstructed HOA signal, the dispersion factor and the decoding energy ratio factor are used to adjust the energy of each channel of the reconstructed HOA signal, so that the reconstructed HOA signal after energy adjustment is closer to the energy of the HOA signal to be encoded.

一種可能的方式中，所述根據所述第四音訊信號的通道能量平均值和所述調整後的第三音訊信號的通道能量獲取能量平均閾值，包括：In one possible manner, obtaining the energy average threshold according to the channel energy average of the fourth audio signal and the adjusted channel energy of the third audio signal includes:

通過如下方式獲取能量平均閾值k： The energy average threshold k is obtained as follows:

其中，E_mean(i，b)表示所述第四音訊信號的通道能量平均值，E’_dec(i，b)表示所述調整後的第三音訊信號的能量。Wherein, E_mean(i, b) represents the channel energy average value of the fourth audio signal, and E'_dec(i, b) represents the energy of the adjusted third audio signal.

一種可能的方式中，所述根據所述目標能量和所述調整後的第三音訊信號的通道能量獲取能量平滑因數，包括：In one possible manner, the energy smoothing factor is obtained according to the target energy and the adjusted channel energy of the third audio signal, including:

通過如下方式獲取能量平滑因數q(i，b)：The energy smoothing factor q(i, b) is obtained as follows:

q(i，b) = sqrt(E_target(i，b))/sqrt(E’_dec(i，b))；q(i, b) = sqrt(E_target(i, b))/sqrt(E’_dec(i, b));

其中，E_target(i，b)表示所述目標能量，E’_dec(i，b)表示所述調整後的第三音訊信號的能量。Wherein, E_target(i, b) represents the target energy, and E’_dec(i, b) represents the energy of the adjusted third audio signal.

一種可能的方式中，所述根據所述解碼能量比例因數對所述第三音訊信號的解碼高階能量增益進行調整，以得到所述調整後的解碼高階能量增益，包括：In one possible manner, the step of adjusting the decoding high-order energy gain of the third audio signal according to the decoding energy proportional factor to obtain the adjusted decoding high-order energy gain includes:

通過如下方式獲取所述調整後的解碼高階能量增益Gain_dec’(i，b)：The adjusted decoding high-order energy gain Gain_dec’(i, b) is obtained by the following method:

Gain_dec’(i，b) = w*min(g(i，b),Gain_dec(i，b)) + (1 - w) * g(i，b)；Gain_dec’(i,b) = w*min(g(i,b),Gain_dec(i,b)) + (1 - w) * g(i,b);

其中，g(i，b)表示所述解碼能量比例因數，Gain_dec(i，b)表示所述第三音訊信號的解碼高階能量增益，w為預設的調節比例閾值，min表示取最小值運算，*表示相乘運算。Among them, g(i, b) represents the decoding energy proportional factor, Gain_dec(i, b) represents the decoding high-order energy gain of the third audio signal, w is the default adjustment proportional threshold, min represents the minimum value operation, and * represents the multiplication operation.

當K小於C1時，在對場景音訊信號編碼的過程中，本發明編碼的音訊信號的通道數，小於現有技術編碼的音訊信號的通道數，且目標虛擬揚聲器的屬性資訊的資料量，遠小於一個通道的音訊信號的資料量；因此在同等碼率的前提下，本發明解碼得到重建場景音訊信號的音訊品質更高。When K is less than C1, in the process of encoding the scene audio signal, the number of channels of the audio signal encoded by the present invention is less than the number of channels of the audio signal encoded by the prior art, and the amount of data of the attribute information of the target virtual speaker is much smaller than the amount of data of the audio signal of one channel; therefore, under the premise of the same bit rate, the audio quality of the reconstructed scene audio signal obtained by decoding of the present invention is higher.

其次，由於現有技術編碼傳輸的虛擬揚聲器信號和殘差資訊是通過原始音訊信號（即待編碼的場景音訊信號）轉換而來的，並不是原始音訊信號，會引入誤差；而本發明編碼了部分原始音訊信號（即待編碼的場景音訊信號中的K個通道的音訊信號），避免了誤差的引入，進而能夠提高解碼得到重建場景音訊信號的音訊品質；且還能夠避免解碼得到重建場景音訊信號的重建品質的波動，穩定性高。Secondly, since the virtual speaker signal and residual information encoded and transmitted by the existing technology are converted from the original audio signal (i.e., the scene audio signal to be encoded), and are not the original audio signal, errors will be introduced; while the present invention encodes part of the original audio signal (i.e., the audio signals of K channels in the scene audio signal to be encoded), thus avoiding the introduction of errors, thereby being able to improve the audio quality of the reconstructed scene audio signal obtained by decoding; and it is also possible to avoid fluctuations in the reconstruction quality of the reconstructed scene audio signal obtained by decoding, and has high stability.

此外，由於現有技術編碼以及傳輸的是虛擬揚聲器信號，而虛擬揚聲器信號的資料量較大，因此現有技術選取的目標虛擬揚聲器的數量受到頻寬限制較大。本發明編碼以及傳輸的是虛擬揚聲器的屬性資訊，屬性資訊的資料量遠小於虛擬揚聲器信號的資料量；因此本發明選取的目標虛擬揚聲器的數量受到頻寬限制較小。而選取的目標虛擬揚聲器的數量越多，基於目標虛擬揚聲器的虛擬揚聲器信號，重建出的場景音訊信號的品質也就越高。因此，相對於現有技術而言，在同等碼率的情況下，本發明可以選取數量更多的目標虛擬揚聲器，這樣，本發明解碼得到重建場景音訊信號的品質也就更高。In addition, since the existing technology encodes and transmits virtual speaker signals, and the data volume of virtual speaker signals is relatively large, the number of target virtual speakers selected by the existing technology is subject to greater bandwidth restrictions. The present invention encodes and transmits attribute information of virtual speakers, and the data volume of attribute information is much smaller than the data volume of virtual speaker signals; therefore, the number of target virtual speakers selected by the present invention is subject to less bandwidth restrictions. The more target virtual speakers selected, the higher the quality of the scene audio signal reconstructed based on the virtual speaker signals of the target virtual speakers. Therefore, compared with the prior art, under the same bit rate, the present invention can select a larger number of target virtual speakers, so that the quality of the reconstructed scene audio signal decoded by the present invention is higher.

此外，綜合編碼端和解碼端，相對於現有技術的編碼端和解碼端而言，本發明的編碼端和解碼端無需進行殘差和疊加操作，因此本發明編碼端和解碼端的綜合複雜度，低於現有技術編碼端和解碼端的綜合複雜度。In addition, compared with the encoding end and decoding end of the prior art, the encoding end and decoding end of the present invention do not need to perform residual and superposition operations, so the comprehensive complexity of the encoding end and decoding end of the present invention is lower than the comprehensive complexity of the encoding end and decoding end of the prior art.

應該理解的是，當編碼端對場景音訊信號中第一音訊信號進行的是失真壓縮時，解碼端解碼得到的第一重建信號和編碼端編碼的第一音訊信號存在差異。當編碼端對第一音訊信號進行的是無失真壓縮時，解碼端解碼得到的第一重建信號和編碼端編碼的第一音訊信號相同。It should be understood that when the encoding end performs distortion compression on the first audio signal in the scene audio signal, there is a difference between the first reconstructed signal decoded by the decoding end and the first audio signal encoded by the encoding end. When the encoding end performs lossless compression on the first audio signal, the first reconstructed signal decoded by the decoding end is the same as the first audio signal encoded by the encoding end.

應該理解的是，當編碼端對目標虛擬揚聲器的屬性資訊進行的是失真壓縮時，解碼端解碼得到的屬性資訊和編碼端編碼的屬性資訊存在差異。當編碼端對虛擬揚聲器的屬性資訊進行的是無失真壓縮時，解碼端解碼得到的屬性資訊和編碼端編碼的屬性資訊相同。其中，本發明對編碼端編碼的屬性資訊和解碼端解碼得到的屬性資訊，未從名稱上進行區分。It should be understood that when the encoding end performs distortion compression on the attribute information of the target virtual speaker, there is a difference between the attribute information decoded by the decoding end and the attribute information encoded by the encoding end. When the encoding end performs lossless compression on the attribute information of the virtual speaker, the attribute information decoded by the decoding end is the same as the attribute information encoded by the encoding end. In particular, the present invention does not distinguish between the attribute information encoded by the encoding end and the attribute information decoded by the decoding end in terms of name.

第二方面以及第二方面的任意一種實現方式分別與第一方面以及第一方面的任意一種實現方式相對應。第二方面以及第二方面的任意一種實現方式所對應的技術效果可參見上述第一方面以及第一方面的任意一種實現方式所對應的技術效果，此處不再贅述。The second aspect and any implementation of the second aspect correspond to the first aspect and any implementation of the first aspect, respectively. The technical effects corresponding to the second aspect and any implementation of the second aspect can refer to the technical effects corresponding to the first aspect and any implementation of the first aspect, which will not be repeated here.

協力廠商面，本發明實施例提供一種碼流生成方法，該方法可以根據如第一方面及第一方面的任意一種實現方式生成碼流。Regarding third parties, an embodiment of the present invention provides a method for generating a bit stream, which can generate a bit stream according to the first aspect and any one of the implementation methods of the first aspect.

協力廠商面以及協力廠商面的任意一種實現方式分別與第一方面以及第一方面的任意一種實現方式相對應。協力廠商面以及協力廠商面的任意一種實現方式所對應的技術效果可參見上述第一方面以及第一方面的任意一種實現方式所對應的技術效果，此處不再贅述。The third party aspect and any implementation of the third party aspect correspond to the first aspect and any implementation of the first aspect, respectively. The technical effects corresponding to the third party aspect and any implementation of the third party aspect can be found in the first aspect and any implementation of the first aspect, which will not be elaborated here.

第四方面，本發明實施例提供一種場景音訊編碼裝置，該裝置包括：In a fourth aspect, an embodiment of the present invention provides a scene audio coding device, the device comprising:

獲取模組，用於獲取待編碼的場景音訊信號，所述場景音訊信號包括C1個通道的音訊信號，C1為正整數；An acquisition module, used for acquiring a scene audio signal to be encoded, wherein the scene audio signal includes audio signals of C1 channels, where C1 is a positive integer;

所述獲取模組，還用於獲取所述場景音訊信號對應的目標虛擬揚聲器的屬性資訊；The acquisition module is also used to acquire the property information of the target virtual speaker corresponding to the scene audio signal;

所述獲取模組，還用於獲取所述場景音訊信號的高階能量增益；The acquisition module is also used to obtain the high-order energy gain of the scene audio signal;

編碼模組，用於對所述高階能量增益進行編碼，以得到高階能量增益編碼結果；A coding module, used for coding the high-order energy gain to obtain a high-order energy gain coding result;

所述編碼模組，還用於編碼所述場景音訊信號中第一音訊信號、所述目標虛擬揚聲器的屬性資訊和所述高階能量增益編碼結果，以得到第一碼流；其中，所述第一音訊信號為所述場景音訊信號中K個通道的音訊信號，K為小於或等於C1的正整數。The encoding module is also used to encode the first audio signal in the scene audio signal, the attribute information of the target virtual speaker and the high-order energy gain encoding result to obtain a first code stream; wherein the first audio signal is the audio signal of K channels in the scene audio signal, and K is a positive integer less than or equal to C1.

第四方面的場景音訊編碼裝置，可以執行第一方面以及第一方面的任意一種實現方式中的步驟，在此不再贅述。The scene audio coding device of the fourth aspect can execute the steps of the first aspect and any one of the implementations of the first aspect, which will not be described in detail herein.

第四方面以及第四方面的任意一種實現方式分別與第一方面以及第一方面的任意一種實現方式相對應。第四方面以及第四方面的任意一種實現方式所對應的技術效果可參見上述第一方面以及第一方面的任意一種實現方式所對應的技術效果，此處不再贅述。The fourth aspect and any implementation of the fourth aspect correspond to the first aspect and any implementation of the first aspect, respectively. The technical effects corresponding to the fourth aspect and any implementation of the fourth aspect can refer to the technical effects corresponding to the first aspect and any implementation of the first aspect, which will not be repeated here.

第五方面，本發明實施例提供一種場景音訊解碼裝置，該裝置包括：In a fifth aspect, an embodiment of the present invention provides a scene audio decoding device, the device comprising:

碼流接收模組，用於接收第一碼流；A code stream receiving module, used for receiving a first code stream;

解碼模組，用於解碼所述第一碼流，以得到第一重建信號、目標虛擬揚聲器的屬性資訊和高階能量增益編碼結果，所述第一重建信號是場景音訊信號中第一音訊信號的重建信號，所述場景音訊信號包括C1個通道的音訊信號，所述第一音訊信號為場景音訊信號中K個通道的音訊信號，C1為正整數，K為小於或等於C1的正整數；a decoding module, configured to decode the first bit stream to obtain a first reconstructed signal, property information of a target virtual speaker, and a high-order energy gain coding result, wherein the first reconstructed signal is a reconstructed signal of a first audio signal in a scene audio signal, the scene audio signal includes audio signals of C1 channels, the first audio signal is audio signals of K channels in the scene audio signal, C1 is a positive integer, and K is a positive integer less than or equal to C1;

虛擬揚聲器信號生成模組，用於基於所述目標虛擬揚聲器的屬性資訊和所述第一音訊信號，生成所述目標虛擬揚聲器對應的虛擬揚聲器信號；a virtual speaker signal generating module, configured to generate a virtual speaker signal corresponding to the target virtual speaker based on the attribute information of the target virtual speaker and the first audio signal;

場景音訊信號重建模組，用於基於所述目標虛擬揚聲器的屬性資訊和所述虛擬揚聲器信號進行重建，以得到第一重建場景音訊信號；所述第一重建場景音訊信號包括C2個通道的音訊信號，C2為正整數；A scene audio signal reconstruction modeling group, used for reconstructing based on the attribute information of the target virtual speaker and the virtual speaker signal to obtain a first reconstructed scene audio signal; the first reconstructed scene audio signal includes audio signals of C2 channels, where C2 is a positive integer;

衰減因數確定模組，用於根據所述第一重建場景音訊信號中的重建信號的頻帶序號和/或所述第一重建場景音訊信號的階數確定衰減因數；an attenuation factor determination module, configured to determine an attenuation factor according to a frequency band sequence number of a reconstructed signal in the first reconstructed scene audio signal and/or an order of the first reconstructed scene audio signal;

場景音訊信號調整模組，用於根據所述高階能量增益編碼結果和所述衰減因數對所述第一重建場景音訊信號進行調整，以得到重建後的場景音訊信號。The scene audio signal adjustment module is used to adjust the first reconstructed scene audio signal according to the high-order energy gain coding result and the attenuation factor to obtain a reconstructed scene audio signal.

第五方面的場景音訊解碼裝置，可以執行第二方面以及第二方面的任意一種實現方式中的步驟，在此不再贅述。The scene audio decoding device of the fifth aspect can execute the steps of the second aspect and any one of the implementations of the second aspect, which will not be elaborated here.

第五方面以及第五方面的任意一種實現方式分別與第二方面以及第二方面的任意一種實現方式相對應。第五方面以及第五方面的任意一種實現方式所對應的技術效果可參見上述第二方面以及第二方面的任意一種實現方式所對應的技術效果，此處不再贅述。The fifth aspect and any implementation of the fifth aspect correspond to the second aspect and any implementation of the second aspect, respectively. The technical effects corresponding to the fifth aspect and any implementation of the fifth aspect can refer to the technical effects corresponding to the above-mentioned second aspect and any implementation of the second aspect, which will not be repeated here.

第六方面，本發明實施例提供一種電子設備，包括：記憶體和處理器，記憶體與處理器耦合；記憶體存儲有程式指令，當程式指令由處理器執行時，使得電子設備執行第一方面或第一方面的任意可能的實現方式中的場景音訊編碼方法。In a sixth aspect, an embodiment of the present invention provides an electronic device, comprising: a memory and a processor, the memory being coupled to the processor; the memory storing program instructions, and when the program instructions are executed by the processor, the electronic device executes the scene audio coding method in the first aspect or any possible implementation of the first aspect.

第六方面以及第六方面的任意一種實現方式分別與第一方面以及第一方面的任意一種實現方式相對應。第六方面以及第六方面的任意一種實現方式所對應的技術效果可參見上述第一方面以及第一方面的任意一種實現方式所對應的技術效果，此處不再贅述。The sixth aspect and any implementation of the sixth aspect correspond to the first aspect and any implementation of the first aspect, respectively. The technical effects corresponding to the sixth aspect and any implementation of the sixth aspect can refer to the technical effects corresponding to the first aspect and any implementation of the first aspect, which will not be repeated here.

第七方面，本發明實施例提供一種電子設備，包括：記憶體和處理器，記憶體與處理器耦合；記憶體存儲有程式指令，當程式指令由處理器執行時，使得電子設備執行第二方面或第二方面的任意可能的實現方式中的場景音訊解碼方法。In the seventh aspect, an embodiment of the present invention provides an electronic device, comprising: a memory and a processor, the memory being coupled to the processor; the memory storing program instructions, and when the program instructions are executed by the processor, the electronic device executes the scene audio decoding method in the second aspect or any possible implementation of the second aspect.

第七方面以及第七方面的任意一種實現方式分別與第二方面以及第二方面的任意一種實現方式相對應。第七方面以及第七方面的任意一種實現方式所對應的技術效果可參見上述第二方面以及第二方面的任意一種實現方式所對應的技術效果，此處不再贅述。The seventh aspect and any implementation of the seventh aspect correspond to the second aspect and any implementation of the second aspect, respectively. The technical effects corresponding to the seventh aspect and any implementation of the seventh aspect can refer to the technical effects corresponding to the above-mentioned second aspect and any implementation of the second aspect, which will not be repeated here.

第八方面，本發明實施例提供一種晶片，包括一個或多個介面電路和一個或多個處理器；介面電路用於從電子設備的記憶體接收信號，並向處理器發送信號，信號包括記憶體中存儲的電腦指令；當處理器執行電腦指令時，使得電子設備執行第一方面或第一方面的任意可能的實現方式中的場景音訊編碼方法。In an eighth aspect, an embodiment of the present invention provides a chip comprising one or more interface circuits and one or more processors; the interface circuit is used to receive signals from a memory of an electronic device and send signals to the processor, the signals comprising computer instructions stored in the memory; when the processor executes the computer instructions, the electronic device executes the scene audio coding method of the first aspect or any possible implementation of the first aspect.

第八方面以及第八方面的任意一種實現方式分別與第一方面以及第一方面的任意一種實現方式相對應。第八方面以及第八方面的任意一種實現方式所對應的技術效果可參見上述第一方面以及第一方面的任意一種實現方式所對應的技術效果，此處不再贅述。The eighth aspect and any implementation of the eighth aspect correspond to the first aspect and any implementation of the first aspect, respectively. The technical effects corresponding to the eighth aspect and any implementation of the eighth aspect can refer to the technical effects corresponding to the first aspect and any implementation of the first aspect, which will not be repeated here.

第九方面，本發明實施例提供一種晶片，包括一個或多個介面電路和一個或多個處理器；介面電路用於從電子設備的記憶體接收信號，並向處理器發送信號，信號包括記憶體中存儲的電腦指令；當處理器執行電腦指令時，使得電子設備執行第二方面或第二方面的任意可能的實現方式中的場景音訊解碼方法。In the ninth aspect, an embodiment of the present invention provides a chip comprising one or more interface circuits and one or more processors; the interface circuit is used to receive signals from the memory of the electronic device and send signals to the processor, the signals including computer instructions stored in the memory; when the processor executes the computer instructions, the electronic device executes the scene audio decoding method in the second aspect or any possible implementation of the second aspect.

第九方面以及第九方面的任意一種實現方式分別與第二方面以及第二方面的任意一種實現方式相對應。第九方面以及第九方面的任意一種實現方式所對應的技術效果可參見上述第二方面以及第二方面的任意一種實現方式所對應的技術效果，此處不再贅述。The ninth aspect and any implementation of the ninth aspect correspond to the second aspect and any implementation of the second aspect, respectively. The technical effects corresponding to the ninth aspect and any implementation of the ninth aspect can refer to the technical effects corresponding to the second aspect and any implementation of the second aspect, which will not be repeated here.

第十方面，本發明實施例提供一種電腦可讀存儲介質，電腦可讀存儲介質存儲有電腦程式，當電腦程式運行在電腦或處理器上時，使得電腦或處理器執行第一方面或第一方面的任意可能的實現方式中的場景音訊編碼方法。In a tenth aspect, an embodiment of the present invention provides a computer-readable storage medium, which stores a computer program. When the computer program runs on a computer or a processor, the computer or the processor executes the scene audio encoding method in the first aspect or any possible implementation of the first aspect.

第十方面以及第十方面的任意一種實現方式分別與第一方面以及第一方面的任意一種實現方式相對應。第十方面以及第十方面的任意一種實現方式所對應的技術效果可參見上述第一方面以及第一方面的任意一種實現方式所對應的技術效果，此處不再贅述。The tenth aspect and any implementation of the tenth aspect correspond to the first aspect and any implementation of the first aspect, respectively. The technical effects corresponding to the tenth aspect and any implementation of the tenth aspect can refer to the technical effects corresponding to the first aspect and any implementation of the first aspect, which will not be repeated here.

第十一方面，本發明實施例提供一種電腦可讀存儲介質，電腦可讀存儲介質存儲有電腦程式，當電腦程式運行在電腦或處理器上時，使得電腦或處理器執行第二方面或第二方面的任意可能的實現方式中的場景音訊解碼方法。In the eleventh aspect, an embodiment of the present invention provides a computer-readable storage medium, which stores a computer program. When the computer program runs on a computer or a processor, the computer or the processor executes the scene audio decoding method in the second aspect or any possible implementation of the second aspect.

第十一方面以及第十一方面的任意一種實現方式分別與第二方面以及第二方面的任意一種實現方式相對應。第十一方面以及第十一方面的任意一種實現方式所對應的技術效果可參見上述第二方面以及第二方面的任意一種實現方式所對應的技術效果，此處不再贅述。The eleventh aspect and any implementation of the eleventh aspect correspond to the second aspect and any implementation of the second aspect, respectively. The technical effects corresponding to the eleventh aspect and any implementation of the eleventh aspect can refer to the technical effects corresponding to the above-mentioned second aspect and any implementation of the second aspect, which will not be repeated here.

第十二方面，本發明實施例提供一種電腦程式產品，電腦程式產品包括軟體程式，當軟體程式被電腦或處理器執行時，使得電腦或處理器執行第一方面或第一方面的任意可能的實現方式中的場景音訊編碼方法。In a twelfth aspect, an embodiment of the present invention provides a computer program product, which includes a software program. When the software program is executed by a computer or a processor, the computer or the processor executes the scene audio coding method in the first aspect or any possible implementation of the first aspect.

第十二方面以及第十二方面的任意一種實現方式分別與第一方面以及第一方面的任意一種實現方式相對應。第十二方面以及第十二方面的任意一種實現方式所對應的技術效果可參見上述第一方面以及第一方面的任意一種實現方式所對應的技術效果，此處不再贅述。The twelfth aspect and any implementation of the twelfth aspect correspond to the first aspect and any implementation of the first aspect, respectively. The technical effects corresponding to the twelfth aspect and any implementation of the twelfth aspect can refer to the technical effects corresponding to the above-mentioned first aspect and any implementation of the first aspect, which will not be repeated here.

第十三方面，本發明實施例提供一種電腦程式產品，電腦程式產品包括軟體程式，當軟體程式被電腦或處理器執行時，使得電腦或處理器執行第二方面或第二方面的任意可能的實現方式中的場景音訊解碼方法。In a thirteenth aspect, an embodiment of the present invention provides a computer program product, which includes a software program. When the software program is executed by a computer or a processor, the computer or the processor executes the scene audio decoding method in the second aspect or any possible implementation of the second aspect.

第十三方面以及第十三方面的任意一種實現方式分別與第二方面以及第二方面的任意一種實現方式相對應。第十三方面以及第十三方面的任意一種實現方式所對應的技術效果可參見上述第二方面以及第二方面的任意一種實現方式所對應的技術效果，此處不再贅述。The thirteenth aspect and any implementation of the thirteenth aspect correspond to the second aspect and any implementation of the second aspect, respectively. The technical effects corresponding to the thirteenth aspect and any implementation of the thirteenth aspect can refer to the technical effects corresponding to the above-mentioned second aspect and any implementation of the second aspect, which will not be repeated here.

第十四方面，本發明實施例提供一種存儲碼流的裝置，該裝置包括：接收器和至少一個存儲介質，接收器用於接收碼流；至少一個存儲介質用於存儲碼流；碼流是根據第一方面以及第一方面的任意一種實現方式生成的。In a fourteenth aspect, an embodiment of the present invention provides a device for storing a code stream, the device comprising: a receiver and at least one storage medium, the receiver is used to receive the code stream; at least one storage medium is used to store the code stream; the code stream is generated according to the first aspect and any one of the implementation methods of the first aspect.

第十四方面以及第十四方面的任意一種實現方式分別與第一方面以及第一方面的任意一種實現方式相對應。第十四方面以及第十四方面的任意一種實現方式所對應的技術效果可參見上述第一方面以及第一方面的任意一種實現方式所對應的技術效果，此處不再贅述。The fourteenth aspect and any implementation of the fourteenth aspect correspond to the first aspect and any implementation of the first aspect, respectively. The technical effects corresponding to the fourteenth aspect and any implementation of the fourteenth aspect can refer to the technical effects corresponding to the above-mentioned first aspect and any implementation of the first aspect, which will not be repeated here.

第十五方面，本發明實施例提供一種傳輸碼流的裝置，該裝置包括：發送器和至少一個存儲介質，至少一個存儲介質用於存儲碼流，碼流是根據第一方面以及第一方面的任意一種實現方式生成的；發送器用於從存儲介質中獲取碼流並將碼流通過傳輸介質發送給端側設備。In a fifteenth aspect, an embodiment of the present invention provides a device for transmitting a code stream, the device comprising: a transmitter and at least one storage medium, the at least one storage medium is used to store the code stream, the code stream is generated according to the first aspect and any one of the implementation methods of the first aspect; the transmitter is used to obtain the code stream from the storage medium and send the code stream to the end device through the transmission medium.

第十五方面以及第十五方面的任意一種實現方式分別與第一方面以及第一方面的任意一種實現方式相對應。第十五方面以及第十五方面的任意一種實現方式所對應的技術效果可參見上述第一方面以及第一方面的任意一種實現方式所對應的技術效果，此處不再贅述。The fifteenth aspect and any implementation of the fifteenth aspect correspond to the first aspect and any implementation of the first aspect, respectively. The technical effects corresponding to the fifteenth aspect and any implementation of the fifteenth aspect can refer to the technical effects corresponding to the first aspect and any implementation of the first aspect, which will not be repeated here.

第十六方面，本發明實施例提供一種分發碼流的系統，該系統包括：至少一個存儲介質，用於存儲至少一個碼流，至少一個碼流是根據第一方面以及第一方面的任意一種實現方式生成的，流媒體設備，用於從至少一個存儲介質中獲取目的碼流，並將目的碼流發送給端側設備，其中，流媒體設備包括內容伺服器或內容分佈伺服器。In a sixteenth aspect, an embodiment of the present invention provides a system for distributing a bit stream, the system comprising: at least one storage medium for storing at least one bit stream, the at least one bit stream being generated according to the first aspect and any one of the implementation methods of the first aspect, a streaming media device for obtaining a target bit stream from the at least one storage medium and sending the target bit stream to an end device, wherein the streaming media device comprises a content server or a content distribution server.

第十六方面以及第十六方面的任意一種實現方式分別與第一方面以及第一方面的任意一種實現方式相對應。第十六方面以及第十六方面的任意一種實現方式所對應的技術效果可參見上述第一方面以及第一方面的任意一種實現方式所對應的技術效果，此處不再贅述。The sixteenth aspect and any implementation of the sixteenth aspect correspond to the first aspect and any implementation of the first aspect, respectively. The technical effects corresponding to the sixteenth aspect and any implementation of the sixteenth aspect can refer to the technical effects corresponding to the first aspect and any implementation of the first aspect, which will not be repeated here.

下面將結合本發明實施例中的附圖，對本發明實施例中的技術方案進行清楚、完整地描述，顯然，所描述的實施例是本發明一部分實施例，而不是全部的實施例。基於本發明中的實施例，本領域普通技術人員在沒有作出創造性勞動前提下所獲得的所有其他實施例，都屬於本發明保護的範圍。The following will be combined with the drawings in the embodiments of the present invention to clearly and completely describe the technical solutions in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative labor are within the scope of protection of the present invention.

本文中術語“和/或”，僅僅是一種描述關聯物件的關聯關係，表示可以存在三種關係，例如，A和/或B，可以表示：單獨存在A，同時存在A和B，單獨存在B這三種情況。The term "and/or" in this article is only a description of the association relationship between related objects, indicating that three types of relationships may exist. For example, A and/or B can mean: A exists alone, A and B exist at the same time, and B exists alone.

本發明實施例的說明書和申請專利範圍中的術語“第一”和“第二”等是用於區別不同的物件，而不是用於描述物件的特定順序。例如，第一目標物件和第二目標物件等是用於區別不同的目標物件，而不是用於描述目標物件的特定順序。The terms "first" and "second" in the description and patent application of the embodiments of the present invention are used to distinguish different objects rather than to describe a specific order of objects. For example, a first target object and a second target object are used to distinguish different target objects rather than to describe a specific order of target objects.

在本發明實施例中，“示例性的”或者“例如”等詞用於表示作例子、例證或說明。本發明實施例中被描述為“示例性的”或者“例如”的任何實施例或設計方案不應被解釋為比其它實施例或設計方案更優選或更具優勢。確切而言，使用“示例性的”或者“例如”等詞旨在以具體方式呈現相關概念。In the embodiments of the present invention, words such as "exemplary" or "for example" are used to indicate examples, illustrations or explanations. Any embodiment or design described as "exemplary" or "for example" in the embodiments of the present invention should not be interpreted as being more preferred or advantageous than other embodiments or designs. Rather, the use of words such as "exemplary" or "for example" is intended to present the relevant concepts in a concrete manner.

在本發明實施例的描述中，除非另有說明，“多個”的含義是指兩個或兩個以上。例如，多個處理單元是指兩個或兩個以上的處理單元；多個系統是指兩個或兩個以上的系統。In the description of the embodiments of the present invention, unless otherwise specified, the meaning of "plurality" refers to two or more. For example, multiple processing units refer to two or more processing units; multiple systems refer to two or more systems.

為了下述各實施例的描述清楚簡潔，首先給出相關技術的簡要介紹。In order to make the description of the following embodiments clear and concise, a brief introduction to the relevant technology is first given.

聲音（sound)是由物體振動產生的一種連續的波。產生振動而發出聲波的物體稱為聲源。聲波通過介質（如：空氣、固體或液體）傳播的過程中，人或動物的聽覺器官能感知到聲音。Sound is a continuous wave generated by the vibration of an object. The object that vibrates and emits sound waves is called the sound source. When sound waves propagate through a medium (such as air, solid or liquid), the hearing organs of humans or animals can perceive the sound.

聲波的特徵包括音調、音強和音色。音調表示聲音的高低。音強表示聲音的大小。音強也可以稱為響度或音量。音強的單位是分貝（decibel，dB）。音色又稱為音品。The characteristics of sound waves include pitch, intensity and timbre. Pitch refers to the high or low pitch of a sound. Intensity refers to the size of a sound. Intensity can also be called loudness or volume. The unit of intensity is decibel (dB). Tone is also called timbre.

聲波的頻率決定了音調的高低。頻率越高音調越高。物體在一秒鐘之內振動的次數稱為頻率，頻率單位是赫茲（hertz，Hz）。人耳能識別的聲音的頻率在20 Hz~20000 Hz之間。The frequency of sound waves determines the pitch of the sound. The higher the frequency, the higher the pitch. The number of times an object vibrates in one second is called frequency, and the unit of frequency is Hertz (Hz). The frequency of sound that the human ear can recognize is between 20 Hz and 20,000 Hz.

聲波的幅度決定了音強的強弱。幅度越大音強越大。距離聲源越近，音強越大。The amplitude of the sound wave determines the intensity of the sound. The greater the amplitude, the greater the intensity of the sound. The closer to the sound source, the greater the intensity of the sound.

聲波的波形決定了音色。聲波的波形包括方波、鋸齒波、正弦波和脈衝波等。The waveform of the sound wave determines the timbre. The waveform of the sound wave includes square wave, sawtooth wave, sine wave and pulse wave.

根據聲波的特徵，聲音可以分為規則聲音和無規則聲音。無規則聲音是指聲源無規則地振動發出的聲音。無規則聲音例如是影響人們工作、學習和休息等的雜訊。規則聲音是指聲源規則地振動發出的聲音。規則聲音包括語音和樂音。聲音用電表示時，規則聲音是一種在時頻域上連續變化的類比信號。該類比信號可以稱為音訊信號。音訊信號是一種攜帶語音、音樂和音效的資訊載體。According to the characteristics of sound waves, sound can be divided into regular sound and irregular sound. Irregular sound refers to the sound produced by the irregular vibration of the sound source. Irregular sound is, for example, noise that affects people's work, study and rest. Regular sound refers to the sound produced by the regular vibration of the sound source. Regular sound includes speech and music. When sound is represented electrically, regular sound is an analog signal that changes continuously in the time-frequency domain. This analog signal can be called an audio signal. An audio signal is an information carrier that carries speech, music and sound effects.

由於人的聽覺具有辨別空間中聲源的位置分佈的能力，則聽音者聽到空間中的聲音時，除了能感受到聲音的音調、音強和音色外，還能感受到聲音的方位。Since human hearing has the ability to distinguish the positional distribution of sound sources in space, when listeners hear sounds in space, in addition to being able to feel the pitch, intensity and timbre of the sounds, they can also feel the direction of the sounds.

隨著人們對聽覺系統體驗的關注和品質要求與日俱增，為了增強聲音的縱深感、臨場感和空間感，則三維音訊技術應運而生。從而聽音者不僅感受到來自前、後、左和右的聲源發出的聲音，而且感受到自己所處空間被這些聲源產生的空間聲場（簡稱“聲場”（sound field））所包圍的感覺，以及聲音向四周擴散的感覺，營造出一種使聽音者置身於影院或音樂廳等場所的“身臨其境”的音響效果。As people pay more and more attention to the experience of hearing systems and demand more quality, three-dimensional audio technology has emerged to enhance the depth, presence and spatial sense of sound. The listener not only feels the sound coming from the front, back, left and right sound sources, but also feels that the space they are in is surrounded by the spatial sound field (abbreviated as "sound field") generated by these sound sources, and the sound diffuses to the surroundings, creating an "immersive" sound effect that makes the listener feel like they are in a theater or concert hall.

本發明實施例涉及的場景音訊信號，可以是指用於描述聲場的信號；其中，場景音訊信號可以包括：HOA信號（其中，HOA信號可以包括三維HOA信號和二維HOA信號（也可以稱為平面HOA信號））和三維音訊信號；三維音訊信號可以是指場景音訊信號中除HOA信號之外的其他音訊信號。以下以HOA信號為例進行說明。The scene audio signal involved in the embodiments of the present invention may refer to a signal used to describe a sound field; wherein the scene audio signal may include: an HOA signal (wherein the HOA signal may include a three-dimensional HOA signal and a two-dimensional HOA signal (also referred to as a planar HOA signal)) and a three-dimensional audio signal; the three-dimensional audio signal may refer to other audio signals in the scene audio signal except the HOA signal. The following is an explanation using the HOA signal as an example.

眾所周知，聲波在理想介質中傳播，波數為，角頻率為，其中，為聲波頻率，為聲速。聲壓滿足公式(1，b)，為拉普拉斯運算元。（1） As we all know, sound waves propagate in an ideal medium with a wave number of , the angular frequency is ,in, is the sound wave frequency, is the speed of sound. Satisfying formula (1, b), is the Laplace operator. (1)

假設人耳以外的空間系統是一個球形，聽音者處於球的中心，從球外傳來的聲音在球面上有一個投影，過濾掉球面以外的聲音，假設聲源分佈在這個球面上，用球面上的聲源產生的聲場來擬合原始聲源產生的聲場，即三維音訊技術就是一個擬合聲場的方法。具體地，在球坐標系下求解公式(1，b)等式方程，在無源球形區域內，該公式(1，b)方程解為如下公式(2)。（2） Assume that the space system outside the human ear is a sphere, and the listener is at the center of the sphere. The sound coming from outside the sphere has a projection on the sphere, filtering out the sound outside the sphere. Assume that the sound source is distributed on this sphere, and use the sound field generated by the sound source on the sphere to fit the sound field generated by the original sound source. That is, three-dimensional audio technology is a method of fitting the sound field. Specifically, solve the equation (1, b) in the spherical coordinate system. In the passive spherical area, the equation (1, b) is solved as follows: equation (2). (2)

其中，表示球半徑，表示水平角資訊（或者稱為方位角資訊），表示俯仰角資訊（或稱為仰角資訊），表示波數，表示理想平面波的幅度，表示HOA信號的階數序號（或稱為HOA信號的階數序號）。表示球貝塞爾函數，球貝塞爾函數又稱為徑向基函數，其中，第一個j表示虛數單位，不隨角度變化。表示 , 方向的球諧函數，表示聲源方向的球諧函數。HOA信號滿足公式(3)。（3） in, represents the radius of the sphere, Represents horizontal angle information (or azimuth information), Indicates pitch angle information (or elevation information). represents the wave number, represents the amplitude of an ideal plane wave, Indicates the order number of the HOA signal (or called the order number of the HOA signal). represents the spherical Bessel function, which is also called the radial basis function. The first j represents an imaginary unit. Does not change with angle. express , The spherical harmonic function of the direction, The spherical harmonic function represents the direction of the sound source. The HOA signal satisfies formula (3). (3)

將公式(3)代入公式(2)，公式(2)可以變形為公式(4)。（4） Substituting formula (3) into formula (2), formula (2) can be transformed into formula (4). (4)

其中，將m截斷到第N項，即m=N，以作為對聲場的近似描述；此時，可以稱為HOA係數（可以用於表示N階HOA信號）。聲場是指介質中有聲波存在的區域。N為大於或等於1的整數。 Among them, m is truncated to the Nth item, that is, m=N, so As an approximate description of the sound field; at this point, It can be called the HOA coefficient (can be used to represent the N-order HOA signal). The sound field refers to the area in the medium where sound waves exist. N is an integer greater than or equal to 1.

場景音訊信號是一種攜帶聲場中聲源的空間位置資訊的資訊載體，描述了空間中聽音者的聲場。公式(4)表明聲場可以在球面上按球諧函數展開，即聲場可以分解為多個平面波的疊加。因此，可以將HOA信號描述的聲場使用多個平面波的疊加來表達，並通過HOA係數重建聲場。The scene audio signal is an information carrier that carries the spatial position information of the sound source in the sound field, and describes the sound field of the listener in space. Formula (4) shows that the sound field can be expanded on the sphere according to the spherical harmonic function, that is, the sound field can be decomposed into the superposition of multiple plane waves. Therefore, the sound field described by the HOA signal can be expressed by the superposition of multiple plane waves, and the sound field can be reconstructed by the HOA coefficient.

本發明的實施例涉及的待編碼的HOA信號可以是指N1階HOA信號，可以採用HOA係數或Ambisonic（身歷聲混響）係數表示，N1為大於或等於1的整數（其中，當N1等於時，1階HOA信號，可以稱為FOA（First Order Ambisonic，一階立體混響）信號）。其中，N1階HOA信號包括個通道的音訊信號。 The HOA signal to be encoded involved in the embodiment of the present invention may refer to an N1-order HOA signal, which may be represented by an HOA coefficient or an Ambisonic (stereo reverberation) coefficient, where N1 is an integer greater than or equal to 1 (wherein, when N1 is equal to, the 1-order HOA signal may be referred to as a FOA (First Order Ambisonic, first-order stereo reverberation) signal). The N1-order HOA signal includes channels of audio signals.

圖1a為示例性示出的應用場景示意圖。在圖1a示出的是場景音訊信號的編解碼場景。Fig. 1a is a schematic diagram of an exemplary application scenario. Fig. 1a shows the encoding and decoding scenario of a scene audio signal.

參照圖1a，示例性的，第一電子設備可以包括第一音訊採集模組、第一場景音訊編碼模組、第一通道編碼模組、第一通道解碼模組、第一場景音訊解碼模組和第一音訊重播模組。應該理解的是，第一電子設備可以包括比圖1a所示的更多或更少的模組，本發明對此不作限制。1a, exemplarily, the first electronic device may include a first audio acquisition module, a first scene audio encoding module, a first channel encoding module, a first channel decoding module, a first scene audio decoding module, and a first audio playback module. It should be understood that the first electronic device may include more or fewer modules than those shown in FIG. 1a, and the present invention is not limited thereto.

參照圖1a，示例性的，第二電子設備可以包括第二音訊採集模組、第二場景音訊編碼模組、第二通道編碼模組、第二通道解碼模組、第二場景音訊解碼模組和第二音訊重播模組。應該理解的是，第二電子設備可以包括比圖1a所示的更多或更少的模組，本發明對此不作限制。1a, exemplarily, the second electronic device may include a second audio acquisition module, a second scene audio encoding module, a second channel encoding module, a second channel decoding module, a second scene audio decoding module, and a second audio playback module. It should be understood that the second electronic device may include more or fewer modules than those shown in FIG. 1a, and the present invention is not limited thereto.

示例性的，第一電子設備編碼並傳輸場景音訊信號至第二電子設備，由第二電子設備解碼以及音訊重播的過程可以如下：第一音訊採集模組可以進行音訊採集，輸出場景音訊信號至第一場景音訊編碼模組。接著，第一場景音訊編碼模組可以對場景音訊信號進行編碼，輸出碼流至第一通道編碼模組。之後，第一通道編碼模組可以對碼流進行通道編碼，並將通道編碼後的碼流通過無線或有線網路通信設備傳輸到第二電子設備。然後，第二電子設備的第二通道解碼模組可以對接收到的資料進行通道解碼，以得到碼流並將碼流輸出至第二場景音訊解碼模組。接著，第二場景音訊解碼模組可以對該碼流進行解碼，以得到重建場景音訊信號；然後將該重建場景音訊信號輸出至第二音訊重播模組，由第二音訊重播模組進行音訊重播。Exemplarily, the process in which a first electronic device encodes and transmits a scene audio signal to a second electronic device, and the second electronic device decodes and replays the audio may be as follows: a first audio acquisition module may perform audio acquisition and output the scene audio signal to a first scene audio encoding module. Next, the first scene audio encoding module may encode the scene audio signal and output a code stream to a first channel encoding module. Thereafter, the first channel encoding module may channel-code the code stream and transmit the channel-coded code stream to a second electronic device via a wireless or wired network communication device. Then, a second channel decoding module of the second electronic device may channel-decode the received data to obtain a code stream and output the code stream to a second scene audio decoding module. Next, the second scene audio decoding module can decode the code stream to obtain a reconstructed scene audio signal; and then output the reconstructed scene audio signal to the second audio replay module, which performs audio replay.

需要說明的是，第二音訊重播模組可以對重建場景音訊信號進行後處理（如音訊渲染（例如，可以將包含個通道音訊信號的重建場景音訊信號，轉換為與第二電子設備中揚聲器數量相同通道數的音訊信號）、響度歸一化、用戶交互、音訊格式轉換或去雜訊等），以將重建場景音訊信號轉換為適應於第二電子設備中揚聲器播放的音訊信號。 It should be noted that the second audio playback module can perform post-processing (such as audio rendering) on the reconstructed scene audio signal (for example, The invention relates to a method for converting a reconstructed scene audio signal of a plurality of channel audio signals into an audio signal having the same number of channels as the number of speakers in the second electronic device), resonant normalization, user interaction, audio format conversion or noise removal, etc., so as to convert the reconstructed scene audio signal into an audio signal suitable for playing by the speakers in the second electronic device.

應該理解的是，第二電子設備編碼並傳輸場景音訊信號至第一電子設備，由第一電子設備解碼以及音訊重播的過程，與上述第一電子設備傳輸場景音訊信號至第二電子設備，由第二電子設備進行音訊重播的過程類似，在此不再贅述。It should be understood that the process of the second electronic device encoding and transmitting the scene audio signal to the first electronic device, and the first electronic device decoding and audio replaying is similar to the above-mentioned process of the first electronic device transmitting the scene audio signal to the second electronic device, and the second electronic device replaying the audio, and will not be repeated here.

示例性的，第一電子設備和第二電子設備均可以包括但不限於：個人電腦、電腦工作站、智慧手機、平板電腦、伺服器、智慧攝像頭、智慧汽車或其他類型蜂窩電話、媒體消費設備、可穿戴設備、機上盒、遊戲機等。Exemplarily, the first electronic device and the second electronic device may include but are not limited to: personal computers, computer workstations, smart phones, tablet computers, servers, smart cameras, smart cars or other types of cellular phones, media consumption devices, wearable devices, set-top boxes, game consoles, etc.

示例性的，本發明具體可以應用於VR（Virtual Reality，虛擬實境）/AR（Augmented Reality，增強現實）場景。一種可能的方式中，第一電子設備為伺服器，第二電子設備為VR/AR設備。一種可能的方式中，第二電子設備為伺服器，第一電子設備為VR/AR設備。Exemplarily, the present invention can be specifically applied to VR (Virtual Reality)/AR (Augmented Reality) scenes. In one possible manner, the first electronic device is a server, and the second electronic device is a VR/AR device. In one possible manner, the second electronic device is a server, and the first electronic device is a VR/AR device.

示例性的，第一場景音訊編碼模組和第二場景音訊編碼模組，可以是場景音訊編碼器。第一場景音訊解碼模組和第二場景音訊解碼模組，可以是場景音訊解碼器。Exemplarily, the first scene audio encoding module and the second scene audio encoding module may be scene audio encoders. The first scene audio decoding module and the second scene audio decoding module may be scene audio decoders.

示例性的，當由第一電子設備編碼場景音訊信號，第二電子設備重建場景音訊信號時，第一電子設備可以稱為編碼端，第二電子設備可以稱為解碼端。當由第二電子設備編碼場景音訊信號，第一電子設備重建場景音訊信號時，第二電子設備可以稱為編碼端，第一電子設備可以稱為解碼端。For example, when the first electronic device encodes the scene audio signal and the second electronic device reconstructs the scene audio signal, the first electronic device can be called the encoding end and the second electronic device can be called the decoding end. When the second electronic device encodes the scene audio signal and the first electronic device reconstructs the scene audio signal, the second electronic device can be called the encoding end and the first electronic device can be called the decoding end.

圖1b為示例性示出的應用場景示意圖。在圖1b示出的是場景音訊信號的轉碼場景。Fig. 1b is a schematic diagram of an exemplary application scenario. Fig. 1b shows a transcoding scenario of a scene audio signal.

參照圖1b（1），示例性的，無線或核心網設備可以包括：通道解碼模組、其他音訊解碼模組、場景音訊編碼模組和通道編碼模組。其中，無線或核心網設備可以用於音訊轉碼。Referring to FIG. 1 b ( 1 ), illustratively, the wireless or core network device may include: a channel decoding module, other audio decoding modules, a scene audio encoding module and a channel encoding module. The wireless or core network device may be used for audio transcoding.

示例性的，圖1b（1）的具體應用場景可以是：在第一電子設備未設有場景音訊編碼模組，僅設有其他音訊編碼模組；而第二電子設備僅設有場景音訊解碼模組，未設有其他音訊解碼模組的情況下，為了實現第二電子設備能夠解碼並重播第一電子設備採用其他音訊編碼模組編碼場景音訊信號，可以使用無線或核心網設備進行轉碼。Exemplarily, the specific application scenario of Figure 1b (1) may be: when the first electronic device is not provided with a scene audio encoding module but only with other audio encoding modules; and the second electronic device is only provided with a scene audio decoding module but not with other audio decoding modules, in order to enable the second electronic device to decode and replay the scene audio signal encoded by the first electronic device using other audio encoding modules, wireless or core network equipment may be used for transcoding.

具體的，第一電子設備採用其他音訊編碼模組對場景音訊信號進行編碼，得到第一碼流；並將第一碼流進行通道編碼後發送給無線或核心網設備。接著，無線或核心網設備的通道解碼模組可以進行通道解碼，將通道解碼出的第一碼流輸出至其他音訊解碼模組。之後，其他音訊解碼模組對第一碼流進行解碼，得到場景音訊信號並將場景音訊信號輸出至場景音訊編碼模組。然後，場景音訊編碼模組可以對場景音訊信號進行編碼，以得到第二碼流並將第二碼流輸出至通道編碼模組，由通道編碼模組對第二碼流進行通道編碼後，發送至第二電子設備。這樣，第二電子設備可以調用場景音訊解碼模組，對通道解碼得到第二碼流進行解碼，得到重建場景音訊信號；後續即可對重建場景音訊信號進行音訊重播。Specifically, the first electronic device uses other audio encoding modules to encode the scene audio signal to obtain a first code stream; and the first code stream is channel-encoded and sent to the wireless or core network device. Then, the channel decoding module of the wireless or core network device can perform channel decoding and output the first code stream decoded by the channel to other audio decoding modules. Afterwards, other audio decoding modules decode the first code stream to obtain the scene audio signal and output the scene audio signal to the scene audio encoding module. Then, the scene audio encoding module can encode the scene audio signal to obtain a second code stream and output the second code stream to the channel encoding module, and the channel encoding module channel-encodes the second code stream and sends it to the second electronic device. In this way, the second electronic device can call the scene audio decoding module to decode the second code stream obtained by channel decoding to obtain a reconstructed scene audio signal; subsequently, the reconstructed scene audio signal can be replayed.

參照圖1b（2），示例性的，無線或核心網設備可以包括：通道解碼模組、場景音訊解碼模組、其他音訊編碼模組和通道編碼模組。其中，無線或核心網設備可以用於音訊轉碼。Referring to FIG. 1 b ( 2 ), illustratively, the wireless or core network device may include: a channel decoding module, a scene audio decoding module, other audio encoding modules and a channel encoding module. The wireless or core network device may be used for audio transcoding.

示例性的，圖1b（2）的具體應用場景可以是：在第一電子設備僅設有場景音訊編碼模組，未設有其他音訊編碼模組；而第二電子設備未設有場景音訊解碼模組，僅設有其他音訊解碼模組的情況下，為了實現第二電子設備能夠解碼並重播第一電子設備採用場景音訊編碼模組編碼場景音訊信號，可以使用無線或核心網設備進行轉碼。Exemplarily, the specific application scenario of Figure 1b (2) may be: when the first electronic device is only provided with a scene audio encoding module and no other audio encoding modules are provided; and the second electronic device is not provided with a scene audio decoding module and only provided with other audio decoding modules, in order to enable the second electronic device to decode and replay the scene audio signal encoded by the scene audio encoding module of the first electronic device, a wireless or core network device may be used for transcoding.

具體的，第一電子設備採用場景音訊編碼模組對場景音訊信號進行編碼，得到第一碼流；並將第一碼流進行通道編碼後發送給無線或核心網設備。接著，無線或核心網設備的通道解碼模組可以進行通道解碼，將通道解碼出的第一碼流輸出至場景音訊解碼模組。之後，場景音訊解碼模組對第一碼流進行解碼，得到場景音訊信號並將場景音訊信號輸出至其他音訊編碼模組。然後，其他音訊編碼模組可以對場景音訊信號進行編碼，以得到第二碼流並將第二碼流輸出至通道編碼模組，由通道編碼模組對第二碼流進行通道編碼後，發送至第二電子設備。這樣，第二電子設備可以調用其他音訊解碼模組，對通道解碼得到第二碼流進行解碼，得到重建場景音訊信號；後續即可對重建場景音訊信號進行音訊重播。Specifically, the first electronic device uses a scene audio encoding module to encode the scene audio signal to obtain a first code stream; and the first code stream is channel-encoded and sent to the wireless or core network device. Then, the channel decoding module of the wireless or core network device can perform channel decoding and output the first code stream decoded by the channel to the scene audio decoding module. Afterwards, the scene audio decoding module decodes the first code stream to obtain a scene audio signal and outputs the scene audio signal to other audio encoding modules. Then, other audio encoding modules can encode the scene audio signal to obtain a second code stream and output the second code stream to the channel encoding module, and the channel encoding module channel-encodes the second code stream and sends it to the second electronic device. In this way, the second electronic device can call other audio decoding modules to decode the second code stream obtained by channel decoding to obtain a reconstructed scene audio signal; subsequently, the reconstructed scene audio signal can be replayed.

以下對場景音訊信號的編解碼過程進行說明。The following describes the encoding and decoding process of scene audio signals.

圖2a為示例性示出的編碼過程示意圖。FIG. 2a is a schematic diagram showing an exemplary encoding process.

S201，獲取待編碼的場景音訊信號，場景音訊信號包括C1個通道的音訊信號，C1為正整數。S201, obtaining a scene audio signal to be encoded, wherein the scene audio signal includes audio signals of C1 channels, where C1 is a positive integer.

示例性的，當場景音訊信號為HOA信號時，該HOA信號可以為N1階HOA信號，也就是N1項的上述公式（3）中的。 For example, when the scene audio signal is an HOA signal, the HOA signal may be an N1-order HOA signal, that is, the N1-term in the above formula (3) .

示例性的，N1階HOA信號可以包括C1個通道的音訊信號，C1= 。例如，N1=3時，N1階HOA信號包括16個通道的音訊信號；N1=4時，N1階HOA信號包括25個通道的音訊信號。 For example, the N1-order HOA signal may include C1 channels of audio signals, where C1= For example, when N1=3, the N1-order HOA signal includes 16 channels of audio signals; when N1=4, the N1-order HOA signal includes 25 channels of audio signals.

S202，獲取場景音訊信號對應的目標虛擬揚聲器的屬性資訊。S202, obtaining property information of a target virtual speaker corresponding to the scene audio signal.

基於場景音訊信號，從多個候選虛擬揚聲器中選取目標虛擬揚聲器，獲取目標虛擬揚聲器的屬性資訊。Based on the scene audio signal, a target virtual speaker is selected from multiple candidate virtual speakers, and property information of the target virtual speaker is obtained.

S203，獲取場景音訊信號的高階能量增益。S203, obtaining a high-order energy gain of the scene audio signal.

示例性的，從待編碼的HOA信號獲取HOA信號的特徵資訊，通過HOA信號的特徵資訊獲取高階能量增益，高階能量增益可用於指示場景音訊信號的高階通道信號的能量增益。Exemplarily, characteristic information of the HOA signal is obtained from the HOA signal to be encoded, and a high-order energy gain is obtained through the characteristic information of the HOA signal. The high-order energy gain can be used to indicate the energy gain of the high-order channel signal of the scene audio signal.

示例性的，場景音訊信號包括C1個通道的音訊信號，第一音訊信號為場景音訊信號中K個通道的音訊信號，K為小於或等於C1的正整數，對於K的取值不做限定。Exemplarily, the scene audio signal includes audio signals of C1 channels, the first audio signal is audio signals of K channels in the scene audio signal, K is a positive integer less than or equal to C1, and the value of K is not limited.

場景音訊信號為N1階HOA信號，N1階HOA信號包括第一音訊信號和第二音訊信號，第二音訊信號為N1階HOA信號中除第一音訊信號之外的音訊信號，C1等於（N1+1）的平方。The scene audio signal is an N1-order HOA signal, the N1-order HOA signal includes a first audio signal and a second audio signal, the second audio signal is an audio signal in the N1-order HOA signal except the first audio signal, and C1 is equal to the square of (N1+1).

一種可能的方式中，假設N1=3，C1=10。N1階HOA信號包括第1至第16通道音訊信號，第一音訊信號為N1階HOA信號中第1至第10通道音訊信號，第二音訊信號為N1階HOA信號中第11至第16通道音訊信號。In a possible manner, assuming that N1=3 and C1=10, the N1-order HOA signal includes 1st to 16th channel audio signals, the first audio signal is the 1st to 10th channel audio signals in the N1-order HOA signal, and the second audio signal is the 11th to 16th channel audio signals in the N1-order HOA signal.

示例性的，N1=3，C1=9。N1階HOA信號包括第1至第16通道音訊信號，第一音訊信號為N1階HOA信號中第1至第9通道音訊信號，第二音訊信號為N1階HOA信號中第10至第16通道音訊信號。Exemplarily, N1=3, C1=9. The N1-order HOA signal includes 1st to 16th channel audio signals, the first audio signal is the 1st to 9th channel audio signals in the N1-order HOA signal, and the second audio signal is the 10th to 16th channel audio signals in the N1-order HOA signal.

示例性的，N1=3，C1=8。N1階HOA信號包括第1至第16通道音訊信號，第一音訊信號為N1階HOA信號中第1至第6和第8、第9通道音訊信號，第二音訊信號為N1階HOA信號中第7和第10至第16通道音訊信號。Exemplarily, N1=3, C1=8. The N1-order HOA signal includes the 1st to 16th channel audio signals, the first audio signal is the 1st to 6th and 8th, 9th channel audio signals in the N1-order HOA signal, and the second audio signal is the 7th and 10th to 16th channel audio signals in the N1-order HOA signal.

一種可能實現方式中，獲取場景音訊信號的高階能量增益，包括：In a possible implementation, obtaining a high-order energy gain of a scene audio signal includes:

根據第二音訊信號的特徵資訊和第一音訊信號的特徵資訊獲取高階能量增益。A high-order energy gain is obtained according to the feature information of the second audio signal and the feature information of the first audio signal.

其中，場景音訊信號包括第一音訊信號和第二音訊信號，分別獲取第二音訊信號的特徵資訊和第一音訊信號的特徵資訊，場景音訊信號所對應的特徵資訊包括但不限於：增益資訊和擴散資訊。根據第二音訊信號的特徵資訊和第一音訊信號的特徵資訊可以獲取場景音訊信號的高階能量增益。The scene audio signal includes a first audio signal and a second audio signal, and feature information of the second audio signal and feature information of the first audio signal are obtained respectively. Feature information corresponding to the scene audio signal includes but is not limited to: gain information and diffusion information. The high-order energy gain of the scene audio signal can be obtained according to the feature information of the second audio signal and the feature information of the first audio signal.

示例性的，可以參照如下公式，計算場景音訊信號中第二音訊信號的增益資訊Gain（i，b）：Exemplarily, the gain information Gain(i, b) of the second audio signal in the scene audio signal may be calculated according to the following formula:

其中，i為場景音訊信號中第二音訊信號包含的第i個通道的編號，編號又可以稱為通道號，b為第二音訊信號的頻帶序號，E(i，b)為所述第二音訊信號的第b個頻帶的第i個通道能量，E(1，b)為所述第一音訊信號的第b個頻帶的通道能量，例如第一音訊信號的通道具體可以是N1階HOA信號的第1通道。Among them, i is the number of the i-th channel contained in the second audio signal in the scene audio signal, and the number can also be called the channel number, b is the frequency band number of the second audio signal, E(i, b) is the i-th channel energy of the b-th frequency band of the second audio signal, and E(1, b) is the channel energy of the b-th frequency band of the first audio signal. For example, the channel of the first audio signal can specifically be the first channel of the N1-order HOA signal.

以下步驟可以在一幀信號內執行，也可以在子幀上執行。以下步驟可以在全頻帶執行，也可以在子帶上執行。The following steps can be performed in a frame signal or on a subframe. The following steps can be performed in the full band or on a subband.

示例性的，在計算得到Gain（i，b）之後，通過如下方式計算Gain’(i，b)：For example, after Gain(i, b) is calculated, Gain'(i, b) is calculated as follows:

Gain’(i，b) = 10*log10(Gain(i，b))。Gain’(i, b) = 10*log10(Gain(i, b)).

S204，對高階能量增益進行編碼，以得到高階能量增益編碼結果。S204, encoding the high-order energy gain to obtain a high-order energy gain encoding result.

編碼端獲取到場景音訊信號的高階能量增益之後，可以對該高階能量增益進行編碼，生成高階能量增益編碼結果。高階能量增益的作用是在解碼端調節高階通道能量，使HOA信號編解碼品質更高。After the encoder obtains the high-order energy gain of the scene audio signal, it can encode the high-order energy gain to generate a high-order energy gain encoding result. The role of the high-order energy gain is to adjust the high-order channel energy at the decoder to make the HOA signal encoding and decoding quality higher.

S205，編碼場景音訊信號中第一音訊信號和目標虛擬揚聲器的屬性資訊和高階能量增益編碼結果，以得到第一碼流；其中，第一音訊信號為場景音訊信號中K個通道的音訊信號，K為小於或等於C1的正整數。S205, encode the attribute information and high-order energy gain encoding result of the first audio signal and the target virtual speaker in the scene audio signal to obtain a first bit stream; wherein the first audio signal is the audio signal of K channels in the scene audio signal, and K is a positive integer less than or equal to C1.

示例性的，虛擬揚聲器是虛擬的揚聲器，不是真實存在的揚聲器。Illustratively, a virtual speaker is a virtual speaker, not a real speaker.

示例性的，基於上述可知，場景音訊信號可以使用多個平面波的疊加來表達，進而可以確定用於來類比場景音訊信號中聲源的目標虛擬揚聲器；這樣，後續在解碼過程中，採用目標虛擬揚聲器對應的虛擬揚聲器信號，來重建該場景音訊信號。For example, based on the above, the scene audio signal can be expressed by superposition of multiple plane waves, and then the target virtual speaker used to simulate the sound source in the scene audio signal can be determined; in this way, in the subsequent decoding process, the virtual speaker signal corresponding to the target virtual speaker is used to reconstruct the scene audio signal.

一種可能的方式中，可以在球面上設置位置不同的多個候選虛擬揚聲器；接著，可以從這多個候選虛擬揚聲器中，選取位置與場景音訊信號中聲源位置相匹配的目標虛擬揚聲器。In one possible approach, a plurality of candidate virtual speakers at different positions may be arranged on a sphere; then, a target virtual speaker whose position matches the position of a sound source in a scene audio signal may be selected from the plurality of candidate virtual speakers.

圖2b為示例性示出的候選虛擬揚聲器分佈示意圖。在圖2b中，多個候選虛擬揚聲器可以均勻的分佈在球面上，球面上一個點，代表一個候選虛擬揚聲器。Fig. 2b is a schematic diagram of candidate virtual speaker distribution. In Fig. 2b, multiple candidate virtual speakers can be evenly distributed on a sphere, and a point on the sphere represents a candidate virtual speaker.

需要說明的是，本發明對候選虛擬揚聲器的數量以及分佈不作限制，可以按照需求設置，具體在後續進行說明。It should be noted that the present invention does not limit the number and distribution of candidate virtual speakers, which can be set according to needs, as will be described in detail later.

示例性的，可以基於場景音訊信號，從這多個候選虛擬揚聲器中，選取位置與場景音訊信號中聲源位置對應的目標虛擬揚聲器；其中，目標虛擬揚聲器的數量可以是一個，也可以是多個，本發明對此不作限制。Exemplarily, based on the scene audio signal, a target virtual speaker whose position corresponds to the sound source position in the scene audio signal can be selected from the multiple candidate virtual speakers; wherein the number of target virtual speakers can be one or more, and the present invention is not limited thereto.

一種可能的方式中，可以預先設定目標虛擬揚聲器。In one possible approach, a target virtual speaker may be pre-set.

示例性的，一種可能的方式中，在解碼過程中，可以根據虛擬揚聲器信號來重建場景音訊信號；但是直接傳輸目標虛擬揚聲器的虛擬揚聲器信號，會增加碼率；而目標虛擬揚聲器的虛擬揚聲器信號可以基於目標虛擬揚聲器的屬性資訊和部分或全部通道的場景音訊信號來生成；因此可以獲取目標虛擬揚聲器的屬性資訊，以及獲取場景音訊信號中的K個通道的音訊信號，作為第一音訊信號；然後對第一音訊信號、目標虛擬揚聲器的屬性資訊和高階能量增益編碼結果進行編碼，以得到第一碼流。Exemplarily, in one possible manner, during the decoding process, the scene audio signal can be reconstructed based on the virtual speaker signal; however, directly transmitting the virtual speaker signal of the target virtual speaker will increase the bit rate; and the virtual speaker signal of the target virtual speaker can be generated based on the attribute information of the target virtual speaker and the scene audio signal of some or all channels; therefore, the attribute information of the target virtual speaker and the audio signals of K channels in the scene audio signal can be obtained as the first audio signal; then the first audio signal, the attribute information of the target virtual speaker and the high-order energy gain encoding result are encoded to obtain a first bit stream.

示例性的，可以對第一音訊信號和目標虛擬揚聲器的屬性資訊進行下混、變換、量化以及熵編碼等操作，以得到第一碼流，另外，還可以將高階能量增益編碼結果寫入到第一碼流中。也就是說，該第一碼流中可以包括場景音訊信號中第一音訊信號的編碼資料，以及目標虛擬揚聲器的屬性資訊的編碼資料，以及高階能量增益編碼結果。Exemplarily, the first audio signal and the property information of the target virtual speaker may be downmixed, transformed, quantized, and entropy encoded to obtain a first bitstream, and the high-order energy gain encoding result may be written into the first bitstream. In other words, the first bitstream may include the encoding data of the first audio signal in the scene audio signal, the encoding data of the property information of the target virtual speaker, and the high-order energy gain encoding result.

相對於現有技術中其他重建場景音訊信號的方法而言，基於虛擬揚聲器信號重建出的場景音訊信號的音訊品質更高；因此當K等於C1時，在同等碼率下，本發明重建出的場景音訊信號的音訊品質更高。Compared with other methods of reconstructing scene audio signals in the prior art, the scene audio signals reconstructed based on the virtual speaker signals have higher audio quality; therefore, when K is equal to C1, at the same bit rate, the scene audio signals reconstructed by the present invention have higher audio quality.

當K小於C1時，在對場景音訊信號編碼的過程中，本發明編碼的音訊信號的通道數，小於現有技術編碼的音訊信號的通道數，且目標虛擬揚聲器的屬性資訊的資料量，也遠小一個通道的音訊信號的資料量；因此在達到同等品質的前提下，本發明編碼碼率更低。When K is less than C1, in the process of encoding the scene audio signal, the number of channels of the audio signal encoded by the present invention is less than the number of channels of the audio signal encoded by the prior art, and the amount of data of the attribute information of the target virtual speaker is also much smaller than the amount of data of the audio signal of one channel; therefore, under the premise of achieving the same quality, the encoding bit rate of the present invention is lower.

此外，現有技術是將場景音訊信號轉換為虛擬揚聲器信號和殘差信號後再編碼，而本發明編碼端直接編碼場景音訊信號中部分通道的音訊信號，無需計算虛擬揚聲器信號和殘差信號，編碼端的編碼複雜度更低。In addition, the prior art converts the scene audio signal into a virtual speaker signal and a residual signal and then encodes the signal. However, the encoding end of the present invention directly encodes the audio signals of some channels in the scene audio signal without calculating the virtual speaker signal and the residual signal, and the encoding complexity of the encoding end is lower.

圖3為示例性示出的解碼過程示意圖。圖3為與圖2的編碼過程所對應的解碼過程。Fig. 3 is a schematic diagram of an exemplary decoding process. Fig. 3 is a decoding process corresponding to the encoding process of Fig. 2.

S301，接收第一碼流。S301, receiving a first code stream.

S302，解碼第一碼流，以得到第一重建信號和目標虛擬揚聲器的屬性資訊。S302: Decode the first bit stream to obtain a first reconstructed signal and property information of a target virtual speaker.

示例性的，可以對第一碼流包含的場景音訊信號中第一音訊信號的編碼資料進行解碼，可以得到第一重建信號；也就是說，第一重建信號是第一音訊信號的重建信號。以及可以對第一碼流包含的目標虛擬揚聲器的屬性資訊的編碼資料進行解碼，可以得到目標虛擬揚聲器的屬性資訊。Exemplarily, the coded data of the first audio signal in the scene audio signal included in the first bitstream can be decoded to obtain the first reconstructed signal; that is, the first reconstructed signal is a reconstructed signal of the first audio signal. And the coded data of the property information of the target virtual speaker included in the first bitstream can be decoded to obtain the property information of the target virtual speaker.

S303，基於目標虛擬揚聲器的屬性資訊和第一重建信號，生成目標虛擬揚聲器對應的虛擬揚聲器信號。S303: Generate a virtual speaker signal corresponding to the target virtual speaker based on the property information of the target virtual speaker and the first reconstructed signal.

S304，基於目標虛擬揚聲器的屬性資訊和虛擬揚聲器信號進行重建，以得到第一重建場景音訊信號。第一重建場景音訊信號包括C2個通道的音訊信號，C2為正整數。S304: Reconstruct based on the property information of the target virtual speaker and the virtual speaker signal to obtain a first reconstructed scene audio signal. The first reconstructed scene audio signal includes audio signals of C2 channels, where C2 is a positive integer.

示例性的，可以基於虛擬揚聲器信號，來重建場景音訊信號；進而可以先基於目標虛擬揚聲器的屬性資訊和第一重建信號，生成目標虛擬揚聲器對應虛擬揚聲器信號。其中，一個目標虛擬揚聲器對應一路虛擬揚聲器信號，虛擬揚聲器信號是平面波。接著，再基於目標虛擬揚聲器的屬性資訊和虛擬揚聲器信號進行重建，生成第一重建場景音訊信號。For example, the scene audio signal can be reconstructed based on the virtual speaker signal; then, the virtual speaker signal corresponding to the target virtual speaker can be generated based on the property information of the target virtual speaker and the first reconstructed signal. One target virtual speaker corresponds to one virtual speaker signal, and the virtual speaker signal is a plane wave. Then, the scene audio signal is reconstructed based on the property information of the target virtual speaker and the virtual speaker signal to generate a first reconstructed scene audio signal.

示例性的，當場景音訊信號為HOA信號時，重建得到的第一重建場景音訊信號也可以為HOA信號，該HOA信號可以是N2階HOA信號，N2為正整數。示例性的，N2階HOA信號可以包括C2個通道的音訊信號，C2= 。 Exemplarily, when the scene audio signal is an HOA signal, the reconstructed first reconstructed scene audio signal may also be an HOA signal, and the HOA signal may be an N2-order HOA signal, where N2 is a positive integer. Exemplarily, the N2-order HOA signal may include audio signals of C2 channels, where C2= .

示例性的，第一重建場景音訊信號的階數N2，可以大於或等於圖2a實施例中場景音訊信號的階數N1；對應的，第一重建場景音訊信號包括的音訊信號的通道數C2，可以大於或等於圖2a實施例中場景音訊信號包括的音訊信號的通道數C1。Exemplarily, the order N2 of the first reconstructed scene audio signal may be greater than or equal to the order N1 of the scene audio signal in the embodiment of FIG. 2a ; correspondingly, the number of channels C2 of the audio signal included in the first reconstructed scene audio signal may be greater than or equal to the number of channels C1 of the audio signal included in the scene audio signal in the embodiment of FIG. 2a .

示例性的，場景音訊信號為N1階HOA信號，N1階HOA信號包括第二音訊信號，第二音訊信號為N1階HOA信號中除第一音訊信號之外的音訊信號，C1等於（N1+1）的平方；和/或，Exemplarily, the scene audio signal is an N1-order HOA signal, the N1-order HOA signal includes a second audio signal, the second audio signal is an audio signal in the N1-order HOA signal except the first audio signal, and C1 is equal to the square of (N1+1); and/or,

第一重建場景音訊信號為N2階HOA信號，N2階HOA信號包括第三音訊信號，第三音訊信號為N2階HOA信號中與第二音訊信號的各通道對應的重建信號，C2等於（N2+1）的平方。The first reconstructed scene audio signal is an N2-order HOA signal, the N2-order HOA signal includes a third audio signal, the third audio signal is a reconstructed signal corresponding to each channel of the second audio signal in the N2-order HOA signal, and C2 is equal to the square of (N2+1).

一種可能的方式中，可以直接將第一重建場景音訊信號，作為最終的解碼結果。In one possible approach, the first reconstructed scene audio signal can be directly used as the final decoding result.

S305，根據第一重建場景音訊信號中的重建信號的頻帶序號和/或第一重建場景音訊信號的階數確定衰減因數。S305: Determine an attenuation factor according to a frequency band number of a reconstructed signal in the first reconstructed scene audio signal and/or an order of the first reconstructed scene audio signal.

示例的，解碼端可以根據第一重建場景音訊信號中的重建信號的頻帶序號獲取衰減因數，或者解碼端根據或第一重建場景音訊信號的階數獲取衰減因數，該或第一重建場景音訊信號的階數具體可以是N2階HOA信號的階數，例如Ambisonic階數，或者解碼端可以根據上述頻帶序號和第一重建場景音訊信號的階數的階數獲取衰減因數，該衰減因數可以稱為雙衰減因數。衰減因數可以根據重建信號的頻帶序號和/或第一重建場景音訊信號的階數兩個因素中的至少一個進行衰減，該衰減因數可用於對第一重建場景音訊信號進行調整，以使得重建場景音訊信號的品質更高。For example, the decoder can obtain the attenuation factor according to the frequency band number of the reconstructed signal in the first reconstructed scene audio signal, or the decoder can obtain the attenuation factor according to the order of the first reconstructed scene audio signal, where the order of the first reconstructed scene audio signal can specifically be the order of an N2-order HOA signal, such as an Ambisonic order, or the decoder can obtain the attenuation factor according to the above-mentioned frequency band number and the order of the first reconstructed scene audio signal, and the attenuation factor can be called a double attenuation factor. The attenuation factor can be used to attenuate according to at least one of the frequency band number of the reconstructed signal and/or the order of the first reconstructed scene audio signal. The attenuation factor can be used to adjust the first reconstructed scene audio signal to make the quality of the reconstructed scene audio signal higher.

S306，根據高階能量增益編碼結果和衰減因數對第一重建場景音訊信號進行調整，以得到重建後的場景音訊信號。S306: Adjust the first reconstructed scene audio signal according to the high-order energy gain coding result and the attenuation factor to obtain a reconstructed scene audio signal.

其中，解碼端從第一碼流中獲取高階能量增益編碼結果，利用高階能量增益編碼結果和衰減因數對第一重建場景音訊信號進行能量調整。解碼端利用高階能量增益編碼結果調節第一重建場景音訊信號的高階通道能量，使場景音訊信號的解碼品質更高。The decoding end obtains the high-order energy gain coding result from the first bit stream, and uses the high-order energy gain coding result and the attenuation factor to adjust the energy of the first reconstructed scene audio signal. The decoding end uses the high-order energy gain coding result to adjust the high-order channel energy of the first reconstructed scene audio signal, so that the decoding quality of the scene audio signal is higher.

其次，由於現有技術編碼傳輸的虛擬揚聲器信號和殘差資訊是通過原始音訊信號（即待編碼的場景音訊信號）轉換而來的，並不是原始音訊信號，會引入誤差；而本發明編碼了部分原始音訊信號（即待編碼的場景音訊信號中K個通道的音訊信號），避免了誤差的引入，進而能夠提高解碼得到重建場景音訊信號的音訊品質；且還能夠避免解碼得到重建場景音訊信號的重建品質的波動，穩定性高。Secondly, since the virtual speaker signal and residual information encoded and transmitted by the existing technology are converted from the original audio signal (i.e., the scene audio signal to be encoded), and are not the original audio signal, errors will be introduced; while the present invention encodes part of the original audio signal (i.e., the audio signals of K channels in the scene audio signal to be encoded), avoiding the introduction of errors, thereby improving the audio quality of the reconstructed scene audio signal obtained by decoding; and it can also avoid the fluctuation of the reconstruction quality of the reconstructed scene audio signal obtained by decoding, and has high stability.

此外，綜合編碼端和解碼端，相對於現有技術的編碼端和解碼端而言，本發明的編碼端和解碼端無需進行殘差和疊加操作，因此本發明編碼端和解碼端的綜合複雜度，低於現有技術編碼端和解碼端的綜合複雜度。由於編碼端發送的第一碼流中包括高階能量增益編碼結果，因此高階能量增益可用於在解碼端調節高階通道能量，使場景音訊信號的編解碼品質更高。In addition, the integrated coding end and decoding end do not need to perform residual and superposition operations compared to the coding end and decoding end of the prior art, so the comprehensive complexity of the coding end and decoding end of the present invention is lower than that of the coding end and decoding end of the prior art. Since the first bit stream sent by the coding end includes the high-order energy gain coding result, the high-order energy gain can be used to adjust the high-order channel energy at the decoding end, so that the coding and decoding quality of the scene audio signal is higher.

以下對編碼過程中高階能量增益的編碼過程，以及解碼過程中高階能量增益對音訊信號的調整過程進行說明。The following describes the encoding process of the high-order energy gain in the encoding process and the adjustment process of the high-order energy gain on the audio signal in the decoding process.

圖4為示例性示出的編碼過程示意圖。FIG. 4 is a schematic diagram showing an exemplary encoding process.

S401，獲取待編碼的場景音訊信號，場景音訊信號包括C1個通道的音訊信號，C1為正整數。S401, obtaining a scene audio signal to be encoded, wherein the scene audio signal includes audio signals of C1 channels, where C1 is a positive integer.

示例性的，S401可以參照上述S201的描述，在此不再贅述。Exemplarily, S401 may refer to the description of S201 above, which will not be repeated here.

S402，獲取場景音訊信號對應的目標虛擬揚聲器的屬性資訊。S402, obtaining property information of a target virtual speaker corresponding to the scene audio signal.

一種可能的方式中，基於目標虛擬揚聲器的位置資訊，生成目標虛擬揚聲器的屬性資訊。其中，一種可能的方式中，可以將目標虛擬揚聲器的位置資訊（包括俯仰角資訊和水平角資訊），作為目標虛擬揚聲器的屬性資訊。一種可能的方式中，將目標虛擬揚聲器的位置資訊對應的位置索引（包括俯仰角索引（可以用於唯一標識俯仰角資訊）和水平角索引（可以用於唯一標識水平角資訊）），作為目標虛擬揚聲器的屬性資訊。In one possible manner, based on the position information of the target virtual speaker, the attribute information of the target virtual speaker is generated. In one possible manner, the position information of the target virtual speaker (including the pitch angle information and the horizontal angle information) can be used as the attribute information of the target virtual speaker. In one possible manner, the position index corresponding to the position information of the target virtual speaker (including the pitch angle index (which can be used to uniquely identify the pitch angle information) and the horizontal angle index (which can be used to uniquely identify the horizontal angle information)) is used as the attribute information of the target virtual speaker.

一種可能的方式中，可以將目標虛擬揚聲器的虛擬揚聲器索引（例如，虛擬揚聲器標識），作為目標虛擬揚聲器的屬性資訊。其中，虛擬揚聲器索引與位置資訊一一對應。In a possible manner, a virtual speaker index (eg, virtual speaker identification) of the target virtual speaker may be used as the attribute information of the target virtual speaker, wherein the virtual speaker index corresponds to the position information one by one.

一種可能的方式中，可以將目標虛擬揚聲器的虛擬揚聲器係數，作為目標虛擬揚聲器的屬性資訊。示例性的，可以確定目標虛擬揚聲器的C2個虛擬揚聲器係數，將目標虛擬揚聲器的C2個虛擬揚聲器係數，作為目標虛擬揚聲器的屬性資訊；其中，目標虛擬揚聲器的C2個虛擬揚聲器係數與第一重建場景音訊信號包括的C2個通道數的音訊信號一一對應。In one possible manner, the virtual speaker coefficient of the target virtual speaker can be used as the attribute information of the target virtual speaker. Exemplarily, C2 virtual speaker coefficients of the target virtual speaker can be determined, and the C2 virtual speaker coefficients of the target virtual speaker can be used as the attribute information of the target virtual speaker; wherein the C2 virtual speaker coefficients of the target virtual speaker correspond one-to-one to the audio signal of C2 channels included in the first reconstructed scene audio signal.

需要說明的是，虛擬揚聲器係數的資料量，遠大於位置資訊、位置資訊的索引和虛擬揚聲器索引的資料量；可以根據頻寬，決策採用位置資訊、位置資訊的索引、虛擬揚聲器索引和虛擬揚聲器係數中的哪種資訊，作為目標虛擬揚聲器的屬性資訊。例如，當頻寬較大時，可以將虛擬揚聲器係數，作為目標虛擬揚聲器的屬性資訊；這樣，無需解碼端計算目標虛擬揚聲器的虛擬揚聲器係數，可以節省解碼端的算力。當頻寬較小時，可以將位置資訊、位置資訊的索引、虛擬揚聲器索引中的任一種，作為目標虛擬揚聲器的屬性資訊；這樣，可以節省碼率。應該理解的是，也可以預先設置採用位置資訊、位置資訊的索引、虛擬揚聲器索引和虛擬揚聲器係數中的哪種資訊，作為目標虛擬揚聲器的屬性資訊；本發明對此不作限制。It should be noted that the amount of data of virtual speaker coefficients is much larger than the amount of data of location information, location information index, and virtual speaker index; based on the bandwidth, it can be decided which of the location information, location information index, virtual speaker index, and virtual speaker coefficients to use as the attribute information of the target virtual speaker. For example, when the bandwidth is large, the virtual speaker coefficients can be used as the attribute information of the target virtual speaker; in this way, the decoder does not need to calculate the virtual speaker coefficients of the target virtual speaker, which can save the computing power of the decoder. When the bandwidth is small, any one of the position information, the index of the position information, and the virtual speaker index can be used as the attribute information of the target virtual speaker; in this way, the bit rate can be saved. It should be understood that it is also possible to pre-set which of the position information, the index of the position information, the virtual speaker index, and the virtual speaker coefficient is used as the attribute information of the target virtual speaker; the present invention is not limited to this.

S403、獲取第一音訊信號的能量增益和第二音訊信號的能量增益。S403: Obtain energy gain of the first audio signal and energy gain of the second audio signal.

場景音訊信號所對應的特徵資訊包括增益資訊，場景音訊信號包括第一音訊信號和第二音訊信號，分別計算第一音訊信號的能量增益E(1，b)和第二音訊信號的能量增益E(i，b)。The feature information corresponding to the scene audio signal includes gain information. The scene audio signal includes a first audio signal and a second audio signal. The energy gain E(1, b) of the first audio signal and the energy gain E(i, b) of the second audio signal are calculated respectively.

S404、根據第一音訊信號的能量增益和第二音訊信號的能量增益獲取高階能量增益。S404: Obtain a high-order energy gain according to the energy gain of the first audio signal and the energy gain of the second audio signal.

示例性的，根據第一音訊信號的能量增益和第二音訊信號的能量增益獲取高階能量增益，包括：Exemplarily, obtaining a high-order energy gain according to an energy gain of a first audio signal and an energy gain of a second audio signal comprises:

通過如下方式獲取高階能量增益Gain’(i，b)：The high-order energy gain Gain’(i, b) is obtained as follows:

Gain’(i，b) = 10*log10( )； Gain'(i，b) = 10*log10( );

其中，log10表示對數函數log，*表示相乘運算，E(1，b)為第一音訊信號的通道能量，E(i，b)為第二音訊信號的各通道能量，i為第二音訊信號的第i個通道的編號，b為第二音訊信號的頻帶序號。Wherein, log10 represents the logarithmic function log, * represents the multiplication operation, E(1, b) is the channel energy of the first audio signal, E(i, b) is the energy of each channel of the second audio signal, i is the number of the i-th channel of the second audio signal, and b is the frequency band number of the second audio signal.

示例性的，第二音訊信號的特徵資訊可以為N1階HOA信號的高階能量增益，具體為第二音訊信號的各個通道與W通道（N1階HOA信號的第1通道）的能量比例，該W通道具體可以是第一音訊信號的通道。Exemplarily, the characteristic information of the second audio signal may be the high-order energy gain of the N1-order HOA signal, specifically the energy ratio of each channel of the second audio signal to the W channel (the first channel of the N1-order HOA signal), and the W channel may specifically be the channel of the first audio signal.

示例性的，可以參照如下步驟獲取第二音訊信號的特徵資訊：Exemplarily, the characteristic information of the second audio signal may be obtained by referring to the following steps:

對N1階HOA信號進行時頻變換，將時域N1階HOA信號變換得到頻域N1階HOA信號。The N1-order HOA signal is subjected to time-frequency transformation, and the time-domain N1-order HOA signal is transformed into a frequency-domain N1-order HOA signal.

計算W通道能量E(1，b)和第二音訊信號的各通道能量E(i，b)，其中，i為第二音訊信號的通道編號。Calculate W channel energies E(1, b) and each channel energy E(i, b) of the second audio signal, where i is the channel number of the second audio signal.

計算高階能量增益Gain’(i，b)可以採用以下公式：The following formula can be used to calculate the high-order energy gain Gain’(i, b):

； ;

Gain’(i，b) = 10*log10(Gain(i，b)) 。Gain’(i,b) = 10*log10(Gain(i,b)).

S405、對高階能量增益進行量化，以得到量化後的高階能量增益。S405: quantize the high-order energy gain to obtain a quantized high-order energy gain.

S406、對量化後的高階能量增益進行熵編碼，以得到所述高階能量增益編碼結果。S406: Perform entropy coding on the quantized high-order energy gain to obtain the high-order energy gain coding result.

其中，獲取場景音訊信號中第二音訊信號的特徵資訊，通過第二音訊信號的特徵資訊得到場景音訊信號的高階能量增益，對高階能量增益依次進行量化和熵編碼。The characteristic information of the second audio signal in the scene audio signal is obtained, the high-order energy gain of the scene audio signal is obtained through the characteristic information of the second audio signal, and the high-order energy gain is sequentially quantized and entropy encoded.

示例性的，可以採用標量量化對高階能量增益量化。Exemplarily, scalar quantization may be used to quantize high-order energy gains.

對量化後的高階能量增益進行熵編碼。熵編碼方法不做限定。The quantized high-order energy gain is entropy-coded. The entropy coding method is not limited.

示例性的，對高階能量增益進行差分編碼，然後估計熵編碼的比特數，如果估計比特數小於定長編碼，對高階能量增益進行變長編碼，例如哈夫曼編碼；否則對高階能量增益進行定長編碼。Exemplarily, the high-order energy gain is differentially encoded, and then the number of bits of entropy coding is estimated. If the estimated number of bits is less than the fixed-length coding, the high-order energy gain is variable-length encoded, such as Huffman coding; otherwise, the high-order energy gain is fixed-length encoded.

在得到高階能量增益編碼結果之後，將編碼結果寫入碼流。After obtaining the high-order energy gain coding result, the coding result is written into the bitstream.

S407，編碼場景音訊信號中第一音訊信號和目標虛擬揚聲器的屬性資訊和高階能量增益編碼結果，以得到第一碼流。S407, encoding the first audio signal and the property information of the target virtual speaker and the high-order energy gain encoding result in the scene audio signal to obtain a first bit stream.

應該理解的是，第一音訊信號所包括的音訊信號的通道數，可以按照需求以及頻寬確定，本發明對此不作限制。It should be understood that the number of channels of the audio signal included in the first audio signal can be determined according to demand and bandwidth, and the present invention does not impose any limitation on this.

本發明實施例中，編碼端可以計算高階通道與W通道的能量比，從而得到高階能量增益編碼結果，然後根據子幀間差分結果的比特數預估，選擇哈夫曼編碼，或直接編碼。從而使得編碼端發送的第一碼流中包括高階能量增益編碼結果，因此高階能量增益可用於在解碼端調節高階通道能量，使場景音訊信號的編解碼品質更高。In the embodiment of the present invention, the encoder can calculate the energy ratio of the high-order channel to the W channel to obtain the high-order energy gain coding result, and then select Huffman coding or direct coding according to the bit number estimation of the difference result between subframes. As a result, the first bit stream sent by the encoder includes the high-order energy gain coding result, so the high-order energy gain can be used to adjust the high-order channel energy at the decoder, so that the encoding and decoding quality of the scene audio signal is higher.

圖5a為示例性示出的解碼過程示意圖。圖5a為與圖4編碼過程中對應的解碼過程。Fig. 5a is a schematic diagram of an exemplary decoding process. Fig. 5a is a decoding process corresponding to the encoding process in Fig. 4.

S501，接收第一碼流。S501, receiving a first code stream.

S502，解碼第一碼流，以得到第一重建信號和目標虛擬揚聲器的屬性資訊和高階能量增益編碼結果。S502, decoding the first bit stream to obtain the first reconstructed signal and the property information of the target virtual speaker and the high-order energy gain coding result.

S503，基於目標虛擬揚聲器的屬性資訊和第一音訊信號，生成所述目標虛擬揚聲器對應的虛擬揚聲器信號；S503, generating a virtual speaker signal corresponding to the target virtual speaker based on the attribute information of the target virtual speaker and the first audio signal;

S504，基於目標虛擬揚聲器的屬性資訊和所述虛擬揚聲器信號進行重建，以得到第一重建場景音訊信號；所述第一重建場景音訊信號包括C2個通道的音訊信號，C2為正整數S504, reconstructing based on the property information of the target virtual speaker and the virtual speaker signal to obtain a first reconstructed scene audio signal; the first reconstructed scene audio signal includes audio signals of C2 channels, where C2 is a positive integer.

示例性的，S501~S504，可以參照S301~S304的描述，在此不再贅述。For example, S501 to S504 can refer to the description of S301 to S304, which will not be repeated here.

示例性的，上述S306可以參照S505~S508的描述。Exemplarily, the above S306 can refer to the description of S505~S508.

S505，對高階能量增益編碼結果進行熵解碼，以得到熵解碼後的高階能量增益。S505, entropy decoding is performed on the high-order energy gain encoding result to obtain the high-order energy gain after entropy decoding.

S506，對熵解碼後的高階能量增益進行反量化，以得到高階能量增益。S506, dequantizing the high-order energy gain after entropy decoding to obtain the high-order energy gain.

示例性的，從第一碼流中讀取高階能量增益編碼結果。對高階能量增益編碼結果進行熵解碼。熵解碼方法為編碼端熵編碼的逆過程。Exemplarily, a high-order energy gain coding result is read from the first bitstream. Entropy decoding is performed on the high-order energy gain coding result. The entropy decoding method is the inverse process of entropy coding at the encoding end.

示例性，如果編碼端採用定長編碼，則解碼端使用與之對應的定長解碼，如果編碼端採用編碼編碼，則解碼端使用與之對應的邊長解碼，例如哈夫曼解碼。For example, if the encoding end adopts fixed-length coding, the decoding end uses the corresponding fixed-length decoding. If the encoding end adopts coding coding, the decoding end uses the corresponding edge-length decoding, such as Huffman decoding.

對熵解碼結果進行反量化，反量化方法為編碼端量化方法的逆過程。The entropy decoding result is dequantized, and the dequantization method is the inverse process of the quantization method at the encoding end.

S507，根據第二音訊信號的特徵資訊和第一音訊信號的特徵資訊對高階能量增益進行調整，以得到調整後的解碼高階能量增益。S507, adjusting the high-order energy gain according to the characteristic information of the second audio signal and the characteristic information of the first audio signal to obtain an adjusted decoding high-order energy gain.

其中，解碼端進行信號重建，得到第一重建場景音訊信號之後，從第一重建場景音訊信號中確定第一音訊信號和第三音訊信號，第三音訊信號為N2階HOA信號中與第二音訊信號的各通道對應的重建信號，根據第三音訊信號的特徵資訊確定第二音訊信號的特徵資訊，最後根據第二音訊信號的特徵資訊和第一音訊信號的特徵資訊對高階能量增益進行調整，以得到調整後的解碼高階能量增益，對高階能量增益進行調整，使得高階通道能量更加均勻和平滑，重建出的音訊信號的品質更優。Among them, the decoding end performs signal reconstruction, and after obtaining the first reconstructed scene audio signal, the first audio signal and the third audio signal are determined from the first reconstructed scene audio signal, the third audio signal is a reconstructed signal corresponding to each channel of the second audio signal in the N2-order HOA signal, and the characteristic information of the second audio signal is determined according to the characteristic information of the third audio signal. Finally, the high-order energy gain is adjusted according to the characteristic information of the second audio signal and the characteristic information of the first audio signal to obtain the adjusted decoded high-order energy gain. The high-order energy gain is adjusted so that the high-order channel energy is more uniform and smooth, and the quality of the reconstructed audio signal is better.

示例性的，S507根據第二音訊信號的特徵資訊和第一音訊信號的特徵資訊對高階能量增益進行調整，包括：Exemplarily, S507 adjusts the high-order energy gain according to the characteristic information of the second audio signal and the characteristic information of the first audio signal, including:

S5071，根據第一音訊信號的通道能量和高階能量增益獲取第二音訊信號的高階能量；S5071, obtaining high-order energy of the second audio signal according to the channel energy and high-order energy gain of the first audio signal;

其中，解碼端從第一碼流中獲取高階能量增益編碼結果，對高階能量增益編碼結果進行熵解碼，反量化，得到高階能量增益。再根據第一音訊信號的通道能量和高階能量增益對第二音訊信號的能量進行估計，以確定第二音訊信號的高階能量。The decoding end obtains the high-order energy gain coding result from the first bit stream, performs entropy decoding on the high-order energy gain coding result, and dequantizes it to obtain the high-order energy gain. Then, the energy of the second audio signal is estimated according to the channel energy and the high-order energy gain of the first audio signal to determine the high-order energy of the second audio signal.

示例性的，第一重建場景音訊信號為N2階HOA信號，對N2階HOA信號進行時頻變換，將時域N2階HOA信號變換得到頻域N2階HOA信號。Exemplarily, the first reconstructed scene audio signal is an N2-order HOA signal, and the N2-order HOA signal is subjected to a time-frequency transformation, and the time-domain N2-order HOA signal is transformed into a frequency-domain N2-order HOA signal.

計算第二音訊信號的高階能量E_Ref(i，b)，可以採用以下公式：The high-order energy E_Ref(i, b) of the second audio signal can be calculated using the following formula:

E_Ref(i，b) = E_dec(1，b) * 10^(Gain’(i，b)/10)E_Ref(i,b) = E_dec(1,b) * 10^(Gain’(i,b)/10)

其中，E_dec(1，b)為N2階HOA信號中第一音訊信號的第b個頻帶的通道能量，i為第二音訊信號對應的通道編號，Gain’(i，b)為高階能量增益，b為第一音訊信號的頻帶序號。Wherein, E_dec(1, b) is the channel energy of the bth frequency band of the first audio signal in the N2-order HOA signal, i is the channel number corresponding to the second audio signal, Gain’(i, b) is the high-order energy gain, and b is the frequency band number of the first audio signal.

S5072，根據第三音訊信號的通道能量和第二音訊信號的高階能量獲取解碼能量比例因數。S5072: Obtain a decoding energy proportional factor according to the channel energy of the third audio signal and the high-order energy of the second audio signal.

具體的，第三音訊信號為N2階HOA信號中與第二音訊信號的各通道對應的重建信號，通過第三音訊信號和第二音訊信號進行能量比例計算，得到解碼能量比例因數。Specifically, the third audio signal is a reconstructed signal corresponding to each channel of the second audio signal in the N2-order HOA signal, and the decoding energy ratio factor is obtained by calculating the energy ratio of the third audio signal and the second audio signal.

示例性的，計算解碼能量比例因數g(i，b)，可以採用以下公式：Exemplarily, the decoding energy proportional factor g(i, b) may be calculated using the following formula:

g(i，b) = sqrt(E_Ref(i，b)) / sqrt(E_dec(i，b))g(i,b) = sqrt(E_Ref(i,b)) / sqrt(E_dec(i,b))

其中，sqrt()為開方運算，E_dec(i，b)為第三音訊信號的第b個頻帶的通道能量，i為第三音訊信號對應的通道編號，E_Ref(i，b)為第二音訊信號的第b個頻帶的高階能量。Wherein, sqrt() is a square root operation, E_dec(i, b) is the channel energy of the b-th frequency band of the third audio signal, i is the channel number corresponding to the third audio signal, and E_Ref(i, b) is the high-order energy of the b-th frequency band of the second audio signal.

S5073，根據第三音訊信號的通道能量和第一音訊信號的通道能量獲取第三音訊信號的解碼高階能量增益。S5073: Obtain a decoded high-order energy gain of the third audio signal according to the channel energy of the third audio signal and the channel energy of the first audio signal.

其中，以第一音訊信號的通道能量為基準，對第三音訊信號的通道能量進行增益計算，以得到第三音訊信號的解碼高階能量增益。The channel energy of the first audio signal is used as a reference to calculate the gain of the channel energy of the third audio signal to obtain the decoded high-order energy gain of the third audio signal.

示例性的，計算解碼高階能量增益Gain_dec(i，b)，可以採用以下公式：Exemplarily, the following formula may be used to calculate the decoding high-order energy gain Gain_dec(i, b):

Gain_dec(i，b) = E_dec(i，b)/E_dec(1，b)Gain_dec(i,b) = E_dec(i,b)/E_dec(1,b)

其中，E_dec(1，b)為N2階HOA信號中第一音訊信號的第b個頻帶的通道能量，E_dec(i，b)為第三音訊信號的第b個通道的通道能量。Wherein, E_dec(1, b) is the channel energy of the b-th frequency band of the first audio signal in the N2-order HOA signal, and E_dec(i, b) is the channel energy of the b-th channel of the third audio signal.

S5074，根據解碼能量比例因數對第三音訊信號的解碼高階能量增益進行調整，以得到調整後的解碼高階能量增益。S5074, adjusting the decoding high-order energy gain of the third audio signal according to the decoding energy proportional factor to obtain an adjusted decoding high-order energy gain.

具體的，為使得高階通道的能量更加均勻和平滑，使用解碼能量比例因數對第三音訊信號的解碼高階能量增益進行調整，確定調整後的解碼高階能量增益。使用解碼能量比例因數調整之後，高階通道的能量更加均勻和平滑，重建出的音訊信號的品質更優。Specifically, in order to make the energy of the high-order channel more uniform and smooth, the decoding high-order energy gain of the third audio signal is adjusted using the decoding energy proportional factor, and the adjusted decoding high-order energy gain is determined. After the decoding energy proportional factor is adjusted, the energy of the high-order channel is more uniform and smooth, and the quality of the reconstructed audio signal is better.

示例性的，根據解碼能量比例因數對第三音訊信號的解碼高階能量增益進行調整，以得到調整後的解碼高階能量增益，包括：Exemplarily, adjusting the decoding high-order energy gain of the third audio signal according to the decoding energy proportional factor to obtain the adjusted decoding high-order energy gain includes:

通過如下方式獲取調整後的解碼高階能量增益Gain_dec’(i，b)：The adjusted decoding high-order energy gain Gain_dec’(i, b) is obtained as follows:

其中，g(i，b)表示解碼能量比例因數， Gain_dec(i，b)表示第三音訊信號的第b個頻帶的解碼高階能量增益，w為預設的調節比例閾值，min表示取最小值運算，*表示相乘運算。Wherein, g(i, b) represents the decoding energy proportional factor, Gain_dec(i, b) represents the decoding high-order energy gain of the b-th frequency band of the third audio signal, w is the default adjustment proportional threshold, min represents the minimum value operation, and * represents the multiplication operation.

示例性的，min(a,b)為取得a和b中的最小值，w為調節比例閾值，w的取值方式有多種，例如w的取值為0.25。For example, min(a,b) is the minimum value of a and b, and w is the adjustment ratio threshold. There are many ways to take the value of w, for example, the value of w is 0.25.

S508，根據調整後的解碼高階能量增益對N2階HOA信號中的第三音訊信號進行調整，以得到調整後的第三音訊信號。S508: Adjust the third audio signal in the N2-order HOA signal according to the adjusted decoded high-order energy gain to obtain an adjusted third audio signal.

其中，解碼端從第一碼流中獲取高階能量增益編碼結果，利用高階能量增益編碼結果對N2階HOA信號中的第三音訊信號進行能量調整。解碼端利用高階能量增益編碼結果調節第三音訊信號的高階通道能量，使第三音訊信號的解碼品質更高。The decoding end obtains the high-order energy gain coding result from the first bit stream, and uses the high-order energy gain coding result to adjust the energy of the third audio signal in the N2-order HOA signal. The decoding end uses the high-order energy gain coding result to adjust the high-order channel energy of the third audio signal, so that the decoding quality of the third audio signal is higher.

其中，第三音訊信號為N2階HOA信號中與第二音訊信號各通道對應的通道音訊信號。The third audio signal is a channel audio signal corresponding to each channel of the second audio signal in the N2-order HOA signal.

示例性的，可以基於N1階HOA信號中的第二音訊信號所對應的特徵資訊，對第三音訊信號進行調整，以提升N2階HOA信號的品質。Exemplarily, the third audio signal may be adjusted based on the feature information corresponding to the second audio signal in the N1-order HOA signal to improve the quality of the N2-order HOA signal.

示例性的，S508根據調整後的解碼高階能量增益對N2階HOA信號中的第三音訊信號進行調整，包括：Exemplarily, S508 adjusts the third audio signal in the N2-order HOA signal according to the adjusted decoded high-order energy gain, including:

S5081，根據第三音訊信號所在的頻帶序號和/或N2階HOA信號的階數獲取衰減因數。S5081: Obtain an attenuation factor according to the frequency band number of the third audio signal and/or the order of the N2-order HOA signal.

示例的，解碼端可以根據第三音訊信號所在的頻帶序號獲取衰減因數，或者解碼端根據N2階HOA信號的階數獲取衰減因數，該N2階HOA信號的階數具體可以是Ambisonic階數，或者解碼端可以根據上述頻帶序號和N2階HOA信號的階數獲取衰減因數，該衰減因數可以稱為雙衰減因數。For example, the decoding end can obtain the attenuation factor according to the frequency band number where the third audio signal is located, or the decoding end can obtain the attenuation factor according to the order of the N2-order HOA signal, and the order of the N2-order HOA signal can specifically be the Ambisonic order, or the decoding end can obtain the attenuation factor according to the above-mentioned frequency band number and the order of the N2-order HOA signal, and the attenuation factor can be called a double attenuation factor.

S5082，根據調整後的解碼高階能量增益和衰減因數對第三音訊信號進行調整，得到調整後的第三音訊信號，調整後的第三音訊信號屬於所述重建後的場景音訊信號。S5082, adjusting the third audio signal according to the adjusted decoded high-order energy gain and attenuation factor to obtain an adjusted third audio signal, wherein the adjusted third audio signal belongs to the reconstructed scene audio signal.

其中，獲取到調整後的解碼高階能量增益之後，可以對當前幀的第三音訊信號的增益進行加權處理，增益隨著第三音訊信號所在的頻帶序號和/或N2階HOA信號的階數進行衰減，可以先根據第三音訊信號所在的頻帶序號和/或N2階HOA信號的階數獲取衰減因數。例如，該衰減因數可以隨頻帶和Ambisonic階數兩個因素衰減，然後將調整後的解碼高階能量增益和獲取到的衰減因數作用於當前幀重建的第三音訊信號的高階通道，使得高階通道能量更加均勻和平滑，提高重建的音訊信號的品質。Among them, after obtaining the adjusted decoded high-order energy gain, the gain of the third audio signal of the current frame can be weighted, and the gain can be attenuated along with the frequency band sequence number of the third audio signal and/or the order of the N2-order HOA signal. The attenuation factor can be first obtained according to the frequency band sequence number of the third audio signal and/or the order of the N2-order HOA signal. For example, the attenuation factor can be attenuated along with the two factors of frequency band and Ambisonic order, and then the adjusted decoded high-order energy gain and the obtained attenuation factor are applied to the high-order channel of the third audio signal reconstructed in the current frame, so that the energy of the high-order channel is more uniform and smooth, and the quality of the reconstructed audio signal is improved.

示例性的，使用調整後的解碼高階能量增益Gain_dec’(i，b)和衰減因數g’(i，b)對第三音訊信號進行調整。Exemplarily, the third audio signal is adjusted using the adjusted decoded high-order energy gain Gain_dec’(i, b) and attenuation factor g’(i, b).

示例性的，可以參照如下公式進行調整：For example, the following formula may be used for adjustment:

X’(i，b) = X(i，b) * Gain_dec’(i，b) * g’(i，b)；X’(i,b) = X(i,b) * Gain_dec’(i,b) * g’(i,b);

其中，X(i，b)為調整前的第三音訊信號，X’(i，b)為調整後的第三音訊信號。Wherein, X(i, b) is the third audio signal before adjustment, and X'(i, b) is the third audio signal after adjustment.

一種可能的方式中，S5081根據第三音訊信號所在的頻帶序號和N2階HOA信號的階數獲取衰減因數，包括：In one possible manner, S5081 obtains the attenuation factor according to the frequency band number of the third audio signal and the order of the N2-order HOA signal, including:

示例性的，b為第三音訊信號的頻帶序號，頻帶序號又可以稱為子帶序號，b=0, 1, 2, …, 11。Exemplarily, b is the frequency band number of the third audio signal, and the frequency band number can also be called the sub-band number, b=0, 1, 2, …, 11.

通過上述衰減因數g’(i，b)的計算方式，b為第三音訊信號的頻帶序號，表示目標虛擬揚聲器的數量，表示N2階HOA信號的映射通道數量，，M為N2階HOA信號的階數，γ表示第三音訊信號的通道號i對應的階數，通過上述參數可以準確計算出衰減因數，通過參數的調節使得衰減因數隨著揚聲器數量、HOA信號映射通道數量、HOA階數三層因素而改變，使得該衰減因數和調整後的解碼高階能量增益用於調整第三音訊信號時，提高重建音訊信號的品質。 By calculating the attenuation factor g'(i, b) as above, b is the frequency band number of the third audio signal. Indicates the number of target virtual speakers. Indicates the number of mapping channels of N2-order HOA signals. , M is the order of the N2-order HOA signal, γ represents the order corresponding to the channel number i of the third audio signal. The attenuation factor can be accurately calculated through the above parameters. The attenuation factor changes with the number of speakers, the number of HOA signal mapping channels, and the HOA order through parameter adjustment, so that when the attenuation factor and the adjusted decoded high-order energy gain are used to adjust the third audio signal, the quality of the reconstructed audio signal is improved.

一種可能的方式中，本發明實施例提供的場景音訊信號解碼方法還包括：In one possible manner, the scene audio signal decoding method provided by the embodiment of the present invention further includes:

當b≤d時，將更新為，；d為預設的第一閾值； When b≤d, Updated to , ; d is the default first threshold;

當b＞d時，將更新為，。 When b＞d, Updated to , .

示例性的，d為預設的第一閾值，對於第一閾值的取值不做限定，當b≤d時，可以將，當b＞d時，可以將的取值縮減為，。當頻帶b小於第一閾值時，也就是說第b個頻帶可以表示低頻帶，那麼衰減係數設定一個較小值，例如衰減係數等於0.375，當頻帶b大於閾值時，也就是說第b個頻帶可以表示高頻帶，那麼衰減係數設為較大值，例如衰減係數等於0.5。上述衰減係數為0.375或0.5只是一種可能的舉例實現方式，例如上述0.375還可以替換為0.38或者0.37，上述0.5還可以替換為0.55或者0.6，具體需要結合應用場景確定衰減係數的取值，此處不做限定。上述方案中根據頻帶b的取值大小，可以靈活調整的取值，達到隨著頻帶越高衰減因數衰減效果越顯著的效果，從而使得重建出的音訊信號更符合人耳聽覺特性。 For example, d is a preset first threshold value, and there is no limitation on the value of the first threshold value. When b≤d, , When b＞d, we can The value of is reduced to , . When the frequency band b is less than the first threshold, that is to say, the bth frequency band can represent the low frequency band, then the attenuation coefficient is set to a smaller value, for example, the attenuation coefficient is equal to 0.375. When the frequency band b is greater than the threshold, that is to say, the bth frequency band can represent the high frequency band, then the attenuation coefficient is set to a larger value, for example, the attenuation coefficient is equal to 0.5. The above attenuation coefficient of 0.375 or 0.5 is only a possible example implementation method. For example, the above 0.375 can also be replaced by 0.38 or 0.37, and the above 0.5 can also be replaced by 0.55 or 0.6. The specific value of the attenuation coefficient needs to be determined in combination with the application scenario, and is not limited here. In the above scheme, the value of band b can be adjusted flexibly. The value of is set so that the attenuation effect becomes more significant as the frequency band becomes higher, so that the reconstructed audio signal is more in line with the hearing characteristics of the human ear.

示例性的，表示目標虛擬揚聲器的數量，取值為0,1,2,3…N，例如 2，3，4。可以對進一步調整：當子帶b≤第一閾值時，，否則。 For example, Indicates the number of target virtual speakers, with a value of 0, 1, 2, 3…N. For example 2, 3, 4. Further adjustment: When subband b ≤ the first threshold, , otherwise .

將所述更新為，； will be described Updated to , ;

其中，bands表示所述第三音訊信號的頻帶數量，即bands表示子帶數量，例如bands的取值可配置為12。此處只是舉例說明，不作為對本發明實施例的限定。Bands represents the number of frequency bands of the third audio signal, that is, bands represents the number of sub-bands, and for example, the value of bands can be configured as 12. This is only an example for illustration and is not intended to limit the embodiments of the present invention.

本發明實施例中，將的取值更新為，，即通過對的取值擴大，使得衰減因數的取值與所在第i個頻帶相關，達到低頻帶和高頻帶衰減效率不同的結果，使得該衰減因數用於調整第三音訊信號時，提高重建音訊信號的品質。 In the embodiment of the present invention, The value of is updated to , , that is, by The value of is expanded so that the value of the attenuation factor is related to the i-th frequency band, achieving different attenuation efficiencies for the low-band and the high-band, so that when the attenuation factor is used to adjust the third audio signal, the quality of the reconstructed audio signal is improved.

當b≤d時，將更新為，；d為預設的第一閾值，w為預設的調節比例閾值。 When b≤d, Updated to , ; d is the default first threshold, w is the default adjustment ratio threshold.

示例性的，表示HOA信號的映射通道數量，取值為，M為HOA信號階數，取值為0,1,2,3…N，當子帶b≤第一閾值時，。當b≤第一閾值時，上述方案中根據頻帶b的取值大小，可以靈活調整的取值，達到隨著頻帶越高衰減因數衰減效果越顯著的效果，從而使得重建出的音訊信號更符合人耳聽覺特性。 For example, Indicates the number of mapping channels of HOA signals, and its value is , M is the HOA signal order, The value is 0, 1, 2, 3...N. When sub-band b ≤ the first threshold, When b ≤ the first threshold, the above scheme can be flexibly adjusted according to the value of the frequency band b. The value of is set so that the attenuation effect becomes more significant as the frequency band becomes higher, so that the reconstructed audio signal is more in line with the hearing characteristics of the human ear.

一種可能的方式中，本發明實施例提供的場景音訊信號解碼方法還包括：將w更新為w2，w2=w+ ×0.05。 In one possible manner, the scene audio signal decoding method provided by the embodiment of the present invention further includes: updating w to w2, where w2=w+ ×0.05.

示例性的，w取值方法如下，w初始值為0，依次遍歷取值，計算得到最後w的調整值。例如將w更新為w2，w2 = w + × 0.05。可以根據對w的取值進行更新，使得衰減因數中的權重w與參數建立關係，隨著的增加w也增加，達到隨著頻帶越高衰減因數衰減效果越顯著的效果，從而使得重建出的音訊信號更符合人耳聽覺特性。 For example, the method of obtaining the value of w is as follows. The initial value of w is 0, and the values are traversed in sequence. Take the value and calculate the final adjustment value of w. For example, update w to w2, w2 = w + × 0.05. Update the value of w so that the weight w in the attenuation factor is consistent with the parameter Build relationships, As w increases, the attenuation effect becomes more significant as the frequency band increases, making the reconstructed audio signal more in line with the auditory characteristics of the human ear.

一種可能的方式中，當i的取值為0、1、2、或3時，γ的取值為1；In one possible approach, when the value of i is 0, 1, 2, or 3, the value of γ is 1;

當i的取值為4、5、6、7、或8時，γ的取值為2；When the value of i is 4, 5, 6, 7, or 8, the value of γ is 2;

當i的取值為9、10、11、12、13、14、或15時，γ的取值為3；When the value of i is 9, 10, 11, 12, 13, 14, or 15, the value of γ is 3;

其中，i為第三音訊信號的第i個通道的編號。Wherein, i is the number of the i-th channel of the third audio signal.

示例性的，γ表示第i個通道所處的HOA信號的階數，γ和i滿足如下表1：表1為γ和i的對應關係表：表1 i 0 1 1 1 2 1 3 1 4 2 5 2 6 2 7 2 8 2 9 3 10 3 11 3 12 3 13 3 14 3 15 3 Exemplarily, γ represents the order of the HOA signal in the i-th channel, and γ and i satisfy the following Table 1: Table 1 is a corresponding relationship table of γ and i: Table 1 i 0 1 1 1 2 1 3 1 4 2 5 2 6 2 7 2 8 2 9 3 10 3 11 3 12 3 13 3 14 3 15 3

通過上述γ的取值可知，通過γ的取值確定衰減因數，γ取值為與通道所在HOA階數相關的分段函數，隨著i所在的HOA階數的增加，γ取值增加，但不會超過最大HOA階數，該衰減因數用於調整第三音訊信號時，提高重建音訊信號的品質。It can be seen from the above-mentioned value of γ that the attenuation factor is determined by the value of γ. The value of γ is a piecewise function related to the HOA order of the channel. As the HOA order of i increases, the value of γ increases, but will not exceed the maximum HOA order. This attenuation factor is used to improve the quality of the reconstructed audio signal when adjusting the third audio signal.

示例性的，S5082根據調整後的解碼高階能量增益和衰減因數對N2階HOA信號中的第三音訊信號進行調整之後，所述方法還包括：Exemplarily, after S5082 adjusts the third audio signal in the N2-order HOA signal according to the adjusted decoded high-order energy gain and attenuation factor, the method further includes:

S5083，獲取調整後的第三音訊信號對應的第四音訊信號的通道能量，第三音訊信號包括當前幀的音訊信號，第四音訊信號包括當前幀的在先幀的音訊信號；S5083, obtaining channel energy of a fourth audio signal corresponding to the adjusted third audio signal, wherein the third audio signal includes an audio signal of a current frame, and the fourth audio signal includes an audio signal of a previous frame of the current frame;

S5084，根據第四音訊信號的通道能量對調整後的第三音訊信號再次進行調整。S5084: adjust the adjusted third audio signal again according to the channel energy of the fourth audio signal.

其中，解碼端還可以利用第三音訊信號的在先幀對當前幀的調整後的第三音訊信號再次進行調整，以使得重建的音訊信號的品質提高。第三音訊信號包括當前幀的音訊信號，第四音訊信號包括當前幀的在先幀的音訊信號，例如在先幀是可以與當前幀相鄰的之前幀的音訊信號，或者在先幀也可以是不與當前幀相鄰的之前幀的音訊信號，該第四音訊信號的通道能量可用於調整第三音訊信號。例如，解碼端將第三音訊信號的當前幀的高階通道和前2幀的高階通道對應子帶做線性加權，得到能量平滑後的當前幀的高階通道。Among them, the decoding end can also use the previous frame of the third audio signal to adjust the third audio signal after the adjustment of the current frame again, so as to improve the quality of the reconstructed audio signal. The third audio signal includes the audio signal of the current frame, and the fourth audio signal includes the audio signal of the previous frame of the current frame. For example, the previous frame can be an audio signal of a previous frame adjacent to the current frame, or the previous frame can also be an audio signal of a previous frame not adjacent to the current frame. The channel energy of the fourth audio signal can be used to adjust the third audio signal. For example, the decoding end linearly weights the high-order channel of the current frame of the third audio signal and the high-order channel corresponding subbands of the previous two frames to obtain the high-order channel of the current frame after energy smoothing.

示例性的，S5084根據第四音訊信號的通道能量對第三音訊信號進行調整，包括：Exemplarily, S5084 adjusts the third audio signal according to the channel energy of the fourth audio signal, including:

S50841，獲取第四音訊信號的通道能量平均值和調整後的第三音訊信號的通道能量；S50841, obtaining a channel energy average value of the fourth audio signal and an adjusted channel energy of the third audio signal;

其中，第四音訊信號的通道能量平均值可以是第四音訊信號的所有通道能量的平均值。The average value of the channel energy of the fourth audio signal may be an average value of all channel energies of the fourth audio signal.

S50842，根據第四音訊信號的通道能量平均值和第三音訊信號的通道能量獲取能量平均閾值。S50842: Obtain an energy average threshold according to the channel energy average of the fourth audio signal and the channel energy of the third audio signal.

能量平均閾值是對第三音訊信號和第四音訊信號各自的通道能量進行計算得到的閾值。The energy average threshold is a threshold obtained by calculating the channel energies of the third audio signal and the fourth audio signal.

示例性的，根據第四音訊信號的通道能量平均值和第三音訊信號的通道能量獲取能量平均閾值，包括：Exemplarily, obtaining the energy average threshold according to the channel energy average of the fourth audio signal and the channel energy of the third audio signal includes:

通過如下方式獲取能量平均閾值k：The energy average threshold k is obtained as follows:

可以採用以下公式： The following formula can be used:

其中，E_mean(i)表示所述第四音訊信號的通道能量平均值，E’_dec(i)表示所述調整後的第三音訊信號的能量。Wherein, E_mean(i) represents the channel energy average value of the fourth audio signal, and E’_dec(i) represents the energy of the adjusted third audio signal.

S50843，根據能量平均閾值對第四音訊信號的通道能量平均值和調整後的第三音訊信號的通道能量進行加權平均計算，以得到目標能量；S50843, performing weighted average calculation on the channel energy average value of the fourth audio signal and the adjusted channel energy of the third audio signal according to the energy average threshold to obtain a target energy;

計算目標能量E_target(i，b)，可以採用以下公式：To calculate the target energy E_target(i, b), the following formula can be used:

E_target(i，b) = k * E_mean(i，b) + (1-k)*E’_dec(i，b)；E_target(i,b) = k * E_mean(i,b) + (1-k)*E’_dec(i,b);

其中，E_mean(i，b)為在先幀能量的平均值，E’_dec(i，b)為調整後的第三音訊信號的能量。Among them, E_mean(i, b) is the average value of the energy in the previous frame, and E’_dec(i, b) is the energy of the adjusted third audio signal.

S50844，根據目標能量和調整後的第三音訊信號的通道能量獲取能量平滑因數；S50844, obtaining an energy smoothing factor according to the target energy and the adjusted channel energy of the third audio signal;

能量平滑因數可以用於對第三音訊信號的調整，使得第三音訊信號的解碼品質更高。The energy smoothing factor can be used to adjust the third audio signal so that the decoding quality of the third audio signal is higher.

示例性的，根據目標能量和調整後的第三音訊信號的通道能量獲取能量平滑因數，包括：Exemplarily, obtaining an energy smoothing factor according to the target energy and the adjusted channel energy of the third audio signal includes:

其中，E_target(i，b)表示目標能量，E’_dec(i，b)表示第三音訊信號的能量。Wherein, E_target(i, b) represents the target energy, and E’_dec(i, b) represents the energy of the third audio signal.

S50845、根據能量平滑因數對第三音訊信號進行調整。S50845. Adjust the third audio signal according to the energy smoothing factor.

通過使用能量平滑因數q(i，b)對調整後的第三音訊信號再次調整，進一步提高第三音訊信號的解碼品質。The adjusted third audio signal is further adjusted by using the energy smoothing factor q(i,b) to further improve the decoding quality of the third audio signal.

示例性的，可以參照如下公式對第三音訊信號進行調整：Exemplarily, the third audio signal may be adjusted according to the following formula:

X’’(i，b) = X’(i，b) * q(i，b)；X’’(i, b) = X’(i, b) * q(i, b);

示例性的，在得到調整後的第三音訊信號之後，還可以用調整後的第三音訊信號的能量更新在先幀能量的平均值。Exemplarily, after obtaining the adjusted third audio signal, the energy of the adjusted third audio signal may be used to update the average value of the energy of the previous frame.

圖5b為示例性示出的解碼過程示意圖。圖5b為與圖4編碼過程中對應的解碼過程。Fig. 5b is a schematic diagram of an exemplary decoding process. Fig. 5b is a decoding process corresponding to the encoding process in Fig. 4.

S501，接收第一碼流。S501, receiving a first code stream.

示例性的，從第一碼流中讀取高階能量增益編碼結果。對高階能量增益編碼結果進行熵解碼。熵解碼方法為編碼端熵編碼的逆過程。Exemplarily, a high-order energy gain coding result is read from the first bitstream. Entropy decoding is performed on the high-order energy gain coding result. The entropy decoding method is the reverse process of the entropy coding at the encoding end.

S509，根據第一碼流確定第一重建場景音訊信號的彌散度（diffusion）因數。S509, determining a diffusion factor of the first reconstructed scene audio signal according to the first bitstream.

由於重建HOA信號是由虛擬揚聲器信號計算得到，因此僅包含了待編碼HOA信號中有明確方位的聲源成分，也叫方向性成分，缺少待編碼HOA信號中的環境成分，也叫非方向性成分。因此通過音訊增益對重建HOA信號各個通道進行能量調整，只能做到令方向性成分的能量與待編碼HOA信號的能量更接近，而對非方向性成分的調整不穩定。Since the reconstructed HOA signal is calculated from the virtual speaker signal, it only contains the sound source component with a clear direction in the HOA signal to be encoded, also called the directional component, and lacks the environmental component in the HOA signal to be encoded, also called the non-directional component. Therefore, by adjusting the energy of each channel of the reconstructed HOA signal through the audio gain, it can only make the energy of the directional component closer to the energy of the HOA signal to be encoded, while the adjustment of the non-directional component is unstable.

彌散度可用於描述HOA信號擴散程度的參數，應用於HOA信號聲場重建和聲場重播中。本發明實施例中利用HOA信號的彌散度來衡量待編碼HOA信號中非方向性成分的能量比例，並用來調整重建HOA信號各個通道的能量，使能量調整後的重建HOA信號與待編碼HOA信號能量更接近。使用彌散度作為重建HOA信號能量調整的參數，還可以彌散度利用自身對聲源方向性的描述能力，在調整重建HOA信號能量時自我調整的對非方向性成分進行補償，對方向性成分不進行補償，因此彌散度可以彌補基於信號能量占比的增益調整方法的不足，且不引入額外誤差。The dispersion can be used as a parameter to describe the degree of diffusion of the HOA signal, and is applied to the HOA signal sound field reconstruction and sound field replay. In the embodiment of the present invention, the dispersion of the HOA signal is used to measure the energy ratio of the non-directional components in the HOA signal to be encoded, and is used to adjust the energy of each channel of the reconstructed HOA signal, so that the reconstructed HOA signal after energy adjustment is closer to the energy of the HOA signal to be encoded. Using the dispersion as a parameter for adjusting the energy of the reconstructed HOA signal, the dispersion can also use its own ability to describe the directionality of the sound source to self-adjust the non-directional components when adjusting the energy of the reconstructed HOA signal, and not compensate for the directional components. Therefore, the dispersion can make up for the shortcomings of the gain adjustment method based on the signal energy ratio without introducing additional errors.

本發明實施例中將彌散度用於對能量增益進行調整，使重建HOA信號能量更加準確。In the embodiment of the present invention, the divergence is used to adjust the energy gain so as to reconstruct the HOA signal energy more accurately.

一種可能的實現方式中，S509根據所述第一碼流確定所述第一重建場景音訊信號的彌散度因數，包括：In a possible implementation, S509 determines the dispersion factor of the first reconstructed scene audio signal according to the first bitstream, including:

從所述第一碼流中解碼得到所述彌散度因數，其中，所述第一碼流中包括所述彌散度因數。The dispersion factor is obtained by decoding the first code stream, wherein the first code stream includes the dispersion factor.

具體的，在編碼端根據待編碼HOA信號計算彌散度，將彌散度量化編碼至第一碼流，解碼端解碼該第一碼流，然後通過逆量化得到彌散度，然後用於對重建HOA信號中未編碼通道的增益調整。Specifically, the dispersion is calculated according to the HOA signal to be encoded at the encoding end, the dispersion is quantized and encoded into the first bit stream, the decoding end decodes the first bit stream, and then the dispersion is obtained by inverse quantization, which is then used to adjust the gain of the uncoded channel in the reconstructed HOA signal.

根據從所述第一碼流中解碼得到的所述第一重建信號獲取所述彌散度因數。The dispersion factor is obtained according to the first reconstructed signal decoded from the first code stream.

其中，彌散度計算方法有多種，在解碼端根據解碼傳輸通道中的低階HOA信號計算彌散度，然後用於對重建HOA信號中未編碼通道的增益調整。There are many methods for calculating the dispersion. At the decoding end, the dispersion is calculated based on the low-order HOA signal in the decoded transmission channel, and then used to adjust the gain of the uncoded channel in the reconstructed HOA signal.

一種可能的實現方式中，S509根據從所述第一碼流中解碼得到的所述第一重建信號獲取所述彌散度因數，包括：In a possible implementation, S509 obtains the dispersion factor according to the first reconstructed signal obtained by decoding the first bit stream, including:

S5091，獲取所述第一重建信號中的每個頻帶的聲場強度；S5091, obtaining the sound field intensity of each frequency band in the first reconstructed signal;

S5092，獲獲取所述每個頻帶的能量；S5092, obtaining energy of each frequency band;

S5093，獲根據所述每個頻帶的聲場強度和所述每個頻帶的能量確定所述彌散度因數。S5093: Determine the dispersion factor according to the sound field intensity of each frequency band and the energy of each frequency band.

示例性的，S5091獲取所述第一重建信號中的每個頻帶的聲場強度，包括：Exemplarily, S5091 obtains the sound field intensity of each frequency band in the first reconstructed signal, including:

通過如下方式獲取第b個頻帶的聲場強度： The sound field intensity of the bth frequency band is obtained as follows: :

其中，Re表示取實部運算，表示所述第一重建信號中第b個頻帶的四個通道信號。 Among them, Re represents the real part operation, Represents four channel signals of the bth frequency band in the first reconstructed signal.

示例性的，S5092中每個頻帶的能量可以通過計算每個頻帶的實部的平方加虛部的平方之和得到。Exemplarily, the energy of each frequency band in S5092 can be obtained by calculating the sum of the square of the real part and the square of the imaginary part of each frequency band.

S5093根據所述每個頻帶的聲場強度和和所述每個頻帶的能量確定所述彌散度因數，包括：S5093 determines the dispersion factor according to the sound field intensity of each frequency band and the energy of each frequency band, including:

通過如下方式獲取所述彌散度因數： The dispersion factor is obtained by: :

其中，E表示求取期望運算，表示對表示對 I求取二範數，表示第b個頻帶的能量。 Among them, E represents the expected operation. Express Express I finds the second norm, represents the energy of the bth frequency band.

可以理解的是，上述S5091至S5093中彌散度的計算只是一種舉例實現方式，此處不做為對本發明實施例的限定。It is understandable that the calculation of the divergence in the above S5091 to S5093 is only an example implementation method, and is not intended to limit the embodiments of the present invention.

S510，根據第一重建場景音訊信號中的重建信號的頻帶序號和/或第一重建場景音訊信號的階數確定衰減因數。根據彌散度因數對衰減因數進行線性加權，以得到加權後的衰減因數。S510: Determine an attenuation factor according to a frequency band sequence number of a reconstructed signal in a first reconstructed scene audio signal and/or an order of the first reconstructed scene audio signal, and linearly weight the attenuation factor according to a dispersion factor to obtain a weighted attenuation factor.

具體的，彌散度因數還可以用於對衰減因數的線性加權，加權後的衰減因數用於對第三音訊信號進行調整。使用加權方法來平衡衰減因數中彌散成分和方向性成分之間的占比，由於彌散度因數可用於衡量待編碼HOA信號中非方向性成分的能量比例，並用來調整重建HOA信號各個通道的能量，使能量調整後的重建HOA信號與待編碼HOA信號能量更接近。Specifically, the dispersion factor can also be used to linearly weight the attenuation factor, and the weighted attenuation factor is used to adjust the third audio signal. The weighting method is used to balance the proportion between the dispersion component and the directional component in the attenuation factor. Since the dispersion factor can be used to measure the energy ratio of the non-directional component in the HOA signal to be encoded, and is used to adjust the energy of each channel of the reconstructed HOA signal, the reconstructed HOA signal after energy adjustment is closer to the energy of the HOA signal to be encoded.

對於彌散度因數對衰減因數進行線性加權的具體實現方式不做限定，舉例說明如下：There is no limitation on the specific implementation method of linearly weighting the attenuation factor by the dispersion factor, and an example is given as follows:

一種可能的實現方式中，根據彌散度因數對衰減因數進行線性加權，得到加權後的衰減因數，包括：In one possible implementation, the attenuation factor is linearly weighted according to the dispersion factor to obtain a weighted attenuation factor, including:

通過如下至少一種方式獲取加權後的衰減因數gd(i,b)：The weighted attenuation factor gd(i,b) is obtained by at least one of the following methods:

gd(i,b)= w diffusion(b)+(1-w) ；其中，w為預設的調節比例閾值，表示相乘運算，diffusion(b)表示第三音訊信號的第b個頻帶的彌散度因數，表示第三音訊信號的第b個頻帶的衰減因數； gd(i,b)= w diffusion(b)+(1-w) ; Where w is the preset adjustment ratio threshold, represents the multiplication operation, diffusion(b) represents the diffusion factor of the bth frequency band of the third audio signal, represents the attenuation factor of the b-th frequency band of the third audio signal;

或者，or,

當衰減因數為全帶信號時，gd(i,b)=w mean(diffusion)+ (1-w) ，其中，mean(diffusion)為第三音訊信號的多個頻帶的彌散度因數的平均值，表示第三音訊信號的第b個頻帶的衰減因數，w為預設的調節比例閾值； When the attenuation factor is full-band signal, gd(i,b)=w mean(diffusion)+ (1-w) , where mean(diffusion) is the average value of the dispersion factors of multiple frequency bands of the third audio signal, represents the attenuation factor of the bth frequency band of the third audio signal, and w is the preset adjustment proportional threshold;

或者，or,

gd(i,b)= w diffusion(b)+(1-w) +offset(i,b)，其中，offset(i,b)表示第三音訊信號的第i個通道上第b個頻帶的偏置常數，表示第三音訊信號的第b個頻帶的衰減因數，w為預設的調節比例閾值，diffusion(b)表示第三音訊信號的第b個頻帶的彌散度因數； gd(i,b)= w diffusion(b)+(1-w) +offset(i,b), where offset(i,b) represents the offset constant of the bth frequency band on the i-th channel of the third audio signal. represents the attenuation factor of the b-th frequency band of the third audio signal, w is the preset adjustment proportional threshold, and diffusion(b) represents the diffusion factor of the b-th frequency band of the third audio signal;

或者，or,

gd(i,b)= w diffusion(b)+(1-w) +direction(i,b)，其中，direction(i,b)表示第三音訊信號的第i個通道上第b個頻帶的方向參數，表示第三音訊信號的第b個頻帶的衰減因數，w為預設的調節比例閾值，diffusion(b)表示第三音訊信號的第b個頻帶的彌散度因數。 gd(i,b)= w diffusion(b)+(1-w) +direction(i,b), where direction(i,b) represents the direction parameter of the bth frequency band on the i-th channel of the third audio signal. represents the attenuation factor of the b-th frequency band of the third audio signal, w is the default adjustment proportional threshold, and diffusion(b) represents the diffusion factor of the b-th frequency band of the third audio signal.

可以理解的是，上述舉例中對衰減因數的加權計算只是一種舉例實現方式，此處不做為對本發明實施例的限定。It is understandable that the weighted calculation of the attenuation factor in the above example is only an example implementation method, and is not intended to limit the embodiments of the present invention.

S511，根據第一音訊信號的通道能量和高階能量增益獲取第二音訊信號的高階能量；根據第三音訊信號的通道能量和第二音訊信號的高階能量獲取解碼能量比例因數。S511, obtaining high-order energy of the second audio signal according to the channel energy and high-order energy gain of the first audio signal; obtaining a decoding energy proportional factor according to the channel energy of the third audio signal and the high-order energy of the second audio signal.

g(i，b) = sqrt(E_Ref(i，b)) / sqrt(E_dec(i，b))；g(i,b) = sqrt(E_Ref(i,b)) / sqrt(E_dec(i,b));

S512，根據加權後的衰減因數和解碼能量比例因數對N2階HOA信號中的第三音訊信號進行調整，以得到調整後的第三音訊信號。S512: Adjust the third audio signal in the N2-order HOA signal according to the weighted attenuation factor and the decoding energy proportional factor to obtain an adjusted third audio signal.

其中，解碼端根據加權後的衰減因數和解碼能量比例因數對N2階HOA信號中的第三音訊信號進行調整。解碼端利用加權後的衰減因數和解碼能量比例因數調節第三音訊信號的高階通道能量，使第三音訊信號的解碼品質更高。The decoder adjusts the third audio signal in the N2-order HOA signal according to the weighted attenuation factor and the decoding energy proportional factor. The decoder uses the weighted attenuation factor and the decoding energy proportional factor to adjust the high-order channel energy of the third audio signal, so that the decoding quality of the third audio signal is higher.

一種可能的實現方式中，根據加權後的衰減因數和解碼能量比例因數對N2階HOA信號中的第三音訊信號進行調整，以得到調整後的第三音訊信號：In one possible implementation, the third audio signal in the N2-order HOA signal is adjusted according to the weighted attenuation factor and the decoding energy proportional factor to obtain an adjusted third audio signal:

通過如下方式獲取調整後的第三音訊信號X’(i,b)：The adjusted third audio signal X’(i,b) is obtained by the following method:

X’(i,b) = X(i) g(i,b) gd(i,b)； X'(i,b) = X(i) g(i,b) gd(i,b);

其中，gd(i,b)表示加權後的衰減因數，g(i,b)表示解碼能量比例因數，X(i)表示第三音訊信號。Wherein, gd(i,b) represents the weighted attenuation factor, g(i,b) represents the decoding energy proportional factor, and X(i) represents the third audio signal.

可以理解的是，上述舉例中對第三音訊信號的調整計算只是一種舉例實現方式，此處不做為對本發明實施例的限定。It is understandable that the adjustment calculation of the third audio signal in the above example is only an example implementation method, and is not intended to limit the embodiments of the present invention.

舉例說明如下，編碼端中的輸入信號為3階HOA信號，該3階HOA信號共包括16個通道的音訊信號，第一音訊信號為第1至5通道、第7通道、第9至第10通道的音訊信號，第二音訊信號為第6通道、第8通道、第11至16通道的音訊信號。對於編碼端編碼的碼流有如下三種實現方式：1、編碼得到的碼流中不包括高階能量增益編碼結果。2、編碼得到的碼流中包括高階能量增益編碼結果。對於解碼端執行的場景音訊解碼方法，有如下三種實現方式：1、在接收到的碼流中不包括高階能量增益編碼結果時，解碼端對碼流中的場景音訊信號進行重建。2、在接收到的碼流中包括高階能量增益編碼結果時，解碼端對碼流中的場景音訊信號進行重建，並根據高階能量增益編碼結果對重建場景音訊信號進行調整，以得到重建後的場景音訊信號。3、本發明實施例中，在接收到的碼流中包括高階能量增益編碼結果時，解碼端對碼流中的場景音訊信號進行重建，並根據高階能量增益編碼結果和衰減因數對重建場景音訊信號進行調整，以得到重建後的場景音訊信號。Take an example as follows, the input signal in the encoding end is a 3rd order HOA signal, and the 3rd order HOA signal includes a total of 16 channels of audio signals, the first audio signal is the audio signal of the 1st to 5th channels, the 7th channel, and the 9th to 10th channels, and the second audio signal is the audio signal of the 6th channel, the 8th channel, and the 11th to 16th channels. There are three ways to implement the bit stream encoded by the encoding end: 1. The bit stream obtained by encoding does not include the high-order energy gain coding result. 2. The bit stream obtained by encoding includes the high-order energy gain coding result. There are three ways to implement the scene audio decoding method executed by the decoding end: 1. When the received bit stream does not include the high-order energy gain coding result, the decoding end reconstructs the scene audio signal in the bit stream. 2. When the received bitstream includes the high-order energy gain coding result, the decoding end reconstructs the scene audio signal in the bitstream and adjusts the reconstructed scene audio signal according to the high-order energy gain coding result to obtain the reconstructed scene audio signal. 3. In the embodiment of the present invention, when the received bitstream includes the high-order energy gain coding result, the decoding end reconstructs the scene audio signal in the bitstream and adjusts the reconstructed scene audio signal according to the high-order energy gain coding result and the attenuation factor to obtain the reconstructed scene audio signal.

通過對重建場景音訊信號的信號品質進行分析可知，沒有攜帶高階能量增益編碼結果的解碼HOA信號，品質很差。有攜帶高階能量增益編碼結果的解碼HOA信號，但沒有通過衰減因數進行重建場景音訊信號的調整，品質中等。有攜帶高階能量增益編碼結果的解碼HOA信號，有通過衰減因數進行重建場景音訊信號的調整，品質最優。By analyzing the signal quality of the reconstructed scene audio signal, it can be seen that the decoded HOA signal without the high-order energy gain coding result has very poor quality. The decoded HOA signal with the high-order energy gain coding result, but without adjusting the reconstructed scene audio signal through the attenuation factor, has medium quality. The decoded HOA signal with the high-order energy gain coding result and with the reconstructed scene audio signal adjusted through the attenuation factor has the best quality.

通過上述分析可知，本發明實施例中解碼端可以根據高階能量增益編碼結果和衰減因數對重建場景音訊信號進行調整，從而得到重建後的場景音訊信號的高階通道能量更加均勻和平滑，重建後的場景音訊信號的品質更優。例如，該衰減因數可以隨重建場景音訊信號的頻帶和Ambisonic階數兩個因素衰減，有效提升了HOA信號的編解碼品質。Through the above analysis, it can be seen that in the embodiment of the present invention, the decoder can adjust the reconstructed scene audio signal according to the high-order energy gain coding result and the attenuation factor, so that the high-order channel energy of the reconstructed scene audio signal is more uniform and smooth, and the quality of the reconstructed scene audio signal is better. For example, the attenuation factor can be attenuated according to the two factors of the frequency band and the Ambisonic order of the reconstructed scene audio signal, which effectively improves the encoding and decoding quality of the HOA signal.

圖6a為示例性示出的編碼端的結構示意圖。FIG. 6a is a schematic diagram showing the structure of an encoding end.

參數圖6a，示例性的，編碼端可以包括配置單元、虛擬揚聲器生成單元、目標揚聲器生成單元、核心編碼器。應該理解的是，圖6a僅是本發明的一個示例，本發明的編碼端可以包括比圖6a示出的更多或更少的模組，在此不再贅述。Parameter Figure 6a, exemplary, the encoding end may include a configuration unit, a virtual speaker generation unit, a target speaker generation unit, and a core encoder. It should be understood that Figure 6a is only an example of the present invention, and the encoding end of the present invention may include more or less modules than those shown in Figure 6a, which will not be repeated here.

示例性的，配置單元，可以用於確定候選虛擬揚聲器的配置資訊。Exemplarily, the configuration unit may be used to determine configuration information of a candidate virtual speaker.

示例性的，虛擬揚聲器生成單元，可以用於根據候選虛擬揚聲器的配置資訊，生成多個候選虛擬揚聲器以及確定各候選虛擬揚聲器對應的虛擬揚聲器係數。Exemplarily, the virtual speaker generation unit may be used to generate a plurality of candidate virtual speakers according to the configuration information of the candidate virtual speakers and determine the virtual speaker coefficient corresponding to each candidate virtual speaker.

示例性的，目標揚聲器生成單元，可以用於根據基於場景音訊信號和多組虛擬揚聲器係數，從多個候選虛擬揚聲器中選取目標虛擬揚聲器以及確定目標虛擬揚聲器的屬性資訊。Exemplarily, the target speaker generation unit may be configured to select a target virtual speaker from a plurality of candidate virtual speakers and determine property information of the target virtual speaker based on a scene audio signal and a plurality of sets of virtual speaker coefficients.

示例性的，核心編碼器，可以用於獲取場景音訊信號的高階能量增益，以及獲取高階能量增益編碼結果；對場景音訊信號中第一音訊信號、目標虛擬揚聲器的屬性資訊和高階能量增益編碼結果進行編碼。Exemplarily, the core encoder can be used to obtain the high-order energy gain of the scene audio signal and obtain the high-order energy gain encoding result; encode the first audio signal in the scene audio signal, the attribute information of the target virtual speaker and the high-order energy gain encoding result.

示例性的，上述圖1a和圖1b中的場景音訊編碼模組可以包括圖6a的配置單元、虛擬揚聲器生成單元、目標揚聲器生成單元、核心編碼器；或者，僅包括核心編碼器。Exemplarily, the scene audio coding module in FIG. 1a and FIG. 1b may include the configuration unit, the virtual speaker generation unit, the target speaker generation unit, and the core encoder of FIG. 6a; or, may only include the core encoder.

圖6b為示例性示出的解碼端的結構示意圖。FIG6b is a schematic diagram showing the structure of a decoding end.

參數圖6b，示例性的，解碼端可以包括核心解碼器、虛擬揚聲器係數生成單元、虛擬揚聲器信號生成單元、重建單元和信號調整單元。應該理解的是，圖6b僅是本發明的一個示例，本發明的解碼端可以包括比圖6b示出的更多或更少的模組，在此不再贅述。6b, exemplarily, the decoding end may include a core decoder, a virtual speaker coefficient generating unit, a virtual speaker signal generating unit, a reconstruction unit and a signal adjustment unit. It should be understood that FIG6b is only an example of the present invention, and the decoding end of the present invention may include more or less modules than those shown in FIG6b, which will not be repeated here.

示例性的，核心解碼器，可以用於解碼第一碼流，以得到第一重建信號、目標虛擬揚聲器的屬性資訊和高階能量增益編碼結果。Exemplarily, the core decoder can be used to decode the first code stream to obtain a first reconstructed signal, property information of a target virtual speaker, and a high-order energy gain coding result.

示例性的，虛擬揚聲器係數生成單元，可以用於基於目標虛擬揚聲器的屬性資訊，確定虛擬揚聲器係數。Exemplarily, the virtual speaker coefficient generation unit may be configured to determine the virtual speaker coefficient based on the property information of the target virtual speaker.

示例性的，虛擬揚聲器信號生成單元，可以用於基於第一重建信號和虛擬揚聲器係數，生成虛擬揚聲器信號。Exemplarily, the virtual speaker signal generating unit may be configured to generate a virtual speaker signal based on the first reconstructed signal and the virtual speaker coefficient.

示例性的，重建單元，可以用於基於虛擬揚聲器信號和屬性資訊進行重建，以得到第一重建場景音訊信號。Exemplarily, the reconstruction unit may be configured to perform reconstruction based on the virtual speaker signal and the attribute information to obtain a first reconstructed scene audio signal.

示例性的，信號調整單元，可以用於根據第一重建場景音訊信號中的重建信號的頻帶序號和/或第一重建場景音訊信號的階數確定衰減因數；根據高階能量增益編碼結果和衰減因數對第一重建場景音訊信號進行調整，以得到重建後的場景音訊信號。Exemplarily, the signal adjustment unit can be used to determine the attenuation factor according to the frequency band number of the reconstructed signal in the first reconstructed scene audio signal and/or the order of the first reconstructed scene audio signal; adjust the first reconstructed scene audio signal according to the high-order energy gain coding result and the attenuation factor to obtain the reconstructed scene audio signal.

示例性的，上述圖1a和圖1b中的場景音訊解碼模組可以包括圖6b的核心解碼器、虛擬揚聲器係數生成單元、虛擬揚聲器信號生成單元、重建單元和信號調整單元；或者，僅包括核心解碼器。Exemplarily, the scene audio decoding module in FIG. 1a and FIG. 1b may include the core decoder, the virtual speaker coefficient generation unit, the virtual speaker signal generation unit, the reconstruction unit and the signal adjustment unit of FIG. 6b; or, may only include the core decoder.

圖7為示例性示出的場景音訊編碼裝置的結構示意圖。圖7中的場景音訊編碼裝置可以用於執行前述實施例的編碼方法，因此，其所能達到的有益效果可參考上文所提供的對應的方法中的有益效果，此處不再贅述。其中，場景音訊編碼裝置可以包括：FIG7 is a schematic diagram of the structure of a scene audio coding device. The scene audio coding device in FIG7 can be used to execute the coding method of the aforementioned embodiment. Therefore, the beneficial effects that can be achieved can refer to the beneficial effects of the corresponding method provided above, which will not be repeated here. The scene audio coding device may include:

獲取模組701，用於獲取待編碼的場景音訊信號，所述場景音訊信號包括C1個通道的音訊信號，C1為正整數；An acquisition module 701 is used to acquire a scene audio signal to be encoded, wherein the scene audio signal includes audio signals of C1 channels, where C1 is a positive integer;

編碼模組702，用於對所述高階能量增益進行編碼，以得到高階能量增益編碼結果；The encoding module 702 is used to encode the high-order energy gain to obtain a high-order energy gain encoding result;

圖8為示例性示出的場景音訊解碼裝置的結構示意圖。圖8中的場景音訊解碼裝置可以用於執行前述實施例的解碼方法，因此，其所能達到的有益效果可參考上文所提供的對應的方法中的有益效果，此處不再贅述。其中，場景音訊解碼裝置可以包括：FIG8 is a schematic diagram of the structure of an exemplary scene audio decoding device. The scene audio decoding device in FIG8 can be used to execute the decoding method of the aforementioned embodiment. Therefore, the beneficial effects that can be achieved can refer to the beneficial effects of the corresponding method provided above, which will not be repeated here. The scene audio decoding device may include:

碼流接收模組801，用於接收第一碼流；The code stream receiving module 801 is used to receive a first code stream;

解碼模組802，用於解碼所述第一碼流，以得到第一重建信號、目標虛擬揚聲器的屬性資訊和高階能量增益編碼結果，第一重建信號是場景音訊信號中第一音訊信號的重建信號，場景音訊信號包括C1個通道的音訊信號，第一音訊信號為場景音訊信號中K個通道的音訊信號，C1為正整數，K為小於或等於C1的正整數；A decoding module 802 is used to decode the first bit stream to obtain a first reconstructed signal, property information of a target virtual speaker, and a high-order energy gain coding result, wherein the first reconstructed signal is a reconstructed signal of a first audio signal in a scene audio signal, the scene audio signal includes audio signals of C1 channels, the first audio signal is audio signals of K channels in the scene audio signal, C1 is a positive integer, and K is a positive integer less than or equal to C1;

虛擬揚聲器信號生成模組803，用於基於所述屬性資訊和所述第一音訊信號，生成所述目標虛擬揚聲器對應的虛擬揚聲器信號；A virtual speaker signal generating module 803 is used to generate a virtual speaker signal corresponding to the target virtual speaker based on the attribute information and the first audio signal;

場景音訊信號重建模組804，用於基於所述目標虛擬揚聲器的屬性資訊和所述虛擬揚聲器信號進行重建，以得到第一重建場景音訊信號；所述第一重建場景音訊信號包括C2個通道的音訊信號，C2為正整數；The scene audio signal reconstruction group 804 is used to reconstruct based on the property information of the target virtual speaker and the virtual speaker signal to obtain a first reconstructed scene audio signal; the first reconstructed scene audio signal includes audio signals of C2 channels, where C2 is a positive integer;

衰減因數確定模組805，用於根據所述第一重建場景音訊信號中的重建信號的頻帶序號和/或所述第一重建場景音訊信號的階數確定衰減因數；an attenuation factor determination module 805, configured to determine an attenuation factor according to a frequency band sequence number of a reconstructed signal in the first reconstructed scene audio signal and/or an order of the first reconstructed scene audio signal;

場景音訊信號調整模組806，用於根據所述高階能量增益編碼結果和所述衰減因數對所述第一重建場景音訊信號進行調整，以得到重建後的場景音訊信號。The scene audio signal adjustment module 806 is used to adjust the first reconstructed scene audio signal according to the high-order energy gain coding result and the attenuation factor to obtain a reconstructed scene audio signal.

一個示例中，圖9示出了本發明實施例的一種裝置900的示意性框圖裝置900可包括：處理器901和收發器/收發管腳902，可選地，還包括記憶體903。In an example, FIG. 9 shows a schematic block diagram of a device 900 according to an embodiment of the present invention. The device 900 may include: a processor 901 and a transceiver/transceiver pin 902, and optionally, a memory 903.

裝置900的各個元件通過匯流排904耦合在一起，其中匯流排904除包括資料匯流排之外，還包括電源匯流排、控制匯流排和狀態信號匯流排。但是為了清楚說明起見，在圖中將各種匯流排都稱為匯流排904。The various components of the device 900 are coupled together via a bus 904, wherein the bus 904 includes a power bus, a control bus, and a status signal bus in addition to a data bus. However, for the sake of clarity, all the buses are referred to as the bus 904 in the figure.

可選地，記憶體903可以用於存儲前述方法實施例中的指令。該處理器901可用於執行記憶體903中的指令，並控制接收管腳接收信號，以及控制發送管腳發送信號。Optionally, the memory 903 can be used to store the instructions in the aforementioned method embodiment. The processor 901 can be used to execute the instructions in the memory 903, and control the receiving pin to receive the signal, and control the sending pin to send the signal.

裝置900可以是上述方法實施例中的電子設備或電子設備的晶片。The apparatus 900 may be an electronic device or a chip of an electronic device in the above method embodiments.

其中，上述方法實施例涉及的各步驟的所有相關內容均可以援引到對應功能模組的功能描述，在此不再贅述。Among them, all relevant contents of each step involved in the above method embodiment can be referred to the functional description of the corresponding functional module, and will not be repeated here.

本實施例還提供一種晶片，該晶片包括一個或多個介面電路和一個或多個處理器；介面電路用於從電子設備的記憶體接收信號，並向處理器發送信號，信號包括記憶體中存儲的電腦指令；當處理器執行電腦指令時，使得電子設備執行上述實施例中的方法。其中，介面電路可以是指圖9中的收發器902。This embodiment also provides a chip, which includes one or more interface circuits and one or more processors; the interface circuit is used to receive signals from the memory of the electronic device and send signals to the processor, the signals including computer instructions stored in the memory; when the processor executes the computer instructions, the electronic device executes the method in the above embodiment. The interface circuit may refer to the transceiver 902 in FIG. 9 .

本實施例還提供一種電腦可讀存儲介質，該電腦可讀存儲介質中存儲有電腦指令，當該電腦指令在電子設備上運行時，使得電子設備執行上述相關方法步驟實現上述實施例中的場景音訊編解碼方法。This embodiment also provides a computer-readable storage medium, in which computer instructions are stored. When the computer instructions are run on an electronic device, the electronic device executes the above-mentioned related method steps to implement the scene audio encoding and decoding method in the above-mentioned embodiment.

本實施例還提供了一種電腦程式產品，當該電腦程式產品在電腦上運行時，使得電腦執行上述相關步驟，以實現上述實施例中的場景音訊編解碼方法。This embodiment also provides a computer program product. When the computer program product is run on a computer, the computer executes the above-mentioned related steps to implement the scene audio encoding and decoding method in the above-mentioned embodiment.

本實施例還提供了一種存儲碼流的裝置，該裝置包括：接收器和至少一個存儲介質，接收器用於接收碼流；至少一個存儲介質用於存儲碼流；碼流是根據上述實施例中的場景音訊編方法生成的。This embodiment also provides a device for storing a code stream, which includes: a receiver and at least one storage medium, the receiver is used to receive the code stream; at least one storage medium is used to store the code stream; the code stream is generated according to the scene audio encoding method in the above embodiment.

本發明實施例提供一種傳輸碼流的裝置，該裝置包括：發送器和至少一個存儲介質，至少一個存儲介質用於存儲碼流，碼流是根據上述實施例中的場景音訊編方法生成的；發送器用於從存儲介質中獲取碼流並將碼流通過傳輸介質發送給端側設備。An embodiment of the present invention provides a device for transmitting a code stream, which includes: a transmitter and at least one storage medium, wherein the at least one storage medium is used to store the code stream, and the code stream is generated according to the scene audio encoding method in the above embodiment; the transmitter is used to obtain the code stream from the storage medium and send the code stream to the end device through the transmission medium.

本發明實施例提供一種分發碼流的系統，該系統包括：至少一個存儲介質，用於存儲至少一個碼流，至少一個碼流是根據上述實施例中的場景音訊編方法生成的，流媒體設備，用於從至少一個存儲介質中獲取目的碼流，並將目的碼流發送給端側設備，其中，流媒體設備包括內容伺服器或內容分佈伺服器。An embodiment of the present invention provides a system for distributing bit streams, the system comprising: at least one storage medium for storing at least one bit stream, the at least one bit stream being generated according to the scene audio encoding method in the above embodiment, a streaming media device for obtaining a target bit stream from the at least one storage medium and sending the target bit stream to an end device, wherein the streaming media device comprises a content server or a content distribution server.

另外，本發明的實施例還提供一種裝置，這個裝置具體可以是晶片，元件或模組，該裝置可包括相連的處理器和記憶體；其中，記憶體用於存儲電腦執行指令，當裝置運行時，處理器可執行記憶體存儲的電腦執行指令，以使晶片執行上述各方法實施例中的場景音訊編解碼方法。In addition, an embodiment of the present invention further provides a device, which may be a chip, a component or a module, and the device may include a connected processor and a memory; wherein the memory is used to store computer execution instructions, and when the device is running, the processor may execute the computer execution instructions stored in the memory, so that the chip executes the scene audio encoding and decoding method in the above-mentioned method embodiments.

其中，本實施例提供的電子設備、電腦可讀存儲介質、電腦程式產品或晶片均用於執行上文所提供的對應的方法，因此，其所能達到的有益效果可參考上文所提供的對應的方法中的有益效果，此處不再贅述。Among them, the electronic device, computer readable storage medium, computer program product or chip provided in this embodiment are used to execute the corresponding methods provided above. Therefore, the beneficial effects that can be achieved can refer to the beneficial effects in the corresponding methods provided above, and will not be repeated here.

通過以上實施方式的描述，所屬領域的技術人員可以瞭解到，為描述的方便和簡潔，僅以上述各功能模組的劃分進行舉例說明，實際應用中，可以根據需要而將上述功能分配由不同的功能模組完成，即將裝置的內部結構劃分成不同的功能模組，以完成以上描述的全部或者部分功能。Through the description of the above implementation method, technical personnel in the relevant field can understand that for the convenience and conciseness of the description, only the division of the above-mentioned functional modules is used as an example for illustration. In actual application, the above-mentioned functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above.

在本發明所提供的幾個實施例中，應該理解到，所揭露的裝置和方法，可以通過其它的方式實現。例如，以上所描述的裝置實施例僅僅是示意性的，例如，模組或單元的劃分，僅僅為一種邏輯功能劃分，實際實現時可以有另外的劃分方式，例如多個單元或元件可以結合或者可以集成到另一個裝置，或一些特徵可以忽略，或不執行。另一點，所顯示或討論的相互之間的耦合或直接耦合或通信連接可以是通過一些介面，裝置或單元的間接耦合或通信連接，可以是電性，機械或其它的形式。In the several embodiments provided by the present invention, it should be understood that the disclosed devices and methods can be implemented in other ways. For example, the device embodiments described above are only schematic, for example, the division of modules or units is only a logical functional division, and there may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another device, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.

作為分離部件說明的單元可以是或者也可以不是物理上分開的，作為單元顯示的部件可以是一個物理單元或多個物理單元，即可以位於一個地方，或者也可以分佈到多個不同地方。可以根據實際的需要選擇其中的部分或者全部單元來實現本實施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may be one physical unit or multiple physical units, that is, they may be located in one place or distributed in multiple different places. Some or all of the units may be selected according to actual needs to achieve the purpose of the present embodiment.

另外，在本發明各個實施例中的各功能單元可以集成在一個處理單元中，也可以是各個單元單獨物理存在，也可以兩個或兩個以上單元集成在一個單元中。上述集成的單元既可以採用硬體的形式實現，也可以採用軟體功能單元的形式實現。In addition, each functional unit in each embodiment of the present invention can be integrated into a processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The above integrated unit can be implemented in the form of hardware or software functional unit.

本發明各個實施例的任意內容，以及同一實施例的任意內容，均可以自由組合。對上述內容的任意組合均在本發明的範圍之內。Any content of each embodiment of the present invention, as well as any content of the same embodiment, can be freely combined. Any combination of the above contents is within the scope of the present invention.

集成的單元如果以軟體功能單元的形式實現並作為獨立的產品銷售或使用時，可以存儲在一個可讀取存儲介質中。基於這樣的理解，本發明實施例的技術方案本質上或者說對現有技術做出貢獻的部分或者該技術方案的全部或部分可以以軟體產品的形式體現出來，該軟體產品存儲在一個存儲介質中，包括若干指令用以使得一個設備（可以是單片機，晶片等）或處理器（processor）執行本發明各個實施例方法的全部或部分步驟。而前述的存儲介質包括：U盤、移動硬碟、唯讀記憶體（read only memory，ROM）、隨機存取記憶體（random access memory，RAM）、磁碟或者光碟等各種可以存儲程式碼的介質。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a readable storage medium. Based on this understanding, the technical solution of the embodiment of the present invention, or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium and includes a number of instructions for a device (which can be a single-chip microcomputer, chip, etc.) or a processor to execute all or part of the steps of the various embodiments of the present invention. The aforementioned storage media include: USB flash drives, mobile hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks or optical disks, and other media that can store program code.

上面結合附圖對本發明的實施例進行了描述，但是本發明並不局限於上述的具體實施方式，上述的具體實施方式僅僅是示意性的，而不是限制性的，本領域的普通技術人員在本發明的啟示下，在不脫離本發明宗旨和申請專利範圍所保護的範圍情況下，還可做出很多形式，均屬於本發明的保護之內。The embodiments of the present invention are described above in conjunction with the accompanying drawings, but the present invention is not limited to the above-mentioned specific implementation methods. The above-mentioned specific implementation methods are only illustrative and not restrictive. Under the inspiration of the present invention, ordinary technical personnel in this field can make many forms without departing from the scope protected by the purpose of the present invention and the scope of the patent application, all of which are protected by the present invention.

結合本發明實施例公開內容所描述的方法或者演算法的步驟可以硬體的方式來實現，也可以是由處理器執行軟體指令的方式來實現。軟體指令可以由相應的軟體模組組成，軟體模組可以被存放於隨機存取記憶體（Random Access Memory，RAM）、快閃記憶體、唯讀記憶體（Read Only Memory，ROM）、可擦除可程式設計唯讀記憶體（Erasable Programmable ROM，EPROM）、電子可擦可程式設計唯讀記憶體（Electrically EPROM，EEPROM）、寄存器、硬碟、移動硬碟、唯讀光碟（CD-ROM）或者本領域熟知的任何其它形式的存儲介質中。一種示例性的存儲介質耦合至處理器，從而使處理器能夠從該存儲介質讀取資訊，且可向該存儲介質寫入資訊。當然，存儲介質也可以是處理器的組成部分。處理器和存儲介質可以位於ASIC中。The steps of the method or algorithm described in the disclosure of the embodiments of the present invention may be implemented in hardware or by a processor executing software instructions. The software instructions may be composed of corresponding software modules, which may be stored in random access memory (RAM), flash memory, read-only memory (ROM), erasable programmable ROM (EPROM), electronically erasable programmable ROM (EEPROM), register, hard disk, removable hard disk, CD-ROM or any other form of storage medium known in the art. An exemplary storage medium is coupled to a processor so that the processor can read information from the storage medium and write information to the storage medium. Of course, the storage medium can also be a component of the processor. The processor and the storage medium can be located in an ASIC.

本領域技術人員應該可以意識到，在上述一個或多個示例中，本發明實施例所描述的功能可以用硬體、軟體、韌體或它們的任意組合來實現。當使用軟體實現時，可以將這些功能存儲在電腦可讀介質中或者作為電腦可讀介質上的一個或多個指令或代碼進行傳輸。電腦可讀介質包括電腦可讀存儲介質和通信介質，其中通信介質包括便於從一個地方向另一個地方傳送電腦程式的任何介質。存儲介質可以是通用或專用電腦能夠存取的任何可用介質。It should be appreciated by those skilled in the art that in one or more of the above examples, the functions described in the embodiments of the present invention may be implemented using hardware, software, firmware, or any combination thereof. When implemented using software, these functions may be stored in a computer-readable medium or transmitted as one or more instructions or codes on a computer-readable medium. Computer-readable media include computer-readable storage media and communication media, wherein communication media include any media that facilitates the transmission of computer programs from one place to another. Storage media may be any available media that can be accessed by a general or special-purpose computer.

上面結合附圖對本發明的實施例進行了描述，但是本發明並不局限於上述的具體實施方式，上述的具體實施方式僅僅是示意性的，而不是限制性的，本領域的普通技術人員在本發明的啟示下，在不脫離本發明宗旨和請求項所保護的範圍情況下，還可做出很多形式，均屬於本發明的保護之內。以上所述僅為本發明之較佳實施例，凡依本發明申請專利範圍所做之均等變化與修飾，皆應屬本發明之涵蓋範圍。 The embodiments of the present invention are described above in conjunction with the attached drawings, but the present invention is not limited to the above-mentioned specific embodiments. The above-mentioned specific embodiments are only illustrative and not restrictive. Under the inspiration of the present invention, ordinary technicians in this field can make many forms without departing from the scope of protection of the purpose and claims of the present invention, which are all within the protection of the present invention. The above is only a preferred embodiment of the present invention. All equivalent changes and modifications made according to the scope of the patent application of the present invention should be within the scope of the present invention.

701:獲取模組 702:編碼模組 801:碼流接收模組 802:解碼模組 803:虛擬揚聲器信號生成模組 804:場景音訊信號重建模組 805:衰減因數確定模組 806:場景音訊信號調整模組 900:裝置 901:處理器 902:收發器/收發管腳 903:記憶體 904:匯流排 S201,S202,S203,S204,S205,S301,S302,S303,S304,S305,S306,S401,S402,S403,S404,S405,S406,S407,S501,S502,S503,S504,S505,S506,S507,S508,S509,S510,S511,S512:步驟 701: Acquisition module 702: Encoding module 801: Code stream receiving module 802: Decoding module 803: Virtual speaker signal generation module 804: Scene audio signal reconstruction module 805: Attenuation factor determination module 806: Scene audio signal adjustment module 900: Device 901: Processor 902: Transceiver/transceiver pins 903: Memory 904: Bus S201,S202,S203,S204,S205,S301,S302,S303,S304,S305,S306,S401,S402,S403,S404,S405,S406,S407,S501,S502,S503,S504,S505,S506,S507,S508,S509,S510,S511,S512: Steps

圖1a為本發明示例性示出的應用場景示意圖；圖1b為本發明示例性示出的應用場景示意圖；圖2a為本發明示例性示出的編碼過程示意圖；圖2b為本發明示例性示出的候選虛擬揚聲器分佈示意圖；圖3為本發明示例性示出的解碼過程示意圖；圖4為本發明示例性示出的編碼過程示意圖；圖5a為本發明示例性示出的一種解碼過程示意圖；圖5b為本發明示例性示出的另一種解碼過程示意圖；圖6a為本發明示例性示出的編碼端的結構示意圖；圖6b為本發明示例性示出的解碼端的結構示意圖；圖7為本發明示例性示出的場景音訊編碼裝置的結構示意圖；圖8為本發明示例性示出的場景音訊解碼裝置的結構示意圖；圖9為本發明示例性示出的裝置的結構示意圖。 Figure 1a is a schematic diagram of an application scenario exemplarily shown in the present invention; Figure 1b is a schematic diagram of an application scenario exemplarily shown in the present invention; Figure 2a is a schematic diagram of a coding process exemplarily shown in the present invention; Figure 2b is a schematic diagram of candidate virtual speaker distribution exemplarily shown in the present invention; Figure 3 is a schematic diagram of a decoding process exemplarily shown in the present invention; Figure 4 is a schematic diagram of a coding process exemplarily shown in the present invention; Figure 5a is a schematic diagram of a decoding process exemplarily shown in the present invention; Figure 5b is a schematic diagram of another decoding process exemplarily shown in the present invention; Figure 6a is a schematic diagram of the structure of an encoding end exemplarily shown in the present invention; Figure 6b is a schematic diagram of the structure of a decoding end exemplarily shown in the present invention; Figure 7 is a schematic diagram of the structure of a scene audio coding device exemplarily shown in the present invention; FIG8 is a schematic diagram of the structure of a scene audio decoding device exemplarily shown in the present invention; FIG9 is a schematic diagram of the structure of a device exemplarily shown in the present invention.

S301,S302,S303,S304,S305,S306:步驟 S301, S302, S303, S304, S305, S306: Steps

Claims

A scene audio decoding method, characterized in that the method comprises: Receiving a first code stream; Decoding the first code stream to obtain a first reconstructed signal, property information of a target virtual speaker and a high-order energy gain coding result, wherein the first reconstructed signal is a reconstructed signal of a first audio signal in a scene audio signal, the scene audio signal comprises audio signals of C1 channels, the first audio signal is audio signals of K channels in the scene audio signal, C1 is a positive integer, and K is a positive integer less than or equal to C1; Based on the property information of the target virtual speaker and the first audio signal, generating a virtual speaker signal corresponding to the target virtual speaker; Reconstructing based on the property information of the target virtual speaker and the virtual speaker signal to obtain a first reconstructed scene audio signal; the first reconstructed scene audio signal includes audio signals of C2 channels, where C2 is a positive integer; Determining an attenuation factor according to a frequency band sequence number of a reconstructed signal in the first reconstructed scene audio signal and/or an order of the first reconstructed scene audio signal; Adjusting the first reconstructed scene audio signal according to the high-order energy gain coding result and the attenuation factor to obtain a reconstructed scene audio signal.

The method of claim 1 is characterized in that, the scene audio signal is an N1-order high-order stereo reverberation HOA signal, the N1-order HOA signal includes a second audio signal, the second audio signal is an audio signal in the N1-order HOA signal other than the first audio signal, C1 is equal to the square of (N1+1); and/or, the first reconstructed scene audio signal is an N2-order HOA signal, the N2-order HOA signal includes a third audio signal, the third audio signal is a reconstructed signal in the N2-order HOA signal corresponding to each channel of the second audio signal, and C2 is equal to the square of (N2+1).

The method as claimed in claim 2 is characterized in that the adjusting the first reconstructed scene audio signal according to the high-order energy gain coding result and the attenuation factor to obtain the reconstructed scene audio signal comprises: performing entropy decoding on the high-order energy gain coding result to obtain the high-order energy gain after entropy decoding; performing inverse quantization on the high-order energy gain after entropy decoding to obtain the high-order energy gain; adjusting the high-order energy gain according to the characteristic information of the second audio signal and the characteristic information of the first audio signal to obtain the adjusted decoded high-order energy gain; The third audio signal in the N2-order HOA signal is adjusted according to the adjusted decoded high-order energy gain and the attenuation factor to obtain an adjusted third audio signal, and the adjusted third audio signal belongs to the reconstructed scene audio signal.

The method as claimed in claim 3 is characterized in that the high-order energy gain is adjusted according to the characteristic information of the second audio signal and the characteristic information of the first audio signal, including: Obtaining the high-order energy of the second audio signal according to the channel energy of the first audio signal and the high-order energy gain; Obtaining a decoding energy proportional factor according to the channel energy of the third audio signal and the high-order energy of the second audio signal; Obtaining the decoding high-order energy gain of the third audio signal according to the channel energy of the third audio signal and the channel energy of the first audio signal; Adjusting the decoding high-order energy gain of the third audio signal according to the decoding energy proportional factor to obtain the adjusted decoding high-order energy gain.

The method as described in any one of claims 3 to 4 is characterized in that the attenuation factor is determined according to the frequency band number of the reconstructed signal in the first reconstructed scene audio signal and/or the order of the first reconstructed scene audio signal, including: Obtaining the attenuation factor according to the frequency band number of the third audio signal and/or the order of the N2-order HOA signal.

The method as described in any one of claims 3 to 5 is characterized in that after adjusting the third audio signal in the N2-order HOA signal according to the high-order energy gain coding result and the attenuation factor, the method further includes: Obtaining the channel energy of the fourth audio signal corresponding to the adjusted third audio signal, the third audio signal includes the audio signal of the current frame, and the fourth audio signal includes the audio signal of the previous frame of the current frame; Adjusting the adjusted third audio signal again according to the channel energy of the fourth audio signal.

The method as described in claim 6 is characterized in that the adjusting the adjusted third audio signal again according to the channel energy of the fourth audio signal comprises: Obtaining the channel energy average value of the fourth audio signal and the channel energy of the adjusted third audio signal; Obtaining an energy average threshold according to the channel energy average value of the fourth audio signal and the channel energy of the adjusted third audio signal; Performing weighted averaging calculation on the channel energy average value of the fourth audio signal and the channel energy of the adjusted third audio signal according to the energy average threshold to obtain a target energy; Obtaining an energy smoothing factor according to the target energy and the channel energy of the adjusted third audio signal; Adjusting the third audio signal according to the energy smoothing factor.

The method of claim 7, characterized in that the step of obtaining the attenuation factor according to the frequency band number of the third audio signal and/or the order of the N2-order HOA signal comprises: obtaining the attenuation factor g'(i, b) by: Wherein, i is the number of the i-th channel of the third audio signal, b is the frequency band number of the third audio signal, represents the number of the target virtual speakers, represents the number of mapping channels of the N2-order HOA signal, , M is the order of the N2-order HOA signal, γ represents the order corresponding to the channel number i of the third audio signal, Represents a multiplication operation.

The method as claimed in claim 8, characterized in that the method further comprises: when b≤d, Updated to , ; d is the default first threshold; When b＞d, the Updated to , .

The method as claimed in claim 8 is characterized in that the method further comprises: Updated to , ; Wherein, bands represents the number of frequency bands of the third audio signal.

The method as claimed in any one of claims 8 to 10, characterized in that the method further comprises: when b≤d, Updated to , ; d is the default first threshold, w is the default adjustment ratio threshold.

The method as claimed in any one of claims 8 to 11, characterized in that the method further comprises: updating w to w2, w2=w+ ×0.05.

A method as described in any one of claims 8 to 12, characterized in that, when the value of i is 0, 1, 2, or 3, the value of γ is 1; when the value of i is 4, 5, 6, 7, or 8, the value of γ is 2; when the value of i is 9, 10, 11, 12, 13, 14, or 15, the value of γ is 3.

The method as claimed in claim 2 is characterized in that the adjusting the first reconstructed scene audio signal according to the high-order energy gain coding result and the attenuation factor comprises: performing entropy decoding on the high-order energy gain coding result to obtain the high-order energy gain after entropy decoding; performing inverse quantization on the high-order energy gain after entropy decoding to obtain the high-order energy gain; obtaining the high-order energy of the second audio signal according to the channel energy of the first audio signal and the high-order energy gain; obtaining the decoded energy ratio factor according to the channel energy of the third audio signal and the high-order energy of the second audio signal; determining the dispersion factor of the first reconstructed scene audio signal according to the first bit stream; The attenuation factor is linearly weighted according to the dispersion factor to obtain a weighted attenuation factor: The third audio signal in the N2-order HOA signal is adjusted according to the weighted attenuation factor and the decoding energy proportional factor to obtain an adjusted third audio signal.

The method as claimed in claim 14 is characterized in that the attenuation factor is linearly weighted according to the divergence factor to obtain the weighted attenuation factor, including: obtaining the weighted attenuation factor gd(i,b) by at least one of the following methods: gd(i,b)= w diffusion(b)+(1-w) ; Where w is the preset adjustment ratio threshold, represents a multiplication operation, diffusion(b) represents the diffusion factor of the b-th frequency band of the third audio signal, represents the attenuation factor of the bth frequency band of the third audio signal; or, when the attenuation factor is a full-band signal, gd(i,b)=w mean(diffusion)+ (1-w) , wherein mean(diffusion) is the average value of the dispersion factors of multiple frequency bands of the third audio signal, represents the attenuation factor of the bth frequency band of the third audio signal, w is the preset adjustment proportional threshold; or, gd(i,b)= w diffusion(b)+(1-w) +offset(i,b), wherein offset(i,b) represents the offset constant of the bth frequency band on the i-th channel of the third audio signal, represents the attenuation factor of the b-th frequency band of the third audio signal, w is the preset adjustment proportional threshold, and diffusion(b) represents the diffusion factor of the b-th frequency band of the third audio signal; or, gd(i,b)= w diffusion(b)+(1-w) +direction(i,b), wherein direction(i,b) represents the direction parameter of the bth frequency band on the i-th channel of the third audio signal, represents the attenuation factor of the b-th frequency band of the third audio signal, w is a preset adjustment proportional threshold, and diffusion(b) represents the diffusion factor of the b-th frequency band of the third audio signal.

The method of any one of claims 14 to 15, characterized in that the third audio signal in the N2-order HOA signal is adjusted according to the weighted attenuation factor and the decoding energy proportional factor to obtain an adjusted third audio signal: The adjusted third audio signal X'(i,b) is obtained in the following manner: X'(i,b) = X(i) g(i,b) gd(i,b); wherein gd(i,b) represents the weighted attenuation factor, g(i,b) represents the decoding energy proportional factor, and X(i) represents the third audio signal.

A scene audio decoding device, characterized in that the device comprises: A code stream receiving module, used for receiving a first code stream; A decoding module, used for decoding the first code stream to obtain a first reconstructed signal, property information of a target virtual speaker and a high-order energy gain coding result, wherein the first reconstructed signal is a reconstructed signal of a first audio signal in a scene audio signal, wherein the scene audio signal comprises audio signals of C1 channels, wherein the first audio signal is audio signals of K channels in the scene audio signal, wherein C1 is a positive integer, and K is a positive integer less than or equal to C1; A virtual speaker signal generation module, used to generate a virtual speaker signal corresponding to the target virtual speaker based on the attribute information of the target virtual speaker and the first audio signal; A scene audio signal reconstruction module, used to reconstruct based on the attribute information of the target virtual speaker and the virtual speaker signal to obtain a first reconstructed scene audio signal; the first reconstructed scene audio signal includes audio signals of C2 channels, C2 is a positive integer; An attenuation factor determination module, used to determine the attenuation factor according to the frequency band sequence number of the reconstructed signal in the first reconstructed scene audio signal and/or the order of the first reconstructed scene audio signal; The scene audio signal adjustment module is used to adjust the first reconstructed scene audio signal according to the high-order energy gain coding result and the attenuation factor to obtain the reconstructed scene audio signal.

An electronic device, characterized in that it includes: A memory and a processor, wherein the memory is coupled to the processor; The memory stores program instructions, and when the program instructions are executed by the processor, the electronic device executes the scene audio decoding method described in any one of request items 1 to 16.

A chip, characterized in that it includes one or more interface circuits and one or more processors; the interface circuit is used to receive signals from the memory of an electronic device and send the signals to the processor, wherein the signals include computer instructions stored in the memory; when the processor executes the computer instructions, the electronic device executes the scene audio decoding method described in any one of request items 1 to 16.

A computer-readable storage medium is characterized in that the computer-readable storage medium stores a computer program, and when the computer program runs on a computer or a processor, the computer or the processor executes a scene audio decoding method as described in any one of request items 1 to 16.

A computer program product, characterized in that the computer program product includes a software program, and when the software program is executed by a computer or a processor, the steps of the method described in any one of claim items 1 to 16 are executed.