[go: up one dir, main page]

CN114128312B - Audio rendering for low frequency effects - Google Patents

Audio rendering for low frequency effects Download PDF

Info

Publication number
CN114128312B
CN114128312B CN202080051077.5A CN202080051077A CN114128312B CN 114128312 B CN114128312 B CN 114128312B CN 202080051077 A CN202080051077 A CN 202080051077A CN 114128312 B CN114128312 B CN 114128312B
Authority
CN
China
Prior art keywords
audio
audio data
low frequency
sound field
frequency effect
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202080051077.5A
Other languages
Chinese (zh)
Other versions
CN114128312A (en
Inventor
J.菲洛斯
A.谢弗西夫
G.B.戴维斯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qualcomm Inc
Original Assignee
Qualcomm Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US16/714,468 external-priority patent/US11122386B2/en
Application filed by Qualcomm Inc filed Critical Qualcomm Inc
Publication of CN114128312A publication Critical patent/CN114128312A/en
Application granted granted Critical
Publication of CN114128312B publication Critical patent/CN114128312B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S3/00Systems employing more than two channels, e.g. quadraphonic
    • H04S3/008Systems employing more than two channels, e.g. quadraphonic in which the audio signals are in digital form, i.e. employing more than two discrete digital channels
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Stereophonic System (AREA)

Abstract

总体上,技术的各方面是针对用于低频效果的音频渲染。包括存储器和处理器的设备可以被配置为执行该技术。存储器可以存储表示声场的音频数据。处理器可以分析音频数据以识别声场的低频效果分量的空间特性,并且基于空间特性处理音频数据以渲染低频效果扬声器馈入。该处理器还可以将低频效果扬声器馈入输出到具有低频效果能力的扬声器。

In general, aspects of the technology are directed to audio rendering for low frequency effects. A device including a memory and a processor can be configured to perform the technology. The memory can store audio data representing a sound field. The processor can analyze the audio data to identify spatial characteristics of a low frequency effects component of the sound field, and process the audio data based on the spatial characteristics to render a low frequency effects speaker feed. The processor can also output the low frequency effects speaker feed to a speaker with low frequency effects capabilities.

Description

Audio rendering for low frequency effects
The present application claims priority from U.S. application Ser. No. 16/714,468, filed on 13, 12, 2019, which claims the benefit of Greek application Ser. No. 20190100269, filed on 20, 6, 2019, each of which is incorporated by reference in its entirety.
Technical Field
The present disclosure relates to processing of media data, such as audio data.
Background
Audio rendering refers to a process of generating speaker feeds that configure one or more speakers (e.g., headphones, loudspeakers, other transducers including bone conduction speakers, etc.) to reproduce a sound field represented by audio data. The audio data may conform to one or more formats, including scene-based audio formats (such as those specified in the moving picture experts group-MPEG-H audio codec standard), object-based audio formats, and/or channel-based audio formats.
The audio playback device may apply an audio renderer to the audio data in order to generate or otherwise obtain speaker feeds. In some examples, the audio playback device may process the audio data to obtain one or more speaker feeds dedicated to reproducing low frequency effects (LFEs, which may also be referred to as bass below a threshold such as 120 or 150 hertz) that are potentially output to LFE-capable speakers such as subwoofers.
Disclosure of Invention
The present disclosure relates generally to techniques for audio rendering for Low Frequency Effects (LFEs). Various aspects of the technology may enable spatially rendering of LFEs to potentially improve reproduction of low frequency components of a sound field (e.g., threshold frequencies below 200 Hz-Hz, 150Hz, 120Hz, or 100 Hz). Rather than processing all aspects of the audio data equally to obtain an LFE speaker feed, aspects of the present technique may analyze the audio data to identify spatial characteristics associated with LFE components and process (e.g., render) the audio data in various ways based on the spatial characteristics to more likely more accurately spatially LFE components within the sound field.
Thus, various aspects of the technology may improve the operation of the audio playback device, as potentially more accurate spatialization of LFE components within the sound field may improve immersion and thus the overall listening experience. Further, various aspects of the technology may address the issue where an audio playback device may be configured to reconstruct LFE components of a sound field using LFEs embedded in other intermediate frequencies (commonly referred to as intermediate frequencies) or high frequency components of audio data when a dedicated LFE channel is corrupted by or otherwise incorrectly encoded by the audio data, as described in more detail throughout this disclosure. By potentially more accurate reconstruction (in terms of spatialization), aspects of the technique may improve LFE audio rendering in the mid-or high-frequency components of the audio data.
In one example, the technique involves an apparatus comprising: a memory configured to store audio data representing a sound field; and one or more processors configured to: analyzing the audio data to identify spatial characteristics of low frequency effect components of the sound field; processing the audio data based on the spatial characteristics to render a low frequency effects speaker feed; and feeding the low frequency effect speaker out to a speaker having low frequency effect capability.
In another example, the technique involves a method comprising: analyzing audio data representing a sound field to identify spatial characteristics of low frequency effect components of the sound field; processing the audio data based on the spatial characteristics to render a low frequency effects speaker feed; and feeding the low frequency effect speaker out to a speaker having a low frequency effect capability.
In another example, the technique involves an apparatus comprising: means for analyzing audio data representing a sound field to identify spatial characteristics of low frequency effect components of the sound field; means for processing the audio data based on the spatial characteristics to render a low frequency effects speaker feed; and means for feeding the low frequency effects speaker out to a speaker having low frequency effects capabilities.
In another example, the techniques involve a non-transitory computer-readable storage medium having instructions stored thereon that, when executed, cause one or more processors of a device to: analyzing audio data representing a sound field to identify spatial characteristics of low frequency effect components of the sound field; processing the audio data based on the spatial characteristics to render a low frequency effects speaker feed; and feeding the low frequency effects speaker out to a speaker having low frequency effects capability.
The details of one or more examples of the disclosure are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the various aspects of the technology will be apparent from the description and drawings, and from the claims.
Drawings
FIG. 1 is a block diagram illustrating an example system that may perform aspects of the techniques described in this disclosure.
Fig. 2 is a block diagram illustrating in more detail the LFE renderer unit shown in the example of fig. 1.
Fig. 3 is a block diagram illustrating another example of the LFE renderer unit shown in fig. 1 in more detail.
Fig. 4 is a flowchart illustrating exemplary operations of the LFE renderer unit shown in fig. 1-3 in performing aspects of the low frequency effect rendering technique.
Fig. 5 is a block diagram illustrating example components of the content consumer device 14 shown in the example of fig. 1.
Detailed Description
There are various formats on the market based on "surround sound" channels. For example, they range from 5.1 home theater systems (which have been most successful in going beyond stereo in the march parlor) to 22.2 systems developed by japanese broadcasters (Nippon Hoso Kyokai, NHK). The content creator (e.g., hollywood studio) wishes to make a soundtrack for a movie once and remix it for each speaker configuration without expending effort. The Motion Picture Expert Group (MPEG) has promulgated standards that allow the representation of a sound field using a layered set of elements (e.g., higher order ambisonic) -HOA coefficients) that can be rendered to speaker feeds for most speaker configurations, including 5.1 and 22.2 configurations, both in locations defined by the various standards and in non-uniform locations.
MPEG promulgates standards as MPEG-H3D audio standards, formally entitled "efficient codec and media delivery in information technology-heterogeneous environments-third part: 3D Audio ", which was proposed by ISO/IEC JTC 1/SC 29 at 25 of 7.2014, the document identifier is ISO/IEC DIS 23008-3.MPEG also promulgates a second version of the 3D audio standard titled "information technology-efficient coding and media delivery in heterogeneous environments-third part: 3D Audio ", which was proposed by ISO/IEC JTC 1/SC 29 on day 10, 12 of 2016, the document identifier is ISO/IEC 23008-3:201x (E). References to "3D audio standards" in this disclosure may refer to one or both of the standards described above.
As described above, one example of a hierarchical set of elements is a set of Spherical Harmonic Coefficients (SHCs). The following expression shows a description or representation of a sound field using SHC:
the expression shows any point of the sound field at time t The pressure p i at the point can be controlled by SHCUniquely represented. Here,C is the sound velocity (-343 m/s),Is a reference point (or observation point), j n (·) is a spherical Bessel function of order n, andIs a spherical harmonic basis function (which may also be referred to as a spherical basis function) of order n and m. It can be appreciated that the term in brackets is a signal (i.e.) Which may be approximated by various time-frequency transforms such as Discrete Fourier Transforms (DFT), discrete Cosine Transforms (DCT) or wavelet transforms. Other examples of hierarchical sets include sets of wavelet transform coefficients and other sets of coefficients of multi-resolution basis functions.
SHCMay be physically acquired (e.g., recorded) by various microphone array configurations, or alternatively they may be derived from a channel-based or object-based description of the sound field. The SHC (which may also be referred to as a higher order full ambient ambisonic-HOA-coefficients) represents scene-based audio, where the SHC may be input to an audio encoder to obtain an encoded SHC that may facilitate more efficient transmission or storage. For example, a fourth order representation comprising (1+4) 2 (25, and therefore fourth order) coefficients may be used.
As described above, the SHC may be derived from microphone recordings using a microphone array. Various examples of how SHCs may be derived from a microphone array are described below: poletti, m. three-dimensional surround sound system based on spherical harmonics, j.audio eng.soc., volume 53, 11 th, month 11 2005, pages 1004-1025.
To illustrate how SHC is derived from an object-based description, consider the following equation. Coefficients corresponding to the sound field of an individual audio objectCan be expressed as:
Wherein i is (. Cndot.) is an n-order (second class) spherical Hankel function, andIs the location of the object. Knowing the object source energy g (ω) as a function of frequency (e.g., using time-frequency analysis techniques such as performing a fast fourier transform on the PCM stream) allows us to convert each PCM object and corresponding location to SHCFurthermore, it can be seen (due to the above linear and orthogonal decomposition) that per objectThe coefficients are additive. In this way, multiple PCM objects may be represented byCoefficient representation (e.g., the sum of coefficient vectors as separate objects). Essentially, these coefficients contain information about the sound field (pressure as a function of 3D coordinates), and the above represents the information at the observation pointNearby transforms from individual objects to representations of the entire sound field.
A scene-based audio format, such as the SHC referred to above (which may also be referred to as a higher-order full ambient ambisonic coefficient or "HOA coefficient"), represents one way to represent a sound field. Other possible formats include channel-based audio formats and object-based audio formats. Channel-based audio formats refer to 5.1 surround sound formats, 7.1 surround sound formats, 22.2 surround sound formats, or any other channel-based format that positions the audio channels to specific locations around the listener to reconstruct the sound field.
An object-based audio format may refer to a format in which audio objects, which are typically encoded using Pulse Code Modulation (PCM) and are referred to as PCM audio objects, are specified in order to represent a sound field. Such audio objects may include metadata that identifies the position of the audio object relative to a listener or other reference point in the sound field such that the audio object may be rendered to one or more speaker channels for playback in an effort to reconstruct the sound field. The techniques described in this disclosure may be applied to any of the foregoing formats, including scene-based audio formats, channel-based audio formats, object-based audio formats, or any combination thereof.
FIG. 1 is a block diagram illustrating an example system that may perform aspects of the techniques described in this disclosure. As shown in the example of fig. 1, system 10 includes a source device 12 and a content consumer device 14. Although described in the context of source device 12 and content consumer device 14, the techniques may be implemented in any context in which audio data is used to reproduce a sound field. Furthermore, source device 12 may represent any form of computing device capable of generating a representation of a sound field and is generally described herein in the context of being a content creator device. Likewise, the content consumer device 14 may represent any form of computing device capable of implementing the audio rendering techniques described in this disclosure as well as audio playback, and is generally described herein in the context of being an audio/video (a/V) receiver.
Source device 12 may be operated by an entertainment company or other entity that may generate multi-channel audio content for consumption by an operator of a content consumer device, such as content consumer device 14. In some cases, source device 12 may generate audio content in conjunction with video content, although such a situation is not depicted in the example of fig. 1 for ease of illustration. The source device 12 comprises a content capture device 300, a content editing device 304 and a sound field representation generator 302. The content capture device 300 may be configured to engage with or otherwise communicate with the microphone 5.
The microphone 5 may represent a sound field capable of being captured and represented as audio data 11Or other types of 3D audio microphones, which may refer to one or more of the above mentioned scene-based audio data (such as HOA coefficients), object-based audio data, and channel-based audio data. Although described as a 3D audio microphone, microphone 5 may also represent other types of microphones (e.g., omni-directional microphones, spot microphones, unidirectional microphones, etc.) configured to capture audio data 11.
In some examples, the content capture device 300 may include an integrated microphone 5 integrated into the housing of the content capture device 300. The content capture device 300 may interface with the microphone 5 wirelessly or via a wired connection. Rather than capturing or in conjunction with capturing the audio data 11 via the microphone 5, the content capture device 300 may process the audio data 11 after inputting the audio data 11 via some type of removable storage, wirelessly, and/or via a wired input process. Thus, various combinations of the content capture device 300 and microphone 5 are possible in accordance with the present disclosure.
The content capture device 300 may also be configured to interface with or otherwise communicate with the content editing device 304. In some examples, the content capture device 300 may include a content editing device 304 (which may represent software or a combination of software and hardware in some examples, including software executed by the content capture device 300 to configure the content capture device 300 to perform a particular form of content editing). The content editing device 304 may represent a unit configured to edit or otherwise change the content 301 including the audio data 11 received from the content capturing device 300. The content editing device 304 may output the edited content 303 and/or associated metadata 305 to the sound field representation generator 302.
The sound field representation generator 302 may comprise any type of hardware device capable of interfacing with the content editing device 304 (or the content capturing device 300). Although not shown in the example of fig. 1, the sound field representation generator 302 may generate one or more bitstreams 21 using edited content 303 including audio data 11 and/or metadata 305 provided by a content editing device 304. In the example of fig. 1 focusing on audio data 11, sound field representation generator 302 may generate one or more representations of the same sound field represented by audio data 11 to obtain bitstream 21 including a representation of the sound field and/or audio metadata 305.
For example, to generate a different representation of the sound field using HOA coefficients, which again is one example of audio data 11, the sound field representation generator 302 may use a coding scheme for the full ambient ambisonic representation of the sound field, referred to as mixed order full ambient ambisonic (MOA), as described in detail below: the title "Mixed Order Ambisonic (MOA) for audio data of computer-mediated reality systems" was filed 8/2017, and U.S. application Ser. No. 15/672,058, published as U.S. patent publication No. 20190007781/1/3/2019.
To generate a particular MOA representation of the sound field, the sound field representation generator 302 may generate a partial subset of the full set of HOA coefficients. For example, each MOA representation generated by sound field representation generator 302 may provide accuracy for some regions of the sound field, but less accuracy in other regions. In one example, the MOA representation of a sound field may include eight (8) uncompressed HOA coefficients of the HOA coefficients, while the third-order HOA representation of the same sound field may include sixteen (16) uncompressed HOA coefficients of the HOA coefficients. In this way, each MOA representation of a sound field generated as a partial subset of HOA coefficients may have a lower storage and lower bandwidth concentration than a corresponding third-order HOA representation of the same sound field generated from the HOA coefficients (if and when transmitted as part of the bit stream 21 over the illustrated transmission channel).
Although described with respect to a MOA representation, the techniques of this disclosure may also be performed with respect to a full-order full-ambient ambisonic (FOA) representation, where all HOA coefficients of a given order N are used to represent the sound field. In other words, rather than representing the sound field using a partially non-zero subset of HOA coefficients, the sound field representation generator 302 may represent the sound field using all HOA coefficients of a given order N, resulting in a sum of HOA coefficients equal to (n+1) 2.
In this regard, the higher-order ambisonic audio data (which is another way of referring to HOA coefficients in a MOA representation or FOA representation) may include higher-order ambisonic coefficients associated with spherical basis functions having a first order or less (which may be referred to as "1 st order ambisonic audio data"), higher-order ambisonic coefficients associated with spherical basis functions having mixed orders and sub-orders (which may be referred to as "MOA representations" discussed above), or higher-order ambisonic coefficients associated with spherical basis functions having an order greater than one (which may be referred to as "FOA representations").
In some examples, the content capture device 300 or the content editing device 304 may be configured to communicate wirelessly with the sound field representation generator 302. In some examples, the content capture device 300 or the content editing device 304 may be in communication with the sound field representation generator 302 via one or both of a wireless connection or a wired connection. Via a connection between the content capture device 300 and the sound field representation generator 302, the content capture device 300 may provide various forms of content, which for discussion purposes is described herein as being part of the audio data 11.
In some examples, content capture device 300 may utilize aspects of sound field representation generator 302 (in terms of hardware or software capabilities of sound field representation generator 302). For example, sound field representation generator 302 may include dedicated hardware configured to perform psycho-acoustic audio encoding (or dedicated software that, when executed, causes one or more processors to perform psycho-acoustic audio encoding), such as a unified speech and audio codec denoted "USAC" as set forth by the Moving Picture Experts Group (MPEG) or the MPEG-H3D audio encoding standard. The content capture device 300 may not include psycho-acoustic audio encoder specific hardware or specific software, but may instead provide the audio aspects of the content 301 in a non-psycho-acoustic audio codec. The soundfield representation generator 302 may facilitate capturing the content 301 at least in part by performing psychoacoustic audio encoding with respect to the audio aspects of the content 301.
The soundfield representation generator 302 may also facilitate content capture and transmission by generating one or more bitstreams 21 based at least in part on audio content (e.g., MOA representations and/or third order HOA representations) generated from the audio data 11 (where the audio data 11 includes scene-based audio data). The bitstream 21 may represent a compressed version of the audio data 11 and any other different types of content 301 (e.g., compressed versions of spherical video data, image data, or text data).
As one example, the sound field representation generator 302 may generate the bit stream 21 for transmission over a transmission channel, which may be a wired or wireless channel, a data storage device, or the like. The bitstream 21 may represent the encoded version of the audio data 11 and may include a main bitstream and an opposite side bitstream, which may be referred to as side channel information. In some examples, the bitstream 21 representing the compressed version of the audio data 11 (which again may represent scene-based audio data, object-based audio data, channel-based audio data, or a combination thereof) may conform to a bitstream generated in accordance with the MPEG-H3D audio encoding standard.
The content consumer device 14 may be operated by a person and may represent an a/V receiver client device. Although described with respect to an a/V receiver client device (which may also be referred to as an "a/V receiver," "AV receiver," or "AV receiver client device"), content consumer device 14 may represent other types of devices, such as a Virtual Reality (VR) client device, an Augmented Reality (AR) client device, a Mixed Reality (MR) client device, a laptop computer, a desktop computer, a workstation, a cellular phone or a cell phone (including so-called "smartphones"), a television, a dedicated gaming system, a handheld gaming system, smart speakers, a head unit (such as an infotainment system or entertainment system for an automobile or other vehicle), or any other device capable of performing audio rendering with respect to audio data 15. As shown in the example of fig. 1, the content consumer device 14 includes an audio playback system 16, which may refer to any form of audio playback system capable of rendering audio data 15 for playback as multi-channel audio content.
Although shown in the example of fig. 1 as being sent directly to the content consumer device 14, the source device 12 may output the bitstream 21 to an intermediate device located between the source device 12 and the content consumer device 14. The intermediate device may store the bitstream 21 for later delivery to the content consumer device 14 that may request the bitstream. The intermediate device may comprise a file server, a web server, a desktop computer, a laptop computer, a tablet computer, a mobile phone, a smart phone, or any other device capable of storing the bitstream 21 for later retrieval by an audio decoder. The intermediate device may reside in a content delivery network that is capable of streaming (and possibly in combination with transmitting a corresponding video data bit stream) the bit stream 21 to a subscriber, such as the content consumer device 14, requesting the bit stream 21.
Alternatively, the source device 12 may store the bit stream 21 to a storage medium such as an optical disc, digital video disc, high definition video disc, or other storage medium, most of which is readable by a computer, and thus may be referred to as a computer readable storage medium or non-transitory computer readable storage medium. In this context, a transmission channel may refer to a channel (and may include retail stores and other store-based delivery mechanisms) that is used to transmit content stored to a medium (e.g., in the form of one or more bitstreams 21). In any event, the techniques of this disclosure should therefore not be limited in this regard to the example of fig. 1.
As described above, the content consumer device 14 includes the audio playback system 16. Audio playback system 16 may represent any system capable of playing back multi-channel audio data. The audio playback system 16 may include a plurality of different renderers 22. The renderers 22 may each provide different forms of rendering, where the different forms of rendering may include one or more of various ways of performing Vector Base Amplitude Panning (VBAP) and/or one or more of various ways of performing sound field synthesis. As used herein, "a and/or B" refers to "a or B", or both "a and B".
The audio playback system 16 may also include an audio decoding device 24. The audio decoding device 24 may represent a device configured to decode the bitstream 21 to output the audio data 15. Again, the audio data 15 may include: scene-based audio data, which in some examples may form a complete second or higher order HOA representation or a subset thereof, the subset forming a MOA representation of the same sound field, a decomposition thereof, such as dominant audio signals, ambient HOA coefficients, and vector-based signals described in the MPEG-H3D audio codec standard; or other forms of scene-based audio data. As such, the audio data 15 may be similar to a full set or partial subset of the audio data 11, but may be different due to lossy operations (e.g., quantization) and/or transmission via a transmission channel.
The audio data 15 may comprise channel-based audio data as an alternative to or in combination with scene-based audio data. The audio data 15 may comprise object-based audio data as an alternative to or in combination with scene-based audio data. As such, the audio data 15 may include any combination of scene-based audio data, object-based audio data, and channel-based audio data.
After audio decoding device 24 has decoded bitstream 21 to obtain audio data 15, audio renderer 22 of audio playback system 16 may render audio data 15 to output speaker feed 25. Speaker feed 25 may drive one or more speakers (not shown in the example of fig. 1 for ease of illustration). Various audio representations of the sound field, including scene-based audio data (and possibly channel-based audio data and/or object-based audio data), may be normalized in a variety of ways, including N3D, SN D, fuMa, N2D, or SN2D.
To select or, in some examples, generate an appropriate renderer, audio playback system 16 may obtain speaker information 13 indicating the number of speakers (e.g., loudspeakers or headset speakers) and/or the spatial geometry of the speakers. In some examples, audio playback system 16 may obtain speaker information 13 using a reference microphone and drive the speaker in a manner that dynamically determines speaker information 13. In other examples, or in conjunction with dynamic determination of speaker information 13, audio playback system 16 may prompt a user to engage with audio playback system 16 and enter speaker information 13.
Audio playback system 16 may select one of audio renderers 22 based on speaker information 13. In some examples, when none of the audio renderers 22 is within a certain threshold similarity measure (in terms of speaker geometry) to the speaker geometry specified in the speaker information 13, the audio playback system 16 may generate one of the audio renderers 22 based on the speaker information 13. In some examples, audio playback system 16 may generate one of audio renderers 22 based on speaker information 13 without first attempting to select an existing one of audio renderers 22.
When outputting speaker feeds 25 to headphones, audio playback system 16 may utilize one of the renderers 22 that provides binaural rendering using head-related transfer functions (HRTFs) or other functions that can be rendered to left and right speaker feeds 25 for headphone speaker playback, such as a Binaural Room Impulse Response (BRIR) renderer. The term "speaker" or "transducer" may generally refer to any speaker, including a loudspeaker, a headset speaker, a bone conduction speaker, an ear bud speaker, a wireless headset speaker, and the like. One or more speakers may then play back the rendered speaker feeds 25.
Although described as rendering speaker feed 25 from audio data 15, references to rendering of speaker feed 25 may refer to other types of rendering, such as rendering incorporated directly into decoding of audio data 15 from bitstream 21. Examples of alternative renderings can be found in annex G of the MPEG-H3D audio standard, where rendering occurs during the formation of the primary signal and the formation of the background signal prior to sound field synthesis. Thus, reference to rendering of the audio data 15 should be understood to refer to rendering of the actual audio data 15 or decomposition of the audio data 15 or a representation thereof (e.g., the above-mentioned dominant audio signal, ambient HOA coefficients, and/or vector-based signals, which may also be referred to as V-vectors).
As described above, the audio data 11 may represent a sound field comprising what is referred to as a Low Frequency Effect (LFE) component, which may also be referred to as bass below a certain threshold frequency, such as 200 Hz-Hz, 150Hz, 120Hz or 100 Hz. Audio data conforming to some audio formats, such as channel-based audio formats, may include a dedicated LFE channel (which is commonly denoted as point one— "x.1" -meaning a single dedicated LFE channel having X main channels, such as center, front left, front right, rear left and rear right main channels when X equals five, "x.2" referring to two dedicated LFE channels, etc.).
Audio data conforming to an object-based audio format may define one or more audio objects in the sound field and the location of each audio object, which is then transformed into channels mapped to individual speakers, including any subwoofer if there are sufficient LFE components (e.g., below about 200 Hz) in the sound field. Audio playback system 16 may process each audio object, perform distance measurements to identify distances from which LFE components originate, low pass filtering to extract any LFE components below a threshold (e.g., 200 Hz), perform bass activity detection to identify LFE components, and so forth. Audio playback system 16 may then render one or more LFE speaker feeds whose outputs form an adjusted LFE speaker feed before processing the LFE speaker feeds to perform dynamic range control.
Audio data conforming to a scene-based audio format may define a sound field as one or more higher-order full ambient ambisonic (HOA) coefficients associated with spherical basis functions having orders and sub-orders greater than or equal to zero. The audio playback system 16 may render HOA coefficient speaker feeds that equally position the best point around the center of the sphere (the so-called Fliege-Maier points) with respect to the sphere (this is another way of referencing the intended listening position). Audio playback system 16 may process each rendered speaker feed in a manner similar to that described above with respect to audio data conforming to the object-based format, forming an adjusted LFE speaker feed.
In each instance, audio playback system 16 may equally process each of the channels (provided in the case of channel-based audio data or rendered in the case of scene-based audio data) and/or audio objects to obtain an adjusted LFE speaker feed. Each of the channels and/or audio objects is treated equally because the human auditory system is generally considered insensitive to the directionality and shape of the LFE component of the sound field, as compared to the higher frequency components of the sound field, which can be clearly located by the human auditory system, which are generally perceived (as vibrations) rather than clearly heard.
However, as audio playback systems have evolved to feature more and more speakers with LFE capabilities (which may refer to full-range speakers, such as a large center speaker, a large front right speaker, a large front left speaker, etc. in addition to one or more subwoofers, where two or more subwoofers are more and more prevalent, particularly in movie theatres and other dedicated viewing and/or listening areas, such as home theatres or listening rooms), the human auditory system may perceive a lack of spatialization of the LFE component. In this way, a viewer and/or listener may notice a drop in immersion when the LFE component is not properly spatialized when rendered, wherein such drop may be detected when the associated scene being viewed does not properly match the rendering of the LFE component.
This drop may be further emphasized when the LFE channel is corrupted (for channel-based audio data) or when no LFE channel is provided (as may be the case for object-based audio data and/or scene-based audio data). The reconstruction of the LFE channels may include mixing all higher frequency channels together (after rendering audio objects and/or HOA coefficients to the channels, when applicable) and outputting the mixed channels to LFE-capable speakers, which may not be full-band (in terms of frequency) given that the high frequency components of the mixed channels may be chaotic or otherwise render such that reproduction is inaccurate, and thus produce an inaccurate reproduction of the LFE components. In some examples, additional processing may be performed to reproduce the LFE speaker feeds, but such processing ignores the spatialization aspect and outputs the same LFE speaker feeds to each of the LFE-capable speakers, which again may be sensed as inaccurate by the human auditory system.
In accordance with the techniques described in this disclosure, audio playback system 16 may perform a spatialization rendering of LFE components to potentially improve reproduction of LFE components of the sound field (e.g., below a threshold frequency of 200 Hz-Hz, 150Hz, 120Hz, or 100 Hz). Rather than processing all aspects of the audio data equally to obtain an LFE speaker feed, audio playback system 16 may analyze audio data 15 to identify spatial characteristics associated with LFE components and process the audio data (e.g., render) in various ways based on the spatial characteristics to potentially more accurately spatially LFE components within the sound field.
As shown in the example of fig. 1, the audio playback system 16 may include an LFE renderer unit 26, which may represent a unit configured to spatially model LFE components of the audio data 15 in accordance with various aspects of the techniques described in this disclosure. In operation, LFE renderer unit 26 may analyze audio data 15 to identify spatial characteristics of LFE components of a sound field.
To identify spatial characteristics, LFE renderer unit 26 may generate a spherical heat map (which may also be referred to as an "energy map") based on audio data 15 that reflects acoustic energy levels within a sound field of one or more frequency ranges (e.g., from 0Hz to 200Hz, 150Hz, or 120 Hz). LFE renderer unit 26 may then identify the spatial characteristics of the LFE components of the sound field based on the spherical heat map. For example, LFE renderer unit 26 may identify the direction and shape of the LFE components based on the location in the sound field where higher energy LFE components exist relative to other locations within the sound field. LFE renderer unit 26 may then process audio data 15 based on the identified directions, shapes, and/or other spatial characteristics to render LFE speaker feed 27.
LFE renderer unit 26 may then output LFE speaker feed 27 to a speaker with LFE capabilities (not shown in the example of fig. 1 for ease of illustration). In some examples, audio playback device 16 may mix LFE speaker feed 27 with one or more speaker feeds 25 to obtain a mixed speaker feed, which is then output to one or more LFE capable speakers.
In this way, aspects of the present technique may improve the operation of the audio playback device 16, as potentially allowing more accurate spatialization of LFE components within the sound field may improve the immersion and thus the overall listening experience. Furthermore, aspects of the present technique may address the issue where audio playback device 16 may be configured to reconstruct the LFE component of the sound field using an LFE embedded in other intermediate frequencies (commonly referred to as intermediate frequencies) or high frequency components of audio data 15 when a dedicated LFE channel is corrupted by or otherwise incorrectly encoded by the audio data. By potentially more accurate reconstruction (in terms of spatialization), aspects of the present technique may improve LFE audio rendering in the mid-frequency or high-frequency components of the audio data 15.
Fig. 2 is a block diagram illustrating in more detail the LFE renderer unit shown in the example of fig. 1. As shown in the example of fig. 2, LFE renderer unit 26A represents one example of LFE renderer unit 26 shown in the example of fig. 1, where LFE renderer unit 26A includes a spatialization LFE analyzer 110, a distance measurement unit 112, a low pass filter 114, a bass activity detection unit 116, a rendering unit 118, and a Dynamic Range Control (DRC) unit 120.
The spatialization LFE analyzer 110 may represent a unit configured to identify spatial characteristics ("SC") 111 of LFE components of the sound field represented by the audio data 15. That is, the spatialization LFE analyzer 110 may obtain the audio data 15 and analyze the audio data 15 to identify the SC 111. The spatialization LFE analyzer 110 may analyze the full frequency audio data 15 to produce a spherical heat map representing directional sound energy (which may also be referred to as a stage or gain) around the sweet spot. The spatialization LFE analyzer 110 may then identify the SC111 of the LFE component of the sound field based on the spherical heat map. As described above, the SC111 of the LFE component may include one or more directions (e.g., directions of arrival), one or more associated shapes, and the like.
The spatialization LFE analyzer 110 may generate the spherical heat map in a number of different ways depending on the format of the audio data 15. In an example of channel-based audio data, the spatialization LFE analyzer 110 may generate a spherical heat map directly from channels, where each channel is defined to reside at a different location in space (e.g., as part of a 5.1 audio format). For object-based audio data, LFE analyzer 110 may forgo generation of the spherical heat map because the object metadata may directly define the location where the associated object resides. LFE analyzer 110 may process all objects to identify which objects contribute to the LFE component of the sound field and identify SC 111 based on object metadata associated with the identified objects.
Instead of, or in combination with, the above-described metadata-based identification of SC 111, the spatialization LFE analyzer 110 may transform the object audio data 15 from the spatial domain to the spherical harmonic domain, producing HOA coefficients representing each object. The spatialization LFE analyzer 110 may then mix the HOA coefficients from each object together in their entirety and transform the HOA coefficients from the spherical harmonic domain back into the spatial domain, producing the channel (or in other words, rendering the HOA coefficients into the channel). The rendered channels may be evenly spaced around a sphere surrounding the listener. The rendered channels may form the basis of a spherical heat map. The spatialization LFE analyzer 110 may perform operations similar to those described above in the instance of scene-based audio data (with reference to rendering channels from HOA coefficients, which are then used to generate a spherical heat map, which may also be referred to as an energy map).
The spatialization LFE analyzer 110 may output the SC111 to one or more of a distance measurement unit 112, a low pass filter 114, a bass activity detection unit 116, a rendering unit 118, and/or a dynamic range control unit 120. The distance measurement unit 112 may determine the distance between the location from which the LFE component originated (as indicated by SC111 or derived therefrom) and each LFE-capable speaker. The distance measurement unit 112 may then select the one of the speakers with LFE capabilities having the smallest determined distance. When there is only a single LFE capable speaker, LFE rendering unit 26A may not invoke distance measurement unit 112 to calculate or otherwise determine the distance.
The low-pass filter 114 may represent a unit configured to perform low-pass filtering on the audio data 15 to obtain LFE components of the audio data 15. To save processing cycles and thereby facilitate more efficient operation (with the associated benefits of lower power consumption, bandwidth utilization including memory bandwidth, etc.), the low-pass filter 114 may select only those channels (for channel-based audio data) from the directions identified by the SC 111. However, in some examples, the low pass filter 114 may apply a low pass filter to the entire audio data 15 to obtain the LFE component. The low pass filter 114 may output the LFE component to the bass activity detection unit 116.
The bass activity detection unit 116 may represent a unit configured to detect whether a bass is included for a given frame of LFE components. The bass activity detection unit 116 may apply a noise floor (noise floor) threshold (e.g., 20 dB-dB) to each frame of the LFE component. Although described with respect to static thresholds, the bass activity detection unit 116 may use histograms (time varying) to set dynamic noise floor thresholds.
When the gain (defined in dB) of the LFE component exceeds or equals the noise floor threshold, the bass activity detection unit 116 may indicate that the LFE component is active for the current frame and is to be rendered. When the gain of the LFE component is below the noise floor threshold, the bass activity detection unit 116 may indicate that the LFE component is not active for the current frame and will not be rendered. The bass activity detection unit 116 may output the indication to the rendering unit 118.
When the indication indicates that the LFE component is active for the current frame, rendering unit 118 may render LFE-capable speaker feed 27 based on SC 111 and speaker information 13. That is, for channel-based audio data, rendering unit 118 may weight the channels according to SC 111 to potentially emphasize the direction from which the LFE components originated in the sound field. In this way, the rendering unit 118 may apply a first weight to a first audio channel of the plurality of audio channels based on the SC 111 to obtain a first weighted audio channel, the first weight being different from a second weight applied to a second audio channel of the plurality of audio channels. The rendering unit 118 may then mix the first weighted audio channel with a second weighted audio channel obtained by applying a second weight to the second audio channel to obtain a mixed audio channel. Rendering unit 118 may then obtain one or more LFE-capable speaker feeds 27 based on the mixed audio channels.
For object-based audio data, rendering unit 118 may adjust the object rendering matrix using SC 111 as the direction of arrival to consider the direction of arrival of the LFE component. For scene-based audio data, rendering unit 118 may again adjust a similar HOA rendering matrix using SC 111 as the direction of arrival to take into account the direction of arrival of the LFE components. Regardless of the type of audio data, rendering unit 118 may utilize speaker information 13 to determine aspects of the rendering weights/matrices (as well as any delays, intersections, etc.) to account for differences between the specified location of the speaker (e.g., in 5.1 format) and the actual location of the LFE-capable speaker.
The rendering unit 118 may perform various types of rendering, such as an object-based rendering type, including a vector-based amplitude panning (VBAP), a distance-based amplitude panning (DBAP), and/or a full-ambient ambisonic-based rendering type. In instances where there is more than one LFE capable speaker, rendering unit 118 may perform VBAP, DBAP, and/or full ambient ambisonic based rendering types in order to create an audible appearance of a virtual speaker positioned at the direction of arrival defined by SC 111. That is, when the audio playback device 16 is coupled to a plurality of speakers having low frequency effect capabilities, the rendering unit 118 may be configured to process the audio data based on SC 111 to render a first low frequency effect speaker feed and a second low frequency effect speaker feed, the first low frequency effect speaker feed being different from the second low frequency effect speaker feed. Instead of rendering a different low frequency effect speaker feed, rendering unit 118 may perform VBAP to locate the direction of arrival of the low frequency effect component.
When the indication indicates that the LFE component is not active for the current frame, rendering unit 118 may refrain from rendering the current frame. At any event, when the LFE component is indicated as being active, rendering unit 118 may output LFE-capable speaker feed 27 to Dynamic Range Control (DRC) unit 120.
The dynamic range control unit 120 may ensure that the dynamic range of the LFE capable speaker feed 27 is kept within a maximum gain to avoid damaging the LFE capable speaker feed 27. Since the tolerances may be different on a per speaker basis, dynamic range control unit 120 may ensure that LFE-capable speaker feed 27 remains below a maximum gain defined for each LFE-capable speaker (or is automatically identified by dynamic range control unit 120 or other components within audio playback system 16). Dynamic range control unit 120 may output the adjusted LFE-capable speaker feed 27 to an LFE-capable speaker.
Fig. 3 is a block diagram illustrating another example of the LFE renderer unit shown in fig. 1 in more detail. As shown in the example of fig. 3, LFE renderer unit 26B represents one example of LFE renderer unit 26 shown in the example of fig. 1, where LFE renderer unit 26B includes the same spatialization LFE analyzer 110, distance measurement unit 112, low pass filter 114, bass activity detection unit 116, rendering unit 118, and Dynamic Range Control (DRC) unit 120 as discussed above with respect to LFE renderer unit 26A. However, LFE renderer unit 26B differs from LFE renderer unit 26A in that bass activity detection unit 116 processes audio data 15 first, potentially improving processing efficiency in view of frames without bass activity being skipped, avoiding the processing of spatialization LFE analyzer 110, distance measurement unit 112, and low pass filter 114.
Fig. 4 is a flowchart illustrating example operations of the LFE renderer unit shown in fig. 1-3 in performing aspects of the low frequency effect rendering technique. LFE renderer unit 26 may analyze audio data 15 representing the sound field to identify SC 111 of the low frequency effects component of the sound field (200). To perform this analysis, LFE renderer unit 26 may generate a spherical heat map based on audio data 15 that represents energy surrounding a listener located in the middle of the sphere (in the sweet spot). LFE renderer unit 26 may choose to locate the direction of maximum energy, as described in more detail above.
LFE renderer unit 26 may then process the audio data based on SC 111 to render one or more low frequency effects speaker feeds (202). As discussed above with respect to the example of fig. 2, LFE renderer unit 26 may adapt rendering unit 118 to differently weight each channel (for channel-based audio data), object (for object-based audio data), and/or various HOA coefficients (for scene-based audio data) based on SC 111.
For example, if the direction of arrival defined by SC 111 indicates that the LFE component is primarily from the right of the listener, LFE renderer unit 26 may configure rendering unit 118 to weight the right channel higher than the left channel (or discard the left channel entirely because it may have little or no LFE component). In the same directional object domain as in the channel case described above, LFE renderer unit 26 may configure rendering unit 118 to weight the object responsible for most of the energy (and whose metadata indicates that the object resides on the right) to exceed the object to the listener's left (or discard the object to the listener's left). In the context of scene-based audio data and for the same example directions as discussed above, LFE renderer unit 26 may configure rendering unit 118 to weight the right channel rendered according to HOA coefficients over the left channel rendered according to HOA coefficients.
LFE renderer unit 26 can output low frequency effects speaker feed 27 to a speaker with low frequency effects capability (204). Although described above as generating the low frequency effects speaker feed 27 from a single type of audio data 15 (e.g., scene-based audio data), these techniques may be performed with respect to mixed format audio data in which two or more of channel-based audio data, object-based audio data, or scene-based audio data are present for the same time frame.
Fig. 5 is a block diagram illustrating example components of the content consumer device 14 shown in the example of fig. 1. In the example of fig. 5, the content consumer device 14 includes a processor 412, a Graphics Processing Unit (GPU) 414, a system memory 416, a display processor 418, one or more integrated speakers 105, a display 103, a user interface 420, and a transceiver module 422. In the example where the content consumer device 14 is a mobile device, the display processor 418 is a Mobile Display Processor (MDP). In some examples, such as examples where content consumer device 14 is a mobile device, processor 412, GPU414, and display processor 418 may be formed as an Integrated Circuit (IC).
For example, an IC may be considered a processing chip within a chip package, and may be a system on a chip (SoC). In some examples, two of processor 412, GPU 414, and display processor 418 may be housed together in the same IC, and the other housed in a different integrated circuit (i.e., a different chip package), or all three may be housed in a different IC or on the same IC. However, in examples where content consumer device 14 is a mobile device, processor 412, GPU 414, and display processor 418 may all be housed in different integrated circuits.
Examples of processor 412, GPU 414, and display processor 418 include, but are not limited to, fixed function and/or programmable processing circuits such as one or more Digital Signal Processors (DSPs), general purpose microprocessors, application Specific Integrated Circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Processor 412 may be a Central Processing Unit (CPU) of content consumer device 14. In some examples, GPU 414 may be dedicated hardware including integrated and/or discrete logic circuitry that provides GPU 414 with a large amount of parallel processing capabilities suitable for graphics processing. In some examples, GPU 414 may also include general purpose processing capabilities, and may be referred to as a General Purpose GPU (GPGPU) when implementing general purpose processing tasks (i.e., non-graphics related tasks). The display processor 418 may also be specialized integrated circuit hardware designed to retrieve image content from the system memory 416, compose the image content into image frames, and output the image frames to the display 103.
The processor 412 may execute various types of applications 20. Examples of applications 20 include web browsers, email applications, spreadsheets, video games, other applications that generate visual objects for display, or any of the application types listed in more detail above. The system memory 416 may store instructions for executing the application 20. Executing one of the applications 20 on the processor 412 causes the processor 412 to generate graphics data of the image content to be displayed and audio data 21 to be played (possibly via the integrated speaker 105). Processor 412 may send graphics data of the image content to GPU 414 for further processing based on instructions or commands sent by processor 412 to GPU 414.
The processor 412 may communicate with the GPU 414 according to a particular Application Processing Interface (API). Examples of such APIs include API, khronos groupOr OpenGLOpenCL TM; however, aspects of the present disclosure are not limited to DirectX, openGL or OpenCL APIs, and may be extended to other types of APIs. Furthermore, the techniques described in this disclosure need not function according to an API, and the processor 412 and GPU 414 may communicate using any technique.
The system memory 416 may be memory of the content consumer device 14. The system memory 416 may include one or more computer-readable storage media. Examples of system memory 416 include, but are not limited to, random Access Memory (RAM), electrically erasable programmable read-only memory (EEPROM), flash memory, or other medium that can be used to carry or store desired program code in the form of instructions and/or data structures and that can be accessed by a computer or processor.
In some examples, system memory 416 may include instructions that cause processor 412, GPU414, and/or display processor 418 to perform the functions attributed to processor 412, GPU414, and/or display processor 418 in this disclosure. Accordingly, system memory 416 may be a computer-readable storage medium having stored thereon instructions that, when executed, cause one or more processors (e.g., processor 412, GPU414, and/or display processor 418) to perform various functions.
The system memory 416 may include non-transitory storage media. The term "non-transitory" indicates that the storage medium is not embodied in a carrier wave or propagated signal. However, the term "non-transitory" should not be construed to mean that the system memory 416 is not removable or that its contents are static. As one example, the system memory 416 may be removed from the content consumer device 14 and moved to another device. As another example, memory substantially similar to system memory 416 may be inserted into content consumer device 14. In some examples, a non-transitory storage medium may store data (e.g., in RAM) that may change over time.
The user interface 420 may represent one or more hardware or virtual (meaning a combination of hardware and software) user interfaces through which a user may interface with the content consumer device 14. The user interface 420 may include physical buttons, switches, triggers, lights, or virtual versions thereof. The user interface 420 may also include a physical or virtual keyboard, such as a touch interface of a touch screen, haptic feedback, or the like.
Processor 412 may include one or more hardware units (including so-called "processing cores") configured to perform all or some of the operations discussed above with respect to LFE renderer unit 26 of fig. 1. Transceiver module 422 may represent one or more receivers and one or more transmitters capable of wireless communication according to one or more wireless communication protocols.
It may be appreciated that certain acts or events of any of the techniques described herein can be performed in a different order, may be added, combined, or omitted altogether, depending on the example (e.g., not all of the described acts or events are necessary for the practice of the technique). Further, in some examples, an action or event may be performed concurrently, e.g., by multi-threaded processing, interrupt processing, or multiple processors, rather than sequentially.
In some examples, an a/V device (or AV and/or streaming device) may transmit an exchange message to an external device using a network interface coupled to a memory of the AV/streaming device, where the exchange message is associated with a plurality of available representations of the sound field. In some examples, an a/V device may receive a wireless signal including data packets, audio packets, video packets, or transmission protocol data associated with a plurality of available representations of a sound field using an antenna coupled to a network interface. In some examples, one or more microphone arrays may capture a sound field.
In some examples, the plurality of available representations of the soundfield stored to the memory device may include a plurality of object-based representations of the soundfield, a higher order full ambient ambisonic representation of the soundfield, a mixed order full ambient ambisonic representation of the soundfield, a combination of the object-based representations of the soundfield and the higher order full ambient ambisonic representation of the soundfield, a combination of the object-based representations of the soundfield and the mixed order full ambient ambisonic representation of the soundfield, or a combination of the mixed order representation of the soundfield and the higher order full ambient ambisonic representation of the soundfield.
In some examples, one or more of the representations of the sound field in the plurality of available representations of the sound field may include at least one high resolution region and at least one lower resolution region, and wherein the selected presentation based on the steering angle provides greater spatial precision with respect to the at least one high resolution region and lesser spatial precision with respect to the lower resolution region.
In one or more examples, the described functionality may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium and executed by a hardware-based processing unit. The computer-readable medium may include a computer-readable storage medium corresponding to a tangible medium, such as a data storage medium, or a communication medium including any medium that facilitates transfer of a computer program from one place to another, e.g., according to a communication protocol. In this manner, a computer-readable medium may generally correspond to (1) a non-transitory tangible computer-readable storage medium or (2) a communication medium such as a signal or carrier wave. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for use in implementing the techniques described in this disclosure. The computer program product may include a computer-readable medium.
By way of example, and not limitation, such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Further, any connection is properly termed a computer-readable medium. For example, if the instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. It should be understood, however, that computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are instead directed to non-transitory tangible storage media. Disk and disc, as used in the present application, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
The instructions may be executed by one or more processors, such as one or more Digital Signal Processors (DSPs), general purpose microprocessors, application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Accordingly, the term "processor" as used herein may refer to any one of the foregoing structures or any other structure suitable for implementation of the techniques described herein. Additionally, in some aspects, the functionality described herein may be provided in dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined codec. Moreover, these techniques may be implemented entirely in one or more circuits or logic elements.
The techniques of this disclosure may be implemented in a wide variety of devices or apparatuses including a wireless handset, an Integrated Circuit (IC), or a set of ICs (e.g., a chipset). Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the disclosed techniques but do not necessarily require realization by different hardware units. Rather, as noted above, the various units may be combined in a codec hardware unit or provided by a collection of interoperable hardware units including one or more processors as described above, in combination with appropriate software and/or firmware.
Various examples have been described. These and other examples are within the scope of the following claims.

Claims (28)

1. An apparatus for processing audio data, comprising:
a memory configured to store audio data representing a sound field; and
One or more processors configured to:
Analyzing the audio data to identify spatial characteristics of low frequency effect components of the sound field, wherein the spatial characteristics include one or more directions within the sound field from which the low frequency effect components originate and a shape of the low frequency effect components within the sound field;
Processing the audio data based on the spatial characteristics to render a low frequency effects speaker feed; and
The low frequency effects speaker feed is output to a speaker with low frequency effects capability.
2. The device of claim 1, wherein the device is coupled to the low frequency effects capable speaker, the low frequency effects capable speaker configured to reproduce low frequency effects components of the sound field based on the low frequency effects speaker feed.
3. The device of claim 1, wherein the one or more processors are further configured to:
generating a spherical heat map reflecting acoustic energy levels within the sound field based on the audio data; and
The spatial characteristics of the low frequency effect components of the sound field are identified based on the spherical heat map.
4. The apparatus according to claim 1,
Wherein the audio data comprises channel-based audio data having data of a plurality of audio channels,
Wherein each audio channel of the plurality of audio channels is associated with a different location within the sound field, and
Wherein the one or more processors are configured to:
Based on the spatial characteristics, applying a first weight to a first audio channel of the plurality of audio channels to obtain a first weighted audio channel, the first weight being different from a second weight applied to a second audio channel of the plurality of audio channels;
mixing the first weighted audio channel with a second weighted audio channel obtained by applying the second weight to the second audio channel to obtain a mixed audio channel; and
Speaker feeds with low frequency effects capabilities are determined based on the mixed audio channels.
5. The apparatus according to claim 1,
Wherein the audio data comprises object-based audio data comprising audio objects and metadata indicating where in the sound field the audio objects originated, and
Wherein the one or more processors are configured to:
Extracting the metadata from the object-based audio data; and
The spatial characteristics are identified based on the metadata.
6. The apparatus according to claim 1,
Wherein the audio data comprises object-based audio data defining a plurality of audio objects, and
Wherein the one or more processors are configured to:
transforming each of the plurality of audio objects from a spatial domain to a spherical harmonic domain to obtain a corresponding set of higher order full environment ambisonic coefficients;
Mixing each of the corresponding higher order full ambient ambisonic coefficient sets into a single higher order full ambient ambisonic coefficient set; and
The single set of higher order full ambient ambisonic coefficients is analyzed to identify the spatial characteristics.
7. The apparatus according to claim 1,
Wherein the audio data comprises scene-based audio data comprising higher-order full-ambient ambisonic coefficients, and
Wherein the one or more processors are configured to:
rendering the scene-based audio data to one or more audio channels; and
The one or more audio channels are analyzed to identify the spatial characteristics.
8. The apparatus of claim 7, wherein the one or more audio channels are evenly distributed around a sphere representing the sound field.
9. The apparatus according to claim 1,
Wherein the device is coupled to a plurality of speakers having low frequency effect capabilities,
Wherein the low frequency effect speaker feed is a first low frequency effect speaker feed, and
Wherein the one or more processors are configured to process the audio data to render the first low frequency effect speaker feed and a second low frequency effect speaker feed based on the spatial characteristics, the first low frequency effect speaker feed being different from the second low frequency effect speaker feed.
10. A method of processing audio data, comprising:
analyzing audio data representing a sound field to identify spatial characteristics of low frequency effect components of the sound field, wherein the spatial characteristics include one or more directions within the sound field from which the low frequency effect components originate and a shape of the low frequency effect components within the sound field;
Processing the audio data based on the spatial characteristics to render a low frequency effects speaker feed; and
The low frequency effects speaker feed is output to a speaker with low frequency effects capability.
11. The method of claim 10, further comprising reproducing low frequency effect components of the sound field based on the low frequency effect speaker feed.
12. The method of claim 10, wherein analyzing the audio data comprises:
generating a spherical heat map reflecting acoustic energy levels within the sound field based on the audio data; and
Spatial characteristics of the low frequency effect components of the sound field are identified based on the spherical heat map.
13. The method according to claim 10,
Wherein the audio data comprises channel-based audio data having data of a plurality of audio channels,
Wherein each audio channel of the plurality of audio channels is associated with a different location within the sound field, and
Wherein processing the audio data comprises:
Based on the spatial characteristics, applying a first weight to a first audio channel of the plurality of audio channels to obtain a first weighted audio channel, the first weight being different from a second weight applied to a second audio channel of the plurality of audio channels;
mixing the first weighted audio channel with a second weighted audio channel obtained by applying the second weight to the second audio channel to obtain a mixed audio channel; and
Speaker feeds with low frequency effects capabilities are determined based on the mixed audio channels.
14. The method according to claim 10,
Wherein the audio data comprises object-based audio data comprising audio objects and metadata indicating a location in the sound field from which the audio objects originated, and
Wherein analyzing the audio data comprises:
extracting the metadata from the object-based audio data; and
The spatial characteristics are identified based on the metadata.
15. The method according to claim 10,
Wherein the audio data comprises object-based audio data defining a plurality of audio objects, and
Wherein analyzing the audio data comprises:
transforming each of the plurality of audio objects from a spatial domain to a spherical harmonic domain to obtain a corresponding set of higher order full environment ambisonic coefficients;
Mixing each of the corresponding higher order full ambient ambisonic coefficient sets into a single higher order full ambient ambisonic coefficient set; and
The single set of higher order full ambient ambisonic coefficients is analyzed to identify the spatial characteristics.
16. The method according to claim 10,
Wherein the audio data comprises scene-based audio data comprising higher-order full-ambient ambisonic coefficients, and
Wherein analyzing the audio data comprises:
rendering the scene-based audio data to one or more audio channels; and
The one or more audio channels are analyzed to identify the spatial characteristics.
17. The method of claim 16, wherein the one or more audio channels are evenly distributed around a sphere representing the sound field.
18. The method according to claim 10,
Wherein the low frequency effect speaker feed is a first low frequency effect speaker feed, and
Wherein processing the audio data includes processing the audio data to render the first low frequency effect speaker feed and a second low frequency effect speaker feed based on the spatial characteristics, the first low frequency effect speaker feed being different from the second low frequency effect speaker feed.
19. An apparatus for processing audio data, comprising:
Means for analyzing audio data representing a sound field to identify spatial characteristics of low frequency effect components of the sound field, wherein the spatial characteristics include one or more directions within the sound field from which the low frequency effect components originate and a shape of the low frequency effect components within the sound field;
Means for processing the audio data based on the spatial characteristics to render a low frequency effects speaker feed; and
Means for feeding the low frequency effects speaker out to a speaker with low frequency effects capabilities.
20. The apparatus of claim 19, further comprising means for reproducing a low frequency effect component of the sound field based on the low frequency effect speaker feed.
21. The apparatus of claim 19, wherein means for analyzing the audio data comprises:
Means for generating a spherical heat map reflecting acoustic energy levels within the sound field based on the audio data; and
Means for identifying the spatial characteristics of the low frequency effect component of the sound field based on the spherical heat map.
22. The apparatus according to claim 19,
Wherein the audio data comprises channel-based audio data having data of a plurality of audio channels,
Wherein each audio channel of the plurality of audio channels is associated with a different location within the sound field, and
Wherein the means for processing the audio data comprises:
Means for applying a first weight to a first audio channel of the plurality of audio channels based on the spatial characteristics to obtain a first weighted audio channel, the first weight being different from a second weight applied to a second audio channel of the plurality of audio channels;
Means for mixing the first weighted audio channel with a second weighted audio channel obtained by applying the second weights to the second audio channel to obtain a mixed audio channel; and
Means for determining speaker feeds with low frequency effects capabilities based on the mixed audio channels.
23. The apparatus according to claim 19,
Wherein the audio data comprises object-based audio data comprising audio objects and metadata indicating a location in the sound field from which the audio objects originated, and
Wherein the means for analyzing the audio data comprises:
means for extracting the metadata from the object-based audio data; and
Means for identifying the spatial characteristics based on the metadata.
24. The apparatus according to claim 19,
Wherein the audio data comprises object-based audio data defining a plurality of audio objects, and
Wherein the means for analyzing the audio data comprises:
transforming each of the plurality of audio objects from a spatial domain to a spherical harmonic domain to obtain a corresponding set of higher order full environment ambisonic coefficients;
Means for mixing each of the corresponding higher order full ambient ambisonic coefficient sets into a single higher order full ambient ambisonic coefficient set; and
Means for analyzing the single set of higher order full ambient ambisonic coefficients to identify the spatial characteristic.
25. The apparatus according to claim 19,
Wherein the audio data comprises scene-based audio data comprising higher-order full-ambient ambisonic coefficients, and
Wherein the means for analyzing the audio data comprises:
means for rendering the scene-based audio data to one or more audio channels; and
Means for analyzing the one or more audio channels to identify the spatial characteristics.
26. The apparatus of claim 25, wherein the one or more audio channels are evenly distributed around a sphere representing the sound field.
27. The apparatus according to claim 19,
Wherein the low frequency effect speaker feed is a first low frequency effect speaker feed, and
Wherein the means for processing the audio data comprises means for processing the audio data based on the spatial characteristics to render the first low frequency effect speaker feed and a second low frequency effect speaker feed, the first low frequency effect speaker feed being different from the second low frequency effect speaker feed.
28. A non-transitory computer-readable storage medium that processes audio data, having instructions stored thereon that, when executed, cause one or more processors of an apparatus to:
analyzing audio data representing a sound field to identify spatial characteristics of low frequency effect components of the sound field, wherein the spatial characteristics include one or more directions within the sound field from which the low frequency effect components originate and a shape of the low frequency effect components within the sound field;
Processing the audio data based on the spatial characteristics to render a low frequency effects speaker feed; and
The low frequency effects speaker feed is output to a speaker with low frequency effects capability.
CN202080051077.5A 2019-06-20 2020-06-16 Audio rendering for low frequency effects Active CN114128312B (en)

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
GR20190100269 2019-06-20
GR20190100269 2019-06-20
US16/714,468 US11122386B2 (en) 2019-06-20 2019-12-13 Audio rendering for low frequency effects
US16/714,468 2019-12-13
PCT/US2020/037926 WO2020257193A1 (en) 2019-06-20 2020-06-16 Audio rendering for low frequency effects

Publications (2)

Publication Number Publication Date
CN114128312A CN114128312A (en) 2022-03-01
CN114128312B true CN114128312B (en) 2024-05-28

Family

ID=71465428

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202080051077.5A Active CN114128312B (en) 2019-06-20 2020-06-16 Audio rendering for low frequency effects

Country Status (3)

Country Link
EP (1) EP3987824B1 (en)
CN (1) CN114128312B (en)
WO (1) WO2020257193A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2025150960A1 (en) * 2024-01-12 2025-07-17 삼성전자 주식회사 Method and apparatus for audio processing for classifying multi-channel audio signals

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150264483A1 (en) * 2014-03-14 2015-09-17 Qualcomm Incorporated Low frequency rendering of higher-order ambisonic audio data
WO2015147434A1 (en) * 2014-03-25 2015-10-01 인텔렉추얼디스커버리 주식회사 Apparatus and method for processing audio signal
US10405126B2 (en) 2017-06-30 2019-09-03 Qualcomm Incorporated Mixed-order ambisonics (MOA) audio data for computer-mediated reality systems

Also Published As

Publication number Publication date
EP3987824C0 (en) 2024-07-10
CN114128312A (en) 2022-03-01
WO2020257193A1 (en) 2020-12-24
EP3987824B1 (en) 2024-07-10
EP3987824A1 (en) 2022-04-27

Similar Documents

Publication Publication Date Title
US10952009B2 (en) Audio parallax for virtual reality, augmented reality, and mixed reality
JP6284955B2 (en) Mapping virtual speakers to physical speakers
CN114072792B (en) Password-based authorization for audio rendering
US20150264483A1 (en) Low frequency rendering of higher-order ambisonic audio data
US20240129681A1 (en) Scaling audio sources in extended reality systems
US11122386B2 (en) Audio rendering for low frequency effects
CN114128312B (en) Audio rendering for low frequency effects
US20250013425A1 (en) Scaling audio sources in extended reality systems within tolerances
US20250330766A1 (en) Rescaling audio sources in extended reality systems based on movement
US20250301277A1 (en) Offset for scaling audio sources in extended reality systems within tolerances
US20240274141A1 (en) Signaling for rendering tools
WO2025199240A1 (en) Offset for scaling audio source positions in extended reality systems within tolerances

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant