The present application claims priority from U.S. application Ser. No. 16/714,468, filed on 13, 12, 2019, which claims the benefit of Greek application Ser. No. 20190100269, filed on 20, 6, 2019, each of which is incorporated by reference in its entirety.
Detailed Description
There are various formats on the market based on "surround sound" channels. For example, they range from 5.1 home theater systems (which have been most successful in going beyond stereo in the march parlor) to 22.2 systems developed by japanese broadcasters (Nippon Hoso Kyokai, NHK). The content creator (e.g., hollywood studio) wishes to make a soundtrack for a movie once and remix it for each speaker configuration without expending effort. The Motion Picture Expert Group (MPEG) has promulgated standards that allow the representation of a sound field using a layered set of elements (e.g., higher order ambisonic) -HOA coefficients) that can be rendered to speaker feeds for most speaker configurations, including 5.1 and 22.2 configurations, both in locations defined by the various standards and in non-uniform locations.
MPEG promulgates standards as MPEG-H3D audio standards, formally entitled "efficient codec and media delivery in information technology-heterogeneous environments-third part: 3D Audio ", which was proposed by ISO/IEC JTC 1/SC 29 at 25 of 7.2014, the document identifier is ISO/IEC DIS 23008-3.MPEG also promulgates a second version of the 3D audio standard titled "information technology-efficient coding and media delivery in heterogeneous environments-third part: 3D Audio ", which was proposed by ISO/IEC JTC 1/SC 29 on day 10, 12 of 2016, the document identifier is ISO/IEC 23008-3:201x (E). References to "3D audio standards" in this disclosure may refer to one or both of the standards described above.
As described above, one example of a hierarchical set of elements is a set of Spherical Harmonic Coefficients (SHCs). The following expression shows a description or representation of a sound field using SHC:
the expression shows any point of the sound field at time t The pressure p i at the point can be controlled by SHCUniquely represented. Here,C is the sound velocity (-343 m/s),Is a reference point (or observation point), j n (·) is a spherical Bessel function of order n, andIs a spherical harmonic basis function (which may also be referred to as a spherical basis function) of order n and m. It can be appreciated that the term in brackets is a signal (i.e.) Which may be approximated by various time-frequency transforms such as Discrete Fourier Transforms (DFT), discrete Cosine Transforms (DCT) or wavelet transforms. Other examples of hierarchical sets include sets of wavelet transform coefficients and other sets of coefficients of multi-resolution basis functions.
SHCMay be physically acquired (e.g., recorded) by various microphone array configurations, or alternatively they may be derived from a channel-based or object-based description of the sound field. The SHC (which may also be referred to as a higher order full ambient ambisonic-HOA-coefficients) represents scene-based audio, where the SHC may be input to an audio encoder to obtain an encoded SHC that may facilitate more efficient transmission or storage. For example, a fourth order representation comprising (1+4) 2 (25, and therefore fourth order) coefficients may be used.
As described above, the SHC may be derived from microphone recordings using a microphone array. Various examples of how SHCs may be derived from a microphone array are described below: poletti, m. three-dimensional surround sound system based on spherical harmonics, j.audio eng.soc., volume 53, 11 th, month 11 2005, pages 1004-1025.
To illustrate how SHC is derived from an object-based description, consider the following equation. Coefficients corresponding to the sound field of an individual audio objectCan be expressed as:
Wherein i is (. Cndot.) is an n-order (second class) spherical Hankel function, andIs the location of the object. Knowing the object source energy g (ω) as a function of frequency (e.g., using time-frequency analysis techniques such as performing a fast fourier transform on the PCM stream) allows us to convert each PCM object and corresponding location to SHCFurthermore, it can be seen (due to the above linear and orthogonal decomposition) that per objectThe coefficients are additive. In this way, multiple PCM objects may be represented byCoefficient representation (e.g., the sum of coefficient vectors as separate objects). Essentially, these coefficients contain information about the sound field (pressure as a function of 3D coordinates), and the above represents the information at the observation pointNearby transforms from individual objects to representations of the entire sound field.
A scene-based audio format, such as the SHC referred to above (which may also be referred to as a higher-order full ambient ambisonic coefficient or "HOA coefficient"), represents one way to represent a sound field. Other possible formats include channel-based audio formats and object-based audio formats. Channel-based audio formats refer to 5.1 surround sound formats, 7.1 surround sound formats, 22.2 surround sound formats, or any other channel-based format that positions the audio channels to specific locations around the listener to reconstruct the sound field.
An object-based audio format may refer to a format in which audio objects, which are typically encoded using Pulse Code Modulation (PCM) and are referred to as PCM audio objects, are specified in order to represent a sound field. Such audio objects may include metadata that identifies the position of the audio object relative to a listener or other reference point in the sound field such that the audio object may be rendered to one or more speaker channels for playback in an effort to reconstruct the sound field. The techniques described in this disclosure may be applied to any of the foregoing formats, including scene-based audio formats, channel-based audio formats, object-based audio formats, or any combination thereof.
FIG. 1 is a block diagram illustrating an example system that may perform aspects of the techniques described in this disclosure. As shown in the example of fig. 1, system 10 includes a source device 12 and a content consumer device 14. Although described in the context of source device 12 and content consumer device 14, the techniques may be implemented in any context in which audio data is used to reproduce a sound field. Furthermore, source device 12 may represent any form of computing device capable of generating a representation of a sound field and is generally described herein in the context of being a content creator device. Likewise, the content consumer device 14 may represent any form of computing device capable of implementing the audio rendering techniques described in this disclosure as well as audio playback, and is generally described herein in the context of being an audio/video (a/V) receiver.
Source device 12 may be operated by an entertainment company or other entity that may generate multi-channel audio content for consumption by an operator of a content consumer device, such as content consumer device 14. In some cases, source device 12 may generate audio content in conjunction with video content, although such a situation is not depicted in the example of fig. 1 for ease of illustration. The source device 12 comprises a content capture device 300, a content editing device 304 and a sound field representation generator 302. The content capture device 300 may be configured to engage with or otherwise communicate with the microphone 5.
The microphone 5 may represent a sound field capable of being captured and represented as audio data 11Or other types of 3D audio microphones, which may refer to one or more of the above mentioned scene-based audio data (such as HOA coefficients), object-based audio data, and channel-based audio data. Although described as a 3D audio microphone, microphone 5 may also represent other types of microphones (e.g., omni-directional microphones, spot microphones, unidirectional microphones, etc.) configured to capture audio data 11.
In some examples, the content capture device 300 may include an integrated microphone 5 integrated into the housing of the content capture device 300. The content capture device 300 may interface with the microphone 5 wirelessly or via a wired connection. Rather than capturing or in conjunction with capturing the audio data 11 via the microphone 5, the content capture device 300 may process the audio data 11 after inputting the audio data 11 via some type of removable storage, wirelessly, and/or via a wired input process. Thus, various combinations of the content capture device 300 and microphone 5 are possible in accordance with the present disclosure.
The content capture device 300 may also be configured to interface with or otherwise communicate with the content editing device 304. In some examples, the content capture device 300 may include a content editing device 304 (which may represent software or a combination of software and hardware in some examples, including software executed by the content capture device 300 to configure the content capture device 300 to perform a particular form of content editing). The content editing device 304 may represent a unit configured to edit or otherwise change the content 301 including the audio data 11 received from the content capturing device 300. The content editing device 304 may output the edited content 303 and/or associated metadata 305 to the sound field representation generator 302.
The sound field representation generator 302 may comprise any type of hardware device capable of interfacing with the content editing device 304 (or the content capturing device 300). Although not shown in the example of fig. 1, the sound field representation generator 302 may generate one or more bitstreams 21 using edited content 303 including audio data 11 and/or metadata 305 provided by a content editing device 304. In the example of fig. 1 focusing on audio data 11, sound field representation generator 302 may generate one or more representations of the same sound field represented by audio data 11 to obtain bitstream 21 including a representation of the sound field and/or audio metadata 305.
For example, to generate a different representation of the sound field using HOA coefficients, which again is one example of audio data 11, the sound field representation generator 302 may use a coding scheme for the full ambient ambisonic representation of the sound field, referred to as mixed order full ambient ambisonic (MOA), as described in detail below: the title "Mixed Order Ambisonic (MOA) for audio data of computer-mediated reality systems" was filed 8/2017, and U.S. application Ser. No. 15/672,058, published as U.S. patent publication No. 20190007781/1/3/2019.
To generate a particular MOA representation of the sound field, the sound field representation generator 302 may generate a partial subset of the full set of HOA coefficients. For example, each MOA representation generated by sound field representation generator 302 may provide accuracy for some regions of the sound field, but less accuracy in other regions. In one example, the MOA representation of a sound field may include eight (8) uncompressed HOA coefficients of the HOA coefficients, while the third-order HOA representation of the same sound field may include sixteen (16) uncompressed HOA coefficients of the HOA coefficients. In this way, each MOA representation of a sound field generated as a partial subset of HOA coefficients may have a lower storage and lower bandwidth concentration than a corresponding third-order HOA representation of the same sound field generated from the HOA coefficients (if and when transmitted as part of the bit stream 21 over the illustrated transmission channel).
Although described with respect to a MOA representation, the techniques of this disclosure may also be performed with respect to a full-order full-ambient ambisonic (FOA) representation, where all HOA coefficients of a given order N are used to represent the sound field. In other words, rather than representing the sound field using a partially non-zero subset of HOA coefficients, the sound field representation generator 302 may represent the sound field using all HOA coefficients of a given order N, resulting in a sum of HOA coefficients equal to (n+1) 2.
In this regard, the higher-order ambisonic audio data (which is another way of referring to HOA coefficients in a MOA representation or FOA representation) may include higher-order ambisonic coefficients associated with spherical basis functions having a first order or less (which may be referred to as "1 st order ambisonic audio data"), higher-order ambisonic coefficients associated with spherical basis functions having mixed orders and sub-orders (which may be referred to as "MOA representations" discussed above), or higher-order ambisonic coefficients associated with spherical basis functions having an order greater than one (which may be referred to as "FOA representations").
In some examples, the content capture device 300 or the content editing device 304 may be configured to communicate wirelessly with the sound field representation generator 302. In some examples, the content capture device 300 or the content editing device 304 may be in communication with the sound field representation generator 302 via one or both of a wireless connection or a wired connection. Via a connection between the content capture device 300 and the sound field representation generator 302, the content capture device 300 may provide various forms of content, which for discussion purposes is described herein as being part of the audio data 11.
In some examples, content capture device 300 may utilize aspects of sound field representation generator 302 (in terms of hardware or software capabilities of sound field representation generator 302). For example, sound field representation generator 302 may include dedicated hardware configured to perform psycho-acoustic audio encoding (or dedicated software that, when executed, causes one or more processors to perform psycho-acoustic audio encoding), such as a unified speech and audio codec denoted "USAC" as set forth by the Moving Picture Experts Group (MPEG) or the MPEG-H3D audio encoding standard. The content capture device 300 may not include psycho-acoustic audio encoder specific hardware or specific software, but may instead provide the audio aspects of the content 301 in a non-psycho-acoustic audio codec. The soundfield representation generator 302 may facilitate capturing the content 301 at least in part by performing psychoacoustic audio encoding with respect to the audio aspects of the content 301.
The soundfield representation generator 302 may also facilitate content capture and transmission by generating one or more bitstreams 21 based at least in part on audio content (e.g., MOA representations and/or third order HOA representations) generated from the audio data 11 (where the audio data 11 includes scene-based audio data). The bitstream 21 may represent a compressed version of the audio data 11 and any other different types of content 301 (e.g., compressed versions of spherical video data, image data, or text data).
As one example, the sound field representation generator 302 may generate the bit stream 21 for transmission over a transmission channel, which may be a wired or wireless channel, a data storage device, or the like. The bitstream 21 may represent the encoded version of the audio data 11 and may include a main bitstream and an opposite side bitstream, which may be referred to as side channel information. In some examples, the bitstream 21 representing the compressed version of the audio data 11 (which again may represent scene-based audio data, object-based audio data, channel-based audio data, or a combination thereof) may conform to a bitstream generated in accordance with the MPEG-H3D audio encoding standard.
The content consumer device 14 may be operated by a person and may represent an a/V receiver client device. Although described with respect to an a/V receiver client device (which may also be referred to as an "a/V receiver," "AV receiver," or "AV receiver client device"), content consumer device 14 may represent other types of devices, such as a Virtual Reality (VR) client device, an Augmented Reality (AR) client device, a Mixed Reality (MR) client device, a laptop computer, a desktop computer, a workstation, a cellular phone or a cell phone (including so-called "smartphones"), a television, a dedicated gaming system, a handheld gaming system, smart speakers, a head unit (such as an infotainment system or entertainment system for an automobile or other vehicle), or any other device capable of performing audio rendering with respect to audio data 15. As shown in the example of fig. 1, the content consumer device 14 includes an audio playback system 16, which may refer to any form of audio playback system capable of rendering audio data 15 for playback as multi-channel audio content.
Although shown in the example of fig. 1 as being sent directly to the content consumer device 14, the source device 12 may output the bitstream 21 to an intermediate device located between the source device 12 and the content consumer device 14. The intermediate device may store the bitstream 21 for later delivery to the content consumer device 14 that may request the bitstream. The intermediate device may comprise a file server, a web server, a desktop computer, a laptop computer, a tablet computer, a mobile phone, a smart phone, or any other device capable of storing the bitstream 21 for later retrieval by an audio decoder. The intermediate device may reside in a content delivery network that is capable of streaming (and possibly in combination with transmitting a corresponding video data bit stream) the bit stream 21 to a subscriber, such as the content consumer device 14, requesting the bit stream 21.
Alternatively, the source device 12 may store the bit stream 21 to a storage medium such as an optical disc, digital video disc, high definition video disc, or other storage medium, most of which is readable by a computer, and thus may be referred to as a computer readable storage medium or non-transitory computer readable storage medium. In this context, a transmission channel may refer to a channel (and may include retail stores and other store-based delivery mechanisms) that is used to transmit content stored to a medium (e.g., in the form of one or more bitstreams 21). In any event, the techniques of this disclosure should therefore not be limited in this regard to the example of fig. 1.
As described above, the content consumer device 14 includes the audio playback system 16. Audio playback system 16 may represent any system capable of playing back multi-channel audio data. The audio playback system 16 may include a plurality of different renderers 22. The renderers 22 may each provide different forms of rendering, where the different forms of rendering may include one or more of various ways of performing Vector Base Amplitude Panning (VBAP) and/or one or more of various ways of performing sound field synthesis. As used herein, "a and/or B" refers to "a or B", or both "a and B".
The audio playback system 16 may also include an audio decoding device 24. The audio decoding device 24 may represent a device configured to decode the bitstream 21 to output the audio data 15. Again, the audio data 15 may include: scene-based audio data, which in some examples may form a complete second or higher order HOA representation or a subset thereof, the subset forming a MOA representation of the same sound field, a decomposition thereof, such as dominant audio signals, ambient HOA coefficients, and vector-based signals described in the MPEG-H3D audio codec standard; or other forms of scene-based audio data. As such, the audio data 15 may be similar to a full set or partial subset of the audio data 11, but may be different due to lossy operations (e.g., quantization) and/or transmission via a transmission channel.
The audio data 15 may comprise channel-based audio data as an alternative to or in combination with scene-based audio data. The audio data 15 may comprise object-based audio data as an alternative to or in combination with scene-based audio data. As such, the audio data 15 may include any combination of scene-based audio data, object-based audio data, and channel-based audio data.
After audio decoding device 24 has decoded bitstream 21 to obtain audio data 15, audio renderer 22 of audio playback system 16 may render audio data 15 to output speaker feed 25. Speaker feed 25 may drive one or more speakers (not shown in the example of fig. 1 for ease of illustration). Various audio representations of the sound field, including scene-based audio data (and possibly channel-based audio data and/or object-based audio data), may be normalized in a variety of ways, including N3D, SN D, fuMa, N2D, or SN2D.
To select or, in some examples, generate an appropriate renderer, audio playback system 16 may obtain speaker information 13 indicating the number of speakers (e.g., loudspeakers or headset speakers) and/or the spatial geometry of the speakers. In some examples, audio playback system 16 may obtain speaker information 13 using a reference microphone and drive the speaker in a manner that dynamically determines speaker information 13. In other examples, or in conjunction with dynamic determination of speaker information 13, audio playback system 16 may prompt a user to engage with audio playback system 16 and enter speaker information 13.
Audio playback system 16 may select one of audio renderers 22 based on speaker information 13. In some examples, when none of the audio renderers 22 is within a certain threshold similarity measure (in terms of speaker geometry) to the speaker geometry specified in the speaker information 13, the audio playback system 16 may generate one of the audio renderers 22 based on the speaker information 13. In some examples, audio playback system 16 may generate one of audio renderers 22 based on speaker information 13 without first attempting to select an existing one of audio renderers 22.
When outputting speaker feeds 25 to headphones, audio playback system 16 may utilize one of the renderers 22 that provides binaural rendering using head-related transfer functions (HRTFs) or other functions that can be rendered to left and right speaker feeds 25 for headphone speaker playback, such as a Binaural Room Impulse Response (BRIR) renderer. The term "speaker" or "transducer" may generally refer to any speaker, including a loudspeaker, a headset speaker, a bone conduction speaker, an ear bud speaker, a wireless headset speaker, and the like. One or more speakers may then play back the rendered speaker feeds 25.
Although described as rendering speaker feed 25 from audio data 15, references to rendering of speaker feed 25 may refer to other types of rendering, such as rendering incorporated directly into decoding of audio data 15 from bitstream 21. Examples of alternative renderings can be found in annex G of the MPEG-H3D audio standard, where rendering occurs during the formation of the primary signal and the formation of the background signal prior to sound field synthesis. Thus, reference to rendering of the audio data 15 should be understood to refer to rendering of the actual audio data 15 or decomposition of the audio data 15 or a representation thereof (e.g., the above-mentioned dominant audio signal, ambient HOA coefficients, and/or vector-based signals, which may also be referred to as V-vectors).
As described above, the audio data 11 may represent a sound field comprising what is referred to as a Low Frequency Effect (LFE) component, which may also be referred to as bass below a certain threshold frequency, such as 200 Hz-Hz, 150Hz, 120Hz or 100 Hz. Audio data conforming to some audio formats, such as channel-based audio formats, may include a dedicated LFE channel (which is commonly denoted as point one— "x.1" -meaning a single dedicated LFE channel having X main channels, such as center, front left, front right, rear left and rear right main channels when X equals five, "x.2" referring to two dedicated LFE channels, etc.).
Audio data conforming to an object-based audio format may define one or more audio objects in the sound field and the location of each audio object, which is then transformed into channels mapped to individual speakers, including any subwoofer if there are sufficient LFE components (e.g., below about 200 Hz) in the sound field. Audio playback system 16 may process each audio object, perform distance measurements to identify distances from which LFE components originate, low pass filtering to extract any LFE components below a threshold (e.g., 200 Hz), perform bass activity detection to identify LFE components, and so forth. Audio playback system 16 may then render one or more LFE speaker feeds whose outputs form an adjusted LFE speaker feed before processing the LFE speaker feeds to perform dynamic range control.
Audio data conforming to a scene-based audio format may define a sound field as one or more higher-order full ambient ambisonic (HOA) coefficients associated with spherical basis functions having orders and sub-orders greater than or equal to zero. The audio playback system 16 may render HOA coefficient speaker feeds that equally position the best point around the center of the sphere (the so-called Fliege-Maier points) with respect to the sphere (this is another way of referencing the intended listening position). Audio playback system 16 may process each rendered speaker feed in a manner similar to that described above with respect to audio data conforming to the object-based format, forming an adjusted LFE speaker feed.
In each instance, audio playback system 16 may equally process each of the channels (provided in the case of channel-based audio data or rendered in the case of scene-based audio data) and/or audio objects to obtain an adjusted LFE speaker feed. Each of the channels and/or audio objects is treated equally because the human auditory system is generally considered insensitive to the directionality and shape of the LFE component of the sound field, as compared to the higher frequency components of the sound field, which can be clearly located by the human auditory system, which are generally perceived (as vibrations) rather than clearly heard.
However, as audio playback systems have evolved to feature more and more speakers with LFE capabilities (which may refer to full-range speakers, such as a large center speaker, a large front right speaker, a large front left speaker, etc. in addition to one or more subwoofers, where two or more subwoofers are more and more prevalent, particularly in movie theatres and other dedicated viewing and/or listening areas, such as home theatres or listening rooms), the human auditory system may perceive a lack of spatialization of the LFE component. In this way, a viewer and/or listener may notice a drop in immersion when the LFE component is not properly spatialized when rendered, wherein such drop may be detected when the associated scene being viewed does not properly match the rendering of the LFE component.
This drop may be further emphasized when the LFE channel is corrupted (for channel-based audio data) or when no LFE channel is provided (as may be the case for object-based audio data and/or scene-based audio data). The reconstruction of the LFE channels may include mixing all higher frequency channels together (after rendering audio objects and/or HOA coefficients to the channels, when applicable) and outputting the mixed channels to LFE-capable speakers, which may not be full-band (in terms of frequency) given that the high frequency components of the mixed channels may be chaotic or otherwise render such that reproduction is inaccurate, and thus produce an inaccurate reproduction of the LFE components. In some examples, additional processing may be performed to reproduce the LFE speaker feeds, but such processing ignores the spatialization aspect and outputs the same LFE speaker feeds to each of the LFE-capable speakers, which again may be sensed as inaccurate by the human auditory system.
In accordance with the techniques described in this disclosure, audio playback system 16 may perform a spatialization rendering of LFE components to potentially improve reproduction of LFE components of the sound field (e.g., below a threshold frequency of 200 Hz-Hz, 150Hz, 120Hz, or 100 Hz). Rather than processing all aspects of the audio data equally to obtain an LFE speaker feed, audio playback system 16 may analyze audio data 15 to identify spatial characteristics associated with LFE components and process the audio data (e.g., render) in various ways based on the spatial characteristics to potentially more accurately spatially LFE components within the sound field.
As shown in the example of fig. 1, the audio playback system 16 may include an LFE renderer unit 26, which may represent a unit configured to spatially model LFE components of the audio data 15 in accordance with various aspects of the techniques described in this disclosure. In operation, LFE renderer unit 26 may analyze audio data 15 to identify spatial characteristics of LFE components of a sound field.
To identify spatial characteristics, LFE renderer unit 26 may generate a spherical heat map (which may also be referred to as an "energy map") based on audio data 15 that reflects acoustic energy levels within a sound field of one or more frequency ranges (e.g., from 0Hz to 200Hz, 150Hz, or 120 Hz). LFE renderer unit 26 may then identify the spatial characteristics of the LFE components of the sound field based on the spherical heat map. For example, LFE renderer unit 26 may identify the direction and shape of the LFE components based on the location in the sound field where higher energy LFE components exist relative to other locations within the sound field. LFE renderer unit 26 may then process audio data 15 based on the identified directions, shapes, and/or other spatial characteristics to render LFE speaker feed 27.
LFE renderer unit 26 may then output LFE speaker feed 27 to a speaker with LFE capabilities (not shown in the example of fig. 1 for ease of illustration). In some examples, audio playback device 16 may mix LFE speaker feed 27 with one or more speaker feeds 25 to obtain a mixed speaker feed, which is then output to one or more LFE capable speakers.
In this way, aspects of the present technique may improve the operation of the audio playback device 16, as potentially allowing more accurate spatialization of LFE components within the sound field may improve the immersion and thus the overall listening experience. Furthermore, aspects of the present technique may address the issue where audio playback device 16 may be configured to reconstruct the LFE component of the sound field using an LFE embedded in other intermediate frequencies (commonly referred to as intermediate frequencies) or high frequency components of audio data 15 when a dedicated LFE channel is corrupted by or otherwise incorrectly encoded by the audio data. By potentially more accurate reconstruction (in terms of spatialization), aspects of the present technique may improve LFE audio rendering in the mid-frequency or high-frequency components of the audio data 15.
Fig. 2 is a block diagram illustrating in more detail the LFE renderer unit shown in the example of fig. 1. As shown in the example of fig. 2, LFE renderer unit 26A represents one example of LFE renderer unit 26 shown in the example of fig. 1, where LFE renderer unit 26A includes a spatialization LFE analyzer 110, a distance measurement unit 112, a low pass filter 114, a bass activity detection unit 116, a rendering unit 118, and a Dynamic Range Control (DRC) unit 120.
The spatialization LFE analyzer 110 may represent a unit configured to identify spatial characteristics ("SC") 111 of LFE components of the sound field represented by the audio data 15. That is, the spatialization LFE analyzer 110 may obtain the audio data 15 and analyze the audio data 15 to identify the SC 111. The spatialization LFE analyzer 110 may analyze the full frequency audio data 15 to produce a spherical heat map representing directional sound energy (which may also be referred to as a stage or gain) around the sweet spot. The spatialization LFE analyzer 110 may then identify the SC111 of the LFE component of the sound field based on the spherical heat map. As described above, the SC111 of the LFE component may include one or more directions (e.g., directions of arrival), one or more associated shapes, and the like.
The spatialization LFE analyzer 110 may generate the spherical heat map in a number of different ways depending on the format of the audio data 15. In an example of channel-based audio data, the spatialization LFE analyzer 110 may generate a spherical heat map directly from channels, where each channel is defined to reside at a different location in space (e.g., as part of a 5.1 audio format). For object-based audio data, LFE analyzer 110 may forgo generation of the spherical heat map because the object metadata may directly define the location where the associated object resides. LFE analyzer 110 may process all objects to identify which objects contribute to the LFE component of the sound field and identify SC 111 based on object metadata associated with the identified objects.
Instead of, or in combination with, the above-described metadata-based identification of SC 111, the spatialization LFE analyzer 110 may transform the object audio data 15 from the spatial domain to the spherical harmonic domain, producing HOA coefficients representing each object. The spatialization LFE analyzer 110 may then mix the HOA coefficients from each object together in their entirety and transform the HOA coefficients from the spherical harmonic domain back into the spatial domain, producing the channel (or in other words, rendering the HOA coefficients into the channel). The rendered channels may be evenly spaced around a sphere surrounding the listener. The rendered channels may form the basis of a spherical heat map. The spatialization LFE analyzer 110 may perform operations similar to those described above in the instance of scene-based audio data (with reference to rendering channels from HOA coefficients, which are then used to generate a spherical heat map, which may also be referred to as an energy map).
The spatialization LFE analyzer 110 may output the SC111 to one or more of a distance measurement unit 112, a low pass filter 114, a bass activity detection unit 116, a rendering unit 118, and/or a dynamic range control unit 120. The distance measurement unit 112 may determine the distance between the location from which the LFE component originated (as indicated by SC111 or derived therefrom) and each LFE-capable speaker. The distance measurement unit 112 may then select the one of the speakers with LFE capabilities having the smallest determined distance. When there is only a single LFE capable speaker, LFE rendering unit 26A may not invoke distance measurement unit 112 to calculate or otherwise determine the distance.
The low-pass filter 114 may represent a unit configured to perform low-pass filtering on the audio data 15 to obtain LFE components of the audio data 15. To save processing cycles and thereby facilitate more efficient operation (with the associated benefits of lower power consumption, bandwidth utilization including memory bandwidth, etc.), the low-pass filter 114 may select only those channels (for channel-based audio data) from the directions identified by the SC 111. However, in some examples, the low pass filter 114 may apply a low pass filter to the entire audio data 15 to obtain the LFE component. The low pass filter 114 may output the LFE component to the bass activity detection unit 116.
The bass activity detection unit 116 may represent a unit configured to detect whether a bass is included for a given frame of LFE components. The bass activity detection unit 116 may apply a noise floor (noise floor) threshold (e.g., 20 dB-dB) to each frame of the LFE component. Although described with respect to static thresholds, the bass activity detection unit 116 may use histograms (time varying) to set dynamic noise floor thresholds.
When the gain (defined in dB) of the LFE component exceeds or equals the noise floor threshold, the bass activity detection unit 116 may indicate that the LFE component is active for the current frame and is to be rendered. When the gain of the LFE component is below the noise floor threshold, the bass activity detection unit 116 may indicate that the LFE component is not active for the current frame and will not be rendered. The bass activity detection unit 116 may output the indication to the rendering unit 118.
When the indication indicates that the LFE component is active for the current frame, rendering unit 118 may render LFE-capable speaker feed 27 based on SC 111 and speaker information 13. That is, for channel-based audio data, rendering unit 118 may weight the channels according to SC 111 to potentially emphasize the direction from which the LFE components originated in the sound field. In this way, the rendering unit 118 may apply a first weight to a first audio channel of the plurality of audio channels based on the SC 111 to obtain a first weighted audio channel, the first weight being different from a second weight applied to a second audio channel of the plurality of audio channels. The rendering unit 118 may then mix the first weighted audio channel with a second weighted audio channel obtained by applying a second weight to the second audio channel to obtain a mixed audio channel. Rendering unit 118 may then obtain one or more LFE-capable speaker feeds 27 based on the mixed audio channels.
For object-based audio data, rendering unit 118 may adjust the object rendering matrix using SC 111 as the direction of arrival to consider the direction of arrival of the LFE component. For scene-based audio data, rendering unit 118 may again adjust a similar HOA rendering matrix using SC 111 as the direction of arrival to take into account the direction of arrival of the LFE components. Regardless of the type of audio data, rendering unit 118 may utilize speaker information 13 to determine aspects of the rendering weights/matrices (as well as any delays, intersections, etc.) to account for differences between the specified location of the speaker (e.g., in 5.1 format) and the actual location of the LFE-capable speaker.
The rendering unit 118 may perform various types of rendering, such as an object-based rendering type, including a vector-based amplitude panning (VBAP), a distance-based amplitude panning (DBAP), and/or a full-ambient ambisonic-based rendering type. In instances where there is more than one LFE capable speaker, rendering unit 118 may perform VBAP, DBAP, and/or full ambient ambisonic based rendering types in order to create an audible appearance of a virtual speaker positioned at the direction of arrival defined by SC 111. That is, when the audio playback device 16 is coupled to a plurality of speakers having low frequency effect capabilities, the rendering unit 118 may be configured to process the audio data based on SC 111 to render a first low frequency effect speaker feed and a second low frequency effect speaker feed, the first low frequency effect speaker feed being different from the second low frequency effect speaker feed. Instead of rendering a different low frequency effect speaker feed, rendering unit 118 may perform VBAP to locate the direction of arrival of the low frequency effect component.
When the indication indicates that the LFE component is not active for the current frame, rendering unit 118 may refrain from rendering the current frame. At any event, when the LFE component is indicated as being active, rendering unit 118 may output LFE-capable speaker feed 27 to Dynamic Range Control (DRC) unit 120.
The dynamic range control unit 120 may ensure that the dynamic range of the LFE capable speaker feed 27 is kept within a maximum gain to avoid damaging the LFE capable speaker feed 27. Since the tolerances may be different on a per speaker basis, dynamic range control unit 120 may ensure that LFE-capable speaker feed 27 remains below a maximum gain defined for each LFE-capable speaker (or is automatically identified by dynamic range control unit 120 or other components within audio playback system 16). Dynamic range control unit 120 may output the adjusted LFE-capable speaker feed 27 to an LFE-capable speaker.
Fig. 3 is a block diagram illustrating another example of the LFE renderer unit shown in fig. 1 in more detail. As shown in the example of fig. 3, LFE renderer unit 26B represents one example of LFE renderer unit 26 shown in the example of fig. 1, where LFE renderer unit 26B includes the same spatialization LFE analyzer 110, distance measurement unit 112, low pass filter 114, bass activity detection unit 116, rendering unit 118, and Dynamic Range Control (DRC) unit 120 as discussed above with respect to LFE renderer unit 26A. However, LFE renderer unit 26B differs from LFE renderer unit 26A in that bass activity detection unit 116 processes audio data 15 first, potentially improving processing efficiency in view of frames without bass activity being skipped, avoiding the processing of spatialization LFE analyzer 110, distance measurement unit 112, and low pass filter 114.
Fig. 4 is a flowchart illustrating example operations of the LFE renderer unit shown in fig. 1-3 in performing aspects of the low frequency effect rendering technique. LFE renderer unit 26 may analyze audio data 15 representing the sound field to identify SC 111 of the low frequency effects component of the sound field (200). To perform this analysis, LFE renderer unit 26 may generate a spherical heat map based on audio data 15 that represents energy surrounding a listener located in the middle of the sphere (in the sweet spot). LFE renderer unit 26 may choose to locate the direction of maximum energy, as described in more detail above.
LFE renderer unit 26 may then process the audio data based on SC 111 to render one or more low frequency effects speaker feeds (202). As discussed above with respect to the example of fig. 2, LFE renderer unit 26 may adapt rendering unit 118 to differently weight each channel (for channel-based audio data), object (for object-based audio data), and/or various HOA coefficients (for scene-based audio data) based on SC 111.
For example, if the direction of arrival defined by SC 111 indicates that the LFE component is primarily from the right of the listener, LFE renderer unit 26 may configure rendering unit 118 to weight the right channel higher than the left channel (or discard the left channel entirely because it may have little or no LFE component). In the same directional object domain as in the channel case described above, LFE renderer unit 26 may configure rendering unit 118 to weight the object responsible for most of the energy (and whose metadata indicates that the object resides on the right) to exceed the object to the listener's left (or discard the object to the listener's left). In the context of scene-based audio data and for the same example directions as discussed above, LFE renderer unit 26 may configure rendering unit 118 to weight the right channel rendered according to HOA coefficients over the left channel rendered according to HOA coefficients.
LFE renderer unit 26 can output low frequency effects speaker feed 27 to a speaker with low frequency effects capability (204). Although described above as generating the low frequency effects speaker feed 27 from a single type of audio data 15 (e.g., scene-based audio data), these techniques may be performed with respect to mixed format audio data in which two or more of channel-based audio data, object-based audio data, or scene-based audio data are present for the same time frame.
Fig. 5 is a block diagram illustrating example components of the content consumer device 14 shown in the example of fig. 1. In the example of fig. 5, the content consumer device 14 includes a processor 412, a Graphics Processing Unit (GPU) 414, a system memory 416, a display processor 418, one or more integrated speakers 105, a display 103, a user interface 420, and a transceiver module 422. In the example where the content consumer device 14 is a mobile device, the display processor 418 is a Mobile Display Processor (MDP). In some examples, such as examples where content consumer device 14 is a mobile device, processor 412, GPU414, and display processor 418 may be formed as an Integrated Circuit (IC).
For example, an IC may be considered a processing chip within a chip package, and may be a system on a chip (SoC). In some examples, two of processor 412, GPU 414, and display processor 418 may be housed together in the same IC, and the other housed in a different integrated circuit (i.e., a different chip package), or all three may be housed in a different IC or on the same IC. However, in examples where content consumer device 14 is a mobile device, processor 412, GPU 414, and display processor 418 may all be housed in different integrated circuits.
Examples of processor 412, GPU 414, and display processor 418 include, but are not limited to, fixed function and/or programmable processing circuits such as one or more Digital Signal Processors (DSPs), general purpose microprocessors, application Specific Integrated Circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Processor 412 may be a Central Processing Unit (CPU) of content consumer device 14. In some examples, GPU 414 may be dedicated hardware including integrated and/or discrete logic circuitry that provides GPU 414 with a large amount of parallel processing capabilities suitable for graphics processing. In some examples, GPU 414 may also include general purpose processing capabilities, and may be referred to as a General Purpose GPU (GPGPU) when implementing general purpose processing tasks (i.e., non-graphics related tasks). The display processor 418 may also be specialized integrated circuit hardware designed to retrieve image content from the system memory 416, compose the image content into image frames, and output the image frames to the display 103.
The processor 412 may execute various types of applications 20. Examples of applications 20 include web browsers, email applications, spreadsheets, video games, other applications that generate visual objects for display, or any of the application types listed in more detail above. The system memory 416 may store instructions for executing the application 20. Executing one of the applications 20 on the processor 412 causes the processor 412 to generate graphics data of the image content to be displayed and audio data 21 to be played (possibly via the integrated speaker 105). Processor 412 may send graphics data of the image content to GPU 414 for further processing based on instructions or commands sent by processor 412 to GPU 414.
The processor 412 may communicate with the GPU 414 according to a particular Application Processing Interface (API). Examples of such APIs include API, khronos groupOr OpenGLOpenCL TM; however, aspects of the present disclosure are not limited to DirectX, openGL or OpenCL APIs, and may be extended to other types of APIs. Furthermore, the techniques described in this disclosure need not function according to an API, and the processor 412 and GPU 414 may communicate using any technique.
The system memory 416 may be memory of the content consumer device 14. The system memory 416 may include one or more computer-readable storage media. Examples of system memory 416 include, but are not limited to, random Access Memory (RAM), electrically erasable programmable read-only memory (EEPROM), flash memory, or other medium that can be used to carry or store desired program code in the form of instructions and/or data structures and that can be accessed by a computer or processor.
In some examples, system memory 416 may include instructions that cause processor 412, GPU414, and/or display processor 418 to perform the functions attributed to processor 412, GPU414, and/or display processor 418 in this disclosure. Accordingly, system memory 416 may be a computer-readable storage medium having stored thereon instructions that, when executed, cause one or more processors (e.g., processor 412, GPU414, and/or display processor 418) to perform various functions.
The system memory 416 may include non-transitory storage media. The term "non-transitory" indicates that the storage medium is not embodied in a carrier wave or propagated signal. However, the term "non-transitory" should not be construed to mean that the system memory 416 is not removable or that its contents are static. As one example, the system memory 416 may be removed from the content consumer device 14 and moved to another device. As another example, memory substantially similar to system memory 416 may be inserted into content consumer device 14. In some examples, a non-transitory storage medium may store data (e.g., in RAM) that may change over time.
The user interface 420 may represent one or more hardware or virtual (meaning a combination of hardware and software) user interfaces through which a user may interface with the content consumer device 14. The user interface 420 may include physical buttons, switches, triggers, lights, or virtual versions thereof. The user interface 420 may also include a physical or virtual keyboard, such as a touch interface of a touch screen, haptic feedback, or the like.
Processor 412 may include one or more hardware units (including so-called "processing cores") configured to perform all or some of the operations discussed above with respect to LFE renderer unit 26 of fig. 1. Transceiver module 422 may represent one or more receivers and one or more transmitters capable of wireless communication according to one or more wireless communication protocols.
It may be appreciated that certain acts or events of any of the techniques described herein can be performed in a different order, may be added, combined, or omitted altogether, depending on the example (e.g., not all of the described acts or events are necessary for the practice of the technique). Further, in some examples, an action or event may be performed concurrently, e.g., by multi-threaded processing, interrupt processing, or multiple processors, rather than sequentially.
In some examples, an a/V device (or AV and/or streaming device) may transmit an exchange message to an external device using a network interface coupled to a memory of the AV/streaming device, where the exchange message is associated with a plurality of available representations of the sound field. In some examples, an a/V device may receive a wireless signal including data packets, audio packets, video packets, or transmission protocol data associated with a plurality of available representations of a sound field using an antenna coupled to a network interface. In some examples, one or more microphone arrays may capture a sound field.
In some examples, the plurality of available representations of the soundfield stored to the memory device may include a plurality of object-based representations of the soundfield, a higher order full ambient ambisonic representation of the soundfield, a mixed order full ambient ambisonic representation of the soundfield, a combination of the object-based representations of the soundfield and the higher order full ambient ambisonic representation of the soundfield, a combination of the object-based representations of the soundfield and the mixed order full ambient ambisonic representation of the soundfield, or a combination of the mixed order representation of the soundfield and the higher order full ambient ambisonic representation of the soundfield.
In some examples, one or more of the representations of the sound field in the plurality of available representations of the sound field may include at least one high resolution region and at least one lower resolution region, and wherein the selected presentation based on the steering angle provides greater spatial precision with respect to the at least one high resolution region and lesser spatial precision with respect to the lower resolution region.
In one or more examples, the described functionality may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium and executed by a hardware-based processing unit. The computer-readable medium may include a computer-readable storage medium corresponding to a tangible medium, such as a data storage medium, or a communication medium including any medium that facilitates transfer of a computer program from one place to another, e.g., according to a communication protocol. In this manner, a computer-readable medium may generally correspond to (1) a non-transitory tangible computer-readable storage medium or (2) a communication medium such as a signal or carrier wave. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for use in implementing the techniques described in this disclosure. The computer program product may include a computer-readable medium.
By way of example, and not limitation, such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Further, any connection is properly termed a computer-readable medium. For example, if the instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. It should be understood, however, that computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are instead directed to non-transitory tangible storage media. Disk and disc, as used in the present application, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
The instructions may be executed by one or more processors, such as one or more Digital Signal Processors (DSPs), general purpose microprocessors, application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Accordingly, the term "processor" as used herein may refer to any one of the foregoing structures or any other structure suitable for implementation of the techniques described herein. Additionally, in some aspects, the functionality described herein may be provided in dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined codec. Moreover, these techniques may be implemented entirely in one or more circuits or logic elements.
The techniques of this disclosure may be implemented in a wide variety of devices or apparatuses including a wireless handset, an Integrated Circuit (IC), or a set of ICs (e.g., a chipset). Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the disclosed techniques but do not necessarily require realization by different hardware units. Rather, as noted above, the various units may be combined in a codec hardware unit or provided by a collection of interoperable hardware units including one or more processors as described above, in combination with appropriate software and/or firmware.
Various examples have been described. These and other examples are within the scope of the following claims.