WO2018026963A1 - Head-trackable spatial audio for headphones and system and method for head-trackable spatial audio for headphones - Google Patents
Head-trackable spatial audio for headphones and system and method for head-trackable spatial audio for headphones Download PDFInfo
- Publication number
- WO2018026963A1 WO2018026963A1 PCT/US2017/045176 US2017045176W WO2018026963A1 WO 2018026963 A1 WO2018026963 A1 WO 2018026963A1 US 2017045176 W US2017045176 W US 2017045176W WO 2018026963 A1 WO2018026963 A1 WO 2018026963A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio
- channel
- value
- tracking
- channel audio
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S7/00—Indicating arrangements; Control arrangements, e.g. balance control
- H04S7/30—Control circuits for electronic adaptation of the sound field
- H04S7/302—Electronic adaptation of stereophonic sound system to listener position or orientation
- H04S7/303—Tracking of listener position or orientation
- H04S7/304—For headphones
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/008—Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
- G10L19/167—Audio streaming, i.e. formatting and decoding of an encoded audio signal representation into a data stream for transmission or storage purposes
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S2420/00—Techniques used stereophonic systems covered by H04S but not provided for in its groups
- H04S2420/01—Enhancing the perception of the sound image or of the spatial distribution using head related transfer functions [HRTF's] or equivalents thereof, e.g. interaural time difference [ITD] or interaural level difference [ILD]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S2420/00—Techniques used stereophonic systems covered by H04S but not provided for in its groups
- H04S2420/03—Application of parametric coding in stereophonic audio systems
Definitions
- the present invention relates generally to
- the present invention advantageously fills the aforementioned deficiencies by providing a device and method for creating and streaming head-trackable spatial audio for headphones.
- the present invention includes a software system together with an associated computer process.
- the system is made up of the following components: audio bussing matrix, convolution engines, HRTF (Head-related transfer function) filters, audio file output summing matrix, multi-channel audio file encode for streaming, multi-channel audio file
- the audio bussing matrix is connected to the
- the HRTF filters are connected to the convolution engines
- the convolution engines are connected to the audio file output summing matrix
- the audio file output summing matrix is connected to the multichannel audio file encoder for streaming.
- Multi ⁇ channel audio is streamed by itself, or along with video content to the decoder.
- the multichannel audio file decoder is connected to the multi-channel audio file head-tracking renderer.
- the associated computer process is made up of the following executable steps: Audio content is sent to a bussing matrix that
- the output of the convolution engines are sent through an audio output summing matrix which delivers multiple multi- perspective stereo audio files which are then encoded into a multi-channel audio stream. That multi-channel audio stream is then fed (sometimes interleaved with a matching video component) into a streaming server and streamed over a network.
- An app or a browser on a computer or mobile electronics device then receives the broadcasted stream at which point the multi ⁇ channel audio stream is then decoded and then rendered for head-tracking based off of positional data
- the computer or mobile electronics device or a 360 video or virtual reality headset to represent the user's dynamic head position which can be static, continuously moving, or any combination of the two.
- the present invention system may also have one or more of the following optional software components on the creation side for more accurate pre-monitoring of the created content before it is streamed over a network to the end user:
- An external 360 video, equi- rectangular, or multi-view video player which
- Positional data may also be referred to as ⁇ , ⁇ , ⁇ data, or Yaw, Pitch, and Roll.
- the present invention' s software system is unique when compared with other known systems and solutions in that it provides a highly efficient means to process high quality multi-perspective spatial audio on a content creation side that can be streamed over a network and rendered for head-tracking on the other end of a network stream using a highly efficient rendering engine requiring very little CPU usage.
- the present invention is unique in that the overall architecture of the system is different from other known systems. More specifically, the present invention system is unique due to the presence of: (1) Content creation side multi-perspective
- FIG. 1 is an overview of the spatial audio system for the content creation process
- Fig. 2 is an overview of the streaming process and end user decoder and spatial audio rendering system.
- the present invention is directed to a software and/or hardware system and architecture for creating and streaming head-trackable spatial audio for
- Fig.l is an overview of the spatial audio system for the content creation process starting with the any number of (but in this case 4) original unprocessed , un-altered, non spatial audio sources 3, which may have original left, right, front, and rear sound data which are routed to the convolution engine routing bus 4 and are then distributed into the left front engine 6, the right front engine 7, the left rear engine 8, and the right rear engine 9 of the front perspective convolution engine block 5.
- the original audio sources are also distributed from the convolution engine routing bus 4 into the right front engine 13, the right rear engine 15, the left front engine 12, and the left rear engine 14 of the left perspective convolution engine block 11.
- the original audio sources are also distributed from the convolution engine routing bus 4 into the right rear engine 21, the left rear engine 20, the right front engine 19, and the left front engine 18 of the rear perspective convolution engine block 17.
- the original audio sources are also distributed from the convolution engine routing bus 4 into the left rear engine 26, the left front engine 24, the right rear engine 27, and the right front engine 25 of the right perspective convolution engine block 23.
- the stereo audio summing bus outputs 10, 16, 22, and 28 of convolution engine blocks 5, 11, 17, and 23, are then routed into and merged into a multi-channel audio output 29.
- Fig. 2 is an overview of the streaming process and end user decoder and spatial audio rendering software where the multi-channel audio output 29 is sent into a streaming encoder 30 and streamed over a network 31, received by an end user application 32 where the audio stream is decoded by the multi-channel audio decoder 33 and then sent into a head-tracking renderer 34 which is controlled by a end user viewing device or headset that provides xyz positional data 35 so that the appropriate audio perspective and sound field can be outputted from the head-tracking renderer 34 to the users headphones 36.
- the number of unprocessed source audio channels and the number of audio channels in the spatialized multichannel audio stream disclosed in these drawings are an example of a use case scenario. This use could be as few as 4 channels of source audio to create each perspective with an 8 channel multichannel spatialized output as the accompanying drawings illustrate, or as many or as few source channels and or rendered
- the rendered, spatialized, multichannel audio output could contain 10, 12, 16, 24, 32, or any number of channels, delivering any number of perspectives, or perspective derivatives, including height perspectives, and perspectives that are below the listener.
- Unprocessed source audio is not limited to being routed to a left front, right front, left rear, and right rear position as
- HRTF engines designated to reproduce height information, or spatial information below a listener, or 5.1 (six channel surround sound audio systems), 7.1 (eight-channel surround audio system), 11.1 (an extension of the 5.1 surround sound format by incorporating height and overhead channels to allow for placement and panning of sound in the horizontal and vertical axis), 12.1, or other surround sound configurations and also spherical or Cartesian spatial layouts.
- Band-limited Low-Frequency Effects (LFE) channels can also be incorporated .
- the software system of the present invention is made up of the following components: a plurality of HRTF filters, a plurality of convolution engines for instantiating said HRTF filters, an audio bussing matrix for routing audio signals to the convolution engines, a post convolution engine rendered multi-channel audio output summing bus, a multi-channel audio stream, a multi-channel decoder post audio stream, and a post audio decode multi-channel audio head-tracking renderer. These components are combined together to create an
- This invention would allow for additional work flow options to be added to a DAW to create multi-perspective, streamable, and head- trackable spatial audio mixes.
- the software components of this invention would fit both inside of a DAW, other stand alone, or virtual (cloud) application, and also inside of a secondary remote application (a user application that receives the created content stream) .
- the following components from this invention would fit inside the architecture of a DAW: an input bussing matrix, convolution engines, HRTF filters, an output summing bus matrix, and a multi-channel audio output.
- the input bussing matrix routes audio into blocks of convolution engines which then spatialize and process the audio via the HRTF filters, each block representing a different head perspective, the processed audio is then fed into an output bus matrix which then sums the multiple convolution engine outputs from each
- the convolution block into a multi-perspective multi- channel spatial audio output or outputs.
- the final multi-channel spatial audio output or outputs are then sent out of the DAW into a system where they are either directly streamed or combined with video and then streamed over a network.
- the second group of components in this invention live on the other side of the network stream, in a remote user based application that can live on a computer, mobile phone, or any other mobile or non-mobile electronics device. In respect to the user side network stream receiving application, the remaining components of this
- a multi-channel audio decoder separates and decodes multi-channel audio from a network stream and then sends it to a multi-channel head tracking rendering component that renders the multi-channel audio based on user inputted head position azimuth and or
- elevation data provided by sensors in the host device, or a virtual reality headset, or manually via a mouse or touch-screen navigation input. While azimuth is employed in the illustrated example, elevation and roll data (yaw, pitch, and roll) or ( ⁇ , ⁇ , ⁇ ) data can be employed alone or in combination.
- blocks 32, 33, 34 and 35 can exist in the mobile phone, with the
- positional data being provided by the phone's sensors, and headphones 36 connected to the mobile phone, or blocks 32, 33, and 34 can exist in the computer or mobile phone/remote device, and the positional data from block 35 can be provided separately (in
- headphones we need to first model the 7.1 speaker setup in that room. This is done by using a dummy head that simulates the shape and function of the human head and the way that people hear sound.
- the ears on a dummy head contain omni-directional microphones that simulate ear drums of a person.
- m the magnitude of the sum output
- f is the magnitude of the front perspective
- r is the magnitude of the right perspective
- 1 is the magnitude of the left perspective
- b is the magnitude of the rear perspective
- a is the azimuth f x cos a + r x sina , 0 ⁇ a ⁇ - ⁇
- the azimuth is suitably provided from the
- the gyroscope of the VR player device which may be a phone or VR headset or the like.
- Variables R and B would be the magnitude of one of the two perspectives in any one of those combinations. Signals that are on
- a linear crossfade can be used to combine the audio perspectives.
- While the example shown uses 4 channel points, front left, front right, rear left, and rear right surround sound routing, up to as many as possible, such as spherical representations (e.g., say 64 channel points on a sphere, or less or even way more than that) can be employed depending on the desired effect. Also in the 4 channel points case, the arrangement can be configured in a different way other than front/rear/left/right, for example in a plus sign configuration .
- a system and method for providing head-trackable spatial audio for use with headphones where any set of headphones can be employed and provide the user with an audio experience that provides audio that tracks the user's head movement, unlike the prior art where audio spatialization is lost when using headphones.
Landscapes
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Stereophonic System (AREA)
Abstract
Systems and methods to create and stream head-trackable spatial audio for 360 video and virtual reality applications for playback over headphones. A plurality of HRTF filters, a plurality of convolution engines for instantiating said HRTF filters, an audio bussing matrix for routing one or more audio signals to the convolution engines, a post convolution engine rendered multi-channel audio output, a multi-channel audio stream, a multi-channel decoder post audio stream, and a post audio decode multi-channel audio head-tracking renderer comprise components of the system. A plurality of audio outputs are summed into a multiple of multi-perspective stereo audio files which are bit stream encoded and streamed to an end user device. At the end user device, the audio bit stream is decoded and the head-tracking renderer renders the audio files for head tracking sound reproduction to the user based on positional data provided by the end user device or headset.
Description
HEAD-TRACKABLE SPATIAL AUDIO FOR HEADPHONES AND SYSTEM AND METHOD FOR HEAD-TRACKABLE SPATIAL AUDIO FOR HEADPHONES
Field of the Invention
The present invention relates generally to
creating and streaming head-trackable spatial audio for headphones. Background
Currently there are very few solutions for
streaming spatial audio for 360 video and virtual reality applications. Some of these solutions attempt to create and stream a stereo spatial representation of a target environment, but these solutions fail to meet the needs of the industry because they are unable to create multiple perspectives that can be head- tracked by the end user. Other solutions attempt to create and stream a multi-channel audio stream to be spatialized at the other end of the stream, but these solutions are similarly unable to meet the needs of the industry because of audio file synchronization issues at the decode stage. Still other solutions seek to attempt to create and stream a multi-channel audio stream to be spatialized at the other end of the
stream, but these solutions also fail to meet industry needs because of the high mobile application
processing power required to process the audio
spatially at the end of the stream. It would be
desirable to have a spatial audio creation
architecture with associated software for a simplified means to both create and stream spatial audio, and to also render that spatial audio after the stream to provide the correct creator intended spatial
perspective to a user based on a user's head position. Furthermore, it would also be desirable to have a system and software that combines all of the steps in the spatial audio creation phase including the
preparation for streaming that content. Still further, it would be desirable to have a system and software that allows for the post stream audio decode and rendering of that audio for head-tracking based on the user's head position. Therefore, there currently exists a need in the industry for a system that provides a complete solution for both the creation of multi-perspective head-trackable spatial audio, and the delivery, decode, and rendering of that audio based on an end user's positional data.
Summary
In accordance with the disclosure, the present invention advantageously fills the aforementioned deficiencies by providing a device and method for creating and streaming head-trackable spatial audio for headphones. The present invention includes a software system together with an associated computer process. The system is made up of the following components: audio bussing matrix, convolution engines, HRTF (Head-related transfer function) filters, audio file output summing matrix, multi-channel audio file encode for streaming, multi-channel audio file
decoder, and a multi-channel audio file head-tracking renderer. These components are connected as follows: the audio bussing matrix is connected to the
convolution engines, the HRTF filters are connected to the convolution engines, the convolution engines are connected to the audio file output summing matrix, the
audio file output summing matrix is connected to the multichannel audio file encoder for streaming. Multi¬ channel audio is streamed by itself, or along with video content to the decoder. The multichannel audio file decoder is connected to the multi-channel audio file head-tracking renderer. The associated computer process is made up of the following executable steps: Audio content is sent to a bussing matrix that
delivers said content to a block of convolution engines that convolve the audio based on a set of HRTF filters that are loaded into the engines. The output of the convolution engines are sent through an audio output summing matrix which delivers multiple multi- perspective stereo audio files which are then encoded into a multi-channel audio stream. That multi-channel audio stream is then fed (sometimes interleaved with a matching video component) into a streaming server and streamed over a network. An app or a browser on a computer or mobile electronics device then receives the broadcasted stream at which point the multi¬ channel audio stream is then decoded and then rendered for head-tracking based off of positional data
provided by the computer or mobile electronics device or a 360 video or virtual reality headset to represent the user's dynamic head position which can be static, continuously moving, or any combination of the two.
The present invention system may also have one or more of the following optional software components on the creation side for more accurate pre-monitoring of the created content before it is streamed over a network to the end user: An external 360 video, equi- rectangular, or multi-view video player which
synchronizes and connects to the convolution engines multi-perspective output bus summing matrix, and
allows for real time head-tracking on the content creation side from user directed positional input to the video player via manual mouse style navigation, or from a remote virtual, augmented, or mixed reality headset or mobile electronics device that provides azimuth and or elevation data based on sensors in or attached to the device. Positional data may also be referred to as Χ,Υ,Ζ data, or Yaw, Pitch, and Roll.
The present invention' s software system is unique when compared with other known systems and solutions in that it provides a highly efficient means to process high quality multi-perspective spatial audio on a content creation side that can be streamed over a network and rendered for head-tracking on the other end of a network stream using a highly efficient rendering engine requiring very little CPU usage.
The present invention is unique in that the overall architecture of the system is different from other known systems. More specifically, the present invention system is unique due to the presence of: (1) Content creation side multi-perspective
spatialization; (2) Highly efficient user side
rendering for head-tracking; and (3) No need for HRTF rendering in the end of stream user application.
Among other things, it is an advantage of the present invention to provide software for creating and streaming head-trackable spatial audio for headphones that does not suffer from any of the problems or deficiencies associated with prior solutions.
The present invention now will be described more fully hereinafter with reference to the accompanying drawings, which are intended to be read in conjunction with both this summary, the detailed description and any preferred and/or particular embodiments
specifically discussed or otherwise disclosed. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these
embodiments are provided by way of illustration only and so that this disclosure will be thorough, complete and will fully convey the full scope of the invention to those skilled in the art.
Brief Description of the Drawings Fig. 1 is an overview of the spatial audio system for the content creation process; and
Fig. 2 is an overview of the streaming process and end user decoder and spatial audio rendering system.
Detailed Description
The present invention is directed to a software and/or hardware system and architecture for creating and streaming head-trackable spatial audio for
headphones .
Fig.l is an overview of the spatial audio system for the content creation process starting with the any number of (but in this case 4) original unprocessed , un-altered, non spatial audio sources 3, which may have original left, right, front, and rear sound data which are routed to the convolution engine routing bus 4 and are then distributed into the left front engine 6, the right front engine 7, the left rear engine 8, and the right rear engine 9 of the front perspective convolution engine block 5. The original audio sources are also distributed from the convolution engine routing bus 4 into the right front engine 13, the right rear engine 15, the left front engine 12, and
the left rear engine 14 of the left perspective convolution engine block 11. The original audio sources are also distributed from the convolution engine routing bus 4 into the right rear engine 21, the left rear engine 20, the right front engine 19, and the left front engine 18 of the rear perspective convolution engine block 17. The original audio sources are also distributed from the convolution engine routing bus 4 into the left rear engine 26, the left front engine 24, the right rear engine 27, and the right front engine 25 of the right perspective convolution engine block 23. The stereo audio summing bus outputs 10, 16, 22, and 28 of convolution engine blocks 5, 11, 17, and 23, are then routed into and merged into a multi-channel audio output 29.
Fig. 2 is an overview of the streaming process and end user decoder and spatial audio rendering software where the multi-channel audio output 29 is sent into a streaming encoder 30 and streamed over a network 31, received by an end user application 32 where the audio stream is decoded by the multi-channel audio decoder 33 and then sent into a head-tracking renderer 34 which is controlled by a end user viewing device or headset that provides xyz positional data 35 so that the appropriate audio perspective and sound field can be outputted from the head-tracking renderer 34 to the users headphones 36.
The number of unprocessed source audio channels and the number of audio channels in the spatialized multichannel audio stream disclosed in these drawings are an example of a use case scenario. This use could be as few as 4 channels of source audio to create each perspective with an 8 channel multichannel spatialized output as the accompanying drawings illustrate, or as
many or as few source channels and or rendered
multichannel audio perspectives as the bandwidth of the computer, mobile device, network or the like, permits. For example, the rendered, spatialized, multichannel audio output could contain 10, 12, 16, 24, 32, or any number of channels, delivering any number of perspectives, or perspective derivatives, including height perspectives, and perspectives that are below the listener. Unprocessed source audio is not limited to being routed to a left front, right front, left rear, and right rear position as
illustrated in the drawings, but could also be routed to any number of possibilities including HRTF engines designated to reproduce height information, or spatial information below a listener, or 5.1 (six channel surround sound audio systems), 7.1 (eight-channel surround audio system), 11.1 (an extension of the 5.1 surround sound format by incorporating height and overhead channels to allow for placement and panning of sound in the horizontal and vertical axis), 12.1, or other surround sound configurations and also spherical or Cartesian spatial layouts. Band-limited Low-Frequency Effects (LFE) channels can also be incorporated .
In a most complete version, the software system of the present invention is made up of the following components: a plurality of HRTF filters, a plurality of convolution engines for instantiating said HRTF filters, an audio bussing matrix for routing audio signals to the convolution engines, a post convolution engine rendered multi-channel audio output summing bus, a multi-channel audio stream, a multi-channel decoder post audio stream, and a post audio decode multi-channel audio head-tracking renderer. These
components are combined together to create an
architecture for the system that has the following characteristics: typically audio content creation would take place inside of a DAW (Digital Audio
Workstation) , and this invention would allow for additional work flow options to be added to a DAW to create multi-perspective, streamable, and head- trackable spatial audio mixes. The software components of this invention would fit both inside of a DAW, other stand alone, or virtual (cloud) application, and also inside of a secondary remote application (a user application that receives the created content stream) . The following components from this invention would fit inside the architecture of a DAW: an input bussing matrix, convolution engines, HRTF filters, an output summing bus matrix, and a multi-channel audio output. Those components are connected as follows: the input bussing matrix routes audio into blocks of convolution engines which then spatialize and process the audio via the HRTF filters, each block representing a different head perspective, the processed audio is then fed into an output bus matrix which then sums the multiple convolution engine outputs from each
convolution block into a multi-perspective multi- channel spatial audio output or outputs. The final multi-channel spatial audio output or outputs are then sent out of the DAW into a system where they are either directly streamed or combined with video and then streamed over a network. The second group of components in this invention live on the other side of the network stream, in a remote user based application that can live on a computer, mobile phone, or any other mobile or non-mobile electronics device. In respect to the user side network stream receiving
application, the remaining components of this
invention are listed and connected as follows: a multi-channel audio decoder separates and decodes multi-channel audio from a network stream and then sends it to a multi-channel head tracking rendering component that renders the multi-channel audio based on user inputted head position azimuth and or
elevation data provided by sensors in the host device, or a virtual reality headset, or manually via a mouse or touch-screen navigation input. While azimuth is employed in the illustrated example, elevation and roll data (yaw, pitch, and roll) or (Χ,Υ,Ζ) data can be employed alone or in combination.
With reference again to Fig. 2, blocks 32, 33, 34 and 35 can exist in the mobile phone, with the
positional data being provided by the phone's sensors, and headphones 36 connected to the mobile phone, or blocks 32, 33, and 34 can exist in the computer or mobile phone/remote device, and the positional data from block 35 can be provided separately (in
connection with headphones 36 in the case of a VR headset, or separately from another positional sensor or oystick/other position control device) .
In the way of a practical explanation of the use of one embodiment of the device to achieve HRTF filter application via fast convolution follows. In a home theatre entertainment room with a 7.1 surround sound setup, when a person watches a movie that plays back in 7.1 surround sound, the person hears sound all around them, in front, to the side, and also behind them. In order to simulate this effect over
headphones, we need to first model the 7.1 speaker setup in that room. This is done by using a dummy head that simulates the shape and function of the human
head and the way that people hear sound. The ears on a dummy head contain omni-directional microphones that simulate ear drums of a person. We can then capture a binaural impulse response from each of the speaker positions by using the dummy head to record the impulse responses. After capturing the impulse
responses of all of the speakers in the surround setup, we would now have HRTF filters that simulate surround sound over headphones.
In the case of this invention, we are creating a production work-flow to allow for the creation of spatial audio (simulated headphone surround) in multiple perspectives. This can be done by creating multiple sets of the surround sound HRTF room model sets, and then create different input routings to each set to represent a different position for each set. For example, let's say that we want to represent a front orientation, a left orientation, a rear
orientation, and a right orientation. Imagine a person sitting in a surround sound home theatre listening while looking at the TV screen, and then the person turns their head 90 degrees to the right and hears the dialog coming from the left because it is hitting the left ear first. Or if the person turns to look at the back wall, they hear the dialog (center and front channels) coming from behind them (because the TV is now behind them) .
We can also simulate this same experience in real time over headphones by using HRTF room models and creating multiple bus routing scenarios. For instance, if we create 4 sets of the above mentioned 7.1 HRTF surround sound room model, and then create 4 sets of 7.1 busses, each bus set representing a different position (front, left, rear, and right), we can then
mix audio in one perspective, but output 4 perspectives at the same time. In the most simple terms, take the center channel audio for a front perspective, if through a multiple 7.1 bus matrix that center channel is also feed to a separate set of 7.1 HRTF room models but instead of always routing it to the actual center channel, it is routed to the left side channel to represent a simulation of the center channel arriving at the person' s left ear first when they are looking 90 degrees to the right. Each set of 7.1 HRTF room models can be summed to stereo and retain all of the filters inherent spatial attributes. In this example we are talking about 4 sets of 7.1 HRTF filters, each representing a different
perspective (front, left, rear, and right) so at the end stage we could sum each 7.1 HRTF set into stereo, giving us 4 stereo audio outputs. Relating to this invention, we can then interleave these 4 stereo audio outputs into a single output or file, stream that interleaved output or file over a network, decode that on the other side of the network in a receiving application, and then render the appropriate stereo file perspective to be played in full, or in any mixed ratio combined with the other stereo audio
perspectives to create a real-time, dynamic, head- trackable spatial audio experience over headphones based off of user generated head positional data from the users device sensors, a virtual reality headset, or any other device or input form that allows for the expression of azimuth and or elevation information.
As one example of how to combine the 4 audio perspectives, the following computation can be used:
Where m = the magnitude of the sum output, f is the magnitude of the front perspective, r is the
magnitude of the right perspective, 1 is the magnitude of the left perspective, b is the magnitude of the rear perspective, a is the azimuth f x cos a + r x sina , 0 < a < - π
- r x cos a + b x sin a , — < a <
m π
I f f x x c eonss a a—— 1 ! x X s siinn a a , . — < a < 0
2
b x cos a— 1 x sin a , π < a <
2
The azimuth is suitably provided from the
gyroscope of the VR player device, which may be a phone or VR headset or the like.
Which perspective is front and which is right is somewhat arbitrary. The above formula details the process for combining any two signals that are
adjacent, front and right, right and rear, rear and left and left and front. Variables R and B would be the magnitude of one of the two perspectives in any one of those combinations. Signals that are on
opposite sides from each other, for example, front and rear, are not combined.
Alternatively, a linear crossfade can be used to combine the audio perspectives.
Other perspectives can be incorporated, including those above and below the user, and corresponding combination can be provided to crossfade between above and below positions if they are employed.
While the example shown uses 4 channel points, front left, front right, rear left, and rear right surround sound routing, up to as many as possible, such as spherical representations (e.g., say 64 channel points on a sphere, or less or even way more than that) can be employed depending on the desired
effect. Also in the 4 channel points case, the arrangement can be configured in a different way other than front/rear/left/right, for example in a plus sign configuration .
In accordance with the disclosure, a system and method for providing head-trackable spatial audio for use with headphones is provided, where any set of headphones can be employed and provide the user with an audio experience that provides audio that tracks the user's head movement, unlike the prior art where audio spatialization is lost when using headphones.
While the present invention has been described above in terms of specific embodiments, it is to be understood that the invention is not limited to these disclosed embodiments. Many modifications and other embodiments of the invention will come to mind of those skilled in the art to which this invention pertains, and which are intended to be and are covered by both this disclosure and the appended claims. It is indeed intended that the scope of the invention should be determined by proper interpretation and
construction of the appended claims and their legal equivalents, as understood by those of skill in the art relying upon the disclosure in this specification and the attached drawings.
Claims
1. A method for streaming spatially adapted audio comprising:
providing multi-channel audio in a stream;
decoding the multi-channel audio stream and providing it to a headphone with spatial recombination based on an azimuth value.
2. The method according to claim 1, wherein the azimuth value is based on tracking of a position of a head of a person wearing the headphone
3. The method according to claim 1, wherein the azimuth value is based on a positional control.
4. The method according to claim 1, wherein the azimuth value is based on a positional data from a mobile device.
5. The method according to claim 1, wherein the multi-channel audio relates to multi-perspective sound values .
6. The method according to claim 5, wherein the multi-perspectives comprise left front, right front, left rear and right rear.
7. The method according to claim 5, wherein the multi-perspectives comprise 5.1 channel based surround sound positions.
8. The method according to claim 5, wherein the
multi-perspectives comprise 7.1 channel based surround sound positions.
9. The method according to claim 5, wherein the multi-perspectives comprise left front, center, right front, left side, right side, left rear, right rear, left front height, right front height, left rear height, right rear height, and LFE channel (11.1) .
10. The method according to claim 5, wherein the multi-perspectives comprise a spherical channel based layout .
11. The method according to claim 1, wherein spatial recombination is based on the azimuth and an elevation value (Yaw and Pitch) .
12. The method according to claim 1, wherein spatial recombination is based on the azimuth, an elevation, and a roll value (Yaw, Pitch, and Roll)
13. A method for streaming spatially adapted audio comprising:
providing multi-perspective audio inputs to plural convolution engines to apply head-related transfer function audio filters thereto;
combining the output of the plural convolution engines to provide a multi-channel audio signal;
streaming an encoded version the multi-channel audio signal;
receiving the encoded version of the streamed multi-channel audio signal;
decoding the multi-channel audio signal to provide plural audio output signals; and
rendering the multi-channel audio signals based on a position tracking value.
14. The method according to claim 13, wherein the multi-channel audio signals are rendered to a
headphone and the position tracking value is based on a head position of a wearer of the headphone.
15. The method according to claim 13, wherein the multi-channel audio relates to multi-perspective sound values .
16. The method according to claim 15, wherein the multi-perspectives comprise left front, right front, left rear and right rear.
17. The method according to claim 15, wherein the multi-perspectives comprise 5.1 channel based surround sound positions.
18. The method according to claim 15, wherein the multi-perspectives comprise 7.1 channel based surround sound positions.
19. The method according to claim 15, wherein the multi-perspectives comprise left front, center, right front, left side, right side, left rear, right rear, left front height, right front height, left rear height, right rear height, and LFE channel (11.1) .
20. The method according to claim 15, wherein the multi-perspectives comprise a spherical channel based layout .
21. The method according to claim 15, wherein the position tracking value is based on the azimuth of a user .
22. The method according to claim 15, wherein the position tracking value is based on the azimuth and an elevation value (Yaw and Pitch) of a user.
23. The method according to claim 15, wherein spatial recombination is based on the azimuth, an elevation, and a roll value (Yaw, Pitch, and Roll) of a user.
24. A system for streaming spatially adapted audio comprising:
a plurality of head-related transfer function filters ;
a plurality of convolution engines for
instantiating said head-related transfer function filters;
an audio bussing matrix for routing audio signals to the convolution engines;
a post convolution engine rendered multi-channel audio output summing bus;
a multi-channel audio stream;
a multi-channel decoder post audio stream; and a post audio decode multi-channel audio position- tracking renderer.
25. The system to claim 24, wherein the multichannel audio signals are rendered to a headphone and the position-tracking renderer is based on tracking of a position of a head of a person wearing a headphone
26. The system to claim 24, wherein the multi¬ channel audio signals are rendered to a headphone and the position-tracking renderer is based on input from a positional control.
27. The system to claim 24, wherein the multichannel audio signals are rendered to a headphone and the position-tracking renderer is based on a
positional data from a mobile device.
28. The system according to claim 24, wherein the position-tracking renderer is based on the azimuth of a user.
29. The system according to claim 24, wherein the position-tracking renderer is based on the azimuth and an elevation value (Yaw and Pitch) of a user.
30. The system according to claim 24, wherein position-tracking renderer is based on the azimuth, an elevation, and a roll value (Yaw, Pitch, and Roll) of a user.
31. The system according to claim 24, wherein the multi-channel audio relates to multi-perspective sound values .
32. The system according to claim 31, wherein the multi-perspectives comprise left front, right front, left rear and right rear.
33. The system according to claim 31, wherein the multi-perspectives comprise 5.1 channel based surround sound positions.
34. The system according to claim 31, wherein the multi-perspectives comprise 7.1 channel based surround sound positions.
35. The system according to claim 31, wherein the multi-perspectives comprise left front, center, right front, left side, right side, left rear, right rear, left front height, right front height, left rear height, right rear height, and LFE channel (11.1) .
36. The system according to claim 31, wherein the multi-perspectives comprise a spherical channel based layout .
37. Apparatus for streaming spatially adapted audio comprising:
a multi-channel audio in a stream;
a decoder receiving the multi-channel audio stream, said decoder providing the multi-channel audio to a headphone with spatial recombination based on an azimuth value.
38. The apparatus according to claim 37, further comprising a tracker monitoring a position of a head of a person wearing the headphone to provide the azimuth value.
39. The apparatus according to claim 37, further comprising a positional control to provide the azimuth value .
40. The apparatus according to claim 37, further comprising a mobile device to provide the azimuth
value .
41. The apparatus according to claim 37, wherein the multi-channel audio relates to multi-perspective sound values.
42. The apparatus according to claim 41, wherein the multi-perspective sound values comprise left front, right front, left rear and right rear sound values.
43. The apparatus according to claim 41, wherein the multi-perspectives comprise 5.1 channel based surround sound positions.
44. The apparatus according to claim 41, wherein the multi-perspectives comprise 7.1 channel based surround sound positions.
45. The apparatus according to claim 41, wherein the multi-perspectives comprise left front, center, right front, left side, right side, left rear, right rear, left front height, right front height, left rear height, right rear height, and LFE channel (11.1) .
46. The apparatus according to claim 41, wherein the multi-perspectives comprise a spherical channel based layout.
47. The apparatus according to claim 37, wherein the spatial recombination is further based on an elevation value (Yaw and Pitch) of a user.
48. The apparatus according to claim 37, wherein
the spatial recombination is further based on, an elevation, and a roll value (Yaw, Pitch, and Roll) of a user.
49. Apparatus for streaming spatially adapted audio comprising:
plural convolution engines receiving multi- perspective audio inputs to apply head-related
transfer function audio filters to the audio inputs; a combiner combining the output of the plural convolution engines to provide a multi-channel audio signal ;
a streaming system for streaming an encoded version the multi-channel audio signal;
a receiver receiving the encoded version of the streamed multi-channel audio signal;
a decoder decoding the multi-channel audio signal to provide plural audio output signals; and
a renderer for rendering the multi-channel audio signals based on a position tracking value.
50. The apparatus according to claim 49, wherein the multi-channel audio signals are rendered to a headphone and the position tracking value is based on a head position of a wearer of the headphone.
51. The apparatus to claim 49, wherein the multi¬ channel audio signals are rendered to a headphone and the position tracking value for the position-tracking renderer is based on input from a positional control.
52. The apparatus according to claim 49, wherein the multi-channel audio signals are rendered to a headphone and the position tracking value for the
position-tracking renderer is based on a positional data from a mobile device.
53. The apparatus according to claim 49, wherein the position data is based on an azimuth value of a user .
54. The apparatus according to claim 49, wherein the position data is based on an azimuth and an elevation value (Yaw and Pitch) of a user.
55. The apparatus according to claim 49, wherein the position data is based on, an azimuth, an
elevation, and a roll value (Yaw, Pitch, and Roll) of a user.
56. The apparatus according to claim 49, wherein the multi-channel audio relates to multi-perspective sound values.
57. The apparatus according to claim 56, wherein the multi-perspective sound values comprise left front, right front, left rear and right rear sound values .
58. The apparatus according to claim 56, wherein the multi-perspectives comprise 5.1 channel based surround sound positions.
59. The apparatus according to claim 56, wherein the multi-perspectives comprise 7.1 channel based surround sound positions.
60. The apparatus according to claim 56, wherein
the multi-perspectives comprise left front, center, right front, left side, right side, left rear, right rear, left front height, right front height, left rear height, right rear height, and LFE channel (11.1) .
61. The apparatus according to claim 56, wherein the multi-perspectives comprise a spherical channel based layout.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201662370358P | 2016-08-03 | 2016-08-03 | |
| US62/370,358 | 2016-08-03 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2018026963A1 true WO2018026963A1 (en) | 2018-02-08 |
Family
ID=61074012
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2017/045176 Ceased WO2018026963A1 (en) | 2016-08-03 | 2017-08-02 | Head-trackable spatial audio for headphones and system and method for head-trackable spatial audio for headphones |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2018026963A1 (en) |
Cited By (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109637550A (en) * | 2018-12-27 | 2019-04-16 | 中国科学院声学研究所 | A kind of sound source elevation angle control method and system |
| US10523171B2 (en) | 2018-02-06 | 2019-12-31 | Sony Interactive Entertainment Inc. | Method for dynamic sound equalization |
| WO2020014506A1 (en) * | 2018-07-12 | 2020-01-16 | Sony Interactive Entertainment Inc. | Method for acoustically rendering the size of a sound source |
| US10652686B2 (en) | 2018-02-06 | 2020-05-12 | Sony Interactive Entertainment Inc. | Method of improving localization of surround sound |
| US11304021B2 (en) | 2018-11-29 | 2022-04-12 | Sony Interactive Entertainment Inc. | Deferred audio rendering |
| US11546715B2 (en) | 2021-05-04 | 2023-01-03 | Google Llc | Systems and methods for generating video-adapted surround-sound |
| US12108235B2 (en) | 2021-11-18 | 2024-10-01 | Surround Sync Pty Ltd | Virtual reality headset audio synchronization system |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6021206A (en) * | 1996-10-02 | 2000-02-01 | Lake Dsp Pty Ltd | Methods and apparatus for processing spatialised audio |
| US20080187156A1 (en) * | 2006-09-22 | 2008-08-07 | Sony Corporation | Sound reproducing system and sound reproducing method |
| WO2009056956A1 (en) * | 2007-11-01 | 2009-05-07 | Nokia Corporation | Focusing on a portion of an audio scene for an audio signal |
| US20120057710A1 (en) * | 2008-08-13 | 2012-03-08 | Sascha Disch | Apparatus for determining a spatial output multi-channel audio signal |
| US20130329922A1 (en) * | 2012-05-31 | 2013-12-12 | Dts Llc | Object-based audio system using vector base amplitude panning |
-
2017
- 2017-08-02 WO PCT/US2017/045176 patent/WO2018026963A1/en not_active Ceased
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6021206A (en) * | 1996-10-02 | 2000-02-01 | Lake Dsp Pty Ltd | Methods and apparatus for processing spatialised audio |
| US20080187156A1 (en) * | 2006-09-22 | 2008-08-07 | Sony Corporation | Sound reproducing system and sound reproducing method |
| WO2009056956A1 (en) * | 2007-11-01 | 2009-05-07 | Nokia Corporation | Focusing on a portion of an audio scene for an audio signal |
| US20120057710A1 (en) * | 2008-08-13 | 2012-03-08 | Sascha Disch | Apparatus for determining a spatial output multi-channel audio signal |
| US20130329922A1 (en) * | 2012-05-31 | 2013-12-12 | Dts Llc | Object-based audio system using vector base amplitude panning |
Cited By (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10523171B2 (en) | 2018-02-06 | 2019-12-31 | Sony Interactive Entertainment Inc. | Method for dynamic sound equalization |
| US10652686B2 (en) | 2018-02-06 | 2020-05-12 | Sony Interactive Entertainment Inc. | Method of improving localization of surround sound |
| WO2020014506A1 (en) * | 2018-07-12 | 2020-01-16 | Sony Interactive Entertainment Inc. | Method for acoustically rendering the size of a sound source |
| US10887717B2 (en) | 2018-07-12 | 2021-01-05 | Sony Interactive Entertainment Inc. | Method for acoustically rendering the size of sound a source |
| US11388540B2 (en) | 2018-07-12 | 2022-07-12 | Sony Interactive Entertainment Inc. | Method for acoustically rendering the size of a sound source |
| US11304021B2 (en) | 2018-11-29 | 2022-04-12 | Sony Interactive Entertainment Inc. | Deferred audio rendering |
| CN109637550A (en) * | 2018-12-27 | 2019-04-16 | 中国科学院声学研究所 | A kind of sound source elevation angle control method and system |
| CN109637550B (en) * | 2018-12-27 | 2020-11-24 | 中国科学院声学研究所 | A sound source height angle control method and system |
| US11546715B2 (en) | 2021-05-04 | 2023-01-03 | Google Llc | Systems and methods for generating video-adapted surround-sound |
| US12108235B2 (en) | 2021-11-18 | 2024-10-01 | Surround Sync Pty Ltd | Virtual reality headset audio synchronization system |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10674262B2 (en) | Merging audio signals with spatial metadata | |
| WO2018026963A1 (en) | Head-trackable spatial audio for headphones and system and method for head-trackable spatial audio for headphones | |
| US11122384B2 (en) | Devices and methods for binaural spatial processing and projection of audio signals | |
| US10251012B2 (en) | System and method for realistic rotation of stereo or binaural audio | |
| TWI517028B (en) | Audio spatialization and environment simulation | |
| JP6820613B2 (en) | Signal synthesis for immersive audio playback | |
| CN116471520A (en) | Audio device and audio processing method | |
| KR20170106063A (en) | A method and an apparatus for processing an audio signal | |
| US11032660B2 (en) | System and method for realistic rotation of stereo or binaural audio | |
| EP3506080B1 (en) | Audio scene processing | |
| JP2018110366A (en) | 3D sound image sound equipment | |
| Llorach et al. | Towards realistic immersive audiovisual simulations for hearing research: Capture, virtual scenes and reproduction | |
| KR20160061315A (en) | Method for processing of sound signals | |
| US10321252B2 (en) | Transaural synthesis method for sound spatialization | |
| US12200467B2 (en) | System and method for improved processing of stereo or binaural audio | |
| CN105682000B (en) | A kind of audio-frequency processing method and system | |
| Cuevas-Rodriguez et al. | An open-source audio renderer for 3D audio with hearing loss and hearing aid simulations | |
| Enzner et al. | Advanced system options for binaural rendering of ambisonic format | |
| CN113347530A (en) | Panoramic audio processing method for panoramic camera | |
| Suzuki et al. | 3D spatial sound systems compatible with human's active listening to realize rich high-level kansei information | |
| Pfanzagl-Cardone | HOA—higher order ambisonics (eigenmike®) | |
| KR102559015B1 (en) | Actual Feeling sound processing system to improve immersion in performances and videos | |
| Gölles et al. | Cat3DA-Camera-Tracked 3D Audio Player | |
| Howie et al. | Comparing immersive sound capture techniques optimized for acoustic music recording through binaural reproduction | |
| Paterson et al. | Producing 3-D audio |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17837634 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 17837634 Country of ref document: EP Kind code of ref document: A1 |