EP4475122A1 - Adapting spatial audio parameters for jitter buffer management - Google Patents
Adapting spatial audio parameters for jitter buffer management Download PDFInfo
- Publication number
- EP4475122A1 EP4475122A1 EP23177532.1A EP23177532A EP4475122A1 EP 4475122 A1 EP4475122 A1 EP 4475122A1 EP 23177532 A EP23177532 A EP 23177532A EP 4475122 A1 EP4475122 A1 EP 4475122A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- subframe
- audio signal
- signal frame
- time slots
- slot
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000005236 sound signal Effects 0.000 claims abstract description 203
- 238000013507 mapping Methods 0.000 claims abstract description 70
- 238000000034 method Methods 0.000 claims description 47
- 230000008569 process Effects 0.000 description 36
- 238000012545 processing Methods 0.000 description 27
- 230000006870 function Effects 0.000 description 22
- 238000004904 shortening Methods 0.000 description 16
- 230000006978 adaptation Effects 0.000 description 12
- 238000009877 rendering Methods 0.000 description 11
- 230000003139 buffering effect Effects 0.000 description 9
- 238000013461 design Methods 0.000 description 9
- 238000009826 distribution Methods 0.000 description 9
- 238000004891 communication Methods 0.000 description 6
- 238000007726 management method Methods 0.000 description 6
- 239000004065 semiconductor Substances 0.000 description 6
- 230000005540 biological transmission Effects 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 4
- 230000002123 temporal effect Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 238000003491 array Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000004519 manufacturing process Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000001052 transient effect Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 2
- 230000015556 catabolic process Effects 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000010168 coupling process Methods 0.000 description 2
- 238000005859 coupling reaction Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 230000003247 decreasing effect Effects 0.000 description 2
- 238000006731 degradation reaction Methods 0.000 description 2
- 230000001788 irregular Effects 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 238000003860 storage Methods 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 239000004020 conductor Substances 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 230000001627 detrimental effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000003607 modifier Substances 0.000 description 1
- 238000012856 packing Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
- 230000003245 working effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/008—Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/005—Correction of errors induced by the transmission channel, if related to the coding algorithm
Definitions
- the present application relates to apparatus and methods for adapting spatial audio metadata for the provision of jitter buffer management in immersive and spatial audio codecs.
- Parametric spatial audio capture from inputs is a typical and an effective choice to estimate from the input (microphone array signals) a set of parameters such as directions of the sound in frequency bands, and the ratios between the directional and non-directional parts of the captured sound in frequency bands. These parameters are known to well describe the perceptual spatial properties of the captured sound at the position of the microphone array. These parameters can be utilized in synthesis of the spatial sound accordingly, for headphones binaurally, for loudspeakers, or to other formats, such as Ambisonics.
- the directions and direct-to-total and diffuse-to-total energy ratios in frequency bands are thus a parameterization that is particularly effective for spatial audio capture.
- a parameter set consisting of a direction parameter in frequency bands and an energy ratio parameter in frequency bands (indicating the directionality of the sound) can be also utilized as the spatial metadata (which may also include other parameters such as surround coherence, spread coherence, number of directions, distance etc) for an audio codec.
- these parameters can be estimated from microphone-array captured audio signals, and, for example, a stereo or mono signal can be generated from the microphone array signals to be conveyed with the spatial metadata.
- Immersive audio codecs are being implemented supporting a multitude of operating points ranging from a low bit rate operation to transparency.
- An example of such a codec is the Immersive Voice and Audio Services (IVAS) codec which is being designed to be suitable for use over a communications network such as a 3GPP 4G/5G network including use in such immersive services as for example immersive voice and audio for virtual reality (VR).
- IVAS Immersive Voice and Audio Services
- This audio codec is expected to handle the encoding, decoding and rendering of speech, music and generic audio. It is furthermore expected to support channel-based audio, object-based audio, and scene-based audio inputs including spatial information about the sound field and sound sources.
- the codec is also expected to operate with low latency to enable conversational services as well as support high error robustness under various transmission conditions.
- a decoder can decode the audio signals into PCM (Pulse code modulation) signals.
- the decoder can also process the sound in frequency bands (using the spatial metadata) to obtain the spatial
- the aforementioned immersive audio codecs are particularly suitable for encoding captured spatial sound from microphone arrays (e.g., in mobile phones, VR cameras, stand-alone microphone arrays).
- microphone arrays e.g., in mobile phones, VR cameras, stand-alone microphone arrays.
- an encoder can have other input types, for example, loudspeaker signals, audio object signals, or Ambisonic signals.
- the decoder can output the audio in supported formats.
- the IVAS decoder is also expected to handle the encoded audio streams and accompanying spatial audio metadata as RTP packets which may arrive with varying degrees of delay as a result of network jitter conditions in a packet-based network.
- immersive audio codecs such as 3GPP IVAS
- immersive audio codecs are being planned which support a multitude of operating points ranging from a low bit rate operation to transparency. It is expected to support channel-based audio, object-based audio, and scene-based audio inputs including spatial information about the sound field and sound sources.
- the example codec is configured to be able to receive multiple input formats.
- the codec is configured to obtain or receive a multi audio signal (for example, received from a microphone array, or as a multichannel audio format input, or an Ambisonics format input).
- the codec is configured to handle more than one input format at a time.
- Metadata-Assisted Spatial Audio is an example of a parametric spatial audio format and representation suitable as an input format for IVAS.
- spatial metadata associated with the audio signals may comprise multiple parameters (such as multiple directions and associated with each direction (or directional value) a direct-to-total energy ratio, spread coherence, distance, etc.) per time-frequency (TF) tile.
- the spatial metadata may also comprise other parameters or may be associated with other parameters which are considered to be non-directional (such as surround coherence, diffuse-to-total energy ratio, remainder-to-total energy ratio) but when combined with the directional parameters are able to be used to define the characteristics of the audio scene.
- a reasonable design choice which is able to produce a good quality output is one where the spatial metadata comprises one or more directions for each time-frequency subframe (and associated with each direction direct-to-total ratios, spread coherence, distance values etc) are determined.
- the MASA analyser 101 is configured to receive the input audio signal(s) 100 and analyse the input audio signals to generate transport audio signal(s) 102 and spatial metadata 104.
- the transport audio signal(s) 102 can be encoded, for example, using an IVAS audio core codec, or with an AAC (Advanced Audio Coding) or EVS (Enhanced Voice Services) encoder.
- AAC Advanced Audio Coding
- EVS Enhanced Voice Services
- MASA spatial metadata is presented in the following table. These values are available for each time-frequency tile.
- a frame is subdivided into 24 frequency bands and 4 temporal sub-frames. In other implementations other divisions of frequency and time can be employed.
- a frame size (for example, as implemented in IVAS) is 20 ms (and thus the temporal sub-frame is 5 ms).
- the MASA analyser is configured to determine 1 or 2 directions for each time-frequency tile (i.e., there are 1 or 2 direction index, direct-to-total energy ratio, and spread coherence parameters for each time-frequency tile).
- the analyser is configured to generate more than 2 directions for a time-frequency tile.
- Field bits Description Direction index 16 Direction of arrival of the sound at a time-frequency parameter interval.
- Remainder-to-total energy ratio 8 Energy ratio of the remainder (such as microphone noise) sound energy to fulfil requirement that sum of energy ratios is 1. Calculated as energy of remainder sound / total energy. Range of values: [0.0, 1.0] (Parameter is independent of number of directions provided.) Values stored as 8-bit unsigned integers with uniform spacing of mapped values.
- the frame size in IVAS is 20 ms.
- An example of the frame structure is shown in Figure 2 where the metadata frame 201 comprises four temporal sub-frames which are 5 ms long, metadata sub-frame 1 202, metadata sub-frame 2 204, metadata sub-frame 3 206, and metadata sub-frame 4 208.
- the IVAS frame can also be formed as a TF-representation of a complex-valued low delay filter band (CLDFB) where each subframe comprises 4 TF-slots and in the case of the above example this equates to each slot corresponding to 1.25 ms.
- An example of the IVAS frame structure 300 corresponding to the TF-representation of the CLDFB is shown in Figure 3 .
- the MASA stream can be rendered to various outputs, such as multichannel loudspeaker signals (e.g., 5.1 and 7.1+4) or binaural signals.
- An example rendering system that can be used for MASA is described in Vilkamo, J., Bburgström, T., & Kuntz, A. (2013). "Optimized covariance domain framework for time-frequency processing of spatial audio". Journal of the Audio Engineering Society, 61(6), 403-411 . Broadly speaking, the rendering method determines a target covariance matrix based on the spatial metadata and energies of TF-tile of the transport audio signal(s).
- the determined target covariance matrix may contain the channel energies of all channels and the inter-channel relationships between all channel pairs, in particular the cross-correlation and the inter-channel phase differences. These features are known to convey the perceptually relevant spatial features of a multichannel sound in various playback situations, such as binaural and multichannel loudspeaker audio.
- the rendering process modifies the transport audio signals in the form of TF-tiles so that the resulting signals have a covariance matrix that resembles the target covariance matrix.
- the rendered spatial audio signals e.g., binaural signals
- the spatial properties as captured by the spatial metadata.
- both the transport audio signal(s) and the spatial metadata vary in time it is desirable that they remain in synchrony with each other. Failure to maintain synchrony may result in the production of unwanted artefacts in the output signals.
- the following scenario depicts the unwanted effects of a failure to maintain synchrony.
- the spatial metadata is mostly pointing towards the transient source (right side), because it typically has more energy than the constant noise source (left side).
- the audio transport signal(s) and spatial audio metadata are in mutual synchrony, the transients are correctly rendered from the right, while the noise is rendered from the left within the same passage of time.
- the transients may be rendered from the left, and the noise may be rendered from the right over a slightly skewed passage of time. This may result in the rendered sound containing strong artefacts with the consequential decrease in perceived audio quality.
- network jitter and packet loss conditions can cause degradation in quality, for example, in conversational speech services in packet networks, such as the IP networks, and mobile networks such as fourth generation (4G LTE) and fifth generation (5G) networks.
- 4G LTE fourth generation
- 5G fifth generation
- the nature of the packet switched communications can introduce variations in transmission of times of the packets (containing frames), known as jitter, which can be seen by the receiver as packets arriving at irregular intervals.
- an audio playback device requires a constant input with no interruptions in order to maintain good audio quality.
- the decoder may have to consider those frames as lost and perform error concealment.
- a jitter buffer can be utilised to manage network jitter by storing incoming frames for a predetermined amount of time (specified, e.g., upon reception of the first packet of a stream) in order to hide the irregular arrival times and provide constant input to the decoder and playback components.
- jitter buffer management scheme in order to dynamically control the balance between short enough delay and low enough numbers of delayed frames.
- an entity controlling the jitter buffer constantly monitors the incoming packet stream and adjusts the buffering delay (or buffering time, these terms are used interchangeably) according to observed changes in the network delay behaviour. If the transmission delay seems to increase or the jitter becomes worse, the buffering delay may need to be increased to meet the network conditions. In the opposite situation, where the transmission delay seems to decrease, the buffering delay can be reduced, and hence, the overall end-to-end delay can be minimised.
- FIG. 4 shows how the IVAS decoder may be connected to a jitter buffer management system.
- the receiver modem 401 can receive packets through a network socket such as an IP (Internet protocol) network socket which may be part of an ongoing Real-time Transport Protocol (RTP) session.
- the received packets may be pushed to a RTP depacketizer module 403, which may be configured to extract the encoded audio stream frames (payload) from the RTP packet.
- the RTP payload may then be pushed to a jitter buffer manager (JBM) 405 where various housekeeping tasks may be performed such as frame receive statistics are updated.
- JBM jitter buffer manager
- the jitter buffer manager 405 may also be arranged to store the received frames.
- the jitter buffer manager 405 may be configured to pass the received frames to an IVAS decoder & renderer 407 for decoding. Accordingly, the IVAS decoder & renderer 407 passes the decoded frames back to the jitter buffer manager 405 in the form of digital samples (PCM samples). Also depicted in Figure 4 is an Acoustic player 409 which may be viewed as the module performing the playing out (or playback) of the decoded audio streams. The function performed by the Acoustic player 409 may be regarded as a pull operation in which it pulls the necessary PCM samples from the JBM buffer to provide uninterrupted audio playback of the audio streams.
- FIG 5 is a system 500 depicting the general workings and interactions of a jitter buffer manager 405 with an IVAS decoder & renderer 407.
- the jitter buffer manager 405 may comprise a jitter buffer 501, a network analyzer 502, an adaptation control logic 503 and an adaptation unit 504.
- Jitter buffer 501 is configured to temporarily store one or more audio stream frames (such as an IVAS bitstream), which are received via a (wired or wireless) network, for instance, in the form of packets 506.
- These packets 506 may for instance be RTP packets, which are unpacked by buffer 501 to obtain the audio stream frames.
- Buffer status information 508 such as, for instance, information on a number of frames contained in buffer 501, or information on a time span covered by a number of frames contained in the buffer, or a buffering time of a specific frame (such as an onset frame), is transferred between buffer 501 and adaptation control logic 503.
- Network analyzer 502 monitors the incoming packets 506 from the RTP depacketizer 403, for instance, to collect reception statistics (e.g., jitter, packet loss). Corresponding network analyzer information 507 is passed from network analyzer 502 to adaptation control logic 503.
- Adaptation control logic 503, controls buffer 501.
- This control comprises determining buffering times for one or more frames received by buffer 501 and is performed based on network analyzer information 507 and/or buffer status information 508.
- the buffering delay of buffer 501 may, for instance, be controlled during comfort noise periods, during active signal periods or in-between.
- a buffering time of an onset signal frame may be determined by adaptation control logic 503, and IVAS decoder & renderer 407 may (for instance, via adaptation unit 504, signals 509, and the signal 510 to control the IVAS decoder) then be triggered to extract this onset signal frame from buffer 501 when this determined buffering time has elapsed.
- the IVAS decoder 507 can be arranged to pass the decoded audio samples to the adaption unit 504 via the connection 511.
- Adaptation unit 504 if necessary, shortens or extends the output audio signal according to requests given by adaptation control logic 503 to enable buffer delay adjustment in a transparent manner.
- the JBM system 500 may be required to perform time stretching/shortening in order to achieve continuous audio playback without the introduction of audio artifacts as a result of network jitter.
- time stretching/shortening can be performed after rendering to the output audio signals in a manner similarly deployed by existing audio codecs such as EVS.
- EVS audio codecs
- time stretching/shortening may contain pitch shifting as a part of the stretching process (especially for larger modifications to the output audio signal(s). This may cause problems with binaural signals as the monaural cues (that allow the human hearing system to determine elevation) may become skewed, leading into erroneous perception of elevation.
- time stretching/shortening may alter inter-channel relationships (such as altering phase and/or level relationships), which again may have a detrimental effect with the perception of direction.
- the process of time stretching/shortening over many audio channels may be computationally complex.
- performing time-stretching after rendering requires the renderer to be run for multiple frames to produce one frame of output.
- performing time stretching after rendering may cause a varying motion-to-sound latency for head-tracked binaural rendering, resulting in a degradation of the perceived quality.
- the time stretching/shortening process may be performed over the spatial audio metadata and transport audio signal(s).
- this invention proceeds on the basis that the time stretching/shortening over the transport audio signal(s) has already been performed and focusses on the issues of time adapting the accompanying spatial audio metadata in order to maintain synchrony with the transport audio signal(s).
- Figure 6 is a more detailed depiction of the jitter buffer management system 500 for IVAS.
- RTP packets 601 are depicted as being received and passed to the RTP de-packer 602 which may be arranged to extract the IVAS frames 607.
- the RTP de-packer 602 may also be arranged to obtain from the RTP streams so called RTP metadata 603 which can be used to update IP network metrics such as frame receive statistics which in turn may be used to estimate network jitter.
- RTP metadata 603 may be passed to the Network jitter analysis and target delay estimator 604 where the RTP metadata 603, comprising packet timestamp and sequence information, may be analysed to provide a target playout delay parameter 605 for use in the adaption control logic processor 606.
- the IVAS frames 607 as obtained by the de-packing process of the RTP de-packer 602 are depicted in Figure 6 as being passed to the de-jitter buffer 608.
- the de-jitter buffer 608 is arranged to store the IVAS frames 607 in a frame buffer ready for decoding by IVAS audio & metadata decoder 610.
- the de-jitter buffer 608 can also be arranged to perform frame-based delay adjustment on the stream of IVAS frames when instructed to by the adaption control logic processor 606, and also reorder the IVAS frames into a correct decoding order should they not arrive in the proper sequential order (for decoding.)
- the output from the de-jitter buffer 608, in other words IVAS frames for decoding 609, may be passed to the IVAS audio & metadata decoder 610.
- the IVAS audio decoder & metadata decoder 610 is arranged to decode the IVAS frames 609 into a decoded multichannel transport audio signal stream 611 (also referred to as the transport audio signal(s) and a MASA spatial audio metadata stream 613 (also referred to as spatial audio metadata).
- the MASA spatial audio metadata and decoded multichannel transport audio signal streams 613 and 611 may be passed on to subsequent processing blocks so that any time sequence adjustments to the respective signals may be performed.
- the respective stream 613 is passed to the metadata adaptor 612 and in the case of the decoded multichannel transport audio signal the respective stream 611 is passed to the multi-channel time scale modifier (MC TSM) 614.
- MC TSM multi-channel time scale modifier
- the MC TSM 614 is configured to time scale modify frames of the transport audio signals 611 under the direction of the adaption control logic processor 606. Basically, the MC TSM 614 performs the time stretching or time shortening of the transport audio signal(s) in the time domain in response to a time adjustment instruction provided by adaption control logic processor 606. The time adjustment instruction may be received by the MC TSM 614 along the control line 615 from the adaption control logic processor 606.
- the output from the MC TSM 614 i.e., frames of the time-adjusted transport audio signals 621, may be passed to the renderer 616, a processing block termed the EXT output constructor 618 and the metadata adaptor 612.
- the time-adjusted transport audio signal(s) 621 is used to assist in the adaption of the spatial audio metadata so that synchrony is better maintained.
- the metadata adaptor 612 is essentially arranged to receive the spatial audio metadata parameters 613 corresponding to the frames of the transport audio signal(s) 611 that are delivered to the MC TSM 614, adapt these spatial audio metadata parameters 613 in accordance with the time adjustment instructions as provided by the adaption control logic processor 606, and maintain time synchrony with the time-adjusted transport audio signal(s) 621.
- the metadata adaptor 612 is configured to receive the time-adjusted transport audio signal(s) 621 and the time adjustment instructions from the adaption control logic processor 606 along the signal line 617.
- the metadata adaptor 612 may then be arranged to produce time-adapted spatial audio metadata which has time synchrony with the time-adjusted transport audio signals 621.
- the time adapted spatial audio metadata is depicted as the signal 623 in Figure 6 and is shown as being passed to both the renderer 616 and EXT output constructor 618.
- the renderer 616 which can receive the time-adjusted transport audio signals 621 and the time adapted spatial audio metadata 623 and render said signals into a multichannel spatial audio output signal.
- the rendering may be performed in accordance with the rendering parameters 625.
- the renderer 616 is also shown as receiving a signal from the adaption control logic processor 606 along the signal line 619.
- FIG. 6 also shows a further output processing function in the form of the EXT output constructor 618.
- This processing function simply takes the time-adjusted transport audio signals 621 and the time adapted spatial audio metadata 623 and "packages" the signals into a single frame format suitable for outputting from a device, in other words a "spatial audio format” suitable for storage as a file type and the like.
- the purpose of the EXT output constructor 618 is to output the spatial audio format as it was decoded with minimal changes to conform to a spatial audio format specification. It can be then stored, re-encoded, mixed, or rendered with an external renderer.
- the jitter buffer management system for IVAS also comprises the adaption control logic processor 606.
- the adaption control logic processor 606 is responsible for providing the time adjustment instructions to other processing blocks in the system. This may be realised by the adaption control logic processor 606 receiving a target delay estimation/parameter 605 from the network jitter analysis and target delay estimator 604 and the current playout delay from the playout delay estimator 620 and using this information to choose the method for playout delay adjustment to reach the target playout delay. This may be provided to the various processing blocks in the form of time adjustment instructions. The various processing blocks may then each individually utilise the received time adjustment instructions to perform appropriate actions so that the audio output from the renderer 616 is played out with the correct time.
- the following functions may be configured to receive time adjustment instructions from the adaption control logic processor 606; de-jitter buffer 608, metadata adaptor 612, MC TSM 614, and the renderer 616.
- the playout delay estimator 620 provides estimate of the current playout delay to 606 based on the information received from 608 and 614.
- the metadata adaptor 612 is arranged to adjust the spatial audio metadata 613 in accordance with the time adjustment instructions (playout delay time) whilst maintaining synchrony with the time-adjusted transport audio signals 621.
- Figure 7 shows the metadata adaptor 612 according to embodiments in further detail.
- the metadata adaptor 612 takes as input the time adjustment instructions 617 from the adaption control logic processor 606. This input may then be passed to the slot to subframe mapper 702.
- the time adjustment instructions 617 may contain information pertaining to the number of subframes and hence audio time slots that are to be rendered. For the sake of brevity this information can be referred to as "slot adaptation info.”
- the original IVAS frame can be divided into a number of subframes with each subframe being divided into a further number of audio slots.
- One such example comprises a 20 ms frame divided into 4 equal length subframes, with each subframe being evenly divided into 4 audio slots giving a total of 16 audio slots at 1.25 ms each.
- the "slot adaptation info" may contain a parameter giving the number of audio slots N slots present in the time-adjusted transport audio signals 621, which in turn provides the number of subframes in the signal and consequently the frame size of the time-adjusted transport audio signals 621. This information may then be used to adapt the spatial audio parameters sets which are currently time aligned with the subframes of the original IVAS frame to being time aligned with the subframes of the time-adjusted transport audio signals 621.
- original IVAS frame refers to the size of the IVAS frame before any time shortening/lengthening has taken place. So, it refers to the frames of the transport audio signal(s).
- the parameter N slots may be different from the default number of slots in an original IVAS frame N slots _ default , with the default number of slots being the number of slots in an original IVAS frame before the time stretching/shortening process.
- N slots_default is 16 audio slots.
- the slot subframe mapper 702 can be arranged to map the "original" default number of slots N slots _ default of the original IVAS frame, to a different number of slots N slots distributed across the same number of subframes as the original IVAS frame. This has the outcome of mapping the slots/subframes of the adapted IVAS frame to the standard IVAS frame. This results in a pattern of mapped slots where some of the subframes (of the original IVAS frame) have either more or less mapped slots depending on whether the adapted slot number N slots is greater or less than the original number of slots N slots-default . For instance, if N slots ⁇ N slots_default then the process is a waveform shortening or output play speeding up operation, and if N slots > N slots _ default then the process is a waveform lengthening or output play slowing down operation.
- the slot to subframe mapper 702 may be arranged to map the N slots time slots of the adapted IVAS frame to the subframes of the original IVAS frame to produce a map for mapping a time slot of the adapted IVAS frame to a subframe of the original IVAS frame.
- mapping of each slot associated with the time adapted transport audio signal(s) 621 (adapted IVAS frame) to a subframe of the original IVAS frame is performed on the premise that the assigned subframe (in the original IVAS frame) best matches the temporal position of the slot in the time adapted transport audio signal(s) 621 (adapted IVAS frame).
- each subframe comprises a set of spatial audio parameters.
- each group of four audio slots in the original IVAS frame is associated with the spatial audio parameter set of one of the subframes. Therefore, the consequence of slot to subframe mapping process may be viewed as associating different groups of slots with the spatial audio parameter sets of the original IVAS frame.
- Figure 8 shows an example subframe to slot mapping process when the adapted slot number N slots is 12 and the original number of slots N slots _ default is 16.
- this Figure 8 depicts an example of waveform shortening (decreasing the playing out time).
- the relationship between slots to subframes for the original IVAS frame, where every 4 slots is mapped to a subframe is shown as 801 in Figure 8 , i.e., slots s1 to s4 are mapped to subframe 1, slots s5 to s8 are mapped to subframe 2, slots s9 to s12 are mapped to subframe 3 and slots s13 to s16 are mapped to subframe 4.
- the result of the mapping process where 12 slots are mapped to the 4 subframes of the original IVAS frame may be shown as 802 in Figure 8 .
- the slot to mapping process has resulted in the first three slots (s1, s2, s3) being mapped to first subframe.
- the fourth and fifth slots (s4, s5) have been mapped to the second subframe.
- Slots s6, s7, s8 are mapped to subframe 3.
- slots s9, s10, s11 and s12 are now mapped to subframe 4.
- the above subframe to slot mapping process can be performed by initially dividing the number of adapted slots N slots into two contiguous regions.
- the second region is made up of the remaining slots s1 to s8, i.e., the run of slots starting from the beginning of the frame, and going up to slot number ( N slots -N slots _ end ) .
- the subframe to slot mapping process take the N slots_end highest ordered slots of the adapted IVAS frame and matches each them on an ordered one-to-one bases to the N slots_end highest ordered slots of the original IVAS frame subframes and consequently the subframes associated with these slots.
- This processing step may be illustrated by referring to the example of Figure 8 , where the slots of the adapted IVAS frame s9, s10, s11 and s12 are mapped on a one-to-one basis to the subframe having the 4 highest slots of the original IVAS frame s13, s14, s15 and s16, i.e., subframe 4.
- M slot_sf ( m ) first is the subframe to slot mapping function which gives the subframe to slot map for the first region. This function returns the mapped subframe number (with respect to the original IVAS frame) for each slot m of the first region of slots of the adapted IVAS frame.
- N slots,remdefault is the number of original slots remaining after the number of N slots,end have been removed.
- N slots , remdefault N slots , default ⁇ N slots , end
- Figure 9 shows an example subframe to slot mapping process when the adapted slot number N slots is 20 and again the original number of slots N slots _ default is 16.
- Figure 9 depicts an example of waveform extending (increasing the playing out time).
- the distribution of slots in the standard IVAS frame is shown as 901 in Figure 9 and the result of the slot to subframe mapping process where 20 slots (and hence 5 subframes of the time adapted transport audio signal(s) 621 frame (adapted IVAS frame) are mapped to the 4 subframes of the standard IVAS frame is shown as 902 in Figure 9 .
- N slots,end 12
- 903 the relationship between the subframes of the time adapted transport audio signals 621 frame and the N slots slots.
- the second region of slots is therefore given by the lowest ordered contiguous run of slots from the first slot, s1 to the highest slot with the slot number N slots - N slots,end .
- the slot to subframe mapping process then maps this contiguous run of lowest ordered slots by distributing them across a number of the subframes, starting from the lowest numbered subframe. For instances, when N slots > N slots_default , i.e., a period of waveform lengthening, the second region of slots (of the adapted IVAS frame) may be distributed across the first subframe and subsequent subframes up to and including the subframe to which the lowest ordered slot from the first region is mapped. For instances, when N slots ⁇ N slots_default , i.e., a period of waveform shortening, the second region of slots (of the adapted IVAS frame) may be distributed across all subframes of the IVAS frame.
- M slot-sf ( m ) second is the subframe to slot mapping function which gives the subframe (of the original IVAS frame) to slot map for the second region. This function returns the mapped subframe number (of the original IVAS frame) for each slot m of the second region of slots of the adapted IVAS frame.
- the output from the slot to subframe mapper 702 is the combined slot to subframe map for both the first and second regions and may be referred to as M slot-sf ( m ) .
- This output is depicted as 701 in Figure 7 .
- the slot to subframe mapper 702 may be arranged to distribute the N slots of the adapted IVAS frame across the subframes of the original IVAS frame in a different manner to the above embodiments. In these embodiments, there may be no mapping between the N slots slots of the adapted IVAS frame and the N slots_default slots of the original IVAS frame. Instead, the N slots slots of the adapted IVAS frame may be mapped directly to the subframes of the original IVAS frame using the following routine.
- N slots,sf is equivalent to L sf in the above embodiment, which is 4 for the standard IVAS frame
- L map may be the same for all subframes of the original IVAS frame.
- the N slots of the adapted IVAS frame may be distributed evenly across the subframes of the original IVAS frame.
- N.B. for the standard IVAS frame size i sf will take the values 1 to 4, corresponding to subframes 1 to 4 of the original IVAS frame.
- Figure 10 shows an example subframe to slot mapping process according to these embodiments where the adapted slot number N slots is 12 and the original number of slots N slots _ default is 16.
- the relationship between slots and subframes for the original IVAS frame, where every 4 slots is mapped to a subframe, is shown as 1001 in Figure 10 .
- the result of the mapping process where 12 slots of the adapted IVAS frame are mapped to the 4 subframes of the original IVAS frame are shown as 1002 in Figure 10 . It can be seen that the 12 slots of the adapted IVAS frame have been evenly distributed across the 4 subframes of the original IVAS frame, i.e.
- Figure 11 depicts further examples of the subframe to slot mapping process according these embodiments where the adapted slot number N slots are 13 and 14 and the original number of slots N slots_default is 16.
- the result of the mapping process where 13 slots of the adapted IVAS frame are mapped to the 4 subframes of the original IVAS frame is shown as 1101 in Figure 11 .
- the output 701 in Figure 7 from the slot to subframe mapper 702 is the above slot to subframe map M slot_sf ( m ).
- the energy determiner 704 which is shown as receiving the time adjustment instructions 617 and the time-adjusted transport audio signal(s) 621 frame (adapted IVAS frame).
- the function of the energy determiner 704 is to determine the energy of the adapted IVAS frame on a slot-by-slot basis according to the number of slots N slots .
- the energy determiner 704 takes in frame length of N slots *slot width (1.25 ms) of the time-adjusted transport audio signals 621, i.e., an adapted IVAS frame and effectively divides the frame into N slots time slots and then determines the energy across all the audio signals of the adapted IVAS frame for each slot.
- q is the number of time shifted transport audio signals/channels in the signal 621.
- the output from the energy determiner 704 is the energy E for each time slot m for an adapted IVAS frame. This is shown as the output 703 in Figure 7 .
- subframe-to-subframe map determiner 706 which is depicted as receiving the energy E for each time slot m of the adapted IVAS frame 703 and the slot to subframe map M slot-sf ( m ) 701.
- the function of the subframe-to-subframe map determiner 706 is to determine, for each subframe of the adapted IVAS frame, a subframe from the original IVAS frame whose associated spatial audio parameters most closely align with the audio signal of the subframe of the adapted IVAS frame.
- This may be performed in order to provide a map whereby a subframe of the adapted IVAS frame is mapped to a subframe of the original IVAS frame.
- the subframe-to-subframe mapping determiner 706 may be arranged to use the map for mapping a time slot of the adapted IVAS frame to a subframe of the original IVAS frame to produce a map for mapping a subframe of the adapted IVAS frame to a subframe of the original IVAS frame.
- this function may be performed by the subframe-to-subframe map determiner 703 being arranged to use the slot to subframe maps 701 and the energy E for each time slot m 703 to determine an energy to subframe map for each subframe of the original IVAS frame.
- the energy to subframe map determiner 706 determines for each subframe of the original IVAS frame the energy of the adapted IVAS frame slots which were mapped to the subframe.
- M E-sf ( n ) is the energy of slots mapped to a subframe n of the original IVAS frame, where the adapted IVAS frame slots mapped to the subframe n are given by the slot to subframe mapping M slot-sf ( m ) and where m n A is the list of slots mapped to subframe n (of the original IVAS frame), m n A (0) represents the first slot mapped to subframe n and m n A ( N - 1) the last slot mapped to subframe n, where N represents the number of slots in subframe n A .
- the understanding of the above equation may be enhanced by returning to the example of Figure 8 .
- the next step performed by the subframe-to-subframe mapper 702 is to determine for a subframe n A of the adapted IVAS frame the subframe n max (of the original IVAS frame) which gives the maximum energy to subframe value of all the M E-sf ( n ) values which comprise the slots of the subframe n A of the adapted IVAS frame. This may be performed for all subframes of the adapted IVAS frame.
- the pseudo code for this step may have the following form:
- the subframe n ewm may more often chosen from the beginning of the slot to subframe map section M slot-sf ( m ), where m ⁇ [ m n A (0) , m n A ( N - 1)].
- subframe-to-subframe mapping function may be performed according to the flow chart presented in Figure 12 .
- the number of subframes of the adapted IVAS frame may be determined or communicated to the subframe-to-subframe map determiner 706.
- the number of subframes in the adapted IVAS frame N A may be based on the premise that each subframe comprises the same number of slots as each subframe of the original IVAS frame.
- the step of determining/acquiring the number of subframes in an adapted IVAS frame is shown in Figure 12 as the processing step 1201.
- n A 0: N A - 1.
- the subframe level processing loop comprises the steps 1207 to 1211.
- the total energy for the first subframe of the adapted IVAS signal will comprise the sum of the slot energies E (0) to E (3) for slots s1 to s4, (i.e. m 1 (0) to m 1 (3)).
- the total energy for the second subframe of the adapted IVAS signal will comprise the sum of the slot energies E (4) to E (7) for slots s5 to s8, (i.e. m 2 (0) to m 2 (3)).
- the total energy for the third subframe will comprise the totals for E (8) to E (11) for slots s9 to s12, (i.e. m 3 (0) to m 3 (3)).
- the final slot s13 may either be processed as a non-full subframe, or it may be buffered for the next decoded IVAS frame.
- the step of determining the total energy for the slots of the subframe of the adapted IVAS frame is shown as processing step 1207 in Figure 12 .
- the next step of the subframe processing loop initialises an accumulative energy factor E cum for the subframe of the adapted IVAS frame. This is shown as processing step 1209 in Figure 12 .
- This is shown as the processing step 1211 and comprises the steps 1213 to 1217.
- the first step of the slot level processing loop adds the energy of a current slot E ( k ) to the accumulative energy E cum . This is shown as step 1213 in Figure 12 .
- the slot level processing loop then checks whether the accumulative energy E cum is greater than E tot /2 for the subframe. This is shown as the processing step 1215.
- step 1215 If it was determined at step 1215 that the above criterion had been met, the slot level processing loop progresses to the processing step 1217.
- the index k (which has led to the above criterion being met) is used to map a subframe of the original IVAS frame to the subframe of the adapted IVAS frame. This may be performed by taking the subframe of the original IVAS frame which houses the index k and assigns this subframe [of the original IVAS frame] to the subframe n A [of the adapted IVAS frame]. As mentioned above the mapping (or relationship) between slots of the adapted IVAS frame and subframes of the original IVAS frame is given by the mapping function M slot-sf ( m ) .
- step 1215 if the criterium is not met, i.e. E cum is not greater than E tot /2 for the subframe n A then the process selects the next slot of the subframe of the adapted IVAS frame and proceeds to steps 1213 and 1215.
- the result of the processing steps of Figure 12 is the subframe-to-subframe map/table M sf-sf with an entry for all subframes n A of the of the adapted IVAS frame.
- This subframe-to-subframe map M sf-sf may then be used to obtain the one-to-one mapping between the optimum subframe of the original IVAS frame for each subframe of the adapted IVAS frame.
- the subframe-to-subframe map M sf-sf may form the output 705 of the subframe-to-subframe map determiner 706.
- the spatial audio metadata adaptor 708 which can be arranged to receive the spatial audio metadata 613 and the subframe-to-subframe map M sf-sf 705 and produce as output the time adapted spatial audio metadata 623.
- the spatial audio metadata adaptor 708 is arranged to assign a spatial audio parameter set of the original IVAS frame to each subframe of the adapted IVAS frame n A by using the subframe-to-subframe map M sf-sf 705. For each entry n A of the subframe-to-subframe map M sf-sf 705 there is a corresponding original IVAS subframe index n. The index n may then be used to assign the spatial audio parameter set of subframe n of the original IVAS frame to subframe n A of the adapted IVAS frame or in other words subframe n A of the time-adjusted transport audio signal(s) 621 frame.
- this mechanism can be repeated for the other spatial parameters in the MASA spatial audio parameter set to give the adapted MASA spatial audio parameter set for subframe n A of the time-adjusted transport audio signal(s) 621 frame.
- the time adapted spatial audio metadata 623 output therefore may comprise spatial audio parameter set for each subframe n A of the time-adjusted transport audio signal(s) 621 frame.
- the audio signal and the metadata may be asynchronized after decoding and the synchronization step is performed after JBM process and the output of the audio and metadata.
- a delay may be needed to allow use of correct slot energy in the weighting process. This may be achieved by simply delaying the audio signal or the original metadata as necessary.
- a ring buffer may be used for such a purpose.
- the process of selecting metadata and calculating energies may be done in time-frequency domain.
- the metadata selection may be done for each subframe & frequency band combination separately using time slots and frequency bands.
- the process of forming the subframe-to-subframe map M sf-sf 705 may use signal energy only for one of the cases, waveform extending (increasing the playing out time) or waveform shortening (decreasing the playing out time).
- the audio and metadata format may be some other format than MASA format, or the audio and metadata format is derived from some other format during encoding and decoding in codec.
- the energy of some slots may be missing or unobtainable, e.g., due to asynchrony.
- the energy of these slots can be approximated from the other slots in current frame and in history that have obtainable energy value.
- An example of such approximation is the average energy value of the other slots with obtainable energy value which may be assigned as the energy value of any slot with missing energy.
- FIG. 13 is shown an example system within which some embodiments can be implemented.
- the transport audio signals 102 and the spatial metadata 104 are passed to an encoder 1301 which generates an encoded bitstream 1302.
- the encoded bitstream 1302 is received by the decoder 1303 which is configured to generate a spatial audio output 1304.
- the transport audio signals 102 and the spatial metadata 104 can be obtained in the form of a MASA stream.
- the MASA stream can, for example, originate from a mobile device (containing a microphone array), or as an alternative example, it may have been created by an audio server that has potentially processed a MASA stream in some way.
- the encoder 1301 can furthermore, in some embodiments, be an IVAS encoder.
- the decoder 1303, in some embodiments, can be configured to directly output the spatial audio output 1304 to be rendered by an external renderer, or edited/processed by an audio server.
- the decoder 1303 comprises a suitable renderer, which is configured to render the output in a suitable form, such as binaural audio signals or multichannel loudspeaker signals (such as 5.1 or 7.1+4 channel format), which are also examples of spatial audio output 1304.
- the device may be any suitable electronics device or apparatus.
- the device 1400 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc.
- the device may for example be configured to implement the encoder and/or decoder or any functional block as described above.
- the device 1400 comprises at least one processor or central processing unit 1407.
- the processor 1407 can be configured to execute various program codes such as the methods such as described herein.
- the device 1400 comprises at least one memory 1411.
- the at least one processor 1407 is coupled to the memory 1411.
- the memory 1411 can be any suitable storage means.
- the memory 1411 comprises a program code section for storing program codes implementable upon the processor 1407.
- the memory 1411 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1407 whenever needed via the memory-processor coupling.
- the device 1400 comprises a user interface 1405.
- the user interface 1405 can be coupled in some embodiments to the processor 1407.
- the processor 1407 can control the operation of the user interface 1405 and receive inputs from the user interface 1405.
- the user interface 1405 can enable a user to input commands to the device 1400, for example via a keypad.
- the user interface 1405 can enable the user to obtain information from the device 1400.
- the user interface 1405 may comprise a display configured to display information from the device 1400 to the user.
- the user interface 1405 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1400 and further displaying information to the user of the device 1400.
- the user interface 1405 may be the user interface for communicating.
- the device 1400 comprises an input/output port 1409.
- the input/output port 1409 in some embodiments comprises a transceiver.
- the transceiver in such embodiments can be coupled to the processor 1407 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network.
- the transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.
- the transceiver can communicate with further apparatus by any suitable known communications protocol.
- the transceiver can use a suitable radio access architecture based on long term evolution advanced (LTE Advanced, LTE-A) or new radio (NR) (or can be referred to as 5G), universal mobile telecommunications system (UMTS) radio access network (UTRAN or E-UTRAN), long term evolution (LTE, the same as E-UTRA), 2G networks (legacy network technology), wireless local area network (WLAN or Wi-Fi), worldwide interoperability for microwave access (WiMAX), Bluetooth ® , personal communications services (PCS), ZigBee ® , wideband code division multiple access (WCDMA), systems using ultra-wideband (UWB) technology, sensor networks, mobile ad-hoc networks (MANETs), cellular internet of things (IoT) RAN and Internet Protocol multimedia subsystems (IMS), any other suitable option and/or any combination thereof.
- LTE Advanced long term evolution advanced
- NR new radio
- 5G long term evolution
- the transceiver input/output port 1409 may be configured to receive the signals.
- the device 1400 may be employed as at least part of the synthesis device.
- the input/output port 1409 may be coupled to headphones (which may be a headtracked or a non-tracked headphones) or similar and loudspeakers.
- the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof.
- some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto.
- firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto.
- While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
- the embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware.
- any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions.
- the software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
- the memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory.
- the data processors may be of any type suitable to the local technical environment, and may include one or more of general-purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.
- Embodiments of the inventions may be practiced in various components such as integrated circuit modules.
- the design of integrated circuits is by and large a highly automated process.
- Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
- Programs such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules.
- the resultant design in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.
- circuitry may refer to one or more or all of the following:
- circuitry also covers an implementation of merely a hardware circuit or processor (or multiple processors) or portion of a hardware circuit or processor and its (or their) accompanying software and/or firmware.
- circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit or processor integrated circuit for a mobile device or a similar integrated circuit in server, a cellular network device, or other computing or network device.
- non-transitory is a limitation of the medium itself (i.e., tangible, not a signal) as opposed to a limitation on data storage persistency (e.g., RAM vs. ROM).
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Mathematical Physics (AREA)
- Stereophonic System (AREA)
Abstract
An apparatus for spatial audio decoding, configured to: receive a first audio signal frame comprising a number of subframes; receive a parameter indicating a total number of time slots of a second audio signal; mapping the total number of time slots of the second audio signal frame to the first number of subframes of the first audio signal frame to produce a map for mapping a time slot of the second audio signal to a subframe of the first audio signal; using this map for mapping a time slot of the second audio signal to a subframe of the first audio signal to produce a map for mapping a subframe of the second audio signal to a subframe of the first audio signal; and using the map for mapping a subframe of the second audio signal to a subframe of the first audio signal to assign at least one spatial audio parameter of a subframe of the first audio signal to a subframe of the second audio signal.
Description
- The present application relates to apparatus and methods for adapting spatial audio metadata for the provision of jitter buffer management in immersive and spatial audio codecs.
- Parametric spatial audio capture from inputs, such as microphone arrays and other sources, is a typical and an effective choice to estimate from the input (microphone array signals) a set of parameters such as directions of the sound in frequency bands, and the ratios between the directional and non-directional parts of the captured sound in frequency bands. These parameters are known to well describe the perceptual spatial properties of the captured sound at the position of the microphone array. These parameters can be utilized in synthesis of the spatial sound accordingly, for headphones binaurally, for loudspeakers, or to other formats, such as Ambisonics.
- The directions and direct-to-total and diffuse-to-total energy ratios in frequency bands are thus a parameterization that is particularly effective for spatial audio capture.
- A parameter set consisting of a direction parameter in frequency bands and an energy ratio parameter in frequency bands (indicating the directionality of the sound) can be also utilized as the spatial metadata (which may also include other parameters such as surround coherence, spread coherence, number of directions, distance etc) for an audio codec. For example, these parameters can be estimated from microphone-array captured audio signals, and, for example, a stereo or mono signal can be generated from the microphone array signals to be conveyed with the spatial metadata.
- Immersive audio codecs are being implemented supporting a multitude of operating points ranging from a low bit rate operation to transparency. An example of such a codec is the Immersive Voice and Audio Services (IVAS) codec which is being designed to be suitable for use over a communications network such as a 3GPP 4G/5G network including use in such immersive services as for example immersive voice and audio for virtual reality (VR). This audio codec is expected to handle the encoding, decoding and rendering of speech, music and generic audio. It is furthermore expected to support channel-based audio, object-based audio, and scene-based audio inputs including spatial information about the sound field and sound sources. The codec is also expected to operate with low latency to enable conversational services as well as support high error robustness under various transmission conditions. A decoder can decode the audio signals into PCM (Pulse code modulation) signals. The decoder can also process the sound in frequency bands (using the spatial metadata) to obtain the spatial output, for example, a binaural output.
- The aforementioned immersive audio codecs are particularly suitable for encoding captured spatial sound from microphone arrays (e.g., in mobile phones, VR cameras, stand-alone microphone arrays). However, such an encoder can have other input types, for example, loudspeaker signals, audio object signals, or Ambisonic signals. Similarly, it is expected that the decoder can output the audio in supported formats. In this regard the IVAS decoder is also expected to handle the encoded audio streams and accompanying spatial audio metadata as RTP packets which may arrive with varying degrees of delay as a result of network jitter conditions in a packet-based network.
- For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which:
-
Figure 1 shows schematically an apparatus for MASA metadata extraction; -
Figure 2 shows schematically an example MASA metadata frame sub-frame structure; -
Figure 3 shows schematically an example MASA frame structure divided into audio slots; -
Figure 4 shows schematically a receiver system deploying a jitter buffer management scheme according to embodiments; -
Figure 5 shows an example schematic block diagram of a jitter buffer manager; -
Figure 6 shows schematically an example deployment of a jitter buffer manager in an IVAS decoder; -
Figure 7 shows schematically a metadata adaptor suitable for employing as part of the decoder as shown inFigure 6 , configured to time adapt spatial audio metadata according to embodiments; -
Figure 8 schematically shows the relationships between the distribution of time slots and subframes for an original length IVAS frame and distribution of time slots and subframes for an IVAS frame which is shortened in time according to some embodiments; -
Figure 9 schematically shows the relationships between the distribution of time slots and subframes for an original length IVAS frame and distribution of time slots and subframes for an IVAS frame which is lengthened in time according to some embodiments; -
Figure 10 schematically shows the relationships between the distribution of time slots and subframes for an original length IVAS frame and distribution of time slots and subframes for an IVAS frame which is shortened in time according to other embodiments; -
Figure 11 schematically shows further relationships between the distribution of time slots and subframes for an original length IVAS frame and distribution of time slots and subframes for an IVAS frame which is shortened in time according to other embodiments; -
Figure 12 shows a flow diagram of an operation of the subframe-to-subframe map determiner as shown inFigure 7 according to some embodiments; -
Figure 13 shows schematically an example system of apparatus suitable for implementing some embodiments; and -
Figure 14 shows an example device suitable for implementing the apparatus shown in previous figures. - The following describes in further detail suitable apparatus and possible mechanisms for the encoding of parametric spatial audio signals comprising transport audio signals and spatial metadata. As indicated above immersive audio codecs (such as 3GPP IVAS) are being planned which support a multitude of operating points ranging from a low bit rate operation to transparency. It is expected to support channel-based audio, object-based audio, and scene-based audio inputs including spatial information about the sound field and sound sources.
- In the following the example codec is configured to be able to receive multiple input formats. In particular, the codec is configured to obtain or receive a multi audio signal (for example, received from a microphone array, or as a multichannel audio format input, or an Ambisonics format input). Furthermore, in some situations the codec is configured to handle more than one input format at a time. Metadata-Assisted Spatial Audio (MASA) is an example of a parametric spatial audio format and representation suitable as an input format for IVAS.
- It can be considered an audio representation consisting of 'N channels + spatial metadata'. It is a scene-based audio format particularly suited for spatial audio capture on practical devices, such as smartphones. The idea is to describe the sound scene in terms of time- and frequency-varying sound source directions and, e.g., energy ratios. Sound energy that is not defined (described) by the directions, is described as diffuse (coming from all directions).
- As discussed above spatial metadata associated with the audio signals may comprise multiple parameters (such as multiple directions and associated with each direction (or directional value) a direct-to-total energy ratio, spread coherence, distance, etc.) per time-frequency (TF) tile. The spatial metadata may also comprise other parameters or may be associated with other parameters which are considered to be non-directional (such as surround coherence, diffuse-to-total energy ratio, remainder-to-total energy ratio) but when combined with the directional parameters are able to be used to define the characteristics of the audio scene. For example, a reasonable design choice which is able to produce a good quality output is one where the spatial metadata comprises one or more directions for each time-frequency subframe (and associated with each direction direct-to-total ratios, spread coherence, distance values etc) are determined.
- With respect to
Figure 1 is shown anexample MASA analyser 101. TheMASA analyser 101 is configured to receive the input audio signal(s) 100 and analyse the input audio signals to generate transport audio signal(s) 102 andspatial metadata 104. - The transport audio signal(s) 102 can be encoded, for example, using an IVAS audio core codec, or with an AAC (Advanced Audio Coding) or EVS (Enhanced Voice Services) encoder.
- Examples of MASA spatial metadata is presented in the following table. These values are available for each time-frequency tile. In some implementations a frame is subdivided into 24 frequency bands and 4 temporal sub-frames. In other implementations other divisions of frequency and time can be employed. Furthermore, in some implementations a frame size (for example, as implemented in IVAS) is 20 ms (and thus the temporal sub-frame is 5 ms). However, similarly, other frame lengths can be employed in other embodiments. In some embodiments the MASA analyser is configured to determine 1 or 2 directions for each time-frequency tile (i.e., there are 1 or 2 direction index, direct-to-total energy ratio, and spread coherence parameters for each time-frequency tile). However, in some embodiments the analyser is configured to generate more than 2 directions for a time-frequency tile.
Field bits Description Direction index 16 Direction of arrival of the sound at a time-frequency parameter interval. Spherical representation at about 1-degree accuracy. Range of values: "covers all directions at about 1 ° accuracy" Values stored as 16-bit unsigned integers. Direct-to- total energy ratio 8 Energy ratio for the direction index (i.e., time-frequency subframe). Calculated as energy in direction / total energy. Range of values: [0.0, 1.0] Values stored as 8-bit unsigned integers with uniform spacing of mapped values. Spread coherence 8 Spread of energy for the direction index (i.e., time-frequency subframe). Defines the direction to be reproduced as a point source or coherently around the direction. Range of values: [0.0, 1.0] Values stored as 8-bit unsigned integers with uniform spacing of mapped values. Diffuse-to- total energy ratio 8 Energy ratio of non-directional sound over surrounding directions. Calculated as energy of non-directional sound / total energy. Range of values: [0.0, 1.0] (Parameter is independent of number of directions provided.) Values stored as 8-bit unsigned integers with uniform spacing of mapped values. Surround coherence 8 Coherence of the non-directional sound over the surrounding directions. Range of values: [0.0, 1.0] (Parameter is independent of number of directions provided.) Values stored as 8-bit unsigned integers with uniform spacing of mapped values. Remainder-to- total energy ratio 8 Energy ratio of the remainder (such as microphone noise) sound energy to fulfil requirement that sum of energy ratios is 1. Calculated as energy of remainder sound / total energy. Range of values: [0.0, 1.0] (Parameter is independent of number of directions provided.) Values stored as 8-bit unsigned integers with uniform spacing of mapped values. - As discussed above the frame size in IVAS is 20 ms. An example of the frame structure is shown in
Figure 2 where themetadata frame 201 comprises four temporal sub-frames which are 5 ms long,metadata sub-frame 1 202,metadata sub-frame 2 204,metadata sub-frame 3 206, andmetadata sub-frame 4 208. The IVAS frame can also be formed as a TF-representation of a complex-valued low delay filter band (CLDFB) where each subframe comprises 4 TF-slots and in the case of the above example this equates to each slot corresponding to 1.25 ms. An example of theIVAS frame structure 300 corresponding to the TF-representation of the CLDFB is shown inFigure 3 . There are 16 audio slots in total (audio slot 16, 316) and each metadata subframe is divided into 4 audio slots with the first metadata subframe havingaudio slots Figure 3 asslots - The MASA stream can be rendered to various outputs, such as multichannel loudspeaker signals (e.g., 5.1 and 7.1+4) or binaural signals. An example rendering system that can be used for MASA is described in Vilkamo, J., Bäckström, T., & Kuntz, A. (2013). "Optimized covariance domain framework for time-frequency processing of spatial audio". Journal of the Audio Engineering Society, 61(6), 403-411. Broadly speaking, the rendering method determines a target covariance matrix based on the spatial metadata and energies of TF-tile of the transport audio signal(s). The determined target covariance matrix may contain the channel energies of all channels and the inter-channel relationships between all channel pairs, in particular the cross-correlation and the inter-channel phase differences. These features are known to convey the perceptually relevant spatial features of a multichannel sound in various playback situations, such as binaural and multichannel loudspeaker audio.
- The rendering process modifies the transport audio signals in the form of TF-tiles so that the resulting signals have a covariance matrix that resembles the target covariance matrix. As a result, the rendered spatial audio signals (e.g., binaural signals) are perceived to have the spatial properties as captured by the spatial metadata.
- However, since both the transport audio signal(s) and the spatial metadata vary in time it is desirable that they remain in synchrony with each other. Failure to maintain synchrony may result in the production of unwanted artefacts in the output signals. By way of illustration the following scenario depicts the unwanted effects of a failure to maintain synchrony. Consider a sound scene containing an ambient noise source on left, and a transient-like source on right (such as a drum roll). For simplicity, let us also assume MASA capture using a single direction per time-frequency tile. When the transient source (on right side) is silent, the spatial metadata is mostly pointing towards the noise source (on left side). On the other hand, when there is a transient event such as a drum hit, the spatial metadata is mostly pointing towards the transient source (right side), because it typically has more energy than the constant noise source (left side). When the audio transport signal(s) and spatial audio metadata are in mutual synchrony, the transients are correctly rendered from the right, while the noise is rendered from the left within the same passage of time. However, if there is some asynchrony (or loss of synchrony) between the transport audio signal(s) and spatial audio metadata, the transients may be rendered from the left, and the noise may be rendered from the right over a slightly skewed passage of time. This may result in the rendered sound containing strong artefacts with the consequential decrease in perceived audio quality.
- In general, network jitter and packet loss conditions can cause degradation in quality, for example, in conversational speech services in packet networks, such as the IP networks, and mobile networks such as fourth generation (4G LTE) and fifth generation (5G) networks. The nature of the packet switched communications can introduce variations in transmission of times of the packets (containing frames), known as jitter, which can be seen by the receiver as packets arriving at irregular intervals. However, an audio playback device requires a constant input with no interruptions in order to maintain good audio quality. Thus, if some packets/frames arrive at the receiver after they are required for playback, the decoder may have to consider those frames as lost and perform error concealment.
- Typically, a jitter buffer can be utilised to manage network jitter by storing incoming frames for a predetermined amount of time (specified, e.g., upon reception of the first packet of a stream) in order to hide the irregular arrival times and provide constant input to the decoder and playback components.
- Nowadays, most audio or speech decoding systems deploy an adaptive jitter buffer management scheme in order to dynamically control the balance between short enough delay and low enough numbers of delayed frames. In this approach, an entity controlling the jitter buffer constantly monitors the incoming packet stream and adjusts the buffering delay (or buffering time, these terms are used interchangeably) according to observed changes in the network delay behaviour. If the transmission delay seems to increase or the jitter becomes worse, the buffering delay may need to be increased to meet the network conditions. In the opposite situation, where the transmission delay seems to decrease, the buffering delay can be reduced, and hence, the overall end-to-end delay can be minimised.
-
Figure 4 shows how the IVAS decoder may be connected to a jitter buffer management system. Thereceiver modem 401 can receive packets through a network socket such as an IP (Internet protocol) network socket which may be part of an ongoing Real-time Transport Protocol (RTP) session. The received packets may be pushed to aRTP depacketizer module 403, which may be configured to extract the encoded audio stream frames (payload) from the RTP packet. The RTP payload may then be pushed to a jitter buffer manager (JBM) 405 where various housekeeping tasks may be performed such as frame receive statistics are updated. Thejitter buffer manager 405 may also be arranged to store the received frames. Thejitter buffer manager 405 may be configured to pass the received frames to an IVAS decoder &renderer 407 for decoding. Accordingly, the IVAS decoder &renderer 407 passes the decoded frames back to thejitter buffer manager 405 in the form of digital samples (PCM samples). Also depicted inFigure 4 is anAcoustic player 409 which may be viewed as the module performing the playing out (or playback) of the decoded audio streams. The function performed by theAcoustic player 409 may be regarded as a pull operation in which it pulls the necessary PCM samples from the JBM buffer to provide uninterrupted audio playback of the audio streams. -
Figure 5 is asystem 500 depicting the general workings and interactions of ajitter buffer manager 405 with an IVAS decoder &renderer 407. Thejitter buffer manager 405 may comprise ajitter buffer 501, anetwork analyzer 502, anadaptation control logic 503 and anadaptation unit 504. -
Jitter buffer 501 is configured to temporarily store one or more audio stream frames (such as an IVAS bitstream), which are received via a (wired or wireless) network, for instance, in the form ofpackets 506. Thesepackets 506 may for instance be RTP packets, which are unpacked bybuffer 501 to obtain the audio stream frames. -
Buffer status information 508, such as, for instance, information on a number of frames contained inbuffer 501, or information on a time span covered by a number of frames contained in the buffer, or a buffering time of a specific frame (such as an onset frame), is transferred betweenbuffer 501 andadaptation control logic 503. -
Network analyzer 502 monitors theincoming packets 506 from theRTP depacketizer 403, for instance, to collect reception statistics (e.g., jitter, packet loss). Correspondingnetwork analyzer information 507 is passed fromnetwork analyzer 502 toadaptation control logic 503. -
Adaptation control logic 503, inter alia, controlsbuffer 501. This control comprises determining buffering times for one or more frames received bybuffer 501 and is performed based onnetwork analyzer information 507 and/orbuffer status information 508. The buffering delay ofbuffer 501 may, for instance, be controlled during comfort noise periods, during active signal periods or in-between. For instance, a buffering time of an onset signal frame may be determined byadaptation control logic 503, and IVAS decoder &renderer 407 may (for instance, viaadaptation unit 504, signals 509, and thesignal 510 to control the IVAS decoder) then be triggered to extract this onset signal frame frombuffer 501 when this determined buffering time has elapsed. TheIVAS decoder 507 can be arranged to pass the decoded audio samples to theadaption unit 504 via theconnection 511. -
Adaptation unit 504, if necessary, shortens or extends the output audio signal according to requests given byadaptation control logic 503 to enable buffer delay adjustment in a transparent manner. - In short, therefore the
JBM system 500 may be required to perform time stretching/shortening in order to achieve continuous audio playback without the introduction of audio artifacts as a result of network jitter. - In a spatial audio system such as IVAS the time stretching/shortening can be performed after rendering to the output audio signals in a manner similarly deployed by existing audio codecs such as EVS. However, taking this approach can result in some disadvantages. Firstly, time stretching/shortening may contain pitch shifting as a part of the stretching process (especially for larger modifications to the output audio signal(s). This may cause problems with binaural signals as the monaural cues (that allow the human hearing system to determine elevation) may become skewed, leading into erroneous perception of elevation. Secondly, time stretching/shortening may alter inter-channel relationships (such as altering phase and/or level relationships), which again may have a detrimental effect with the perception of direction. Thirdly, the process of time stretching/shortening over many audio channels may be computationally complex. Fourthly, performing time-stretching after rendering requires the renderer to be run for multiple frames to produce one frame of output. Fifthly, performing time stretching after rendering may cause a varying motion-to-sound latency for head-tracked binaural rendering, resulting in a degradation of the perceived quality.
- Alternatively, in order to avoid the above issues, it is possible to perform the process of time stretching/shortening before rendering which would avoid the above disadvantages. In other words, the time stretching/shortening process may be performed over the spatial audio metadata and transport audio signal(s). However, as discussed previously in order to preserve audio quality it is desirable to maintain synchrony between the spatial audio metadata and the transport audio signal(s). To that end this invention proceeds on the basis that the time stretching/shortening over the transport audio signal(s) has already been performed and focusses on the issues of time adapting the accompanying spatial audio metadata in order to maintain synchrony with the transport audio signal(s).
Figure 6 is a more detailed depiction of the jitterbuffer management system 500 for IVAS.RTP packets 601 are depicted as being received and passed to the RTP de-packer 602 which may be arranged to extract the IVAS frames 607. The RTP de-packer 602 may also be arranged to obtain from the RTP streams so calledRTP metadata 603 which can be used to update IP network metrics such as frame receive statistics which in turn may be used to estimate network jitter. Accordingly,RTP metadata 603 may be passed to the Network jitter analysis andtarget delay estimator 604 where theRTP metadata 603, comprising packet timestamp and sequence information, may be analysed to provide a targetplayout delay parameter 605 for use in the adaptioncontrol logic processor 606. - The IVAS frames 607 as obtained by the de-packing process of the RTP de-packer 602 are depicted in
Figure 6 as being passed to thede-jitter buffer 608. Thede-jitter buffer 608 is arranged to store the IVAS frames 607 in a frame buffer ready for decoding by IVAS audio &metadata decoder 610. Thede-jitter buffer 608 can also be arranged to perform frame-based delay adjustment on the stream of IVAS frames when instructed to by the adaptioncontrol logic processor 606, and also reorder the IVAS frames into a correct decoding order should they not arrive in the proper sequential order (for decoding.) The output from thede-jitter buffer 608, in other words IVAS frames for decoding 609, may be passed to the IVAS audio &metadata decoder 610. - The IVAS audio decoder &
metadata decoder 610 is arranged to decode the IVAS frames 609 into a decoded multichannel transport audio signal stream 611 (also referred to as the transport audio signal(s) and a MASA spatial audio metadata stream 613 (also referred to as spatial audio metadata). The MASA spatial audio metadata and decoded multichannel transport audio signal streams 613 and 611 may be passed on to subsequent processing blocks so that any time sequence adjustments to the respective signals may be performed. In the case of the MASA spatial audio metadata therespective stream 613 is passed to themetadata adaptor 612 and in the case of the decoded multichannel transport audio signal therespective stream 611 is passed to the multi-channel time scale modifier (MC TSM) 614. - The
MC TSM 614 is configured to time scale modify frames of the transport audio signals 611 under the direction of the adaptioncontrol logic processor 606. Basically, theMC TSM 614 performs the time stretching or time shortening of the transport audio signal(s) in the time domain in response to a time adjustment instruction provided by adaptioncontrol logic processor 606. The time adjustment instruction may be received by theMC TSM 614 along thecontrol line 615 from the adaptioncontrol logic processor 606. The output from theMC TSM 614, i.e., frames of the time-adjusted transport audio signals 621, may be passed to therenderer 616, a processing block termed theEXT output constructor 618 and themetadata adaptor 612. In the case of themetadata adaptor 612, the time-adjusted transport audio signal(s) 621 is used to assist in the adaption of the spatial audio metadata so that synchrony is better maintained. - The
metadata adaptor 612 is essentially arranged to receive the spatialaudio metadata parameters 613 corresponding to the frames of the transport audio signal(s) 611 that are delivered to theMC TSM 614, adapt these spatialaudio metadata parameters 613 in accordance with the time adjustment instructions as provided by the adaptioncontrol logic processor 606, and maintain time synchrony with the time-adjusted transport audio signal(s) 621. In order to perform these functions, themetadata adaptor 612 is configured to receive the time-adjusted transport audio signal(s) 621 and the time adjustment instructions from the adaptioncontrol logic processor 606 along thesignal line 617. Themetadata adaptor 612 may then be arranged to produce time-adapted spatial audio metadata which has time synchrony with the time-adjusted transport audio signals 621. The time adapted spatial audio metadata is depicted as thesignal 623 inFigure 6 and is shown as being passed to both therenderer 616 andEXT output constructor 618. - Also shown in
Figure 6 is therenderer 616 which can receive the time-adjusted transport audio signals 621 and the time adapted spatialaudio metadata 623 and render said signals into a multichannel spatial audio output signal. The rendering may be performed in accordance with therendering parameters 625. Therenderer 616 is also shown as receiving a signal from the adaptioncontrol logic processor 606 along thesignal line 619. -
Figure 6 also shows a further output processing function in the form of theEXT output constructor 618. This processing function simply takes the time-adjusted transport audio signals 621 and the time adapted spatialaudio metadata 623 and "packages" the signals into a single frame format suitable for outputting from a device, in other words a "spatial audio format" suitable for storage as a file type and the like. The purpose of theEXT output constructor 618 is to output the spatial audio format as it was decoded with minimal changes to conform to a spatial audio format specification. It can be then stored, re-encoded, mixed, or rendered with an external renderer. - As mentioned above the jitter buffer management system for IVAS also comprises the adaption
control logic processor 606. The adaptioncontrol logic processor 606 is responsible for providing the time adjustment instructions to other processing blocks in the system. This may be realised by the adaptioncontrol logic processor 606 receiving a target delay estimation/parameter 605 from the network jitter analysis andtarget delay estimator 604 and the current playout delay from theplayout delay estimator 620 and using this information to choose the method for playout delay adjustment to reach the target playout delay. This may be provided to the various processing blocks in the form of time adjustment instructions. The various processing blocks may then each individually utilise the received time adjustment instructions to perform appropriate actions so that the audio output from therenderer 616 is played out with the correct time. In order to realise the audio output playout delay time the following functions may be configured to receive time adjustment instructions from the adaptioncontrol logic processor 606;de-jitter buffer 608,metadata adaptor 612,MC TSM 614, and therenderer 616. Theplayout delay estimator 620 provides estimate of the current playout delay to 606 based on the information received from 608 and 614. - As mentioned previously, the
metadata adaptor 612 is arranged to adjust thespatial audio metadata 613 in accordance with the time adjustment instructions (playout delay time) whilst maintaining synchrony with the time-adjusted transport audio signals 621. In this regardFigure 7 , shows themetadata adaptor 612 according to embodiments in further detail. - The
metadata adaptor 612 takes as input thetime adjustment instructions 617 from the adaptioncontrol logic processor 606. This input may then be passed to the slot tosubframe mapper 702. Thetime adjustment instructions 617 may contain information pertaining to the number of subframes and hence audio time slots that are to be rendered. For the sake of brevity this information can be referred to as "slot adaptation info." - As may be recalled from
Figures 2 and3 , the original IVAS frame can be divided into a number of subframes with each subframe being divided into a further number of audio slots. One such example comprises a 20 ms frame divided into 4 equal length subframes, with each subframe being evenly divided into 4 audio slots giving a total of 16 audio slots at 1.25 ms each. - In embodiments the "slot adaptation info" may contain a parameter giving the number of audio slots Nslots present in the time-adjusted transport audio signals 621, which in turn provides the number of subframes in the signal and consequently the frame size of the time-adjusted transport audio signals 621. This information may then be used to adapt the spatial audio parameters sets which are currently time aligned with the subframes of the original IVAS frame to being time aligned with the subframes of the time-adjusted transport audio signals 621.
- It is to be noted that the term "original IVAS frame" refers to the size of the IVAS frame before any time shortening/lengthening has taken place. So, it refers to the frames of the transport audio signal(s).
- It is also to be noted in the forthcoming we have referred to the time adapted transport audio signal(s) 621 frame, as an "adapted IVAS frame" for the sake of brevity.
- The parameter Nslots may be different from the default number of slots in an original IVAS frame N slots_default , with the default number of slots being the number of slots in an original IVAS frame before the time stretching/shortening process. In the above example Nslots_default is 16 audio slots.
- At this point it may be helpful to note that the process of stretching and shortening the waveform takes place in the
MC TSM processor 614 which may involve changing the number of slots in the frame of the transport audio signal(s) 611. - In embodiments the
slot subframe mapper 702 can be arranged to map the "original" default number of slots N slots_default of the original IVAS frame, to a different number of slots Nslots distributed across the same number of subframes as the original IVAS frame. This has the outcome of mapping the slots/subframes of the adapted IVAS frame to the standard IVAS frame. This results in a pattern of mapped slots where some of the subframes (of the original IVAS frame) have either more or less mapped slots depending on whether the adapted slot number Nslots is greater or less than the original number of slots Nslots-default. For instance, if Nslots < Nslots_default then the process is a waveform shortening or output play speeding up operation, and if Nslots > N slots_default then the process is a waveform lengthening or output play slowing down operation. - In essence the slot to
subframe mapper 702 may be arranged to map the Nslots time slots of the adapted IVAS frame to the subframes of the original IVAS frame to produce a map for mapping a time slot of the adapted IVAS frame to a subframe of the original IVAS frame. - The mapping of each slot associated with the time adapted transport audio signal(s) 621 (adapted IVAS frame) to a subframe of the original IVAS frame is performed on the premise that the assigned subframe (in the original IVAS frame) best matches the temporal position of the slot in the time adapted transport audio signal(s) 621 (adapted IVAS frame).
- As may be recalled from above, each subframe comprises a set of spatial audio parameters. During normal operation, when there is no compensation for network jitter, each group of four audio slots in the original IVAS frame is associated with the spatial audio parameter set of one of the subframes. Therefore, the consequence of slot to subframe mapping process may be viewed as associating different groups of slots with the spatial audio parameter sets of the original IVAS frame.
-
Figure 8 shows an example subframe to slot mapping process when the adapted slot number Nslots is 12 and the original number of slots N slots_default is 16. In other words, thisFigure 8 depicts an example of waveform shortening (decreasing the playing out time). The relationship between slots to subframes for the original IVAS frame, where every 4 slots is mapped to a subframe is shown as 801 inFigure 8 , i.e., slots s1 to s4 are mapped tosubframe 1, slots s5 to s8 are mapped tosubframe 2, slots s9 to s12 are mapped tosubframe 3 and slots s13 to s16 are mapped tosubframe 4. The result of the mapping process where 12 slots are mapped to the 4 subframes of the original IVAS frame may be shown as 802 inFigure 8 . In this example, the slot to mapping process has resulted in the first three slots (s1, s2, s3) being mapped to first subframe. The fourth and fifth slots (s4, s5) have been mapped to the second subframe. Slots s6, s7, s8 are mapped tosubframe 3. Finally, slots s9, s10, s11 and s12 are now mapped tosubframe 4. Also shown inFigure 8 is 803, this depicts the relationship between the subframes of the adapted IVAS frame (time-adapted transport audio signal(s) 621 frame) and Nslots, where the Nslots slots are evenly distributed across the three subframes. - In embodiments the above subframe to slot mapping process can be performed by initially dividing the number of adapted slots Nslots into two contiguous regions. The first region comprising the higher numbered slots is made up of a contiguous run of the Nslots_end highest ordered slots. For instance, in the example above the first region was determined to have 4 slots ( Nslots_end = 4) comprising the slots s12, s11, s10, s9. The second region is made up of the remaining slots s1 to s8, i.e., the run of slots starting from the beginning of the frame, and going up to slot number (Nslots-N slots_end ).
-
- With respect to the first region, the subframe to slot mapping process take the Nslots_end highest ordered slots of the adapted IVAS frame and matches each them on an ordered one-to-one bases to the Nslots_end highest ordered slots of the original IVAS frame subframes and consequently the subframes associated with these slots.
- This processing step may be illustrated by referring to the example of
Figure 8 , where the slots of the adapted IVAS frame s9, s10, s11 and s12 are mapped on a one-to-one basis to the subframe having the 4 highest slots of the original IVAS frame s13, s14, s15 and s16, i.e.,subframe 4. - In embodiments the above subframe to slot mapping process for the first region may be formulated as
-
Figure 9 shows an example subframe to slot mapping process when the adapted slot number Nslots is 20 and again the original number of slots N slots_default is 16. In other words,Figure 9 depicts an example of waveform extending (increasing the playing out time). The distribution of slots in the standard IVAS frame is shown as 901 inFigure 9 and the result of the slot to subframe mapping process where 20 slots (and hence 5 subframes of the time adapted transport audio signal(s) 621 frame (adapted IVAS frame) are mapped to the 4 subframes of the standard IVAS frame is shown as 902 inFigure 9 . It is to be noted that in this example Nslots,end = 12, resulting in the first region of the 12 highest ordered slots (s9 to s20) being mapped to the same subframes as the 12 highest ordered slots (s5 to s16) of the original IVAS frame. Also shown inFigure 9 , is 903 the relationship between the subframes of the time adapted transport audio signals 621 frame and the Nslots slots. - In embodiments, the second region of slots is therefore given by the lowest ordered contiguous run of slots from the first slot, s1 to the highest slot with the slot number Nslots - Nslots,end. The slot to subframe mapping process then maps this contiguous run of lowest ordered slots by distributing them across a number of the subframes, starting from the lowest numbered subframe. For instances, when Nslots > Nslots_default , i.e., a period of waveform lengthening, the second region of slots (of the adapted IVAS frame) may be distributed across the first subframe and subsequent subframes up to and including the subframe to which the lowest ordered slot from the first region is mapped. For instances, when Nslots < Nslots_default , i.e., a period of waveform shortening, the second region of slots (of the adapted IVAS frame) may be distributed across all subframes of the IVAS frame.
-
- Mslot-sf (m) second is the subframe to slot mapping function which gives the subframe (of the original IVAS frame) to slot map for the second region. This function returns the mapped subframe number (of the original IVAS frame) for each slot m of the second region of slots of the adapted IVAS frame.
- In these embodiments the output from the slot to
subframe mapper 702 is the combined slot to subframe map for both the first and second regions and may be referred to as Mslot-sf (m) . This output is depicted as 701 inFigure 7 . - In other embodiments the slot to
subframe mapper 702 may be arranged to distribute the Nslots of the adapted IVAS frame across the subframes of the original IVAS frame in a different manner to the above embodiments. In these embodiments, there may be no mapping between the Nslots slots of the adapted IVAS frame and the Nslots_default slots of the original IVAS frame. Instead, the Nslots slots of the adapted IVAS frame may be mapped directly to the subframes of the original IVAS frame using the following routine. - For the case when the Nslots slots of the adapted IVAS frame is a factor of the number of the slots for each subframe of the original IVAS frame, in other words when mod(Nslots, Nslots,sf ) == 0, where Nslots,sf is the number of slots in each subframe of the original IVAS frame (N.B. Nslots,sf is equivalent to Lsf in the above embodiment, which is 4 for the standard IVAS frame) a mapping length Lmap (isf ) for a subframe isf of the original IVAS may be given as
values 1 to 4, corresponding tosubframes 1 to 4 of the original IVAS frame. -
-
-
Figure 10 shows an example subframe to slot mapping process according to these embodiments where the adapted slot number Nslots is 12 and the original number of slots N slots_default is 16. The relationship between slots and subframes for the original IVAS frame, where every 4 slots is mapped to a subframe, is shown as 1001 inFigure 10 . The result of the mapping process where 12 slots of the adapted IVAS frame are mapped to the 4 subframes of the original IVAS frame are shown as 1002 inFigure 10 . It can be seen that the 12 slots of the adapted IVAS frame have been evenly distributed across the 4 subframes of the original IVAS frame, i.e. slots s1 to s3 (or m=0 to m=2) are mapped to subframe 1 (isf = 1), slots s4 to s6 (m=3 to m=5) are mapped to subframe 2 (isf = 2), slots s7 to s9 (m=6 to m=8) are mapped to subframe 3 (isf = 3) and slots s10 to s12 are mapped to subframe 4 (isf = 4). - For the case when the Nslots slots of the adapted IVAS frame are not a factor of the number of slots (for each subframe of the original IVAS frame,) in other words when mod(Nslots , N Slots,sf ) ≠ 0 the mapping length Lmap (isf ) for a subframe isf of the original IVAS may be given as
-
-
Figure 11 , depicts further examples of the subframe to slot mapping process according these embodiments where the adapted slot number Nslots are 13 and 14 and the original number of slots Nslots_default is 16. The result of the mapping process where 13 slots of the adapted IVAS frame are mapped to the 4 subframes of the original IVAS frame is shown as 1101 inFigure 11 . It can be seen that the 13 slots of the adapted IVAS frame have been distributed across the 4 subframes of the original IVAS frame according to the following pattern of slots; s1 to s3 (or m=0 to m=2) are mapped to subframe 1 (isf = 1), slots s4 to s6 (m=3 to m=5) are mapped to subframe 2 (isf = 2), slots s7 to s9 (m=6 to m=8) are mapped to subframe 3 (isf = 3) and slots s10 to s13 (m = 9 to m=12) are mapped to subframe 4 isf = 4. The result of the mapping process where 14 slots of the adapted IVAS frame are mapped to the 4 subframes of the original IVAS frame is shown as 1102 inFigure 11 . It can be seen that the 14 slots of the adapted IVAS frame have been distributed across the 4 subframes of the original IVAS frame according to the following pattern slots; s1 to s3 (or m=0 to m=2) are mapped to subframe 1 (isf = 1), slots s4 to s7 (m=3 to m=6) are mapped to subframe 2 (isf = 2), slots s8 to s10 (m=7 to m=9) are mapped to subframe 3 (isf = 3) and slots s11 to s14 (m=10 to m13) are mapped to subframe 4 (isf = 4). - In these embodiments the
output 701 inFigure 7 from the slot tosubframe mapper 702 is the above slot to subframe map Mslot_sf (m). - Also, shown in
Figure 7 is theenergy determiner 704 which is shown as receiving thetime adjustment instructions 617 and the time-adjusted transport audio signal(s) 621 frame (adapted IVAS frame). The function of theenergy determiner 704 is to determine the energy of the adapted IVAS frame on a slot-by-slot basis according to the number of slots Nslots. To that end, theenergy determiner 704 takes in frame length of Nslots *slot width (1.25 ms) of the time-adjusted transport audio signals 621, i.e., an adapted IVAS frame and effectively divides the frame into Nslots time slots and then determines the energy across all the audio signals of the adapted IVAS frame for each slot. The energy E for each time slot m may be expressed as - Where q is the number of time shifted transport audio signals/channels in the
signal 621. - The output from the
energy determiner 704 is the energy E for each time slot m for an adapted IVAS frame. This is shown as theoutput 703 inFigure 7 . - Also, shown in
Figure 7 is the subframe-to-subframe map determiner 706 which is depicted as receiving the energy E for each time slot m of the adaptedIVAS frame 703 and the slot to subframe map Mslot-sf (m) 701. - The function of the subframe-to-
subframe map determiner 706 is to determine, for each subframe of the adapted IVAS frame, a subframe from the original IVAS frame whose associated spatial audio parameters most closely align with the audio signal of the subframe of the adapted IVAS frame. - This may be performed in order to provide a map whereby a subframe of the adapted IVAS frame is mapped to a subframe of the original IVAS frame.
- In essence the subframe-to-
subframe mapping determiner 706 may be arranged to use the map for mapping a time slot of the adapted IVAS frame to a subframe of the original IVAS frame to produce a map for mapping a subframe of the adapted IVAS frame to a subframe of the original IVAS frame. - In embodiments this function may be performed by the subframe-to-
subframe map determiner 703 being arranged to use the slot to subframemaps 701 and the energy E for eachtime slot m 703 to determine an energy to subframe map for each subframe of the original IVAS frame. In essence, the energy tosubframe map determiner 706 determines for each subframe of the original IVAS frame the energy of the adapted IVAS frame slots which were mapped to the subframe. -
- ME-sf (n) is the energy of slots mapped to a subframe n of the original IVAS frame, where the adapted IVAS frame slots mapped to the subframe n are given by the slot to subframe mapping Mslot-sf (m) and where mn
A is the list of slots mapped to subframe n (of the original IVAS frame), mnA (0) represents the first slot mapped to subframe n and mnA (N - 1) the last slot mapped to subframe n, where N represents the number of slots in subframe nA . The understanding of the above equation may be enhanced by returning to the example ofFigure 8 . In this example it may be seen that the first subframe of the original IVAS frame has mapped slots s1, s2 and s3, m 1 (0) = s1, m 1 (1) = s2, m 1 (2) = s3, the second subframe has mapped slots s4 and s5, m 2 (0) = s4, m 2 (1) = s5, the third subframe has mapped slots s6 to s8, m 3 (0) = s6 to m 3 (2) = s8, and the fourth subframe comprises has mapped s9 to s12, m 4 (0) = s9 to m 4 (3) = s12. The ME-sf (n) for each subframe may be given as the sum of the energies for each of the adapted IVAS frame slots mapped to the original IVAS frame subframe. For example, if the energy of each adapted IVAS frame slot s1=1, s2 =1, s3=1, s4=8, s5=5, s6=3, s7=1, s8 =1, s9=2, s10=1, s11=1 and s12=1, then the energy to subframe map may be given as ME-sf (1) = 1 + 1 + 1 = 3, ME-sf (2) = 8 + 5 = 13, ME-sf (3) = 3 + 1 + 1 = 5 and ME-sf (4) = 2 + 1 + 1 + 1 = 5. - The next step performed by the subframe-to-
subframe mapper 702 is to determine for a subframe nA of the adapted IVAS frame the subframe nmax (of the original IVAS frame) which gives the maximum energy to subframe value of all the ME-sf (n) values which comprise the slots of the subframe nA of the adapted IVAS frame. This may be performed for all subframes of the adapted IVAS frame. - The pseudo code for this step may have the following form:
- for each nA :
find ME-sf (n), for all values of n which are comprised in the slot range [mnA (0) , mnA (N - 1)] of the subframe nA ; - End for:
Where argmaxn returns the index n that maximises the value of the function ME-sf (n). - Returning to the above example of
Figure 8 . The first subframe of the adapted IVAS frame (nA = 1) has slot range [m 1 A (0) , m 1 A (N - 1)] = s1 to s4. This slot range comprises the original IVAS subframes with indexes n = 1, 2. The ME-sf (1) = 3 and the ME-sf (2) = 13. This first subframe of the adapted IVAS frame (nA = 1) has nmax = 2 because ME-sf (2) has a greater summed energy than ME-sf (1). This process can be repeated for all subframes of the adapted IVAS frame where the second subframe (nA = 2) has nmax = 2, and the third subframe (nA = 3) has nmax = 3. - In other embodiments, the subframe nmax can also be determined by
A (0) ,mnA (N - 1)]. If adj > 1, the subframe newm may more often chosen from the beginning of the slot to subframe map section Mslot-sf (m), where m ∈ [mnA (0) ,mnA (N - 1)]. - The value of nmax for each subframe nA of the adapted IVAS frame may collated in the form of a subframe-to-subframe mapping table/function Msf_sf (nA ) = nmax (nA ) for all nA .
- In other embodiments the subframe-to-subframe mapping function may be performed according to the flow chart presented in
Figure 12 . - Initially the number of subframes of the adapted IVAS frame may be determined or communicated to the subframe-to-
subframe map determiner 706. - In these other embodiments, the number of subframes in the adapted IVAS frame NA may be based on the premise that each subframe comprises the same number of slots as each subframe of the original IVAS frame. For example, in the case of the standard IVAS frame size, the adapted IVAS frame may have NA subframes each being 4 slots wide, giving NA 5ms subframes. Therefore, using the above nomenclature the number of subframes in the adapted IVAS frame can be given by,
- It is to be noted that in instances when Nslots is not a factor of Nslots,sf then any slots above NA ∗ Nslots,sf (i.e. remainder slots will fall outside of the highest subframe) may not be included in calculations involving the subframes of the adapted IVAS frame. This may be illustrated with reference to the example of 1101 in
Figure 11 where the number of slots Nslots is 13 and the number of subframes in the adapted IVAS frame NA = 3. It can be seen in this example thatslot 13 is not included within the subframes of the adapted IVAS frame. - The step of determining/acquiring the number of subframes in an adapted IVAS frame is shown in
Figure 12 as theprocessing step 1201. - The process then moves into a subframe level processing loop for each subframe of the adapted IVAS frame, nA = 0:NA - 1.
- This is shown as the
processing step 1205 inFigure 12 , and the subframe level processing loop comprises thesteps 1207 to 1211. - The first step of the subframe level processing loop calculates the total energy of the slots of the subframe nA of the adapted IVAS frame. This may be determined as
A (0) is the first slot number of subframe nA of the adapted IVAS frame and mnA (N - 1) is the last slot number of subframe nA , and E(k) is the slot energy for slot k of the adapted IVAS signal as provided by theenergy determiner 704. - This may be illustrated by reference to the example of 1101 in
Figure 11 . The total energy for the first subframe of the adapted IVAS signal will comprise the sum of the slot energies E(0) to E(3) for slots s1 to s4, (i.e. m 1(0)to m 1(3)). The total energy for the second subframe of the adapted IVAS signal will comprise the sum of the slot energies E (4) to E (7) for slots s5 to s8, (i.e. m 2(0)to m 2(3)). The total energy for the third subframe will comprise the totals for E (8) to E (11) for slots s9 to s12, (i.e. m 3(0)to m 3(3)). The final slot s13 may either be processed as a non-full subframe, or it may be buffered for the next decoded IVAS frame. - The step of determining the total energy for the slots of the subframe of the adapted IVAS frame is shown as
processing step 1207 inFigure 12 . - The next step of the subframe processing loop initialises an accumulative energy factor Ecum for the subframe of the adapted IVAS frame. This is shown as processing step 1209 in
Figure 12 . - The process then moves into a slot level processing loop for the subframe of the adapted IVAS frame, k= mn
A (0): mnA (N - 1), where k is a slot index for the subframe nA , and is used to index the slots mnA (0): mnA (N - 1), of subframe nA . This is shown as theprocessing step 1211 and comprises thesteps 1213 to 1217. - The first step of the slot level processing loop adds the energy of a current slot E (k) to the accumulative energy Ecum . This is shown as
step 1213 inFigure 12 . - The slot level processing loop then checks whether the accumulative energy Ecum is greater than Etot /2 for the subframe. This is shown as the
processing step 1215. - If it was determined at
step 1215 that the above criterion had been met, the slot level processing loop progresses to theprocessing step 1217. - At
step 1217, the index k (which has led to the above criterion being met) is used to map a subframe of the original IVAS frame to the subframe of the adapted IVAS frame. This may be performed by taking the subframe of the original IVAS frame which houses the index k and assigns this subframe [of the original IVAS frame] to the subframe nA [of the adapted IVAS frame]. As mentioned above the mapping (or relationship) between slots of the adapted IVAS frame and subframes of the original IVAS frame is given by the mapping function Mslot-sf (m) . The output from this mapping function for the input of the index value associated with k produces the subframe of the original IVAS frame which is assigned to the subframe nA (of the adapted IVAS frame) in the form of a subframe-to-subframe mapping table/function Msf_sf (nA ) =Mslot_sf (k).Whenstep 1217 has been executed, the process returns to 1205 to repeat theprocessing steps 1207 to 1217 for the next subframe nA = nA + 1. - Returning to step 1215, if the criterium is not met, i.e. Ecum is not greater than Etot /2 for the subframe nA then the process selects the next slot of the subframe of the adapted IVAS frame and proceeds to
steps - The result of the processing steps of
Figure 12 is the subframe-to-subframe map/table Msf-sf with an entry for all subframes nA of the of the adapted IVAS frame. - This subframe-to-subframe map Msf-sf may then be used to obtain the one-to-one mapping between the optimum subframe of the original IVAS frame for each subframe of the adapted IVAS frame. The subframe-to-subframe map Msf-sf may form the
output 705 of the subframe-to-subframe map determiner 706. - Also shown in
Figure 7 is the spatialaudio metadata adaptor 708 which can be arranged to receive thespatial audio metadata 613 and the subframe-to-subframe map M sf-sf 705 and produce as output the time adapted spatialaudio metadata 623. - In embodiments the spatial
audio metadata adaptor 708 is arranged to assign a spatial audio parameter set of the original IVAS frame to each subframe of the adapted IVAS frame nA by using the subframe-to-subframe map M sf-sf 705. For each entry nA of the subframe-to-subframe map M sf-sf 705 there is a corresponding original IVAS subframe index n. The index n may then be used to assign the spatial audio parameter set of subframe n of the original IVAS frame to subframe nA of the adapted IVAS frame or in other words subframe nA of the time-adjusted transport audio signal(s) 621 frame. - For example, if we just consider one spatial audio parameter of the MASA spatial audio parameter set for a subframe with index n, the output azimuth angle θA associated with subframe nA of the time-adjusted transport audio signal(s) 621 frame may be given as θA (nA ) = θ(Msf-sf (nA )). Obviously, this mechanism can be repeated for the other spatial parameters in the MASA spatial audio parameter set to give the adapted MASA spatial audio parameter set for subframe nA of the time-adjusted transport audio signal(s) 621 frame.
- The time adapted spatial
audio metadata 623 output therefore may comprise spatial audio parameter set for each subframe nA of the time-adjusted transport audio signal(s) 621 frame. - It is to be understood that other embodiments may deploy other frame/subframe and slot sizes.
- In some embodiments, the audio signal and the metadata may be asynchronized after decoding and the synchronization step is performed after JBM process and the output of the audio and metadata. In this case, a delay may be needed to allow use of correct slot energy in the weighting process. This may be achieved by simply delaying the audio signal or the original metadata as necessary. A ring buffer may be used for such a purpose.
- In some embodiments, the process of selecting metadata and calculating energies may be done in time-frequency domain. In such embodiments the metadata selection may be done for each subframe & frequency band combination separately using time slots and frequency bands.
- In some embodiments, the process of forming the subframe-to-
subframe map M sf-sf 705 may use signal energy only for one of the cases, waveform extending (increasing the playing out time) or waveform shortening (decreasing the playing out time). - It is to be understood that in some embodiments, the audio and metadata format may be some other format than MASA format, or the audio and metadata format is derived from some other format during encoding and decoding in codec.
- In some embodiments, the energy of some slots may be missing or unobtainable, e.g., due to asynchrony. In such embodiments, the energy of these slots can be approximated from the other slots in current frame and in history that have obtainable energy value. An example of such approximation is the average energy value of the other slots with obtainable energy value which may be assigned as the energy value of any slot with missing energy.
- With respect to
Figure 13 is shown an example system within which some embodiments can be implemented. As an input are the transport audio signals 102 and thespatial metadata 104. The transport audio signals 102 and thespatial metadata 104 are passed to anencoder 1301 which generates an encodedbitstream 1302. The encodedbitstream 1302 is received by thedecoder 1303 which is configured to generate aspatial audio output 1304. - As discussed above the input to the system, the transport audio signals 102 and the
spatial metadata 104 can be obtained in the form of a MASA stream. The MASA stream can, for example, originate from a mobile device (containing a microphone array), or as an alternative example, it may have been created by an audio server that has potentially processed a MASA stream in some way. - The
encoder 1301 can furthermore, in some embodiments, be an IVAS encoder. - The
decoder 1303, in some embodiments, can be configured to directly output thespatial audio output 1304 to be rendered by an external renderer, or edited/processed by an audio server. In some embodiments, thedecoder 1303 comprises a suitable renderer, which is configured to render the output in a suitable form, such as binaural audio signals or multichannel loudspeaker signals (such as 5.1 or 7.1+4 channel format), which are also examples ofspatial audio output 1304. - With respect to
Figure 14 an example electronic device which may be used as any of the apparatus parts of the system as described above. The device may be any suitable electronics device or apparatus. For example, in some embodiments thedevice 1400 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc. The device may for example be configured to implement the encoder and/or decoder or any functional block as described above. - In some embodiments the
device 1400 comprises at least one processor orcentral processing unit 1407. Theprocessor 1407 can be configured to execute various program codes such as the methods such as described herein. - In some embodiments the
device 1400 comprises at least onememory 1411. In some embodiments the at least oneprocessor 1407 is coupled to thememory 1411. Thememory 1411 can be any suitable storage means. In some embodiments thememory 1411 comprises a program code section for storing program codes implementable upon theprocessor 1407. Furthermore, in some embodiments thememory 1411 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by theprocessor 1407 whenever needed via the memory-processor coupling. - In some embodiments the
device 1400 comprises auser interface 1405. Theuser interface 1405 can be coupled in some embodiments to theprocessor 1407. In some embodiments theprocessor 1407 can control the operation of theuser interface 1405 and receive inputs from theuser interface 1405. In some embodiments theuser interface 1405 can enable a user to input commands to thedevice 1400, for example via a keypad. In some embodiments theuser interface 1405 can enable the user to obtain information from thedevice 1400. For example, theuser interface 1405 may comprise a display configured to display information from thedevice 1400 to the user. Theuser interface 1405 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to thedevice 1400 and further displaying information to the user of thedevice 1400. In some embodiments theuser interface 1405 may be the user interface for communicating. - In some embodiments the
device 1400 comprises an input/output port 1409. The input/output port 1409 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to theprocessor 1407 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling. - The transceiver can communicate with further apparatus by any suitable known communications protocol. For example, in some embodiments the transceiver can use a suitable radio access architecture based on long term evolution advanced (LTE Advanced, LTE-A) or new radio (NR) (or can be referred to as 5G), universal mobile telecommunications system (UMTS) radio access network (UTRAN or E-UTRAN), long term evolution (LTE, the same as E-UTRA), 2G networks (legacy network technology), wireless local area network (WLAN or Wi-Fi), worldwide interoperability for microwave access (WiMAX), Bluetooth®, personal communications services (PCS), ZigBee®, wideband code division multiple access (WCDMA), systems using ultra-wideband (UWB) technology, sensor networks, mobile ad-hoc networks (MANETs), cellular internet of things (IoT) RAN and Internet Protocol multimedia subsystems (IMS), any other suitable option and/or any combination thereof.
- The transceiver input/
output port 1409 may be configured to receive the signals. - In some embodiments the
device 1400 may be employed as at least part of the synthesis device. The input/output port 1409 may be coupled to headphones (which may be a headtracked or a non-tracked headphones) or similar and loudspeakers. - In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
- The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
- The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general-purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.
- Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
- Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.
- As used in this application, the term "circuitry" may refer to one or more or all of the following:
- (a) hardware-only circuit implementations (such as implementations in only analog and/or digital circuitry) and
- (b) combinations of hardware circuits and software, such as (as applicable):
- (i) a combination of analog and/or digital hardware circuit(s) with software/firmware and
- (ii) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions) and
- This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor (or multiple processors) or portion of a hardware circuit or processor and its (or their) accompanying software and/or firmware. The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit or processor integrated circuit for a mobile device or a similar integrated circuit in server, a cellular network device, or other computing or network device.
- The term "non-transitory," as used herein, is a limitation of the medium itself (i.e., tangible, not a signal) as opposed to a limitation on data storage persistency (e.g., RAM vs. ROM).
- As used herein, "at least one of the following: <a list of two or more elements>" and "at least one of <a list of two or more elements>" and similar wording, where the list of two or more elements are joined by "and" or "or", mean at least any one of the elements, or at least any two or more of the elements, or at least all the elements.
- The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.
Claims (15)
- An apparatus for spatial audio decoding, configured to:receive a first audio signal frame comprising a number of subframes, wherein each of the number of subframes is divided into a number of time slots;receive a parameter indicating a total number of time slots of a second audio signal frame;determine a slot to subframe map by mapping the total number of time slots of the second audio signal frame to the number of subframes of the first audio signal frame, wherein the slot to subframe map maps a time slot of the second audio signal frame to a subframe of the first audio signal frame;determine an energy value for each time slot of the total number of time slots of the second audio signal frame;determine a subframe to subframe map based on the slot to subframe map and the energy value for each time slot of the total number of time slots of the second audio signal frame, wherein the subframe to subframe map maps a subframe of the second audio signal frame to a subframe of the first audio signal frame; anduse the subframe to subframe map to assign at least one spatial audio parameter of a subframe of the first audio signal frame to a subframe of the second audio signal frame.
- The apparatus as claimed in Claim 1, wherein the apparatus configured to determine a slot to subframe map by mapping the total number of time slots of the second audio signal frame to the number of subframes of the first audio signal frame is configured to:determine a mapping length number by dividing the parameter indicating the total number of time slots of the second audio signal frame by a value indicating the number of time slots for a subframe of the first audio signal frame;map a first set of time slots comprising the mapping length number of time slots of the second audio signal frame to a first subframe of the first audio signal frame;update an accumulative mapping length number by adding the mapping length number to the accumulative mapping length number; andmap a second set of time slots of the second audio signal to a second subframe of the first audio signal frame, wherein the second set of time slots comprises the mapping length number of time slots of the second audio signal following the accumulative mapping length number of previous time slots of the second audio signal.
- The apparatus as claimed in Claim 2, wherein the result of the division of the parameter indicating the total number of time slots of the second audio signal frame by the value indicating the number of time slots for a subframe of the first audio signal frame is not an integer, wherein the apparatus configured to determine the mapping length number by dividing the parameter indicating the total number of time slots of the second audio signal frame by the value indicating the number of time slots for a subframe of the first audio signal frame is configured to:determine a remainder from the division of the parameter indicating the total number of time slots of the second audio signal frame by the value indicating the number of time slots for a subframe of the first audio signal frame;accumulate the remainder value according to a subframe index of the first audio signal frame;determine an increment value according to the subframe index of the first audio signal frame, wherein the increment value is given by a floor function being applied to the accumulated remainder value; anddetermine the mapping length number by dividing the parameter indicating the total number of time slots of the second audio signal frame by the value indicating the number of time slots for a subframe of the first audio signal frame and adding the increment value to the result of the division.
- The apparatus as claimed in Claim 1, wherein the apparatus configured to determine a subframe to subframe map based on the slot to subframe map and the energy value for each time slot of the total number of time slots of the second audio signal frame is configured to:
determine a total energy value for a subframe of the second audio signal frame, wherein the subframe of the second audio signal frame is divided into a number of time slots and wherein the total energy value is determined by summing the energy value for each time slot of the number of time slots of the subframe of the second audio signal frame; andperform the following steps for a time slot of the subframe of the second audio signal frame;determine an accumulative energy value for the time slot;determine whether the accumulative energy value for the time slot is greater than half the total energy value for the subframe of the second audio signal frame;when the accumulative energy value for the time slot is greater than half the total energy value for the subframe of the second audio signal frame use an index associated with the time slot to obtain a corresponding subframe of the first audio signal frame from the slot to subframe map and assign an entry of the subframe to subframe map mapping the corresponding subframe of the first audio signal frame to the subframe of the second audio signal frame; andwhen the accumulative energy value for the time slot is not greater than half the total energy for the subframe of the second audio signal frame proceed to the next time slot of the subframe of the second audio signal frame. - The apparatus as claimed in Claim 4, wherein the apparatus configured to determine an accumulative energy value for the time slot is configured to:
add an energy value of the time slot of the subframe of the second audio signal frame to a running total energy comprising a sum of energy values for previous time slots of the subframe of the second audio signal frame. - The apparatus as claimed in Claim 1, wherein the apparatus configured to determine a slot to subframe map by mapping the total number of time slots of the second audio signal frame to the number of subframes of the first audio signal frame is configured to:divide the parameter indicating the total number of time slots of the second audio signal frame into a first region of contiguous time slots and a second region of contiguous time slots;map time slots of the first region of contiguous time slots of the second audio signal frame on a one to one basis with higher indexed time slots of the first audio signal frame;assign the time slots of the first region of contiguous time slots of the second audio signal frame to subframes of the number of the subframes of the first audio signal frame which comprise the higher indexed time slots which have been mapped on a one to one basis with the time slots of the first region of contiguous time slots of the second audio signal frame; andmap time slots of the second region of contiguous time slots of the second audio signal frame to at least other subframes of the first audio signal frame;
- The apparatus as claimed in Claim 1, wherein the apparatus configured to determine a subframe to subframe map based on the slot to subframe map and the energy value for each time slot of the total number of time slots of the second audio signal frame is configured to;identify the time slots of the subframe of the second audio signal frame;determine, from the slot to subframe map, at least two subframes of the first audio signal frame which are mapped to the time slots of the subframe of the second audio signal frame;for each of the at least two subframes of the first audio signal frame which are mapped to the time slots of the subframe of the second audio signal frame:
determine an energy by summing the energy values for the time slots of the second audio signal frame mapped to the subframe of the first audio signal frame;from the energy determined for each of the at least two subframes of the first audio signal frame determine a subframe of the at least two subframes of the first audio signal frame which has a maximum energy; andassign the subframe of the second audio signal frame to the subframe of the at least two subframes of the first audio signal frame which has the maximum energy. - The apparatus as claimed in Claim 7, wherein the apparatus configured to assign the subframe of the second audio signal frame to the subframe of the at least two subframes of the first audio signal frame which has the maximum energy is configured toassign the subframe of the second audio signal frame to the subframe of the at least two subframes of the first audio signal frame which does not have the maximum energy, when the maximum energy multiplied by an adjustment factor is less than the energy determined for the subframe of the at least two subframes of the first audio signal frame which does not have the maximum energy; andassign the subframe of the second audio signal frame to the subframe of the at least two subframes of the first audio signal frame which has the maximum energy, when the maximum energy multiplied by an adjustment factor is greater than or equal to the energy determined for the subframe of the at least two subframes of the first audio signal frame which does not have the maximum energy.
- The apparatus as claimed in Claim 1 to 8, wherein the total number of time slots of the second audio signal frame is either greater than or less than a total number of time slots of the first audio signal frame.
- The apparatus as claimed in Claims 1 to 9, wherein the first audio signal frame is either extended in time or shortened in time to give the second audio signal frame.
- A method for spatial audio decoding, comprising:receiving a first audio signal frame comprising a number of subframes, wherein each of the number of subframes is divided into a number of time slots;receiving a parameter indicating a total number of time slots of a second audio signal frame;determining a slot to subframe map by mapping the total number of time slots of the second audio signal frame to the number of subframes of the first audio signal frame, wherein the slot to subframe map maps a time slot of the second audio signal frame to a subframe of the first audio signal frame;determining an energy value for each time slot of the total number of time slots of the second audio signal frame;determining a subframe to subframe map based on the slot to subframe map and the energy value for each time slot of the total number of time slots of the second audio signal frame, wherein the subframe to subframe map maps a subframe of the second audio signal frame to a subframe of the first audio signal frame; andusing the subframe to subframe map to assign at least one spatial audio parameter of a subframe of the first audio signal frame to a subframe of the second audio signal frame.
- The method for spatial audio decoding as claimed in Claim 11, wherein determining a slot to subframe map by mapping the total number of time slots of the second audio signal frame to the number of subframes of the first audio signal frame comprises:determining a mapping length number by dividing the parameter indicating the total number of time slots of the second audio signal frame by a value indicating the number of time slots for a subframe of the first audio signal frame;mapping a first set of time slots comprising the mapping length number of time slots of the second audio signal frame to a first subframe of the first audio signal frame;updating an accumulative mapping length number by adding the mapping length number to the accumulative mapping length number; andmapping a second set of time slots of the second audio signal to a second subframe of the first audio signal frame, wherein the second set of time slots comprises the mapping length number of time slots of the second audio signal following the accumulative mapping length number of previous time slots of the second audio signal.
- The method as claimed in Claim 12, wherein the result of the division of the parameter indicating the total number of time slots of the second audio signal frame by the value indicating the number of time slots for a subframe of the first audio signal frame is not an integer, wherein determining the mapping length number by dividing the parameter indicating the total number of time slots of the second audio signal frame by the value indicating the number of time slots for a subframe of the first audio signal frame comprises:determining a remainder from the division of the parameter indicating the total number of time slots of the second audio signal frame by the value indicating the number of time slots for a subframe of the first audio signal frame;accumulating the remainder value according to a subframe index of the first audio signal frame;determining an increment value according to the subframe index of the first audio signal frame, wherein the increment value is given by a floor function being applied to the accumulated remainder value; anddetermining the mapping length number by dividing the parameter indicating the total number of time slots of the second audio signal frame by the value indicating the number of time slots for a subframe of the first audio signal frame and adding the increment value to the result of the division.
- The method as claimed in Claim 11, wherein determining a subframe to subframe map based on the slot to subframe map and the energy value for each time slot of the total number of time slots of the second audio signal frame is comprises:
determining a total energy value for a subframe of the second audio signal frame, wherein the subframe of the second audio signal frame is divided into a number of time slots and wherein the total energy value is determined by summing the energy value for each time slot of the number of time slots of the subframe of the second audio signal frame; andperforming the following steps for a time slot of the subframe of the second audio signal frame;determining an accumulative energy value for the time slot;determining whether the accumulative energy value for the time slot is greater than half the total energy value for the subframe of the second audio signal frame;when the accumulative energy value for the time slot is greater than half the total energy value for the subframe of the second audio signal frame using an index associated with the time slot to obtain a corresponding subframe of the first audio signal frame from the slot to subframe map and assign an entry of the subframe to subframe map mapping the corresponding subframe of the first audio signal frame to the subframe of the second audio signal frame; andwhen the accumulative energy value for the time slot is not greater than half the total energy for the subframe of the second audio signal frame proceeding to the next time slot of the subframe of the second audio signal frame. - The method as claimed in Claim 14, wherein determining an accumulative energy value for the time slot comprises:
adding an energy value of the time slot of the subframe of the second audio signal frame to a running total energy comprising a sum of energy values for previous time slots of the subframe of the second audio signal frame.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP23177532.1A EP4475122A1 (en) | 2023-06-06 | 2023-06-06 | Adapting spatial audio parameters for jitter buffer management |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP23177532.1A EP4475122A1 (en) | 2023-06-06 | 2023-06-06 | Adapting spatial audio parameters for jitter buffer management |
Publications (1)
Publication Number | Publication Date |
---|---|
EP4475122A1 true EP4475122A1 (en) | 2024-12-11 |
Family
ID=86692800
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP23177532.1A Pending EP4475122A1 (en) | 2023-06-06 | 2023-06-06 | Adapting spatial audio parameters for jitter buffer management |
Country Status (1)
Country | Link |
---|---|
EP (1) | EP4475122A1 (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080114606A1 (en) * | 2006-10-18 | 2008-05-15 | Nokia Corporation | Time scaling of multi-channel audio signals |
US20140140516A1 (en) * | 2011-07-15 | 2014-05-22 | Huawei Technologies Co., Ltd. | Method and apparatus for processing a multi-channel audio signal |
US20200265851A1 (en) * | 2017-11-17 | 2020-08-20 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Apparatus and Method for encoding or Decoding Directional Audio Coding Parameters Using Quantization and Entropy Coding |
WO2021255327A1 (en) * | 2020-06-18 | 2021-12-23 | Nokia Technologies Oy | Managing network jitter for multiple audio streams |
-
2023
- 2023-06-06 EP EP23177532.1A patent/EP4475122A1/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080114606A1 (en) * | 2006-10-18 | 2008-05-15 | Nokia Corporation | Time scaling of multi-channel audio signals |
US20140140516A1 (en) * | 2011-07-15 | 2014-05-22 | Huawei Technologies Co., Ltd. | Method and apparatus for processing a multi-channel audio signal |
US20200265851A1 (en) * | 2017-11-17 | 2020-08-20 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Apparatus and Method for encoding or Decoding Directional Audio Coding Parameters Using Quantization and Entropy Coding |
WO2021255327A1 (en) * | 2020-06-18 | 2021-12-23 | Nokia Technologies Oy | Managing network jitter for multiple audio streams |
Non-Patent Citations (1)
Title |
---|
VILKAMO, J.BACKSTROM, T.KUNTZ, A.: "Optimized covariance domain framework for time-frequency processing of spatial audio", JOURNAL OF THE AUDIO ENGINEERING SOCIETY, vol. 61, no. 6, 2013, pages 403 - 411, XP093021901 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11096002B2 (en) | Energy-ratio signalling and synthesis | |
WO2015131063A1 (en) | Object-based audio loudness management | |
US20230199417A1 (en) | Spatial Audio Representation and Rendering | |
EP3039675A1 (en) | Hybrid waveform-coded and parametric-coded speech enhancement | |
US12165658B2 (en) | Spatial audio parameter encoding and associated decoding | |
WO2021130405A1 (en) | Combining of spatial audio parameters | |
US20240357304A1 (en) | Sound Field Related Rendering | |
US20210250717A1 (en) | Spatial audio Capture, Transmission and Reproduction | |
CN119360865A (en) | Spatial Audio Parameters | |
EP4475122A1 (en) | Adapting spatial audio parameters for jitter buffer management | |
WO2024115045A1 (en) | Binaural audio rendering of spatial audio | |
CN116547749B (en) | Quantization of audio parameters | |
WO2021255327A1 (en) | Managing network jitter for multiple audio streams | |
WO2022223133A1 (en) | Spatial audio parameter encoding and associated decoding | |
GB2627482A (en) | Diffuse-preserving merging of MASA and ISM metadata | |
US20240236601A9 (en) | Generating Parametric Spatial Audio Representations | |
US20240274137A1 (en) | Parametric spatial audio rendering | |
WO2024199802A1 (en) | Coding of frame-level out-of-sync metadata | |
WO2024175320A1 (en) | Priority values for parametric spatial audio encoding | |
WO2024115051A1 (en) | Parametric spatial audio encoding | |
WO2024165271A1 (en) | Audio rendering of spatial audio | |
EP4479966A1 (en) | Parametric spatial audio rendering | |
WO2024115052A1 (en) | Parametric spatial audio encoding | |
WO2024199873A1 (en) | Decoding of frame-level out-of-sync metadata | |
GB2628410A (en) | Low coding rate parametric spatial audio encoding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC ME MK MT NL NO PL PT RO RS SE SI SK SM TR |