EP4475122A1

EP4475122A1 - Adapting spatial audio parameters for jitter buffer management

Info

Publication number: EP4475122A1
Application number: EP23177532.1A
Authority: EP
Inventors: Mikko-Ville Laitinen; Tapani PIHLAJAKUJA; Lauros PAJUNEN; Jouni Kristian PAULUS; Lasse Juhani Laaksonen
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2023-06-06
Filing date: 2023-06-06
Publication date: 2024-12-11

Abstract

An apparatus for spatial audio decoding, configured to: receive a first audio signal frame comprising a number of subframes; receive a parameter indicating a total number of time slots of a second audio signal; mapping the total number of time slots of the second audio signal frame to the first number of subframes of the first audio signal frame to produce a map for mapping a time slot of the second audio signal to a subframe of the first audio signal; using this map for mapping a time slot of the second audio signal to a subframe of the first audio signal to produce a map for mapping a subframe of the second audio signal to a subframe of the first audio signal; and using the map for mapping a subframe of the second audio signal to a subframe of the first audio signal to assign at least one spatial audio parameter of a subframe of the first audio signal to a subframe of the second audio signal.

Description

Field

The present application relates to apparatus and methods for adapting spatial audio metadata for the provision of jitter buffer management in immersive and spatial audio codecs.

Background

Parametric spatial audio capture from inputs, such as microphone arrays and other sources, is a typical and an effective choice to estimate from the input (microphone array signals) a set of parameters such as directions of the sound in frequency bands, and the ratios between the directional and non-directional parts of the captured sound in frequency bands. These parameters are known to well describe the perceptual spatial properties of the captured sound at the position of the microphone array. These parameters can be utilized in synthesis of the spatial sound accordingly, for headphones binaurally, for loudspeakers, or to other formats, such as Ambisonics.
The directions and direct-to-total and diffuse-to-total energy ratios in frequency bands are thus a parameterization that is particularly effective for spatial audio capture.
A parameter set consisting of a direction parameter in frequency bands and an energy ratio parameter in frequency bands (indicating the directionality of the sound) can be also utilized as the spatial metadata (which may also include other parameters such as surround coherence, spread coherence, number of directions, distance etc) for an audio codec. For example, these parameters can be estimated from microphone-array captured audio signals, and, for example, a stereo or mono signal can be generated from the microphone array signals to be conveyed with the spatial metadata.
Immersive audio codecs are being implemented supporting a multitude of operating points ranging from a low bit rate operation to transparency. An example of such a codec is the Immersive Voice and Audio Services (IVAS) codec which is being designed to be suitable for use over a communications network such as a 3GPP 4G/5G network including use in such immersive services as for example immersive voice and audio for virtual reality (VR). This audio codec is expected to handle the encoding, decoding and rendering of speech, music and generic audio. It is furthermore expected to support channel-based audio, object-based audio, and scene-based audio inputs including spatial information about the sound field and sound sources. The codec is also expected to operate with low latency to enable conversational services as well as support high error robustness under various transmission conditions. A decoder can decode the audio signals into PCM (Pulse code modulation) signals. The decoder can also process the sound in frequency bands (using the spatial metadata) to obtain the spatial output, for example, a binaural output.
The aforementioned immersive audio codecs are particularly suitable for encoding captured spatial sound from microphone arrays (e.g., in mobile phones, VR cameras, stand-alone microphone arrays). However, such an encoder can have other input types, for example, loudspeaker signals, audio object signals, or Ambisonic signals. Similarly, it is expected that the decoder can output the audio in supported formats. In this regard the IVAS decoder is also expected to handle the encoded audio streams and accompanying spatial audio metadata as RTP packets which may arrive with varying degrees of delay as a result of network jitter conditions in a packet-based network.

Summary of the Figures

For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which:

Figure 1 shows schematically an apparatus for MASA metadata extraction;
Figure 2 shows schematically an example MASA metadata frame sub-frame structure;
Figure 3 shows schematically an example MASA frame structure divided into audio slots;
Figure 4 shows schematically a receiver system deploying a jitter buffer management scheme according to embodiments;
Figure 5 shows an example schematic block diagram of a jitter buffer manager;
Figure 6 shows schematically an example deployment of a jitter buffer manager in an IVAS decoder;
Figure 7 shows schematically a metadata adaptor suitable for employing as part of the decoder as shown in Figure 6, configured to time adapt spatial audio metadata according to embodiments;
Figure 8 schematically shows the relationships between the distribution of time slots and subframes for an original length IVAS frame and distribution of time slots and subframes for an IVAS frame which is shortened in time according to some embodiments;
Figure 9 schematically shows the relationships between the distribution of time slots and subframes for an original length IVAS frame and distribution of time slots and subframes for an IVAS frame which is lengthened in time according to some embodiments;
Figure 10 schematically shows the relationships between the distribution of time slots and subframes for an original length IVAS frame and distribution of time slots and subframes for an IVAS frame which is shortened in time according to other embodiments;
Figure 11 schematically shows further relationships between the distribution of time slots and subframes for an original length IVAS frame and distribution of time slots and subframes for an IVAS frame which is shortened in time according to other embodiments;
Figure 12 shows a flow diagram of an operation of the subframe-to-subframe map determiner as shown in Figure 7 according to some embodiments;
Figure 13 shows schematically an example system of apparatus suitable for implementing some embodiments; and
Figure 14 shows an example device suitable for implementing the apparatus shown in previous figures.

Embodiments of the Application

The following describes in further detail suitable apparatus and possible mechanisms for the encoding of parametric spatial audio signals comprising transport audio signals and spatial metadata. As indicated above immersive audio codecs (such as 3GPP IVAS) are being planned which support a multitude of operating points ranging from a low bit rate operation to transparency. It is expected to support channel-based audio, object-based audio, and scene-based audio inputs including spatial information about the sound field and sound sources.
In the following the example codec is configured to be able to receive multiple input formats. In particular, the codec is configured to obtain or receive a multi audio signal (for example, received from a microphone array, or as a multichannel audio format input, or an Ambisonics format input). Furthermore, in some situations the codec is configured to handle more than one input format at a time. Metadata-Assisted Spatial Audio (MASA) is an example of a parametric spatial audio format and representation suitable as an input format for IVAS.
It can be considered an audio representation consisting of 'N channels + spatial metadata'. It is a scene-based audio format particularly suited for spatial audio capture on practical devices, such as smartphones. The idea is to describe the sound scene in terms of time- and frequency-varying sound source directions and, e.g., energy ratios. Sound energy that is not defined (described) by the directions, is described as diffuse (coming from all directions).
As discussed above spatial metadata associated with the audio signals may comprise multiple parameters (such as multiple directions and associated with each direction (or directional value) a direct-to-total energy ratio, spread coherence, distance, etc.) per time-frequency (TF) tile. The spatial metadata may also comprise other parameters or may be associated with other parameters which are considered to be non-directional (such as surround coherence, diffuse-to-total energy ratio, remainder-to-total energy ratio) but when combined with the directional parameters are able to be used to define the characteristics of the audio scene. For example, a reasonable design choice which is able to produce a good quality output is one where the spatial metadata comprises one or more directions for each time-frequency subframe (and associated with each direction direct-to-total ratios, spread coherence, distance values etc) are determined.
With respect to Figure 1 is shown an example MASA analyser 101. The MASA analyser 101 is configured to receive the input audio signal(s) 100 and analyse the input audio signals to generate transport audio signal(s) 102 and spatial metadata 104.
The transport audio signal(s) 102 can be encoded, for example, using an IVAS audio core codec, or with an AAC (Advanced Audio Coding) or EVS (Enhanced Voice Services) encoder.

Examples of MASA spatial metadata is presented in the following table. These values are available for each time-frequency tile. In some implementations a frame is subdivided into 24 frequency bands and 4 temporal sub-frames. In other implementations other divisions of frequency and time can be employed. Furthermore, in some implementations a frame size (for example, as implemented in IVAS) is 20 ms (and thus the temporal sub-frame is 5 ms). However, similarly, other frame lengths can be employed in other embodiments. In some embodiments the MASA analyser is configured to determine 1 or 2 directions for each time-frequency tile (i.e., there are 1 or 2 direction index, direct-to-total energy ratio, and spread coherence parameters for each time-frequency tile). However, in some embodiments the analyser is configured to generate more than 2 directions for a time-frequency tile.

Field	bits		Description
Direction index
		16	Direction of arrival of the sound at a time-frequency parameter interval. Spherical representation at about 1-degree accuracy.
			Range of values: "covers all directions at about 1 ° accuracy"
Values stored as 16-bit unsigned integers.
Direct-to-total energy ratio		8	Energy ratio for the direction index (i.e., time-frequency subframe).
	Calculated as energy in direction / total energy.
	Range of values: [0.0, 1.0]
	Values stored as 8-bit unsigned integers with uniform spacing of mapped values.
Spread coherence	8		Spread of energy for the direction index (i.e., time-frequency subframe).
			Defines the direction to be reproduced as a point source or coherently around the direction.
			Range of values: [0.0, 1.0]
		Values stored as 8-bit unsigned integers with uniform spacing of mapped values.
Diffuse-to-total energy ratio	8	Energy ratio of non-directional sound over surrounding directions.
		Calculated as energy of non-directional sound / total energy.
		Range of values: [0.0, 1.0]
		(Parameter is independent of number of directions provided.)
		Values stored as 8-bit unsigned integers with uniform spacing of mapped values.
Surround coherence	8	Coherence of the non-directional sound over the surrounding directions.
		Range of values: [0.0, 1.0]
		(Parameter is independent of number of directions provided.)
		Values stored as 8-bit unsigned integers with uniform spacing of mapped values.
Remainder-to-total energy ratio	8	Energy ratio of the remainder (such as microphone noise) sound energy to fulfil requirement that sum of energy ratios is 1.
		Calculated as energy of remainder sound / total energy.
		Range of values: [0.0, 1.0]
		(Parameter is independent of number of directions provided.)
		Values stored as 8-bit unsigned integers with uniform spacing of mapped values.

As discussed above the frame size in IVAS is 20 ms. An example of the frame structure is shown in Figure 2 where the metadata frame 201 comprises four temporal sub-frames which are 5 ms long, metadata sub-frame 1 202, metadata sub-frame 2 204, metadata sub-frame 3 206, and metadata sub-frame 4 208. The IVAS frame can also be formed as a TF-representation of a complex-valued low delay filter band (CLDFB) where each subframe comprises 4 TF-slots and in the case of the above example this equates to each slot corresponding to 1.25 ms. An example of the IVAS frame structure 300 corresponding to the TF-representation of the CLDFB is shown in Figure 3. There are 16 audio slots in total (audio slot 16, 316) and each metadata subframe is divided into 4 audio slots with the first metadata subframe having audio slots 1, 2, 3, and 4 depicted in Figure 3 as slots 301, 302, 303, and 304 respectively.
The MASA stream can be rendered to various outputs, such as multichannel loudspeaker signals (e.g., 5.1 and 7.1+4) or binaural signals. An example rendering system that can be used for MASA is described in Vilkamo, J., Bäckström, T., & Kuntz, A. (2013). "Optimized covariance domain framework for time-frequency processing of spatial audio". Journal of the Audio Engineering Society, 61(6), 403-411. Broadly speaking, the rendering method determines a target covariance matrix based on the spatial metadata and energies of TF-tile of the transport audio signal(s). The determined target covariance matrix may contain the channel energies of all channels and the inter-channel relationships between all channel pairs, in particular the cross-correlation and the inter-channel phase differences. These features are known to convey the perceptually relevant spatial features of a multichannel sound in various playback situations, such as binaural and multichannel loudspeaker audio.
The rendering process modifies the transport audio signals in the form of TF-tiles so that the resulting signals have a covariance matrix that resembles the target covariance matrix. As a result, the rendered spatial audio signals (e.g., binaural signals) are perceived to have the spatial properties as captured by the spatial metadata.
However, since both the transport audio signal(s) and the spatial metadata vary in time it is desirable that they remain in synchrony with each other. Failure to maintain synchrony may result in the production of unwanted artefacts in the output signals. By way of illustration the following scenario depicts the unwanted effects of a failure to maintain synchrony. Consider a sound scene containing an ambient noise source on left, and a transient-like source on right (such as a drum roll). For simplicity, let us also assume MASA capture using a single direction per time-frequency tile. When the transient source (on right side) is silent, the spatial metadata is mostly pointing towards the noise source (on left side). On the other hand, when there is a transient event such as a drum hit, the spatial metadata is mostly pointing towards the transient source (right side), because it typically has more energy than the constant noise source (left side). When the audio transport signal(s) and spatial audio metadata are in mutual synchrony, the transients are correctly rendered from the right, while the noise is rendered from the left within the same passage of time. However, if there is some asynchrony (or loss of synchrony) between the transport audio signal(s) and spatial audio metadata, the transients may be rendered from the left, and the noise may be rendered from the right over a slightly skewed passage of time. This may result in the rendered sound containing strong artefacts with the consequential decrease in perceived audio quality.
In general, network jitter and packet loss conditions can cause degradation in quality, for example, in conversational speech services in packet networks, such as the IP networks, and mobile networks such as fourth generation (4G LTE) and fifth generation (5G) networks. The nature of the packet switched communications can introduce variations in transmission of times of the packets (containing frames), known as jitter, which can be seen by the receiver as packets arriving at irregular intervals. However, an audio playback device requires a constant input with no interruptions in order to maintain good audio quality. Thus, if some packets/frames arrive at the receiver after they are required for playback, the decoder may have to consider those frames as lost and perform error concealment.
Typically, a jitter buffer can be utilised to manage network jitter by storing incoming frames for a predetermined amount of time (specified, e.g., upon reception of the first packet of a stream) in order to hide the irregular arrival times and provide constant input to the decoder and playback components.
Nowadays, most audio or speech decoding systems deploy an adaptive jitter buffer management scheme in order to dynamically control the balance between short enough delay and low enough numbers of delayed frames. In this approach, an entity controlling the jitter buffer constantly monitors the incoming packet stream and adjusts the buffering delay (or buffering time, these terms are used interchangeably) according to observed changes in the network delay behaviour. If the transmission delay seems to increase or the jitter becomes worse, the buffering delay may need to be increased to meet the network conditions. In the opposite situation, where the transmission delay seems to decrease, the buffering delay can be reduced, and hence, the overall end-to-end delay can be minimised.
Figure 4 shows how the IVAS decoder may be connected to a jitter buffer management system. The receiver modem 401 can receive packets through a network socket such as an IP (Internet protocol) network socket which may be part of an ongoing Real-time Transport Protocol (RTP) session. The received packets may be pushed to a RTP depacketizer module 403, which may be configured to extract the encoded audio stream frames (payload) from the RTP packet. The RTP payload may then be pushed to a jitter buffer manager (JBM) 405 where various housekeeping tasks may be performed such as frame receive statistics are updated. The jitter buffer manager 405 may also be arranged to store the received frames. The jitter buffer manager 405 may be configured to pass the received frames to an IVAS decoder & renderer 407 for decoding. Accordingly, the IVAS decoder & renderer 407 passes the decoded frames back to the jitter buffer manager 405 in the form of digital samples (PCM samples). Also depicted in Figure 4 is an Acoustic player 409 which may be viewed as the module performing the playing out (or playback) of the decoded audio streams. The function performed by the Acoustic player 409 may be regarded as a pull operation in which it pulls the necessary PCM samples from the JBM buffer to provide uninterrupted audio playback of the audio streams.
Figure 5 is a system 500 depicting the general workings and interactions of a jitter buffer manager 405 with an IVAS decoder & renderer 407. The jitter buffer manager 405 may comprise a jitter buffer 501, a network analyzer 502, an adaptation control logic 503 and an adaptation unit 504.
Jitter buffer 501 is configured to temporarily store one or more audio stream frames (such as an IVAS bitstream), which are received via a (wired or wireless) network, for instance, in the form of packets 506. These packets 506 may for instance be RTP packets, which are unpacked by buffer 501 to obtain the audio stream frames.
Buffer status information 508, such as, for instance, information on a number of frames contained in buffer 501, or information on a time span covered by a number of frames contained in the buffer, or a buffering time of a specific frame (such as an onset frame), is transferred between buffer 501 and adaptation control logic 503.
Network analyzer 502 monitors the incoming packets 506 from the RTP depacketizer 403, for instance, to collect reception statistics (e.g., jitter, packet loss). Corresponding network analyzer information 507 is passed from network analyzer 502 to adaptation control logic 503.
Adaptation control logic 503, inter alia, controls buffer 501. This control comprises determining buffering times for one or more frames received by buffer 501 and is performed based on network analyzer information 507 and/or buffer status information 508. The buffering delay of buffer 501 may, for instance, be controlled during comfort noise periods, during active signal periods or in-between. For instance, a buffering time of an onset signal frame may be determined by adaptation control logic 503, and IVAS decoder & renderer 407 may (for instance, via adaptation unit 504, signals 509, and the signal 510 to control the IVAS decoder) then be triggered to extract this onset signal frame from buffer 501 when this determined buffering time has elapsed. The IVAS decoder 507 can be arranged to pass the decoded audio samples to the adaption unit 504 via the connection 511.
Adaptation unit 504, if necessary, shortens or extends the output audio signal according to requests given by adaptation control logic 503 to enable buffer delay adjustment in a transparent manner.
In short, therefore the JBM system 500 may be required to perform time stretching/shortening in order to achieve continuous audio playback without the introduction of audio artifacts as a result of network jitter.
In a spatial audio system such as IVAS the time stretching/shortening can be performed after rendering to the output audio signals in a manner similarly deployed by existing audio codecs such as EVS. However, taking this approach can result in some disadvantages. Firstly, time stretching/shortening may contain pitch shifting as a part of the stretching process (especially for larger modifications to the output audio signal(s). This may cause problems with binaural signals as the monaural cues (that allow the human hearing system to determine elevation) may become skewed, leading into erroneous perception of elevation. Secondly, time stretching/shortening may alter inter-channel relationships (such as altering phase and/or level relationships), which again may have a detrimental effect with the perception of direction. Thirdly, the process of time stretching/shortening over many audio channels may be computationally complex. Fourthly, performing time-stretching after rendering requires the renderer to be run for multiple frames to produce one frame of output. Fifthly, performing time stretching after rendering may cause a varying motion-to-sound latency for head-tracked binaural rendering, resulting in a degradation of the perceived quality.
Alternatively, in order to avoid the above issues, it is possible to perform the process of time stretching/shortening before rendering which would avoid the above disadvantages. In other words, the time stretching/shortening process may be performed over the spatial audio metadata and transport audio signal(s). However, as discussed previously in order to preserve audio quality it is desirable to maintain synchrony between the spatial audio metadata and the transport audio signal(s). To that end this invention proceeds on the basis that the time stretching/shortening over the transport audio signal(s) has already been performed and focusses on the issues of time adapting the accompanying spatial audio metadata in order to maintain synchrony with the transport audio signal(s). Figure 6 is a more detailed depiction of the jitter buffer management system 500 for IVAS. RTP packets 601 are depicted as being received and passed to the RTP de-packer 602 which may be arranged to extract the IVAS frames 607. The RTP de-packer 602 may also be arranged to obtain from the RTP streams so called RTP metadata 603 which can be used to update IP network metrics such as frame receive statistics which in turn may be used to estimate network jitter. Accordingly, RTP metadata 603 may be passed to the Network jitter analysis and target delay estimator 604 where the RTP metadata 603, comprising packet timestamp and sequence information, may be analysed to provide a target playout delay parameter 605 for use in the adaption control logic processor 606.
The IVAS frames 607 as obtained by the de-packing process of the RTP de-packer 602 are depicted in Figure 6 as being passed to the de-jitter buffer 608. The de-jitter buffer 608 is arranged to store the IVAS frames 607 in a frame buffer ready for decoding by IVAS audio & metadata decoder 610. The de-jitter buffer 608 can also be arranged to perform frame-based delay adjustment on the stream of IVAS frames when instructed to by the adaption control logic processor 606, and also reorder the IVAS frames into a correct decoding order should they not arrive in the proper sequential order (for decoding.) The output from the de-jitter buffer 608, in other words IVAS frames for decoding 609, may be passed to the IVAS audio & metadata decoder 610.
The IVAS audio decoder & metadata decoder 610 is arranged to decode the IVAS frames 609 into a decoded multichannel transport audio signal stream 611 (also referred to as the transport audio signal(s) and a MASA spatial audio metadata stream 613 (also referred to as spatial audio metadata). The MASA spatial audio metadata and decoded multichannel transport audio signal streams 613 and 611 may be passed on to subsequent processing blocks so that any time sequence adjustments to the respective signals may be performed. In the case of the MASA spatial audio metadata the respective stream 613 is passed to the metadata adaptor 612 and in the case of the decoded multichannel transport audio signal the respective stream 611 is passed to the multi-channel time scale modifier (MC TSM) 614.
The MC TSM 614 is configured to time scale modify frames of the transport audio signals 611 under the direction of the adaption control logic processor 606. Basically, the MC TSM 614 performs the time stretching or time shortening of the transport audio signal(s) in the time domain in response to a time adjustment instruction provided by adaption control logic processor 606. The time adjustment instruction may be received by the MC TSM 614 along the control line 615 from the adaption control logic processor 606. The output from the MC TSM 614, i.e., frames of the time-adjusted transport audio signals 621, may be passed to the renderer 616, a processing block termed the EXT output constructor 618 and the metadata adaptor 612. In the case of the metadata adaptor 612, the time-adjusted transport audio signal(s) 621 is used to assist in the adaption of the spatial audio metadata so that synchrony is better maintained.
The metadata adaptor 612 is essentially arranged to receive the spatial audio metadata parameters 613 corresponding to the frames of the transport audio signal(s) 611 that are delivered to the MC TSM 614, adapt these spatial audio metadata parameters 613 in accordance with the time adjustment instructions as provided by the adaption control logic processor 606, and maintain time synchrony with the time-adjusted transport audio signal(s) 621. In order to perform these functions, the metadata adaptor 612 is configured to receive the time-adjusted transport audio signal(s) 621 and the time adjustment instructions from the adaption control logic processor 606 along the signal line 617. The metadata adaptor 612 may then be arranged to produce time-adapted spatial audio metadata which has time synchrony with the time-adjusted transport audio signals 621. The time adapted spatial audio metadata is depicted as the signal 623 in Figure 6 and is shown as being passed to both the renderer 616 and EXT output constructor 618.
Also shown in Figure 6 is the renderer 616 which can receive the time-adjusted transport audio signals 621 and the time adapted spatial audio metadata 623 and render said signals into a multichannel spatial audio output signal. The rendering may be performed in accordance with the rendering parameters 625. The renderer 616 is also shown as receiving a signal from the adaption control logic processor 606 along the signal line 619.
Figure 6 also shows a further output processing function in the form of the EXT output constructor 618. This processing function simply takes the time-adjusted transport audio signals 621 and the time adapted spatial audio metadata 623 and "packages" the signals into a single frame format suitable for outputting from a device, in other words a "spatial audio format" suitable for storage as a file type and the like. The purpose of the EXT output constructor 618 is to output the spatial audio format as it was decoded with minimal changes to conform to a spatial audio format specification. It can be then stored, re-encoded, mixed, or rendered with an external renderer.
As mentioned above the jitter buffer management system for IVAS also comprises the adaption control logic processor 606. The adaption control logic processor 606 is responsible for providing the time adjustment instructions to other processing blocks in the system. This may be realised by the adaption control logic processor 606 receiving a target delay estimation/parameter 605 from the network jitter analysis and target delay estimator 604 and the current playout delay from the playout delay estimator 620 and using this information to choose the method for playout delay adjustment to reach the target playout delay. This may be provided to the various processing blocks in the form of time adjustment instructions. The various processing blocks may then each individually utilise the received time adjustment instructions to perform appropriate actions so that the audio output from the renderer 616 is played out with the correct time. In order to realise the audio output playout delay time the following functions may be configured to receive time adjustment instructions from the adaption control logic processor 606; de-jitter buffer 608, metadata adaptor 612, MC TSM 614, and the renderer 616. The playout delay estimator 620 provides estimate of the current playout delay to 606 based on the information received from 608 and 614.
As mentioned previously, the metadata adaptor 612 is arranged to adjust the spatial audio metadata 613 in accordance with the time adjustment instructions (playout delay time) whilst maintaining synchrony with the time-adjusted transport audio signals 621. In this regard Figure 7, shows the metadata adaptor 612 according to embodiments in further detail.
The metadata adaptor 612 takes as input the time adjustment instructions 617 from the adaption control logic processor 606. This input may then be passed to the slot to subframe mapper 702. The time adjustment instructions 617 may contain information pertaining to the number of subframes and hence audio time slots that are to be rendered. For the sake of brevity this information can be referred to as "slot adaptation info."
As may be recalled from Figures 2 and 3, the original IVAS frame can be divided into a number of subframes with each subframe being divided into a further number of audio slots. One such example comprises a 20 ms frame divided into 4 equal length subframes, with each subframe being evenly divided into 4 audio slots giving a total of 16 audio slots at 1.25 ms each.
In embodiments the "slot adaptation info" may contain a parameter giving the number of audio slots N_slots present in the time-adjusted transport audio signals 621, which in turn provides the number of subframes in the signal and consequently the frame size of the time-adjusted transport audio signals 621. This information may then be used to adapt the spatial audio parameters sets which are currently time aligned with the subframes of the original IVAS frame to being time aligned with the subframes of the time-adjusted transport audio signals 621.
It is to be noted that the term "original IVAS frame" refers to the size of the IVAS frame before any time shortening/lengthening has taken place. So, it refers to the frames of the transport audio signal(s).
It is also to be noted in the forthcoming we have referred to the time adapted transport audio signal(s) 621 frame, as an "adapted IVAS frame" for the sake of brevity.
The parameter N_slots may be different from the default number of slots in an original IVAS frame N _{slots_default}, with the default number of slots being the number of slots in an original IVAS frame before the time stretching/shortening process. In the above example N_{slots_default} is 16 audio slots.
At this point it may be helpful to note that the process of stretching and shortening the waveform takes place in the MC TSM processor 614 which may involve changing the number of slots in the frame of the transport audio signal(s) 611.
In embodiments the slot subframe mapper 702 can be arranged to map the "original" default number of slots N _{slots_default} of the original IVAS frame, to a different number of slots N_slots distributed across the same number of subframes as the original IVAS frame. This has the outcome of mapping the slots/subframes of the adapted IVAS frame to the standard IVAS frame. This results in a pattern of mapped slots where some of the subframes (of the original IVAS frame) have either more or less mapped slots depending on whether the adapted slot number N_slots is greater or less than the original number of slots N_{slots-default}. For instance, if N_slots < N_{slots_default} then the process is a waveform shortening or output play speeding up operation, and if N_slots > N _{slots_default} then the process is a waveform lengthening or output play slowing down operation.
In essence the slot to subframe mapper 702 may be arranged to map the N_slots time slots of the adapted IVAS frame to the subframes of the original IVAS frame to produce a map for mapping a time slot of the adapted IVAS frame to a subframe of the original IVAS frame.
The mapping of each slot associated with the time adapted transport audio signal(s) 621 (adapted IVAS frame) to a subframe of the original IVAS frame is performed on the premise that the assigned subframe (in the original IVAS frame) best matches the temporal position of the slot in the time adapted transport audio signal(s) 621 (adapted IVAS frame).
As may be recalled from above, each subframe comprises a set of spatial audio parameters. During normal operation, when there is no compensation for network jitter, each group of four audio slots in the original IVAS frame is associated with the spatial audio parameter set of one of the subframes. Therefore, the consequence of slot to subframe mapping process may be viewed as associating different groups of slots with the spatial audio parameter sets of the original IVAS frame.
Figure 8 shows an example subframe to slot mapping process when the adapted slot number N_slots is 12 and the original number of slots N _{slots_default} is 16. In other words, this Figure 8 depicts an example of waveform shortening (decreasing the playing out time). The relationship between slots to subframes for the original IVAS frame, where every 4 slots is mapped to a subframe is shown as 801 in Figure 8, i.e., slots s1 to s4 are mapped to subframe 1, slots s5 to s8 are mapped to subframe 2, slots s9 to s12 are mapped to subframe 3 and slots s13 to s16 are mapped to subframe 4. The result of the mapping process where 12 slots are mapped to the 4 subframes of the original IVAS frame may be shown as 802 in Figure 8. In this example, the slot to mapping process has resulted in the first three slots (s1, s2, s3) being mapped to first subframe. The fourth and fifth slots (s4, s5) have been mapped to the second subframe. Slots s6, s7, s8 are mapped to subframe 3. Finally, slots s9, s10, s11 and s12 are now mapped to subframe 4. Also shown in Figure 8 is 803, this depicts the relationship between the subframes of the adapted IVAS frame (time-adapted transport audio signal(s) 621 frame) and N_slots, where the N_slots slots are evenly distributed across the three subframes.
In embodiments the above subframe to slot mapping process can be performed by initially dividing the number of adapted slots N_slots into two contiguous regions. The first region comprising the higher numbered slots is made up of a contiguous run of the N_{slots_end} highest ordered slots. For instance, in the example above the first region was determined to have 4 slots ( N_{slots_end} = 4) comprising the slots s12, s11, s10, s9. The second region is made up of the remaining slots s1 to s8, i.e., the run of slots starting from the beginning of the frame, and going up to slot number (N_slots-N _{slots_end}).
In embodiments the N_{slots_end} can be determined using the following expression $N_{slots, end} = \frac{N_{slots, default}}{2} - (N_{slots, default} - N_{slots})$
where N_{slots_default} is the number of slots in the original IVAS frame, in the above case it can take the value 16.
With respect to the first region, the subframe to slot mapping process take the N_{slots_end} highest ordered slots of the adapted IVAS frame and matches each them on an ordered one-to-one bases to the N_{slots_end} highest ordered slots of the original IVAS frame subframes and consequently the subframes associated with these slots.
This processing step may be illustrated by referring to the example of Figure 8, where the slots of the adapted IVAS frame s9, s10, s11 and s12 are mapped on a one-to-one basis to the subframe having the 4 highest slots of the original IVAS frame s13, s14, s15 and s16, i.e., subframe 4.
In embodiments the above subframe to slot mapping process for the first region may be formulated as $M_{slot - sf} {(m)}_{first} = floor (\frac{s}{L_{sf}}), m \in [N_{slots} - 1, N_{slots} - N_{slots, end}], s \in [N_{slots, default} - 1, N_{slots, rendefault}]$
where M_{slot_sf} (m) _first is the subframe to slot mapping function which gives the subframe to slot map for the first region. This function returns the mapped subframe number (with respect to the original IVAS frame) for each slot m of the first region of slots of the adapted IVAS frame. N_{slots,remdefault} is the number of original slots remaining after the number of N_slots,end have been removed. $N_{slots, remdefault} = N_{slots, default} - N_{slots, end}$
Figure 9 shows an example subframe to slot mapping process when the adapted slot number N_slots is 20 and again the original number of slots N _{slots_default} is 16. In other words, Figure 9 depicts an example of waveform extending (increasing the playing out time). The distribution of slots in the standard IVAS frame is shown as 901 in Figure 9 and the result of the slot to subframe mapping process where 20 slots (and hence 5 subframes of the time adapted transport audio signal(s) 621 frame (adapted IVAS frame) are mapped to the 4 subframes of the standard IVAS frame is shown as 902 in Figure 9. It is to be noted that in this example N_slots,end = 12, resulting in the first region of the 12 highest ordered slots (s9 to s20) being mapped to the same subframes as the 12 highest ordered slots (s5 to s16) of the original IVAS frame. Also shown in Figure 9, is 903 the relationship between the subframes of the time adapted transport audio signals 621 frame and the N_slots slots.
In embodiments, the second region of slots is therefore given by the lowest ordered contiguous run of slots from the first slot, s1 to the highest slot with the slot number N_slots - N_slots,end. The slot to subframe mapping process then maps this contiguous run of lowest ordered slots by distributing them across a number of the subframes, starting from the lowest numbered subframe. For instances, when N_slots > N_{slots_default} , i.e., a period of waveform lengthening, the second region of slots (of the adapted IVAS frame) may be distributed across the first subframe and subsequent subframes up to and including the subframe to which the lowest ordered slot from the first region is mapped. For instances, when N_slots < N_{slots_default} , i.e., a period of waveform shortening, the second region of slots (of the adapted IVAS frame) may be distributed across all subframes of the IVAS frame.
In embodiments the slot to subframe mapping of the second region may be formulated as $M_{slot - sf} {(m)}_{second} = floor (\frac{round (s_{f} (m))}{L_{sf}}), m \in [N_{slots} - N_{slots, end} - 1, 0]$
where $s_{f} (m) = s - (N_{slots} - N_{slots, end} - m) * \frac{s}{L_{JBM}}, s = N_{slots, remdefault}$
where $L_{JBM} = \frac{N_{slots, default}}{2}$
is the JBM segment length and L_sf is the default subframe length in slots (4 for the standard IVAS frame).
M_slot-sf (m) _second is the subframe to slot mapping function which gives the subframe (of the original IVAS frame) to slot map for the second region. This function returns the mapped subframe number (of the original IVAS frame) for each slot m of the second region of slots of the adapted IVAS frame.
In these embodiments the output from the slot to subframe mapper 702 is the combined slot to subframe map for both the first and second regions and may be referred to as M_slot-sf (m) . This output is depicted as 701 in Figure 7.
In other embodiments the slot to subframe mapper 702 may be arranged to distribute the N_slots of the adapted IVAS frame across the subframes of the original IVAS frame in a different manner to the above embodiments. In these embodiments, there may be no mapping between the N_slots slots of the adapted IVAS frame and the N_{slots_default} slots of the original IVAS frame. Instead, the N_slots slots of the adapted IVAS frame may be mapped directly to the subframes of the original IVAS frame using the following routine.
For the case when the N_slots slots of the adapted IVAS frame is a factor of the number of the slots for each subframe of the original IVAS frame, in other words when mod(N_slots, N_slots,sf ) == 0, where N_slots,sf is the number of slots in each subframe of the original IVAS frame (N.B. N_slots,sf is equivalent to L_sf in the above embodiment, which is 4 for the standard IVAS frame) a mapping length L_map (i_sf ) for a subframe i_sf of the original IVAS may be given as $L_{map} (i_{sf}) = \frac{N_{slots}}{N_{slots, sf}}, if \mod (N_{slots}, N_{slots, sf}) = = 0, for all i_{sf}$
where i_sf is the subframe index of the original IVAS frame. L_map may be the same for all subframes of the original IVAS frame. In other words, when the condition of N_slots is a factor of N_slots,sf , the N_slots of the adapted IVAS frame may be distributed evenly across the subframes of the original IVAS frame. N.B. for the standard IVAS frame size, i_sf will take the values 1 to 4, corresponding to subframes 1 to 4 of the original IVAS frame.
The slot to subframe map M_slot-sf (m), where m is the slot index of the adapted IVAS frame, may be determined based on the above mapping length L_map (i_sf ) and may be given as $M_{slot - sf} (m) = i_{sf}, m \in [L_{sum} (i_{sf}), L_{sum} (i_{sf}) + L_{map} (i_{sf}) - 1]$
Where L_sum (i_sf ) is the sum of the mapping lengths L_map for subframes before the subframe i_sf and L_sum (i_sf ) for subframe i_sf can be given as $L_{sum} (i_{sf}) = \sum_{n = 0}^{i_{sf} - 1} L_{map} (n)$
Figure 10 shows an example subframe to slot mapping process according to these embodiments where the adapted slot number N_slots is 12 and the original number of slots N _{slots_default} is 16. The relationship between slots and subframes for the original IVAS frame, where every 4 slots is mapped to a subframe, is shown as 1001 in Figure 10. The result of the mapping process where 12 slots of the adapted IVAS frame are mapped to the 4 subframes of the original IVAS frame are shown as 1002 in Figure 10. It can be seen that the 12 slots of the adapted IVAS frame have been evenly distributed across the 4 subframes of the original IVAS frame, i.e. slots s1 to s3 (or m=0 to m=2) are mapped to subframe 1 (i_sf = 1), slots s4 to s6 (m=3 to m=5) are mapped to subframe 2 (i_sf = 2), slots s7 to s9 (m=6 to m=8) are mapped to subframe 3 (i_sf = 3) and slots s10 to s12 are mapped to subframe 4 (i_sf = 4).
For the case when the N_slots slots of the adapted IVAS frame are not a factor of the number of slots (for each subframe of the original IVAS frame,) in other words when mod(N_slots , N _Slots,sf) ≠ 0 the mapping length L_map (i_sf ) for a subframe i_sf of the original IVAS may be given as $L_{map} (i_{sf}) = \frac{N_{slots}}{N_{slots, sf}} + increment (i_{sf})$
where a residual increment for a subframe i_sf is determined by $increment (i_{sf}) = floor (decSum (i_{sf}) + ε),$
and where ε is a small value to counter rounding errors The decimal sum decSum(i_sf ) may be determined by summing the value of dec up to the subframe with index i_sf $decSum (i_{sf}) = \sum_{n = 0}^{i_{sf}} dec$
where the decimal dec may be given as $dec = remainder (\frac{N_{slots}}{N_{slots, sf}})$
The decimal sum can be further modified for subframe i_sf with $decSum (i_{sf} + 1) = decSum (i_{sf}) - 1, if increment (i_{sf}) > 0$
Figure 11, depicts further examples of the subframe to slot mapping process according these embodiments where the adapted slot number N_slots are 13 and 14 and the original number of slots N_{slots_default} is 16. The result of the mapping process where 13 slots of the adapted IVAS frame are mapped to the 4 subframes of the original IVAS frame is shown as 1101 in Figure 11. It can be seen that the 13 slots of the adapted IVAS frame have been distributed across the 4 subframes of the original IVAS frame according to the following pattern of slots; s1 to s3 (or m=0 to m=2) are mapped to subframe 1 (i_sf = 1), slots s4 to s6 (m=3 to m=5) are mapped to subframe 2 (i_sf = 2), slots s7 to s9 (m=6 to m=8) are mapped to subframe 3 (i_sf = 3) and slots s10 to s13 (m = 9 to m=12) are mapped to subframe 4 i_sf = 4. The result of the mapping process where 14 slots of the adapted IVAS frame are mapped to the 4 subframes of the original IVAS frame is shown as 1102 in Figure 11. It can be seen that the 14 slots of the adapted IVAS frame have been distributed across the 4 subframes of the original IVAS frame according to the following pattern slots; s1 to s3 (or m=0 to m=2) are mapped to subframe 1 (i_sf = 1), slots s4 to s7 (m=3 to m=6) are mapped to subframe 2 (i_sf = 2), slots s8 to s10 (m=7 to m=9) are mapped to subframe 3 (i_sf = 3) and slots s11 to s14 (m=10 to m13) are mapped to subframe 4 (i_sf = 4).
In these embodiments the output 701 in Figure 7 from the slot to subframe mapper 702 is the above slot to subframe map M_{slot_sf} (m).
Also, shown in Figure 7 is the energy determiner 704 which is shown as receiving the time adjustment instructions 617 and the time-adjusted transport audio signal(s) 621 frame (adapted IVAS frame). The function of the energy determiner 704 is to determine the energy of the adapted IVAS frame on a slot-by-slot basis according to the number of slots N_slots. To that end, the energy determiner 704 takes in frame length of N_slots *slot width (1.25 ms) of the time-adjusted transport audio signals 621, i.e., an adapted IVAS frame and effectively divides the frame into N_slots time slots and then determines the energy across all the audio signals of the adapted IVAS frame for each slot. The energy E for each time slot m may be expressed as $E (m) = \sum_{t = t_{1} (m)}^{t_{2} (m)} \sum_{q = 0}^{n = Q - 1} s_{q}^{2} (t) for m = 0 to N_{slots} - 1$
Where q is the number of time shifted transport audio signals/channels in the signal 621.
The output from the energy determiner 704 is the energy E for each time slot m for an adapted IVAS frame. This is shown as the output 703 in Figure 7.
Also, shown in Figure 7 is the subframe-to-subframe map determiner 706 which is depicted as receiving the energy E for each time slot m of the adapted IVAS frame 703 and the slot to subframe map M_slot-sf (m) 701.
The function of the subframe-to-subframe map determiner 706 is to determine, for each subframe of the adapted IVAS frame, a subframe from the original IVAS frame whose associated spatial audio parameters most closely align with the audio signal of the subframe of the adapted IVAS frame.
This may be performed in order to provide a map whereby a subframe of the adapted IVAS frame is mapped to a subframe of the original IVAS frame.
In essence the subframe-to-subframe mapping determiner 706 may be arranged to use the map for mapping a time slot of the adapted IVAS frame to a subframe of the original IVAS frame to produce a map for mapping a subframe of the adapted IVAS frame to a subframe of the original IVAS frame.
In embodiments this function may be performed by the subframe-to-subframe map determiner 703 being arranged to use the slot to subframe maps 701 and the energy E for each time slot m 703 to determine an energy to subframe map for each subframe of the original IVAS frame. In essence, the energy to subframe map determiner 706 determines for each subframe of the original IVAS frame the energy of the adapted IVAS frame slots which were mapped to the subframe.
The energy to subframe mapping function for each subframe of the original IVAS frame may be expressed as $M_{E - sf} (n) = \sum_{m \in [m_{n_{A}} (0), m_{n_{A}} (N - 1)]} E (m), n = M_{slot - sf} (m)$
M_E-sf (n) is the energy of slots mapped to a subframe n of the original IVAS frame, where the adapted IVAS frame slots mapped to the subframe n are given by the slot to subframe mapping M_slot-sf (m) and where m_nA is the list of slots mapped to subframe n (of the original IVAS frame), m_nA (0) represents the first slot mapped to subframe n and m_nA (N - 1) the last slot mapped to subframe n, where N represents the number of slots in subframe n_A . The understanding of the above equation may be enhanced by returning to the example of Figure 8. In this example it may be seen that the first subframe of the original IVAS frame has mapped slots s1, s2 and s3, m ₁ (0) = s1, m ₁ (1) = s2, m ₁ (2) = s3, the second subframe has mapped slots s4 and s5, m ₂ (0) = s4, m ₂ (1) = s5, the third subframe has mapped slots s6 to s8, m ₃ (0) = s6 to m ₃ (2) = s8, and the fourth subframe comprises has mapped s9 to s12, m ₄ (0) = s9 to m ₄ (3) = s12. The M_E-sf (n) for each subframe may be given as the sum of the energies for each of the adapted IVAS frame slots mapped to the original IVAS frame subframe. For example, if the energy of each adapted IVAS frame slot s1=1, s2 =1, s3=1, s4=8, s5=5, s6=3, s7=1, s8 =1, s9=2, s10=1, s11=1 and s12=1, then the energy to subframe map may be given as M_E-sf (1) = 1 + 1 + 1 = 3, M_E-sf (2) = 8 + 5 = 13, M_E-sf (3) = 3 + 1 + 1 = 5 and M_E-sf (4) = 2 + 1 + 1 + 1 = 5.
The next step performed by the subframe-to-subframe mapper 702 is to determine for a subframe n_A of the adapted IVAS frame the subframe n_max (of the original IVAS frame) which gives the maximum energy to subframe value of all the M_E-sf (n) values which comprise the slots of the subframe n_A of the adapted IVAS frame. This may be performed for all subframes of the adapted IVAS frame.
The pseudo code for this step may have the following form:

for each n_A :
find M_E-sf (n), for all values of n which are comprised in the slot range [m_nA (0) , m_nA (N - 1)] of the subframe n_A ; $n_{\max} (n_{A}) = {argmax}_{n} (M_{E - sf} (n));$
End for:
Where argmax_n returns the index n that maximises the value of the function M_E-sf (n).

Returning to the above example of Figure 8. The first subframe of the adapted IVAS frame (n_A = 1) has slot range [m _{1 _A}(0) , m _{1 _A}(N - 1)] = s1 to s4. This slot range comprises the original IVAS subframes with indexes n = 1, 2. The M_E-sf (1) = 3 and the M_E-sf (2) = 13. This first subframe of the adapted IVAS frame (n_A = 1) has n_max = 2 because M_E-sf (2) has a greater summed energy than M_E-sf (1). This process can be repeated for all subframes of the adapted IVAS frame where the second subframe (n_A = 2) has n_max = 2, and the third subframe (n_A = 3) has n_max = 3.
In other embodiments, the subframe n_max can also be determined by $n_{\max} = 0$
$n_{\max} = {\begin{matrix} n, if M_{E - sf} (n) > (M_{E - sf} (n_{\max}) * adj) \\ n_{\max}, else \end{matrix}, n \in [1, N - 1]$
where N represents the number of slots in subframe n_A and adj is an adjustment multiplier. If adj ∈ [0,1), the subframe n_max may more often chosen from the end of the slot to subframe map section M_slot-sf (m), where m ∈ [m_nA (0) ,m_nA (N - 1)]. If adj > 1, the subframe n_ewm may more often chosen from the beginning of the slot to subframe map section M_slot-sf (m), where m ∈ [m_nA (0) ,m_nA (N - 1)].
The value of n_max for each subframe n_A of the adapted IVAS frame may collated in the form of a subframe-to-subframe mapping table/function M_{sf_sf} (n_A ) = n_max (n_A ) for all n_A .
In other embodiments the subframe-to-subframe mapping function may be performed according to the flow chart presented in Figure 12.
Initially the number of subframes of the adapted IVAS frame may be determined or communicated to the subframe-to-subframe map determiner 706.
In these other embodiments, the number of subframes in the adapted IVAS frame N_A may be based on the premise that each subframe comprises the same number of slots as each subframe of the original IVAS frame. For example, in the case of the standard IVAS frame size, the adapted IVAS frame may have N_A subframes each being 4 slots wide, giving N_A 5ms subframes. Therefore, using the above nomenclature the number of subframes in the adapted IVAS frame can be given by, $N_{A} = floor (\frac{N_{slots}}{N_{slots, sf}})$
where N_slots,sf =4 for a standard IVAS frame, i.e. 4 slots per subframe.
It is to be noted that in instances when N_slots is not a factor of N_slots,sf then any slots above N_A ∗ N_slots,sf (i.e. remainder slots will fall outside of the highest subframe) may not be included in calculations involving the subframes of the adapted IVAS frame. This may be illustrated with reference to the example of 1101 in Figure 11 where the number of slots N_slots is 13 and the number of subframes in the adapted IVAS frame N_A = 3. It can be seen in this example that slot 13 is not included within the subframes of the adapted IVAS frame.
The step of determining/acquiring the number of subframes in an adapted IVAS frame is shown in Figure 12 as the processing step 1201.
The process then moves into a subframe level processing loop for each subframe of the adapted IVAS frame, n_A = 0:N_A - 1.
This is shown as the processing step 1205 in Figure 12, and the subframe level processing loop comprises the steps 1207 to 1211.
The first step of the subframe level processing loop calculates the total energy of the slots of the subframe n_A of the adapted IVAS frame. This may be determined as $E_{tot} = \sum_{k = m_{n_{A}} (0)}^{m_{n_{A}} (N - 1)} E (k)$
where m_nA (0) is the first slot number of subframe n_A of the adapted IVAS frame and m_nA (N - 1) is the last slot number of subframe n_A , and E(k) is the slot energy for slot k of the adapted IVAS signal as provided by the energy determiner 704.
This may be illustrated by reference to the example of 1101 in Figure 11. The total energy for the first subframe of the adapted IVAS signal will comprise the sum of the slot energies E(0) to E(3) for slots s1 to s4, (i.e. m ₁(0)to m ₁(3)). The total energy for the second subframe of the adapted IVAS signal will comprise the sum of the slot energies E (4) to E (7) for slots s5 to s8, (i.e. m ₂(0)to m ₂(3)). The total energy for the third subframe will comprise the totals for E (8) to E (11) for slots s9 to s12, (i.e. m ₃(0)to m ₃(3)). The final slot s13 may either be processed as a non-full subframe, or it may be buffered for the next decoded IVAS frame.
The step of determining the total energy for the slots of the subframe of the adapted IVAS frame is shown as processing step 1207 in Figure 12.
The next step of the subframe processing loop initialises an accumulative energy factor E_cum for the subframe of the adapted IVAS frame. This is shown as processing step 1209 in Figure 12.
The process then moves into a slot level processing loop for the subframe of the adapted IVAS frame, k= m_nA (0): m_nA (N - 1), where k is a slot index for the subframe n_A , and is used to index the slots m_nA (0): m_nA (N - 1), of subframe n_A . This is shown as the processing step 1211 and comprises the steps 1213 to 1217.
The first step of the slot level processing loop adds the energy of a current slot E (k) to the accumulative energy E_cum . This is shown as step 1213 in Figure 12.
The slot level processing loop then checks whether the accumulative energy E_cum is greater than E_tot /2 for the subframe. This is shown as the processing step 1215.
If it was determined at step 1215 that the above criterion had been met, the slot level processing loop progresses to the processing step 1217.
At step 1217, the index k (which has led to the above criterion being met) is used to map a subframe of the original IVAS frame to the subframe of the adapted IVAS frame. This may be performed by taking the subframe of the original IVAS frame which houses the index k and assigns this subframe [of the original IVAS frame] to the subframe n_A [of the adapted IVAS frame]. As mentioned above the mapping (or relationship) between slots of the adapted IVAS frame and subframes of the original IVAS frame is given by the mapping function M_slot-sf (m) . The output from this mapping function for the input of the index value associated with k produces the subframe of the original IVAS frame which is assigned to the subframe n_A (of the adapted IVAS frame) in the form of a subframe-to-subframe mapping table/function M_{sf_sf} (n_A ) =M_{slot_sf} (k).When step 1217 has been executed, the process returns to 1205 to repeat the processing steps 1207 to 1217 for the next subframe n_A = n_A + 1.
Returning to step 1215, if the criterium is not met, i.e. E_cum is not greater than E_tot /2 for the subframe n_A then the process selects the next slot of the subframe of the adapted IVAS frame and proceeds to steps 1213 and 1215.
The result of the processing steps of Figure 12 is the subframe-to-subframe map/table M_sf-sf with an entry for all subframes n_A of the of the adapted IVAS frame.
This subframe-to-subframe map M_sf-sf may then be used to obtain the one-to-one mapping between the optimum subframe of the original IVAS frame for each subframe of the adapted IVAS frame. The subframe-to-subframe map M_sf-sf may form the output 705 of the subframe-to-subframe map determiner 706.
Also shown in Figure 7 is the spatial audio metadata adaptor 708 which can be arranged to receive the spatial audio metadata 613 and the subframe-to-subframe map M _sf-sf 705 and produce as output the time adapted spatial audio metadata 623.
In embodiments the spatial audio metadata adaptor 708 is arranged to assign a spatial audio parameter set of the original IVAS frame to each subframe of the adapted IVAS frame n_A by using the subframe-to-subframe map M _sf-sf 705. For each entry n_A of the subframe-to-subframe map M _sf-sf 705 there is a corresponding original IVAS subframe index n. The index n may then be used to assign the spatial audio parameter set of subframe n of the original IVAS frame to subframe n_A of the adapted IVAS frame or in other words subframe n_A of the time-adjusted transport audio signal(s) 621 frame.
For example, if we just consider one spatial audio parameter of the MASA spatial audio parameter set for a subframe with index n, the output azimuth angle θ_A associated with subframe n_A of the time-adjusted transport audio signal(s) 621 frame may be given as θ_A (n_A ) = θ(M_sf-sf (n_A )). Obviously, this mechanism can be repeated for the other spatial parameters in the MASA spatial audio parameter set to give the adapted MASA spatial audio parameter set for subframe n_A of the time-adjusted transport audio signal(s) 621 frame.
The time adapted spatial audio metadata 623 output therefore may comprise spatial audio parameter set for each subframe n_A of the time-adjusted transport audio signal(s) 621 frame.
It is to be understood that other embodiments may deploy other frame/subframe and slot sizes.
In some embodiments, the audio signal and the metadata may be asynchronized after decoding and the synchronization step is performed after JBM process and the output of the audio and metadata. In this case, a delay may be needed to allow use of correct slot energy in the weighting process. This may be achieved by simply delaying the audio signal or the original metadata as necessary. A ring buffer may be used for such a purpose.
In some embodiments, the process of selecting metadata and calculating energies may be done in time-frequency domain. In such embodiments the metadata selection may be done for each subframe & frequency band combination separately using time slots and frequency bands.
In some embodiments, the process of forming the subframe-to-subframe map M _sf-sf 705 may use signal energy only for one of the cases, waveform extending (increasing the playing out time) or waveform shortening (decreasing the playing out time).
It is to be understood that in some embodiments, the audio and metadata format may be some other format than MASA format, or the audio and metadata format is derived from some other format during encoding and decoding in codec.
In some embodiments, the energy of some slots may be missing or unobtainable, e.g., due to asynchrony. In such embodiments, the energy of these slots can be approximated from the other slots in current frame and in history that have obtainable energy value. An example of such approximation is the average energy value of the other slots with obtainable energy value which may be assigned as the energy value of any slot with missing energy.
With respect to Figure 13 is shown an example system within which some embodiments can be implemented. As an input are the transport audio signals 102 and the spatial metadata 104. The transport audio signals 102 and the spatial metadata 104 are passed to an encoder 1301 which generates an encoded bitstream 1302. The encoded bitstream 1302 is received by the decoder 1303 which is configured to generate a spatial audio output 1304.
As discussed above the input to the system, the transport audio signals 102 and the spatial metadata 104 can be obtained in the form of a MASA stream. The MASA stream can, for example, originate from a mobile device (containing a microphone array), or as an alternative example, it may have been created by an audio server that has potentially processed a MASA stream in some way.
The encoder 1301 can furthermore, in some embodiments, be an IVAS encoder.
The decoder 1303, in some embodiments, can be configured to directly output the spatial audio output 1304 to be rendered by an external renderer, or edited/processed by an audio server. In some embodiments, the decoder 1303 comprises a suitable renderer, which is configured to render the output in a suitable form, such as binaural audio signals or multichannel loudspeaker signals (such as 5.1 or 7.1+4 channel format), which are also examples of spatial audio output 1304.
With respect to Figure 14 an example electronic device which may be used as any of the apparatus parts of the system as described above. The device may be any suitable electronics device or apparatus. For example, in some embodiments the device 1400 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc. The device may for example be configured to implement the encoder and/or decoder or any functional block as described above.
In some embodiments the device 1400 comprises at least one processor or central processing unit 1407. The processor 1407 can be configured to execute various program codes such as the methods such as described herein.
In some embodiments the device 1400 comprises at least one memory 1411. In some embodiments the at least one processor 1407 is coupled to the memory 1411. The memory 1411 can be any suitable storage means. In some embodiments the memory 1411 comprises a program code section for storing program codes implementable upon the processor 1407. Furthermore, in some embodiments the memory 1411 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1407 whenever needed via the memory-processor coupling.
In some embodiments the device 1400 comprises a user interface 1405. The user interface 1405 can be coupled in some embodiments to the processor 1407. In some embodiments the processor 1407 can control the operation of the user interface 1405 and receive inputs from the user interface 1405. In some embodiments the user interface 1405 can enable a user to input commands to the device 1400, for example via a keypad. In some embodiments the user interface 1405 can enable the user to obtain information from the device 1400. For example, the user interface 1405 may comprise a display configured to display information from the device 1400 to the user. The user interface 1405 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1400 and further displaying information to the user of the device 1400. In some embodiments the user interface 1405 may be the user interface for communicating.
In some embodiments the device 1400 comprises an input/output port 1409. The input/output port 1409 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1407 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.
The transceiver can communicate with further apparatus by any suitable known communications protocol. For example, in some embodiments the transceiver can use a suitable radio access architecture based on long term evolution advanced (LTE Advanced, LTE-A) or new radio (NR) (or can be referred to as 5G), universal mobile telecommunications system (UMTS) radio access network (UTRAN or E-UTRAN), long term evolution (LTE, the same as E-UTRA), 2G networks (legacy network technology), wireless local area network (WLAN or Wi-Fi), worldwide interoperability for microwave access (WiMAX), Bluetooth^®, personal communications services (PCS), ZigBee^®, wideband code division multiple access (WCDMA), systems using ultra-wideband (UWB) technology, sensor networks, mobile ad-hoc networks (MANETs), cellular internet of things (IoT) RAN and Internet Protocol multimedia subsystems (IMS), any other suitable option and/or any combination thereof.
The transceiver input/output port 1409 may be configured to receive the signals.
In some embodiments the device 1400 may be employed as at least part of the synthesis device. The input/output port 1409 may be coupled to headphones (which may be a headtracked or a non-tracked headphones) or similar and loudspeakers.
In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general-purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.
Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.
As used in this application, the term "circuitry" may refer to one or more or all of the following:

(a) hardware-only circuit implementations (such as implementations in only analog and/or digital circuitry) and
(b) combinations of hardware circuits and software, such as (as applicable):
1. (i) a combination of analog and/or digital hardware circuit(s) with software/firmware and
2. (ii) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions) and

This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor (or multiple processors) or portion of a hardware circuit or processor and its (or their) accompanying software and/or firmware. The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit or processor integrated circuit for a mobile device or a similar integrated circuit in server, a cellular network device, or other computing or network device.
The term "non-transitory," as used herein, is a limitation of the medium itself (i.e., tangible, not a signal) as opposed to a limitation on data storage persistency (e.g., RAM vs. ROM).
As used herein, "at least one of the following: <a list of two or more elements>" and "at least one of <a list of two or more elements>" and similar wording, where the list of two or more elements are joined by "and" or "or", mean at least any one of the elements, or at least any two or more of the elements, or at least all the elements.
The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.

Claims

An apparatus for spatial audio decoding, configured to:
receive a first audio signal frame comprising a number of subframes, wherein each of the number of subframes is divided into a number of time slots;

receive a parameter indicating a total number of time slots of a second audio signal frame;

determine a slot to subframe map by mapping the total number of time slots of the second audio signal frame to the number of subframes of the first audio signal frame, wherein the slot to subframe map maps a time slot of the second audio signal frame to a subframe of the first audio signal frame;

determine an energy value for each time slot of the total number of time slots of the second audio signal frame;

determine a subframe to subframe map based on the slot to subframe map and the energy value for each time slot of the total number of time slots of the second audio signal frame, wherein the subframe to subframe map maps a subframe of the second audio signal frame to a subframe of the first audio signal frame; and

use the subframe to subframe map to assign at least one spatial audio parameter of a subframe of the first audio signal frame to a subframe of the second audio signal frame.
The apparatus as claimed in Claim 1, wherein the apparatus configured to determine a slot to subframe map by mapping the total number of time slots of the second audio signal frame to the number of subframes of the first audio signal frame is configured to:
determine a mapping length number by dividing the parameter indicating the total number of time slots of the second audio signal frame by a value indicating the number of time slots for a subframe of the first audio signal frame;

map a first set of time slots comprising the mapping length number of time slots of the second audio signal frame to a first subframe of the first audio signal frame;

update an accumulative mapping length number by adding the mapping length number to the accumulative mapping length number; and

map a second set of time slots of the second audio signal to a second subframe of the first audio signal frame, wherein the second set of time slots comprises the mapping length number of time slots of the second audio signal following the accumulative mapping length number of previous time slots of the second audio signal.
The apparatus as claimed in Claim 2, wherein the result of the division of the parameter indicating the total number of time slots of the second audio signal frame by the value indicating the number of time slots for a subframe of the first audio signal frame is not an integer, wherein the apparatus configured to determine the mapping length number by dividing the parameter indicating the total number of time slots of the second audio signal frame by the value indicating the number of time slots for a subframe of the first audio signal frame is configured to:
determine a remainder from the division of the parameter indicating the total number of time slots of the second audio signal frame by the value indicating the number of time slots for a subframe of the first audio signal frame;

accumulate the remainder value according to a subframe index of the first audio signal frame;

determine an increment value according to the subframe index of the first audio signal frame, wherein the increment value is given by a floor function being applied to the accumulated remainder value; and

determine the mapping length number by dividing the parameter indicating the total number of time slots of the second audio signal frame by the value indicating the number of time slots for a subframe of the first audio signal frame and adding the increment value to the result of the division.
The apparatus as claimed in Claim 1, wherein the apparatus configured to determine a subframe to subframe map based on the slot to subframe map and the energy value for each time slot of the total number of time slots of the second audio signal frame is configured to:
determine a total energy value for a subframe of the second audio signal frame, wherein the subframe of the second audio signal frame is divided into a number of time slots and wherein the total energy value is determined by summing the energy value for each time slot of the number of time slots of the subframe of the second audio signal frame; and
perform the following steps for a time slot of the subframe of the second audio signal frame;

determine an accumulative energy value for the time slot;

determine whether the accumulative energy value for the time slot is greater than half the total energy value for the subframe of the second audio signal frame;

when the accumulative energy value for the time slot is greater than half the total energy value for the subframe of the second audio signal frame use an index associated with the time slot to obtain a corresponding subframe of the first audio signal frame from the slot to subframe map and assign an entry of the subframe to subframe map mapping the corresponding subframe of the first audio signal frame to the subframe of the second audio signal frame; and

when the accumulative energy value for the time slot is not greater than half the total energy for the subframe of the second audio signal frame proceed to the next time slot of the subframe of the second audio signal frame.
The apparatus as claimed in Claim 4, wherein the apparatus configured to determine an accumulative energy value for the time slot is configured to:
add an energy value of the time slot of the subframe of the second audio signal frame to a running total energy comprising a sum of energy values for previous time slots of the subframe of the second audio signal frame.
The apparatus as claimed in Claim 1, wherein the apparatus configured to determine a slot to subframe map by mapping the total number of time slots of the second audio signal frame to the number of subframes of the first audio signal frame is configured to:
divide the parameter indicating the total number of time slots of the second audio signal frame into a first region of contiguous time slots and a second region of contiguous time slots;

map time slots of the first region of contiguous time slots of the second audio signal frame on a one to one basis with higher indexed time slots of the first audio signal frame;

assign the time slots of the first region of contiguous time slots of the second audio signal frame to subframes of the number of the subframes of the first audio signal frame which comprise the higher indexed time slots which have been mapped on a one to one basis with the time slots of the first region of contiguous time slots of the second audio signal frame; and

map time slots of the second region of contiguous time slots of the second audio signal frame to at least other subframes of the first audio signal frame;
The apparatus as claimed in Claim 1, wherein the apparatus configured to determine a subframe to subframe map based on the slot to subframe map and the energy value for each time slot of the total number of time slots of the second audio signal frame is configured to;
identify the time slots of the subframe of the second audio signal frame;

determine, from the slot to subframe map, at least two subframes of the first audio signal frame which are mapped to the time slots of the subframe of the second audio signal frame;

for each of the at least two subframes of the first audio signal frame which are mapped to the time slots of the subframe of the second audio signal frame:
determine an energy by summing the energy values for the time slots of the second audio signal frame mapped to the subframe of the first audio signal frame;

from the energy determined for each of the at least two subframes of the first audio signal frame determine a subframe of the at least two subframes of the first audio signal frame which has a maximum energy; and

assign the subframe of the second audio signal frame to the subframe of the at least two subframes of the first audio signal frame which has the maximum energy.
The apparatus as claimed in Claim 7, wherein the apparatus configured to assign the subframe of the second audio signal frame to the subframe of the at least two subframes of the first audio signal frame which has the maximum energy is configured to
assign the subframe of the second audio signal frame to the subframe of the at least two subframes of the first audio signal frame which does not have the maximum energy, when the maximum energy multiplied by an adjustment factor is less than the energy determined for the subframe of the at least two subframes of the first audio signal frame which does not have the maximum energy; and

assign the subframe of the second audio signal frame to the subframe of the at least two subframes of the first audio signal frame which has the maximum energy, when the maximum energy multiplied by an adjustment factor is greater than or equal to the energy determined for the subframe of the at least two subframes of the first audio signal frame which does not have the maximum energy.
The apparatus as claimed in Claim 1 to 8, wherein the total number of time slots of the second audio signal frame is either greater than or less than a total number of time slots of the first audio signal frame.
The apparatus as claimed in Claims 1 to 9, wherein the first audio signal frame is either extended in time or shortened in time to give the second audio signal frame.
A method for spatial audio decoding, comprising:
receiving a first audio signal frame comprising a number of subframes, wherein each of the number of subframes is divided into a number of time slots;

receiving a parameter indicating a total number of time slots of a second audio signal frame;

determining a slot to subframe map by mapping the total number of time slots of the second audio signal frame to the number of subframes of the first audio signal frame, wherein the slot to subframe map maps a time slot of the second audio signal frame to a subframe of the first audio signal frame;

determining an energy value for each time slot of the total number of time slots of the second audio signal frame;

determining a subframe to subframe map based on the slot to subframe map and the energy value for each time slot of the total number of time slots of the second audio signal frame, wherein the subframe to subframe map maps a subframe of the second audio signal frame to a subframe of the first audio signal frame; and

using the subframe to subframe map to assign at least one spatial audio parameter of a subframe of the first audio signal frame to a subframe of the second audio signal frame.
The method for spatial audio decoding as claimed in Claim 11, wherein determining a slot to subframe map by mapping the total number of time slots of the second audio signal frame to the number of subframes of the first audio signal frame comprises:
determining a mapping length number by dividing the parameter indicating the total number of time slots of the second audio signal frame by a value indicating the number of time slots for a subframe of the first audio signal frame;

mapping a first set of time slots comprising the mapping length number of time slots of the second audio signal frame to a first subframe of the first audio signal frame;

updating an accumulative mapping length number by adding the mapping length number to the accumulative mapping length number; and

mapping a second set of time slots of the second audio signal to a second subframe of the first audio signal frame, wherein the second set of time slots comprises the mapping length number of time slots of the second audio signal following the accumulative mapping length number of previous time slots of the second audio signal.
The method as claimed in Claim 12, wherein the result of the division of the parameter indicating the total number of time slots of the second audio signal frame by the value indicating the number of time slots for a subframe of the first audio signal frame is not an integer, wherein determining the mapping length number by dividing the parameter indicating the total number of time slots of the second audio signal frame by the value indicating the number of time slots for a subframe of the first audio signal frame comprises:
determining a remainder from the division of the parameter indicating the total number of time slots of the second audio signal frame by the value indicating the number of time slots for a subframe of the first audio signal frame;

accumulating the remainder value according to a subframe index of the first audio signal frame;

determining an increment value according to the subframe index of the first audio signal frame, wherein the increment value is given by a floor function being applied to the accumulated remainder value; and

determining the mapping length number by dividing the parameter indicating the total number of time slots of the second audio signal frame by the value indicating the number of time slots for a subframe of the first audio signal frame and adding the increment value to the result of the division.
The method as claimed in Claim 11, wherein determining a subframe to subframe map based on the slot to subframe map and the energy value for each time slot of the total number of time slots of the second audio signal frame is comprises:
determining a total energy value for a subframe of the second audio signal frame, wherein the subframe of the second audio signal frame is divided into a number of time slots and wherein the total energy value is determined by summing the energy value for each time slot of the number of time slots of the subframe of the second audio signal frame; and
performing the following steps for a time slot of the subframe of the second audio signal frame;

determining an accumulative energy value for the time slot;

determining whether the accumulative energy value for the time slot is greater than half the total energy value for the subframe of the second audio signal frame;

when the accumulative energy value for the time slot is greater than half the total energy value for the subframe of the second audio signal frame using an index associated with the time slot to obtain a corresponding subframe of the first audio signal frame from the slot to subframe map and assign an entry of the subframe to subframe map mapping the corresponding subframe of the first audio signal frame to the subframe of the second audio signal frame; and

when the accumulative energy value for the time slot is not greater than half the total energy for the subframe of the second audio signal frame proceeding to the next time slot of the subframe of the second audio signal frame.
The method as claimed in Claim 14, wherein determining an accumulative energy value for the time slot comprises:
adding an energy value of the time slot of the subframe of the second audio signal frame to a running total energy comprising a sum of energy values for previous time slots of the subframe of the second audio signal frame.