WO2009081002A1

WO2009081002A1 - Processing of a 3d audio stream as a function of a level of presence of spatial components

Info

Publication number: WO2009081002A1
Application number: PCT/FR2008/052285
Authority: WO
Inventors: Jérôme DANIEL
Original assignee: France Telecom
Priority date: 2007-12-21
Filing date: 2008-12-11
Publication date: 2009-07-02

Abstract

The present invention relates to a method of processing a 3D audio stream comprising a plurality of spatial components, and such that it comprises the steps of obtaining (E41) information representative of the level of presence of the spatial components of the audio stream as a function of frequency, of selecting (E42) a processing based on frequency or frequency band as a function of the information obtained and applying (E44) selected processing operations to the 3D audio stream. The invention also relates to a device (350) implementing the method described. It applies in particular in the case of a processing of spatial decoding type before sound restoration of the 3D audio stream or in respect of an application of spatial separation and/or noise reduction.

Description

Processing a 3D audio stream according to a level of presence of spatial components

The present invention relates to the processing of digital signals. These signals may be, for example, audio signals, video signals or more generally multimedia signals.

The invention relates more particularly to 3D audio streams comprising a plurality of spatial components, the spatial components being associated with directivity functions.

The invention applies to systems for coding / decoding 3D sound scenes, and more particularly during spatial decoding before rendering on loudspeakers or headphones. It applies similarly to "beamforming" for spatial separation and / or noise reduction application.

An example of a 3D audio stream is a surround-type stream ("ambisonic" in English), more precisely in HOA format (for "Higher Order Ambisonic" in English). This type of audio stream can be obtained for example by taking a sound from a spherical array of microphones. For more information on this type of sound recording, refer to the following document: "3D Sound Field Recording with Higher Order Ambisonics-Objective Measurements and Validation of a 4th Order Spherical Microphone", S.Moreau, J.Daniel , S.Bertet, in 120th AES Paris Convention (2006).

The audio stream with its spatial components can also be obtained after a spatialization processing applied to N channels corresponding to monophonic signals. This type of spatialization processing can be of the ambiophonic type. Ambiophonic encoding of order M gives a compact spatial representation of a 3D sound scene, by making projections of the sound field on the associated spherical or cylindrical harmonic functions.

For more information on the surround transformations, we can refer to the following document: "Acoustic field representation, application the transmission and reproduction of complex sound scenes in a multimedia context ", Doctoral thesis of Paris 6 University, Jérôme DANIEL, 2001.

In the context of higher order ambiophonic spatialization (HOA), the spatial components are ambiophonic components B _m ^σ _n connected to a sound pressure field p by the Fourier-Bessel series, and to which the contribution of a sound source in far field, that is to say a planar wave of incidence (θs, δs) carrying a signal S is written by the spatial encoding equation:

Kn = S. ^Y Li ^β sΛ ⁾ (1) where spherical harmonic functions Y ° _n {θ, δ) describe an orthonormal basis:

C (θ, δ) = l (2m ₊ 1) (2 - δ _{0> n} ) ^^ - P _mn (sin δ) fcoswe * if σ = +1 x <[sinw * if σ = -1 (ignored if M = O)

The P _mn (sin <?) Are the associated Legendre functions.

Figure la represents the spherical coordinate system used for these equations, where a direction is represented by the azimuth angles θ and elevation δ.

A representation of the spherical harmonic functions is also represented in FIG. 1b. It is thus possible to see the omnidirectional component Fo ₀ (Ty), the bidirective components F ₁₀ (Z), Y _n (X), Y _n ¹ (Y) and the components of higher dimensions.

A three-dimensional representation or "3D" called "of order M" comprises K = (M + 1) ² components whose triplets of indices {m, n, σ) are such that O≤m≤M, O≤n≤ m, σ = ± 1. A two-dimensional representation or "2D" of order M includes a subset of these components by retaining only the indices m = n, ie £ = 2M + 1 components.

Thus, the set of spatial components are staggered according to a complementary dimension (other than frequency or time) that reflects the associated angular frequencies. The notion of spatial resolution or encoding order is then defined by the maximum angular frequency represented, thus related to the number of spatial components present significantly. We are interested here in the processing of a 3D audio content of HOA type for a spatialized reproduction on loudspeakers or on a helmet, or for a formation of way for spatial separation. This treatment is also called here in general spatial decoding.

This treatment is generally linear and consists for example of stamping operations, filtering or combination of both.

In the frequency domain, this treatment can be formulated by the expression S = D. B where B and S are the vectors of the processed (B) and resultant (S) signals and where D is the processing matrix. This processing matrix is composed of amplitude gains in simpler realizations or transfer functions in more elaborate realizations.

Existing treatments are performed assuming that the spatial resolution is homogeneous over the entire frequency band of the audio stream and is constant over time.

Thus, for a sound field produced by a source, the spatially encoded sound S is found within each component with an identical gain for all frequencies, which depends only on the direction of incidence (θs, δs): K _n = ^s - ^γ _mn ( ^θ _s> δ _s ). This is what we will call an "ideal spatial encoding".

However, this hypothesis of ideal spatial encoding is not verified in a certain number of practical cases.

For example, in the case of an HOA content resulting from a 3D recording by a spherical array of microphones, the spatial resolution is in practice not homogeneous over the entire frequency band. Indeed, for reasons of sizing of the microphonic network, the spatial resolution is indeed lower in a low-frequency domain, that is to say that the higher order components have a signal level (density spectral power) weaker or insignificant.

FIG. 2 represents, for example, schematically the actual presence of the spatial components B _n according to the frequency f and according to their spatial order m (related to spatial resolution) in a particular example of sound pickup by spherical array of microphones.

Thus, we can speak of a spatial resolution by step. For a 3D microphone of order 4, the effective resolution is for example of order 1 up to / 2 = 1000 Hz, then order

then of order 4 up to the spatial aliasing frequency (eg / _a ii _aSmg ⁼ 10kHz). Spatial aliasing is an encoding artifact related to the ambiguity of the wave incidence direction information, occurring when the wavelength is no longer large enough to the differences in acoustic path between sensors.

When we speak of effective resolution of order m at a given frequency, this means that only spatial components characterized by an angular frequency of less than or equal to m are present significantly at this frequency (in the particular case of a representation HOA 2D, it would be the first 2m + l signals).

Consequently, in this case, a spatial encoding said ideal of order m = 4 would be suboptimal here with regard to a relatively low frequency part where the effective resolution is for example of order 1. Thus, the precision sound scenes resulting from spatial decoding, if any, will be worse for this frequency range.

There is therefore a need to take into account the effective spatial resolution to perform optimal spatial decoding of the audio streams.

The present invention improves this situation.

For this purpose, the invention proposes a method for processing an encoded 3D audio stream comprising a plurality of spatial components. This method is such that it comprises, during the decoding of the audio stream, the following steps: obtaining information representative of the level of presence of the spatial components of the audio stream as a function of frequency; selection by frequency or frequency band of a spatial decoding processing compatible with the information obtained; applying the selected treatments to the 3D audio stream. Thus, the processing applied to the audio stream takes into account the presence characteristics of the spatial components and therefore the spatial resolution, by frequency to best adapt the flow processing over the entire frequency band.

In a particular embodiment, the method comprises a step of obtaining a global treatment to be applied over the entire frequency band of the audio stream, from the selected treatments.

A single treatment is therefore to be applied over the entire frequency band of the audio stream, which simplifies the implementation.

Obtaining the overall treatment may comprise a step of aggregating the selected treatments and integrating a smoothing function between the different treatments.

Thus, annoying audible artifacts that can be heard between the different frequency band treatments are attenuated.

The invention is advantageously applied in the case where the overall processing is a bank of filters adapted to perform a spatial decoding of the audio stream before sound reproduction.

In one embodiment, the information representative of the level of presence of the spatial components comes from characteristics of the devices for generating the audio stream and are obtained by reading data related to the audio stream.

This information is received directly at the same time as the audio stream. They come from the characteristics of the devices for generating the audio stream, for example the characteristics of microphones.

In another embodiment, the information representing the level of presence of the spatial components is obtained by analyzing the audio stream, the analysis comprising a step of estimating the presence level of the components by comparing the energy levels of the components according to of the frequency. The information can therefore be obtained at different times, in case the level of presence changes over time.

In addition, the analysis step may include a step of estimating a noise level and / or a quality index.

This additional information can be used to make a more judicious choice of the treatment to be applied for example.

In a particular embodiment, the selected treatments are listed in a processing database.

This database may include matrix coefficients and / or processing filters, and / or rules and parameters for constructing a processing function.

This processing data can be updated or modified at any time.

According to one particular embodiment, the selection takes into account other criteria such as in particular a noise level resulting from the application of said processing and / or a quality level of said processing and / or a level of spatial performance of the processed audio stream. by said processing and / or treatment characteristics selected in neighboring frequency bands.

The selection is optimized to better adapt treatments to the flow and improve the quality of treatment.

It is also possible that the selection of a frequency or frequency band processing comprises a step of compensating for the level of presence of the spatial component to be applied to said processing.

This is implemented advantageously for components that have a low level of presence.

The invention also relates to a processing device for decoding a coded 3D audio stream comprising a plurality of spatial components. This device is such that it comprises: a module for obtaining information representative of the presence level of the spatial components of the audio stream as a function of the frequency; a selection module adapted to select by frequency or frequency band a spatial decoding processing compatible with the information obtained; a processing module adapted to apply the selected treatments to the 3D audio stream.

The invention also relates to a digital audio decoder comprising such a device.

Finally, the invention is directed to a computer program comprising code instructions for implementing the steps of the method according to the invention, when these instructions are executed by a processor.

It also relates to a storage medium readable by a computer system storing a set of instructions executable by said system to implement the steps of the method according to the invention.

Other features and advantages of the invention will appear more clearly on reading the following description, given solely by way of nonlimiting example, and with reference to the appended drawings, in which:

FIG. 1, previously described, illustrates the direction of propagation of a plane wave in space, described previously;

FIG. 1b previously described illustrates the spherical harmonic components in the case of an ambiophonic spatial representation of order 3;

FIG. 2 previously described illustrates a representation of spatial components in the case of sound recording by a spherical array of microphones;

FIG. 3 represents a digital audio coding / decoding system comprising a processing device according to the invention;

FIG. 4 illustrates in flowchart form the main steps of a treatment method according to the invention; FIG. 5 illustrates a presence representation of spatial components as a function of frequency;

FIG. 6 illustrates an application of the selected treatments according to a first embodiment of the invention;

FIG. 7 illustrates a determination of the treatment to be applied from the selected treatments, according to a second embodiment of the invention;

FIG. 8 illustrates an embodiment of the step of selecting and obtaining a global treatment according to the invention; and

FIG 9 illustrates a processing device according to the invention.

With reference to FIG. 3, a coding / decoding system according to the invention is now described.

3D audio content is generated by a 3D content generation module 330 which can for example be a sound pickup module by a microphones network or a 3D virtual scene composition module, or a post-production chain integrating for example and among others this type of tools. This 3D content can also come from a 3D recording stored on a medium.

This 3D content or 3D audio stream comprises spatial components B _m ^σ _n whose index triplets {m, n, σ} are such that O≤m≤M, O≤n≤m, σ = ± \ as previously defined. . Naturally, the invention applies to variant representations, in particular 2D as described above.

For the sake of simplification, we will identify in the following description a spatial component and the variables associated with a simple index k (1 ≤ k≤K) rather than the index triplet. Thus, at the output of the generation module, the audio stream comprises spatial components Bk (t) which are optionally transmitted to an audio coder 300. In the absence of an audio coder, the 3D audio stream is transmitted directly. to the processing module 322 of the processing device 350.

In one embodiment of the invention, the 3D audio stream is accompanied by data D of description of the sound recording comprising information on the actual presence of spatial components by frequency band or frequency. This data can be in the form of a table of values depending on the frequency. These values can be updated over time. They typically derive from the characteristics of the 3D microphonic system that was used to produce the content to be processed. This data is then transmitted to the processing device 350 and in particular to the module 355 for receiving or obtaining this information. They are then transmitted to the processing selection module 353.

The audio coder 300 may comprise a time / frequency transformation module 301, for example of the MDCT type (for "Modified Discrete Cosine transform"), of the 3D audio stream. At the output of this module, we obtain spatial components Bk (f) in the frequency domain. The encoder may also include a quantization module 302 capable of quantifying the audio stream in a binary stream T. This bit stream is then transmitted, recorded or transported.

On receiving this bit stream, an audio decoder 320 dequantizes this stream, if necessary, by an inverse quantization module 321. The stream B'k (f) obtained is processed by a processing module 322 of the processing device 350. an embodiment variant, the stream B'k (f) first undergoes a frequency / time transformation by the transformation module 323 before undergoing the processing of the module 322.

The processing carried out by this processing module is a spatial decoding process for a reproduction by the reproduction module 340 on loudspeakers or on headphones.

The processing module is controlled by the processing device 350. This processing device comprises a module 353 for selecting a processing to be applied to the audio stream for a given frequency band or frequency. The selection of the processing adapted to the frequency band is performed according to the information D received on the actual presence of the spatial components in the frequency band concerned. Thus, a processing solution is selected for a frequency band, if it corresponds to the maximum possible compatible resolution with the level of effective presence of the spatial components. A threshold decision criterion is then applied. Several processing solutions can also be selected according to the adequacy between the contribution level required for the output signals and the actual presence level of each component.

We will see later with reference to Figure 8 that the selection of frequency band treatments can also be performed by taking into account additional criteria.

The treatments to be selected are listed in a database 352 of treatments. In concrete terms, this database comprises, for example, for processing on K spatial components and producing N signals, matrices of dimensions K * N, associated or not with filters or filter matrices of dimension K * N.

The database may also include in a non-exhaustive manner parameters that will be used to calculate the corresponding processing matrix or filter coefficients or transfer functions (frequency-tabulated numeric values) or filter design parameters ( transient frequencies, frequency subband response level) or data specially adapted for application in the transformed domain (infinite impulse response infinite impulse response RIF, infinite impulse response subfilters) filtered-decimated by subbands).

The database may include not only data but also application rules or processing algorithms.

These are rules or algorithms that calculate operational processing data.

For example, for HOA spatial decoding processing on loudspeakers, these rules or algorithms can be rules or algorithms for calculating decoding matrices optimized according to "psychoacoustic" location criteria (such as those introduced by M.Gerzon and that can be found in the documents: __ GERZON, MA "General Metatheory of Auditory Localization" AES 92nd Convention, preprint 3306, Vienna, Austria, March 23-29, 1992 or GERZON, MA "Psychoacoustic Decoders for Multipspeaker Stereo and Surround-Sound AES 93rd Convention, preprint 3406, San Francisco, USA, October 1992). Rules and formulas for calculating parameters or decoding matrices, optimal according to such criteria, can be found in J.Daniel's thesis report cited above.

Decoding matrices can also be chosen according to other higher-level criteria (e.g., centered auditor, expanded audience).

For binaural decoding processing, it may be a program for calculating and optimizing decoding filters, based on a database of HRTFs (for "Head Related Transfer Functions") and using high level settings. Such treatments are described in particular in the patent application WO2007101958.

The database 352 therefore consists of pre-calculated processing data and / or rules for calculating them which are supposed to satisfy the desired function (optimal spatial decoding, transformation, etc.) as a function of parameters or combination of parameters, for example the geometric configuration of the loudspeakers, the spatial resolution of the processed HOA flux, the frequency band considered.

In one embodiment of the processing database, the data of the database can be prepared for example in a form specially adapted to the processing mode (eg frequency domain) and / or selectable according to set user parameters (eg base of HRTF if binaural decoding).

Within the database, each selectable process can be described as a transfer matrix D (J) of N rows and K columns, whose index element row n and index column k is the transfer function d _n k (f) -

A selectable processing thus described in the base may concern only a subset E = [K) of the represented signals Eκ = {k = \, ... K}. Thus, when this processing is selected by the module 353, the latter completes the processing matrix with zeros to form a matrix of dimension K * N, D (f), by inserting null columns at the indices k of E _K who are not in E. Thus, based on the presence information of the spatial components for a given frequency band, the module 353 selects the adapted processing in the database 352.

It thus obtains a processing adapted by frequency band, ie a plurality of treatments for the entire frequency band of the audio stream to be processed.

In a particular embodiment, the processing device comprises a module 354 for determining a global processing to be applied over the entire frequency band of the audio stream. This module makes it possible to compile the processes selected by the selection module 353 and put them in an operational form for processing over the entire frequency band.

Thus, the processing or processing data retained for the different frequency bands are the subject of an aggregation procedure in the module 354. This aggregation procedure may for example consist of grouping pieces of transfer functions, for example. recompose each required transfer function to derive a FIR filter by inverse Fourier transform.

In the case where the overall processing sought is in the form of filters, it is possible to define associated transfer functions by smoothing or frequency interpolation from the data retained for each frequency band (or the different target frequencies), rather than by simple juxtaposition. The criteria of smoothing or frequency interpolation are defined so as to better condition the filter (size, regularity ...) and to reduce the audible artifacts.

Moreover, in the case where the effective resolution, the actual presence of the components, is variable over time, the adapted processing must also vary over time and a temporal smoothing method can be implemented to avoid audible artifacts. undesirable effects due to too abrupt variations.

The resulting overall processing TG is then transmitted to the processing module T 322.

This processing module therefore applies the processing received from the module 353 by frequency band or the overall processing received from the module 354 for the entire frequency band of the audio signal. For example, the processing carried out by the module 322 may correspond to the processes described later with reference to FIG. 6 or a global processing determined by the module 354 and described with reference to FIG. 7 later.

The processing is applied to signals of the time domain or of the frequency domain depending on whether the audio stream is received directly from the audio stream generation module or that a coded transformed audio stream is received or that the processing is implemented before or after the transformation module 323.

In the case where the data D comprising presence information of the spatial components per frequency band is not provided with the 3D audio stream, an analysis module 351 may be provided.

This module implements a step of analysis of the 3D audio stream to estimate the level of presence of the spatial components by frequency band.

This estimation step is carried out here with the assumption that the level of presence at a given frequency is substantially the same for components of the same order m.

This level of presence can be defined as a scale factor in the sense of an attenuation of the level with respect to a so-called ideal spatial encoding as described initially.

Thus for HOA components, this level of presence can be defined as a gain y _{n n} (f) (again denoted γ _k (f)) depending on the frequency. This gain is such that the equation (1) defining the ideal encoding is replaced by the following equation (3):

or else B _k - γ _k (f) Y _k (θ _s , δ _s ) S.

Thus under the assumption that γ _m ^σ _n = γ _m (f), the step of estimating the level of presence of the spatial components can be carried out by intercorrelation between the components B _m ^σ _n . We can detect if at a given moment we are dealing with an acoustic field diffuse (perfectly uncorrelated components) or conversely, to a field probably produced by a single sound source (perfectly correlated components).

Under a hypothesis of orthonormality of the base of spherical harmonics, in the first case, an ideal encoding should give rise to components of the same energy. In the second case, the ideal encoding should be such that the average of the energies of the components of the same order m is the same for all orders m.

The level of presence γ _m ^σ _n = γ _m {f) can therefore be estimated by comparing the energy levels of the components as a function of frequency, for example by the energy spectrum ratio, or alternatively, densities. Power Spectrum Density (PSD) between the components of order greater than 1 and that of order 0 according to the following expressions:

Σ WΛf) rAΩ = - (4)

\ K (f) or, in the variant where the PSDs are used: Σ PSD (B _m ^σ _n , f) γ (f) = ° ^≤ " ^≤ ". ^ ± '(5)

U KJ PSD (B £, f) ^K '^'

For greater reliability, it is preferable to observe the signal in the time domain in the medium and long term and to perform temporal and / or frequency smoothing.

In cases where the actual resolution is assumed to be invariant in time, the estimate may be made in advance over part or all of the content or over time and adaptively with a convergence objective of the estimate.

In cases where the indices are by nature variable in time as well as in frequency, the estimate is updated over time (for example frame by frame). In the case of a content encoded in the "transformed" domain, the estimate of the level of presence can be done by observing the scale factors (in the classical sense signal coding) and quantization rate (binary allocation) of each brick. "time-frequency-space", supplemented by the estimation methods evoked.

In addition to estimating the presence level of the spatial components per frequency band, the analysis module 351 can also measure other characteristics of the signal.

Thus, a noise level can be estimated. This noise may be related for example to the background noise of microphones for recording and / or quantization noise in the case of audio coding.

Other information such as the index of quality or reliability of the spatial encoding can be determined. This index is for example represented by a modeling error of the spatial information ε _k (/) due for example to an encoding error that can occur in the presence of spatial aliasing or following an imperfect calibration of the microphonic system. .

This additional information (noise level index, quality index) can also be part of the data D associated with the audio stream, and be determined by the characteristics of the sound recording.

This additional information is such that the equation (1) defining the ideal encoding is replaced by the following equation (6):

B _k = YΛfK (θ _s , δ _s ) S + v _k (f) + ε _k (f) (6) where v _k (/) denotes an acquisition noise.

This information can be used when selecting the processing adapted to the actual presence of the components per frequency band, in one embodiment described with reference to FIG. 8.

The processing device 350 as described with reference to FIG. 3 thus implements a processing method which will now be described with reference to FIG. 4 which illustrates, in the form of an algorithm, the main steps of the general processing method. Thus, step E41 is a step of receiving the 3D audio stream as well as obtaining data D of information on the level of presence of the spatial components of the 3D audio stream as a function of frequency. These data are obtained as mentioned above, either directly from the characteristics of the sound recording or after analysis of the audio stream.

This data may further include information on the noise level or the quality level of a spatial encoding.

In step E42, a selection of treatments to be applied per frequency band is performed according to the level of presence of the components obtained in step E41. This selection can also be made taking into account other criteria such as the noise level or quality. The different treatments to be selected come from a BD processing database.

This gives a treatment adapted by frequency band.

These different treatments are then applied at E44 to the audio stream Bk for the different frequency bands to provide Sn signals which will then be reproduced on speakers or on a headset.

In an optional step E43, the different frequency band processes are concatenated or reformulated to generate a global treatment to be applied over the entire frequency band. This global processing is thus applied in E44 to the audio stream.

With reference to FIGS. 5 and 6, we will now describe a first embodiment of application of treatments to the audio stream.

FIG. 5 represents the presence information of spatial components received either directly with the 3D audio stream, or from an analysis of the stream. This figure therefore shows that for a frequency between 0 and fl, the effective spatial resolution is 1, that for a frequency between f1 and f2, the effective resolution is 2, that for a frequency between f2 and f3, the effective resolution is 3 and that for a frequency higher than f3, the effective resolution is 4. In this embodiment, the selection module 353, as a function of the presence information of the spatial components, takes into account the representative frequencies and defines the effective spatial resolution

as the maximum order such χmn ^σ (fι)> fthres Vm <m _e _eff ctive fthres being an acceptability threshold (e.g. 3dB).

This module thus retains as frequencies of transition the frequencies where m _sffsctιvs (f) knows a discontinuity.

For each frequency f _t (or between the transition frequencies), the selection module select from the database DB, the most suitable treatment for the effective resolution m _eff _ec tive (/ I) - e.g. are selected D decoding matrix ,.

In a particular embodiment, particularly in the case of a decoding for a speaker-type reproduction device equi-distributed over a circle, the processing matrices factorize as the product of a base matrix D common _base all decoding solutions and a diagonal matrix g whose coefficients are specific to each decoding variant. For example, a matrix identified by the index i will be written:

Typically, this amounts to weighting the spatial components B _m ^σ _n treated by said coefficients g _m ^ (generally associated with the order m) before stamping.

Thus, the optimal gains selected typically vary "in step" with frequency.

Table 1 below shows an example of values that can take these gains g _m ^ according to the respective order decodings Mproc = 1, 2, 3, 4 for 12 loudspeakers:

Table 1

An example of a basic matrix (for a decoding of K = 9 components of a 2D representation of order 4, on N = 12 loudspeakers equidistributed on a circle) can be like Table 2 below:

Table 2

In general, as represented in FIG. 6, for the frequency band from 0 to F1, a decoding matrix D1 of order 1 is chosen, for the frequency band of f1 to f2, a decoding matrix D2 of order 2 is chosen, for the frequency band from f2 to O, a decoding matrix D3 of order 3 is chosen and for the frequency band greater than f3, a decoding matrix D4 of order 4 is chosen.

A bank of filters whose limit frequencies are the transition frequencies determined above is generated.

In practice this filter bank does not need to be very selective, so it may not be very expensive.

These filters have functions respectively low-pass, high-pass and band-pass, they can be finite impulse response (RIF) or infinite (RII), with relatively few coefficients. It is important, however, that they have a substantially identical (and preferably linear) phase response.

The application of the processing by the module 322 is represented in FIG. 6. It is carried out by a subband filtering Fi (F1, F2, F3 and F4 in the figure) of the signals Bk of K components, using the filter bank determined, to decline versions B _k ^(l) , (B _k ⁽¹⁾ , B _k ⁽²⁾ , B _k ⁽³⁾ and B _k ⁽⁴⁾ in the figure).

For each sub-band i, a matrixing of the filtered signals B _k * ^ by the corresponding matrix D is performed, supplying limited-band signals S _n ^ (S _n ⁽¹⁾ , S _n ⁽²⁾ , S _n ^{( 3)} and S _n ⁽⁴⁾ in the figure).

A summation of the signals corresponding to the different sub-bands is then performed to obtain the signals S _n = E ₁ S _n ⁽¹⁾ .

In this embodiment, a processing Di is applied to each subband, the processing Di being associated with the effective resolution of the stream in this subband.

In a second embodiment shown in FIG. 7, the step of selecting the treatments D 1 is the same as that carried out previously for the first embodiment.

In this embodiment, the module 354 for generating a global processing to be applied over the entire frequency band of the audio stream is implemented.

This module constructs a new and unique transfer matrix D _op as the sum of the selected matrices D ₁ for each subband [f ₁₅ f _{1 + 1} ], frequency-weighted by functions W ₁ (I):

The functions W ₁ (I) typically have low-pass, band-pass and high-pass functions, with the /, as transition frequencies.

This generation of a global matrix is illustrated in FIG. 7 for an example of 4 frequency bands. The processing matrices D1 to D4 are weighted by respective functions Wi (f) to W ₄ (I) and are combined to obtain a matrix Dop of dimension K * N. The processing carried out in the module 322 is here advantageously carried out in the frequency domain. It consists, for each time block of the processed multichannel stream and for each frequency band of the transformed representation, into a matrix product between the matrix B _k of the coefficients representing the flux in said frequency band, and the coefficients of the matrix D _op operational transfer for this frequency band. Naturally, we adopt an implementation that guarantees identical frequency sampling for matrices B and D _op .

FIG. 8 now described represents an exemplary embodiment of the process selection step implemented by the selection module 353. This embodiment applies in the case where the data D obtained is directly or by flow analysis. , include not only information on the presence of spatial components per frequency band γ _k (f), but also information on the noise level v _k (f) and / or information on the encoding uncertainty ε _k (/).

Thus, the selection of the treatments is also carried out according to the compatibility with the encoding quality of the components processed, namely not only the level or presence factor but also the noise level, or even an index of reliability of encoding , linked for example to the encoding uncertainty.

In this embodiment, compensation is made within a certain limit of the level of presence γ _k (f) of the components to be treated when it is deficient.

Thus, as represented in FIG. 8, step E80 is a step of preselecting processing by frequency band as a function of the information χ _t (/) on the actual presence of the spatial components. Elements d _nk (f) are thus obtained and constitute the overall processing matrix D (f) in step E81 for the entire frequency band.

In step E82, it is examined whether for certain frequencies, the effective presence of the components is low, for example if γ _k {f) <\. For these frequencies, then replaces the corresponding processing elements of the global matrix by the elements d _nk (f) lγ _k (f) of compensated processes.

A new global processing matrix D '(f) is thus obtained in step E83.

Recall that for an ideally encoded representation the expected processing produces signals according to the following equation:

S _n ⁽ I) = Σd _nk ⁽ f) B _k ⁽ f) (8)

where B _k (/) represents the components of a stream after an ideal encoding and S _n (/) those obtained after a corresponding spatial decoding.

To compensate for the potentially deficient (Z) presence factors, it is proposed to adapt the treatment using the matrix:

D ,, (/) = D (/). diag (γ) - ¹ , where y = [ _Yι -γ _κ ] (9) from which resultant signals:

By formalizing the expression of non-ideally decoded components by the following expression:

B _k (Z) - n (f) B _k (Z) + v _k (Z) + ε _k (J) (11) corresponding to the expression (6) described above, and applying the compensation mentioned above we obtain the following expression:

(12)

Ignoring at first, the term ε _k (/), we get:

SΛf) = Σ (dAf) B _k (f) + ^ - vΛf)) (13)

Thus this expression shows that the noise level at the end of the previously calculated treatment is therefore, supposing the noises v _déc decorrelated two by two:

In step E84, an overall index calculation representative of the noise associated with the candidate D (f), a function of the frequency, is defined as follows: either as the maximum of the output noises: v (D, /) = max _n S _n (f) - S _n (I)

either as their root mean square: v (D, /) = JΣ _n S _n (f) -S _n (f) \ ² / N

In the absence of knowledge or estimation of the background noise v _k (J), it can be assumed that the signals are "of identical quality", that is to say, affected by a noise of acquisition of the same level | v (/) || . In this case, we can calculate the increase of the noise level (in quadratic average) by the sum:

For simplicity, it can be considered that the noise present is of a level deemed "acceptable" by the content producer but that its increase at the end of the treatment must not be greater than a certain value. The noise level v (D, f) for the processing D must therefore not be greater than the noise level v _k (J)) received for this frequency band.

Assuming that the processing D globally preserves the signal level, it is the degradation of the signal to noise ratio that we seek to limit.

It will be noted that the selection according to the invention advantageously takes advantage of the fact that it is possible, with certain decoding solutions and for certain frequency ranges, not to degrade the signal-to-noise ratio while compensating for presence factors γ _k. (J) <1. It is observed that some decoding matrices contain elements d _nk whose values decrease for increasing values of k, for which it happens that the scaling factors γ _k (J) decrease themselves (typically in low frequency ).

In a variant of the method described, it is proposed to exploit moreover the rate of uncertainty on the encoding, which corresponds to the term ε _k (J) by repeating the expression (12):

In step E84 a check of the influence of the error term is performed. Indeed, the compensation of the scale factor χ _k (f) must not raise the error term ε _k (f) to a non-negligible level before B _k (f), this to avoid spatial out-performance.

Thus, a compromise is sought between the compensation of the presence factors and the induced noise level, or the term of error produced. This compromise to perform will determine in step E86, the processing to be performed by frequency band. A weighting function W ₁ (I) taking these criteria into account is then calculated.

It is also possible to associate with a candidate processing solution one or more spatial performance indices calculated in step E85 to obtain another selection criterion.

In the context of surround sound spatialization on loudspeakers, objective spatialization performances are usually characterized by the velocity and energy vectors introduced by Gerzon. Below is an example of a configuration of N loudspeakers equidistant from a reference point which is the preferred listening point, placed in directions marked by unit vectors U _n .

For the characterization of spatial performances, we consider a set of virtual source directions represented by unit vectors v _q or azimuth and elevation angles (θ _q , φ _q ), representative of an acoustic field: for example a substantially regular sampling of the circle or of the unit sphere, depending on whether it is aimed at rendering on a horizontal or three-dimensional loudspeaker device. For each direction considered, the gains G _n (y _q ) which connect the signals S _n of the loudspeakers to the encoded S signal are calculated, counting the assumed ideal encoding operation, B _k = γ _k (θ _q , φ _q ) and the decoding operation using candidate D such that S = D. B, where S and B represent the vectors of the signals S _n and B _k, respectively. The vector of gains G _{n is} written G = DY (v _? ) Where Y (v _? ) Is The vector of functions Y _k (v _q ) = Y _k (θ _q , φ _q ). Finally, the energy vector is defined as follows:

r _E being its module and û _E the unit vector which describes its direction.

Naturally, for a lower calculation cost, the invention advantageously takes into account that indices such as those related to the energy vector can be pre-calculated or calculated from simple formulas without having to calculate them from an important sampling of virtual source directions.

The conventional decoding solutions which are for example listed in the database BD, in principle check pretty well the directional conformity criterion û _E = v _q for all the directions of virtual source.

The spatial performance is then described by the module r _E , which predicts somehow the blur of the sound image produced through the angle a _E = arccos r _E. This index is for example described in the following document: article AESl 16 of Moreau, Daniel and Bertet, cited above.

If this index varies according to the encoding direction v _q , we will retain for example an average, possibly weighted according to the encoding direction to favor certain regions of space.

The two tables above, Table 3 and Table 4 show an example of values of both the module Γ _E (ideal reference value "= 1) and its arccos OI _E (reference value = 0 °) for each effective resolution Meff = 1 to 4.

Depending on the desired performance values, the Mproc treatments of resolutions 1 to 4 are chosen.

Table 3

Table 4

Thus, in step E85, an index σ (D, f) of spatial performance associated with a particular treatment and for a frequency is obtained.

This spatial performance index can be advantageously supplemented by acoustic reconstruction quality information enabled by the decoding solution that can be calculated from the acoustic reconstruction error for a given frequency and listening area.

In the context of the invention, it is preferable that all of these performance indices are pre-calculated and associated with each candidate solution, but it is expected that they can be (re) calculated at the time of selection, by according to criteria or specific options defined by the user (eg size of the listening area, etc.).

More generally, the invention applies to any other form of characterization of spatial performances. It incorporates in particular the angular distortion (angle difference between û _E and v _q ) that can result from the use of a decoding solution that is poorly adapted to effective resolution. Indeed, in the case of non-regular devices, the use of an optimal decoding solution of order M for an effective resolution flow of order M <M can lead to angular distortions (of the energy vector for example) . ElIe also applies to the characterization of audio rendering properties other than strictly spatial (such as coloring effects for example), but whose quality depends on the proper consideration of the effective spatial resolution.

Thus, each candidate solution is associated with one or more spatial performance indices and this information is used for their selection in step E86.

Indeed, at this step E86, a note of preference P (D /) is calculated so that it is an increasing function of the spatial performance σ (D /) calculated in step E85 and decreasing function of the increasing the noise level V (D /) calculated in step E84.

According to a first option, a solution is chosen per frequency band, namely that obtaining the best rating of preference P (D /). A weighting function W ₁ (Z) is then defined. This function is for example 1 when the solution n ° i is the best at the frequency / and 0 elsewhere. Preferably, W ₁ (J) is defined so that it continuously changes from 0 to 1 over a frequency range around each transition frequency.

The optimal processing data are then calculated in step E87 as a weighting of the candidate solutions as a function of the frequency:

D _OP (/) = ΣW) D; (/) (i7)

This definition is advantageously suitable for processing in the frequency domain, as illustrated in FIG. 7.

Advantageously, the calculation of the preference rating can be modified to reflect the ease of interpolation between candidate solutions on adjacent frequency bands.

Similarly, the weighting functions can be defined to optimize the interpolation between adjacent band solutions.

FIG. 9 describes a particular embodiment of the processing device 350 according to the invention. Materially, this device 350 typically comprises a μP processor cooperating with a memory block BM including a storage and / or working memory, as well as the aforementioned database BD for to list the possible treatments according to the level of presence of the spatial components. The memory block may advantageously comprise a computer program comprising code instructions for implementing the steps of the method in the sense of the invention, when these instructions are executed by a μP processor of the device 350 and in particular a first step of obtaining information representative of the presence level of the spatial components of the audio stream as a function of the frequency, a second step of selecting a frequency or frequency band treatment according to the information obtained and a third step of applying the selected treatments to the 3D audio stream.

Typically, FIG. 4 can illustrate a flowchart representing the algorithm of such a computer program.

The computer program may also be stored on a memory medium readable by a reader of the device or downloadable in the memory space of the device 350.

This device 350 according to the invention can be independent or integrated into a digital audio signal decoder as described with reference to FIG.

Claims

1. A method for processing a coded 3D audio stream comprising a plurality of spatial components, characterized in that it comprises, during the decoding of the audio stream, the following steps: obtaining (E41) information representative of the presence level of the spatial components of the audio stream as a function of frequency; selecting (E42) by frequency or frequency band of a spatial decoding processing compatible with the information obtained; application (E44) of the selected treatments to the 3D audio stream.

2. Method according to claim 1, characterized in that it comprises a step of obtaining (E43) a global processing to be applied over the entire frequency band of the audio stream, from the selected treatments.

3. Method according to claim 2, characterized in that obtaining a global treatment comprises a step of aggregating the selected treatments and integration of a smoothing function between the different treatments.

4. Method according to claim 2, characterized in that the overall processing is a filter bank adapted to perform a spatial decoding of the audio stream before sound reproduction.

5. Method according to claim 1, characterized in that the information representative of the level of presence of the spatial components comes from characteristics of the devices for generating the audio stream and are obtained by reading data related to the audio stream.

6. Method according to claim 1, characterized in that the information representative of the presence level of the spatial components is obtained by analysis of the audio stream, the analysis comprising a step of estimating the level of presence of the components by comparing the levels of the components. component energy as a function of frequency.

7. Method according to claim 6, characterized in that it further comprises a step of estimating a noise level and / or a quality index.

8. Method according to claim 1, characterized in that the selected treatments are listed in a processing database.

9. Method according to claim 8, characterized in that the processing database comprises matrix coefficients and / or processing filters, and / or rules and parameters for constructing a processing function.

10. Method according to claim 1, characterized in that the selection of a frequency or frequency band processing is also performed according to a noise level resulting from the application of said treatment and / or a a quality level of said processing and / or a spatial performance level of the audio stream processed by said processing and / or selected processing characteristics in neighboring frequency bands.

11. The method of claim 1, characterized in that the selection of a frequency or frequency band processing comprises a step of compensating for the level of presence of spatial component to be applied to said processing.

12. Processing device (350) for decoding a coded 3D audio stream comprising a plurality of spatial components, characterized in that it comprises: a module (355) for obtaining information representative of the presence level of the spatial components of the audio stream as a function of frequency;

a selection module (353) capable of selecting by frequency or frequency band a spatial decoding processing compatible with the information obtained; a processing module (322) adapted to apply the selected treatments to the 3D audio stream.

13. Digital audio decoder characterized in that it comprises a device according to claim 12.

14. Computer program comprising code instructions for implementing the steps of the method according to one of claims 1 to 11, when these instructions are executed by a processor.