WO1999027523A1

WO1999027523A1 - Method for reconstructing sound signals after noise abatement

Info

Publication number: WO1999027523A1
Application number: PCT/FR1998/002491
Authority: WO
Inventors: Dominique Pastor; Gérard REYNAUD
Original assignee: Sextant Avionique
Priority date: 1997-11-21
Filing date: 1998-11-20
Publication date: 1999-06-03
Also published as: FR2771543B1; FR2771543A1

Abstract

The invention concerns a method which is applied to a noise abatement process of a sound signal starting with a digitisation followed by a splitting (4a, 4b) of the noisy signals (u(t) into two frame sequences. The frames (Tei) of the first sequence are all identical. The first frame of the second sequence has a length equal to a half-frame, the other frames (T'ei) being of the same length as those of the first sequence (Tei), so as to obtain a half-frame shift. The frames of each sequence (Tei, Tei) are noise-corrected by applying a Fourier transform (50a, 50b), a Wiener filtering (51a, 51b), and an inverse Fourier transform (52a, 52b). It consists in weighting the frames of the two sequences (60a, 60b) through a cosine window and summing them (61) so as to reconstruct the noise-free signals (s'(t).

Description

METHOD FOR RECONSTRUCTION, AFTER NOISING, OF

SOUND SIGNALS

The present invention relates to a method of reconstruction, after denoising, of sound signals.

It is more particularly in the context of the denoising of sound signals containing speech picked up in noisy environments. It finds a main application, although not exclusive, in the context of telephone or radiotelephone communications, voice recognition, sound recording on board civil or military aircraft, and more generally all noisy vehicles, d on-board intercoms, etc.

By way of nonlimiting example, in the case of an aircraft, the noises result from the engines, from the air conditioning, from the ventilation of the on-board equipment or from the aerodynamic noises. All these noises are picked up, at least partially, by the microphone in which the pilot or another member of the crew speaks. In addition, for this type of application in particular, one of the characteristics of the noises is to be very variable over time. Indeed, they are very dependent on the engine operating regime (take-off phase, stabilized regime, etc.). The useful signals, that is to say the signals representing the conversations, also have peculiarities: they are most often of short duration.

Finally, whatever the application envisaged, if one is interested in "voicing", one can highlight certain particularities. As is known, voicing concerns elementary characteristics of pieces of speech, and more precisely relates to vowels, as well as part of the consonants: "b", "d", "g", "j", etc. These letters are characterized by an audiophonic signal of pseudo-periodic structure. In speech processing, it is usual to consider that stationary regimes, in particular the aforementioned voicing, are established over durations of between 10 and 20 ms. This time interval is characteristic of the elementary phenomena of speech production and will be referred to as a frame below.

Also, it is usual for denoising methods to take into account this important characteristic of sound signals comprising speech.

These methods generally include the following main steps: splitting the audio signal to be denoised into frames, processing these frames by a Fourier transform operation (or a similar transform) to pass into the frequency domain including appropriate windowing such than a Hanning windowing, the denoising processing proper by digital filtering, and a processing, dual of the first, by an inverse Fourier transform, to return to the time domain and reconstruct the denoised signal.

In practice, digital techniques are used. Also, the frame signals are not "continuously evolving" signals, but discrete signals obtained by sampling. It is assumed that the signals are sampled at period *, before digital processing. It is common to then consider 2 ^P samples for a signal frame, choosing p so that the value 2 ^ x ^e is of the order of magnitude of the duration D of a frame. By way of example, for a sampling frequency of 10 kHz, frames of 12.8 ms are often chosen, so that 128 points are available for each frame, which constitutes a power of two. The number of samples corresponding to a frame will be noted below LGtrame. The following relation: D = LGframe ^x T ", is therefore satisfied. Figure 1 placed in the appendix to this description illustrates the steps of a denoising process of a noisy signal u (t) = s (t) + x (t), according to known art and in accordance with what comes to be called back. S (t) is the useful signal (speech signal, for example) and x (t) the noise signal. The process includes three phases:

a phase of division into frames 1, itself comprising two stages: the digitization of the signal u (t) and the storage in buffer memory (stage 10), and the division of the original signal into frames of length LG frame and the reading of these frames (step 11);

- a denoising phase 2, itself comprising two stages: the application of a Fourier transform or an equivalent transform to pass into the frequency domain (step 20) on a series of input frame Tel (i varying from 1 to N, maximum number of frames in the sequence) and digital filtering which performs the actual denoising (step 21) and:

- a signal reconstruction phase 3, by applying an inverse Fourier transform or, more generally, a dual transform of the first (step 22), which generates a series of output frames Tsj ..

The useful signal s '(t) is recovered at the end of phase 3. In reality, this signal has been referenced s' (t) and not s (t), since it is an "estimated" signal and not of the exact useful signal s (t) which would be extracted from the noisy signal u (t). It contains errors with respect to the exact value of the signal s (t), the rate of which fluctuates over time.

The actual denoising operation (step 21) is advantageously carried out using an optimal Wiener filter. This filter has the advantage of treating each frame, a priori, differently from the previous frame and from the following frame.

If we call: - U (n) the Discrete Fourier transform of the observed random process, ie the noisy signal;

- S (n) the Discrete Fourier transform of the "desired" process, to be estimated by linear filtering of U (n); - X (n) the Discrete Fourier transform of the additive noise polluting the useful signal;

- W (z) the estimation filter expressed in the frequency domain;

~ Y s ( ⁿ ) ^the spectral density of the useful signal; and: - Y χ (n) the spectral density of the parasitic noise,

the equation describing the Wiener filter is given by the following relation:

relation in which: γ _u (n) = γ χ (n) + γ _s (n) (2).

By way of nonlimiting examples, Wiener filters are described in the following books, to which one can profitably refer:

- Yves THOMAS: "Signals and linear systems", MASSON editions (1994); and

- François MICHAUT: "Adaptive methods for the signal", HERMES edition (1992).

Examination of equation (1) shows that the parameters of the Wiener filter vary from one frame to another, since if the numerator of the second term is fixed for a finite number of frames, the denominator is variable.

At the output of the Wiener filter, there are therefore denoised frames one by one, with coefficients of filtering adapted to each of the frames, which constitutes an important advantage.

The connection of these frames can be done simply by "joining" the noisy frames one after the other, the reconstruction phase then being limited to the application of the inverse Fourier transform. However, there are edge effects due in particular to the various Fourier transforms, which are not totally reduced by the prior application of windowing (Hanning for example) preceding the application of the direct Fourier transform. In addition, the filters used to denoise each of the frames are different, as just indicated. There can therefore be no "continuity" of the denoised signal. The use of a Wiener filter, which has certain advantages (differentiated processing of the frames), is therefore not free from drawbacks either.

FIG. 2 is a diagram illustrating the parasitic effect of the edge effects. To fix the ideas, we consider a simple noisy signal which is in the form of a sum of two sinusoidal functions, the edge effects are manifested by energy peaks at the ends of the two frames of this test signal. The two sines verify the assertions below: - for the useful signal: s (t) ≈ sin (2π.l0 t);

- for noise: x (t) = 0.5xsin (2π.50t);

- a signal to noise ratio: SNR ≈ 6 dB; and:

- number of samples per frame: 128.

The vertical axis of the diagram in FIG. 2 gives the amplitude of the error present in the output signal and the horizontal axis the duration of the signal in number of samples. Two frames, called "Frame 1" and "Frame 2" were represented, making a total of 256 samples. For a maximum amplitude equal to unity, the curve shows, in this particular example, mean amplitude fluctuations, of the order of ± 15%, around the zero value, and large amplitude peaks, greater than ± 50% of the useful signal. These peaks are due to edge effects, in the "connection" areas between frames.

To combat this edge effect in the areas of connection between frames, it is known to carry out a double cutting into frames of the sound signal to be denoised, so as to obtain two series of frames offset by a fraction of frame length, to subject the two series of frames, independently of one another, to a denoising treatment similar to that illustrated in FIG. 1 and then to sum the frames of the two suites after denoising, taking account of their offset. This process, although effective, leaves residual edge effects which make the noise-suppressed sound signal still affected by an annoying processing noise.

The invention sets itself the aim, while retaining the advantages of the methods according to the known art, of overcoming the disadvantages, and in particular of avoiding the abovementioned side effects. It makes it possible, more generally, to minimize the residual error remaining between the noise-suppressed signal generated, that is to say "estimated", and the real non-noisy signal.

It relates to a method of reconstructing a sound signal, after a double cutting into frames of the noisy sound signal so as to obtain two sets of frames offset by a fraction of frame length and denoising of each of the two sets of frames , consisting in operating a windowing operation on each of the two sequences of frames after they have been denoised and before they are summed to provide the final denoised sound signal.

According to an important characteristic, the windowing produced during the reconstruction operation on each of the two series of frames after denoising is such that the summation of windowed frames gives a result always equal to unity, regardless of the rank of the frame in one or the other of the two sequences.

In a preferred embodiment, the weighting window used on the two denoised frame sequences is of the “cosine” function type g (k) and obeys the following relationship:

g (k) = with k G [l; LGtrame]

A denoising processing of noisy sound signals consisting of so-called useful sound signals mixed with noise signals, implementing the reconstruction method according to the invention comprises the following different steps:

- splitting of said noisy sound signals into two sequences of consecutive time frames, the first sequence consisting of identical frames of a determined length and the second sequence consisting of a first frame whose length is a predetermined fraction of said determined length, followed by identical frames of length equal to said determined length, so as to create a time difference between the frames of said first and second sequences, the amplitude of which is equal to the length of said first frame of the second sequence; - successive readings of all the frames of said first sequence and of all the frames of said second sequence with the exception of the first, so as to keep said offset;

- denoising, frame by frame, of the frames successively read from said first sequence and from the frames successively read from said second sequence so as to extract denoised frames from each of said sequences; - weighting of the noisy frames of said first and second sequences by multiplying each of these frames by a weighting window representing a determined function, and - summation of the weighted frames of said first sequence with the weighted frames of said second sequence, these frames having an overlap equal to said fraction of predetermined length.

The invention also relates to the application of this reconstruction method to speech processing.

The invention will be better understood and other characteristics and advantages will appear on reading the description which follows with reference to the appended figures, among which: FIG. 1 is a block diagram illustrating the main phases and steps of an example of method for denoising a noisy signal according to known art;

- Figure 2 is a diagram illustrating the error remaining on a particular noisy signal, error due to parasitic effects generated by the method of Figure 1;

- Figure 3 is a block diagram illustrating the main phases and steps of an example of denoising processing of a noisy signal implementing the reconstruction method according to the invention; - Figure 4 illustrates the double cutting into frames used in the denoising treatment illustrated in the previous figure;

- Figure 5 illustrates a cosine function used as a weighting window in a preferred embodiment of the reconstruction method according to the invention;

- Figure 6 schematically illustrates the final step of summing the two series of frames used in the reconstruction method according to the invention; FIG. 7 is a diagram showing the effect of the steps for weighting by windowing and for summing the two series of frames used in the reconstruction method according to the invention; - And Figure 8 is a diagram illustrating the error remaining on the particular noisy signal of Figure 2, processed by the reconstruction method according to the invention.

An example of a process according to the invention will now be described with reference to the diagram in FIG. 3.

The signal u (t) to be processed, that is to say to denois, is firstly digitized and stored in a buffer memory. Then the processing chain is split into two parallel paths, each of these paths being associated, in FIG. 3, with the indices "a" and "b", respectively. Each of the channels takes up most of the phases and stages of denoising processing according to the prior art: cutting into frames, denoising and reconstruction of the signal. Consequently, the basic processing operations will only be re-described as necessary. More specifically, the left channel (in Figure 3), arbitrarily associated with the index "a", is strictly identical to the processing chain shown in Figure 1. The right channel (in Figure 3) , arbitrarily associated with the index "b" comprises an additional step which will be described below.

A final step of reconstruction of the denoised signal makes it possible to recombine the signals obtained following the two series of denoising processing carried out in parallel.

The first phase of the denoising treatment consists of a double cutting into frames (blocks 4a and 4b). A first step (40a and 40b) consists in storing the digital samples obtained in two buffer memories , 40a and 40b, of the "FIFO" type ("First In, First Out", that is to say "first in, first out"). As before, the second step (42a and 42b) of this phase consists in cutting the original signal into frames of length LGframe and in reading these frames.

The two series of frames have the following characteristics:

The first series of frames represents the division into frames of length LGframe of the original signal u (t): series of Tel frames, said to be input for the denoising phase by block 5a.

The formation of the second series of frames begins with the reading of an initial frame of length less than LGframe (step 41b): let Δ its duration. This frame is not useful, in the sense that it will not be taken into account for the continuation of the operations, but it is decisive for obtaining a "shift" of the two sequences of frames. Then, the reading continues taking into account again frames of length LGframe: continuation of the frames T'ei, said to be input for the denoising phase by the block 5b.

FIG. 4 is a diagram which illustrates the double cutting which has just been described. On the upper part of FIG. 4, the first five frames of the first series of frames are shown: Tei to Tes, all of identical length equal to LGframe (first cutting). On the lower part of this same figure, the first five frames of the second sequence are shown: T •, Te 'i to Te' 4. The frame T 'is particular. It is obtained in step 41b, by reading a frame of length Δ, ie a fraction of frame length LGframe equal to [Ltrame / x]. The other frames, Te '1 to Te' 4 are again frames of length LGframe (second cutting).

The frame T 'is eliminated from the subsequent processing. We see that there is an overlap of the frames of the same rank of the two sequences, either an "offset" equal to Δ or [LGtrame / x]. It should be clear, however, that the notion of "offset" does not mean a recopy in time of the frames of the first sequence. The start of the frame Te'i corresponds to the amplitude of the signal u (t) at the instant to + Δ (with to arbitrary initial instant), while the start of the frame Tei corresponds to the amplitude of the signal u (t) at time t _θ in both cases after digitization.

We then proceed to a denoising phase of the two sequences of input frames: Tel and T'ei (Figure 3). The two blocks 5a and 5b can be identical to that described in FIG. 1.

This denoising phase is based on a modification of the frequency components of the signal to be processed. It generally involves a passage into the frequency domain by means of a fast Fourier transform or the like (steps 50a or 50b), an actual denoising operation (steps 51a and 51b) by digital filtering, advantageously of the Wiener type, and a return to the time domain by an inverse Fourier transform or the like (step 52a or 512b).

Here, the denoising processing is done not only on the original signal, that is to say on the sequence of Tel frames, but also on the "shifted" signal, that is to say on the sequence of frames You. The offset between these two signals is variable and can in particular be fixed as a function of the desired response time. This time offset Δ, induced by the reading of an initial frame during the creation of the second series of samples, can advantageously be represented in the form of the whole part of a fraction of LGTRAME:

Δ = E (Lgframe / x), relation in which E indicates the whole part. Thus two sequences of processed frames Tsi and T'si are obtained, called output. Like the input frames, Tei and T'ei, these output frames, Tsi and T'si, are also temporally offset by the value Δ.

The last phase of the process consists in reconstructing the signal s' (t) free of noise. The first step, 60a or 60b, consists of windowing which is carried out independently on each frame of the two suites. The weighting window used has specific characteristics which will be specified below.

We will now go to a preferred embodiment for which the value of x is equal to 2. In other words, the initial frame T ¹ is exactly half a frame length, ie [LGtrame / 2]. The frames of the two suites therefore overlap by half.

Still in a preferred embodiment, the function describing the window, and used for weighting, is a cosine function as shown in the diagram in FIG. 5. In the context of the example described, a frame comprises 128 samples , and the horizontal axis of the diagram is graduated in number nd • samples.

The mathematical equation describing this function is given by the following relation:

Within a frame, this function has a maximum amplitude A equal to unity for [LGtrame / 2], ie n = 64 samples, and goes through zero for n = 0 and n = 128.

If we call: LGframe g _l (k) the part of g (k) for k and (4)

LGframe,, _ g ₂ (k) the part of g (k) for ke + 1; LGframe (5),

there is an additional property between gι (k) and 92 M, expressed by the following relation:

LGtrame LGtrame gl (k + - -) + g20 ^≈ l, ke (6).

After weighting (step 60a or 60b, of phase 6 of signal reconstruction), the denoised signal segments are added (step 61) as shown diagrammatically in FIG. 6. Thus, a half-frame weighted by the first part (gi (k)) of the cosine window is added with a half-frame weighted by the second part (g2 (k)) of this same window. The noisy frames, at the weighting and summation outputs, are referenced TDli, TD2i and TDi with regard to the first sequence (cutting 1), the second sequence (cutting 2) and the result of the summing, respectively. Taking into account the property expressed by equation (6), the signal resulting from the summation constitutes the sought-after denoised signal s' (t).

In FIG. 6, the following frames have been represented:

- TDl _m and TDlm + i ∑ two consecutive denoised frames of rows m and m + l, with the division 1;

- TDl _m -ι, TD2 _m and TDl _m + ι: three consecutive denoised frames of rows m-1, m and m + l, with the cutting 2; and:

- TDl _m : two consecutive denoised frames of rows m and m + l, after summation (step 61). The summation process is therefore as follows: the first half-frame TDm is equal to the sum of the first half-frame TDlm and the second half-frame TDm-i, the second half-frame TDm is equal to the sum of the second half-frame TDlm and the first half-frame TD and so on, until all the frames are processed if it is a signal u (t) of finite length . Otherwise, the process is continuous.

FIG. 8 is a diagram illustrating the result of the summation of the two weighting windows gι (k) and 92 (k) in accordance with the remarkable property which links them (equation (6)). The two curves above and the result g (k) ≈ gι (k) + g2 (k) are plotted on this diagram, over the length of a half-frame. It can be seen that at all times, the curve g (k) = gι (k) + g2 (k) is a line of zero slope, passing through the ordinate unit. This result remains true, whatever the pair of half-frames considered.

The cosine weighting window (equation (3)) thus makes it possible to cancel the edge effects at the ends of each frame.

The signal is therefore correctly denoised, while eliminating the parasitic effects observed in the processes of the known art.

By way of comparison, if we again consider the example that led to the error curve in FIG. 2, that is to say a simple noisy signal, which is in the form of a sum of two functions sinusoidal whose parameters have been previously given, we obtain the error curve (on two frames, that is to say 256 samples as before) given by the diagram of figure 8.

It can be seen that there are no more edge effects and that the error curve is reduced to a very low ripple, residuals lying in the range ± 0.05. Since the maximum signal amplitude is ± 1 (sinusoidal functions), This last value is to be compared with the peaks due to edge effects greater than ± 50% of the maximum amplitude of the useful signal (Figure 2), that is to say ± 0.5. The maximum error is therefore reduced in a ratio of ten. Similarly, the residual ripple in amplitude of the error function is three times less (± 0.05 instead of ± 0.15).

On reading the above, it is easy to see that the invention achieves the goals it has set for itself.

It should be clear, however, that the invention is not limited only to the examples of embodiments explicitly described, in particular in relation to FIGS. 3 to 8.

Although it is particularly advantageous to adopt an offset value:

. E (LGtrame)

Δ = - -

which corresponds to the preferred embodiment, more generally, other values can be adopted such as:

. _ E (LGtrame)

with x> l.

It is only necessary that the function associated with the weighting window be such that the relation (6) is verified at all times.

Likewise, the weighting function is not limited to the only cosine type function, although this function has the advantage of gentle variations. One can, in fact, use sawtooth functions, of the triangle or trapezoid type. However, these functions are likely to induce parasitic effects, because they present abrupt variations with the changes of slope.

Claims

1. Method for reconstructing a sound signal, after double cutting the signal into frames so as to obtain two sets of frames offset by a fraction of frame length and denoising of each of the sets of frames, characterized in that it consists in carrying out a windowing operation on each of the two series of frames after they have been denoised and before they are summed to provide the final denoised sound signal.

2. Method according to claim 1, characterized in that said windowing operation consists in applying to each denoised signal frame a weighting function g (k) included in a window of length equal to that of a frame and obeying the following relationship:

LGtrame LGtrame gι (k + - - -) + g ₂ (k) = l, ke i; with

LGframe equal to said determined frame length,

LGframe gl (k) the part of g (k) to ^link

LGtrame g2 (k) the degfl part) for k e + 1; LGtrame

3. Method according to claim 2, characterized in that said weighting function is a cosine function in a window of length equal to said determined frame length (LGframe) and obeying the following relation:

4. denoising treatment of a sound signal implementing the reconstruction method according to claim 1, characterized in that it comprises a preliminary step consisting in the digitization of said noisy signals (u (t)) by sampling before said cutting in two series of frames (Tel, T'ei) and a storage step (40a, 40b) of the digitalized frames of the two suites in two circulating memories of the "first in - first out" type.

5. Treatment according to claim 4, characterized in that it comprises a denoising step (5a, 5b) consisting, independently for each of said two sequences, in applying to each frame (Tei or T'ei) a fast Fourier transform (50a, 50b), digital filtering (51a, 51b), differentiated from one frame to another, followed by an inverse Fourier transform (52a, 52b).

6. Treatment according to claim 5, characterized in that said digital filtering (51a, 51b) is carried out using a Wiener filter.

7. Application of the processing according to any one of claims 4 to 6 to the denoising of noisy speech signals (u (t)).

8. Method according to claim 7, characterized in that the duration of said frames (Tei, T'ei) is in the range 10 to 20 ms.