CN104142492A

CN104142492A - SRP-PHAT multi-source spatial positioning method

Info

Publication number: CN104142492A
Application number: CN201410366922.4A
Authority: CN
Inventors: 孙明
Original assignee: Foshan University
Current assignee: Foshan University
Priority date: 2014-07-29
Filing date: 2014-07-29
Publication date: 2014-11-12
Anticipated expiration: 2034-07-29
Also published as: CN104142492B

Abstract

A kind of SRP-PHAT multi-source spatial positioning method described in the present invention, at first assume that the number and the spatial position of all the microphones of the uniform circular microphone array are unchanged in the data acquisition process, and the isotropic microphones are evenly distributed in a radius r is located on the circumference of the xy plane, using polar coordinates to represent the arrival direction of the plane wave s, the origin of the coordinate system is located at the center of the circular array, and the multi-sound source signals are divided into non-overlapping time-frequency point sets, so that each The time-frequency window contains only one active source signal, which satisfies the weak W-separated orthogonality condition; and selects the Hamming window, calculates the controllable response power function and obtains the objective function through the SRP-PHAT algorithm, and controls the beam in all possible receiving If the direction value of the beam output power is the largest, the direction of the sound source is obtained, which makes the DOA estimation of multiple sound sources have better separation performance in the acoustic environment of strong noise and moderate reverberation, and the true peak value is obviously highlighted. , with high positioning accuracy.

Description

A kind of SRP-PHAT multi-source space-location method

Technical field

The present invention relates to a kind of space-location method, specifically, relate to a kind of SRP-PHAT multi-source space-location method, be applied in the systems such as video conference, voice enhancing, osophone, hands-free phone and intelligent robot.

Background technology

Auditory localization technology is with a wide range of applications in the systems such as video conference, voice enhancing, osophone, hands-free phone and intelligent robot, has received in recent years increasing concern.

Controlled responding power (SRP-PHAT:Steered Response Power-Phase Transform) the auditory localization algorithm of phase tranformation weighting has at present become main flow algorithm, this algorithm combines the advantage of steerable beam formation and GCC-PHAT, has stronger robustness under Low SNR.For simple sund source, be positioned with good performance, but maximum shortcoming is that operand is large, huge operand has limited the application in real-time system.

Many researchers are attempting reducing the calculated amount of the controlled responding power search procedure of its core.As secondary accelerates SRP-PHAT auditory localization algorithm, by vertically arranged array, the search of two-dimensional space is converted into the search of the one-dimensional space, adopts Level Search strategy, by thick, to smart, the one-dimensional space is searched for.And for example improved associating SRP-PHAT voice location algorithm utilizes orthogonal straight lines microphone array that two-dimensional search space is reduced to dimension space one to one, then in the one-dimensional space, carries out respectively hierarchical search strategy, finds SRP maximal value to determine sound source position.

In practice, usually need to estimate the position of multi-acoustical.The separated orthogonality hypothesis of the existing W-based on the sparse property of voice signal does not meet many sound sources, cause the method spatial resolution low, easily be subject to the impact of reverberation, particularly under reverberation and noise circumstance, cannot differentiate two nearer signal sources of leaning in direction.Therefore, many auditory localizations problem has very important theory significance and practical value.

Summary of the invention

The present invention has overcome shortcoming of the prior art, and a kind of SRP-PHAT multi-source space-location method is provided, and can under reverberation and noise circumstance, differentiate a plurality of nearer signal sources of leaning in direction, good positioning effect.

In order to solve the problems of the technologies described above, the present invention is achieved by the following technical solutions:

A SRP-PHAT multi-source space-location method, is characterized in that, comprises the following steps:

1) computer memory coordinate under assumed condition, first number and the locus of supposing whole microphones of Homogeneous Circular microphone array in data acquisition process are constant, sound source and microphone distance meet the requirement of sound-field model, the physical property of each microphone is identical, isotropic microphone is evenly distributed on the circumference that is positioned at x-y plane that a radius is r, adopt polar coordinates to represent the arrival direction of plane wave s, the initial point of coordinate system is positioned on the home position of circular array, the pitching angle theta ∈ [0 of signal, pi/2], and position angle φ ∈ [0,2 π];

2) many sound-source signals are divided into the time frequency point sets of non-overlapping copies, make only to comprise a movable source signal in each time frequency window, meet the separated orthogonality condition of weak W; And choose Hamming window, work as WDO _mmeet the separated quadrature of W-at=1 o'clock;

3) by SRP-PHAT algorithm, calculate the controlled responding power function of the right phase tranformation of all microphones and obtain an objective function, the control wave beam of Beam-former scans at all possible receive direction, and the direction value of wave beam output power maximum obtains the direction of sound source.

Further, described step 2) comprising:

First introduce two important performance criterias: (1) is sheltered and to what extent retained interested sound source; (2) shelter and to what extent suppressed interference sound source;

Consideration is divided into many sound-source signals the time frequency point sets of non-overlapping copies, only comprises a movable source signal in each time frequency window, and approximate satisfied

S_{j} (t, ω) S_{k} (t, ω) \approx 0, &ForAll; t, ω

Definition time-frequency masking code is

By estimating the time-frequency masking in corresponding each source, can from mixing source, obtain certain source j thus

S_{j} (t, ω) = M_{j} (t, ω) X (t, ω), &ForAll; t, ω

M wherein _jfor the indicator function of source j support, S _j(t, ω), X (t, ω) is respectively s _j, the time-frequency representation of x (t),

For given time-frequency mask M, the signal ratio PSRM that definition retains:

{PSR}_{M} = \frac{{| | M (t, ω) S_{j} (t, ω) | |}^{2}}{{| | S_{j} (t, ω) | |}^{2}}

PSRM is the shared number percent of source Sj energy that appraisal retains after use is sheltered;

Definition simultaneously

z_{j} (t) = {\underset{k = 1}{Σ}}_{j &NotEqual; k}^{N} s_{k} (t)

Z wherein _j(t) be at source S _jlower active sum of interference;

After definition application time-frequency masking M, signal-to-noise ratio is:

{SIR}_{M} = \frac{{| | M (t, ω) S_{j} (t, ω) | |}^{2}}{{| | M (t, ω) Z_{j} (t, ω) | |}^{2}}

SIR wherein _mthe main signal-to-noise ratio of estimating after application time-frequency masking M separation signal;

Pass through PSR _mand SIR _mcan estimate approximate W-separated orthogonality WDO _m:

{WDO}_{M} = \frac{{| | M (t, ω) S_{j} (t, ω) | |}^{2} - {| | M (t, ω) Y_{j} (t, ω) | |}^{2}}{{| | S_{j} (t, ω) | |}^{2}}

Because voice signal has sparse time-frequency representation, the power of its time-frequency representation accounts for the exhausted vast scale of general power, and the product amplitude of its time-frequency representation is conventionally always little, therefore meets the separated orthogonality condition of weak W; Especially, work as WDO _mmeet the separated quadrature of W-at=1 o'clock.

Further, described step 3) for the SRP-PHAT algorithm of dual microphone;

For only having two microphones, microphone m _iwith microphone m _jarray, from the signal of position angle and the angle of pitch, arriving two microphone time delays is Δ τ _ij(θ, φ), TDOA can estimate by broad sense simple crosscorrelation (GCC), be expressed as:

Δ τ_{ij} (θ, φ) = \underset{τ}{\arg \max} P (r) = \underset{τ}{\arg \max} R_{s_{i}, s_{j}} ({Δτ}_{ij} (θ, φ))

Wherein P (r) is three-dimensional space vectors r spatial likelihood function, can obtain by calculating all possible θ and φ broad sense cross correlation function Rs _is _j(Δ τ _{i, j}(θ, φ)) in frequency domain, can be expressed as:

R_{s_{i} s_{j}} (Δ τ_{ij} (θ, φ)) = {&Integral;}_{- π}^{π} Ψ_{ij} (ω) S_{i} (ω) S_{j}^{*} (ω) e^{jω ({Δτ}_{ij} (θ, φ))} dω

ψ wherein _ij(ω) be weighting function, S _i(ω) S* _j(ω) be cross-spectral density function;

Phase tranformation (PHAT) method is exactly a kind of typical transform method,

Definition phase weighting function is:

Ψ_{ij} (ω) = \frac{1}{| S_{i} (ω) S_{j}^{*} (ω) |}

By selecting suitable weighting function, make the controlled responding power of delay accumulation meet optimization signal-to-noise ratio (SNR) Criterion, broad sense simple crosscorrelation Rs _is _j(Δ τ _{i, j}(θ, φ)) in limited scope τ, show as a peak value, correspondence propagates into microphone m _iwith microphone m _jdelay TDOA.

Further, described step 3) for the SRP-PHAT algorithm of circular array microphone sound source:

The broad sense simple crosscorrelation right to all microphones summation:

P (Δ τ_{1}, {Δτ}_{2}, . . . {Δτ}_{N}) = Σ_{i = 1}^{N} Σ_{j = 1}^{N} R_{s_{i} s_{j}} ({Δτ}_{ij} (θ, φ))

= Σ_{i = 1}^{N} Σ_{j = 1}^{N} {&Integral;}_{- π}^{π} Ψ_{i, j} (ω) S_{i} (ω) S_{j}^{*} (ω) e^{jω ({Δτ}_{i} - {Δτ}_{j})} dω

Δ τ wherein ₁, Δ τ ₂Δ τ _nfor the controllable time delay of N microphone, Δ τ wherein _i=τ _i-τ ₀i=1 ... N, τ ₀for estimating with reference to time delay, getting minimum in all microphone time delays is reference.

Further, described step 3) for many sound sources circular array microphone SRP-PHAT algorithm:

When there is two and above sound source, when there is more than two sound source, the SRP-PHAT peak value of a sound source has been sneaked into the SRP-PHAT peak value of another sound source, on some points, can produce false peak value, is difficult to find local maximal peak simultaneously simultaneously;

Utilize voice signal approximate W-separated orthogonality, at time-frequency domain, estimate that each sound-source signal arrives the relative time delay of microphone, array, utilize Short Time Fourier Transform as approximate W-separated orthogonal transformation,

The frequency domain representation of supposing the signal model of i microphone is:

X_{i} [ω, τ] = S_{n} (ω, τ) e^{- jωΔ τ_{n, i}} + N_{i} [ω, τ]

If given window function W, the Short Time Fourier Transform of sj is Sj, has

S_{j} (t, ω) = F^{W} (s_{j} (\cdot)) (t, ω) = \frac{1}{\sqrt{2 π}} {&Integral;}_{- \infty}^{\infty} W (τ - t) s_{j} (τ) e^{- iωτ} dτ

By selecting appropriate window function and size, at signal, be under approximate W-separated orthogonality hypothesis, only have a sound source at any time-Frequency point is effective, its cross-spectrum is:

E [X_{i} [ω, τ] X_{j}^{*} [ω, τ]] = {| S_{n} (ω, τ) |}^{2} e^{- jω ({Δτ}_{i} - {Δτ}_{j})}

The time delay Δ τ between microphone i and microphone j _{n, i}-Δ τ _{n, j}can obtain by cross-power spectrum.

Compared with prior art, the invention has the beneficial effects as follows:

A kind of SRP-PHAT multi-source space-location method of the present invention shows by theoretical analysis and emulation experiment, associating approximate W based on circular array-separated quadrature SRP-PHAT algorithm makes the DOA of many sound sources estimate to have good separating property under the acoustic enviroment of very noisy and appropriate reverberation, obviously give prominence to true peaks, there is higher positioning precision.

1. for uniform circular array row, can see the research to simple sund source location, and relatively less for the multi-source Position Research of circular array.There is more high spatial resolution

2. on the basis of approximate W-separated orthogonality hypothesis, SRP-PHAT algorithm makes the DOA of many sound sources estimate under the acoustic enviroment of very noisy and appropriate reverberation, to have good separating property, has obviously given prominence to true peaks, has higher positioning precision.

3. can effectively solve the problem at false spectrum peak, 3 signal sources can be differentiated and opened,

4. this method is applicable to the location under medium reverberation.

Accompanying drawing explanation

Accompanying drawing is used to provide a further understanding of the present invention, for explaining the present invention, is not construed as limiting the invention together with embodiments of the present invention, in the accompanying drawings:

Fig. 1 is uniform circular array row geometric graphs;

Fig. 2 is that uniform circular array train wave bundle forms principle;

Fig. 2: WDO ratio (80%) in 3 sound source situations;

Fig. 4: WDO ratio (90%) in 3 sound source situations;

Fig. 5 sound source s ₁(t) time frequency analysis | S ₁w (t, ω) |;

Fig. 6 sound source s ₂(t) time frequency analysis | S ₂w (t, ω) |;

Fig. 7 time frequency analysis | S ₁w (t, ω) S ₂w (t, ω) |;

Fig. 8 method realizes block diagram;

Fig. 9 uniform circular array row;

Figure 10 is two auditory localization two-dimensional imaging figure, and signal to noise ratio (S/N ratio) is 20dB;

Figure 11 is two auditory localization two-dimensional imaging figure, and signal to noise ratio (S/N ratio) is 30dB;

Figure 12 is the position angle that circular array is surveyed two sound sources, and signal to noise ratio (S/N ratio) is 20dB;

Figure 13 is the position angle that circular array is surveyed two sound sources, and signal to noise ratio (S/N ratio) is 30dB;

Figure 14 is two angle, the sound bearing three-dimensional plot of surveying, and signal to noise ratio (S/N ratio) is 30dB;

Figure 15 is three auditory localization two-dimensional imaging figure, and signal to noise ratio (S/N ratio) is 30dB;

Figure 16 improves one's methods for circular array, to survey the position angle of three sound sources, and signal to noise ratio (S/N ratio) is 30dB;

Figure 17 is that classic method is surveyed the position angle of three sound sources for circular array, and signal to noise ratio (S/N ratio) is 30dB;

Figure 18 is the signal waveform that 8 yuan of microphones receive;

Signal to noise ratio (S/N ratio) and angular error curve when Figure 19 is different T60

Embodiment

Below in conjunction with accompanying drawing, the preferred embodiments of the present invention are described, should be appreciated that preferred embodiment described herein, only for description and interpretation the present invention, is not intended to limit the present invention.

The first step, location model and uniform circular array train wave bundle form.

A Homogeneous Circular array can be determined space coordinates, as shown in Figure 1, is that isotropic microphone is evenly distributed on the circumference that is positioned at x-y plane that a radius is R.Adopt polar coordinates to represent the arrival direction of plane wave s, the initial point of coordinate system is positioned on the home position of circular array, and true origin is the pitching angle theta ∈ [0, pi/2] of system reference point signal, and position angle φ ∈ [0,2 π].Wherein r is the distance that sound source arrives the circular array center of circle, r _ifor sound source is to microphone m _idistance.

Suppose that acoustic signals is:

s (r, t) = e^{j ω_{0} t} - - - (1)

Wherein: ω ₀for the angular frequency of sound-source signal, and

C is velocity of wave, C=384m/s.

F is the frequency (Hz) of sound source.

The signal of i microphone reception is

f _i(r，t)＝s(t-Δτ _i)

(2)

As shown in Figure 1

Wherein: r _ibe that i microphone is to the distance in source

R is the distance that the round microphone array center of circle is arrived in source

R circular array radius

θ is the angle of pitch of sound source,

position angle for sound source.

i=0,1,2 ... N-1 is the position angle of i microphone.

So the time delay of each microphone before stack is

Wherein: C is velocity of wave, C=384m/s.

As shown in Figure 2, by the time delay Beam-former that superposes, the shifted signal of all microphones capture is sued for peace.The contribution stack of each sound source far zone field point just can, in the hope of the far-field pattern function of this ring array, be had

y (t) = \frac{1}{N} Σ_{i = 1}^{N} s (r_{i}, t - Δ τ_{i}) = \frac{1}{N} Σ_{n = 1}^{N} e^{j ω_{0} (t - Δ τ_{i})} = e^{j ω_{0} t} \frac{1}{N} Σ_{n = 1}^{N} e^{- j ω_{0} Δ τ_{i}} - - - (5)

(4) are brought in (5) and obtained

Wherein: for sound source unit's wave-number vector.

T is vector transposition.

Δ τ _i=τ _i-τ ₀τ 0 estimates with reference to time delay, and getting minimum in all microphone time delays is reference.

Second step, approximate W-separated orthogonality hypothesis

Conventionally the masking effect of people's ear is divided into frequency masking and temporal masking characteristic, based on time-frequency masking method hypothesis sound-source signal, is sparse in separable, meets the separated orthogonality of W-.

Suppose that signal x (t) is comprised of N sound-source signal, can be expressed as

x (t) = Σ_{j = 1}^{N} s_{j} (t) - - - (7)

If there is certain linear transformation T, be called s _jto S _jmapping, be designated as and there is following properties:

(1) conversion T has reversibility, i.e. T ^-1(Ts)=T (T ^-1s)=s

(2) during j ≠ k, Λ wherein _jfor S _jsupport, Λ _j=supp S _j:={ λ: S _j(λ) ≠ 0}, table

Show collection Λ _jwith Λ _kfriendship non-zero.

If meet above-mentioned (1), the condition of (2), the mixed signal in collection S all can be effectively separated.

If a given window function, if meet

S_{j} (t, ω) S_{k} (t, ω) = 0 &ForAll; t, ω - - - (8)

Claim two sound source S _jand S _kmeet the separated orthogonality of W-.

But the separated orthogonality hypothesis of W-does not meet the signal that will study herein, the result of expression formula (7) is seldom zero.

For this reason, introduce two important performance criterias: (1) is sheltered and to what extent retained interested sound source; (2) shelter and to what extent suppressed interference sound source.

S_{j} (t, ω) S_{k} (t, ω) \approx 0, &ForAll; t, ω - - - (9)

Definition time-frequency masking code is

S_{j} (t, ω) = M_{j} (t, ω) X (t, ω), &ForAll; t, ω - - - (11)

For given time-frequency mask M, the signal ratio PSR that definition retains _m

{PSR}_{M} = \frac{{| | M (t, ω) S_{j} (t, ω) | |}^{2}}{{| | S_{j} (t, ω) | |}^{2}} - - - (12)

PSR _mfor estimate the source S retaining after use is sheltered _jthe number percent that energy is shared.

Definition simultaneously

z_{j} (t) = {\underset{k = 1}{Σ}}_{j &NotEqual; k}^{N} s_{k} (t) - - - (13)

Z wherein _j(t) be at source S _jlower active sum of interference.

After definition application time-frequency masking M, signal-to-noise ratio is

{SIR}_{M} = \frac{{| | M (t, ω) S_{j} (t, ω) | |}^{2}}{{| | M (t, ω) Z_{j} (t, ω) | |}^{2}} - - - (14)

SIR wherein _mthe main signal-to-noise ratio of estimating after application time-frequency masking M separation signal.

Pass through PSR _mand SIR _mcan estimate approximate W-separated orthogonality WDO _m.

{WDO}_{M} = \frac{{| | M (t, ω) S_{j} (t, ω) | |}^{2} - {| | M (t, ω) Y_{j} (t, ω) | |}^{2}}{{| | S_{j} (t, ω) | |}^{2}} - - - (15)

Because voice signal has sparse time-frequency representation, the power of its time-frequency representation accounts for the exhausted vast scale of general power, and the product amplitude of its time-frequency representation is conventionally always little.Therefore meet the separated orthogonality condition of weak W.Approximate W-separated intercept is higher, has better separating effect.Want to obtain good time-frequency masking effect, window function type and choosing of size are played vital effect to its performance.Especially, work as WDO _mmeet the separated quadrature of W-at=1 o'clock.

According to the experiment of Scott Rickard (Scott Rickard, Radu Balan and Justinian Rosca.Real-time time-frequency based blind source separation.Proceedings ICA2001, pp.651-656, December2001.), under 0dB, the WDO ratio of different number sound sources is as follows

N	2	3	4	5	6	7	8	9	10
										WDO	93.6	88.0	83.4	79.2	75.6	72.3	69.3	66.6	64

As shown in Figure 3, Figure 4, by the situation of 3 sound sources is carried out to simplation verification, horizontal ordinate is WDO value, and ordinate is voice signal number of samples, can see in 3 sound source situations, and signal more than 80% is being quadrature.

As Fig. 5, Fig. 6 and Fig. 7, in addition 2 sound sources are carried out to nearly orthogonal condition Verification, respectively to signal s ₁(t), s ₂(t) carry out time frequency analysis, respectively with analyze simultaneously horizontal ordinate is the time, and ordinate is frequency.Window function W (t) chooses Hamming window, length of window 64ms, and by Fig. 5, Fig. 6, Fig. 7 can find out, in comprise seldom with composition, can prove that sound-source signal meets approximate W-separated quadrature thus.

The 3rd step, the SRP-PHAT localization method of associating approximate W-separated many sound sources of quadrature circular array

SRP-PHAT algorithm is by calculating the controlled responding power function of the right phase tranformation of all microphones and obtaining an objective function, the Beam-former of devise optimum is also controlled wave beam and is scanned at all possible receive direction, and the direction value of wave beam output power maximum obtains the direction of sound source.

The SRP-PHAT algorithm of 1 dual microphone

For only there being two microphone m _iand m _jarray, from the signal of position angle and pitching, arriving two microphone time delays is Δ τ _ij(θ, φ), TDOA can estimate by broad sense simple crosscorrelation (GCC), be expressed as:

{Δτ}_{ij} (θ, φ) = \underset{τ}{\arg \max} P (r) = \underset{τ}{\arg \max} R_{s_{i}, s_{j}} ({Δτ}_{ij} (θ, φ)) - - - (16)

Wherein P (r) is three-dimensional space vectors r spatial likelihood function, can obtain by calculating all possible θ and φ.Broad sense cross correlation function Rs _is _j(Δ τ _{i, j}(θ, φ)) in frequency domain, can be expressed as:

R_{s_{i} s_{j}} (Δ τ_{ij} (θ, φ)) = {&Integral;}_{- π}^{π} Ψ_{ij} (ω) S_{i} (ω) S_{j}^{*} (ω) e^{jω ({Δτ}_{ij} (θ, φ))} dω - - - (17)

ψ wherein _ij(ω) be weighting function, S _i(ω) S* _j(ω) be cross-spectral density function.

Phase tranformation (PHAT) method is exactly a kind of typical transform method.

Definition phase weighting function is:

Ψ_{ij} = (ω) = \frac{1}{| S_{i} (ω) S_{j}^{*} (ω) |} - - - (18)

By selecting suitable weighting function, make the controlled responding power of delay accumulation meet optimization signal-to-noise ratio (SNR) Criterion, broad sense simple crosscorrelation Rs _is _j(Δ τ _{i, j}(θ, φ)) in limited scope τ, show as a peak value, correspondence propagates into microphone m _iand m _jdelay TDOA.This algorithm has certain noise immunity, anti-reverberation and robustness in auditory localization.

2 circular array SRP-PHAT algorithms

The broad sense simple crosscorrelation right to all microphones summation

\begin{matrix} P ({Δτ}_{1}, {Δτ}_{2}, . . . {Δτ}_{N}) = Σ_{i = 1}^{N} Σ_{j = 1}^{N} R_{s_{i} s_{j}} ({Δτ}_{ij} (θ, φ)) \\ = Σ_{i = 1}^{N} Σ_{j = 1}^{N} {&Integral;}_{- π}^{π} Ψ_{i, j} (ω) S_{i} (ω) S_{j}^{*} (ω) e^{jω ({Δτ}_{i} - {Δτ}_{j})} dω \end{matrix} - - - (19)

Along with the increase of microphone number, dual microphone SRP-PHAT method expands to round microphone SRP-PHAT method naturally.

The circular array of sound source more than 3 SRP-PHAT algorithm

When there is two and above sound source, when there is more than two sound source, the SRP-PHAT peak value of a sound source has been sneaked into the SRP-PHAT peak value of another sound source, on some points, can produce false peak value, is difficult to find local maximal peak simultaneously simultaneously.

Utilize foregoing voice signal approximate W-separated orthogonality, at time-frequency domain, estimate that each sound-source signal arrives the relative time delay of microphone array.

Utilize Short Time Fourier Transform as approximate W-separated orthogonal transformation.

X_{i} [ω, τ] = S_{n} (ω, τ) e^{- jω {Δτ}_{n, i}} + N_{i} [ω, τ] - - - (20)

If given window function W, the Short Time Fourier Transform of sj is Sj, has

S_{j} (t, ω) = F^{W} (s_{j} (\cdot)) (t, ω) = \frac{1}{\sqrt{2 π}} {&Integral;}_{- \infty}^{\infty} W (τ - t) s_{j} (τ) e^{- iωτ} dτ - - - (21)

By selecting appropriate window function and size, at signal, be under approximate W-separated orthogonality hypothesis, only have a sound source at any time-Frequency point is effective.Its cross-spectrum is:

E [X_{i} [ω, τ] X_{j}^{*} [ω, τ]] = {| S_{n} (ω, τ) |}^{2} e^{- jω ({Δτ}_{i} - {Δτ}_{j})} - - - (22)

The time delay Δ τ n between microphone i and j, i-Δ τ n, j can obtain by cross-power spectrum.

1 two auditory localizations of embodiment

1. uniform circular array row location model is selected

Emulation experiment is simulated under different signal to noise ratio (S/N ratio)s and reverberation environment, and Homogeneous Circular array is placed in the room of 7m * 8m * 3.5m, and its 8 yuan of microphone locus are respectively [3.25 ,-1.6,1.5], [3.25,1.1,1.5], [1.87,3.75,1.5], [1.0,3.75,1.5], [3.25,1.8,1.5], [3.25,-1.0,1.5], [2.2 ,-3.75,1.5], [0.6 ,-3.75,1.5].

2. the selection of sound source

Sound source is the random voice signal producing, and signal to noise ratio (S/N ratio) is 0-30dB.Random interfering signal is gaussian signal, is used for simulating air condition electric fan and from noise outside window, noise power can reach 10dB the most by force, and the corresponding reverberation time is determined by the reflection coefficient of room wall, floor and ceiling.

3. pair array reception signal carries out Short Time Fourier Transform (STFT)

If given window function W, s _jshort Time Fourier Transform be S _j, have

S_{j} (t, ω) = F^{W} (s_{j} (\cdot)) (t, ω) = \frac{1}{\sqrt{2 π}} {&Integral;}_{- \infty}^{\infty} W (τ - t) s_{j} (τ) e^{- iωτ} dτ - - - (22)

Want to obtain good time-frequency masking effect, window function type and choosing of size are played vital effect to its performance.Wherein window function is chosen Hamming window, and window size is 1024 points.

4. carry out the broad sense simple crosscorrelation of phase tranformation

By choosing suitable window function, desirable good separating effect, meets approximate W-separated quadrature.On this basis, can carry out broad sense computing cross-correlation.

Broad sense cross correlation function Rs _is _j(Δ τ _{i, j}(θ, φ)) in frequency domain, can be expressed as:

R_{s_{i} s_{j}} ({Δτ}_{ij} (θ, φ)) = {&Integral;}_{- π}^{π} Ψ_{ij} (ω) S_{i} (ω) S_{j}^{*} (ω) e^{jω ({Δτ}_{ij} (θ, φ))} dω - - - (22)

ψ wherein _ij(ω) be weighting function, for:

Ψ_{ij} (ω) = \frac{1}{| S_{i} (ω) S_{j}^{*} (ω) |} - - - (23)

The broad sense simple crosscorrelation right to all microphones summation

\begin{matrix} P ({Δτ}_{1}, {Δτ}_{2}, . . . {Δτ}_{N}) = Σ_{i = 1}^{N} Σ_{j = 1}^{N} R_{s_{i} s_{j}} ({Δτ}_{ij} (θ, φ)) \\ = Σ_{i = 1}^{N} Σ_{j = 1}^{N} {&Integral;}_{- π}^{π} Ψ_{i, j} (ω) S_{i} (ω) S_{j}^{*} (ω) e^{jω ({Δτ}_{i} - {Δτ}_{j})} dω \end{matrix} - - - (24)

Obtain P (Δ τ ₁, Δ τ ₂... Δ τ _n) maximal value after, can determine pitching angle theta and the position angle φ of sound source.

5. the result after above step

Shown in Figure 10, Figure 11, be respectively circular array at 20dB, sound source wave field image under 30dB signal to noise ratio (S/N ratio).In figure, is microphone position, and zero represents the sound source of estimating, * for disturbing residing position.

The locus that Figure 10 shows that two sound sources is respectively [0.59,2.08,1.5] and [0.29 ,-1.37,1.5], and signal to noise ratio (S/N ratio) is 20dB.Random interfering signal is gaussian signal, is used for simulating air condition electric fan and from noise outside window, locus is respectively [2 ,-4,1.5], [3.5 ,-3.2,1.5], noise power can reach 10dB the most by force, and the corresponding reverberation time is determined by the reflection coefficient of room wall, floor and ceiling.

The locus that Figure 11 shows that two sound sources is respectively [1.5,2.1,1.5] and [2.1,0.8,1.5], and signal to noise ratio (S/N ratio) is 30dB.Be used for simulating air condition electric fan and from noise outside window away from two sound sources.

Adopt the SRP-PHAT algorithm of associating approximate W-separated quadrature to carry out orientation estimation, choose Hamming window, window size is 1024 points.Shown in Figure 10, Figure 11, be respectively circular array at 20dB, sound source wave field image under 30dB signal to noise ratio (S/N ratio).Is microphone position, and zero represents the sound source of estimating, * for disturbing residing position.Visible under identical background noise environment, the signal to noise ratio (S/N ratio) of signal more high position precision is also higher.

Shown in Figure 12, Figure 13, be respectively the angle, sound bearing recording.Fig. 5 position angle is respectively φ 1=74 ° and φ 2=-78 °, although the azran of two signals is near and Signal-to-Noise is low, 2 sound sources can be differentiated out substantially, in true bearing, all there is spectrum peak, do not have false spectrum peak to occur, and target azimuth correctly still can draw estimated result, 2 sound sources can be differentiated out substantially.Figure 13 is measured position angle φ 1=17 °, φ 2=52 °.Although the azran of two signals is nearer, because signal to noise ratio (S/N ratio) is high and two angles differ larger, 2 sound sources are differentiated completely.Along with the increase of signal to noise ratio (S/N ratio), evaluated error can be more and more less, and estimated accuracy can be more and more higher.The larger estimation of differential seat angle between two signals is more accurate, when the difference of angle greatly to a certain extent after estimated accuracy tend towards stability.

Position angle shown in Figure 14 and the angle of pitch are (φ 1=74 °, θ 1=46 °) and (φ 2=-78 °, θ 2=0 °).

2 three auditory localizations of embodiment

When sound source increases to 3, in the situation that signal to noise ratio (S/N ratio) is low, can not solve well the problem at false spectrum peak.Under high s/n ratio condition, substantially can solve the problem at false spectrum peak, many sound sources are had to good resolution characteristic.

Specific implementation step, with example 1, is omited herein.

Figure 15 shows that three auditory localization two-dimensional imaging figure, signal to noise ratio (S/N ratio) 30dB.

Shown in Figure 16, Figure 17, be respectively the angle, sound bearing that the method that proposes herein and traditional SRP-PHAT method record under the higher condition of signal to noise ratio (S/N ratio).SRP-PHAT method based on approximate W-separated quadrature can solve the problem at false spectrum peak effectively, 3 signal sources can be differentiated and opened, and traditional SRP-PHAT method there will be false spectrum peak, 3 useful signals of indistinguishable.

Figure 18 shows that the sound-source signal that 8 yuan of microphone array received arrive, can find out that interference source is on No. 7 microphones impacts close to are larger from it, Figure 19 shows that 60 times signal to noise ratio (S/N ratio)s of different reverberation time T and orientation angle error relationship curve, RT60 chooses respectively 300ms, 450ms and 600ms.Along with the increase of T60, evaluated error is increasing, and estimated accuracy can be more and more lower.Visible in the situation that reverberation is large, be difficult to resolution target orientation, this method is applicable to the location under medium reverberation.

From simulation result, can find out, adopt the SRP-PHAT algorithm of even ring array to there is good positioning performance.Particularly, when SNR is higher, when reverberation is moderate, locating effect is better

The separated orthogonality hypothesis of W-the present invention is directed to based on the sparse property of voice signal does not meet many sound sources, two key properties of signal to noise ratio (S/N ratio) after signal retention rate and time-frequency masking after introducing voice signal time-frequency masking, derived approximate W-separated orthogonality hypothesis condition, many sound-source signals are divided into the time frequency point sets of non-overlapping copies, each set only comprises the time frequency component of single source signal, at time-frequency domain, estimates that each sound-source signal arrives the relative time delay of microphone array.Estimate that source signal arrives the relative time delay of microphone array.Special employing has the more circular array of high spatial resolution, realized and the high-resolution of the position angle of many sound-source signals, the angle of pitch having been estimated simultaneously, realize the space orientation of sound-source signal, overcome the three-dimensional fix problem that existing sound localization method cannot effectively be realized a plurality of aliasing sound sources.

Finally it should be noted that: these are only the preferred embodiments of the present invention; be not limited to the present invention; although the present invention is had been described in detail with reference to embodiment; for a person skilled in the art; its technical scheme that still can record aforementioned each embodiment is modified; or part technical characterictic is wherein equal to replacement; but within the spirit and principles in the present invention all; any modification of doing, be equal to replacement, improvement etc., within protection scope of the present invention all should be included in.

Claims

1. a SRP-PHAT multi-source spatial positioning method, is characterized in that, comprises the following steps:

1) Calculating the spatial coordinates under assumptions, first assuming that the number and spatial position of all microphones in the uniform circular microphone array remain unchanged during the data acquisition process, the distance between the sound source and the microphones meets the requirements of the sound field model, and the physical properties of each microphone are the same , the isotropic microphones are evenly distributed on a circle with a radius of r on the x-y plane, using polar coordinates to represent the arrival direction of the plane wave s, the origin of the coordinate system is located at the center of the circular array, and the pitch angle θ of the signal ∈[0,π/2], while the azimuth φ∈[0,2π];

2) The multi-source signal is divided into non-overlapping time-frequency point sets, so that each time-frequency window contains only one active source signal, which satisfies the weak W separation orthogonal condition; and select Hamming window, when WDO _M ＝1 satisfies W-separated orthogonality;

3) Calculate the controllable response power function of the phase transformation of all microphone pairs through the SRP-PHAT algorithm and obtain an objective function. The control beam of the beamformer scans in all possible receiving directions, and the direction value of the maximum beam output power is Get the direction of the sound source.

2. a kind of SRP-PHAT multi-source spatial positioning method according to claim 1, is characterized in that, described step 2) comprises:

First, two important characteristic criteria are introduced: (1) how much masking preserves the sound source of interest; (2) how much masking suppresses interfering sound sources;

Consider dividing the multi-source signal into a set of non-overlapping time-frequency points, each time-frequency window contains only one active source signal, and approximately satisfies

{S S}_{j j} ((t t,, ω ω)) {S S}_{k k} ((t t,, ω ω)) \approx \approx 00,, &ForAll; &ForAll; t t,, ω ω

Define the time-frequency mask as

By estimating the time-frequency mask corresponding to each source, a certain source j can be obtained from the mixture of sources

{S S}_{j j} ((t t,, ω ω)) = = {M m}_{j j} ((t t,, ω ω)) X x ((t t,, ω ω)),, &ForAll; &ForAll; t t,, ω ω

Among them, Mj is the indicator function of the support set of source j, Sj(t, ω), X(t, ω) are the time-frequency representations of sj, x(t), respectively,

For a given time-frequency mask M, the preserved signal ratio PSRM is defined:

{PSR PSR}_{M m} = = \frac{{| | | | M m ((t t,, ω ω)) {S S}_{j j} ((t t,, ω ω)) | | | |}^{22}}{{| | | | {S S}_{j j} ((t t,, ω ω)) | | | |}^{22}}

PSRM is an estimate of the percentage of source Sj energy retained after masking is applied;

define at the same time

{z z}_{j j} ((t t)) = = {\underset{k k = = 11}{Σ Σ}}_{j j &NotEqual; &NotEqual; k k}^{N N} {s the s}_{k k} ((t t))

Where zj(t) is the sum of all sources under the interference of source Sj;

Define the signal-to-interference ratio after applying the time-frequency mask M as:

{SIR SIR}_{M m} = = \frac{{| | | | M m ((t t,, ω ω)) {S S}_{j j} ((t t,, ω ω)) | | | |}^{22}}{{| | | | M m ((t t,, ω ω)) {Z Z}_{j j} ((t t,, ω ω)) | | | |}^{22}}

Among them, SIRM mainly estimates the signal-to-interference ratio after applying time-frequency masking M to separate signals;

The approximate W-separated orthogonality WDOM can be estimated by PSRM and SIRM:

{WDO WDO}_{M m} = = \frac{{| | | | M m ((t t,, ω ω)) {S S}_{j j} ((t t,, ω ω)) | | | |}^{22} - - {| | | | M m ((t t,, ω ω)) {Y Y}_{j j} ((t t,, ω ω)) | | | |}^{22}}{{| | | | {S S}_{j j} ((t t,, ω ω)) | | | |}^{22}}

Since the speech signal has a sparse time-frequency representation, the power of its time-frequency representation accounts for a large proportion of the total power, and the magnitude of the product of its time-frequency representation is usually always small, thus satisfying the weak W-separated orthogonality condition; in particular, W-separated orthogonality is satisfied when WDOM=1.

3. a kind of SRP-PHAT multi-source spatial positioning method according to claim 1, is characterized in that, described step 3) for the SRP-PHAT algorithm of two microphones,

For an array with only two microphones, microphone mi and microphone mj, the signals from the azimuth and elevation angles arrive at the two microphones with a delay of Δτij(θ, φ), TDOA can be estimated by generalized cross-correlation (GCC), expressed as:

Δ Δ {τ τ}_{ij ij} ((θ θ,, φ φ)) = = \underset{τ τ}{arg arg max max} P P ((r r)) = = \underset{τ τ}{arg arg max max} {R R}_{{s the s}_{i i},, {s the s}_{j j}} ((Δ Δ {τ τ}_{ij ij} ((θ θ,, φ φ))))

where P(r) is the space likelihood function of the three-dimensional space vector r, which can be obtained by calculating all possible θ and φ, and the generalized cross-correlation function Rsisj(Δτi, j(θ, φ)) can be expressed in the frequency domain as:

{R R}_{{s the s}_{i i},, {s the s}_{j j}} ((Δ Δ {τ τ}_{ij ij} ((θ θ,, φ φ)))) = = {&Integral; &Integral;}_{- - π π}^{π π} {Ψ Ψ}_{ij ij} ((ω ω)) {S S}_{i i} ((ω ω)) {S S}_{j j}^{* *} ((ω ω)) {e e}^{jω jω ((Δ Δ {τ τ}_{ij ij} ((θ θ,, φ φ))))} dω dω

Where ψij(ω) is the weighting function, Si(ω)S*j(ω) is the cross power spectral density function;

The phase transformation (PHAT) method is a typical transformation method.

Define the phase weighting function as:

{Ψ Ψ}_{ij ij} ((ω ω)) = = \frac{11}{| | {S S}_{i i} ((ω ω)) {S S}_{j j}^{* *} ((ω ω)) | |}

By selecting an appropriate weighting function, the time-delay accumulated controllable response power satisfies the optimal SNR criterion, and the generalized cross-correlation Rsisj(Δτi, j(θ, φ)) shows a peak within the limited range τ, corresponding to Delay TDOA propagated to microphone mi and microphone mj.

4. a kind of SRP-PHAT multi-source spatial positioning method according to claim 1, is characterized in that, described step 3) for the SRP-PHAT algorithm of circular array microphone sound source:

Generalized cross-correlation over all microphone pairs Summing:

\begin{matrix} P P ((Δ Δ {τ τ}_{11},, Δ Δ {τ τ}_{22},, \cdot &Center Dot; \cdot &Center Dot; \cdot &Center Dot; Δ Δ {τ τ}_{N N})) = = {Σ Σ}_{i i = = 11}^{N N} {Σ Σ}_{j j = = 11}^{N N} {R R}_{{s the s}_{i i},, {s the s}_{j j}} ((Δ Δ {τ τ}_{ij ij} ((θ θ,, φ φ)))) \\ = = {Σ Σ}_{i i = = 11}^{N N} {Σ Σ}_{j j = = 11}^{N N} {&Integral; &Integral;}_{- - π π}^{π π} {Ψ Ψ}_{ij ij} ((ω ω)) {S S}_{i i} ((ω ω)) {S S}_{j j}^{* *} ((ω ω)) {e e}^{jω jω ((Δ Δ {τ τ}_{i i} - - Δ Δ {τ τ}_{j j}))} dω dω \end{matrix}

Among them, Δτ ₁ , Δτ ₂ ... Δτ _N are the controllable delays of N microphones, where Δτ _i = τ _i -τ ₀ i=1...N, τ ₀ is the reference delay estimate, and the smallest of all microphone delays is taken for reference.

5. a kind of SRP-PHAT multi-source spatial positioning method according to claim 1, is characterized in that, described step 3) for multi-source circular array microphone SRP-PHAT algorithm:

When there are two or more sound sources at the same time, when there are more than two sound sources at the same time, the SRP-PHAT peak of one sound source is mixed with the SRP-PHAT peak of another sound source, and false peaks will be generated at some points. It is difficult to find the local maximum peak;

Using the approximate W-separation orthogonality of the speech signal, estimate the relative delay of each sound source signal reaching the microphone and array in the time-frequency domain, and use the short-time Fourier transform as an approximate W-separation orthogonal transformation,

Suppose the frequency domain representation of the signal model of the i-th microphone is:

{X x}_{i i} [[ω ω,, τ τ]] = = {S S}_{n no} ((ω ω,, τ τ)) {e e}^{- - jωΔ jωΔ {τ τ}_{n no,, i i}} + + {N N}_{i i} [[ω ω,, τ τ]]

If the window function W is given, the short-time Fourier transform of sj is Sj, we have

{S S}_{j j} ((t t,, ω ω)) = = {F f}^{W W} (({s the s}_{j j} ((\cdot \cdot)))) ((t t,, ω ω)) = = \frac{11}{\sqrt{22 π π}} {&Integral; &Integral;}_{- - \infty \infty}^{\infty \infty} W W ((τ τ - - t t)) {s the s}_{j j} ((τ τ)) {e e}^{- - iωτ iωτ} dτ dτ

By choosing an appropriate window function and size, under the assumption that the signal is approximately W-separated and orthogonal, only one sound source is effective at any time-frequency point, then its cross-spectrum is:

E E. [[{X x}_{i i} [[ω ω,, τ τ]] {X x}_{j j}^{* *} [[ω ω,, τ τ]]]] = = {| | {S S}_{n no} ((ω ω,, τ τ)) | |}^{22} {e e}^{- - jω jω (({Δτ Δτ}_{i i} - - {Δτ Δτ}_{j j}))}

Then the time delay Δτn,i-Δτn,j between microphone i and microphone j can be obtained through the cross power spectrum.