Summary of the invention
The present invention has overcome shortcoming of the prior art, and a kind of SRP-PHAT multi-source space-location method is provided, and can under reverberation and noise circumstance, differentiate a plurality of nearer signal sources of leaning in direction, good positioning effect.
In order to solve the problems of the technologies described above, the present invention is achieved by the following technical solutions:
A SRP-PHAT multi-source space-location method, is characterized in that, comprises the following steps:
1) computer memory coordinate under assumed condition, first number and the locus of supposing whole microphones of Homogeneous Circular microphone array in data acquisition process are constant, sound source and microphone distance meet the requirement of sound-field model, the physical property of each microphone is identical, isotropic microphone is evenly distributed on the circumference that is positioned at x-y plane that a radius is r, adopt polar coordinates to represent the arrival direction of plane wave s, the initial point of coordinate system is positioned on the home position of circular array, the pitching angle theta ∈ [0 of signal, pi/2], and position angle φ ∈ [0,2 π];
2) many sound-source signals are divided into the time frequency point sets of non-overlapping copies, make only to comprise a movable source signal in each time frequency window, meet the separated orthogonality condition of weak W; And choose Hamming window, work as WDO
mmeet the separated quadrature of W-at=1 o'clock;
3) by SRP-PHAT algorithm, calculate the controlled responding power function of the right phase tranformation of all microphones and obtain an objective function, the control wave beam of Beam-former scans at all possible receive direction, and the direction value of wave beam output power maximum obtains the direction of sound source.
Further, described step 2) comprising:
First introduce two important performance criterias: (1) is sheltered and to what extent retained interested sound source; (2) shelter and to what extent suppressed interference sound source;
Consideration is divided into many sound-source signals the time frequency point sets of non-overlapping copies, only comprises a movable source signal in each time frequency window, and approximate satisfied
Definition time-frequency masking code is
By estimating the time-frequency masking in corresponding each source, can from mixing source, obtain certain source j thus
M wherein
jfor the indicator function of source j support, S
j(t, ω), X (t, ω) is respectively s
j, the time-frequency representation of x (t),
For given time-frequency mask M, the signal ratio PSRM that definition retains:
PSRM is the shared number percent of source Sj energy that appraisal retains after use is sheltered;
Definition simultaneously
Z wherein
j(t) be at source S
jlower active sum of interference;
After definition application time-frequency masking M, signal-to-noise ratio is:
SIR wherein
mthe main signal-to-noise ratio of estimating after application time-frequency masking M separation signal;
Pass through PSR
mand SIR
mcan estimate approximate W-separated orthogonality WDO
m:
Because voice signal has sparse time-frequency representation, the power of its time-frequency representation accounts for the exhausted vast scale of general power, and the product amplitude of its time-frequency representation is conventionally always little, therefore meets the separated orthogonality condition of weak W; Especially, work as WDO
mmeet the separated quadrature of W-at=1 o'clock.
Further, described step 3) for the SRP-PHAT algorithm of dual microphone;
For only having two microphones, microphone m
iwith microphone m
jarray, from the signal of position angle and the angle of pitch, arriving two microphone time delays is Δ τ
ij(θ, φ), TDOA can estimate by broad sense simple crosscorrelation (GCC), be expressed as:
Wherein P (r) is three-dimensional space vectors r spatial likelihood function, can obtain by calculating all possible θ and φ broad sense cross correlation function Rs
is
j(Δ τ
i, j(θ, φ)) in frequency domain, can be expressed as:
ψ wherein
ij(ω) be weighting function, S
i(ω) S*
j(ω) be cross-spectral density function;
Phase tranformation (PHAT) method is exactly a kind of typical transform method,
Definition phase weighting function is:
By selecting suitable weighting function, make the controlled responding power of delay accumulation meet optimization signal-to-noise ratio (SNR) Criterion, broad sense simple crosscorrelation Rs
is
j(Δ τ
i, j(θ, φ)) in limited scope τ, show as a peak value, correspondence propagates into microphone m
iwith microphone m
jdelay TDOA.
Further, described step 3) for the SRP-PHAT algorithm of circular array microphone sound source:
The broad sense simple crosscorrelation right to all microphones
summation:
Δ τ wherein
1, Δ τ
2Δ τ
nfor the controllable time delay of N microphone, Δ τ wherein
i=τ
i-τ
0i=1 ... N, τ
0for estimating with reference to time delay, getting minimum in all microphone time delays is reference.
Further, described step 3) for many sound sources circular array microphone SRP-PHAT algorithm:
When there is two and above sound source, when there is more than two sound source, the SRP-PHAT peak value of a sound source has been sneaked into the SRP-PHAT peak value of another sound source, on some points, can produce false peak value, is difficult to find local maximal peak simultaneously simultaneously;
Utilize voice signal approximate W-separated orthogonality, at time-frequency domain, estimate that each sound-source signal arrives the relative time delay of microphone, array, utilize Short Time Fourier Transform as approximate W-separated orthogonal transformation,
The frequency domain representation of supposing the signal model of i microphone is:
If given window function W, the Short Time Fourier Transform of sj is Sj, has
By selecting appropriate window function and size, at signal, be under approximate W-separated orthogonality hypothesis, only have a sound source at any time-Frequency point is effective, its cross-spectrum is:
The time delay Δ τ between microphone i and microphone j
n, i-Δ τ
n, jcan obtain by cross-power spectrum.
Compared with prior art, the invention has the beneficial effects as follows:
A kind of SRP-PHAT multi-source space-location method of the present invention shows by theoretical analysis and emulation experiment, associating approximate W based on circular array-separated quadrature SRP-PHAT algorithm makes the DOA of many sound sources estimate to have good separating property under the acoustic enviroment of very noisy and appropriate reverberation, obviously give prominence to true peaks, there is higher positioning precision.
1. for uniform circular array row, can see the research to simple sund source location, and relatively less for the multi-source Position Research of circular array.There is more high spatial resolution
2. on the basis of approximate W-separated orthogonality hypothesis, SRP-PHAT algorithm makes the DOA of many sound sources estimate under the acoustic enviroment of very noisy and appropriate reverberation, to have good separating property, has obviously given prominence to true peaks, has higher positioning precision.
3. can effectively solve the problem at false spectrum peak, 3 signal sources can be differentiated and opened,
4. this method is applicable to the location under medium reverberation.
Embodiment
Below in conjunction with accompanying drawing, the preferred embodiments of the present invention are described, should be appreciated that preferred embodiment described herein, only for description and interpretation the present invention, is not intended to limit the present invention.
The first step, location model and uniform circular array train wave bundle form.
A Homogeneous Circular array can be determined space coordinates, as shown in Figure 1, is that isotropic microphone is evenly distributed on the circumference that is positioned at x-y plane that a radius is R.Adopt polar coordinates to represent the arrival direction of plane wave s, the initial point of coordinate system is positioned on the home position of circular array, and true origin is the pitching angle theta ∈ [0, pi/2] of system reference point signal, and position angle φ ∈ [0,2 π].Wherein r is the distance that sound source arrives the circular array center of circle, r
ifor sound source is to microphone m
idistance.
Suppose that acoustic signals is:
Wherein: ω
0for the angular frequency of sound-source signal, and
C is velocity of wave, C=384m/s.
F is the frequency (Hz) of sound source.
The signal of i microphone reception is
f
i(r,t)=s(t-Δτ
i)
(2)
As shown in Figure 1
Wherein: r
ibe that i microphone is to the distance in source
R is the distance that the round microphone array center of circle is arrived in source
R circular array radius
θ is the angle of pitch of sound source,
position angle for sound source.
i=0,1,2 ... N-1 is the position angle of i microphone.
So the time delay of each microphone before stack is
Wherein: C is velocity of wave, C=384m/s.
As shown in Figure 2, by the time delay Beam-former that superposes, the shifted signal of all microphones capture is sued for peace.The contribution stack of each sound source far zone field point just can, in the hope of the far-field pattern function of this ring array, be had
(4) are brought in (5) and obtained
Wherein:
for sound source unit's wave-number vector.
T is vector transposition.
Δ τ
i=τ
i-τ
0τ 0 estimates with reference to time delay, and getting minimum in all microphone time delays is reference.
Second step, approximate W-separated orthogonality hypothesis
Conventionally the masking effect of people's ear is divided into frequency masking and temporal masking characteristic, based on time-frequency masking method hypothesis sound-source signal, is sparse in separable, meets the separated orthogonality of W-.
Suppose that signal x (t) is comprised of N sound-source signal, can be expressed as
If there is certain linear transformation T, be called s
jto S
jmapping, be designated as
and there is following properties:
(1) conversion T has reversibility, i.e. T
-1(Ts)=T (T
-1s)=s
(2)
during j ≠ k, Λ wherein
jfor S
jsupport, Λ
j=supp S
j:={ λ: S
j(λ) ≠ 0}, table
Show collection Λ
jwith Λ
kfriendship non-zero.
If meet above-mentioned (1), the condition of (2), the mixed signal in collection S all can be effectively separated.
If a given window function, if meet
Claim two sound source S
jand S
kmeet the separated orthogonality of W-.
But the separated orthogonality hypothesis of W-does not meet the signal that will study herein, the result of expression formula (7) is seldom zero.
For this reason, introduce two important performance criterias: (1) is sheltered and to what extent retained interested sound source; (2) shelter and to what extent suppressed interference sound source.
Consideration is divided into many sound-source signals the time frequency point sets of non-overlapping copies, only comprises a movable source signal in each time frequency window, and approximate satisfied
Definition time-frequency masking code is
By estimating the time-frequency masking in corresponding each source, can from mixing source, obtain certain source j thus
M wherein
jfor the indicator function of source j support, S
j(t, ω), X (t, ω) is respectively s
j, the time-frequency representation of x (t),
For given time-frequency mask M, the signal ratio PSR that definition retains
m
PSR
mfor estimate the source S retaining after use is sheltered
jthe number percent that energy is shared.
Definition simultaneously
Z wherein
j(t) be at source S
jlower active sum of interference.
After definition application time-frequency masking M, signal-to-noise ratio is
SIR wherein
mthe main signal-to-noise ratio of estimating after application time-frequency masking M separation signal.
Pass through PSR
mand SIR
mcan estimate approximate W-separated orthogonality WDO
m.
Because voice signal has sparse time-frequency representation, the power of its time-frequency representation accounts for the exhausted vast scale of general power, and the product amplitude of its time-frequency representation is conventionally always little.Therefore meet the separated orthogonality condition of weak W.Approximate W-separated intercept is higher, has better separating effect.Want to obtain good time-frequency masking effect, window function type and choosing of size are played vital effect to its performance.Especially, work as WDO
mmeet the separated quadrature of W-at=1 o'clock.
According to the experiment of Scott Rickard (Scott Rickard, Radu Balan and Justinian Rosca.Real-time time-frequency based blind source separation.Proceedings ICA2001, pp.651-656, December2001.), under 0dB, the WDO ratio of different number sound sources is as follows
N |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
WDO |
93.6 |
88.0 |
83.4 |
79.2 |
75.6 |
72.3 |
69.3 |
66.6 |
64 |
As shown in Figure 3, Figure 4, by the situation of 3 sound sources is carried out to simplation verification, horizontal ordinate is WDO value, and ordinate is voice signal number of samples, can see in 3 sound source situations, and signal more than 80% is being quadrature.
As Fig. 5, Fig. 6 and Fig. 7, in addition 2 sound sources are carried out to nearly orthogonal condition Verification, respectively to signal s
1(t), s
2(t) carry out time frequency analysis, respectively
with
analyze simultaneously
horizontal ordinate is the time, and ordinate is frequency.Window function W (t) chooses Hamming window, length of window 64ms, and by Fig. 5, Fig. 6, Fig. 7 can find out,
in comprise seldom
with
composition, can prove that sound-source signal meets approximate W-separated quadrature thus.
The 3rd step, the SRP-PHAT localization method of associating approximate W-separated many sound sources of quadrature circular array
SRP-PHAT algorithm is by calculating the controlled responding power function of the right phase tranformation of all microphones and obtaining an objective function, the Beam-former of devise optimum is also controlled wave beam and is scanned at all possible receive direction, and the direction value of wave beam output power maximum obtains the direction of sound source.
The SRP-PHAT algorithm of 1 dual microphone
For only there being two microphone m
iand m
jarray, from the signal of position angle and pitching, arriving two microphone time delays is Δ τ
ij(θ, φ), TDOA can estimate by broad sense simple crosscorrelation (GCC), be expressed as:
Wherein P (r) is three-dimensional space vectors r spatial likelihood function, can obtain by calculating all possible θ and φ.Broad sense cross correlation function Rs
is
j(Δ τ
i, j(θ, φ)) in frequency domain, can be expressed as:
ψ wherein
ij(ω) be weighting function, S
i(ω) S*
j(ω) be cross-spectral density function.
Phase tranformation (PHAT) method is exactly a kind of typical transform method.
Definition phase weighting function is:
By selecting suitable weighting function, make the controlled responding power of delay accumulation meet optimization signal-to-noise ratio (SNR) Criterion, broad sense simple crosscorrelation Rs
is
j(Δ τ
i, j(θ, φ)) in limited scope τ, show as a peak value, correspondence propagates into microphone m
iand m
jdelay TDOA.This algorithm has certain noise immunity, anti-reverberation and robustness in auditory localization.
2 circular array SRP-PHAT algorithms
The broad sense simple crosscorrelation right to all microphones
summation
Δ τ wherein
1, Δ τ
2Δ τ
nfor the controllable time delay of N microphone, Δ τ wherein
i=τ
i-τ
0i=1 ... N, τ
0for estimating with reference to time delay, getting minimum in all microphone time delays is reference.
Along with the increase of microphone number, dual microphone SRP-PHAT method expands to round microphone SRP-PHAT method naturally.
The circular array of sound source more than 3 SRP-PHAT algorithm
When there is two and above sound source, when there is more than two sound source, the SRP-PHAT peak value of a sound source has been sneaked into the SRP-PHAT peak value of another sound source, on some points, can produce false peak value, is difficult to find local maximal peak simultaneously simultaneously.
Utilize foregoing voice signal approximate W-separated orthogonality, at time-frequency domain, estimate that each sound-source signal arrives the relative time delay of microphone array.
Utilize Short Time Fourier Transform as approximate W-separated orthogonal transformation.
The frequency domain representation of supposing the signal model of i microphone is:
If given window function W, the Short Time Fourier Transform of sj is Sj, has
By selecting appropriate window function and size, at signal, be under approximate W-separated orthogonality hypothesis, only have a sound source at any time-Frequency point is effective.Its cross-spectrum is:
The time delay Δ τ n between microphone i and j, i-Δ τ n, j can obtain by cross-power spectrum.
1 two auditory localizations of embodiment
1. uniform circular array row location model is selected
Emulation experiment is simulated under different signal to noise ratio (S/N ratio)s and reverberation environment, and Homogeneous Circular array is placed in the room of 7m * 8m * 3.5m, and its 8 yuan of microphone locus are respectively [3.25 ,-1.6,1.5], [3.25,1.1,1.5], [1.87,3.75,1.5], [1.0,3.75,1.5], [3.25,1.8,1.5], [3.25,-1.0,1.5], [2.2 ,-3.75,1.5], [0.6 ,-3.75,1.5].
2. the selection of sound source
Sound source is the random voice signal producing, and signal to noise ratio (S/N ratio) is 0-30dB.Random interfering signal is gaussian signal, is used for simulating air condition electric fan and from noise outside window, noise power can reach 10dB the most by force, and the corresponding reverberation time is determined by the reflection coefficient of room wall, floor and ceiling.
3. pair array reception signal carries out Short Time Fourier Transform (STFT)
If given window function W, s
jshort Time Fourier Transform be S
j, have
Want to obtain good time-frequency masking effect, window function type and choosing of size are played vital effect to its performance.Wherein window function is chosen Hamming window, and window size is 1024 points.
4. carry out the broad sense simple crosscorrelation of phase tranformation
By choosing suitable window function, desirable good separating effect, meets approximate W-separated quadrature.On this basis, can carry out broad sense computing cross-correlation.
Broad sense cross correlation function Rs
is
j(Δ τ
i, j(θ, φ)) in frequency domain, can be expressed as:
ψ wherein
ij(ω) be weighting function, for:
The broad sense simple crosscorrelation right to all microphones
summation
Δ τ wherein
1, Δ τ
2Δ τ
nfor the controllable time delay of N microphone, Δ τ wherein
i=τ
i-τ
0i=1 ... N, τ
0for estimating with reference to time delay, getting minimum in all microphone time delays is reference.
Obtain P (Δ τ
1, Δ τ
2... Δ τ
n) maximal value after, can determine pitching angle theta and the position angle φ of sound source.
5. the result after above step
Shown in Figure 10, Figure 11, be respectively circular array at 20dB, sound source wave field image under 30dB signal to noise ratio (S/N ratio).In figure, is microphone position, and zero represents the sound source of estimating, * for disturbing residing position.
The locus that Figure 10 shows that two sound sources is respectively [0.59,2.08,1.5] and [0.29 ,-1.37,1.5], and signal to noise ratio (S/N ratio) is 20dB.Random interfering signal is gaussian signal, is used for simulating air condition electric fan and from noise outside window, locus is respectively [2 ,-4,1.5], [3.5 ,-3.2,1.5], noise power can reach 10dB the most by force, and the corresponding reverberation time is determined by the reflection coefficient of room wall, floor and ceiling.
The locus that Figure 11 shows that two sound sources is respectively [1.5,2.1,1.5] and [2.1,0.8,1.5], and signal to noise ratio (S/N ratio) is 30dB.Be used for simulating air condition electric fan and from noise outside window away from two sound sources.
Adopt the SRP-PHAT algorithm of associating approximate W-separated quadrature to carry out orientation estimation, choose Hamming window, window size is 1024 points.Shown in Figure 10, Figure 11, be respectively circular array at 20dB, sound source wave field image under 30dB signal to noise ratio (S/N ratio).Is microphone position, and zero represents the sound source of estimating, * for disturbing residing position.Visible under identical background noise environment, the signal to noise ratio (S/N ratio) of signal more high position precision is also higher.
Shown in Figure 12, Figure 13, be respectively the angle, sound bearing recording.Fig. 5 position angle is respectively φ 1=74 ° and φ 2=-78 °, although the azran of two signals is near and Signal-to-Noise is low, 2 sound sources can be differentiated out substantially, in true bearing, all there is spectrum peak, do not have false spectrum peak to occur, and target azimuth correctly still can draw estimated result, 2 sound sources can be differentiated out substantially.Figure 13 is measured position angle φ 1=17 °, φ 2=52 °.Although the azran of two signals is nearer, because signal to noise ratio (S/N ratio) is high and two angles differ larger, 2 sound sources are differentiated completely.Along with the increase of signal to noise ratio (S/N ratio), evaluated error can be more and more less, and estimated accuracy can be more and more higher.The larger estimation of differential seat angle between two signals is more accurate, when the difference of angle greatly to a certain extent after estimated accuracy tend towards stability.
Position angle shown in Figure 14 and the angle of pitch are (φ 1=74 °, θ 1=46 °) and (φ 2=-78 °, θ 2=0 °).
2 three auditory localizations of embodiment
When sound source increases to 3, in the situation that signal to noise ratio (S/N ratio) is low, can not solve well the problem at false spectrum peak.Under high s/n ratio condition, substantially can solve the problem at false spectrum peak, many sound sources are had to good resolution characteristic.
Specific implementation step, with example 1, is omited herein.
Figure 15 shows that three auditory localization two-dimensional imaging figure, signal to noise ratio (S/N ratio) 30dB.
Shown in Figure 16, Figure 17, be respectively the angle, sound bearing that the method that proposes herein and traditional SRP-PHAT method record under the higher condition of signal to noise ratio (S/N ratio).SRP-PHAT method based on approximate W-separated quadrature can solve the problem at false spectrum peak effectively, 3 signal sources can be differentiated and opened, and traditional SRP-PHAT method there will be false spectrum peak, 3 useful signals of indistinguishable.
Figure 18 shows that the sound-source signal that 8 yuan of microphone array received arrive, can find out that interference source is on No. 7 microphones impacts close to are larger from it, Figure 19 shows that 60 times signal to noise ratio (S/N ratio)s of different reverberation time T and orientation angle error relationship curve, RT60 chooses respectively 300ms, 450ms and 600ms.Along with the increase of T60, evaluated error is increasing, and estimated accuracy can be more and more lower.Visible in the situation that reverberation is large, be difficult to resolution target orientation, this method is applicable to the location under medium reverberation.
From simulation result, can find out, adopt the SRP-PHAT algorithm of even ring array to there is good positioning performance.Particularly, when SNR is higher, when reverberation is moderate, locating effect is better
The separated orthogonality hypothesis of W-the present invention is directed to based on the sparse property of voice signal does not meet many sound sources, two key properties of signal to noise ratio (S/N ratio) after signal retention rate and time-frequency masking after introducing voice signal time-frequency masking, derived approximate W-separated orthogonality hypothesis condition, many sound-source signals are divided into the time frequency point sets of non-overlapping copies, each set only comprises the time frequency component of single source signal, at time-frequency domain, estimates that each sound-source signal arrives the relative time delay of microphone array.Estimate that source signal arrives the relative time delay of microphone array.Special employing has the more circular array of high spatial resolution, realized and the high-resolution of the position angle of many sound-source signals, the angle of pitch having been estimated simultaneously, realize the space orientation of sound-source signal, overcome the three-dimensional fix problem that existing sound localization method cannot effectively be realized a plurality of aliasing sound sources.
Finally it should be noted that: these are only the preferred embodiments of the present invention; be not limited to the present invention; although the present invention is had been described in detail with reference to embodiment; for a person skilled in the art; its technical scheme that still can record aforementioned each embodiment is modified; or part technical characterictic is wherein equal to replacement; but within the spirit and principles in the present invention all; any modification of doing, be equal to replacement, improvement etc., within protection scope of the present invention all should be included in.