Disclosure of Invention
The invention provides an indoor sound source following and enhancing method based on a microphone array.
The technical solution for realizing the purpose of the invention is as follows: an indoor sound source following and enhancing method based on a microphone array comprises the following steps:
step 1, performing anti-aliasing low-pass filtering on analog sound signals picked up by each microphone, and performing A/D sampling and frame division on the filtered analog signals;
step 2, carrying out frame level detection on each path of sound signal, dividing a frame into a sound frame signal and a noise frame signal, if the sound frame is a noise frame, storing the sound frame signal, and if the sound frame is a sound frame, carrying out spectrum cancellation with the previous noise frame;
step 3, selecting a reference path signal, estimating a wavelength according to a reference path sound frame, and performing cross-correlation operation on each path of sound frame and a current reference path sound frame to further estimate a wave arrival angle;
step 4, determining a weight coefficient corresponding to each sound frame according to the estimated wave wavelength and arrival angle and by combining the geometric shape of the microphone array;
step 5, carrying out weighted summation on each path of sound frame to determine an enhanced voice digital signal, carrying out D/A conversion on the obtained voice digital signal, generating an analog signal and transmitting the analog signal to a loudspeaker;
and 6, repeating the steps 1-5, and processing the sound signal of the next frame until the sound pickup is finished.
Compared with the prior art, the invention has the remarkable advantages that: 1) according to the invention, after a sound source is positioned, airspace filtering and noise spectrum cancellation processing are carried out, so that the noise suppression capability and the sound fidelity of the system are improved; 2) the invention has strong self-healing capability, can still continue to work normally when any microphone is damaged, and ensures the overall performance of the system.
Detailed Description
The principles and aspects of the present invention are further described below in conjunction with the following figures and the detailed description.
Fig. 1 shows a flow of a method for indoor sound source following and enhancement based on a microphone array, and for convenience of explanation, a speech system composed of 4 element linear microphone arrays shown in fig. 2 is taken as an example, wherein a speaker is located at the left front of the array, and a speaker is directed at a student seat in front of the microphone, the method of the invention includes the following steps:
step 1, anti-aliasing filtering
Generally, the frequency of sound which can be perceived by human ears is 20KHz at most, and the sound above the frequency band is filtered before sampling, so that the sampling is prevented from causing spectrum aliasing. Since the sampling rate can be satisfied only by satisfying the nyquist sampling law, the sampling rate of 40.96KHz >2 × 20KHz can be adopted in the exemplary procedure of the present invention.
Step 2, A/D sampling and frame division
By utilizing the forward masking effect of human ears, as long as the system completes data processing and plays the enhanced sound signal within the specified time, the human ears cannot perceive the sound directly from a speaker, but can only perceive the sound played by a loudspeaker, and the experience of a meeting place cannot be influenced.
It is also important to note that the time for the system to process data is not too long, and the frame must be the unit of signal processing, and the divided frame duration should not exceed the masking effect time. For example, the typical human ear forward masking effect is as short as 5ms, and if the sampling rate is f
sIf the energy efficiency response duration is within
f s200 points, considering the requirement of the number of the fast Fourier transform points, can be taken
The point is an optimal frame, wherein
Representing an integer less than and closest to x. In addition, in practical situations, since the speaker sound is much larger than the sound source sound, the sound intensity masking effect can significantly prolong the time domain forward masking effect, so that the time for subsequent signal processing is completely sufficient for such a framing manner.
In addition, the frame interval should be smaller than the frame length to avoid the serious loss of the useful signal caused by the asynchronization of the sound frames of each path. An example of frame division with a frame length of 200 and a frame interval of 100 is shown in fig. 4, where the data points in the solid frame are the previous data frame and the data points in the dashed frame are the current data frame.
Step 3, frame type judgment
The level amplitude of the noise frame signal is small, and the level amplitude of the sound frame signal is large, so that the division is performed by level values, and the method is intuitive, simple and accurate. The specific judgment method is that each frame signal is calculated according to a certain rule to generate a judgment level, and the judgment level is compared with a threshold level so as to divide the frame into a sound frame or a noise frame. The noise frame is only FFT stored and the voice frame proceeds to step 4.
The operation rule and the threshold level are chosen freely and variously, and the operation rule adopted by the exemplary program of the present invention is to average absolute values of all points in a frame, and the threshold level is 1/10 of the full-scale level of the sound collection system.
Step 4, spectrum cancellation
For random noise, due to the randomness of the random noise, the superposition operation among all paths of frame signals cannot obviously enhance the noise; for coherent noise, due to its coherence, the addition enhances the sound signal and also enhances the coherent noise to an equal extent, so measures should be taken to suppress the coherent noise.
Because coherent noise has a relatively fixed frequency spectrum, the coherent noise in the array beam pointing space can be suppressed by subtracting the FFT of the sound frame signal and the FFT of the noise frame signal of the nearest section.
Step 5, wavelength estimation
The human voice signal is a short-time stationary signal, and in a conference or teaching environment, the voice signal when a human normally speaks can be regarded as a narrow-band signal. Therefore, the FFT result of the sound signal frame is analyzed, and the central point of the frequency band range exceeding the set threshold value in the FFT result is selected as the main frequency festBackward deducing the wavelength lambda from the speed of sound cest:
λest=c/fest
Step 6, cross-correlation operation
The cross-correlation function describes the degree of correlation of two time series. Defining the cross-correlation function of two sequences f [ n ] and g [ n ] of digital sound signals as:
wherein R isf,g[m]Represents the sequence g [ n ]]Sliding m points to left and then reacting with f [ n ]]The results are multiplied point by point and summed. If g [ n ]]Is f [ n ]]Time delay sequence of g [ n ]]Sliding J point back and f [ n ]]Aligned when the two sequences are most similar, the cross-correlation function also peaks at this point.
In general, cross-correlation and convolution are not directly calculated in a signal processing system due to large calculation amount, and are converted into a frequency domain to ensure processing speed. Therefore, the present invention performs inverse fourier transform after fourier transform on the cross-correlation function, so that the cross-correlation operation is actually calculated in the signal processing system by the following formula:
Rf,g[n]=IFFT{F[n]G*[n]}
where F [ n ] and G [ n ] are discrete Fourier transforms of F [ n ] and G [ n ], G [ n ] denotes the conjugation to the sequence G [ n ], IFFT { } denotes the inverse Fourier transform operation.
Because the function reaches the peak value when the sequences are similar, the time delay difference between the sequences can be obtained by inverse solution after the peak value position of the cross-correlation result is obtained. For example, for a two-voice frame signal f [ N ] with a frame length N]And g [ n ]]Cross correlation function R thereoff,g[m]At point m, L has a maximum value, i.e. g n]Left shift L point and f [ n ]]The sequences are most similar, and the time slot corresponding to one sampling point is 1/fsThus, two sequences of delays are solved:
Δt=L/fs
fig. 5 shows the cross-correlation results of two sound sequences with frame length 512 and sampling rate 40.96KHz, and the actual time delay difference is 1.2489 × 10-4s, time delay Δ t solved by an algorithmsComprises the following steps:
it can be seen that the actual result matches well with the calculated result.
Step 7, DOA estimation
Because the delay difference of the sound signal reaching each microphone depends on the position of the speaker and the geometrical shape of the microphone array, the DOA can be estimated after the delay difference is obtained and the geometrical shape of the microphone array is known.
Let r be the distance from the sound source to the center of the array, L be the length of the array, and the highest frequency f of normal speech of the personmaxAround 4KHz, the wavelength is about λ ═ c/fmax85 cm. Based on the knowledge of the array antenna, if a low sidelobe or deep null complex beam is to be realized, the low sidelobe or deep null complex beam is neededSatisfy r ≥ 10L2And/lambda. In practical application scenarios where r is between 2m and 10m, the array length can be calculated to be between 0.13m and 0.29m, and since the array pitch is very small compared to r, a far-field model can be used, and the DOA of each microphone is considered to be equal, as shown in fig. 7.
As can be seen from fig. 6, the time delay difference Δ t of the sound signals between adjacent microphones is dsin θ/c, and since it can be solved through cross-correlation operation to obtain a specific value, and the array element spacing d and the sound velocity c are also known, the arrival angle θ of each array element is arcsin (c Δ t/d), and finally, the average is obtained.
Figure 7 shows that the algorithm solves for an angle of arrival of 20.6 degrees at an actual DOA (relative to the geometric center of the array) of 20.2 degrees. The deviation between the solution value and the actual value is small, and the reliability of the algorithm is proved.
Step 8, weight coefficient generation
After the arrival angle is solved, the knowledge about the array antenna needs to perform phase compensation on each signal on the array element through weighting in order to make the beam align with the signal at the angle and suppress the signals arriving in other directions. In the existing digital beam forming technology, there are many ways of generating weight coefficients to implement spatial filtering, such as LCMV filtering algorithm, which can enhance the signal in one direction and generate nulls in another interference direction at the same time; also like the autocorrelation matrix inversion (SMI) method, etc., therefore, the choice of the weight coefficient generation manner is very free.
The basic array antenna spatial filtering algorithm is explained in detail below.
When the arrival angle is theta, the phase difference between each array element k of the uniform linear array and the sound signal of the reference array element is as follows:
at this time, weighting coefficients are taken at each array element
I.e. the main lobe of the beam is aligned with the direction of the angle of arrival theta.
Fig. 8 shows the effect of the weighting coefficients used when θ is 0 degrees and 30 degrees, respectively, on the gain effect in the array direction under the example algorithm. It can be seen that the prediction fits well with the reality.
In this case, the wide main lobe of the 4-element linear array makes up for the deviation of the wavelength estimation. In a conference or teaching scene, only one speaker can talk at the same time, and for the noise environment nearby the speaker, the volume of the speaker is the largest, so the wide main lobe does not influence the spatial filtering effect. If a more diverse array is used, it is more effective, but it is not very meaningful for a meeting or classroom environment.
Step 9, weighted summation
And weighting the sound frame signals of each array element according to the weight coefficient of each array element calculated in the step 8, and then adding and summing the weighted sound frame signals.
Fig. 9 compares the difference in the effect of the weighted sum and the direct sum, the source of the sound signal being a sample point of human laughter provided by MATLAB. The left column is a time domain diagram of the signal, and the right side is a frequency spectrum diagram of the complete signal; the uppermost row represents original digital signals, the middle row represents digital signals processed by the classroom speaker sound following enhancement system based on the microphone array, and the last row represents digital signals directly added without any processing by adopting the same microphone array. It is apparent from the figure that the system proposed by the present invention can effectively enhance the signal while maintaining the original shape of the sound signal spectrum; the system, which adds directly without any processing, although it also enhances the signal, produces significant distortion in the frequency spectrum of the sound signal and produces a poor sound quality from the loudspeaker.
In addition, because the system adopts the microphone array, the final result is the weighted summation of each array element, so that the whole system can not be directly paralyzed due to the damage of a single microphone, and the self-healing capability is strong.
Step 10, D/A conversion
And converting the weighted and summed digital signals into analog signals, and outputting the analog signals to a loudspeaker to realize the enhanced playing of the voice of the speaker.
Step 11, determining whether to close the system according to the session status
If the conversation is not finished, the system processes the next frame signal immediately; if the conversation is over, the manager can close the system power supply and stop the system operation.