CN110111807B

CN110111807B - Microphone array-based indoor sound source following and enhancing method

Info

Publication number: CN110111807B
Application number: CN201910346841.0A
Authority: CN
Inventors: 沈伟; 何天豪
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2019-04-27
Filing date: 2019-04-27
Publication date: 2022-01-11
Anticipated expiration: 2039-04-27
Also published as: CN110111807A

Abstract

The invention discloses an indoor sound source following and enhancement method based on a microphone array, which performs anti-aliasing low-pass filtering on the analog sound signal picked up by each microphone, and performs A/D sampling and frame division on the filtered analog signal. ; Perform frame level detection on each sound signal, divide the frame into sound frame signal and noise frame signal, if it is a noise frame, then store it, if it is a sound frame, then perform spectrum cancellation; select the reference road signal, according to Estimate the wavelength of the sound frame of the reference road, and perform the cross-correlation operation between the sound frame of each channel and the sound frame of the current reference road, and then estimate the angle of arrival; according to the estimated wavelength and angle of arrival, combined with the geometry of the microphone array, determine the sound of each channel The weight coefficient corresponding to the frame; the weighted summation of each sound frame to determine the enhanced voice digital signal, and the D/A conversion of the obtained voice digital signal to generate an analog signal and transmit it to the speaker. The present invention improves the noise suppression capability and sound fidelity of the system.

Description

Microphone array-based indoor sound source following and enhancing method

Technical Field

The invention relates to a voice signal processing technology, in particular to an indoor sound source following and enhancing method based on a microphone array.

Background

In an indoor meeting or teaching, the personal speaking always is limited by the existing voice system, and the learning and communication process is very inconvenient. For example, in a voice system using a wireless microphone or a fixed-point microphone, although the voice of a target sound source can be accurately collected, the transmission distance limits the position of a speaker, and the maintenance cost of a battery and the microphone is high. The voice system based on the microphone array enlarges the reception range by installing the microphones at different indoor positions, removes the position limitation of a speaker, and is more convenient and safer to manage due to the adoption of stable power supply sources such as a socket and the like. But because the signal processing mode is simple, the pickup sound is easily interfered by interference sources in all directions, the speaker can not be focused on the most main speaker, the reality degree of the sound is ensured, and the self-healing capability is not provided, and if any one microphone is damaged, the whole system can also be paralyzed.

Disclosure of Invention

The invention provides an indoor sound source following and enhancing method based on a microphone array.

The technical solution for realizing the purpose of the invention is as follows: an indoor sound source following and enhancing method based on a microphone array comprises the following steps:

step 1, performing anti-aliasing low-pass filtering on analog sound signals picked up by each microphone, and performing A/D sampling and frame division on the filtered analog signals;

step 2, carrying out frame level detection on each path of sound signal, dividing a frame into a sound frame signal and a noise frame signal, if the sound frame is a noise frame, storing the sound frame signal, and if the sound frame is a sound frame, carrying out spectrum cancellation with the previous noise frame;

step 3, selecting a reference path signal, estimating a wavelength according to a reference path sound frame, and performing cross-correlation operation on each path of sound frame and a current reference path sound frame to further estimate a wave arrival angle;

step 4, determining a weight coefficient corresponding to each sound frame according to the estimated wave wavelength and arrival angle and by combining the geometric shape of the microphone array;

step 5, carrying out weighted summation on each path of sound frame to determine an enhanced voice digital signal, carrying out D/A conversion on the obtained voice digital signal, generating an analog signal and transmitting the analog signal to a loudspeaker;

and 6, repeating the steps 1-5, and processing the sound signal of the next frame until the sound pickup is finished.

Compared with the prior art, the invention has the remarkable advantages that: 1) according to the invention, after a sound source is positioned, airspace filtering and noise spectrum cancellation processing are carried out, so that the noise suppression capability and the sound fidelity of the system are improved; 2) the invention has strong self-healing capability, can still continue to work normally when any microphone is damaged, and ensures the overall performance of the system.

Drawings

Fig. 1 is a flow chart of an indoor sound source following and enhancing method based on a microphone array according to the present invention.

Fig. 2 is a diagram of a speech system for an indoor quaternary linear microphone array.

Fig. 3 is a diagram of a geometric mathematical model of fig. 2.

Fig. 4 is a schematic diagram of framing a digital sound signal.

Fig. 5 is a diagram of the result of cross-correlating the same sound frame picked up by two microphone units.

Fig. 6 is a diagram illustrating the relationship between the sound signal time delay Δ t and the angle of arrival θ between the microphones.

Fig. 7 is a comparison graph between the calculated value and the actual value of the angle of arrival for one example.

Fig. 8 is an array gain pattern, where (a) is the array gain pattern with a weight of 0 degrees and (b) is the array gain pattern with a weight of 30 degrees.

Fig. 9 is a graph comparing the processing effect of the indoor sound source following and enhancing method based on the microphone array on a certain sound in a room with the processing effect of the common method.

Detailed Description

The principles and aspects of the present invention are further described below in conjunction with the following figures and the detailed description.

Fig. 1 shows a flow of a method for indoor sound source following and enhancement based on a microphone array, and for convenience of explanation, a speech system composed of 4 element linear microphone arrays shown in fig. 2 is taken as an example, wherein a speaker is located at the left front of the array, and a speaker is directed at a student seat in front of the microphone, the method of the invention includes the following steps:

step 1, anti-aliasing filtering

Generally, the frequency of sound which can be perceived by human ears is 20KHz at most, and the sound above the frequency band is filtered before sampling, so that the sampling is prevented from causing spectrum aliasing. Since the sampling rate can be satisfied only by satisfying the nyquist sampling law, the sampling rate of 40.96KHz >2 × 20KHz can be adopted in the exemplary procedure of the present invention.

Step 2, A/D sampling and frame division

By utilizing the forward masking effect of human ears, as long as the system completes data processing and plays the enhanced sound signal within the specified time, the human ears cannot perceive the sound directly from a speaker, but can only perceive the sound played by a loudspeaker, and the experience of a meeting place cannot be influenced.

It is also important to note that the time for the system to process data is not too long, and the frame must be the unit of signal processing, and the divided frame duration should not exceed the masking effect time. For example, the typical human ear forward masking effect is as short as 5ms, and if the sampling rate is f_sIf the energy efficiency response duration is within f _s200 points, considering the requirement of the number of the fast Fourier transform points, can be taken

The point is an optimal frame, wherein

Representing an integer less than and closest to x. In addition, in practical situations, since the speaker sound is much larger than the sound source sound, the sound intensity masking effect can significantly prolong the time domain forward masking effect, so that the time for subsequent signal processing is completely sufficient for such a framing manner.

In addition, the frame interval should be smaller than the frame length to avoid the serious loss of the useful signal caused by the asynchronization of the sound frames of each path. An example of frame division with a frame length of 200 and a frame interval of 100 is shown in fig. 4, where the data points in the solid frame are the previous data frame and the data points in the dashed frame are the current data frame.

Step 3, frame type judgment

The level amplitude of the noise frame signal is small, and the level amplitude of the sound frame signal is large, so that the division is performed by level values, and the method is intuitive, simple and accurate. The specific judgment method is that each frame signal is calculated according to a certain rule to generate a judgment level, and the judgment level is compared with a threshold level so as to divide the frame into a sound frame or a noise frame. The noise frame is only FFT stored and the voice frame proceeds to step 4.

The operation rule and the threshold level are chosen freely and variously, and the operation rule adopted by the exemplary program of the present invention is to average absolute values of all points in a frame, and the threshold level is 1/10 of the full-scale level of the sound collection system.

Step 4, spectrum cancellation

For random noise, due to the randomness of the random noise, the superposition operation among all paths of frame signals cannot obviously enhance the noise; for coherent noise, due to its coherence, the addition enhances the sound signal and also enhances the coherent noise to an equal extent, so measures should be taken to suppress the coherent noise.

Because coherent noise has a relatively fixed frequency spectrum, the coherent noise in the array beam pointing space can be suppressed by subtracting the FFT of the sound frame signal and the FFT of the noise frame signal of the nearest section.

Step 5, wavelength estimation

The human voice signal is a short-time stationary signal, and in a conference or teaching environment, the voice signal when a human normally speaks can be regarded as a narrow-band signal. Therefore, the FFT result of the sound signal frame is analyzed, and the central point of the frequency band range exceeding the set threshold value in the FFT result is selected as the main frequency f_estBackward deducing the wavelength lambda from the speed of sound c_est：

λ_est＝c/f_est

Step 6, cross-correlation operation

The cross-correlation function describes the degree of correlation of two time series. Defining the cross-correlation function of two sequences f [ n ] and g [ n ] of digital sound signals as:

wherein R is_f,g[m]Represents the sequence g [ n ]]Sliding m points to left and then reacting with f [ n ]]The results are multiplied point by point and summed. If g [ n ]]Is f [ n ]]Time delay sequence of g [ n ]]Sliding J point back and f [ n ]]Aligned when the two sequences are most similar, the cross-correlation function also peaks at this point.

In general, cross-correlation and convolution are not directly calculated in a signal processing system due to large calculation amount, and are converted into a frequency domain to ensure processing speed. Therefore, the present invention performs inverse fourier transform after fourier transform on the cross-correlation function, so that the cross-correlation operation is actually calculated in the signal processing system by the following formula:

R_f,g[n]＝IFFT{F[n]G^*[n]}

where F [ n ] and G [ n ] are discrete Fourier transforms of F [ n ] and G [ n ], G [ n ] denotes the conjugation to the sequence G [ n ], IFFT { } denotes the inverse Fourier transform operation.

Because the function reaches the peak value when the sequences are similar, the time delay difference between the sequences can be obtained by inverse solution after the peak value position of the cross-correlation result is obtained. For example, for a two-voice frame signal f [ N ] with a frame length N]And g [ n ]]Cross correlation function R thereof_f,g[m]At point m, L has a maximum value, i.e. g n]Left shift L point and f [ n ]]The sequences are most similar, and the time slot corresponding to one sampling point is 1/f_sThus, two sequences of delays are solved:

Δt＝L/f_s

fig. 5 shows the cross-correlation results of two sound sequences with frame length 512 and sampling rate 40.96KHz, and the actual time delay difference is 1.2489 × 10^-4s, time delay Δ t solved by an algorithm_sComprises the following steps:

it can be seen that the actual result matches well with the calculated result.

Step 7, DOA estimation

Because the delay difference of the sound signal reaching each microphone depends on the position of the speaker and the geometrical shape of the microphone array, the DOA can be estimated after the delay difference is obtained and the geometrical shape of the microphone array is known.

Let r be the distance from the sound source to the center of the array, L be the length of the array, and the highest frequency f of normal speech of the person_maxAround 4KHz, the wavelength is about λ ═ c/f_max85 cm. Based on the knowledge of the array antenna, if a low sidelobe or deep null complex beam is to be realized, the low sidelobe or deep null complex beam is neededSatisfy r ≥ 10L²And/lambda. In practical application scenarios where r is between 2m and 10m, the array length can be calculated to be between 0.13m and 0.29m, and since the array pitch is very small compared to r, a far-field model can be used, and the DOA of each microphone is considered to be equal, as shown in fig. 7.

As can be seen from fig. 6, the time delay difference Δ t of the sound signals between adjacent microphones is dsin θ/c, and since it can be solved through cross-correlation operation to obtain a specific value, and the array element spacing d and the sound velocity c are also known, the arrival angle θ of each array element is arcsin (c Δ t/d), and finally, the average is obtained.

Figure 7 shows that the algorithm solves for an angle of arrival of 20.6 degrees at an actual DOA (relative to the geometric center of the array) of 20.2 degrees. The deviation between the solution value and the actual value is small, and the reliability of the algorithm is proved.

Step 8, weight coefficient generation

After the arrival angle is solved, the knowledge about the array antenna needs to perform phase compensation on each signal on the array element through weighting in order to make the beam align with the signal at the angle and suppress the signals arriving in other directions. In the existing digital beam forming technology, there are many ways of generating weight coefficients to implement spatial filtering, such as LCMV filtering algorithm, which can enhance the signal in one direction and generate nulls in another interference direction at the same time; also like the autocorrelation matrix inversion (SMI) method, etc., therefore, the choice of the weight coefficient generation manner is very free.

The basic array antenna spatial filtering algorithm is explained in detail below.

When the arrival angle is theta, the phase difference between each array element k of the uniform linear array and the sound signal of the reference array element is as follows:

at this time, weighting coefficients are taken at each array element

I.e. the main lobe of the beam is aligned with the direction of the angle of arrival theta.

Fig. 8 shows the effect of the weighting coefficients used when θ is 0 degrees and 30 degrees, respectively, on the gain effect in the array direction under the example algorithm. It can be seen that the prediction fits well with the reality.

In this case, the wide main lobe of the 4-element linear array makes up for the deviation of the wavelength estimation. In a conference or teaching scene, only one speaker can talk at the same time, and for the noise environment nearby the speaker, the volume of the speaker is the largest, so the wide main lobe does not influence the spatial filtering effect. If a more diverse array is used, it is more effective, but it is not very meaningful for a meeting or classroom environment.

Step 9, weighted summation

And weighting the sound frame signals of each array element according to the weight coefficient of each array element calculated in the step 8, and then adding and summing the weighted sound frame signals.

Fig. 9 compares the difference in the effect of the weighted sum and the direct sum, the source of the sound signal being a sample point of human laughter provided by MATLAB. The left column is a time domain diagram of the signal, and the right side is a frequency spectrum diagram of the complete signal; the uppermost row represents original digital signals, the middle row represents digital signals processed by the classroom speaker sound following enhancement system based on the microphone array, and the last row represents digital signals directly added without any processing by adopting the same microphone array. It is apparent from the figure that the system proposed by the present invention can effectively enhance the signal while maintaining the original shape of the sound signal spectrum; the system, which adds directly without any processing, although it also enhances the signal, produces significant distortion in the frequency spectrum of the sound signal and produces a poor sound quality from the loudspeaker.

In addition, because the system adopts the microphone array, the final result is the weighted summation of each array element, so that the whole system can not be directly paralyzed due to the damage of a single microphone, and the self-healing capability is strong.

Step 10, D/A conversion

And converting the weighted and summed digital signals into analog signals, and outputting the analog signals to a loudspeaker to realize the enhanced playing of the voice of the speaker.

Step 11, determining whether to close the system according to the session status

If the conversation is not finished, the system processes the next frame signal immediately; if the conversation is over, the manager can close the system power supply and stop the system operation.

Claims

1. An indoor sound source following and enhancing method based on a microphone array is characterized by comprising the following steps:

step 4, determining a weight coefficient corresponding to each sound frame according to the estimated sound wave wavelength and the estimated arrival angle and by combining the geometric shape of the microphone array;

step 6, repeating the steps 1-5, and processing the next frame of sound signals until the sound pickup is finished;

in the step 1, the cut-off frequency band of the anti-aliasing low-pass filtering is a signal above 20KHz which can not be heard by human ears; the divided frame duration should not exceed the masking effect time, the frame interval should be less than the frame length, the common human ear forward masking effect is shortest 5ms, if the sampling rate is f_sIf the energy efficiency response duration is within f_s200 points, taking the point number requirement of fast Fourier transform into consideration

The point is an optimal frame, wherein

Represents an integer less than and closest to x;

in step 2, taking the absolute value of all points in the frame and averaging to obtain the judgment level of each frame signal, and comparing the judgment level with a threshold level so as to divide the frame into sound frames or noise frames; carrying out subtraction on the FFT of the sound frame signal and the FFT of the noise frame signal of the nearest section to carry out spectrum cancellation;

in step 3, the FFT result of the sound signal frame is analyzed, and the central point of the frequency band range exceeding the set threshold in the FFT result is selected as the main frequency f_estCombined with the speed of sound c, backward wavelength λ_est：

λ_est＝c/f_est；

In step 3, after fourier transform is performed on the cross-correlation function, inverse fourier transform is performed again, since the function reaches the peak value when the sequences are similar, after the peak position of the cross-correlation result is obtained, the time delay difference between the sequences can be obtained by inverse solution, and the specific calculation formula is as follows:

R_f,g[n]＝IFFT{F[n]G^*[n]}

wherein F [ n ]]And G [ n ]]Is f [ n ]]And g [ n ]]Discrete Fourier transform of G^*[n]Represents a pair sequence G [ n ]]Taking conjugation, and using IFFT to represent inverse Fourier transform operation;

in step 3, setting the distance between any array element and a reference array element as d, the sound velocity as c and the time delay difference between sequences as delta t, so that the arrival angle of the array element is theta arcsin (c delta t/d), and averaging the arrival angles of all the array elements to obtain the arrival angle of the whole array;

in step 4, determining a weight coefficient corresponding to each path of sound frame according to a digital beam forming algorithm;

in step 4, for the uniform linear array, setting the phase difference between each array element k and the sound signal of the reference array element as follows:

in order to make the main lobe of the wave beam align with the direction of arrival angle theta, weighting coefficients are taken at each array element