TW201303854A

TW201303854A - A method for noise reduction and speech enhancement

Info

Publication number: TW201303854A
Application number: TW100123330A
Authority: TW
Inventors: Ming-Sian R Bai; Ying-Ting Liou; Chen-Yi Kuei; Wei-Chih Hsu
Original assignee: Univ Nat Chiao Tung
Priority date: 2011-07-01
Filing date: 2011-07-01
Publication date: 2013-01-16
Also published as: TWI517143B

Abstract

A method for noise reduction and speech enhancement includes steps of providing at least two microphones to receive at least two microphone signals, transferring the two microphone signals into frequency domain by fast Fourier transform (FFT) to obtain a speech signal and a noise signal therein, calculating an angle between the speech and noise signal, and further obtaining an ITD by applying a phase difference estimation algorithm to the angle, calculating a threshold value of the ITD corresponding to the angle, applying a binary mask principle to the ITD and its threshold value to purify the speech signal out of the noise signal, and transferring the speech signal back into time domain to obtain a purified speech signal, thereby removing the noise signal and enhancing the speech quality.

Description

Method for eliminating noise and improving speech recognition rate

本發明係有關於一種消除麥克風雜音之方法，特別是一種可消除雜音並降低通訊中迴音干擾，以有效提升語音辨識率的方法。The invention relates to a method for eliminating microphone noise, in particular to a method for eliminating noise and reducing echo interference in communication, so as to effectively improve the speech recognition rate.

一般而言，麥克風接收聲音訊號之方式主要分為單通道及雙通道兩種，其中，單通道之消噪方式需要估算消噪比，而雙通道感應則多是利用波束形成法(beam forming)，以陣列方式產生有方向性之麥克風系統。Generally speaking, the way in which the microphone receives the sound signal is mainly divided into single channel and dual channel. Among them, the single channel denoising method needs to estimate the noise cancellation ratio, and the dual channel sensing mostly uses beam forming. A directional microphone system is produced in an array.

此種麥克風系統對人聲的敏感度較高，因而指向人的位置接收聲音訊號，而對背景的噪音則較不敏感。但此種麥克風系統由於包含有兩個或兩個以上的麥克風，其所形成之波束相當大，容易造成指向性不足的問題。Such a microphone system is highly sensitive to human voice, and thus receives a sound signal from a person's position and is less sensitive to background noise. However, since such a microphone system includes two or more microphones, the beam formed by the microphone system is quite large, which easily causes a problem of insufficient directivity.

目前用於車內或一般室內之行動電話通訊噪音消除裝置大多使用數量眾多的麥克風、各種濾波器與龐大的矩陣運算。在如此沉重的運算量、巨大的記憶體空間與眾多的麥克風下，對於硬體的成本實為一大負擔。Mobile phone communication noise cancellation devices currently used in vehicles or in general indoor use a large number of microphones, various filters and large matrix operations. With such a heavy amount of computing, huge memory space and numerous microphones, the cost of hardware is a big burden.

其次，由於指向性不足的問題，目前無論是市面上的產品或有關麥克風陣列的專利及文獻都無法在存有雜音的環境下，有效的消除雜音並不讓語音失真。Secondly, due to the lack of directivity, neither the products on the market nor the patents and literature on microphone arrays can effectively eliminate noise and not distort the speech in the presence of noise.

除此之外，一般的行動電話或車內通訊裝置也常具有在通話過程中因迴聲(echo)太大，而影響到通訊品質的問題。In addition, general mobile phones or in-vehicle communication devices often have problems in that the echo is too large during the call, which affects the communication quality.

因此，如何提出一種可有效消除環境中雜音，並增進語音品質之麥克風收音方法，係為熟習此項技術領域者亟需解決的問題之一。Therefore, how to propose a microphone receiving method that can effectively eliminate noise in the environment and improve voice quality is one of the problems that need to be solved by those skilled in the art.

本發明之主要目的係在提供一種可消除雜音且增進語音辨識率之方法，其係利用黃金比例搜尋法搭配泰勒理論，計算得到最佳的耳間時間差閥值，使得每一個角度的語音訊號皆可得到最佳的語音品質。The main object of the present invention is to provide a method for eliminating noise and improving speech recognition rate, which uses a golden ratio search method with Taylor's theory to calculate an optimal interaural time difference threshold, so that the voice signals of each angle are Get the best voice quality.

本發明之另一目的係在提供一種可消除雜音且增進語音辨識率之方法，其係利用一複合式迴聲消除系統，濾除語音訊號之主要聲學迴音及環境擾動，藉此消去通訊過程中所產生的迴音，進一步地提高語音品質。Another object of the present invention is to provide a method for eliminating noise and improving speech recognition rate by using a composite echo cancellation system to filter out the main acoustic echo and environmental disturbance of the voice signal, thereby eliminating the communication process. The resulting echo further enhances voice quality.

為達到上述之目的，本發明係有關於一種可消除雜音且增進語音辨識率之方法，其包括以下步驟：提供二個以上的麥克風，用以接收至少二麥克風訊號；將該些麥克風訊號利用快速傅立葉轉換至頻率域，以取得其中之一語音訊號與一雜音訊號；計算語音訊號與雜音訊號之夾角，並使用一相位差演算法，進一步找出一耳間時間差；依據語音訊號與雜音訊號之夾角計算出耳間時間差之一閥值；依據耳間時間差與閥值，使用一遮蔽法則，以取得語音訊號，去除雜音訊號；以及將語音訊號利用一反快速傅立葉轉換暨疊加模組轉至時間域輸出。In order to achieve the above object, the present invention relates to a method for eliminating noise and improving speech recognition rate, comprising the steps of: providing two or more microphones for receiving at least two microphone signals; and utilizing the microphone signals quickly Fourier transforms to the frequency domain to obtain one of the voice signals and a noise signal; calculates the angle between the voice signal and the noise signal, and uses a phase difference algorithm to further find the time difference between the ears; according to the voice signal and the noise signal The angle is calculated as a threshold value of the time difference between the ears; according to the time difference between the ear and the threshold, a masking rule is used to obtain the voice signal to remove the noise signal; and the voice signal is transferred to the time by using an inverse fast Fourier transform and superposition module. Domain output.

本發明另有關於一種可消除雜音且增進語音辨識率之方法，其包括以下步驟：提供二個以上的麥克風，用以接收至少二麥克風訊號；將該些麥克風訊號利用快速傅立葉轉換至頻率域，以取得麥克風訊號中之一語音訊號與一雜音訊號；計算語音訊號與雜音訊號之夾角，並依據該夾角使用一相位差演算法配合遮蔽估測，以取得麥克風訊號中之語音訊號，去除雜音訊號；將語音訊號利用一反快速傅立葉轉換暨疊加模組轉至時間域輸出；以及串聯一複合式迴聲消除系統於轉換回時間域之語音訊號後，以濾除語音訊號之聲學擾動。The invention further relates to a method for eliminating noise and improving speech recognition rate, comprising the steps of: providing two or more microphones for receiving at least two microphone signals; and converting the microphone signals into a frequency domain by using fast Fourier transform, Obtaining a voice signal and a noise signal in the microphone signal; calculating an angle between the voice signal and the noise signal, and using a phase difference algorithm according to the angle to match the shadow estimation to obtain the voice signal in the microphone signal, and removing the noise signal The voice signal is converted to the time domain output by using an inverse fast Fourier transform and superposition module; and a composite echo cancellation system is coupled to the voice signal of the time domain to filter out the acoustic disturbance of the voice signal.

底下藉由具體實施例配合所附的圖式詳加說明，當更容易瞭解本發明之目的、技術內容、特點及其所達成之功效。The purpose, technical contents, features and effects achieved by the present invention will be more readily understood by the detailed description of the embodiments and the accompanying drawings.

本發明提供一種可消除雜音且增進語音辨識率之方法，利用兩麥克風之間的相位差以獲得麥克風訊號在時間域及頻率域之遮罩，消除雜音，以增進語音品質。The invention provides a method for eliminating noise and improving speech recognition rate, which utilizes a phase difference between two microphones to obtain a mask of a microphone signal in a time domain and a frequency domain, and eliminates noise to improve speech quality.

請參考第1圖，係為根據本發明實施例可消除雜音且增進語音辨識率之麥克風陣列之示意圖，其包含有一麥克風陣列(包括至少二麥克風14、14’)、至少二快速傅立葉轉換模組16、16’、一運算模組18、一遮蔽估測模組20、以及一反快速傅立葉轉換暨疊加模組22。Please refer to FIG. 1 , which is a schematic diagram of a microphone array capable of eliminating noise and improving speech recognition rate according to an embodiment of the present invention, including a microphone array (including at least two microphones 14 , 14 ′′) and at least two fast Fourier transform modules. 16, 16', a computing module 18, a mask estimation module 20, and an inverse fast Fourier transform and overlay module 22.

請參閱第2圖，係為根據本發明實施例之可消除雜音且增進語音辨識率之方法的步驟流程圖。以下關於此一實施例之實施方式的說明，請一併參照第1至2圖所示。Please refer to FIG. 2, which is a flow chart showing the steps of a method for eliminating noise and improving speech recognition rate according to an embodiment of the present invention. For the description of the embodiments of this embodiment, please refer to the figures 1 to 2 together.

如步驟S202所示，語音源10及雜音源12之聲音傳送出去後，麥克風14、14’接收同時含有雜音訊號及語音訊號之麥克風訊號。After the sounds of the voice source 10 and the noise source 12 are transmitted as shown in step S202, the microphones 14, 14' receive the microphone signals containing both the noise signals and the voice signals.

之後，如步驟S204所示，快速傅立葉轉換模組16、16’用以將麥克風14、14’收到之麥克風訊號轉換至頻率域，以取得麥克風訊號中之語音訊號與雜音訊號。Then, as shown in step S204, the fast Fourier transform modules 16, 16' are used to convert the microphone signals received by the microphones 14, 14' to the frequency domain to obtain the voice signals and noise signals in the microphone signals.

之後，如步驟S206所示，運算模組18連接麥克風14、14’，用以計算麥克風訊號中語音訊號及雜音訊號之夾角為何。藉此，運算模組18依據此夾角，使用相位差演算法進一步找出耳間時間差(interaural time difference，ITD)。Then, as shown in step S206, the computing module 18 is connected to the microphones 14, 14' for calculating the angle between the voice signal and the noise signal in the microphone signal. Thereby, the computing module 18 further uses the phase difference algorithm to find the interaural time difference (ITD) according to the angle.

如步驟S208所示，在運算模組18找出耳間時間差後，運算模組18進一步地計算出該耳間時間差對應每一雜音訊號與語音訊號之夾角的閥值為何。As shown in step S208, after the computing module 18 finds the time difference between the ears, the computing module 18 further calculates the threshold of the angle between the inter-aural time corresponding to each of the noise signals and the voice signal.

之後，如步驟S210所示，遮蔽估測模組20依據算出的耳間時間差與閥值，利用一遮蔽法則，以取得語音訊號，去除雜音訊號。Then, as shown in step S210, the mask estimation module 20 uses a masking rule to obtain a voice signal and remove the noise signal according to the calculated time difference between the ear and the threshold.

最後，如步驟S212所示，反快速傅立葉轉換暨疊加模組22用以將語音訊號由頻率域轉回時間域，以得到去除雜音後具有較高語音辨識率之語音訊號。Finally, as shown in step S212, the inverse fast Fourier transform and superposition module 22 is configured to convert the voice signal from the frequency domain back to the time domain to obtain a voice signal having a higher speech recognition rate after removing the noise.

其中，在步驟S204中，雜音訊號及語音訊號經由麥克風14、14’接收後，由快速傅立葉轉換模組16、16’經漢明窗(Hamming window)和快速傅立葉轉換(FFT)轉至頻率域，其二麥克風訊號P₁(k,l)及P₂(k,l)如下式(1)、(2)所示：In step S204, after the noise signal and the voice signal are received via the microphones 14, 14', the fast Fourier transform module 16, 16' is transferred to the frequency domain via a Hamming window and a fast Fourier transform (FFT). The second microphone signals P ₁ ( k,l ) and P ₂ ( k,l ) are as shown in the following equations (1) and (2):

其中(k,l)代表第k個頻率，第l個畫框，X代表語音訊號，N _i代表第i個雜音源，P _m是第m個麥克風收到之訊號，ω_k=2πk/N，0≦k≦N/2-1，N是快速傅立葉轉換之長度。Wherein (k, l) represents the k th frequency, the l th frame, X represents a voice signal, N _i represents the i-th noise sources, the received signal P _m is the m-th microphone, ω _k = 2πk / N, 0 ≦ k ≦ N / 2-1, N is the length of the fast Fourier transform.

接著在步驟S206中，運算模組18計算此二麥克風訊號P₁(k,l)及P₂(k,l)中雜音訊號及語音訊號之夾角，亦即語音源10及雜音源12之間的夾角，以進一步找出耳間時間差(ITD)。Next, in step S206, the computing module 18 calculates an angle between the noise signal and the voice signal in the two microphone signals P ₁ ( k, l ) and P ₂ ( k, l ), that is, between the voice source 10 and the noise source 12 . The angle of the angle to further find the time difference between the ears (ITD).

一般而言，假設語音訊號在麥克風的正前方，則其耳間時間差為0，其他方向來的雜音則用d_i(k,l)來表示其耳間時間差，耳間時間差和時間及頻率有關。若有一時-頻域bin(k _j ,l _j)是由一最強干擾所支配，則上式(1)、(2)可簡化為下式(3)、(4)：Generally speaking, if the voice signal is directly in front of the microphone, the time difference between the ears is 0. The noise in other directions uses d _i ( k, l ) to indicate the time difference between the ears. The time difference between the ears is related to time and frequency. . If the time-frequency domain bin( k _j , l _j ) is dominated by a strongest interference, the above equations (1) and (2) can be simplified to the following equations (3) and (4):

此時的耳間時間差可經由計算兩麥克風訊號之間的相位差而得到，如下式(5)：The time difference between the ears at this time can be obtained by calculating the phase difference between the two microphone signals, as shown in the following equation (5):

之後，在步驟S208中，運算模組18係進一步地計算出耳間時間差對應雜音訊號與語音訊號之夾角的閥值為何。根據本發明之實施例，運算模組18計算最佳閥值的方法，係利用黃金比例搜尋法(Golden-Section Search，GSS)搭配泰勒理論，來找尋對應各個夾角的最佳閥值τ。Then, in step S208, the computing module 18 further calculates the threshold of the angle between the inter-aural time difference corresponding to the noise signal and the voice signal. According to an embodiment of the present invention, the calculation module 18 calculates the optimal threshold value by using the Golden-Section Search (GSS) with Taylor's theory to find the optimal threshold τ corresponding to each angle.

假設一函數f(x)在[a,b]內是連續的且只有一最小值，在[a,b]內選取兩點c和d，其關係如下式(9)：Suppose a function f(x) is continuous and has a minimum value in [a, b], and two points c and d are selected in [a, b], and the relationship is as follows:

其中d為c在線段上的對稱點，比較f(c)和f(d)的大小，若f(c)<f(d)，則新的搜尋點變成[a,d]，否則變成[c,b]，然後在新的範圍內再取一點，再次比較內部兩點之大小，重複此步驟不斷把範圍縮小，當範圍小到可接受的地步時，就將其當作函數f(x)在[a,b]區間的最小值，根據泰勒理論，函數f(x)靠近x_m時，其值近似於：Where d is c in The symmetry point on the line segment compares the size of f(c) and f(d). If f(c)<f(d), the new search point becomes [a,d], otherwise it becomes [c,b], Then take another point in the new range, compare the size of the two internal points again, repeat this step to continue to narrow the range, when the range is small enough to accept the point, treat it as a function f(x) at [a, b] the minimum value of the interval, according to Taylor's theory, when the function f(x) is close to x _m , its value approximates:

若f(x)夠靠近f(x_m)，則後面二次微分項小到可忽略，因此公式(10)可表示為如下式(11)：If f(x) is close enough to f(x _m ), the subsequent second derivative term is negligibly small, so equation (10) can be expressed as the following equation (11):

其中ε為10^-3。使用語音失真度，消噪程度與整體語音品質做為黃金比例搜尋法中函數的參數，可得到夾角對τ值的函數如下式(12)：Where ε is 10 ^-3 . Using the speech distortion, denoising degree and overall speech quality as the parameters of the function in the golden ratio search method, the function of the angle τ value can be obtained as follows (12):

τ(i)=(-7.76*10^-5)i²+(1.69*10^-2)i-(5.45*10^-2)　(12)τ(i)=(-7.76*10 ^-5 )i ² +(1.69*10 ^-2 )i-(5.45*10 ^-2 ) (12)

其中i為語音訊號與雜音訊號之間的夾角，在此夾角i所對應的閥值τ可以使經過處理的訊號有最佳的語音品質。Where i is the angle between the voice signal and the noise signal, and the threshold τ corresponding to the angle i can make the processed signal have the best voice quality.

因此，在得到耳間時間差之最佳閥值τ後，在步驟S210中，遮蔽估測模組20依據遮蔽法則(binary mask principle)由下式(6)估計出麥克風訊號之遮蔽訊號：Therefore, after obtaining the optimal threshold value τ of the time difference between the ears, in step S210, the mask estimation module 20 estimates the masking signal of the microphone signal according to the binary mask principle by the following formula (6):

其中，只有耳間時間差比τ小的訊號會被認為是目標語音訊號。Among them, only the signal whose time difference between the ears is smaller than τ will be regarded as the target voice signal.

最後的語音訊號S(k,l)可經由將二麥克風訊號之平均(k,l)及遮蔽訊號B(kj,lj)相乘而得，如下式(7)及下式(8)：The last voice signal S( k , l ) can be averaged by the two microphone signals. ( k , l ) and the masking signal B( kj,lj ) are multiplied, as shown in the following equation (7) and (8):

當步驟S210取得語音訊號，以與雜音訊號成功分離之後，在步驟S212中，反快速傅立葉轉換暨疊加模組22將此頻率域之語音訊號再經過反快速傅立葉轉換(IFFT)及重疊相加法(OLA)來轉為時域訊號輸出，以得到去除雜音後具有較高語音辨識率之語音訊號。After the voice signal is obtained in step S210 to be successfully separated from the noise signal, in step S212, the inverse fast Fourier transform and superposition module 22 performs the inverse fast Fourier transform (IFFT) and the overlap addition method on the voice signal in the frequency domain. (OLA) is converted to time domain signal output to obtain a voice signal with a higher speech recognition rate after noise removal.

請參照第3圖，其係為根據本發明另一實施例可消除雜音且增進語音辨識率之麥克風陣列之示意圖。如第3圖所示，在本發明提出之架構下，反快速傅立葉轉換暨疊加模組22更可連接有一自動語音辨識模組24，用以接收反快速傅立葉轉換暨疊加模組22所輸出之語音訊號，以進行語音辨識。Please refer to FIG. 3, which is a schematic diagram of a microphone array capable of eliminating noise and improving speech recognition rate according to another embodiment of the present invention. As shown in FIG. 3, in the framework of the present invention, the inverse fast Fourier transform and superposition module 22 can be further connected with an automatic speech recognition module 24 for receiving the output of the inverse fast Fourier transform and superposition module 22. Voice signal for speech recognition.

其次，考量到若聲源位置不在麥克風陣列正前方時，本發明更提出一種波束轉向(bean-steering)的技術，其藉由將不同的延遲(delay)加入各個麥克風，來控制麥克風之波束轉向角度，使其轉至聲源位置。Secondly, considering that if the sound source position is not directly in front of the microphone array, the present invention further proposes a beam-steering technique for controlling the beam steering of the microphone by adding different delays to the respective microphones. Angle to turn it to the sound source position.

假設轉向角度為θ_M,，則波束轉向之濾波頻率因子可如下式(13)所示：Assuming that the steering angle is θ _M , the filter frequency factor of the beam steering can be as shown in the following equation (13):

其中n表第n個麥克風，ω是頻率因子，f_s是取樣頻率，d是麥克風間距。則在時域上，此濾波器即可依下式(14)所示，而寫成一延遲：Where n is the nth microphone, ω is the frequency factor, f _s is the sampling frequency, and d is the microphone spacing. Then in the time domain, the filter can be written as a delay according to the following equation (14):

由於上式延遲不是整數，因此必須使用拉格朗內插法(Lagrange interpolation)來使其更容易達成，此內插法可利用如下式(15)所示之無限脈衝響應系統(Infinite Impulse Response Filter)簡單的達成：Since the delay of the above formula is not an integer, Lagrange interpolation must be used to make it easier to achieve. This interpolation method can use the Infinite Impulse Response Filter shown in the following formula (15). Simple completion:

其中N是此濾波器的階數，在此使用一階，D則是延遲小數部份。Where N is the order of the filter, where the first order is used, and D is the delay fraction.

根據本發明之實施例，波束轉向的角度包括0度至180度。也就是說，在麥克風陣列接收到麥克風訊號之後，麥克風首先進行全方位(0°~180°)的波束轉向，並在每一次波束轉向後，進行頻譜分析計算耳間時間差，再通過如上式(6)的遮蔽法則，保留目標聲源並且抑制干擾。經過了上述語音純化的過程後，最後計算各麥克風在每一轉向角度之波束能量，以進行語音音源方位的偵測(Direction of arrival estimation，DOA)。According to an embodiment of the invention, the angle of beam steering includes 0 degrees to 180 degrees. That is to say, after the microphone array receives the microphone signal, the microphone first performs omnidirectional (0°~180°) beam steering, and after each beam steering, performs spectrum analysis to calculate the time difference between the ears, and then passes the above formula ( 6) The masking rule preserves the target sound source and suppresses interference. After the above process of voice purification, the beam energy of each microphone at each steering angle is finally calculated for the direction of arrival estimation (DOA) of the voice source.

其原因在於，當麥克風轉向到實際的聲源方位的時候應可得最大的能量(因為目標聲源的能量皆能通過上式(6)的遮蔽法則)，以藉此判斷正確的聲源方向。其可如下式(16)來計算其能量大小：The reason is that when the microphone is turned to the actual sound source orientation, the maximum energy should be available (because the energy of the target sound source can pass the masking rule of the above formula (6)), thereby judging the correct sound source direction. . It can calculate its energy level by the following formula (16):

其中(k,l)為雙聲道訊號經過相位差演算法純化後的訊號；(k,l)分別為頻率及時間的指數；e ^-jkλ是為頻率函數的波束轉向濾波器；而λ必須要如下式(17)所示，介於最大及最小的延遲時間內：among them ( k , l ) is the signal after the two-channel signal is purified by the phase difference algorithm; ( k , l ) is the frequency and time index respectively; e ^-jkλ is the beam steering filter as a function of frequency; and λ must be As shown in the following equation (17), between the maximum and minimum delay times:

請參考第4圖，係為根據本發明另一實施例可消除雜音且增進語音辨識率之麥克風陣列之示意圖，其包括有麥克風陣列(包括至少二麥克風14、14’)、至少二快速傅立葉轉換模組16、16’、運算模組18、遮蔽估測模組20、反快速傅立葉轉換暨疊加模組22、一固定式濾波器26與一適應性濾波器28。Please refer to FIG. 4, which is a schematic diagram of a microphone array capable of eliminating noise and improving speech recognition rate according to another embodiment of the present invention, including a microphone array (including at least two microphones 14, 14') and at least two fast Fourier transforms. The modules 16, 16', the computing module 18, the shadow estimation module 20, the inverse fast Fourier transform and superposition module 22, a stationary filter 26 and an adaptive filter 28.

請參閱第5圖，係為根據本發明另一實施例之可消除雜音且增進語音辨識率之方法的步驟流程圖。以下關於此實施例之說明，請一併參照第4至5圖所示。Please refer to FIG. 5, which is a flow chart showing the steps of a method for eliminating noise and improving speech recognition rate according to another embodiment of the present invention. For the description of this embodiment, please refer to the figures 4 to 5 together.

其中，步驟S502至步驟S508係同本發明前一實施例之步驟S202至步驟S212所示，故在此不再重述。值得注意的是，在此實施例中，本發明更包括步驟S510：串聯一複合式迴聲消除系統(固定式濾波器26與適應性濾波器28)於轉換回時間域之語音訊號後，以利用此複合式迴聲消除系統濾除掉語音訊號之聲學擾動。Steps S502 to S508 are the same as steps S202 to S212 of the previous embodiment of the present invention, and therefore will not be repeated herein. It should be noted that, in this embodiment, the present invention further includes step S510: connecting a composite echo cancellation system (the fixed filter 26 and the adaptive filter 28) to convert the voice signal back to the time domain to utilize This composite echo cancellation system filters out acoustic disturbances of the voice signal.

詳細而言，當使用者與一遠端的第三者進行通話，且系統中具有一揚聲器30，遠端第三者產生一遠端音訊32時，揚聲器30與麥克風14、14’將形成一固定的迴音路徑(echo path)。本發明係在反快速傅立葉轉換暨疊加模組22後串聯一固定式濾波器(fixed filter)26與適應性濾波器(adaptive filter)28，以產生複合式迴聲消除系統。In detail, when the user talks with a remote third party and the system has a speaker 30, and the remote third party generates a far-end audio 32, the speaker 30 and the microphone 14, 14' will form a Fixed echo path. The present invention is followed by a fixed filter 26 and an adaptive filter 28 in an inverse fast Fourier transform and superposition module 22 to produce a composite echo cancellation system.

在此一實施例中，固定式濾波器26係用以濾除語音訊號主要的聲學迴音，適應性濾波器28則用以濾除語音訊號於周遭環境中擾動所產生的問題。舉例而言，固定式濾波器在加入系統前可先以離線方式(off-line)將系統動態特性辨識出來。且此複合式系統的適應性演算法可以是但不限於filterd-x LMS演算法，其包括有：如下式(18)所示之整段迴聲消除路徑：In this embodiment, the fixed filter 26 is used to filter out the main acoustic echo of the voice signal, and the adaptive filter 28 is used to filter out the problem caused by the disturbance of the voice signal in the surrounding environment. For example, a fixed filter can first identify the dynamic characteristics of the system offline (off-line) before joining the system. The adaptive algorithm of the composite system may be, but not limited to, a filterd-x LMS algorithm, which includes: an entire echo cancellation path as shown in the following formula (18):

f(n)為固定式濾波器，w(n)為適應性濾波器，其中w ₀(n)為1，Δw(n)=[w ₁(n)w ₂(n)…w _L _-1(n)]，δ(n)為單位脈衝數列；如下式(19)所示之計算誤差訊號： f ( n ) is a fixed filter, w ( n ) is an adaptive filter, where w ₀ ( n ) is 1, Δ w ( n )=[ w ₁ ( n ) w ₂ ( n )... w _L _{- 1} ( n )], δ( n ) is a unit pulse sequence; the calculation error signal as shown in the following equation (19):

e(n)=d(n)-y(n)=d(n)-w ^T(n)[f(n)*x(n)]　(19) e ( n )= d ( n )- y ( n )= d ( n )- w ^T ( n )[ f ( n )* x ( n )] (19)

d(n)為麥克風收入訊號，y(n)為濾波器輸出訊號，w(n)=[w ₁(n) w ₂(n)…w _L _-1(n)]^T為在時間n時的適應性濾波器係數組成向量，x(n)=[x(n) x(n-1)…x(n-L+1)]^T為在時間n時的輸入訊號向量；以及依據下列公式(20)使用FXLMS演算法來更新適應性濾波器， d ( n ) is the microphone income signal, y ( n ) is the filter output signal, w ( n )=[ w ₁ ( n ) w ₂ ( n )... w _L _-1 ( n )] ^T is at time n Adaptive filter coefficient composition vector, x ( n )=[ x ( n ) x ( n -1)... x ( n - L +1)] ^T is the input signal vector at time n ; and according to the following formula (20) Using the FXLMS algorithm to update the adaptive filter,

然而，值得注意的是，當遠端第三者產生遠端音訊32時，此時麥克風14、14’所收到的訊號將不只是系統產生的聲學迴音，於此將會造成適應性濾波器28的發散。有鑑於此，如第6圖所示，在確定更新適應性濾波器28之前，本發明另包括有步驟S602至S608。However, it is worth noting that when the far-end third party generates the far-end audio 32, the signals received by the microphones 14, 14' at this time will not only be the acoustic echo generated by the system, but will cause an adaptive filter. 28 divergence. In view of this, as shown in FIG. 6, the present invention further includes steps S602 to S608 before determining to update the adaptive filter 28.

如步驟S602至步驟S604所示，系統中係包括有一雙邊對話偵測器(double talk detector，DTD)用以偵測語音源10產生的語音訊號與遠端第三者產生的遠端音訊32是否同時發生。之後，如步驟S606所示，若二者同時發生時(意即使用者與遠端第三者同時說話)，則停止更新適應性濾波器28。否則，如步驟S608所示，若二者未同時發生時(意即使用者與遠端第三者未同時說話)，則繼續持續地更新適應性濾波器28。As shown in step S602 to step S604, the system includes a double talk detector (DTD) for detecting whether the voice signal generated by the voice source 10 and the far-end audio 32 generated by the remote third party are At the same time. Thereafter, as shown in step S606, if both occur simultaneously (that is, the user simultaneously speaks with the remote third party), the adaptive filter 28 is stopped. Otherwise, as shown in step S608, if the two do not occur simultaneously (that is, the user does not simultaneously speak with the remote third party), the adaptive filter 28 continues to be continuously updated.

詳細而言，本發明主要比較麥克風14、14’收到的訊號與適應性濾波器28輸出的訊號。由於比較能量大小會造成適應性濾波器28開關太劇烈，因此便依據下式(21)至(23)所示，計算麥克風訊號d(n)、固定式濾波器輸出訊號x'(n)及適應性濾波器輸出訊號y(n)所形成的波封v _d(n)、v _x(n)及v _y(n)，α=0.99。In particular, the present invention primarily compares the signals received by the microphones 14, 14' with the signals output by the adaptive filter 28. Since the comparison energy amount causes the adaptive filter 28 to switch too much, the microphone signal d ( n ), the fixed filter output signal x '( n ) and the fixed filter are calculated according to the following equations (21) to (23). The adaptive filter outputs a signal y ( n ) to form a wave seal v _d ( n ), v _x ( n ) and v _y ( n ), α = 0.99.

v _x(n)=αv _x(n-1)+(1-α)|x(n)|　(21) v _x ( n )= αv _x ( n -1)+(1 - α )| x ( n )| (21)

v _d(n)=αv _d(n-1)+(1-α)|d(n)|　(22) v _d ( n )= αv _d ( n -1)+(1 - α )| d ( n )| (22)

v _y(n)=αv _y(n-1)+(1-α)|y(n)|　(23) v _y ( n )= αv _y ( n -1)+(1 - α )| y ( n )| (23)

再依據下式(24)與(25)，由v _d(n)、v _x(n)及v _y(n)求得偵測函數(detection function) ξ(n)及動態門檻函數(dynamic threshold function) T(n)。當偵測函數ξ(n)大於動態門檻函數T(n)時便代表遠端第三者產生遠端音訊32，適應性濾波器28也隨之停止更新。According to the following formulas (24) and (25), the detection function ξ( n ) and the dynamic threshold function are obtained from v _d ( n ), v _x ( n ) and v _y ( n ). Function) T ( n ). When the detection function ξ( n ) is greater than the dynamic threshold function T ( n ), the remote third party generates the far-end audio 32, and the adaptive filter 28 also stops updating.

當偵測函數ξ(n)小於動態門檻函數T(n)時，適應性濾波器28才繼續做更新。γ由實驗可得最佳值為0.05，加入小正實數β係為防止偵測錯誤所預留的範圍。When the detection function ξ( n ) is less than the dynamic threshold function T ( n ), the adaptive filter 28 continues to update. The optimum value of γ is 0.05 in the experiment, and the addition of the small positive real number β is the range reserved for preventing detection errors.

綜上所述，本發明提出一種可消除雜音且增進語音辨識率之方法，其可將聲學訊號處理方法實現在電信通訊系統中。此種方法不僅可利用兩麥克風之間的相位差，獲得聲源角度進而決定波束開口大小，以增進語音辨識率，更可透過波束轉向自動偵測聲源位置。此外，利用波束轉向技術更可解決語音訊號不在主軸位置上的情況。In summary, the present invention provides a method for eliminating noise and improving speech recognition rate, which can implement an acoustic signal processing method in a telecommunication communication system. In this way, not only the phase difference between the two microphones can be utilized, but also the sound source angle is obtained to determine the beam opening size, so as to improve the speech recognition rate, and the sound source position can be automatically detected through the beam steering. In addition, beam steering technology can be used to solve the problem that the voice signal is not at the spindle position.

本發明提出之可消除雜音且增進語音辨識率之方法，並可應用於語音打斷(barge in)系統，結合複合式迴聲消除系統，有效地降低迴聲對辨識率的干擾。此系統適用於需要使用到語音辨識系統的手機、智慧型玩具等儀器內，使辨識系統即便在雜音及殘響嚴重的空間內，也能擁有不錯的辨識率。The invention provides a method for eliminating noise and improving the speech recognition rate, and can be applied to a barge in system, and combined with a composite echo cancellation system, effectively reducing the interference of the echo on the recognition rate. This system is suitable for mobile phones, smart toys and other instruments that need to be used in the speech recognition system, so that the identification system can have a good recognition rate even in a space with loud noise and reverberation.

以上所述之實施例僅係為說明本發明之技術思想及特點，其目的在使熟習此項技藝之人士能夠瞭解本發明之內容並據以實施，當不能以之限定本發明之專利範圍，即大凡依本發明所揭示之精神所作之均等變化或修飾，仍應涵蓋在本發明之專利範圍內。The embodiments described above are merely illustrative of the technical spirit and the features of the present invention, and the objects of the present invention can be understood by those skilled in the art, and the scope of the present invention cannot be limited thereto. That is, the equivalent variations or modifications made by the spirit of the present invention should still be included in the scope of the present invention.

10．．．語音源10. . . Voice source

12．．．雜音源12. . . Noise source

14、14’．．．麥克風14, 14’. . . microphone

16、16’．．．快速傅立葉轉換模組16, 16’. . . Fast Fourier Transform Module

18．．．運算模組18. . . Computing module

20．．．遮蔽估測模組20. . . Mask estimation module

22．．．反快速傅立葉轉換暨疊加模組twenty two. . . Anti-fast Fourier transform and overlay module

24．．．自動語音辨識模組twenty four. . . Automatic speech recognition module

26．．．固定式濾波器26. . . Fixed filter

28．．．適應性濾波器28. . . Adaptive filter

30．．．揚聲器30. . . speaker

32．．．遠端音訊32. . . Remote audio

第1圖係為根據本發明實施例之可消除雜音且增進語音辨識率之麥克風陣列之示意圖。1 is a schematic diagram of a microphone array capable of eliminating noise and improving speech recognition rate according to an embodiment of the present invention.

第2圖係為根據本發明實施例之可消除雜音且增進語音辨識率之方法的步驟流程圖。2 is a flow chart showing the steps of a method for eliminating noise and improving speech recognition rate according to an embodiment of the present invention.

第3圖係為根據本發明另一實施例之可消除雜音且增進語音辨識率之麥克風陣列之示意圖。3 is a schematic diagram of a microphone array capable of eliminating noise and improving speech recognition rate according to another embodiment of the present invention.

第4圖係為根據本發明另一實施例可消除雜音且增進語音辨識率之麥克風陣列之示意圖。4 is a schematic diagram of a microphone array capable of eliminating noise and improving speech recognition rate according to another embodiment of the present invention.

第5圖係為根據本發明另一實施例之可消除雜音且增進語音辨識率之方法的步驟流程圖。Figure 5 is a flow chart showing the steps of a method for eliminating noise and improving speech recognition rate according to another embodiment of the present invention.

第6圖係為根據第5圖在更新適應性濾波器之前的步驟流程圖。Figure 6 is a flow chart showing the steps before updating the adaptive filter according to Figure 5.

Claims

A method for eliminating noise and improving speech recognition rate, comprising the steps of: providing two or more microphones for receiving at least two microphone signals; converting the microphone signals to a frequency domain by using fast Fourier to obtain the microphone signals a voice signal and a noise signal; calculating an angle between the voice signal and the noise signal, and using a phase difference algorithm to further find an inter-ear time difference; calculating the angle according to the voice signal and the noise signal a threshold value of the time difference between the ears; according to the time difference between the ear and the threshold, a masking rule is used to obtain the voice signal to remove the noise signal; and the voice signal is subjected to an inverse fast Fourier transform and superposition mode The group goes to the time domain output.

A method for eliminating noise and improving a speech recognition rate as described in claim 1, wherein the calculation of the threshold is performed by using a Golden-Section Search with Taylor's theory.

The method of claim 2, which can eliminate noise and improve speech recognition rate, wherein the golden ratio search method selects two points in a continuous range, and compares the function value of one of the two points to reduce the continuous range. And repeating the steps of optionally selecting two points and comparing the function values to continue narrowing the continuous range to find a minimum value of the function value in the continuous range, the threshold value being obtained by using the minimum value in conjunction with Taylor's theory.

The method of claim 1, wherein the masking rule further comprises the steps of: comparing the time difference between the ear and the threshold to obtain a masking signal; and The average of the microphone signals is multiplied by the masking signal to obtain the voice signal in the microphone signals.

The method of claim 1, wherein the inverse fast Fourier transform and superposition module converts the voice signal in the frequency domain into a time domain by using an inverse fast Fourier transform and an overlap addition method. Signal.

The method of claim 1, which can eliminate noise and improve speech recognition rate, wherein when the voice signal is located directly in front of the microphones, the time difference between the ears is zero.

The method of claim 1, wherein the microphone signal is regarded as the voice signal when the time difference between the ears is less than the threshold.

The method of claim 1, wherein the microphones are arranged in an array.

The method for eliminating noise and improving the speech recognition rate according to claim 1, further comprising receiving the voice signal output by the inverse fast Fourier transform and superimposing module by using an automatic speech recognition module to perform speech recognition.

The method of claim 1, wherein the receiving the microphone signals further comprises the step of: adding a delay to each of the microphones to control beam steering angles of the microphones.

The method of claim 10, which can eliminate noise and improve speech recognition rate, wherein after adding the delay to each of the microphones, the method further includes the steps of: calculating beam energy of each of the microphones at each steering angle to determine the voice. The direction of the sound source of the signal.

The method of claim 11, wherein the beam steering angle of the microphones is from 0 degrees to 180 degrees.

The method of claim 1, wherein the voice signal is cancelled and the voice recognition rate is improved, wherein after converting the voice signal to the time domain, the method further comprises the step of: connecting a composite echo cancellation system to convert the voice signal back to the time domain. To filter out the acoustic disturbance of the voice signal.

The method of claim 13, wherein the composite echo cancellation system comprises a fixed filter and an adaptive filter, wherein the fixed filter filters out the acoustics of the voice signal. Echo, the adaptive filter filters out the disturbance caused by the voice signal in the environment.

The method of claim 14, which can eliminate noise and improve speech recognition rate, further includes the step of: updating the adaptive filter by using an FXLMS algorithm.

The method of claim 15, wherein the method further comprises the steps of: providing a double talk detector (DTD); and detecting the noise filter. Whether the voice signal coincides with a far-end audio; and when the voice signal coincides with the far-end audio, the adaptation of the adaptive filter is stopped.

A method for eliminating noise and improving speech recognition rate, comprising the steps of: providing two or more microphones for receiving at least two microphone signals; converting the microphone signals to a frequency domain by using fast Fourier to obtain the microphone signals a voice signal and a noise signal; calculating an angle between the voice signal and the noise signal, and using a phase difference algorithm according to the angle to match the shadow estimation to obtain the voice signal in the microphone signals, and removing The noise signal; the voice signal is transferred to the time domain output by using an inverse fast Fourier transform and superposition module; and the composite echo cancellation system is connected to the voice signal of the time domain to filter out the voice signal. Acoustic disturbance.

The method of claim 17, wherein the composite echo cancellation system comprises a fixed filter and an adaptive filter, and the fixed filter filters out the acoustics of the voice signal. Echo, the adaptive filter filters out the disturbance caused by the voice signal in the environment.

The method of claim 18, which can eliminate noise and improve speech recognition rate, further includes the step of: updating the adaptive filter by using an FXLMS algorithm.

The method of claim 19, which can eliminate noise and improve speech recognition rate, wherein before updating the adaptive filter, the method further comprises the steps of: providing a double talk detector (DTD); detecting the Whether the voice signal coincides with a far-end audio; and when the voice signal coincides with the far-end audio, the adaptation of the adaptive filter is stopped.