CN1623186A

CN1623186A - Voice activity detector and validator for noisy environments

Info

Publication number: CN1623186A
Application number: CNA038026821A
Authority: CN
Inventors: 道格拉斯·拉尔夫·伊利; 霍利·路易斯·凯莱赫; 戴维·约翰·本杰明·皮尔斯
Original assignee: Motorola Inc
Current assignee: Motorola Mobility LLC; Google Technology Holdings LLC
Priority date: 2002-01-24
Filing date: 2003-01-10
Publication date: 2005-06-01
Anticipated expiration: 2023-01-10
Also published as: KR20040075959A; CN1307613C; GB2384670B; KR20090127182A; WO2003063138A1; GB2384670A; JP2005516247A; FI124869B; KR100976082B1; JP2010061151A; GB0201585D0; FI20041013L

Abstract

A communication unit (100) comprising an audio processing unit (109) with a voice activity detection mechanism (130, 135). A voice activity detection mechanism (130, 135) measures an energy acceleration rate of a signal input into the communication unit (100) and determines from the measurement whether the input signal is speech or noise. A method of detecting speech and a method of deciding whether an input signal is speech or noise are also described. Using energy acceleration rate based voice activity detectors and verifiers offers the advantages of noise robustness, fast response and input speech level independence especially for noisy environments.

Description

Voice Activity Detector and Authenticator for Noisy Environments

技术领域technical field

本发明涉及噪声环境内的语音的检测(通常称为话音活动检测(VAD))。本发明适用于(但并不限于)语音检测系统中的话音信号的能量加速率测量。The present invention relates to the detection of speech in noisy environments, commonly referred to as voice activity detection (VAD). The invention is applicable to (but not limited to) energy acceleration rate measurement of speech signals in speech detection systems.

背景技术Background technique

许多话音通信系统，例如针对个人移动无线用户的全球移动通信系统(GSM)蜂窝电话标准和陆地中继无线(TETRA)系统使用语音处理单元来编码和解码语音模式。在这种话音通信系统中，语音编码器把模拟语音模式转换为传输用的合适的数字格式。语音解码器把接收的数字语音信号转换为音频模拟语音模式。Many voice communication systems, such as the Global System for Mobile Communications (GSM) cellular telephone standard and Terrestrial Trunked Radio (TETRA) systems for personal mobile radio users, use a voice processing unit to encode and decode voice patterns. In such voice communication systems, a vocoder converts the analog speech pattern into a suitable digital format for transmission. The voice decoder converts the received digital voice signal into audio analog voice mode.

用于检测话音活动的方法和设备在本技术领域中已公知。话音活动检测器(VAD)在假设语音只存在于音频信号的一部分中的假设下工作。这个假设通常是正确的，因为许多音频信号间隔只具有静音或背景噪声。Methods and apparatus for detecting voice activity are known in the art. A Voice Activity Detector (VAD) works under the assumption that speech is only present in a part of the audio signal. This assumption is usually correct because many audio signal intervals have only silence or background noise.

话音活动检测器可以用于许多目的。这些包括当在没有语音时抑制传输系统中的整个传输活动，从而潜在地节约了功率和信道带宽。当VAD检测到语音活动继续进行时，能够重新开始传输活动。Voice activity detectors can be used for many purposes. These include suppressing the entire transmission activity in the transmission system when there is no speech, thereby potentially saving power and channel bandwidth. When the VAD detects that voice activity is continuing, transmission activity can be restarted.

话音活动检测器还可以与语音存储设备结合使用，把包括语音的音频部分与“无语音”部分区分开。包括语音的部分后来被存储在存储设备中而“无语音”部分被丢弃。Voice activity detectors can also be used in conjunction with speech storage devices to distinguish portions of the audio that include speech from "no speech" portions. The portion including speech is later stored in a storage device and the "no speech" portion is discarded.

用于检测话音的现有方法至少部分地基于用于检测和估算语音信号的功率的方法。估算的功率与一常数或一自适应门限比较，以作出该信号是否是语音的判决。这些方法的主要优点在于其低复杂度，这使得它们适用于低处理资源的实施。这种方法的主要缺点是背景噪声可能无意中导致在实际上没有“语音”的时候检测到“语音”。另外，因为含糊不清，实际存在的“语音”可能未被检测到，并且由于背景噪声而导致难以检测到。Existing methods for detecting speech are based at least in part on methods for detecting and estimating the power of speech signals. The estimated power is compared to a constant or an adaptive threshold to make a decision whether the signal is speech. The main advantage of these methods lies in their low complexity, which makes them suitable for implementation with low processing resources. The main disadvantage of this approach is that background noise may inadvertently cause "speech" to be detected when there is actually no "speech". Also, actual "speech" may go undetected because of ambiguity and be difficult to detect due to background noise.

用于检测语音活动的一些方法针对于噪声移动环境且基于语音信号的自适应滤波。这在最终判决之前降低了来自该信号的噪声内容。由于该方法用于不同的说话者和不同的环境，所以频谱和噪声电平可能发生改变。因此，输入滤波器和门限通常是自适应的，以跟踪这些变化。Some methods for detecting voice activity are aimed at noisy mobile environments and are based on adaptive filtering of voice signals. This reduces the noise content from the signal before final decision. As the method is used for different speakers and different environments, the frequency spectrum and noise level may change. Therefore, input filters and thresholds are usually adaptive to track these changes.

这些方法的示例在分别用于半速率、全速率和增强全速率语音业务信道的GSM规范06.42话音活动检测器(VAD)中提供。另一这种方法是ITU G.729附录B中所建议的“Multi-Boundary Voice ActivityDetection Algorithm(多界限话音活动检测算法)”。这些方法在噪声环境中很准确，但是实施起来很复杂。Examples of these methods are provided in the GSM specification 06.42 Voice Activity Detector (VAD) for half-rate, full-rate and enhanced full-rate voice traffic channels respectively. Another such method is "Multi-Boundary Voice Activity Detection Algorithm (multi-boundary voice activity detection algorithm)" suggested in Appendix B of ITU G.729. These methods are accurate in noisy environments, but are complex to implement.

所有这些方法都需要输入语音信号。采用语音解压缩方案的一些应用在语音解压缩处理期间需要执行语音检测。All of these methods require an input speech signal. Some applications employing speech decompression schemes need to perform speech detection during the speech decompression process.

Benyassine等人的欧洲专利申请No.EP-A-0785419涉及一种用于话音活动检测的方法，该方法包括以下步骤：European Patent Application No. EP-A-0785419 by Benyassine et al. relates to a method for voice activity detection comprising the following steps:

(i)从每帧的呼入语音信号中提取出预定集的参数，以及(i) extracting a predetermined set of parameters from each frame of the incoming speech signal, and

(ii)根据从预定集的参数中提取出的偏差测量集来对每帧的呼入语音信号作出帧话音判决。(ii) making a frame speech decision for each frame of the incoming speech signal based on a set of deviation measures extracted from a predetermined set of parameters.

蜂窝系统中的VAD进行偏置，以确保当一方说话时，包括语音编解码器和RF电路等的无线设备被激活，以把该语音传送至背景噪声及其它损伤环境中的另一方。但是，这导致在一方没有说话时出现数据传输。这种方法的代价是稍微降低了电池寿命和稍微增加了对该系统的其它单元中的同信道用户的干扰。这些基本上是第二(或更高)阶效应。VADs in cellular systems are biased to ensure that when one party speaks, the wireless equipment, including speech codecs and RF circuitry, is activated to transmit that speech to the other party amid background noise and other impairments. However, this results in data transmissions occurring when one party is not speaking. This approach comes at the cost of slightly reduced battery life and slightly increased interference to co-channel users in other elements of the system. These are basically second (or higher) order effects.

在这些系统中，没有对有限资源可用于双工呼叫的构思。通常在不同载波上的上行链路和下行链路完全可以一致同时使用整个带宽。In these systems, there is no conception that limited resources are available for duplex calls. Usually the uplink and downlink on different carriers can use the entire bandwidth in unison at the same time.

在本发明的领域中已公知，一些话音活动或话音开始检测器(VAD/VOD)试图使用诸如谐波结构(例如通过自相关)的语音特性来辨别浊音语音(voiced speech)。但是，在噪音中，由于语音结构的破坏或由于噪声中的结构，这些结构指示符可能失效。这例如可以是汽车中的引擎、轮胎或空调噪声。最后，这些方法在检测清音语音(unvoiced speech)方面上较弱。It is known in the field of the invention that some voice activity or voice onset detectors (VAD/VOD) attempt to use speech characteristics such as harmonic structure (eg by autocorrelation) to distinguish voiced speech. However, in noise, these structural indicators may fail due to breakdown of the speech structure or due to structure in noise. This can be, for example, engine, tire or air conditioning noise in a car. Finally, these methods are weak in detecting unvoiced speech.

其替换物只是使用帧能量级来检测语音。这对于高信噪比(SNR)条件的语音是令人满意的，其中，可以设置高于噪声电平的任意门限来表示语音。但是，这种方法在很多实际噪声条件中失效。An alternative to this is simply to use the frame energy levels to detect speech. This is satisfactory for speech in high signal-to-noise ratio (SNR) conditions, where an arbitrary threshold above the noise level can be set to represent speech. However, this approach fails in many real noise conditions.

对于非归一化的数据库或在实际应用中，一个示例集中的噪声电平很可能比另一示例集中的语音电平高，这使得不能设置门限值。克服这个问题的现有方法是取话语的大约第一个100毫秒的平均值，假定这代表噪声，从而创建用于该话语的特定门限。但是，此外，这对于非平稳噪声是不够的，其中该噪声可能迅速偏离初始估计值，其中该噪声具有高方差或其中第一少数帧实际上包含不是假定噪声的语音。For non-normalized databases or in practical applications, the noise level in one example set is likely to be higher than the speech level in another example set, which makes it impossible to set the threshold value. Existing methods to overcome this problem are to take an average of about the first 100 milliseconds of an utterance, assuming this represents noise, thereby creating a specific threshold for that utterance. However, moreover, this is not sufficient for non-stationary noise, where the noise may deviate rapidly from the initial estimate, where the noise has high variance or where the first few frames actually contain speech that is not supposed to be noise.

因此，需要有一种用于噪声环境的经改善的话音活动检测器和验证器，其可以缓和上述缺点。Therefore, there is a need for an improved voice activity detector and verifier for noisy environments that can alleviate the above disadvantages.

发明内容Contents of the invention

根据本发明的第一方面，提供了一种如权利要求1所述的通信单元。According to a first aspect of the present invention there is provided a communication unit as claimed in claim 1 .

根据本发明的第二方面，提供了一种如权利要求11所述的检测输入到通信单元中的语音信号的方法。According to a second aspect of the present invention there is provided a method of detecting a speech signal input into a communication unit as claimed in claim 11.

根据本发明的第三方面，提供了一种如权利要求14所述的确定输入到通信单元中的信号是语音还是噪声的方法。According to a third aspect of the present invention, there is provided a method of determining whether a signal input into a communication unit is speech or noise as claimed in claim 14 .

本发明的其它方面如其从属权利要求中所述。Other aspects of the invention are described in the dependent claims.

总之，本发明旨在通过使用能量加速率测量(优选为能量幅度测量)来解决任意幅度的非平稳噪声的情况，以表示存在或不存在语音。In summary, the present invention aims to address the case of non-stationary noise of arbitrary magnitude by using an energy acceleration rate measurement, preferably an energy magnitude measurement, to indicate the presence or absence of speech.

附图说明Description of drawings

现在参考附图对本发明的示例性实施例进行描述，在附图中：Exemplary embodiments of the invention will now be described with reference to the accompanying drawings, in which:

图1示出了适用于执行本发明的优选实施例的话音活动检测和验证的通信单元的方框图；Figure 1 shows a block diagram of a communication unit suitable for performing voice activity detection and verification of a preferred embodiment of the present invention;

图2示出了根据本发明的优选实施例的用于噪声环境的基于能量加速率的话音活动检测器的流程图；Fig. 2 shows the flow chart of the voice activity detector based on the energy acceleration rate for the noise environment according to the preferred embodiment of the present invention;

图3示出了根据本发明的优选实施例的用于噪声环境的基于能量加速率的话音活动验证的流程图；以及FIG. 3 shows a flow chart of energy acceleration rate-based voice activity verification for noisy environments according to a preferred embodiment of the present invention; and

图4示出了根据本发明的优选实施例的缓冲器操作。Figure 4 illustrates buffer operation according to a preferred embodiment of the present invention.

具体实施方式Detailed ways

浊音语音具有相对较高的能量加速率值，因为浊音语音的开始依赖于或振动或静止的声带的活动。类似地，清音的开始(例如爆破音)也具有高能量加速率。Voiced speech has a relatively high energy acceleration rate value because the onset of voiced speech depends on the activity of either vibrating or stationary vocal cords. Similarly, unvoiced onsets (eg plosives) also have high energy acceleration rates.

本发明人已意识到，在代表性的有明显语音特征的域中，例如窄带功率谱或Mel频谱，所得的能量加速率大大高于非平稳噪声。唯一主要的例外是冲击噪声(例如鼓掌)。The inventors have realized that in domains typically characterized by distinct speech features, such as the narrowband power spectrum or the Mel spectrum, the resulting energy acceleration rates are much higher than for non-stationary noise. The only major exception is impact noise (e.g. clapping).

因此，根据本发明的优选实施例，本发明人已发现通过集中可能含有话音信号的基本基音的频率区中的能量，而能够另外与这些噪声区分开。具体地说，本发明的发明人建议使用语音的非结构特征，即能量加速率(或反映语音能量或其分量的一些度量的加速率)。Therefore, according to a preferred embodiment of the present invention, the inventors have found that by concentrating the energy in the frequency region which may contain the fundamental pitch of the speech signal, it is additionally possible to distinguish from these noises. In particular, the inventors of the present invention propose to use a non-structural feature of speech, namely energy acceleration rate (or acceleration rate reflecting some measure of speech energy or its components).

具体地说，对于在此所描述的发明构思的优选应用是目前正由欧洲电信标准协会(ETSI)所定义的分布式语音识别(DSR)标准：“SpeechProcessing；Transmission and Quality aspects(STQ)；Distributed speechrecognition；Front-end feature extraction algorithm；Compressionalgorithm(语音处理、传输和质量方面(STQ)；分布式语音识别；前端特征提取算法；压缩算法)”，ETSI ES 201 108 vl.1.2(2000-04)，2000年4月。In particular, a preferred application for the inventive concept described here is the Distributed Speech Recognition (DSR) standard currently being defined by the European Telecommunications Standards Institute (ETSI): "Speech Processing; Transmission and Quality aspects (STQ); Distributed speech recognition; Front-end feature extraction algorithm; Compression algorithm (speech processing, transmission and quality aspects (STQ); distributed speech recognition; front-end feature extraction algorithm; compression algorithm)", ETSI ES 201 108 vl.1.2 (2000-04), April 2000.

现在参考图1，示出了适用于支持本发明的优选实施例的发明构思的音频用户单元100的方框图。Referring now to FIG. 1, there is shown a block diagram of an audio subscriber unit 100 suitable for supporting the inventive concepts of the preferred embodiment of the present invention.

根据无线音频通信单元来描述本发明的优选实施例，例如能够在用于未来蜂窝无线通信系统的第三代合作项目(3GPP)标准下运行且提供DSR能力的无线音频通信单元。但是，在此所描述的关于话音活动检测和验证的发明构思同样适用于响应话音信号且可以从经改善的话音活动检测电路中获益的任何电子器件，这也在本发明的范围之内。A preferred embodiment of the invention is described in terms of a wireless audio communication unit, such as a wireless audio communication unit capable of operating under the 3rd Generation Partnership Project (3GPP) standard for future cellular wireless communication systems and providing DSR capabilities. However, the inventive concepts described herein with respect to voice activity detection and verification are equally applicable to any electronic device that responds to voice signals and could benefit from improved voice activity detection circuitry and is within the scope of the present invention.

如在本技术领域中已知，音频用户单元100包含优选地连接至双工滤波器、天线开关或循环器104的天线102，循环器104使音频用户单元100内的接收链和发送链之间隔离。As is known in the art, the audio subscriber unit 100 includes an antenna 102 that is preferably connected to a duplex filter, antenna switch, or circulator 104 that provides a link between the receive chain and the transmit chain within the audio subscriber unit 100. isolation.

接收器链包括接收器前端电路106(有效提供接收、滤波和中频或基带频率转换)。前端电路106串联连接至信号处理功能块(一般由数字信号处理器(DSP)实现)108。信号处理功能块108执行信号解调、纠错和格式化。从信号处理功能块108恢复的数据串联连接至音频处理功能块109，其以合适的方式格式化接收信号，以发送至音频发音器/显示器111。The receiver chain includes receiver front-end circuitry 106 (effectively providing reception, filtering, and IF or baseband frequency conversion). Front-end circuitry 106 is connected in series to a signal processing functional block (typically implemented by a digital signal processor (DSP)) 108 . Signal processing functional block 108 performs signal demodulation, error correction and formatting. The data recovered from signal processing function 108 is connected in series to audio processing function 109 which formats the received signal in a suitable manner for transmission to audio producer/display 111 .

在本发明的不同实施例中，信号处理功能块108和音频处理功能块109可以设置在相同的物理设备内。控制器114被安置来控制用户单元100的组件的信息流和运行状态。In different embodiments of the present invention, the signal processing functional block 108 and the audio processing functional block 109 may be disposed in the same physical device. Controller 114 is positioned to control the information flow and operating status of the components of subscriber unit 100 .

至于发送链，这基本上包括音频输入设备120，其串联连接音频处理功能块109、信号处理功能块108、发射器/调制电路122和功率放大器124。处理器108、发射器/调制电路122和功率放大器124可操作地响应控制器。功率放大器输出被连接至双工滤波器、天线开关或循环器104以及天线102，以发射最终的射频信号。As for the transmission chain, this basically consists of an audio input device 120 , which connects an audio processing function 109 , a signal processing function 108 , a transmitter/modulation circuit 122 and a power amplifier 124 in series. Processor 108, transmitter/modulation circuit 122, and power amplifier 124 are operatively responsive to the controller. The power amplifier output is connected to a duplex filter, antenna switch or circulator 104 and antenna 102 to transmit the final radio frequency signal.

具体地说，音频处理功能块109包括话音活动(或话音开始)检测(VAD)功能块130，其操作地连接至话音活动判决功能块135。根据本发明的优选实施例，VAD功能块130和话音活动判决功能块135适用于提供经改善的话音检测和判决机制，其操作将根据图2和图3得到进一步的描述。应当注意，话音活动检测器功能块130包括由三个测量组成的逐帧检测阶段。这三个频率范围测量包括：Specifically, the audio processing functional block 109 includes a voice activity (or voice onset) detection (VAD) functional block 130 , which is operatively connected to a voice activity decision functional block 135 . According to a preferred embodiment of the present invention, VAD functional block 130 and voice activity decision functional block 135 are adapted to provide an improved voice detection and decision mechanism, the operation of which will be further described with reference to FIGS. 2 and 3 . It should be noted that the voice activity detector functional block 130 includes a frame-by-frame detection stage consisting of three measurements. The three frequency range measurements include:

(i)整个频谱；(i) the entire spectrum;

(ii)频谱子频段；以及(ii) spectrum sub-bands; and

(iii)频谱方差。(iii) Spectral variance.

接着，话音活动判决功能块135根据测量的缓冲器来执行判决，分析其语音似然性。判决阶段的最终判决的应用可追溯至缓冲器中的最早的帧。Next, the voice activity decision function block 135 performs a decision based on the measured buffer, analyzing its speech likelihood. The application of the final decision of the decision stage goes back to the earliest frame in the buffer.

在本发明的优选实施例中，计时器/计数器118也适用于执行图2和图3的检测和判定处理中的定时功能。In a preferred embodiment of the present invention, timer/counter 118 is also adapted to perform timing functions in the detection and determination processes of FIGS. 2 and 3 .

信号处理器功能块108、音频处理功能块109、VAD功能块130和话音活动判决功能块135可以实现为不同的、操作地连接的处理组件。另外，一个或多个处理器可以用来实现一个或多个对应的处理操作。在另一替换实施例中，上述功能块可以实现为硬件、软件或固件组件的混合，使用专用集成电路(ASIC)和/或处理器，例如数字信号处理器(DSP)。The signal processor functional block 108, the audio processing functional block 109, the VAD functional block 130 and the voice activity decision functional block 135 may be implemented as distinct, operatively connected processing components. Additionally, one or more processors may be utilized to implement one or more corresponding processing operations. In another alternative embodiment, the functional blocks described above may be implemented as a mix of hardware, software or firmware components, using application specific integrated circuits (ASICs) and/or processors such as digital signal processors (DSPs).

当然，音频用户单元100内的各种元件可以实现为分开的或集成元件形式，因此最终结构只是任意选择的结果。Of course, the various components within audio subscriber unit 100 may be implemented as separate or integrated components, so that the final structure is only a matter of arbitrary choice.

为了实现此目的，存在获得在本发明的优选实施例中使用的能量加速率指示的方法。To achieve this, there is a method of obtaining an indication of the energy acceleration rate used in the preferred embodiment of the present invention.

(i)理论上理想的方法是在话语的连续帧上精确地求能量级的二次导数(double-differentiate)，如在先公开的申请US 6009391所示。这种方法的缺点是这可能引起延迟，因为在分析时需要分析该帧的每侧的多个帧。(i) A theoretically ideal approach would be to accurately double-differentiate the energy levels over successive frames of the utterance, as shown in the prior published application US 6009391. The downside of this approach is that it can cause delays as multiple frames on each side of the frame need to be analyzed when analyzing.

(ii)能量加速率的零延迟估计可以通过把短时平均值与瞬时值比较来获得，例如：(ii) A zero-delay estimate of the energy acceleration rate can be obtained by comparing the short-term average with the instantaneous value, for example:

使用帧平均：Use frame averaging:

$\overset{~ ~}{A A} = = \frac{{x x}_{t t}}{(({x x}_{t t} + + {x x}_{t t - - 11} + + \cdot \cdot \cdot &Center Dot; \cdot &Center Dot; + + {x x}_{t t - - n no})) / / ((n no + + 11))} - - - - - - [[11]]$

或使用滚动平均：or use a rolling average:

$\overset{~ ~}{A A} = = \frac{{x x}_{t t}}{(({ax ax}_{t t} + + {bx bx}_{t t - - 11} + + \cdot \cdot \cdot \cdot \cdot \cdot + + {kx x}_{t t - - n no}))} - - - - - - [[22]]$

在每个情况下，该方法返回其可以解释为‘减速率’＜‘1’＜‘加速率’的值。然后可以找到的经验值和把语音和噪声最好地区分开的分母长度。In each case, the method returns a value which can be interpreted as 'deceleration rate'<'1'<'acceleration rate'. can then be found The empirical value of and the denominator length that best distinguishes speech from noise.

本发明的发明人已意识到，优选的最佳解决方案是找出可以快速跟踪非平稳噪声的分母，但是其对于跟踪话音开始来说太长了。对于滚动平均的建议的值序列是a＝0.2、b＝0.8×a、c＝0.8×b等，其可以简单地表示为递归式：The inventors of the present invention have realized that the preferred optimal solution is to find a denominator that can quickly track non-stationary noise, but it is too long for tracking speech onset. The suggested sequence of values for the rolling average is a=0.2, b=0.8xa, c=0.8xb, etc., which can be expressed simply as a recurrence:

d_t＝0.2x_t+0.8d_t-1 [3]d _t =0.2x _t +0.8d _t-1 [3]

则：but:

A＝x_t/d_t [4]A=x _t /d _t [4]

检测阶段内的优选VAD和参数初始化系统在图2的流程图中概括出。在非平稳噪声中，长时能量门限不是语音的可靠指示。类似地，在高噪声条件下，语音的结构(例如谐音)不能整个地依赖于指示，因为其可能受噪声破坏，或者结构噪声可能使检测器混淆。因此，优选的话音活动检测器使用语音的噪声鲁棒性(noise-robust)特征，即与语音开始有关的能量加速率。The preferred VAD and parameter initialization system within the detection phase is outlined in the flowchart of FIG. 2 . In nonstationary noise, long-term energy thresholds are not reliable indicators of speech. Similarly, under high noise conditions, the structure of speech (eg, harmonics) cannot be relied upon entirely for indications, since it may be corrupted by the noise, or the structure noise may confuse the detector. Therefore, preferred voice activity detectors use the noise-robust characteristics of speech, ie the rate of acceleration of energy associated with speech onset.

现在参考图2，示出了优选检测处理的流程图200。如上所指出，该处理包括逐帧分析。优选VAD机制涉及‘整个频谱’的测量处理。初始估算帧计数器来确定其是否小于‘N’，其限定了缓存帧的数目，如步骤205所示。作为优选实施例的示例，‘N’设置为‘15’，假定设定为每帧递增例如10毫秒。如果在步骤205中帧计数器小于‘N’，则更新初始加速率测试的滚动平均值，如步骤210。如果在步骤205中帧计数器不小于‘N’，则跳过步骤210。Referring now to FIG. 2 , a flowchart 200 of a preferred detection process is shown. As noted above, this processing includes frame-by-frame analysis. Preferably the VAD mechanism involves measurement processing of the 'whole spectrum'. The frame counter is initially evaluated to determine if it is less than 'N', which defines the number of buffered frames, as shown in step 205. As an example of a preferred embodiment, 'N' is set to '15', assuming a setting of eg 10 milliseconds per frame increment. If the frame counter is less than 'N' in step 205, the rolling average of the initial acceleration test is updated as in step 210. If the frame counter is not less than 'N' in step 205, then step 210 is skipped.

然后，作出估算能量加速率测量是否在一个或多个指定限度之内的确定，如步骤235所示。如果在步骤235中能量加速率测量在一个或多个指定限度之内，则用进一步的能量加速率测试的结果来更新滚动平均值，如步骤240。如果在步骤235中能量加速率测量不是在一个或多个指定限度之内，则跳过步骤240。A determination is then made as to whether the estimated energy acceleration rate measurement is within one or more specified limits, as shown in step 235 . If the energy acceleration rate measurement is within one or more specified limits in step 235 , the rolling average is updated with the results of further energy acceleration rate tests, as in step 240 . If the energy acceleration rate measurement is not within one or more specified limits in step 235, then step 240 is skipped.

然后，作出估算能量加速率测量是否大于指定门限的确定，如步骤260所示。如果在步骤260中能量加速率测量大于指定门限，则认为该帧是语音帧，如步骤265。如果在步骤260中能量加速率测量不大于指定门限，则认为该帧为噪声帧，如步骤270。A determination is then made as to whether the estimated energy acceleration rate measurement is greater than a specified threshold, as shown in step 260 . If the energy acceleration rate measurement is greater than the specified threshold in step 260, the frame is considered to be a speech frame, as in step 265. If the energy acceleration rate measurement is not greater than the specified threshold in step 260 , then the frame is considered to be a noisy frame, as in step 270 .

然后递增帧计数器，如步骤275，且该处理从步骤205开始重复。The frame counter is then incremented, step 275, and the process repeats from step 205 onwards.

作为对该处理的改善，替代或除此之外，还可以执行整个频谱测量处理，如可选步骤215和245所示的子区测量处理。频谱的特定子区被选为很可能包含基本基音的子区。As an improvement to this process, instead or in addition, an entire spectrum measurement process, such as the sub-region measurement process shown in optional steps 215 and 245, may also be performed. A specific subregion of the spectrum is selected as one that is likely to contain the fundamental pitch.

在该子区处理中，当在步骤210中在整个频谱测量中更新初始加速率测试的滚动平均时，作出检查能量加速率测量是否大于门限值的确定，如步骤220所示。如果在步骤220中该能量加速率测量大于该门限值，则挂起初始化其它参数的处理，如步骤225所示。如果在步骤220中该能量加速率测量不大于该门限值，则更新其它参数的初始化，如步骤230。然后该处理返回至步骤235，如所示。In this subfield process, when the rolling average of the initial acceleration test is updated across the spectrum measurements in step 210 , a determination is made to check whether the energy acceleration rate measurement is greater than a threshold value, as shown in step 220 . If the energy acceleration rate measurement is greater than the threshold value in step 220 , then the process of initializing other parameters is suspended, as shown in step 225 . If the energy acceleration rate measurement is not greater than the threshold value in step 220 , update the initialization of other parameters, as in step 230 . The process then returns to step 235, as shown.

在步骤235中作出估算能量加速率测量是否在一个或多个指定限度之内的确定之后作出又一优选确定。估算该减速率值来确定其在步骤250中是否是‘高’的，且如果是这样的话，则缓慢地更新能量加速率测试的滚动平均，如步骤255所示。然后该处理在步骤260返回至整个频谱方法。A further preferred determination is made after the determination made in step 235 whether the estimated energy acceleration rate measurement is within one or more specified limits. This deceleration rate value is evaluated to determine if it was 'high' in step 250, and if so, the rolling average of the energy acceleration rate test is slowly updated, as shown in step 255. The process then returns to the full spectrum method at step 260 .

通过这样的方式，子区检测器的较高信噪比(SNR)使其具有较高的噪声鲁棒性。但是，其容易受不利的麦克风和说话者变化以及限带噪声的影响。因此，该测量不应当依赖于所有的环境。因此，本发明的优选实施例合并了子区检测器，以加强整个频谱测量。In this way, the higher signal-to-noise ratio (SNR) of the subfield detector makes it more robust to noise. However, it is susceptible to adverse microphone and speaker variations and band-limited noise. Therefore, this measurement should not depend on all circumstances. Therefore, the preferred embodiment of the present invention incorporates subregion detectors to enhance overall spectral measurements.

又一测量处理优选地使用例如每帧的频谱的下半部分内的值方差的‘加速率’来执行。该方差测量检测频谱的下半部分内的结构，使其对浊音语音高度敏感。方差测量遵循子区处理的方法，频谱的下半部分是选择的特定子区。这个方差测量进一步补充了整个频谱测量方法，其能够更好地检测清音和爆破音语音。A further measurement process is preferably performed using eg 'acceleration' of the variance of values within the lower half of the frequency spectrum for each frame. This variance measure detects structures within the lower half of the spectrum, making it highly sensitive to voiced speech. The variance measure follows the approach of subfield processing, with the lower half of the spectrum being the specific subfield chosen. This variance measure further complements the whole spectral measure method, which enables better detection of unvoiced and plosive speech.

所有这三个测量从由双重维纳滤波器的第一阶段产生的滤波器增益的谱表示中取出其原始输入，如申请人为摩托罗拉公司且发明人为Yan-Ming Chen的US 09/427497的美国专利申请中所描述。如上所述，每个测量使用这个数据的不同方面。All three measurements take their original input from the spectral representation of the filter gain produced by the first stage of the double Wiener filter, as in US 09/427497 to Motorola Corporation and inventor Yan-Ming Chen described in the application. As noted above, each measurement uses a different aspect of this data.

具体地说，整个频谱检测器使用已知的由双重韦纳滤波器的第一阶段产生的滤波器增益的Mel滤波的谱表示。单个输入值是通过对Mel滤波器组的和进行平方而获得的。Specifically, the entire spectrum detector uses the known Mel-filtered spectral representation of the filter gain produced by the first stage of the double Weiner filter. A single input value is obtained by squaring the sum of the Mel filterbank.

在本发明的优选实施例中，整个频谱检测器向所有帧应用了下面的处理，如下所述：In a preferred embodiment of the invention, the entire spectrum detector applies the following processing to all frames, as follows:

步骤一以下述的方式初始化噪声估计跟踪值(Tracker)：Step 1 initializes the noise estimation tracking value (Tracker) in the following way:

如果帧数＜15且加速率＜2.5，If the number of frames < 15 and the acceleration rate < 2.5,

则跟踪值＝MAX(跟踪值，输入)。Then tracking value = MAX(tracking value, input).

如果语音在15帧的导入时间内发生，则能量加速率测量防止跟踪值被更新。If speech occurs within the lead-in time of 15 frames, the energy acceleration rate measurement prevents the tracking value from being updated.

如果当前输入与噪声估值相同，则步骤二以下面的方式更新跟踪值：If the current input is the same as the noise estimate, step two updates the tracked value in the following way:

如果输入＜跟踪值×上限且If input < tracking value × upper limit and

输入＞跟踪值×下限，Input > Tracking Value × Lower Limit,

则跟踪值＝a×跟踪值+(1-a)×输入Then tracking value = a × tracking value + (1-a) × input

步骤三对那些第一少数帧内存在语音或无特征大噪声内容的实例提供了故障保险机制。这致使所得的错误高噪声估值降低。步骤三优选地以下面的方式进行：Step three provides a fail-safe mechanism for those instances where there is speech or featureless loud noise content within the first few frames. This results in a reduction in the resulting erroneous high-noise estimates. Step 3 is preferably carried out in the following manner:

如果输入＜跟踪值×最低值(Floor)，If input<tracking value×lowest value (Floor),

则跟踪值＝b×跟踪值+(1-b)×输入Then tracking value=b×tracking value+(1-b)×input

如果当前输入比跟踪值大165％，则步骤四以下面的方式返回，作为‘真’语音确定：If the current input is 165% greater than the tracked value, step four returns as the 'true' speech determination in the following manner:

如果输入＞跟踪值×门限，If input>tracking value×threshold,

则输出‘真’，否则输出‘假’。Then output 'true', otherwise output 'false'.

瞬时输入与短时均值跟踪值的比率是连续输入的能量加速率的函数。The ratio of the instantaneous input to the short-term mean tracking value is a function of the energy acceleration rate of the continuous input.

其中，在上述中：Among them, in the above:

a＝0.8且b＝0.97；a=0.8 and b=0.97;

上限是150％且下限是75％；The upper limit is 150% and the lower limit is 75%;

最低值是50％；且The minimum value is 50%; and

门限是165％。The threshold is 165%.

应当注意，如果该值大于上限或在下限和最低值之间，则不更新。此外，如上所指出，能量加速率输入可以根据下述的方式计算：It should be noted that if the value is greater than the upper bound or between the lower bound and the lowest value, it is not updated. Additionally, as noted above, the energy acceleration rate input can be calculated as follows:

在连续输入上二次求导或通过跟踪输入的两个滚动平均的比率来估算。Quadratic derivative on continuous input or estimated by tracking the ratio of two rolling averages of the input.

应当注意，快速和缓慢自适应滚动平均的比率反映了连续输入的能量加速率。It should be noted that the ratio of the fast and slow adaptive rolling averages reflects the rate of energy acceleration of the continuous input.

例如，上面所使用的对于该平均数的贡献率是：For example, the contribution to the average used above is:

(i)0×均值+1×输入，且(i) 0× mean + 1× input, and

(ii)((帧数-1)×均值+1×输入)/帧数，(ii) ((frame number-1)×mean+1×input)/frame number,

使能量加速率测量对首十五帧越来越敏感。Make energy acceleration measurement more and more sensitive to the first fifteen frames.

该子频段检测器优选地使用从‘整个频谱’测量得出的第二、第三和第四Mel滤波器组的平均数。然后，该检测器以如下所述的方式对所有帧应用了下面的处理：The sub-band detector preferably uses the average of the second, third and fourth Mel filter banks measured from the 'whole spectrum'. The detector then applies the following processing to all frames in the manner described below:

(i)输入＝p×当前输入+(1-p)×先前输入；(i) input = p × current input + (1-p) × previous input;

(ii)如果帧数＜15，(ii) If the number of frames < 15,

则跟踪值＝MAX(跟踪值，输入)；Then tracking value = MAX(tracking value, input);

(iii)如果输入＜跟踪值×上限且(iii) If input < tracking value × upper limit and

输入＞跟踪值×下限，Input > Tracking Value × Lower Limit,

则跟踪值＝a×跟踪值+(1-a)×输入；Then tracking value=a×tracking value+(1-a)×input;

(iv)如果输入＜跟踪值×最低值，(iv) If input < tracking value × minimum value,

(v)如果输入＞跟踪值×门限，(v) If input>tracking value×threshold,

其中，在子区测量中：Among them, in the sub-region measurement:

p＝0.75。p = 0.75.

除了等于3.25的门限外，对于整个频谱测量，所有其它参数都相同。Except for the threshold equal to 3.25, all other parameters are the same for the entire spectrum measurement.

对于频谱方差测量，包括每帧增益的窄带谱表示的下半部分频率的值的方差被用作输入。然后，该检测器对整个频谱测量应用了相同的处理。For the spectral variance measurement, the variance of the values of the lower frequency half of the narrowband spectral representation comprising the gain per frame is used as input. The detector then applies the same processing to the entire spectrum measurement.

该方差计算为：This variance is calculated as:

$\frac{11}{N N} {Σ Σ}_{i i = = 00}^{N N - - 11} {W W}_{i i}^{22} - - {(({Σ Σ}_{i i = = 00}^{N N - - 11} {W W}_{i i}))}^{22} / / {N N}^{22} - - - - - - [[55]]$

其中：in:

N＝FFT长度/4，以及N = FFT length/4, and

w_i是增益的窄带谱表示的值。w _i is the value represented by the narrowband spectrum of the gain.

根据本发明的优选实施例，上面所详细描述的这三个测量被提供给VAD判决算法，如图3的流程图所示。连续输入被提供给缓冲器，其提供上下文分析。这使得帧延迟等于缓冲器长度减去一帧。According to a preferred embodiment of the present invention, these three measurements detailed above are provided to the VAD decision algorithm, as shown in the flowchart of FIG. 3 . Continuous input is provided to a buffer, which provides contextual analysis. This makes the frame delay equal to the buffer length minus one frame.

现在参考图3，示出了根据本发明的优选实施例的用于噪声环境的基于加速率的话音活动验证处理的流程图300。Referring now to FIG. 3 , there is shown a flowchart 300 of an acceleration-based voice activity verification process for a noisy environment in accordance with a preferred embodiment of the present invention.

对于N＝7帧缓冲器，最近的真/假语音输入被存储在数据缓冲器中的位置N上，如步骤305所示。判决逻辑应用若干个下面的步骤，并且优选地应用每一步骤：For an N=7 frame buffer, the most recent true/fake speech input is stored at position N in the data buffer, as shown in step 305 . The decision logic applies several of the following steps, and preferably applies each step:

步骤1：step 1:

V_N＝测量1或测量2或测量3V _N = measure 1 or measure 2 or measure 3

如果这三个测量中的任何一个返回真语音指示，则输入V_N定义为‘真’(T)。Input V _N is defined as 'true' (T) if any of these three measures returns a true speech indication.

步骤2：Step 2:

该算法搜索缓冲器中的‘真’值的最长连续序列，如步骤310。因此，例如，对于序列‘TTFTTTF’，M等于3。The algorithm searches the buffer for the longest consecutive sequence of 'true' values, step 310. Thus, for example, for the sequence 'TTFTTTF' M is equal to 3.

步骤3：Step 3:

如果M≥S_P且T＜L_S，T＝L_S；If M≥S _P and T<L _S , T=L _S ;

其中，S_P等同于步骤315中的第一门限。如果在步骤315中真(T)语音值的最长序列等于或超过第一门限，即S_P＝3或更多连续‘真’值，则缓冲器被判决为包含‘可能(possible)’的语音。如果在步骤320中确定还未存在(或超过)，则在步骤325中启动例如L_S＝5帧的短计时器T(时间_1)。Wherein, S _P is equal to the first threshold in step 315 . If in step 315 the longest sequence of true (T) speech values equals or exceeds the first threshold, i.e. S _P =3 or more consecutive 'true' values, the buffer is judged to contain 'possible' voice. If it is determined in step 320 that it has not already existed (or exceeded), then in step 325 a short timer T(time_1) of eg L _S =5 frames is started.

步骤4：Step 4:

如果M≥S_L且F＞F_S，T＝L_M，否则T＝L_L；If M≥S _L and F>F _S , T=L _M , otherwise T=L _L ;

其中，S_L等于步骤330中的第二门限。如果存在S_L＝4或更多连续的‘真’值，则再次判断缓冲器包含‘可能(likely)’的语音。如果如步骤335中所确定的当前帧F处于初始导入安全周期F_S之外，则在步骤340中启动例如L_M＝22帧的中计时器T。否则，在步骤345中使用例如L_L＝40帧的故障保险长计时器T。在话语中的语音早期出现时使用这种布置会使VAD的初始的噪声估值过高。Wherein, _SL is equal to the second threshold in step 330 . If there are _SL = 4 or more consecutive 'true' values, the buffer is again judged to contain 'likely' speech. If the current frame F, as determined in step 335, is outside the initial lead-in security period _FS , then in step 340 an intermediate timer T of eg L _M =22 frames is started. Otherwise, in step 345 a failsafe long timer T such as _LL = 40 frames is used. Using this arrangement when speech occurs early in the utterance would overstate the initial noise estimate of the VAD.

步骤5：Step 5:

如果M＜S_P且T＞0，T--；If M<S _P and T>0, T--;

如果该处理在步骤350中确定存在小于S_P＝3的连续‘真’值且计时器在步骤355中大于零，则计时器在步骤360中递减。If the process determines in step 350 that there are consecutive 'true' values less than S _P =3 and the timer is greater than zero in step 355 , then the timer is decremented in step 360 .

步骤6：Step 6:

如果T＞0，输出‘真’，否则输出‘假’；If T>0, output 'true', otherwise output 'false';

如果计时器在步骤365中大于零，则该处理输出‘真’语音判决，如步骤370所示。另外，如果计时器在步骤365中不大于零，则该处理输出‘噪声’判决，如步骤375所示。If the timer is greater than zero in step 365, then the process outputs a 'true' speech decision, as shown in step 370. Additionally, if the timer is not greater than zero in step 365, then the process outputs a 'noise' decision, as shown in step 375.

步骤7：Step 7:

Frame++，把缓冲器向左移位且返回至步骤1。Frame++, shift the buffer to the left and return to step 1.

在步骤380中准备下一帧，缓冲器向左移位，以容纳下一输入，如根据图4所示。该输出语音判决应用于从该缓冲器出来的帧。然后在步骤305中对输入到数据缓冲器中的下一个真/假输入重复该处理。In preparation for the next frame in step 380 , the buffer is shifted to the left to accommodate the next input, as shown with respect to FIG. 4 . The outgoing speech decisions are applied to frames coming out of the buffer. The process is then repeated in step 305 for the next true/false input into the data buffer.

执行根据如上所述的能量加速率处理作出语音或噪声判决的替换机制也在本发明的考虑范围之内。例如，该判决机制可能不是基于一个或多个计时器，而可能完全地根据是否超过一个或多个能量加速率门限而作出判决的。Alternative mechanisms for implementing speech or noise decisions based on energy acceleration rate processing as described above are also within the contemplation of the present invention. For example, the decision mechanism may not be based on one or more timers, but may be based entirely on whether one or more energy acceleration rate thresholds are exceeded.

现在参考图4，更详细地示出了根据本发明的优选实施例的缓冲器操作400的示例。我们假定第一门限设置为三个连续的‘真’值。在“t”410时，假定只有当前输入(帧#7)425和先前输入(帧#6)420为‘真’。因此，当该缓冲器移位时，第一帧(帧#1)415将被标记为假。Referring now to FIG. 4 , an example of buffer operation 400 is shown in greater detail in accordance with a preferred embodiment of the present invention. We assume that the first threshold is set to three consecutive 'true' values. At "t" 410, only the current input (frame #7) 425 and previous input (frame #6) 420 are assumed to be 'true'. Therefore, when the buffer is shifted, the first frame (frame #1) 415 will be marked false.

在‘t+1’430时，第三‘真’输入(帧#8)450已被接收，以增补以前的两个‘真’输入440和445。因此，当该缓冲器移位时，下一个输出帧(帧#2)435将被标记为‘真’。At 't+1' 430, a third 'true' input (frame #8) 450 has been received to supplement the previous two 'true' inputs 440 and 445. Therefore, when the buffer is shifted, the next output frame (frame #2) 435 will be marked 'true'.

应当注意，在上述的判定处理中，唯一的约束是：It should be noted that in the decision process described above, the only constraints are:

(i)时间_1＜时间_2＜时间_3，且(i) time_1<time_2<time_3, and

(ii)门限_1＜门限_2。(ii) threshold_1<threshold_2.

假定只有这三个输入(帧#6、帧#7和帧#8)为‘真’，则整个输出序列是：Assuming only these three inputs (frame #6, frame #7 and frame #8) are 'true', the entire output sequence is:

F T T T T T T T T T TF T T T T T T T T T T T T T

1 2 3 4 5 6 7 8 9 10 111 2 3 4 5 6 7 8 9 10 11

T T T T T T F F F F FT T T T T T T F F F F F F F

12 13 14 15 16 17 18 19 20 21 2212 13 14 15 16 17 18 19 20 21 22

其中，由于缓冲器导入功能，帧#2-#5指示为‘真’。帧#6-#8指示‘真’，作为实际的初始‘真’语音输入的位置。由于缓冲器导出功能，帧#9-#12指示为‘真’。响应于所使用的计时器延迟，帧#13-#18指示‘真’。当话语中的所有帧都被输入时，缓冲器移出‘假’条目(帧#19-#L_M)直到清空。Among them, frames #2-#5 indicate 'true' due to the buffer import function. Frames #6-#8 indicate 'true' as the location of the actual initial 'true' speech input. Frames #9-#12 indicate 'true' due to the buffer export function. Frames #13-#18 indicate 'true' in response to the timer delay used. When all frames in the utterance have been input, the buffer is shifted out of 'false' entries (frames #19-#L _M ) until empty.

缓冲器长度和延迟计时器可以被动态地调整为满足音频通信单元的需求，这也在本发明的范围之内。同样，使用‘N’为8的缓冲器长度的优选实施例和5帧的延迟计时器只是出于解释性的目的。但是，应当注意，缓冲器长度‘N’应当总是确定为N≥S_L。It is also within the scope of the present invention that the buffer length and delay timer can be dynamically adjusted to meet the needs of the audio communication unit. Again, using the preferred embodiment of a buffer length of 8 for 'N' and a delay timer of 5 frames is for explanatory purposes only. However, it should be noted that the buffer length 'N' should always be determined as N≥S _L .

除了用作其自身VAD之外，在图2的方法步骤中执行的能量加速率测量可以用于验证其它参数的初始化，这也在本发明的考虑范围之内。例如，频谱提取方案根据语音的首十帧(典型地为100毫秒)来要求噪声的初始估值。甚至在平稳噪声中，可能发生若干事件而致使初始估值无效。这种事件的示例包括：In addition to being used as a VAD itself, the energy acceleration rate measurement performed in the method step of Fig. 2 can be used to verify the initialization of other parameters, which is also within the scope of the present invention. For example, spectral extraction schemes require an initial estimate of noise based on the first ten frames of speech (typically 100 milliseconds). Even in stationary noise, several events may occur that invalidate the initial estimate. Examples of such events include:

(a)信号的上斜：(a) Upslope of the signal:

由于各种可能的原因，在估值时，记录的开始可能在该周期内‘上斜’至满值。完全上斜的原因包括：数字系统中的缓冲器填充，模拟系统中的容量或带头连接。这些事件的影响使该估值无效。因此，能量加速率测量可以用于检测这种上斜并防止出现这种失误。For various possible reasons, at valuation, the start of recording may 'ramp up' to full value within the period. Reasons for full ramp up include: buffer filling in digital systems, capacity or lead connections in analog systems. The impact of these events invalidates this valuation. Therefore, energy acceleration measurements can be used to detect such ramp-ups and prevent such errors.

(b)初始信号中的毛刺：(b) Glitch in the initial signal:

普通‘毛刺’伴随着用户无线单元上的一键通(PTT)按钮的完整动作而发生，其中，电接触极少发生在按钮碰击开关背部之前。如上所述，当发生这种事件时，能量加速率测量可以用于挂起估值处理，如图2的步骤225所示。A common 'glitch' occurs with the complete actuation of the push-to-talk (PTT) button on the subscriber wireless unit, where electrical contact rarely occurs before the button strikes the back of the switch. As mentioned above, when such an event occurs, the energy acceleration rate measurement may be used to suspend the evaluation process, as shown in step 225 of FIG. 2 .

(c)初始信号中的语音：(c) Speech in the initial signal:

另一通常发生的事件是，具体地说对于PTT系统，用户在按下PTT按钮时立即开始讲话。通过这种方式，在语音开始之后进行电接触。能量加速率测量可以识别这一点且挂起基于噪声的初始化，如图2的步骤225所示，或者强迫使用故障估值。Another common occurrence, particularly with PTT systems, is that the user immediately starts speaking when pressing the PTT button. In this way, the electrical contact is made after speech starts. The energy acceleration rate measurement can identify this and suspend noise-based initialization, as shown in step 225 of Figure 2, or force the use of fault estimates.

总之，已对包括具有话音活动检测机制的音频处理单元的通信单元进行描述。话音活动检测机制提供输入至通信单元的信号输入的能量加速率的指示且根据所述指示来确定所述输入信号是语音还是噪声。In summary, a communication unit comprising an audio processing unit with a voice activity detection mechanism has been described. A voice activity detection mechanism provides an indication of the energy acceleration rate of a signal input to the communication unit and determines from the indication whether the input signal is speech or noise.

此外，已对检测输入到通信单元中的语音信号的方法进行描述。该方法包括以下步骤：指示输入到通信单元的输入信号的加速率；以及根据所述指示步骤来确定所述输入信号是语音还是噪声。Furthermore, a method of detecting a voice signal input into a communication unit has been described. The method includes the steps of: indicating an acceleration rate of an input signal input to the communication unit; and determining whether the input signal is speech or noise based on the indicating step.

此外，已对判决输入到通信单元中的信号是语音还是噪声的方法进行描述。该方法包括以下步骤：根据能量加速率判决所述输入信号是语音还是噪声，例如使用若干输入信号的帧平均或滚动平均。Furthermore, the method of judging whether the signal input into the communication unit is speech or noise has been described. The method includes the following steps: judging whether the input signal is speech or noise according to the energy acceleration rate, for example, using frame average or rolling average of several input signals.

因此，应当理解，如上所述的用于噪声环境的基于能量加速率的话音活动检测器和验证器提供了噪声鲁棒性和快速响应的优点。由于优选实施例使用依赖于能量加速率的测量，而不是绝对的测量，所以在此所描述的发明构思可以应用于任何输入电平的语音。Therefore, it should be appreciated that the energy acceleration rate based voice activity detector and verifier for noisy environments as described above provides the advantages of noise robustness and fast response. Since the preferred embodiment uses an energy acceleration rate dependent measure, rather than an absolute measure, the inventive concepts described herein can be applied to speech at any input level.

虽然上面已对本发明的实施例的特定和优选实现进行了描述，但是应当清楚，本领域的技术人员易于应用落入本发明的范围之内的这种发明构思的变化和修改。While specific and preferred implementations of embodiments of the invention have been described above, it should be apparent that those skilled in the art can readily apply changes and modifications of such inventive concepts that fall within the scope of the invention.

因此，已对用于噪声环境的经改善的话音活动检测器和验证器进行描述，其中，基本上消除了与现有技术布置相关联的上述缺点。Accordingly, an improved voice activity detector and verifier for use in noisy environments has been described in which the above-mentioned disadvantages associated with prior art arrangements are substantially eliminated.

Claims

1. a communication unit (100), it comprises and has voice activity detection mechanism (130,135) audio treatment unit (109), described communication unit (100) is characterised in that, described voice activity detection mechanism (130,135) measure the energy rate of acceleration that is input to the signal in the described communication unit (100), and determine that according to described measurement described input signal is voice or noise.

2. communication unit as claimed in claim 1 (100), wherein, described voice activity detection mechanism comprises speech activity detector functional block (130), it is to being input to the detection frame by frame of the signal execution speech in the described voice activity detection mechanism (130,135).

3. communication unit as claimed in claim 2 (100), wherein, the described detection frame by frame comprises at one or more in the following frequency range and carries out the energy rate of acceleration and measure being input to signal in the described voice activity detection mechanism (130,135):

(i) entire spectrum

(ii) frequency spectrum frequency sub-band; And

(iii) frequency spectrum variance.

4. communication unit as claimed in claim 3 (100), wherein, described voice activity detection mechanism comprises voice activity decision function piece (135), whether it may be operably coupled to described speech activity detector functional block (130), be voice to adjudicate described input signal according to the buffer operation of one or more described measurements.

5. communication unit as claimed in claim 4 (100), wherein, described voice activity decision function piece (135) uses the frame of a plurality of described input signals average or roll whether on average adjudicate input signal be voice.

6. as each the described communication unit (100) in the claim 2 to 5, wherein,, think that then incoming frame is speech frame (265) if described energy rate of acceleration measures the energy rate of acceleration value greater than energy rate of acceleration thresholding.

7. communication unit as claimed in claim 6 (100) wherein, determines that incoming frame is the frame that the application of the judgement (265) of speech frame can trace back to the front in the impact damper of input signal.

8. as claim 6 or the described communication unit of claim 7 (100), wherein, if for a plurality of successive frames, described energy rate of acceleration measures the energy rate of acceleration value greater than energy rate of acceleration thresholding, thinks that then incoming frame is speech frame (370).

9. when depending on claim 3, as each the described communication unit (100) in the claim 3 to 8, wherein, if select the subarea of input signal spectrum, then this selection is based on the basic fundamental tone that the subarea most possibly comprises voice signal and makes.

10. as the described communication unit of each claim (100) of front, wherein, described voice activity detection mechanism (130,135) uses the rate of acceleration of the correlated characteristic of speech energy to verify the parameter initialization of the correlated measure of other speech or noise, for example frequency spectrum extraction scheme.

11. a detection inputs to the method for the voice signal in the communication unit, it is characterized in that, comprises following steps:

Measurement inputs to rate of acceleration or the variation in the energy of the input signal in the described communication unit; And

Determine that according to described measuring process (315,330,350) described input signal is voice (370) or noise (375).

12. voice signal detection method as claimed in claim 11 is characterized in that, further comprises following steps:

To inputing to the detection frame by frame of the signal execution speech in the described communication unit.

13. voice signal detection method as claimed in claim 12, wherein, the described detection frame by frame may further comprise the steps:

At one or more following frequency ranges, described input signal is carried out the energy rate of acceleration measures:

(i) entire spectrum

(ii) frequency spectrum frequency sub-band; And

(iii) frequency spectrum variance.

14. the signal that a judgement inputs in the communication unit is the voice or the method for noise, preferably according to each claim in the claim 11 to 13 of front, the method is characterized in that, further comprises following steps:

According to the energy rate of acceleration in the energy measurement of described input signal or change that to adjudicate (315,330,350) described input signal be voice (370) or noise (375), for example use the frame of a plurality of input signals average or roll on average.

15. the signal that judgement as claimed in claim 14 inputs in the communication unit is the voice or the method for noise, wherein, described decision steps comprises:

If described energy rate of acceleration measures energy rate of acceleration value greater than energy rate of acceleration thresholding, determine that then incoming frame is speech frame (265); And

The frame of the front in the described impact damper of determining to be applied to input signal with reviewing.