[go: up one dir, main page]

HK1114939A - Method and apparatus for robust speech classification - Google Patents

Method and apparatus for robust speech classification Download PDF

Info

Publication number
HK1114939A
HK1114939A HK08104621.1A HK08104621A HK1114939A HK 1114939 A HK1114939 A HK 1114939A HK 08104621 A HK08104621 A HK 08104621A HK 1114939 A HK1114939 A HK 1114939A
Authority
HK
Hong Kong
Prior art keywords
speech
parameter
parameters
classification
classifier
Prior art date
Application number
HK08104621.1A
Other languages
Chinese (zh)
Inventor
P.黄
Original Assignee
高通股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 高通股份有限公司 filed Critical 高通股份有限公司
Publication of HK1114939A publication Critical patent/HK1114939A/en

Links

Description

Robust speech classification method and apparatus
The present application is a divisional application of the chinese patent application entitled "robust speech classification method and apparatus" of the invention with the filing date of 2001, 12/4/01822493.8.
Background
I. Field of the invention
The disclosed embodiments relate to the field of speech processing, and more particularly, to a novel and improved method and apparatus for robust speech classification.
II. background
The transmission of voice by digital techniques has become popular, particularly in long-range and digital wireless telephony applications. This in turn leads to a determination of the minimum amount of information that can be transmitted over the channel while maintaining the perceived quality of the reconstructed speech. If speech is transmitted by simple sampling and digitization, a data rate on the order of 65 kilobits per second (kbps) is required to achieve the speech quality of conventional analog phones. However, the data rate can be significantly reduced through the use of speech analysis followed by appropriate coding and resynthesis at the receiver. The more accurately speech analysis is achieved, the more properly the data is encoded, thereby reducing the data rate.
Devices that employ techniques for compressing speech by extracting parameters associated with a model of human speech generation are known as speech coders. The speech encoder divides the incoming speech signal into blocks of time or analysis frames. Speech coders generally include an encoder and a decoder or codec. The encoder analyzes the incoming speech frame to extract certain relevant parameters and then quantizes the parameters into a binary representation, i.e., into a set of bits or binary data packets. The data packets are transmitted over a communication channel to a receiver and decoder. The decoder processes the data packets, dequantizes them to produce parameters, and then re-synthesizes speech frames using the dequantized parameters.
The role of a speech encoder is to compress a digitized speech signal into a low bit rate signal by removing all natural redundancies inherent in speech. Digital compression is achieved by representing the input speech frame with a set of parameters and using quantization to represent the parameters with a set of bits. If the input speech frame has NiOne bit, the data packet generated by the speech encoder has NoOne bit, compression ratio C obtained by speech coderr=Ni/No. The problem is to maintain a high quality of the decoded speech sound while achieving the target compression factor. The performance of a speech decoder depends on (1) the behavior of the speech modes or combination of the above analysis and synthesis processes, and (2) the parameter quantization process at the target bit rate N per frameoBit-time behavior. The goal of speech mode is therefore to obtain the essence of the speech signal or the target sound quality with a small set of parameters per frame.
The speech encoder may be implemented as a time-domain encoder that attempts to acquire a time-domain speech waveform by encoding small segments of a speech frame (typically 5 millisecond (ms) subframes) at a time using a high time resolution process. For each subframe, a high accuracy representation from the codebook space is found by various speech algorithms known in the art. Alternatively, the speech encoder may be implemented as a frequency domain encoder that attempts to acquire the short-time speech spectrum of an input speech frame using a set of parameters (analysis) and employ a corresponding synthesis process to reconstruct the speech waveform from the spectral parameters. The parameter quantizer retains the parameters by representing them according to stored code Vector representations in known Quantization techniques described in Vector Quantization and signal compression (1992) by a.
A well-known time-domain speech coder is in l.b. rabiner&Digital processing of Speech Signals 396-45 by R.W.Schaft3(1978), which is hereby incorporated by reference in its entirety. In a CELP coder, short-time correlations or redundancies in the speech signal are removed by Linear Prediction (LP) analysis, which finds the coefficients of a short-time formant filter. Applying a short-time prediction filter to the incoming speech frame results in an LP residual signal that is further patterned and quantized with long-time prediction filter parameters and then a random codebook. Thus, CELP coding separates the task of coding the time-domain speech waveform into a separate task of coding the LP short-time filter coefficients and a separate task of coding the LP residual. Time-domain coding may be at a fixed rate (i.e., using the same number of bits N for each frame)o) Or at a variable rate where different bit rates are used for different types of frame content. Variable rate encoders attempt to use only the number of bits required to encode the codec parameters to achieve the target quality. An example of a variable rate CELP encoder is described in us patent No. 5414796, assigned to the assignee of the present invention and incorporated herein by reference.
Time-domain coders, e.g. CELP coders, typically rely on a high number of bits N per frameoTo preserve the accuracy of the time domain speech waveform. Provided that each frame has a number of bits NoRelatively large (e.g., 8Kbps or more), such encoders typically give excellent sound quality. However, at low bit rates (4Kbps and lower), time-domain encoders cannot maintain high quality and robust performance due to the limited number of bits available. At low bit rates, the limited codebook space limits the waveform matching performance of conventional time-domain coders, which are successfully used in higher rate commercial applications.
In general, the CELP scheme uses a short-term prediction (STP) filter and a long-term prediction (LTP) filter. Analysis by the synthesis (AbS) method is employed at the encoder to find the LTP delay and gain, as well as the optimal random codebook gain and index. State-of-the-art CELP coders, such as the Enhanced Variable Rate Coder (EVRC), can achieve high quality synthesized speech at data rates of approximately 8 kilobits per second.
It is known that unvoiced speech does not exhibit periodicity. The bandwidth-consuming coding of LTP filters in conventional CELP schemes is not used as efficiently for unvoiced speech as for voiced speech, where the periodicity of speech is strong and LTP filtering is meaningful. Therefore, a more efficient (i.e., lower bit rate) coding scheme is needed for unvoiced speech. Accurate speech classification is necessary to select the most efficient coding scheme and achieve the lowest data rate.
For coding at lower bit rates, various spectral methods, or frequency-domain coding of speech, have been developed, in which the speech signal is analyzed as a time-varying evolution of the spectrum, see e.g. r.j&T.F.Quatieri,Sinusoidal CodingIn Speech Coding and Synthesis Chapter IV (W.B. Kleijn)&K. K. Paliwal eds, 1995). In a spectral encoder, the goal is to model or predict the short-term speech spectrum of each input frame of speech with a set of spectral parameters, rather than accurately modeling the time-varying speech waveform. The spectral parameters are then encoded and an output speech frame is created with the decoded parameters. The resulting synthesized speech does not match the original input speech waveform, but provides similar perceptible quality. Examples of frequency domain encoders well known in the art include multi-band excitation encoders (MBE), sinusoidal transform encoders (STC), and harmonic encoders (HC). Such frequency-domain encoders provide high-quality parametric models with a small set of parameters that can be accurately quantized with a small number of bits available at low bit rates.
However, low bit-rate coding imposes critical limitations of a limited coding solution or a limited codebook space, the latter limiting the effectiveness of a single coding scheme, and value coders cannot represent speech segments of various types with the same accuracy under different background conditions. For example, conventional low bit rate frequency domain encoders do not transmit phase information for speech frames. Instead, the phase information is reconstructed by using random, artificially generated initial phase values and linear interpolation techniques. See, e.g., Yang et alQuadratic Phase Interpolation for Voiced Speech Synthesis in the MBE Model,in 29Electronic Letters 856-57(1993 years and 5 months). Since the phase information is artificially generated, even though the amplitude of the sinusoid is well preserved by the quantization-dequantization process, the output speech generated by the frequency-domain encoder is not aligned with the original input speech (i.e., the main pulse is not synchronized). It has therefore proven difficult to employ any closed-loop performance measure, such as signal-to-noise ratio (SNR) or perceptual SNR in a frequency-domain encoder.
One effective technique for efficiently encoding speech at low bit rates is multi-mode coding. Multi-mode coding techniques have been used for low-rate speech coding along with open-loop mode decision processes. One such multi-mode encoding technique is described in Amitava Das et alMulti-mode and Variable-Rate Coding of SpeechIn Speech Coding and Synthesis Chapter seven (W.B. Kleijn)&K. paliwales.1995). Conventional multi-mode encoders apply different modes or encoding-decoding algorithms to different types of input speech frames. Each mode or encoding-decoding process is customized to represent a type of speech segment, such as voiced speech, unvoiced speech, or background noise (non-speech), in a more efficient manner. The success of such multi-mode coding techniques is highly dependent on the correct mode decision, i.e. speech classification. The outer open loop mode decision mechanism examines the incoming speech frame and makes a decision as to which mode to apply to the frame. Open-loop mode decisions are typically made by extracting a number of parameters from the input frame, estimating parameters for certain time and frequency domain characteristics, and making mode decisions based on the estimates. Thus, the mode decision is made without knowing the exact condition of the input speech in advance, i.e., how close the output speech is to the input speech in terms of sound quality or other performance metrics. An example of speech codec open loop mode decision is described in U.S. patent No. 5414796, assigned to the assignee of the present invention and incorporated herein by reference in its entirety.
The multi-mode encoding may be fixed rate, using the same number of bits N per frameoOr variable rate, where different bit rates are used for different modes. The purpose of variable rate coding is to use only enough to obtain the target propertyThe amount is the number of bits required to encode the codec parameters. As a result, the same target sound quality, e.g., the sound quality of a fixed-rate high-rate encoder, can be achieved with a significantly lower average rate using Variable Bit Rate (VBR) techniques. An example of a variable bit rate speech coder is described in us patent No. 5414796. There is a current drive and strong commercial need to develop a high quality speech coder (i.e., between and below 2.4 to 4 Kbps) that operates at medium to low bit rates. Applications range from wireless telephony, satellite communications, internet telephony, various multimedia and streaming applications, voice mail, and other sound storage systems. The driving force is the need for high capacity and the requirement for robust performance in case of packet loss. Various recent speech coding standard studies are another direct driver to advance the research and development of low-rate speech coding algorithms. A low rate speech coder establishes more channels or users per allowed application bandwidth. A low rate speech encoder coupled with an additional layer of appropriate channel coding can fit the overall bit budget of the encoder specification and has robust performance under channel error conditions.
Multi-mode VBR speech coding is therefore an efficient mechanism for coding speech at low bit rates. Conventional multi-mode schemes require efficient coding schemes or modes to be designed for each speech segment (e.g., unvoiced, voiced, transition), as well as modes of background noise or silence. The overall performance of the speech encoder depends on the robustness of the mode classification and the performance of each mode. The average rate of the encoder depends on the bit rate of the different modes of unvoiced, voiced, and other speech segments. In order to achieve the target quality at a low average rate, the speech patterns under different conditions must be correctly determined. Generally speaking, voiced and unvoiced speech segments are captured at a high bit rate, and background noise and silence segments are represented by modes operating at significantly lower rates. Multi-mode bit-rate coders require correct speech classification to accurately capture and encode most speech segments using a minimum number of bits per frame. More accurate speech classification results in a lower average encoded bit rate and higher quality decoded speech. Previously, speech classification techniques only considered a minimum number of parameters for the separated speech frames, resulting in very few and inaccurate speech classifications. Therefore, to allow for optimal performance of multi-mode variable bit rate coding techniques, a high-performance speech classifier is needed to correctly classify many speech modes under different environmental conditions.
Abstract
The disclosed embodiments are directed to a robust speech classification technique that estimates many characteristic parameters of speech to classify various modes of speech with a high degree of accuracy under different conditions. Thus, in one aspect, a method of speech classification is disclosed. The method includes inputting classification parameters from an external component into a speech classifier, generating internal classification parameters from at least one input parameter within the speech classifier, setting a normalized autocorrelation coefficient function threshold and selecting a parameter analyzer based on a signal environment, and analyzing the input parameters and the internal parameters to generate a speech pattern classification.
In another aspect, a speech classifier is disclosed. The speech classifier includes: a generator for generating an internal classification parameter from at least one external input parameter, a normalized autocorrelation coefficient function threshold generator for setting a normalized autocorrelation coefficient function threshold and selecting a parameter analyzer according to a signal environment, and a parameter analyzer for analyzing the at least one external input parameter and the internal parameter to generate a speech pattern classification.
Brief Description of Drawings
The features, nature, and advantages of the present invention will become more apparent from the detailed description set forth below when taken in conjunction with the drawings in which like reference characters identify correspondingly throughout and wherein:
FIG. 1 is a block diagram of communication channels terminated at endpoints by a speech encoder;
FIG. 2 is a block diagram of a robust speech classifier that may be used by the encoder of FIG. 1;
FIG. 3 is a flow chart illustrating the speech classification steps of the robust speech classifier;
FIGS. 4A, 4B and 4C are state diagrams used by the disclosed embodiment of the speech classifier;
FIGS. 5A, 5B and 5C are decision tables used by embodiments disclosed by the speech classifier; and
FIG. 6 is an exemplary diagram of one embodiment of a speech signal with classification parameters and speech mode values.
Description of The Preferred Embodiment
The disclosed embodiments provide a method and apparatus for improved speech classification within a vocoder. The new classification parameters are analyzed to produce more speech classifications with greater accuracy than previously. The new decision process is used to classify speech on a frame basis over frames. Parameters derived from the original input speech, SNR information, noise suppressed output speech, voice activity information, Linear Prediction Coefficient (LPC) analysis and open-loop pitch estimation, and open-loop pitch estimation are used by a new state-based decider to accurately classify various speech modes. Each speech frame is classified by analyzing past and future frames as well as the current frame. Speech patterns that can be classified by the disclosed embodiments include transient transitions to active speech at the end of an utterance, voiced, unvoiced, and silence.
The disclosed embodiments illustrate a speech classification technique for different speech modes under different ambient noise. Speech patterns can be reliably and accurately identified for encoding in the most efficient way.
In fig. 1, a first encoder 10 receives digitized speech samples s (n) and performs encoding on the samples s (n) for transmission to a first decoder 14 in a transmission medium 12 or communication channel 12. The first decoder decodes the encoded speech samples and synthesizes an output speech signal SSYNTH(n)。For transmission in the reverse direction, the digitized samples s (n) are encoded by a second encoder 16 and then transmitted over a communication channel 18. The second decoder 20 receives and decodes the encoded speech samples, producing a synthesized output speech signal SSYNTH(n)。
The speech samples s (n) represent the digitized speech signal and are quantized according to any known method, such as Pulse Code Modulation (PCM), companding mu-law or a-law. As is known in the art, speech samples s (n) are organized into input frames where each frame includes a predetermined number of digitized speech samples s (n). In an exemplary embodiment, a sampling rate of 8kHz is used, and 160 samples are included every 20ms frame. In the embodiments described below, the data transmission rate may vary from 8kbps (full rate) to 4kbps (half rate) to 2kbps (quarter rate) to 1kbps (eighth rate) on a frame-to-frame basis. In addition, other data rates may be used. As used herein, "full rate" or "high rate" generally refers to data rates of greater than 8kbps or equal to 8kbps, and "half rate" or "low rate" generally refers to data rates of greater than 4kbps or equal to 4 kbps. Changing the data transmission rate is advantageous because a low bit rate may be selected for use with frames that include relatively less speech information. Other sampling rates, frame sizes, and data transmission rates may be used, as will be appreciated by those skilled in the art.
The first encoder 10 and the second decoder 20 together comprise a first speech encoder or speech codec. Similarly, the second encoder 16 and the first decoder 14 together comprise a second speech encoder. Those skilled in the art will appreciate that the speech coder may be implemented using a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), discrete gate logic, firmware, or any conventional programmable software module and microprocessor. The software modules may reside in RAM memory, flash memory, registers, or any other storage medium known in the art that can be written to. In addition, any conventional processor, controller or state machine may be substituted for the microprocessor. ASICs specifically designed for speech coding are exemplified in U.S. patent nos. 5727123 and 5784532, assigned to the assignee of the present invention and incorporated herein by reference.
Fig. 2 illustrates an exemplary embodiment of a robust speech classifier. In one embodiment, the speech classifier block of FIG. 2 may reside in the encoder (10, 16) of FIG. 1. In another embodiment, the robust speech classifier may exist on its own, if a speech classification mode output is provided to a device such as the encoder (10, 16) of fig. 1.
In fig. 2, input speech is provided to a noise suppressor (202). Input speech is typically produced by analog-to-digital conversion of a sound signal. A noise suppressor (202) generates a noise-suppressed output speech signal from the input speech signal filtering noise component, along with SNR information for the current output speech. The SNR information and the output speech signal are input to a speech classifier (210). The output speech signal of the noise suppressor (202) is also input to a voice activity detector (204), an LPC analyzer (206) and an open-loop pitch estimator (208). The SNR information is used by the speech classifier (210) to set a periodicity threshold and to distinguish between zeroed out and noisy speech. The SNR parameter is therefore called curr _ ns _ SNR. The output speech signal is hereinafter referred to as t _ in. If, in one embodiment, the noise suppressor (202) is not present, or is turned off, the SNR parameter curr _ ns _ SNR should be preset to a default value.
The voice activity detector (204) outputs voice activity information of the current frame to the speech classifier (210). The voice activity information output indicates if the current speech is active or inactive. In an exemplary embodiment, the voice activity information output may be binary, i.e., active or inactive. In another embodiment, the voice activity information output may be multi-valued. The voice activity information parameter is referred to as vad below.
The LPC analyzer (206) outputs the current output speech LPC reflection coefficients to the speech classifier (210). The LPC analyzer (206) may also output other parameters such as LPC coefficients. The LPC reflection coefficient parameter is referred to as refl hereinafter.
The open-loop pitch estimator (208) outputs the normalized autocorrelation coefficient function, NACF, values, and NACFs around the pitch value to the speech classifier (210). The NACF parameter is referred to below as NACF, and the NACF around the pitch parameter is referred to below as NACF _ at _ pitch. A more periodic voice signal produces a higher nacf _ at _ pitch value. A higher nacf _ at _ ptch value is more likely to be associated with a stationary voice output voice type. The speech classifier (210) maintains an array of nacf _ at _ pitch values. nacf _ at _ pitch is calculated on a sub-frame basis. In the exemplary embodiment, two open-loop pitch estimates are measured for each frame of output speech by measuring two subframes per frame. Each subframe's nacf _ at _ pitch is calculated from the open-loop pitch estimate. In the present exemplary embodiment, the five-dimensional array of nacf _ at _ pitch values (i.e., nacf _ at _ pitch [5]) includes values for two half-output speech frames. The nacf _ at _ pitch array is updated for each frame of output speech. The novel use of the array of nacf _ at _ pitch parameters provides the speech classifier (210) with the ability to use current, past, and look ahead (future) signal information to make more accurate and robust speech pattern decisions.
In addition to information from external components being input to the speech classifier (210), additional new parameters are generated internally to the speech classifier (210) from the output speech for use in the speech mode decision process.
In one embodiment, the zero crossing rate parameter, hereinafter zcr, is generated internally by the speech classifier (210). The zcr parameter of the current output speech defines the number of sign changes of the speech signal per speech frame. In acoustic speech, the value of zcr is low, whereas unvoiced speech (or noise) has a high zcr value because the signal is very random. The zcr parameter is used by a speech classifier (210) to classify sound and unvoiced speech.
In one embodiment, the speech classifier (210) internally generates a current frame energy parameter, hereinafter referred to as E. E may be used by the speech classifier (210) to identify transient speech by comparing the energy of the current frame with past and future frames. The parameter vEprev is the energy of the previous frame derived from E.
In one embodiment, the speech classifier (210) internally generates a look-ahead frame energy parameter, hereinafter referred to as Enext. The energy may contain the energy values of a portion of the current frame and a portion of the next frame of output speech. In one embodiment, Enext represents the energy of the second portion of the current frame and the energy of the first portion of the next frame of output speech. Enext is used by the speech classifier (210) to recognize transient speech. At the end of speech, the energy of the next frame drops abruptly compared to the current frame. The speech classifier (210) may compare the current frame energy to the next frame energy to identify end of speech and onset of speech conditions or up-and down-transient speech modes.
In one embodiment, the speech classifier (210) internally generates a band energy ratio parameter, defined as log2(EL/EH), where EL is the low-band current frame energy from 0 to 2kHz and EH is the high-band current frame energy from 2kHz to 4 kHz. The band energy ratio parameter is hereinafter referred to as bER. The bER parameter allows the speech classifier (210) to recognize voiced and unvoiced speech patterns, as is typically the case, voiced speech concentrates energy in the low band, while noisy unvoiced speech concentrates energy in the high band.
In one embodiment, the speech classifier (210) internally generates three-frame average voiced energy parameters, hereinafter referred to as vEav, from the output speech. In other embodiments, the vEav may be averaged over multiple frames instead of three. If the current speech pattern is active and voiced, the vseav computes a running average of the energy of the last three frames of the output speech. The energy averaging of the last three frames provides a more stable statistic for the speech classifier (210) on which to decide on speech patterns rather than just single frame energy calculations. The vseav is used by the speech classifier (210) to classify end or down-transient modes of acoustic speech, such as when the current frame energy E and the average voiced energy vseav drop off abruptly when speech stops. The vseav is updated only when the current frame is voiced or reset to a fixed value for unvoiced or inactive speech. In one embodiment, the fixed reset value is 0.01.
In one embodiment, the speech classifier (210) internally generates the first three frame average voiced energy parameter, hereinafter referred to as vEprev. In other embodiments, vEprev may be averaged over multiple frames rather than three. vEprev is used by the speech classifier (210) to identify transient speech. At the beginning of speech, the energy E of the current frame is steeper than the previous three voice frames. The speech classifier (210) can compare the energy of the current frame with the energy of the previous three frames to identify the onset of speech conditions, or up-transients, and speech patterns. Similar to voiced speech ending, the current frame energy drops steeply, and therefore vEprev can be used to classify the end-of-speech transition.
In one embodiment, the speech classifier (210) internally generates a ratio parameter of the current frame energy to the previous three frames mean voiced energy, defined as 10 log10 (E/vseprev). In other embodiments, vEprev may be averaged over multiple frames rather than three. The current energy and the previous vEprev are used by a speech classifier (210) to identify transient speech. The ratio of the current frame energy E to the average voiced energy of the previous three frames at the beginning of speech is hereinafter referred to as vER. vER is a speech classifier (210) for classifying sound speech onset and sound speech termination or up transient mode and down transient mode, vER being large at speech restart and small at sound speech termination. The vER parameter may be used in conjunction with the vEprev parameter to classify transient speech.
In one embodiment, the speech classifier (210) internally generates a parameter of the average voiced energy of the current frame energy over the previous three frames, defined as MIN (20, 10 log10 (E/vseav)). The current frame energy over the three frame average voiced energy is hereinafter referred to as vER 2. vER2 is a speech classifier (210) for classifying transient sound patterns at the end of classified sound speech.
In one embodiment, the speech classifier (210) internally generates the maximum subframe energy index parameter. The speech classifier (210) averages the output speech current frames into sub-frames and calculates a Root Mean Square (RMS) energy value for each sub-frame. In one embodiment, the current frame is divided into ten subframes. The maximum subframe energy index parameter is the index of the subframe within the current frame having the largest RMS energy value. The maximum subframe energy index parameter is hereinafter referred to as maxsfe _ idx. The division of the current frame into sub-frames provides speech classifier (210) peak energy location information, including the maximum peak energy location within a frame. More solutions can be achieved by dividing the frame into more sub-frames. maxsfe _ idx is used, along with other parameters, by the speech classifier (210) to classify transient speech modes, such as unvoiced or unvoiced speech modes, where the energy is generally stable and gradually increases or gradually stops during transient speech modes.
The speech classifier (210) uses the new parameter inputs, as well as the new internally generated parameters, directly from the encoding component to derive a more accurate and robust speech classification scheme than before. The speech classifier (210) applies a new decision process to the direct input and the internally generated parameters to generate a modified speech classification result. The decision process is described in detail below with reference to fig. 4A-4C and 5A-5C.
In one embodiment, the speech mode output of the speech classifier (210) includes: transient, up transient, down transient, voiced, unvoiced, and unvoiced modes. Transient modes are voiced but less periodic speech, optimally coded with full rate CELP. The up-transient mode is the first voice frame of active speech, preferably encoded with full-rate CELP. The down-transient mode is low energy voiced speech, typically at the end of the word, and is optimally encoded by half-rate CELP. Voiced patterns are highly periodic acoustic speech, mainly consisting of vowels. Voiced mode speech may be encoded at full rate, half rate, quarter rate, or eighth rate. The data rate for voice mode speech is selected to meet Average Data Rate (ADR) requirements. Unvoiced mode, mainly including consonants, is optimally noise-quarter-rate-excited linear prediction (NELP) coded. The silent mode is inactive speech, preferably CELP coded at eighth rate.
Those skilled in the art will appreciate that the parameters and speech patterns are not limited to those of the disclosed embodiments. Additional parameters and speech modes may be used without departing from the scope of the disclosed embodiments.
FIG. 3 is a flow diagram illustrating one embodiment of the speech classification steps of the robust speech classification technique.
In step 200, the classification parameter inputs from the external components include curr _ ns _ sur and t _ in inputs from the noise suppression component, nacf and nacf _ at _ pitch parameter inputs from the open loop pitch estimator component, vad input from the voice activity detector component, and refl input from the LPC analysis component. The flow of control proceeds to step 302.
At step 302, additional internally generated parameters are computed from the classification parameter input from the external component. In an exemplary embodiment, zcr, E, Enext, bER, vEav, vEprev, vER2, and maxsfe _ idx are calculated from t _ in. When the internally generated parameters have been calculated for each output speech frame, control passes to step 304.
At step 304, a NACF threshold is determined and a parameter analyzer is selected based on the speech signal environment. In an exemplary embodiment, the NACF threshold is determined by comparing the curr _ ns _ SNR parameter input at step 300 to the SNR threshold. curr _ ns _ snr information, derived from the noise suppressor, provides adaptive control of the new periodic decision threshold. In this way, different periodicity thresholds and different noise component levels are applied to the classification process of the speech signal. A more accurate speech classification decision is made when the most appropriate nacf or period, threshold for the noise level of the speech signal, is selected for each frame of output speech. Determining the most appropriate period threshold for the speech signal allows the selection of the best parameter analyzer for the speech signal.
The return-to-zero and noise speech signals have different original periods. When there is noise, there is speech degradation. When speech degradation is present, the periodicity measure, or nacf, is lower than that of return-to-zero speech. Thus, the nacf threshold is lowered in a noisy signal environment to compensate or raised in a return-to-zero signal environment. The new speech classification techniques of the disclosed embodiments do not fix the periodicity threshold for all environments, regardless of the noise level, resulting in more accurate and robust mode decisions.
In the exemplary embodiment, if the curr _ ns _ SNR value is greater than or equal to SNR threshold 25db, the nacf threshold for return-to-zero speech is applied. An exemplary nacf threshold for nulling speech is defined by the following table.
Type threshold Threshold name Threshold value
With sound VOICEDTH .75
Transient state LOWVOICEDTH .5
Silent sound UNVOICEDTH .35
TABLE 1
In the exemplary embodiment, if the curr _ ns _ SNR value is less than the SNR threshold of 25db, the nacf threshold for noisy speech is applied. An exemplary nacf threshold for noisy speech is defined by the following table.
Type threshold Threshold name Threshold value
With sound VOICEDTH .65
Transient state LOWVOICEDTH .5
Silent sound UNVOICEDTH .35
TABLE 2
Noisy speech is equivalent to return-to-zero speech with additive noise. Together with adaptive periodic threshold control, robust speech classification techniques are more likely to produce the same classification decisions than previously for both return-to-zero speech and noisy speech. When the nacf threshold is set for each frame, control proceeds to step 306.
At step 306, the parametric inputs from the external components and the internally generated parameters are analyzed to generate a speech pattern classification. A state machine or other analysis method selected according to the signal environment is applied to these parameters. In an exemplary embodiment, the parameters are input from external components and the internally generated parameters are applied to the state-based mode decision process described in detail with reference to FIGS. 4A-4C and 5A-5C. The decision process produces a speech mode classification. In an exemplary embodiment, transient, up transient, down transient, voiced, unvoiced speech mode classification is generated. When a speech mode decision is made, control flows to step 308.
At step 308, the state variables and different parameters are updated to include the current frame. In the exemplary embodiment, the voicing states of vEav, vEprev, and the current frame are updated. The current frame energy E, nacf _ at _ pitch and the current frame speech pattern are updated to classify the next frame.
Steps 300-308 are repeated for each speech frame.
Fig. 4A-4C illustrate mode decision processing implementations of an exemplary embodiment of a robust speech classification technique. The decision process selects a state machine for speech classification based on the periodicity of the speech frames. For each frame of speech, the state machine or noise component of the speech frame that best conforms to the periodicity is selected for the decision process by comparing the speech frame period measurement, i.e., NACF _ at _ pitch value, with the NACF threshold set in step 304 of fig. 3. The speech frame period level limits and controls the state transitions of the mode decision process, resulting in a more robust classification.
FIG. 4A illustrates an implementation of the state machine selected in the exemplary embodiment when vad is 1 (active speech) and the third value of nacf _ at _ pitch (i.e., nacf _ at _ pitch [2], zero index) is high or greater than VOICEDTH. VOICEDTH is defined at step 304 of FIG. 3. Fig. 5A illustrates parameters for each state estimate.
The initial state is silence. If vad is 0 (i.e. no voice activity), the current frame is always classified as silence regardless of the previous state.
When the previous state is silence, the current frame may be classified as either silence or up-transient. If nacf _ at _ pitch [3] is low, zcr is high, bER low vER is low, or a combination of these conditions is met, the current frame is classified as silent. Otherwise the classification defaults to up transient.
When the previous state is silence, the current frame may be classified as either silence or up-transient. If nacf _ at _ pitch [3] is low, nacf _ at _ pitch [4] is low, zcr is high, bER low vER is low and E is less than vEprev, or a combination of these conditions is met, the current frame is classified as silent. Otherwise the classification defaults to up transient.
When the previous state is voiced, the current frame may be classified as either unvoiced, transient, down-transient, or voiced. If vER is low and E is less than vEprev, the current frame is classified as unvoiced. If nacf _ at _ pitch [1] and nacf _ at _ pitch [3] are low, E is greater than half vEprev, or a combination of these conditions is met, the current frame is classified as transient. If vER is low, and nacf _ at _ pitch [3] is a moderate value, the current frame is classified as a down-transient. Otherwise the classification defaults to voiced.
When the previous state is transient or up-transient, the current frame may be classified as either unvoiced, transient, down-transient, or voiced. If vER is low and E is less than vEprev, the current frame is classified as unvoiced. If nacf _ at _ pitch [1] is low and nacf _ at _ pitch [3] is of moderate value, nacf _ at _ pitch [4] is low and the state is not transient, or if a combination of these conditions is met, the current frame is classified as transient. If nacf _ at _ pitch [3] is a moderate value and E is less than 0.05 times vEav, the current frame is classified as down-transient. Otherwise the classification defaults to voiced.
When the previous state is down-transient, the current frame may be classified as either silence, transient, or down-transient. If vER is low, the current frame is classified as silent. If E is greater than vEprev, the current frame is classified as transient. Otherwise the classification defaults to down transient.
FIG. 4B illustrates an implementation of the state machine selected in the exemplary embodiment when vad is 1 (there is active speech) and the third value of nacf _ at _ pitch is low or less than UNVOICEDTH. Unroacededth is defined at step 304 of fig. 3. Fig. 5B illustrates the parameters of each state estimate.
The initial state is silence. If vad is 0 (i.e. no voice activity), the current frame is always classified as silence regardless of the previous state.
When the previous state is silence, the current frame may be classified as either silence or up-transient. If nacf _ at _ pitch [2-4] exhibits a growing trend, nacf _ at _ pitch [3-4] has a moderate median, zcr is very low to moderate, bER is high, vER has a moderate value, or if a combination of these conditions is met, the current frame is classified as an up-transient. Otherwise the classification defaults to silence.
When the previous state is silence, the current frame may be classified as either silence or up-transient. If nacf _ at _ pitch [2-4] exhibits a growing trend, nacf _ at _ pitch [3-4] has a moderate to high value, zcr is low or moderate, vER is not low, bER is high, refl is low, nacf has a moderate value and E is greater than vsecrev, or if a combination of these conditions are met, the current frame is classified as up-transient. The combination of these conditions and the threshold may depend on what is reflected in the curr _ ns _ snr parameter in the noise level of the speech frame. Otherwise the classification defaults to silence.
When the previous state is voiced, up-transient or transient, the current frame may be classified as either unvoiced, transient or down-transient. If the bER is less than or equal to zero, vER is low, the bER is greater than zero, and E is less than vseprev or a combination of these conditions is met, the current frame is classified as unvoiced. If bER is greater than zero, nacf _ at _ pitch [2-4] shows a growing trend, zcr is not high, vER is not low, refl is low, nacf _ at _ pitch [3] and nacf are moderate and bER is less than zero or equal to zero, or if a combination of these conditions is met, the current frame is classified as transient. The combination of these conditions and the threshold may depend on what is reflected in the curr _ ns _ snr parameter in the noise level of the speech frame. If bER is greater than zero, nacf _ at _ pitch [3] is moderate, E is less than Eprev, zcr is not high and vER2 is less than minus fifteen, and the current frame is classified as a down-transient.
When the previous state is down-transient, the current frame may be classified as either silence, transient, or down-transient. If nacf _ at _ pitch [2-4] shows a growing trend, nacf _ at _ pitch [3-4] is moderately high, vER is not low, E is twice as large as Eprev, or if a combination of these conditions is met, the current frame is classified as transient. If vER is not low and zcr is low, the current frame is classified as a down-transient. Otherwise the classification defaults to silence.
FIG. 4C illustrates an implementation of the state machine selected in the exemplary embodiment when vad is 1 (there is active speech) and the third value of nacf _ at _ pitch (i.e., nacf _ at _ pitch [3]) is moderate, i.e., greater than UNVOICEDTH and less than VOICEDTH. Unroacedth and voicdh are defined at step 304 of fig. 3. Fig. 5C illustrates the parameters of each state estimate.
The initial state is silence. If vad is 0 (i.e. no voice activity), the current frame is always classified as silence regardless of the previous state.
When the previous state is silence, the current frame may be classified as either silence or up-transient. If nacf _ at _ pitch [2-4] exhibits a growing trend, nacf _ at _ pitch [3-4] has a moderate to high value, zcr is not high, bER is high, vER has a moderate value, zcr is low and E is more than twice vceprev, or if a combination of these conditions is met, the current frame is classified as up-transient. Otherwise the classification defaults to silence.
When the previous state is silence, the current frame may be classified as either silence or up-transient. If nacf _ at _ pitch [2-4] shows a growing trend, the nacf _ at _ pitch [3-4] value is moderate to very high, zcr is not high, vER is not low, bER is high, refl is low, E is larger than vsecrev, zcr is very low, nacf is not low, maxsfe _ idx points to the last subframe and E is more than twice vsecrev, or if a combination of these conditions is met, the current frame is classified as up-transient. The combination of these conditions and thresholds may be reflected in the curr _ ns _ snr parameter depending on the noise level of the speech frame. Otherwise the classification defaults to silence.
When the previous state is voiced, up-transient, or transient, the current frame may be classified as unvoiced, voiced, transient, down-transient. If bER is less than or equal to zero vER is low, exext is less than E, nacf _ at _ pitch [3-4] is low, bER is greater than zero and E is less than vseprev or if a combination of these conditions is met, the current frame is classified as unvoiced. If bER is greater than zero, nacf _ at _ pitch [2-4] shows a growing trend, zcr is not high, vER is not low, refl is low, nacf _ at _ pitch [3] and nacf are not low, or a combination of these conditions is met, the current frame is classified as a transient. The combination of these conditions and the threshold may depend on what is reflected in the curr _ ns _ snr parameter in the noise level of the speech frame. If bER is greater than zero, nacf _ at _ pitch [3] is not high, E is less than vEprev, zcr is not high, vER is less than minus fifteen and vER2 is less than minus fifteen, or a combination of these conditions is met, and the current frame is classified as a down-transient. If nacf _ at _ pitch [2] is greater than LOWVOICEDTH, bER is greater than or equal to zero, and vER is not low, or a combination of these conditions is met, then the current frame is classified as voiced.
When the previous state is down-transient, the current frame may be classified as either silence, transient, or down-transient. If bER is greater than zero, nacf _ at _ pitch [2-4] shows a tendency to grow, nacf _ at _ pitch [3-4] is moderately high, vER is not high, E is twice as large as Eprev, or a combination of these conditions is met, and the current frame is classified as transient. If vER is not low and zcr is low, the current frame is classified as a down-transient. Otherwise the classification defaults to silence.
FIGS. 5A-5C are embodiments of decision tables used by disclosed embodiments of speech classifiers.
FIG. 5A illustrates parameters for each state estimate, and state transitions when the third value of nacf _ at _ pitch (i.e., nacf _ at _ pitch [2]) is high, or greater than VOICEDTH, according to one embodiment. The decision table illustrated in fig. 5A is used by the state machine described in fig. 4A. The speech mode classification of the previous frame of speech is shown in the leftmost column. When the parameter values are shown in the row associated with each previous mode, the speech mode classification switches to the current mode identified in the row at the top of the associated column.
FIG. 5B illustrates parameters for each state estimate, and state transitions when the third value of nacf _ at _ pitch (i.e., nacf _ at _ pitch [2]) is low, or below UNVOICEDTH, according to an embodiment. The decision table illustrated in fig. 5B is used by the state machine described in fig. 4B. The speech mode classification of the previous frame of speech is shown in the leftmost column. When the parameter values are shown in the row associated with each previous mode, the speech mode classification switches to the current mode identified in the row at the top of the associated column.
FIG. 5C illustrates parameters for each state estimate and state transitions when the third value of nacf _ at _ pitch (i.e., nacf _ at _ pitch [3]) is moderate, i.e., greater than UNVOICEDTH but less than VOICEDTH, according to an embodiment. The decision table illustrated in fig. 5C is used by the state machine described in fig. 4C. The speech mode classification of the previous frame of speech is shown in the leftmost column. When the parameter values are shown in the row associated with each previous mode, the speech mode classification switches to the current mode identified in the row at the top of the associated column.
FIG. 6 is a timeline diagram of an exemplary embodiment of a speech signal with associated parameter values and speech classification.
Those skilled in the art will appreciate that speech classification may be implemented by a DSP, an ASIC, discrete gate logic, firmware, or any conventional programmable software module and microprocessor. The software modules may reside in RAM memory, flash memory, registers, or any other storage medium known in the art that can be written to. In addition, any conventional processor, controller or state machine may be substituted for the microprocessor.
The previous description of the preferred embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without the use of the inventive faculty. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (64)

1. A method of speech classification, comprising:
inputting the classification parameters from the external component to a speech classifier;
generating, within the speech classifier, internal classification parameters from at least one input parameter;
setting a normalized autocorrelation coefficient function threshold and selecting a parameter analyzer according to a signal environment; and
the input parameters and the internal parameters are analyzed to produce a speech pattern classification.
2. The method of claim 1, wherein the input parameters comprise a noise-suppressed speech signal.
3. The method of claim 1, wherein the input parameters comprise signal-to-noise ratio information of a noise-suppressed speech signal.
4. The method of claim 1, wherein the input parameters comprise voice activity information.
5. The method of claim 1, wherein the input parameters comprise linear predictive reflection coefficients.
6. The method of claim 1, wherein the input parameters comprise normalized autocorrelation coefficient function information.
7. The method of claim 1, wherein the input parameters comprise a normalized autocorrelation coefficient function at pitch information.
8. The method of claim 7, wherein the normalized autocorrelation coefficient function at the pitch information is an array of values.
9. The method of claim 1, wherein the internal parameter comprises a zero crossing rate parameter.
10. The method of claim 1, wherein the internal parameters comprise a current frame energy parameter.
11. The method of claim 1, wherein the internal parameters comprise a look-ahead frame energy parameter.
12. The method of claim 1, wherein the internal parameter comprises a bandwidth to energy ratio parameter.
13. The method of claim 1, wherein the internal parameters comprise a three frame averaged voiced energy parameter.
14. The method of claim 1, wherein the internal parameters comprise a voiced energy parameter averaged over the previous three frames.
15. The method of claim 1, wherein the internal parameters include a parameter of a ratio of a current frame energy to an average voiced energy of previous three frames.
16. The method of claim 1, wherein the internal parameters comprise a voiced energy parameter in which current frame energy is averaged over three frames.
17. The method of claim 1, wherein the internal parameter comprises a maximum subframe energy index parameter.
18. The method of claim 1, wherein setting the normalized autocorrelation coefficient function threshold comprises comparing a signal-to-noise information parameter to a predetermined signal-to-noise value.
19. The method of claim 1, in which the analyzing comprises applying a parameter to a state machine.
20. The method of claim 19, wherein the state machine comprises a state for each speech classification mode.
21. The method of claim 1, in which the speech mode classification comprises a transient mode.
22. The method of claim 1, in which the speech mode classification comprises an up-transient mode.
23. The method of claim 1, in which the speech mode classification comprises a down-transient mode.
24. The method of claim 1, wherein the speech mode classification comprises voiced patterns.
25. The method of claim 1, wherein the speech mode classification comprises a silence mode.
26. The method of claim 1, wherein the speech mode classification comprises a silence mode.
27. The method of claim 1, further comprising updating at least one parameter.
28. The method of claim 27, wherein the updated parameters comprise normalized autocorrelation coefficient functions at pitch parameters.
29. The method of claim 27, wherein the updated parameter comprises a three frame averaged voiced energy parameter.
30. The method of claim 27, wherein the updated parameter comprises a look ahead frame energy parameter.
31. The method of claim 27, wherein the updated parameters comprise a previous three frame average voiced energy parameter.
32. The method of claim 27, wherein the updated parameters comprise voice activity detection parameters.
33. A speech classifier, comprising:
a generator for generating classification parameters;
a normalized autocorrelation coefficient function threshold generator for setting a normalized autocorrelation coefficient function threshold and selecting a parameter analyzer according to a signal environment; and
a parameter analyzer for analyzing the at least one external input parameter and the internal parameter to generate a speech pattern classification.
34. The speech classifier of claim 33 wherein the generator for generating classification parameters generates parameters from a noise suppressed speech signal.
35. The speech classifier of claim 33 wherein the generator for generating classification parameters generates parameters from signal-to-noise ratio information.
36. The speech classifier of claim 33 wherein the generator for generating classification parameters generates parameters from voice activity information.
37. The speech classifier of claim 33 wherein the generator for generating classification parameters generates parameters from linear predictive reflection coefficients.
38. The speech classifier of claim 33 wherein the generator for generating classification parameters generates parameters from normalized autocorrelation coefficient function information.
39. The speech classifier of claim 33 wherein the generator for generating classification parameters generates parameters from a normalized autocorrelation coefficient function at pitch information.
40. The speech classifier of claim 39 wherein the normalized autocorrelation coefficient function at the pitch information is an array of values.
41. The speech classifier of claim 33 wherein the generated parameters include a zero crossing rate parameter.
42. The speech classifier of claim 33 wherein the generated parameters include a current frame energy parameter.
43. The speech classifier of claim 33 wherein the generated parameters include a look-ahead frame energy parameter.
44. The speech classifier of claim 33 wherein the generated parameters comprise a bandwidth to energy ratio parameter.
45. The speech classifier of claim 33 wherein the generated parameters comprise a three frame averaged voiced energy parameter.
46. The speech classifier of claim 33 wherein the generated parameters include a first three frame average voiced energy parameter.
47. The speech classifier of claim 33 wherein the generated parameters include a parameter of a ratio of a current frame energy to an average voiced energy of the previous three frames.
48. The speech classifier of claim 33 wherein the generated parameters include a current frame energy versus three frame average voiced energy parameter.
49. The speech classifier of claim 33 wherein the generated parameters comprise a maximum subframe energy index parameter.
50. The speech classifier of claim 33 wherein setting the normalized autocorrelation coefficient function threshold comprises comparing a signal to noise information parameter to a predetermined signal to noise value.
51. The speech classifier of claim 33 wherein the analysis comprises applying parameters to a state machine.
52. The speech classifier of claim 33 wherein the state machine comprises a state for each speech classification mode.
53. The speech classifier of claim 33 wherein the speech mode classification comprises a transient mode.
54. The speech classifier of claim 33 wherein the speech mode classification comprises an up-transient mode.
55. The speech classifier of claim 33 wherein the speech mode classification comprises a down-transient mode.
56. The speech classifier of claim 33 wherein the speech pattern classification comprises voiced patterns.
57. The speech classifier of claim 33 wherein the speech pattern classification comprises a silence pattern.
58. The speech classifier of claim 33 wherein the speech mode classification comprises a silence mode.
59. The speech classifier of claim 33 further comprising updating at least one parameter.
60. The speech classifier of claim 59 wherein the updated parameters include normalized autocorrelation coefficient functions at pitch parameters.
61. The speech classifier of claim 59 wherein the updated parameters comprise a three frame averaged voiced energy parameter.
62. The speech classifier of claim 59 wherein the updated parameters comprise a look-ahead frame energy parameter.
63. The speech classifier of claim 59 wherein the updated parameters comprise an average voiced energy parameter for the first three frames.
64. The speech classifier of claim 59 wherein the updated parameters comprise voice activity detection parameters.
HK08104621.1A 2000-12-08 2004-12-30 Method and apparatus for robust speech classification HK1114939A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US09/733,740 2000-12-08

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
HK04110328.8A Addition HK1067444B (en) 2000-12-08 2001-12-04 Method and apparatus for robust speech classification

Related Child Applications (1)

Application Number Title Priority Date Filing Date
HK04110328.8A Division HK1067444B (en) 2000-12-08 2001-12-04 Method and apparatus for robust speech classification

Publications (1)

Publication Number Publication Date
HK1114939A true HK1114939A (en) 2008-11-14

Family

ID=

Similar Documents

Publication Publication Date Title
JP5425682B2 (en) Method and apparatus for robust speech classification
US10438601B2 (en) Method and arrangement for controlling smoothing of stationary background noise
CN103548081B (en) The sane speech decoding pattern classification of noise
US7493256B2 (en) Method and apparatus for high performance low bit-rate coding of unvoiced speech
JP4907826B2 (en) Closed-loop multimode mixed-domain linear predictive speech coder
US6640209B1 (en) Closed-loop multimode mixed-domain linear prediction (MDLP) speech coder
KR100592627B1 (en) Low bit-rate coding of unvoiced segments of speech
JP2004361970A (en) Method and apparatus for performing reduced rate variable rate vocoding
JP4567289B2 (en) Method and apparatus for tracking the phase of a quasi-periodic signal
HK1114939A (en) Method and apparatus for robust speech classification
HK1067444B (en) Method and apparatus for robust speech classification
HK1055833B (en) Closed-loop multimode mixed-domain linear prediction speech coder and method of processing frames