Algorithms For Speech Processing
Algorithms For Speech Processing
Speech/Non-speech detection
Rule-based method using log energy and zero crossing rate Single speech interval in background noise
Voiced/Unvoiced/Background classification
Bayesian approach using 5 speech parameters Needs to be trained (mainly to establish statistics for background signals)
Pitch detection
Estimation of pitch period (or pitch frequency) during regions of voiced speech Implicitly needs classification of signal as voiced speech Algorithms in time domain, frequency domain, cepstral domain, or using LPC-based processing methods
Formant estimation
Estimation of the frequencies of the major resonances during voiced speech regions Implicitly needs classification of signal as voiced speech Need to handle birth and death processes as formants appear and disappear depending on spectral intensity
The Problem
Pitch period discontinuities that need to be smoothed for use in speech processing systems individual pitch period errors individual voiced/unvoiced errors (pitch period set to 0) regions of pitch period errors The solution median smoother
Running Medians
Non-Linear Smoothing
linear smoothers (filters) are not always appropriate for smoothing parameter estimates because of smearing and blurring discontinuities pitch period smoothing would emphasize errors and distort the contour use combination of non-linear smoother of running medians and linear smoothing linear smoothing => separation of signals based on non-overlapping frequency content non-linear smoothing => separating signals based on their character (smooth or noise-like)
x [n ] = S( x [n ]) + R ( x [n ]) - smooth + rough components y ( x[n ]) = median( x [n ]) = ML ( x[n ]) ML ( x[n ]) = median of x [n ]... x[n L + 1]
6
Median Smoothing
Median Smoothing
Median Smoothing
10
Median Smoothing
Nonlinear Smoother
=S[x(n)] SR[x(n)]
R[x(n)]
11
12
Nonlinear Smoother
- y [n ] is an approximation to the signal S( x[n ]) - second pass of non-linear smoothing improves performance based on: y [n ] = S( x [n ]) - the difference signal, z[n ], is formed as: z[n ] = x[n ] y [n ] = R( x[n ]) - second pass of nonlinear smoothing of z[n ] yields a correction term that is added to y [n ] to give w [n ], a refined approximation to S( x[n ]) w [n ] = S( x [n ]) + S[R( x[n ])] - if z[n ] = R ( x [n ]) exactly, i.e., the non-linear smoother was ideal, then S[R ( x [n ])] would be identically zero and the correction term would be unnecessary
13
14
Algorithm #1
Speech/Non-Speech Detection Using Simple Rules
beginning of speech need endpoint detection to enable: computation reduction (dont have to process background signal) better recognition performance (cant mistake background for speech) non-trivial problem except for high SNR recordings
3.
Sampling Rate Conversion - to a standard sampling rate of 10 kHz Highpass Filter to eliminate DC offset and hum; using a 101 point FIR equiripple highpass filter Short-Time Analysis frame size of 40 msec; frame shift of 10 msec; compute short-time log energy and short-time zero crossing rate (per 10 msec interval)
4.
Speech/Non-Speech Detection
6. move backwards from N1 by comparing Z100 to IZCT, and find the first point where Z100 exceeds IZCT; similarly move forward from N2 by comparing Z100 to IZCT and finding last point where Z100 exceeds IZCT.
1. 2. 3.
find heart of signal via conservative energy threshold => Interval 1 refine beginning and ending points using tighter threshold on energy => Interval 2 check outside the regions using zero crossing and unvoiced threshold => Interval 3
Voiced/Unvoiced/Background ClassificationAlgorithm #2
Utilize a Bayesian statistical approach to classification of frames as voiced speech, unvoiced speech or background signal (i.e., 3-class recognition/classification problem) Use 5 short-time speech parameters as the basic feature set Utilize a (hand) labeled training set to learn the statistics (means and variances for Gaussian model) of each of the 5 short-time speech parameters for each of the classes
Speech Parameters
X = [ x1 , x2 , x3 , x4 , x5 ] x1 = log ES -- short-time log energy of the signal x2 = Z100 -- short-time zero crossing rate of the signal for a 100-sample frame x3 = C1 -- short-time autocorrelation coefficient at unit sample delay x4 = 1 -- first predictor coefficient of a p th order linear predictor x5 = E p -- normalized energy of the prediction error of a p th order linear predictor
Manual Training
Using a designated training set of sentences, each 10 msec interval is classified manually (based on waveform displays and plots of parameter values) as either:
Voiced speech clear periodicity seen in waveform Unvoiced speech clear indication of frication or whisper Background signal lack of voicing or unvoicing traits Unclassified unclear as to whether low level voiced, low level unvoiced, or background signal (usually at speech beginnings and endings); not used as part of the training set
Each classified frame is used to train a single Gaussian model, for each speech parameter and for each pattern class; i.e., the mean and variance of each speech parameter is measured for each of the 3 classes
Bayesian Classifier
Class 1, i , i = 1, representing the background signal class Class 2, i , i = 2, representing the unvoiced class Class 3, i , i = 3, representing the voiced class
Bayesian Classifier
Maximize the probability: p (i | x) = where p ( x) = p ( x | i ) P (i )
i =1 3
p ( x | i ) P (i ) p( x)
p ( x | i ) =
Bayesian Classifier
Maximize p (i | x ) using the monotonic discriminant function gi ( x) = ln p (i | x) = ln[ p ( x | i ) P (i )] ln p( x) = ln p ( x | i ) + ln P (i ) ln p ( x) Disregard term ln p ( x) since it is independent of class, i , giving 1 gi ( x) = ( x mi )T Wi 1 ( x m i ) + ln P(i ) + ci 2 5 1 ci = ln(2 ) ln | Wi | 2 2
Bayesian Classifier
i Ignore bias term, ci , and apriori class probability, ln Pi . Then we can convert maximization to a minimization by reversing the sign, giving the decision rule: Decide class i if and only if di ( x) = ( x mi )T Wi 1 ( x mi ) d j ( x) j i i Utilize confidence measure, based on relative decision scores, to enable a no-decision output when no reliable class information is obtained.
Classification Performance
Training Set
BackgroundClass 1 Unvoiced Class 2 Voiced Class 3
Typical Classifications
VUS classification and confidence scores (scaled by factor of 3) for: (a) Synthetic vowel sequence (b) All voiced utterance (c) - (e) Speech utterances with mixtures of voiced, unvoiced, and background regions
Count 76 57 313
Count 94 82 375
Algorithm #3
Pitch Detection (Pitch Period Estimation Methods)
T0=50 samples
100
200
300
In reality, we cant get either (although we use signal processing to either try to flatten the signal spectrum, or eliminate all harmonics but the fundamental)
500
600
700
800
50
0 log magnitude
-50
-100
-150
500
1000
1500
2000
2500 frequency
3000
3500
4000
4500
5000
amplitude
-0.5
-0.5
-1
100
200
300
500
600
700
800
-1
100
200
300
500
600
700
800
60
log magnitude
40 20 0 -20 -40
500
1000
1500
2000
2500 frequency
3000
3500
4000
4500
5000
-60
500
1000
1500
2000
2500 frequency
3000
3500
4000
4500
5000
Pitch Detector
0 100 200 300 400 time in samples 500 600 700 800
Filter speech to 900 Hz region (adequate for all ranges of pitch eliminates extraneous signal harmonics) Find all positive and negative peaks in the waveform At each positive peak:
determine peak amplitude pulse (positive pulses only) determine peak-valley amplitude pulse (positive pulses only) determine peak-previous peak amplitude pulse (positive pulses only)
4.
5. 6. 7. 8.
Filter pulses with an exponential (peak detecting) window to eliminate false positives and negatives that are far too short to be pitch pulse estimates Determine pitch period estimate as the time between remaining major pulses in each of the six elementary pitch period detectors Vote for best pitch period estimate by combining the 3 most recent estimates for each of the 6 pitch period detectors Clean up errors using some type of non-linear smoother
Negative peaks
a set of peaks and valleys (local maxima and minima) are located, and from their locations and amplitudes, 6 impulse trains are derived
each impulse train is processed by a time-varying non-linear system (called a peak detecting exponential window) impulse of sufficient amplitude is detected => output is reset to value of impulse and held for a blanking interval, Tau(n) during which no new pulses can be detected after the blanking interval, the detector output decays exponentially with a rate of decay dependent on the most recent estimate of pitch period the decay continues until an impulse that exceeds the level of the decay is detected output is a quasi-periodic sequence of pulses, and the duration between estimated pulses is an estimate of the pitch period pitch period estimated periodically, e.g., 100/sec
the 6 current estimates are combined with the two most recent estimates for each of the 6 detectors the pitch period with the most occurrences (to within some tolerance) is declared the pitch period estimate at that time
the algorithm works well for voiced speech there is a lack of pitch period consistency for unvoiced speech or background signal
Pitch Detector
Autocorrelation Method of Pitch Detection
using synthetic speech gives a measure of accuracy of the algorithm pitch period estimates generally within 2 samples of actual pitch period first 10-30 msec of voicing often classified as unvoiced since decision method needs about 3 pitch periods before consistency check works properly => delay of 2 pitch periods in detection
10
x[n], n = 0,1,...,559
need some type of spectrum flattening so that the speech signal more closely approximates a periodic impulse train => center clipping spectrum flattener
pmin ploc pmax
Center Clipping
x[n], n = 0,1,...,559
CL=% of Amax (e.g., 30%)
R[k ], k = 0,1,..., pmax + 10
pmin ploc pmax
Center Clipper definition: if x(n) > CL, y(n)=x(n)-CL if x(n) CL, y(n)=0
y(n) = +1 = -1 = 0
significantly simplified computation (no multiplications) autocorrelation function is very similar to that from a conventional center clipper => most of the extraneous peaks are eliminated and a clear indiction of periodicity is retained
11
Center Clipping
Autocorrelation functions of center clipped speech using L=401 analysis frames (a) Clipping level set at 90% of max (b) Clipping level at 60% of max (c) Clipping level at 30% of max
Second and fourth harmonics much stronger than first and third harmonics leading to potential pitch doubling error.
Fourth harmonic strongest; second harmonic stronger than first; fourth harmonic stronger than thrid (or second or first); potential pitch doubling error results.
12
| X (e ) |
n j r r =1 K n
log| X (e ) |
j r r =1
P is a sum of K frequency compressed replicas of log| Xn (e j ) | => for periodic voiced speech, the harmonics will all align at the fundamental frequency and reinforce each other
sharp peak at F0
15 frames of voiced speech from male talker; pitch frequency goes from 175 Hz down to 130 Hz
15 frames of voiced speech from female talker; pitch frequency goes from 190 Hz up to 240 Hz
13
14
strong peak in 3-20 msec range is strong indication of voiced speech-absense of such a peak does not guarantee unvoiced speech
cepstral peak depends on length of window, and formant structure maximum height of pitch peak is 1 (RW, unchanging pitch, window contains exactly N periods); height varies dramatically with HW, changing pitch, window interactions with pitch period => need at least 2 full pitch periods in window to define pitch period well in cepstrum => need 40 msec window for low pitch malebut this is way too long for high pitch female
2.
3.
need very low threshold-e.g., 0.1-on pitch period-with lots of secondary verifications of pitch period
15
sampling rate reduced from 10 kHz to 2 kHz p=4 analysis inverse filter signal to give spectrally flat result compute short time autocorrelation and find strongest peak in estimated pitch region
16
Speech Synthesis
can use cepstrally (or LPC) estimated parameters to control speech synthesis model for voiced speech the vocal tract transfer function is modeled as V (z) =
Speech Synthesis
for unvoiced speech the model is a complex pole and zero of the form V (z) = (1 2e T cos( 2 FpT ) + e 2 T )(1 2e T cos( 2 FzT )z 1 + e 2 T z 2 ) (1 2e T cos( 2 FpT )z 1 + e 2 T z 2 )(1 2e T cos( 2 FzT ) + e 2 T )
1 2e
k =1
1 2e kT cos( 2 FkT ) + e 2 kT
kT
cos( 2 FkT )z 1 + e 2 kT z 2
-- cascade of digital resonators (F1 F4 ) with unity gain at f = 0 -- estimate F1 F3 using formant estimation methods, F4 fixed at 4000 Hz -- formant bandwidths fixed (1 4 ) fixed spectral compensation approximates glottal pulse shape and radiation (1 e aT z 1 )(1 + e bT z 1 ) a = 400 , b = 5000 S( z ) = (1 e aT )(1 + e bT )
Fp = largest peak in smoothed spectrum above 1000 Hz Fz = ( 0.0065Fp + 4.5 )( 0.014Fp + 28) = 20 log10 H (e
j 2 FpT
) 20 log10 H (e j 0 )
17
essential features of signal well preserved very intelligible synthetic speech speaker easily identified
formant synthesis
600 bps total rate for voiced speech with 100 bps for V/UV decisions
a: original; b: smoothed; c: quantized and decimated by 3-to-1 ratio --little perceptual difference
Based on the model of speech production, we can build a speech synthesizer on the basis of speech parameters estimated by the above set of algorithms and synthesize intelligible speech
18