GB2186160A

GB2186160A - Method and apparatus for processing speech signals

Info

Publication number: GB2186160A
Application number: GB8700378A
Authority: GB
Inventors: Jaswant Raj Jain
Original assignee: Racal Data Communications Inc
Current assignee: Racal Data Communications Inc
Priority date: 1986-01-24
Filing date: 1987-01-08
Publication date: 1987-08-05
Also published as: GB2186160B; GB8700378D0; JPH0198000A

Abstract

A method and apparatus for processing speech signals is disclosed which is applicable to a variety of speech processing including narrowband, mediumband and wideband coding. The speech signal is modified by a normalization process using the envelope of the speech signal such that the modified signal will have more desirable characteristics as seen by the intended processing algorithm (30), eg narrower bandwidth. The modification is achieved by a point-by-point division (normalization) (26) of the signal by an amplitudes function (AF), which is obtained from lowpass filtering (16) and magnitude of the signal. Several examples of normalized signal are presented both for short and long term normalization. Application to pitch detection and speech coding is described. <IMAGE>

Description

SPECIFICATION Method and apparatus for processing speech signals Background 1. Field of the Invention This invention relates generally to the field of speech processing. More particularly, this invention relates to a speech processing method and apparatus which utilizes a point-by-point normalization technique to allow for improved processing of speech during periods of short term and very short term variations in amplitude. These amplitude variations would normally not be accurately reproduced by frame or subframe oriented speech processing systems. The present invention has wide application in the speech processing field including the areas of coding and pitch detection.

2. Background Of The Invention The following references may be useful to the understanding of the present invention and are referenced by number throughout this specification.

REFERENCES [1] B.S. Atal and M.R. Schroeder, "Adaptive Predictive Coding of Speech Signals", Bell Syst. Tech. J., Vol. 49, pp. 1973-1986, Oct.

1970.

[2] M.J. Sabin and R.M. Gray, "Product Code Vector Quantizers for Waveform and Voice Coding", IEEE Trans. on Acoust., speech, and signal processing, Vol. ASSP-32, pp. 474-488, June 1984.

[3] T. Tremain, "The Government Standard Adaptive Predictive Coding Algorithm", Speech Technology, pp. 52-62, February 1985.

[4] M. M. Sondhi, "New Methods of Pitch Extraction", IEEE Trans. Audio and Electroacoustics, Vol AU-16, pp. 262-266, June 1968.

[5] L. R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals, pp. 150-157, Prentice-Hall, Englewood Cliffs, N.J., 1978.

[6] N. S. Jayant, "Adaptive Quantization with One Word Memory", Bell Syst. Tech. J., Vol. 52, pp. 1119-1144, September 1973.

[7] M. Honda and F. Itakura, "Bit Allocation in Time and Frequency Domain for Predictive Coding of Speech", IEEE Trans. on Acoust., Speech, and Signal Processing, Vol.

ASSP-32, pp. 465-473, June 1984.

Block-by-block normalization of speech waveforms is widely used in speech processing systems. Two examples are: 1) the APC method, [1] with a block (or frame) size 10-30 milliseconds, where the residual signal is normalized by gain; and 2) The shape-gain vector quantization (SGVQ), [2], with a block (or vector) size of .5-1 milli-seconds, where a vector is normalized by its r.m.s. value.

In the APC method, the normalization is done over the whole frame by a single gain value. This causes obvious problems when the amplitudes of the signal changes rapidly across the frame. A partial solution to the above problem is to find several gain values (scale factors) for each frame as in [3] where a block is divided into several sub-blocks. A similar sort of problem is also encountered in auto-correlation methods of pitch detection where center clipping is used to avoid correlation peaks due to harmonic structure [4]-[5].

As will be shown later, these methods of center clipping do not achieve the intended goals in certain cases.

Jayant's backward-adaptive quantizer, [6], does an implicit point-by-point normalization of the signal being fed to the quantizer by predicting the signal amplitudes. However, this form of normalization is specifically suited to the quantization method and neither the normalizing function (stepsize) nor the normalized signal are useful for other purposes. The present invention utilizes a point-by-point amplitudes normalization method with a very wide applicability in many areas of speech processing, including coding.

None of the above techniques ues subrate sampling of the positive half of the envelope of the speech signal together with interpolation and normalization to more accurately reproduce or code the digitized speech as taught by the present invention.

Summary of the Invention It is an object of the present invention to provide an improved method and apparatus for processing speech.

It is another object of the present invention to provide a technique for achieving improved processing of speech signals with a relatively low overhead.

It is another object of the present invention to provide a method for processing speech which provides for improved processing of speech in periods of significant short term or very short term amplitude variations.

It is a further object of the present invention to provide a speech processing method and apparatus which provides point-by-point normalization of a speech signal by utilizing the speech envelope as a normalizing function.

These objectives are accomplished by finding a smooth function which changes with the amplitudes of the signal. In the preferred embodiment, this smooth function takes the form of the upper half envelope of the speech signal, but those skilled in the art may recognize other functions which are appropriate. When a point-by-point division of the signal by this function is performed, the resulting waveform has fairly uniform amplitudes throughout. For purposes of the present document, those amplitudes variations which occur from pitch-topitch or frame-to-frame will be called short term (ST) and those which occur within a pitch period will be called very-short-term (VST) or intra-pitch variations. The application and desired results will determine if a shortterm or a very-short-term normalization is desirable.Therefore, the method of computing the amplitude function should be parameter selectable to yield the desired normalization rate in the preferred embodiment.

These and other objects of the invention will become apparent to those skilled in the art upon consideration of the following description of the invention.

In one embodiment of the present invention a method for processing a speech signal includes the steps of taking the absolute value of the speech signal and low pass filtering it to obtain a positive half-envelope function (Amplitude Function). This positive half-envelope function is then subrate sampled and values between the subrate samples are determined by interpolation to produce a replica of the amplitude. The speech signal is then normalized by this amplitude function and wide band coded.

In another embodiment of the present invention, an apparatus for processing a speech signal includes an input circuit for receiving a speech signal. A circuit is coupled to the speech input for converting the speech input signal to an amplitude function having characteristics of a positive (or negative) half-envelope of the speech signal. A normalizing circuit is coupled to the input circuit by dividing the speech signal by the amplitude function to effect a normalization of the speech signal.

The features of the invention believed to be novel are set forth with particularity in the appended claims. The invention itself however, both as to organization and method of operation, together with further objects and advantages thereof, may be best understood by reference to the following description taken in conjunction with the accompanying drawing.

Brief Description of the Drawing Figure 1 is a Block diagram of the speech processing method and apparatus of the present invention.

Figure 2 shows three sets of examples of signals having short-term amplitude variation and the result of normalization according to the present invention. In each case a) represents the input signal, b) represents the amplitude function AF, and c) represents the normalized signal.

Figure 3 shows three examples of the application of the present invention to autocorrelation method of pitch detection. In this Figure, a) represents the input signal, b) represents the AF, c) represents the normalized signal, d) represents the center clipped signal, and e) represents the autocorrelation derived from the center clipped signal. The method of [4] was used in d.1 and e.1, the method of [5] was used in d.2 and e.2, and the normalized signal was used in d.3 and e.3.

Figure 4 shows three examples of very short term amplitude normalization in a wideband coding application. In this Figure, a) represents the input signal, b) represents the step-size from Jayant's 4 bit APCM, c) represents the AF, d) represents the normalized signal and e) is a histogram of the normalized signal.

Detailed Description of the Preferred Embodiment Turning now to Fig. 1 a block diagram of the normalization method of the present invention is shown. In the preferred embodiment, the present invention is implemented by a programmed processor as will be appreciated by those skilled in the art. Therefore, the diagram of Fig. 1 may be thought of as both a functional block diagram and a operational flow diagram.

An input signal present at node 10 is applied to an absolute value block 12 as well as a delay block 14. The output of absolute value block 12 is applied to a lowpass filter 16 which removes higher frequency components to produce positive half-envelope signal at its output 18. The output signal at 18 is sampled at a subsampling block 20. The output of subsampler 20 is applied to a quantizer 22 where the signal is quantized to any of a predetermined number of quantization levels.

In the preferred embodiment, only four or five bits of quantization are used but this is of course not to be limiting and depends greatly on the particular application at hand. The output of 22 is applied to an interpolation block 24. The output of the interpolation block is applied to a divider 26 which is used to divide the delayed output from delay 14 by the output of 24 to provide the normalized signal at node 28. The output of interpolator 24 is essentially a sampled envelope signal shown as AF in the figure and referred to herein as the Amplitude Function. This AF may be used by block 30 as well as the normalized signal at node 28 and the delayed input signal at node 32 in further processing.

The further processing shown in block 30 may take many forms as will be appreciated by those skilled in the art. For example, the speech data may be coded and assembled into a stream or frame of binary data by circuitry including, for example, an adaptive differential Pulse Code Modulation (ADPCM) Codes. This digital signal can then be transmitted, for example by modem, for receipt at a remote location. The remote receiver can then simply use the AF (the quatized, subrate sampled signal from 22 may be easily coded with low overhead) to reconstruct the input signal by denormalization (multiplication). Of course if the signal from 22 is transmitted rather than the replica of the AF, interpolation is also used at the receiver. In another example, coding and/or pitch detection can be part of the further processing.

Those skilled in the art will recognize numerous variations in the application and implementation of the present invention.

In operation, the input signal at 10 is first processed by the absolute value circuit 12 to produce a signal at it's output having all positive values. Those skilled in the art will recognize that other equivalent operations may be substituted for this operation. For example, a half wave rectifier may be used to chip off the bottom (or top) half of the speech signal; or, the speech signal can be squared to remove the negative component and then the square root may be taken. In any event, the object is to detect the peak values in the input signal in preparation for low-pass filter 16. Low-pass filter 16 is used to smooth the signal to produce a signal resembling the positive half of an envelope of the speech signal (assuming symmetry of the speech signal).This signal at 18 is then sampled by a subrate sampler 20 operating at a rate substantially lower than the actual sampling rates required for sampling the speech signal itself, for example approximately 100 Hz for short term amplitude normalization (STAN) and approximately 1000 Hz for very short term amplitude normalization (VSTAN).

In most speech applications, subrate sampling rates in the range of about 100 to 1000 Hz will be appropriate but this is not to be limiting. The output of the sampler is quantized at 22 and then passed on to 24. At 24, an interpolation process is carried out in order to create an Amplitude Function as described later. The interpolation process creates a point in the Amplitude Function for each sample of the input speech signal. This Amplitude Function is then divided point-by-point into the speech signal to effect a normalization of the speech signal.

In the preferred embodiment, low pass filtering is use for interpolation, simple linear interpolation.may also be used to create points on the amplitude function in between the subrate samples to enhance computational efficiency. Other known interpolation and curve fitting techniques may prove advantageous in some applications.

Such normalization provides many advantages to some forms of further processing.

Since the quantized amplitude signal is sampled at a low frequency, compared with the sampling rate for the speech signal transmission to a receiver involves incurring very little overhead while resulting in an enhanced accuracy in the reproduction of transient frames of speech. Further, the normalization process has significant advantages when applied to speech coding. Since the signal is normalized, the signal variation and thus the number of bits required to code the signal accurately is reduced resulting in a reduction in the bandwidth required for transmission.

The amplitudes function (AF) is preferably obtained by lowpass filtering of the absolute value of the input digital signal. The cut-off frequency of this filter determines the normalization rate. A 9-bit linear representation of the AF is adequate and thus the division can be very efficiently implemented by a table lookup. For those speech coding applications where the knowledge of the normalization function is required at the receiver, the amplitudes function is sampled at the Nyquist rate and quantized using 5-bit log-PCM. A reconstruction of the quantized AF is obtained both at the transmitter and the receiver by interpolation so that the same AF is used at both ends. If an explicit knowledge of the AF is not required at the receiver the blocks within the dotted lines may be bypassed.The amplitude normalized signal, the delayed input signal and the AF are then fed to the processing block.

As will be demonstrated, certain parameters are better estimated from the normalized signal, others from the input signal and yet others from the renormalized version of a processed signal derived from the normalized signal.

For short-term amplitudes normalization (STAN), the cut-off frequency of the lowpass filter is set at 25-30 Hz in the preferred embodiment with the stop-band rejection at 50 Hz and the sampling rate at 100 Hz. For veryshort-term amplitudes normalization (VSTAN), all the frequencies are preferably set at ten times the above values (approximately 250 to 300 Hz for the cutoff frequency, 500 Hz for the stop-band rejection) and 1000 Hz sampling rate. In other embodiments, other filter charactistics may be desirable but in general, the cutoff will usually lie somewhere within or between these ranges. Of course these quanties are merely intended to be illustrative and are not to be limiting as they will be largely determined by the application at hand. If the transmission of the AF is not required, an ilR filter (Infinite Impulse Response) can be used.

Otherwise, an FIR filter (Finite Impulse Response) is preferred due to better computational efficiency in the decimation and interpolation process. The computation required for normalization is nearly 10 operations per point, which would consume 2-5% of real-time on most DSPs.

The delay introduced for STAN is 10-20 miiliseconds and for VSTAN it is 2-4* milliseconds. These delays are determined by and set approximately equal to the delay inherent in the low-pass filter and are therefore dependent upon the exact nature of the low-pass filter being used for any particular application.

The delay allows the input signal to be properly normalized by the portion of the envelope created by that portion of the input signal so that a more accurate normalization occurs.

Fig. 2 shows three examples, representative of a variety, of sort-term amplitudes normalization obtained through computer simulation of the present invention. It should be noted that the normalized signal has fairly uniform amplitudes throughout and that is has not altered the intra-pitch characteristics for voiced sounds. It will be clear to those skilled in the art that the normalized signal has more desirable amplitude characteristics for pitch detection and quantization for APC but not in computation of prediction parameters. Example 3 shows a segment containing an unvoiced sound followed by a voiced sound of varying amplitudes followed by a silence. The 125 millisecond segment contains several frames worth of data.In the frames containing voicing transitions (part of the frame containing voiced sound and the remaining part containing unvoiced/silence) the autocorrelation, and hence the prediction parameters, computed from the normalized signal would have much greater influence of unvoiced/silence as compared with that computed from the input signal. This would result in a poorer prediction.

Therefore, it is generally better to use the input (un-normalized) signal for computation of linear prediction parameters. For this reason, selective use of the normalized signal is used depending upon the actual embodiment.

The usefulness of the STAN in the autocorrelation method of pitch detection is evidenced by referring to Fig. 3. A voice segment with changing amplitudes is shown in Fig. 3a. The AF and the normalized signal are shown in Fig. 3b and 3c. As suggested in [4], center clipping the signal before computing the autocorrelation function is helpful in attenuating extraneous peaks caused by strong harmonics. At least three known methods of center clipping may be utilized in accordance with the present invention; (1) As suggesting in [4], divide the frame into 4 millisecond miniframes.For each mini-frame, find the maximum absolute value and set the center clipping threshold at 30% of this value; (2) As suggest in [5], find maximum absolute value in the first third and last third of the frame and set the center clipping threshold at 64% of the minimum of the two values; (3) Use the amplitudes normalized signal for center clipping with the center clipping threshold st at 50% of the maximum absolute value.

Fig. 3d and 3e show the center clipped waveforms and the autocorrelation function derived from them. The time scale for the autocorrelation function has been expanded three times for better resolution. For method 3, a re-normalized center clipped signal, obtained by multiplying he center lipped signal by th AF, is used for autocorrelation computation. It is clear from these figures that method 3 is most in eliminating the harmonic instruction and preserving the pitch peaks of the low level signal after center clipping. This success is reflected in the autocorrelation function for method 3, which is free of extraneous peaks that are present for methods 1 and 2.

Fig. 4 shows two examples of VSTAN and its application to wideband coding. Figs. 4a, 4b, 4c and 4d depict the input waveform the stepsize adaption in Jayant's 4-bit APCM [6], the very-short-term amplitudes function and the normalized waveform, respectively. Those skilled in the art will recognize that the main difference in the normalized Signals in Figs. 3 and 4 is that the latter has also normalized the intra-pitch amplitudes variations. The normalized signal of Fig. 4 would be highly undesirable as an input to a pitch detector. On the other hand, it has very desirable amplitudes characteristics for a fixed step quantizer. An eleven level uniform quantizer (3.46 bits) gave quality very close to the 4-bit APCM. If the bit-rate required to transmit the amplitudes function is included, the two methods have nearly the same rate.However, if an adaptive bit-allocation in the time domain is used based on the knowledge of the AF, [7], while keeping the total number of bits in a frame (10-20 msec.) constant, the quality is better than the APCM. It should be noted that the APCM is relatively simpler and involves no coding delay. Fig. 4e shows a histogram of the normalized signal.

Thus, a point-by-point normalization of speech signal as a preprocessing technique to improve the subsequent processing is disclosed herein. An amplitude function, which is obtained by lowpass filtering the magnitude of the signal thereby producing a signal having a shape resembling half of the envelope of the speech signal, is used for normalization. The cut-off frequency of the lowpass filter plays a very important role in the properties of the normalized signal. Two types of normalization rates have been found to give desired amplitudes characteristics for different applications.

The present invention may also provide improvements when combined with the vector quantization of the VST amplitude function and the normalized signal similar to that of [2].

The present invention is preferably implemented utilizing a programmed processor such as a microcomputer for real time applications but this is not to be limiting. The present invention may also be implemented using dedicated hardware processor if desired or by a more powerful mainfraim computer without departing from the present invention. Those skilled in the art will appreciate that many variations of the implementation of the present invention are possible Thus it is apparent that in accordance with the present invention an apparatus that fully satisfies the objectives, aims and advantages is set forth above. While the invention has been described in conjunction with a specific embodiment, it is evident that many alternatives, modifications and variations will become apparent to those skilled in the art in light of the foregoing description. Accordingly, it is intended that the present invention embrace all such alternatives, modifications and variations as fall within the spirit and broad scope of the appended claims.

Claims

1. A method for processing a speech signal, comprising the steps of: detecting peak signal levels in said speech signal; computing an amplitude function signal from said peak signal levels of said speech signal, said amplitude function signal having an amplitude approximating an envelope of the amplitude of said peak signal of said speech signal; normalizing said speech signal by said amplitude function; and wideband coding said normalized speech signal.

2. The method of claim 1, wherein said computing step includes the step of: low-pass filtering said peak signal levels to produce a low-pass filtered signal.

3. The method of claim 2, wherein said low-pass filtering step is followed by the steps of: subrate sampling said low-pass filtered signal; and interpolating between points in said subrate sampled signal to produce a replica of said amplitude function.

4. The method of any preceding claim, wherein said steps of normalizing and computing provide very short term amplitude normalization.

5. A method for processing a speech signal substantially as hereinbefore described with reference to the accompanying drawings.

6. An apparatus for processing a speech signal, comprising: input means for receiving said speech signal; delay means, receiving said speech signal for producing a delayed speech signal; converting means, coupled to said input means, for converting said speech signal to an amplitude function signal approximating a positive half-envelope of said speech signal; normalizing means for normalizing said delayed speech signal by said amplitude signal; and wherein said converting means includes: means for subrate sampling said positive half-envelope; and means for interpolating between points of said subrate sampled half-envelope signal.

7. An apparatus for processing a speech signal substantially as hereinbefore described with reference to the accompanying drawings.