CA1199730A - Method and apparatus for continuous word string recognition - Google Patents
Method and apparatus for continuous word string recognitionInfo
- Publication number
- CA1199730A CA1199730A CA000469856A CA469856A CA1199730A CA 1199730 A CA1199730 A CA 1199730A CA 000469856 A CA000469856 A CA 000469856A CA 469856 A CA469856 A CA 469856A CA 1199730 A CA1199730 A CA 1199730A
- Authority
- CA
- Canada
- Prior art keywords
- word
- pattern
- score
- frame
- register
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired
Links
Landscapes
- Complex Calculations (AREA)
Abstract
METHOD AND APPARATUS FOR
CONTINUOUS WORD STRING-RECOGNITION
ABSTRACT OF THE DISCLOSURE
A speech recognition method and apparatus for recognizing word strings in a continuous audio signal are disclosed. The word strings are made up of a plura-lity of elements, and each element, generally a word, is represented by an element template defined by a plura-lity of target patterns. Each target pattern is repre-sented by a plurality of statistics describing the expected behavior of a group of spectra selected from plural short-term spectra generated by processing of the incoming audio. The incoming audio spectra are pro-cessed to enhance the separation between the spectral pattern classes during later analysis. The processed audio spectra are grouped into multi-frame spectral pat-terns and are compared, using likelihood statistics, with the target patterns of the element templates. Each multi-frame pattern is forced to contribute to each of a pluarality of pattern scores as represented by the ele-ment templates. A concatenation technique is employed, using dynamic programming techniques, to determine the correct identity of the word string.
CONTINUOUS WORD STRING-RECOGNITION
ABSTRACT OF THE DISCLOSURE
A speech recognition method and apparatus for recognizing word strings in a continuous audio signal are disclosed. The word strings are made up of a plura-lity of elements, and each element, generally a word, is represented by an element template defined by a plura-lity of target patterns. Each target pattern is repre-sented by a plurality of statistics describing the expected behavior of a group of spectra selected from plural short-term spectra generated by processing of the incoming audio. The incoming audio spectra are pro-cessed to enhance the separation between the spectral pattern classes during later analysis. The processed audio spectra are grouped into multi-frame spectral pat-terns and are compared, using likelihood statistics, with the target patterns of the element templates. Each multi-frame pattern is forced to contribute to each of a pluarality of pattern scores as represented by the ele-ment templates. A concatenation technique is employed, using dynamic programming techniques, to determine the correct identity of the word string.
Description
7'3~
1. This application i.s a divisional of application serial number 412,g43 filed October 5, 1982.
BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition method and apparatus, and more paxticularly to a method of and apparatus for recognizing in real time, word strings in a continuous audio signal Various speech recognition systems have been proposed herebefore to recognize isolated utterances by comparing an unknown isolated audio signal, suitably processed, with one or more previously prepared representations of known keywords.
In this context, "keywords" is used to mean a connected group of phonemes and sounds and may be, for example, a portion of a syllable, a word, a phrase, etcO While many systems have met with limited success, one system, in particular, has been employed successfully, in commercial applications, to recognize isolated keywords. This system operates.substantially in accordance with the method described in U.S. Patent No. 4,038,503, granted July 26, 1977, assigned to the assignee of this application, and provides a successful method for recognizing one of a restricted vocabulary of keywords provided that the boundaries of the unknown audio signal data are either silence or background noise as measured by the recognition system. That system relies upon the presumption that the interval, during which the unknown audio signal occurs, is well defined and contains a single keyword utterance.
In a continuous audio signal, such as continuous conversational speech, wherein the keyword boundaries are not a prior known or marked, several methods have been devised to segment the incoming audio data, that is, to determine the boundaries of linguistic 3~
3~) 1 units, such as phonemes, syllables, words, sentence~, etc., prior to initiation of a keyword recognition process.
These prior continuous speech systems, however, have achieved only a limited success in part because a satis-factory segmenting process has not been found. Other sub-stantial problems still exist: for example, only limited vocabularies can be consistently recognized with a low false alarm rate; the recognition accuray is highly sensitive to the differences between voice characteristics of different talkers; and the systems are highly sensitive to distor-tion in the audio signals being analy~ed, such as typically occurs, for example, in audio signals transmitted over ordinary telephone communications apparatus.
The continuous speech recognition method described in the applicant's U.S. patent Nos. 4,227,176 which issued October 7, 1980; 4,241,3~9 which issued December 23, 1.980; and 4,227,177 which issued October 7, 1980, respectively, describe commercially acceptable and effective procedures for successfully recognizing, in real time, keywords in continuous speech. The general methods described in these patents are presently in commercial use and have been proved both experimentally and in practical field testing to effectively provide a high reliability and low error rate, in a speaker~indepedent environment. Never-theless, even these systems, while at the forefront of present day technologyv and the concept upon which they were developed, have shortcomings in both the false-alarm rate and speaker-independent performance.
The continuous speech recognition methods described in the above-identified U.S. patents are 3~
1 directed primarily to recognizing or spotting one of a plurality of keywords in continuous speech. In other applications, a continuous word string can b~ recognized wherein the result of the recognition process is the identity of each of the individLIal word elements of the continuous word string. A continuous word string in this context is a plurality of recognizable elements which are bounded by silence. ~his is related for example to the commercial e~uipment noted above with respect to the isolated word application in which the boundaries are a priori known. Here however the boundaries, silence, are unknown and ~ust be determined by the recognition system itself. In addition, the elements being examined are no longer keyword elements but a plurality of elements "strung" together to form the word string. Various methods and apparatus have been su~gested in the art for recognizing continuous word stringsO These apparatus and methods however have various shortcomings again, for example, in false alarm rate, speaker independent performance, and real time operation.
Therefore, a principal object of the present invention is a speech recognition method and apparatus having improved effectiveness in recogni2ing con~inuous word strings in a continuous, unmarked audio signal.
Other objects of the invention are a method and apparatus which are relatively insensitive to phase and amplitude distortion o~ the u~known audio input signal data, which are relatively insensitive to varia~ions in the articulation rate of the unknown audio inpLIt si~nals, which will respond equally well to different speakers and h~nce different voice ~haracteristics, which are reliable and have an improved lcwer false-alarm rate~ and which will operate in real timeO
i 3~
SUMMARY OF THE I~VENTION
The inv0ntion relates to a method and apparatus for analyzing audio signals. Specifically, the audio signal is a speech signal and the method and apparatus recognize keywords in the speech. Each keyword is characterized by a keyword template having at least one target pattern. Each target pattern represents at least one short term power spectrum and each pattern further has associated with it at least one required dwell time position followed by at least one optional dwell time position. In general, each target pattern will have a plurality of required and optional dwell time positions.
The recognition method features the steps of formin~, at a repetitive frame time, a sequence of frame patterns from, and representing, the audio signal~
Numerical measures of the similarity of each frame pattern with each targe~ pat~ern are then generated.
The method further features the steps of accumulating, for each of the target pattern required and optional dwell time positions, using the numerical measures, a numerical valus representing the alignment of the just formed audio representing frame pattern with the respective target pattern dwell til~e positions and g~nerating a recognition decision based upon the numerical values when a predetermined event occurs in the audio siynal~ Preferably, the predetermined ~vent is the recognition of "silence"~
In another ~spect~ the accumulation step features ~he steps o~ ~13 accumulating for each target pattern second and later dwell position the sum of the accumulated score ~or the previous targ2~ pattern dwell ( '73~J
1 time position during the previous fra~e time and the present numerical measure associated with the taryet pattern; (2) accumulating, for each keyword first target pattern, first required dwell time position, the sum of the best accumulated score during the previous frame time which is associated with the end of a keyword, and the present numerical measure associated with the keyword first target pattern; and (3) accumulating for each other target pattern first dwell position the sum of the best ending accumulated value or the previous target pattern of the same keyword and the present numerical measure associated with the target pattern.
~ he method further features the steps of storing, in association with each frame time position, the identity and duration, in frame time posit;onst of the keyword having the best score and a valid endin~ at the rame time positionr and storing in association with each dwell time position accumulated score, a word duration count corresponding to the time position length of the keyword associated with the accumulated score at that dwell time position. Thereby, the generating step further includes the step of tracing back, throuyh the stored keyword identity and duration information, for determining each keyword in a word string.
In yet anothe~ aspect~ the method further eatures storing, in association with each accumulated score correspondin~ to the dwell time positions, a keyword dura~ion count. The duration count corresponds to the number of numerical measures, that is, the time position count, which have been accumulated to form that position score for the present keyword pattern. In another aspect, the ~ethod furt}2er Eeatures directiny /
3'~3~
1 the transfer of the accu~ulated scores in response to a syntax controlled element.
The apparatus of the invention can be accommodated in either hardware, software, or a mixture of the two. Hardware employed for implementing the method of the invention is described in further de~ail hereinafter.
BRIE~ DESCRIPTION OF THE DRAWINGS
.. . . _ _ _ , ~ _ Other objects, features, and advantages of the invention will appear from the following description of a preferred embodiment taken together with the drawings in which:
Figure 1 is a flow chart illustrating in general terms the sequence of operations performed in accordance with the practice of the present invention;
Figure lA is an electrical block diagram of apparatus according to a pxeferred em~odiment of the invention;
Figure 2 is a schematic block diagram of electronic apparatus for performing certain preprocessing operations in the overall process illustrated in Figure l;
Figure 3 is a flow diagram of a digital cornputer program performing certain procedures in the process of Figure l;
Fiyure 4 is a graphical representation of the pattern alignment process according to the ;nvention;
1 Figure 5 is an electrical block diagram o a likelihood function processor according to a preferred embodiment of the invention;
Figure 6 is an electrical schematic block diagram of the subtract and absolute value circuit according to a preferred embodiment of the invention;
Figure 7 is an electrical circuit diagram o an overflow detection logIc circuit accordin~ to a preferred embodiment of the inventionî
Figure 8 is a truth table for the circuit diayram of Figure 7;
Figure 9 is a schematic flow representation o a syntax processor according to one particular embodiment of the processor of the invention; and ~ igure 10 îs an electrical block diagram showing a sequential decoding pattern alignment circuit confiyuration according to a preferred particular embodiment of the invention.
Corresponding reference characters indicate corresponding elemen~s throughout .he several views of the drawingsO
DESCRIPTION OF A PREFERRED EMBODfMENT
_ Ih ~ne of the ~articular preferred embodiments which is described herein, speech recognition ;s performed by an overall apparatus which involves both a specially ccnstructed electronic sys~em for effectin~
certa.in analog and digital processin~ of incoming audic 73~
1 data signals, generally speech, and a general purpose digital computer which is prograrnmed in accordance with the present invention to effect certain other data reduction steps and numerical evaluations~ The division of tasks between the hardware portion and the software portion of this system has been made so as to obtain an overall system which can accomplish speech recognition in real time at moderate cost. However, it should be understood that some of the tasks being performed in hardware in this particular system could well be perfor~ed in software and that some of the tasks being performed by software programming in this example might also be performed by special purpose circuitry in a different embodiment of the invention. In this later connection, where available, hardwar~ and software implementations of the apparatus will be described.
One aspect of the present invention is the provision of apparatus which will recognize a word string in continuous speech signals even though those signals are distortedr for exa~ple, by a telephone line.
Thus, referring in particular to Figure 1, the voice input signal, indicated at 10, may be considered a voice signal produced by a carbon element telephone transmitter and receiver over a telephone line encompassing any arbitrary distance or number o~
switching interch3nges. A typical application of the invention is therefore recognizing continuous word strinys in audio data from an unknown source received over the telephone system. On the other hand, the input signal may also be any audio data signal, for example, a voice input signal, taken from a radio teleoommunica~ions link~ for example~ from a coT~mercial broadcast s~a~ion9 from a private dedicated 3'73C~
1 communications link, or an operator standing near the equipment.
As will become apparent from the description, the present snethod and apparatus are concerned with the recognition of speech signals containing a sequence of sounds or phonemes, or other recognizable indicia. In the description herein, and in the claims, reference is made to either "a word," "an element", 'la sequence of target patterns," i'a template pattern," or l'an element 1~ template," the five terms being considered as generic and equivalent. This is a convenient way of expressing a recogni~able sequence of audio sounds, or representations thereof, which combine to constitute the word string which the method and apparatus can detect and recognize~ The terms should be broadly and generically construed to encompass anything from a single phomeme, syllable, or sound, to a series of words ~in the grammatical sense) as well as a single wordO
An analog to~digital tA/D~ csnverter 13 ~0 receives the incoming analog audio signal data on line 10 and converts the s-ignal amplitude o~ the incoming data to a digital form. The illustrated A/D converter is ~esigned to convert the input signal data ~o a twelve~bi~ binary representationS the conversions occurring at the rate of 8,000 conversions per second (In other embodiments, other sampling rates can be ernployed; for example, a 16 kHz rate can be used when a high quality signal is availableO~ The A~D converter 13 applies its output over lines 15 to an autocorrelator 17. The autocorreIator 17 processes the digital input signals to generate a short-term autccorre~ation function one hundred times per second and applies its !
3~
l output, as indicated, over lines l9. Each autocorrelation function has thirty--two values or channels, each value being calculated to a 30 bit resolution. The autocorrelator is described in greater detail hereinafter with reference to Fi~ure 2.
The autocorrelation functions over lines 19 are Fourier transformed by a Fourier transformation apparatus 21 to obtain corresponding short-term windowed power spectra over lines 23. The spectra are generated at the same repetition rate as the autocorrelation functionsl that is, lO0 per second, and each short-term power spectrum has thirty-one numerical terms having a resolution of 16 bits each. As will be understood, each of the thirty-one terms in the spectrum represents the signal power within a frequency band. The Fourier transformation apparatus also preferably includes a ~lamming or similar windo~ function to reduce spurious adjacent-band responses.
In the first illustrated embodiment, the Fourier transformation as well as subsequent process;ng steps are preferabl~ performed under the control of a general purpose digital computerr appropriately programmed, utilizin~ a peripheral array processor for speeding the arithmetic operations required repetitively accordiny to the present method. The particular computer employed is a model PDP-ll*manufactured by the Digital Equipment Corporation of Maynard, MassachusettsD
The particular array processor employed is described in U.~ Patent 4,228,498, assigned to the assignee of this applioation. The programming described hereinafter with reference to ~igure 3 is substantially predicated upon the capabilities and characteristics of these available diyital processing units~
*Trade Mark 1 The short-term windowed power spectra are fre~uencyresponse equalized, as indicated at 25r equalization being performed as a function of the peak amplitudes occurrin~ in each frequency band or channel as described in greater detail hereinafter. The fre~uency-response equalized spectra, over lines 26, are generated at the rate of one hundred per second and each spectrum has thirty-one numerical terms evaluated to 16 bit accuracy. To facilitate the final evaluation of the incoming audio data, the frequency-response equalized and windowed spectra over lines 26 are subjected to an amplitude transformation, as indicated at 35, which imposes a non-linear amplitude transformation on the incoming spectra. This tra~nsformation is described in ~reatex detail hereinafter, but it may be noted at this point that it improves the accuracy with which the unknown incoming audio signal may be matched with target pat$ern templates in a reference vocabulary. In the illustrated embodiment, this transformation is performed on all of the frequency-response equalized and windowed spectra at a time prior to the comparison of the spectra with patterns representing the elements of ~he reference vocabulary.
The amplitude transformed and equalized short-term spectra over lines 38 are then compared against the element templates at ~0 as described in detail below~ The reference patterns, designated at 42 represent the elements of the reference vocabulary in a sta~istical ~ashion with which the transformed and equalized spectra can be compared. Each time l'silence"
is detected~ a decision is made with re~ard to the identi~y of the just received word stri~g. This is indica~ed at 44. Candida~e words are ~hus selected l according to the closeness o~ the comparison; and in the illustrated embocliment, the selection process is designed to minimize the likelihood of a missed keyword~
Referring to Figure lA~ a speech recognition system, according to the invention, employs a controller 45 which may be for example a general purpose digital computer such as a PDP-ll or a hardware controller specifically built for the apparatus. In the illustrated embodirnent, the controller 45 receives preprocessed audio data from a preprocessor 46 which is described in greater detail in connection with ~iyure 2.
The preprocessor 46 receives audio input analog signals over a line 47 and provides processed data over inter~ace lines 48 to the control processor.
Generally, the operational speed of the control processor, if a cleneral purpose element, is not fast enough to process the incoming data in real time.
As a result, various special purpose hardware can ~e advantageously employed to effectively increase the processincg speed of element 45. In particular, a vector processing element 48a such as that described in U.S.
Patent 4,228,498, assigned to the assignee of this invention, provides significantly increased array processing capability by using a pipeline effect. In addi~ion~ as described in more detail in connection with Figures 4, 5, and 6, a likelihood function processor 48b can be used in connection with the Vector Processor in order to still further increase the operating speed of th apparatus by tenfold~
While in the preferred embodiment of the inven~ion con~rol processor 45 is a digital computer, in ;
( 1 ~nother particular embodiment, described in connection with Figure 10, a significant portion of the processiny capa~ility is implemented e~ternally of the control processor in a sequential decoding processor 49. The structure of this processor is described in greater detail in connection with Figure 10. Thus, the apparatus for implementing speech recognition illustrated herein has great flexibility both in its speed capabilities and in the ability to be implemented it in both hardware, software, or an advantageous combination of hardware and software elements.
Pre~cessor In the apparatus illustrated in ~igure 2, an autocorrelation function with its instrinsic averaging is performed digitally on the digital data stream generated by the analog-to-digital converter 13 operating on the incoming analog audio data over line 10, generally a voice signal~ The eonverter 13 provides a digital input si~nal over lines 15. The digital processing functions, as well as the input analoy-to-digital conversion, are timed under the control o~ a clock oscillator 51. The clock oscillator provides a basic timing signal of 256rO00 pulses per second~ and this signal is applied to a fre~uen~y div;der 52 to obtain a second timing signal at 8,000 pulses per second. The slower timing si~nal controls the analog-~o-digital converter 13 together with a latch register 53 which holds ~he twelve-bit results of tha last conve~s~on until the next convarsion is completed.
The autocorrelation products are generated by a digital multiplier 56 which mul~iplies the number contained in register 53 by ~he output of a thirty-two t~3 1 word shift register 58. Shift register 58 is operated in a recirculating mode and is driven by the faster clock frequency, so that one complete circulation of the shift register data is accomplished for each analoy-to-digital conversion. An input to shift reyis~er 5~ is taken from register 53 once during each complete circulation cycle. One input to the digital multiplier 56 is taken directly from the latch register 53 while the other input to the multiplier is taken (with one exception described below~ from the curren~
output of the shift register through a multiplexer 59.
The multiplications are performed at the higher clock frequency.
Thus, each value obtained from the A/D
conversion is multiplied with each of the preceding 31 conversion values. As will be understood by those skilled in the art, the si~nals thereby generated are equivalent to multiplying the input signal by itself, delayed in ~ime by thirty-~wo different time increments (one of which is the zero delay). To produce the zero delay correlation, that is, the power of the signal, multiplexer 59 causes the current value of the latch register 53 to be multiplied by itself at: the time each new value is being introduced into the shift register.
~his timing function is indicated at 60O
As will also be understood by those skilled in the art, the products from a single conversion, together with its 31 predecessors, will not be fairly representative of the energy distribution or spectrum over ~ reasonable sampliny interval~ Accordingly; the apparatus of Figure Z provides for averaging of these sets of products~
( 3(~
1 An accumulation ~rocess, which ef~ects averaying, is provided by a thirty~two word shift re~ister 63 which is interconnected with an adder 65 to form a set of thirty~two accumulators~ Thus, each word can be recirculated after having been added to the corresponding increment from the digital multiplier.
The circulation loop passes through a gate 67 which is controlled by a divide~by-N divider circuit 69 driven by the low frequency clock signal. The divider 69 divides the lower fre~uency clock by a factor which determines the nu~ber of instantaneous autocorrelation functions - which are accumulated, and thus averaged, before the shift register 63 is read out.
In the illustrated example, eighty samples are accumulated before being read out. In other words, N
for the divide-by-N divider circuit 69 is equal to eighty~ After eighty conversion samples have thus been correlated and accumulatedj the divider circuit 69 triggers a computer interrupt circuit 71 over a line 72.
At this time, ~he contents of the shift register 63 are successively read into the computer memory through a suitable interface circuitry 73, the thirty-two successive words in the register being presented in ordered sequence to the computer throu~h the interface 73. ~s will be understood by those skilled in the art this data transfer from a peripheral unit~ the autocorrelator preprocessor, to the computer may be typically performed by a direct memory access procedure.
Predicated o~ an averaging of eighty sample~, at an initial sampling rate of 8,000 samples per second~ it will be seen that 100 averaged autocorrelation f-lnctions ~re provided to the computer every sec~nd.
73~
1 While the shift register contents are being read out to the c~mputer, the gate 67 is closed so that each of the words in the shift register is effectively reset to zero to permit the accumulation process to begin again.
Expressed in mathematical terms, the operation of the apparatus shown in Figure ~ can be described as follows. Assuming that the analog-to-digital converter generates the time series S(t), where t = O, Tot 2To, ... , and To is the sampling interval (1/8000 sec~ in the illustrated embodiment), the illustrated digital correlation circuitry oE Figure 2 may be considered, ignoring start-up ambiguities, to compute the autocorrelation function ~(j, t) = ~ S(t+kTo) S(t~(k-j) To) ~1) k=O
where j = O, 1, 2 ... , 31; and t = 80 Tol 160 To/ .~. , 80n To~ . These autocorrelation functions correspond to tlle correlation output on lines 19 of Figure 1.
~0 Referring now to Figure 3, the digital correlator operates continuously to transmit to the computer a series of data blocks at the rata of one complete autocorrelation function every ten milliseconds~ This is indicated at 77 ~Fig. 3). Each block of data represents the autocorrelation function derived from a corresponding subinterval of time~ As noted above, the illustrated autocorrelation funetions are provided to the computer at the ra~e of one hundred, _ !
3~
32-word functions per second. This analysis interval is referred to hereinafter as a "frame".
In the first illustrated embodiment, the processing of the autocorrelation function data is performed by ~n appropria~ely programmed, special purpose digital computer. The flow chart, which includes the function provided by the co~puter program is given in ~igure 3~ Again, however, it should he pointed out that various of the s~eps could also be performed by hardware (as described below) rather than software and that likewise certain of the ~unctions performed by apparatus of Figure 2 could additionally be performed in software by a corresponding revision of the flow chart of Figure 3.
Although the digital correlator of Figure 2 performs some time-averaging of the autocorrelation functions generated on an instantaneous basis, the average autocorrelation functions read out to the computer may still contain some anomalous discontinuities or unevenness which might interfere with the orderly processing and evaluation of the sampl~s.
Accordingly, each block of data, that is, each auto orrelation function a(j,t~ is first smoothed with respect to time. This is indicated in the flow chart of Figure 3 at 78r The preferred smoothing proces~ is one in which the smoothed autocorrelation output aS(j~t) is given by a~(j, t) = C~ a(j,t~ -~ Cl a~j, t-T) + C2 a(j,t-2T~ ~2 where a~j,t) is the unsmoothed input au~ocorrela~ion defined in Equa~ion l! aS(j~t3 is ~he smoothed .
3'73~1 1 autocorrelation output, j denotes the delay time, t denotes real time, and T denotes the time interval between consecutively generated autocorrelation functions (frames) r equal to .01 second in the preferred embodiment. The weiyhting functions CO~ Cl, C2, are preEerably chosen to be 1/4, 1/2, 1/4 in the illustrated embodiment, although other values could be chosen. For example~ a smoothing function approximatin~ a Gaussian impulse response with a frequency cutoff of~ say, 20 Hertz could have been implemented in the computer sor~ware. However, experiments indicate that the illustrated, easier to implement, smoothing function of Equation 2 provides satisfactory resultsO As indicated, the smoothing function is applied separately for each ~alue j of delay.
It will become clear that subsequent analysis involves various operations on the short-term Fourier power spectrum of the speech signal and for reasons of hardware simplicity and processing speed, the transformatiorl of the autocorrelation function to the frequency domain is carried out in eight-bit arithmetic in the illustrated'embodiment. At the high end of the band pass, near three kilohertz, the spectral power density decreases to a level at which resolution is inadequate in eight--bit quantities. Therefore, the frequency response of the system is tilted at a rising rat~ of 6db per octave~ This is indicated at 79~ This high frequency emphasis is accomplished by -taking the second derivative of the autocorrelation function with respect ~o its argument, i.e., the time delay or lag.
The derivative operation i~
b(j,t~ = -a(j~l, t) + ~a(j,t) - a(i-l,t~ (3 3~
1 To evaluate the derivative for j = 0, it is assumed that the autocorrelation function is symmetxical about 0, so that a(-j,t) = a(+j,t). Also/ there is no data for a(32) so the derivative at j = 31 is taken to be the same as the derivative when j = 30.
As indicated in the flow chart of Fig~ 3, the next step in the analysis procedure, after high frequency emphasis, is to estimate the signal power in the current frame interval by finding the peak absolute value of the autocorrelation. The power estimate, P(t), is P(t) = max ¦ b(i,t)¦ (4) In order to prepare the autocorrelation for the eightbit spectrum analysis, the smoothed autocorrelation function is block normalized with respect to P(t) (at 80~ and the most significant eight bits of each normalized value are input to the spectrum analysis hardware. The normalized (and smoothed) autocorrelation function is, therefore:
c(j,t~ 7 b(j~t)/P(t) (5) As indicated at 81, a cosine Fourier transform is then applied to each time smoothed, frequency emphasized, normalized autocorrelation function, c(j,t), to ~enerate a 31 point power spectrum. The matrix of cosine values is given by:
S(i,j~=126 ~(i)(~os~X~i/8000)f(j)~,j=0,1~2,.~.~31 ~6 73C~
1 where S (i,j) is the spectral energy in a band centered at f(j~ Hz, at time t; gti) = ~ cos 2~ i/63) is the (Elamming) window function envelope to reduce side lobes;
and ~(j) = 30 + 1000 (0.0552j ~ 0.438)1/-63 Hz; (7 j=0, 1, 2, ..., 31 which are the analysis frequencies equally spaced on the so-called "mel" curve of subjective musical pitch. As will be understocd, this corresponds to a subjective ~pitch ~mel scale) frequency-axis spacing for frequencies in the bandwidth of a typical communication channel o about 300~350~ Hertz.
Since th~ spectrum analysis requires summation over la~s from ~31 to +31, by making the assumption that the autocorrelation is symmetric about zero, only the positive values of j are required. However, to avoid counting the lag zero term twice, the cosign ma~rix is adjusted so that S~0,;) = 126/2 = 63, for all j ~8) Thus the computed power spectrum is given by S'~j~t) =¦ ~ a~i,t) S (i,j) ¦ , j = 0,1, .~., 31 (9 I i=o . I
where the jth result corresponds to the ~requency f~
'73~
1 As will also be understood, each point or value within each spectrum represents a corresponding band of frequencies. While this ~ourier transform can be performed completely within the conventional computer hardware, the process may be speeded considerably if an external hardware multiplier or Fast Fourier Transform (FFT) peripheral device is utilized~ The construction and operation of such modules are well known in the art, howeverr and are not described in detail herein~
Advantageously built into the hardware Fast Fourier Transforrn peripheral device is the frequency smoothing function wherein each of the spectra are smoothed in fre~uency according to the preferred (Hamming) window weighting function g(i) defined above~ This is inc3icated at 83 of the block 85 which corresponds to the hardware Fourier transform implementation.
If the background nois0 is significant, an estimate of the power spectrum of the background noise should be subtracted from S'(j,t) at this stage. The frame or frames selected to represent the noise should not contain any speech signals. The optimum rule for selecting noise frame intervals will vary with the application. If the talker is engaged in two-way communication, for example, with a machine controlled by the speech recognition apparatus, it is convenient, for example, to chose a rame arbitrarily in the interval immediately after the machine has finished speaking by its voice response unit~ In less cQnstrained sit~a~ions, t~e noise frame may be found by ohoosing a frame of a minimum amplitude during the past one or two seconds of audio input.
As successive smoothed power spectra are received from the Fas~ ~ourier Transform peripheral 85 ( 39~0 1 a communications channel equali~ation is obtained by determining a (generally different) peak power spectrum envelope for the spactra from peripheral 85, and nodifying the output of the Fast Fourier Transform apparatus accordingly~ as described below~ Each newly generated peak amplitude spectrum p (j, t), corresponding to and updated by an incoming windowed power spectrum S'(j, t), where j is indexed over the plur~l frequency bands of the spectrum, is the result of a fast attack, slow decay, peak detecting function for ~ach of the spectrum channels or bands~ The windowed power spectra are normalized with respect to the respective terms of the corresponding peak amplitude spectrum. This is indicated at 87.
According to the illustrated embodiment, the values of the "ol~" peak amplitude spectrum p(j, t - T), determined prior to receiving a new windowed spectrum are compared on a frequency band by frequency band basis with the new incoming spectrum S'l;, t). The new peak spectrum p(j,t) is then generated according to the follo~ing rules~ The power amplitude in each band of the "old" peak amplitude spectrum is multiplied by a fixed fraction, for example, 1023/1024, in the illustrated example. This corxesponds to the slow decay portion of the peak detecting function. If the power amplitude in a requency band j of the incomin~ spectrum S'(j,t) is greater than the power amplitude in the corresponding fre~uency band of the decayed peak amplitude spectrum, then the decayed peak amp~itude spectrum value for that (those) frequency band(s) is ~are~ replaced by the spectrum value o~ the corresponding band of the incoming windowed ~pectrum~
This corresponds to ~he ast attack portion o the peak 1 detectiny function. Mathematically, the peak detectiny function can be expressed as p(j,t)=max{p(j,t-T)~ E); P(t3~ S'(j,t)}j=0,1~..,31(10 where j is indexed over each of the frequency bands, p(j,t) is the resultiny peak spectrum, p(j, t-T) is the "old" or previous peak spectrum, S'(j,t~ is the new incoming, partially processed, power spectrum, P(t) is the power estimate at time t, and E is the decay parameter.
According ~o equation 10, the peak spectrum normally decays, absent a higher value spectrum input, by a factor of 1 - Eo Typically E equals 1/1024. It may however be undesirable to permit decay of the peak spectrum during intervals o~ silence, particularly i~ no r~lpid change in the communication channel or voice characteristics is expected. To define the silence frame, the same method empl~yed to choose background noise frames can be employed The amplitudes (square root of P(t)) o~ the pas~ 128 frames are inspected, and the minimum value found~ If the amplitude of the current frame is less than four times this minimum, the current frame i5 determined to be silence and the value "zero" is substituted for the value 1/1024, or ~.
After thP peak spectrum is genera~ed the resulting peak amplitude spectrum p(j~t) is frequency smoothed at ~9 by averaging each frequency band peak value with peak values corresponding to adjacent frequencies of the newly ~enerated peak spectra, the width of the overall band of frequeneies contributing to the averaye value being approximately equal to the 3~) 1 typical frequency separation between formant frequencies. As will be understood by those skilled in the speech recognition art, this separation is in the order of about 1000 Hz. ~y averaging in this particular way, the useful information in the spectra, that is, the local variations revealing formant resonances are retained whereas overall or gross emphasis in the frequency spectrum is suppressed~ According to the preerred embodiment the peak spectrum is smoothed with respect to freguency by a moving average function covering seven adjacent frequency bands. The averaging function is:
j+3 e(j,t) = h(j) ~ p(k,t) (11) k-j-3 At the ends of the passband, p~k,t) is taken to be 0, for k less than 0 and k greater than 31. The normalizing envelope h(j) takes into account the number of valid data elements actually summed: thus, h(0) =
7/4, h(l) = 7/5, ht~) ~ 7/6, h(3) = 1~ ... r h(283 = 1 h(29) = 7~6, h(30) = 7/5, and h~31) = 7/4. The resulting smoothed peak amplitude spectrum e(j,t) is then employed to normalize and frequency equalize the just received power spectrum, S'~j,t), by dividing the amplitude value of each ~reguency band of the incoming smoothed spectrum S'(j,t)~ by tha corresponding fr~quency band value in the smoothed speak spectrum e(j,t)~ Mathematically, this corresponds to sn~jlt~ = (S'(j,t) ~ e(j,t)) 32767 ~12 ( 3~
1 where sn(f,t) is the peak-normalized, smoothed power spectrum and j is indexed over each of the frequency bands. This step is indicated at 91. There results a sequence of frequency equalized and normali~ed short-term power spectra which emphasizes changes in the frequency content of the incoming audio signals while suppressing any generali~ed long-term frequency emphasis or distorti~n. This method of frequency compensation has been found to be highly advantageous in the recognition of speech signals transmitted over frequency di.storting communication link~ such as telephone lines, in comparison to the more usual systems of frequency compensation in which the basis for compensation is the .~t~erage power level, ei~her in the whole signal or in each respective frequency band.
It is use~ul to point out that, while successive spectra have been variously processed and equalized, the da~a representing the incoming audio signals still comprises spectra occurring at a rate of one hundred per second.
The normalized and frequency equalized spectra, indicated at 91, are subjected to an amplitude transforma~ion, indicated at 93, which effects a non-linear scaling of the spectrum amplitude values.
Designating the individual equalized and normalized spectra as sn(j,t) (from ~quation 1~) where j indexes the differen~ requency bands of the spectrum and t denotes real ~ime, the non-linear scaled spec~ru~ x~,t) .is defined by the linear fraction function sn(jytl - A
x(j7t) = 12a ~ j=0, lt ...-, 3Q ~13) 5n(i~t) ~ A
73~3 where A is the average value of the spectrurn sn(j,t) over j=0 to 31, and is defined as follows:
A = ~ sn(j, t? (14) 3~ j-0 where j indexes over the fre~uency bands of the power spectrum.
The ~hirty-first term of the spectrum is replaced by the logarithm of A so that x(31,t) = 16 1O92 A (15) This scaling function (Eq~ 13) produces a soft threshold and gradual saturation effect for spectral intensities which deviate greatly from the short-term average A. Mathema~ically, for intensities near the average, the function is approximately linear; for intensities ~urther from the average, it is approximately logarithmici and at the extreme values of intensity, it is substantially constant. On a logarithmic scale, the function x(j,t) is sym~etric about zero and the function exhibits threshold and saturation behavior that is suggestive of an auditory nerve firing-rate function. In practice, the overall recognition system performs significantly bet~er with this particular non-linear scaling function than it does with either a linear or a logarithmic scaling of the spectrum amplitudes~
T~ere is thus generated a sequence of amplitude transform~d, fre~uency-response equalized, o 1 normalized, short-term power spectra x(j,t) where t equals .01, .02, .03, .04, ... , seconds and j = 0, ....
30 (corresponding to the frequency bands of the generated power spectra). Thirty-two words are provided for each spectrum; and ~he value of A (Equation 15), the avera~e value o~ the spectrum values, is stored as the thirty-second word. The amplitude transformed, short~term power spectra hereinafter r~ferred to as "frames", are stored, as indicated at 95, in a first-in, first-out circulating memory having storage capacity, in the illustrated embodiment, for 256 thirty two-word spectra. There is thus made available for analysis, in the illustrated embodiment, 2.56 seconds of the audio input signal. This stora~e capacity provides the recognition system with the flexibility, if required, to select spectra at different real times, for analysis and e~aluation and thus with the ability to go forward and backward in time as the analysis requires.
Thus, the frames for the last 2~56 seconds are stored in the circulating memory and are available as needed. In operation, in the illustrated embodiment, each fram~ is stored for 2.56 seconds. Thus, a framer which enters the circulating memory at time tl, is lost or shifted from the memory 2.56 seconds later as a new ~rame, corresponding to a time tl + 2056, is stored.
The frames passing through the circulatory memory are compared, preferably in real time~ against a known vocabul'ary of words to deter~ine and identify the input data in word groups called a word strin~0 Each 30 voc~bulary word is represen~ed by a template pattern statistically representing a plurality of processed power spectra formed into plural non-overlapping 3~
1 multi-fra~e (preferably three frames) design set patterns. These patterns are preferably selected to best represent significant acoustical events of the vocabulary words and are stored at 94.
The spectra forming the design set patterns are generated for the words spoken in various contexts using the same system descrihed hereinabove for processing the continuous unknown speech input on line 10 as shown in Figure 3.
Thus, each vocabulary word has associated with it a generally plural sequence of design set patterns, P(i)l, P~i)2, .~. , t~hich represent, in a domain of short-term power spectra, one designation of that ith keyword. The collection o~ design set patterns for each keyword form the statistical basis from which the target patterns are ~enerated.
In the illustrated embodiment of the invention, the design set patterns P(i)j can each be considered a 96 element array comprising three selected frames arranged in a series sequence. The frames fc)rming the pattern should preferably be spaced at least 30 milliseconds apart to avoid spurious correlation due to time domain smoothing. In other embodiments of the inventionr other samplinq strategies can be implemented for choosing the ~rames; however, the preferred strategy is to select frames spaced by a constant time duratisn, preferably 30 milliseconds, and to space the non-overlapping design set patterns throughout the time interval defining the keyword~ Thus, a first design set pattern Pl corresponds to a portion of a keyword near its beginning; a second pattern P2 corresponds to a ( 3~
1 portion later in time, etc., and the patterns Pl, P2, ... form the statistical basis for the series or sequence of target patterns, the word ~emplate, against which the incoming audio data will be matched. The target patterns tl~ t2, ..., each comprise the statistical data, generated from corresponding P(i~; by assuminy the P(i)j are comprised of independent Gaussian variables, which enable a likelihood statistic to be generated between incoming frames, defined below, and the target patterns. Thus, the target patterns consist of an array wherein the entries comprise the mean, standard deviation and area normalization factor for a corresponding collection of design set pattern array entries. A more refined likelihood statistic is described below.
It will be obvious to those skilled in the àrt that substantially all words will have more than one contextual and/or regional pronounciation and hence more than one "spelling" of design set patterns. Thus, a vocabulary word having the patterned spelling Pl, P2 --referred to above, can in actuality be generally expressed as p(i)l, p(i)2, ... i = 1, 2, ..., M where each of the p(i)j are possible alternative descriptions of the jth class of design set patterns, there being a total of M diferent spellings for the wordO
The target patterns tl~ ~2~ ti~ in the most general sense, therefore, each represent p~ural alternative s~atistical spellings for ith ~roup or class of design set patterns~ In ~he illus~rated embodiment described herein, the term "target pa~tern" is ~hus used in the most ~eneral sense and each target pattern may therefore have more th~n one permissible alternative "statistical spelling~"
7~C) 1 Preprocessing of ~he incoming unknown audio signals and the audio forming the reference pa~terns is now complete.
Processing the Stored Spectra A more indepth study of the keyword recognition method of concatenating phonetic patterns into detected words, described in U.S. Patents 4,241,329, 4,227,177, and 4,227 r 177, has shown that it is a special case of a more general and possibly superior recognition method. Referrin~ to Figure 4, the word recognition search can be represented as the problem of finding an apprbpriate path through an abstract state space. In the figure, each circle represents a possible state, also designated a dwell time position or register, through which the decision making process can pass~ The space between dashed vertical lines 120, 122 represents each of the hypothetical states in which the decision ma~ing process can pass in determining whether a pattern matches or does not match a current phoneme. This space is divided into a required dwell time portion 124 and an optional dwell time portion 126. The re~uired dwell time portion is the minimum duration of the particular "current"
phoneme or pattern. The optional dwell time portion represents the additional maximum duration of a pattern~
Each of the circles within the optional or required dwell time portions rep*esents one frame time of the corfinuum oE for~ed fr~mes and corresponds to the 0.01 second intervals from frame to frame. Thus, each circle identiies a hypothesized current phonetic position in a word spellin~ and, toge~her with the number of ~.01 second~ frames hypothesized to have elapsed since the current phoneme beganr corresp~nding to the numher of ( 73~
1 earlier "circles" or positions in that phoneme or taryet pattern, represents the present duration of the pattern.
After a pattern (phoneme) has be~un and the minimum dwell time in~erval has elapsed, there are several possîble paths of advanciny to the ~irst node or position ~circle) 12B of the next target pattern tphone~e). This depends upon when the decision to move to the next pattern (phoneme) of the spelling is made~
These decision possibilities are represented in the figurc by the several arrows leading to circle 128. A
transition to the next pattern (phoneme), the beginning of which is represented by circle 128, might be made at any node or position during the optional dwell time of the current pattern (phoneme) or at the last node of the re~uired dwell time interval.
The key word recognition method described in U.S. Patents 4,241r329; 4,~27tl76; and 4,227,177, makes the transition at the irst such node for which the likelihood score relative to the next pattern (phoneme) is better th~n the likelihood score relative to the current pattern (phoneme) That is~ a frame matches the next phoneme or pattern better than the present phoneme or pattern. The total word score, however, is the average pattern (phoneme) score per frame (i.e., per node included in the path~ This sa~e "total score"
definition applied to a word score up to the current node can be used to decide when to make the transitivn;
that is, whether to make the transition to the n~xt pattern at ~a~ a first opportunity, correspondiny ~or example to a transition indicating line 130, or at a later time, correspondiny to, or example, a transitlon indicatin~ line 132. Optimally, one chooses that path into the next pattern (phoneme~ for which the avera3e 3~
:L score per node is best. Since the standard keyword method described in U.S. Patents 4,241,329, 4,227,176, and 4,227,177, does not examine any of the potential paths after it has made the decision to move to the next pattern (phone), it may make a sub-optimal decision as measured by average score per node~
Accordingl~, the present invention employs an average score per node strategy for word string recognition~ ~he problem arises, when used in connection with word string recognition as described in detail hereinafter, that one must either normalize all partial word scores by the number of nodes included~
which is computationally inefficient, or else one must bias the accumulation so that an explicit normalization is not necessary. A natural bias to use in the closed vocahulary task is the unnormalized score ror the best word ending at the present analysis time; then the accumulated scores at all nodes will always be the sum of the same number of elementary pattern scores~
Furthexmore the score is transformed by this bias into the score o~ the best string of words ending at the current analysis node.
The average score per node decision strategy is efficiently implemented in the Vector Processor described in U~S. Patent 4,228,498, by a dynamic programming technique. When programmed in this manner the processing speed is somewhat faster than for the standard key word reco~nition method described in UOS~
Patents 4 t 241,329; 4,2~7,176, and 4,227,177~ even though more hypothesis tes~s are required.
Generally speaking, to recognize strings of words, the program remembers ~he name of the best 3'73~
I
1 hypothesized vocabulary word ending at each analysis node. It also remembers the node (time) at which this best word began. The best strin~ of words is then found by tracing back from the end of the utterance, noting the stored word name and finding the next previou$ word at the indicated beginning time of the current word.
By including silence as a vocabulary word, it becomes unnecessary to specify how many words are contained in the string of words. The operation of tracing back to find the string is executed whenever the silence word has the best word score, and the operation terminates at the next previously detected silence.
Thus a string is found every time the talker pauses for breath.
The word string recognition method described herein is one level of abstraction higher than the detection of individual key w~rds. Since the word string scoring forces all speech throughout the utterance to be included in some word of the string, it has an advantage over the simpler word spotting approach, which frequently detects false sort words within longer words~
Advantageously no timing patterns are necessary for the worc~ string/ since the word concatenator outputs a word beginning time or each word ending hypoth~sis. The simplest siring concatenator assumes that these word beginning times are correctO On detec~ing silence, it assu~es that the string of words has just ended, and that the beginning of the last word is the ~nd of the previous word (which may be 5ilence)0 It is then a simple mat~er ~o trace backward through the ~9~
1 string, choosing the word ~7ith -the best ending score at each word boundary. Since there is usually a context-dependent transition between each pair of words in the string, ;t may be preferable to permit the apparatus to search the neighb~rhood of each word beginning for the best ending of the previous word~
The method and apparatus, including hardware and software embodiments are now described in greater detail r Referring to ~ig. 3, the stored spectra, or frames, at 95, representing the incoming continuous audio data, are compared with the stored template of target patterns indicated at 96, representing keywords of the vocabulary according to the following method.
For each 10 millisecond frame, a pattern for comparison with the stored reference patterns is formed at 97 by adjoining the current spectrum vec~or s(j~t), the spectrum s(j,t-~03) from three frames ago, and the spectrum s(j,t-.06) from six frames ago, to form a 96 element pattern:
( s(~Jt-.06)~ j=0,...,31 x(j,~) = ( s(j-32,~-.03), j=32,...,63 ~ s(j-64,t), j=64,...,95 As noted above, the stored reference pa~terns consist of the m~n values, standard deviations9 and area normalizing terms o~ previously collected 96 ~9~3~
1 element patterns belonging to ~he various speech pattern classes to be recognized. The comparison is accomplished by a probability model of the values x(j,t~
to be expected if the input speech belongs to a particular class.
Whiler a Gaussian distribution can be used for the probability ~odel, (see e.~. U.S. Pa~ents 4,241,329;
4r~7~176; and 4,227~177, referred to above), the Laplace distribution p(x) = (1/ ~ s') exp-( ~2 ~ x-m ¦/s') ~where m is ~he statistical mean and s~ the standard deviation of the variable x) requi.res less computation and has been found to perform nearly as well as the Guassian distrib~tion in, for example, the talker independen~, isolated word recognition method described in U.S. Patent 4,038,503~ The degree of similarity L(x ¦ k~ between an unknown input pattern x and the kth stored reference pattern is proportional to the lo~arithm of the probability and is estimated at 100 by the following formula:
~3~ 1 Xi - Uik I
L~x ¦ k) = ~ Ak s ik (17) i=l where Ak a ~ ~ ln 5~ik i=l 3~3 1 In order to combine the likelihood scores L of a sequence of patterns to form the likelihood score of a spoken word or phrase, the score L(x ¦ k) for each frame is adjusted by subtracting the best (smallest) score of all the reference patterns for that frame, as follows:
L'(x ¦k) = L(x ¦ k) - min L(x ¦ i) (183 Thus the best-fitting pattern on each fra~e will have a score of zero. The adjusted scores for a hypothesized sequence of reference patterns can be accumulated from frame to frame to obtain a seguence score rela~ed directly to the probability that a decision in favor ~f the indicated sequence would be the correct decision.
Comparison of unknown input spectrum patterns against stored known patterns is accomplished by computing the function q = ~ Sik ¦ Xi - Uik ¦ ~ Ck ~19) . i=l (where sik e~uals l~s'ik) for the kth reference pattern.
In a normal software implemented computation, the following instructions would be executed to compute the algebraic func~ion s¦ x-u ~ (of Equation 19~:
1~ c~mpute x-u 3~
1 2. test the sign of x-u 3. if x-u is negative, negate to form the absolute value 4. multiply by s 5. add the result into an accumulator In a typical spee~h recognition system having a 20~word vocabulary, there would be about 222 different reference patterns~ The number of steps required to evaluate them is then 5x96x222 = 106560 steps, not including overhead operations, and this must be done in less than 10 illiseconds in order to keep up with the real time spectrum ~rame rate. The processor must therefore be capable of executing nearly 11 million instructions per second just to evaluate th~ likelihood functions. In view of the necessary speed, a special purpose li~elihood function hardware module 200 ~Fig. 4), which is compatible with a system Vector Processor as disclosed in U.S. Patent 4j228,498, is employed.
In this special purpose hardware, the five steps listed above are performed simultaneousl~ with two sets vf the arguments s~ x, u; so that in effect ten instructions are performed in the ~ime it normally takes to execute on~ instruction. Since the basic Vector Processor operates at a rate o 8 million instiructions per second~ the effective com~utation rate for the likelihood function becomes about 80 million instructions per second with the special purpose hardware module 200 being employed.
, 73~
1 Hardware module 200, referring to Fig. 5, employs a combination of hardware pipelinin~ and parallel processing to provide the simultaneous eY.ecution of the ten steps~ Two identical sections 202, 204 Pach perform five arithmetic steps upon the independent input data arguments and the two results are combined by an adder 206 connected to their outputs.
The accumulation of the summations from adder 206 form the summation from 1 to 96 of Equation 19 and is handled by the arithmetic unit of the standard Vector Processor described in U.S. Patent 4,288.,498.
In operation, pipelining registers hold the intermediate data at the following stages of the processing:
lo input arguments (clocked registers 208, 210~ 212~ 214r 216~ 218)
1. This application i.s a divisional of application serial number 412,g43 filed October 5, 1982.
BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition method and apparatus, and more paxticularly to a method of and apparatus for recognizing in real time, word strings in a continuous audio signal Various speech recognition systems have been proposed herebefore to recognize isolated utterances by comparing an unknown isolated audio signal, suitably processed, with one or more previously prepared representations of known keywords.
In this context, "keywords" is used to mean a connected group of phonemes and sounds and may be, for example, a portion of a syllable, a word, a phrase, etcO While many systems have met with limited success, one system, in particular, has been employed successfully, in commercial applications, to recognize isolated keywords. This system operates.substantially in accordance with the method described in U.S. Patent No. 4,038,503, granted July 26, 1977, assigned to the assignee of this application, and provides a successful method for recognizing one of a restricted vocabulary of keywords provided that the boundaries of the unknown audio signal data are either silence or background noise as measured by the recognition system. That system relies upon the presumption that the interval, during which the unknown audio signal occurs, is well defined and contains a single keyword utterance.
In a continuous audio signal, such as continuous conversational speech, wherein the keyword boundaries are not a prior known or marked, several methods have been devised to segment the incoming audio data, that is, to determine the boundaries of linguistic 3~
3~) 1 units, such as phonemes, syllables, words, sentence~, etc., prior to initiation of a keyword recognition process.
These prior continuous speech systems, however, have achieved only a limited success in part because a satis-factory segmenting process has not been found. Other sub-stantial problems still exist: for example, only limited vocabularies can be consistently recognized with a low false alarm rate; the recognition accuray is highly sensitive to the differences between voice characteristics of different talkers; and the systems are highly sensitive to distor-tion in the audio signals being analy~ed, such as typically occurs, for example, in audio signals transmitted over ordinary telephone communications apparatus.
The continuous speech recognition method described in the applicant's U.S. patent Nos. 4,227,176 which issued October 7, 1980; 4,241,3~9 which issued December 23, 1.980; and 4,227,177 which issued October 7, 1980, respectively, describe commercially acceptable and effective procedures for successfully recognizing, in real time, keywords in continuous speech. The general methods described in these patents are presently in commercial use and have been proved both experimentally and in practical field testing to effectively provide a high reliability and low error rate, in a speaker~indepedent environment. Never-theless, even these systems, while at the forefront of present day technologyv and the concept upon which they were developed, have shortcomings in both the false-alarm rate and speaker-independent performance.
The continuous speech recognition methods described in the above-identified U.S. patents are 3~
1 directed primarily to recognizing or spotting one of a plurality of keywords in continuous speech. In other applications, a continuous word string can b~ recognized wherein the result of the recognition process is the identity of each of the individLIal word elements of the continuous word string. A continuous word string in this context is a plurality of recognizable elements which are bounded by silence. ~his is related for example to the commercial e~uipment noted above with respect to the isolated word application in which the boundaries are a priori known. Here however the boundaries, silence, are unknown and ~ust be determined by the recognition system itself. In addition, the elements being examined are no longer keyword elements but a plurality of elements "strung" together to form the word string. Various methods and apparatus have been su~gested in the art for recognizing continuous word stringsO These apparatus and methods however have various shortcomings again, for example, in false alarm rate, speaker independent performance, and real time operation.
Therefore, a principal object of the present invention is a speech recognition method and apparatus having improved effectiveness in recogni2ing con~inuous word strings in a continuous, unmarked audio signal.
Other objects of the invention are a method and apparatus which are relatively insensitive to phase and amplitude distortion o~ the u~known audio input signal data, which are relatively insensitive to varia~ions in the articulation rate of the unknown audio inpLIt si~nals, which will respond equally well to different speakers and h~nce different voice ~haracteristics, which are reliable and have an improved lcwer false-alarm rate~ and which will operate in real timeO
i 3~
SUMMARY OF THE I~VENTION
The inv0ntion relates to a method and apparatus for analyzing audio signals. Specifically, the audio signal is a speech signal and the method and apparatus recognize keywords in the speech. Each keyword is characterized by a keyword template having at least one target pattern. Each target pattern represents at least one short term power spectrum and each pattern further has associated with it at least one required dwell time position followed by at least one optional dwell time position. In general, each target pattern will have a plurality of required and optional dwell time positions.
The recognition method features the steps of formin~, at a repetitive frame time, a sequence of frame patterns from, and representing, the audio signal~
Numerical measures of the similarity of each frame pattern with each targe~ pat~ern are then generated.
The method further features the steps of accumulating, for each of the target pattern required and optional dwell time positions, using the numerical measures, a numerical valus representing the alignment of the just formed audio representing frame pattern with the respective target pattern dwell til~e positions and g~nerating a recognition decision based upon the numerical values when a predetermined event occurs in the audio siynal~ Preferably, the predetermined ~vent is the recognition of "silence"~
In another ~spect~ the accumulation step features ~he steps o~ ~13 accumulating for each target pattern second and later dwell position the sum of the accumulated score ~or the previous targ2~ pattern dwell ( '73~J
1 time position during the previous fra~e time and the present numerical measure associated with the taryet pattern; (2) accumulating, for each keyword first target pattern, first required dwell time position, the sum of the best accumulated score during the previous frame time which is associated with the end of a keyword, and the present numerical measure associated with the keyword first target pattern; and (3) accumulating for each other target pattern first dwell position the sum of the best ending accumulated value or the previous target pattern of the same keyword and the present numerical measure associated with the target pattern.
~ he method further features the steps of storing, in association with each frame time position, the identity and duration, in frame time posit;onst of the keyword having the best score and a valid endin~ at the rame time positionr and storing in association with each dwell time position accumulated score, a word duration count corresponding to the time position length of the keyword associated with the accumulated score at that dwell time position. Thereby, the generating step further includes the step of tracing back, throuyh the stored keyword identity and duration information, for determining each keyword in a word string.
In yet anothe~ aspect~ the method further eatures storing, in association with each accumulated score correspondin~ to the dwell time positions, a keyword dura~ion count. The duration count corresponds to the number of numerical measures, that is, the time position count, which have been accumulated to form that position score for the present keyword pattern. In another aspect, the ~ethod furt}2er Eeatures directiny /
3'~3~
1 the transfer of the accu~ulated scores in response to a syntax controlled element.
The apparatus of the invention can be accommodated in either hardware, software, or a mixture of the two. Hardware employed for implementing the method of the invention is described in further de~ail hereinafter.
BRIE~ DESCRIPTION OF THE DRAWINGS
.. . . _ _ _ , ~ _ Other objects, features, and advantages of the invention will appear from the following description of a preferred embodiment taken together with the drawings in which:
Figure 1 is a flow chart illustrating in general terms the sequence of operations performed in accordance with the practice of the present invention;
Figure lA is an electrical block diagram of apparatus according to a pxeferred em~odiment of the invention;
Figure 2 is a schematic block diagram of electronic apparatus for performing certain preprocessing operations in the overall process illustrated in Figure l;
Figure 3 is a flow diagram of a digital cornputer program performing certain procedures in the process of Figure l;
Fiyure 4 is a graphical representation of the pattern alignment process according to the ;nvention;
1 Figure 5 is an electrical block diagram o a likelihood function processor according to a preferred embodiment of the invention;
Figure 6 is an electrical schematic block diagram of the subtract and absolute value circuit according to a preferred embodiment of the invention;
Figure 7 is an electrical circuit diagram o an overflow detection logIc circuit accordin~ to a preferred embodiment of the inventionî
Figure 8 is a truth table for the circuit diayram of Figure 7;
Figure 9 is a schematic flow representation o a syntax processor according to one particular embodiment of the processor of the invention; and ~ igure 10 îs an electrical block diagram showing a sequential decoding pattern alignment circuit confiyuration according to a preferred particular embodiment of the invention.
Corresponding reference characters indicate corresponding elemen~s throughout .he several views of the drawingsO
DESCRIPTION OF A PREFERRED EMBODfMENT
_ Ih ~ne of the ~articular preferred embodiments which is described herein, speech recognition ;s performed by an overall apparatus which involves both a specially ccnstructed electronic sys~em for effectin~
certa.in analog and digital processin~ of incoming audic 73~
1 data signals, generally speech, and a general purpose digital computer which is prograrnmed in accordance with the present invention to effect certain other data reduction steps and numerical evaluations~ The division of tasks between the hardware portion and the software portion of this system has been made so as to obtain an overall system which can accomplish speech recognition in real time at moderate cost. However, it should be understood that some of the tasks being performed in hardware in this particular system could well be perfor~ed in software and that some of the tasks being performed by software programming in this example might also be performed by special purpose circuitry in a different embodiment of the invention. In this later connection, where available, hardwar~ and software implementations of the apparatus will be described.
One aspect of the present invention is the provision of apparatus which will recognize a word string in continuous speech signals even though those signals are distortedr for exa~ple, by a telephone line.
Thus, referring in particular to Figure 1, the voice input signal, indicated at 10, may be considered a voice signal produced by a carbon element telephone transmitter and receiver over a telephone line encompassing any arbitrary distance or number o~
switching interch3nges. A typical application of the invention is therefore recognizing continuous word strinys in audio data from an unknown source received over the telephone system. On the other hand, the input signal may also be any audio data signal, for example, a voice input signal, taken from a radio teleoommunica~ions link~ for example~ from a coT~mercial broadcast s~a~ion9 from a private dedicated 3'73C~
1 communications link, or an operator standing near the equipment.
As will become apparent from the description, the present snethod and apparatus are concerned with the recognition of speech signals containing a sequence of sounds or phonemes, or other recognizable indicia. In the description herein, and in the claims, reference is made to either "a word," "an element", 'la sequence of target patterns," i'a template pattern," or l'an element 1~ template," the five terms being considered as generic and equivalent. This is a convenient way of expressing a recogni~able sequence of audio sounds, or representations thereof, which combine to constitute the word string which the method and apparatus can detect and recognize~ The terms should be broadly and generically construed to encompass anything from a single phomeme, syllable, or sound, to a series of words ~in the grammatical sense) as well as a single wordO
An analog to~digital tA/D~ csnverter 13 ~0 receives the incoming analog audio signal data on line 10 and converts the s-ignal amplitude o~ the incoming data to a digital form. The illustrated A/D converter is ~esigned to convert the input signal data ~o a twelve~bi~ binary representationS the conversions occurring at the rate of 8,000 conversions per second (In other embodiments, other sampling rates can be ernployed; for example, a 16 kHz rate can be used when a high quality signal is availableO~ The A~D converter 13 applies its output over lines 15 to an autocorrelator 17. The autocorreIator 17 processes the digital input signals to generate a short-term autccorre~ation function one hundred times per second and applies its !
3~
l output, as indicated, over lines l9. Each autocorrelation function has thirty--two values or channels, each value being calculated to a 30 bit resolution. The autocorrelator is described in greater detail hereinafter with reference to Fi~ure 2.
The autocorrelation functions over lines 19 are Fourier transformed by a Fourier transformation apparatus 21 to obtain corresponding short-term windowed power spectra over lines 23. The spectra are generated at the same repetition rate as the autocorrelation functionsl that is, lO0 per second, and each short-term power spectrum has thirty-one numerical terms having a resolution of 16 bits each. As will be understood, each of the thirty-one terms in the spectrum represents the signal power within a frequency band. The Fourier transformation apparatus also preferably includes a ~lamming or similar windo~ function to reduce spurious adjacent-band responses.
In the first illustrated embodiment, the Fourier transformation as well as subsequent process;ng steps are preferabl~ performed under the control of a general purpose digital computerr appropriately programmed, utilizin~ a peripheral array processor for speeding the arithmetic operations required repetitively accordiny to the present method. The particular computer employed is a model PDP-ll*manufactured by the Digital Equipment Corporation of Maynard, MassachusettsD
The particular array processor employed is described in U.~ Patent 4,228,498, assigned to the assignee of this applioation. The programming described hereinafter with reference to ~igure 3 is substantially predicated upon the capabilities and characteristics of these available diyital processing units~
*Trade Mark 1 The short-term windowed power spectra are fre~uencyresponse equalized, as indicated at 25r equalization being performed as a function of the peak amplitudes occurrin~ in each frequency band or channel as described in greater detail hereinafter. The fre~uency-response equalized spectra, over lines 26, are generated at the rate of one hundred per second and each spectrum has thirty-one numerical terms evaluated to 16 bit accuracy. To facilitate the final evaluation of the incoming audio data, the frequency-response equalized and windowed spectra over lines 26 are subjected to an amplitude transformation, as indicated at 35, which imposes a non-linear amplitude transformation on the incoming spectra. This tra~nsformation is described in ~reatex detail hereinafter, but it may be noted at this point that it improves the accuracy with which the unknown incoming audio signal may be matched with target pat$ern templates in a reference vocabulary. In the illustrated embodiment, this transformation is performed on all of the frequency-response equalized and windowed spectra at a time prior to the comparison of the spectra with patterns representing the elements of ~he reference vocabulary.
The amplitude transformed and equalized short-term spectra over lines 38 are then compared against the element templates at ~0 as described in detail below~ The reference patterns, designated at 42 represent the elements of the reference vocabulary in a sta~istical ~ashion with which the transformed and equalized spectra can be compared. Each time l'silence"
is detected~ a decision is made with re~ard to the identi~y of the just received word stri~g. This is indica~ed at 44. Candida~e words are ~hus selected l according to the closeness o~ the comparison; and in the illustrated embocliment, the selection process is designed to minimize the likelihood of a missed keyword~
Referring to Figure lA~ a speech recognition system, according to the invention, employs a controller 45 which may be for example a general purpose digital computer such as a PDP-ll or a hardware controller specifically built for the apparatus. In the illustrated embodirnent, the controller 45 receives preprocessed audio data from a preprocessor 46 which is described in greater detail in connection with ~iyure 2.
The preprocessor 46 receives audio input analog signals over a line 47 and provides processed data over inter~ace lines 48 to the control processor.
Generally, the operational speed of the control processor, if a cleneral purpose element, is not fast enough to process the incoming data in real time.
As a result, various special purpose hardware can ~e advantageously employed to effectively increase the processincg speed of element 45. In particular, a vector processing element 48a such as that described in U.S.
Patent 4,228,498, assigned to the assignee of this invention, provides significantly increased array processing capability by using a pipeline effect. In addi~ion~ as described in more detail in connection with Figures 4, 5, and 6, a likelihood function processor 48b can be used in connection with the Vector Processor in order to still further increase the operating speed of th apparatus by tenfold~
While in the preferred embodiment of the inven~ion con~rol processor 45 is a digital computer, in ;
( 1 ~nother particular embodiment, described in connection with Figure 10, a significant portion of the processiny capa~ility is implemented e~ternally of the control processor in a sequential decoding processor 49. The structure of this processor is described in greater detail in connection with Figure 10. Thus, the apparatus for implementing speech recognition illustrated herein has great flexibility both in its speed capabilities and in the ability to be implemented it in both hardware, software, or an advantageous combination of hardware and software elements.
Pre~cessor In the apparatus illustrated in ~igure 2, an autocorrelation function with its instrinsic averaging is performed digitally on the digital data stream generated by the analog-to-digital converter 13 operating on the incoming analog audio data over line 10, generally a voice signal~ The eonverter 13 provides a digital input si~nal over lines 15. The digital processing functions, as well as the input analoy-to-digital conversion, are timed under the control o~ a clock oscillator 51. The clock oscillator provides a basic timing signal of 256rO00 pulses per second~ and this signal is applied to a fre~uen~y div;der 52 to obtain a second timing signal at 8,000 pulses per second. The slower timing si~nal controls the analog-~o-digital converter 13 together with a latch register 53 which holds ~he twelve-bit results of tha last conve~s~on until the next convarsion is completed.
The autocorrelation products are generated by a digital multiplier 56 which mul~iplies the number contained in register 53 by ~he output of a thirty-two t~3 1 word shift register 58. Shift register 58 is operated in a recirculating mode and is driven by the faster clock frequency, so that one complete circulation of the shift register data is accomplished for each analoy-to-digital conversion. An input to shift reyis~er 5~ is taken from register 53 once during each complete circulation cycle. One input to the digital multiplier 56 is taken directly from the latch register 53 while the other input to the multiplier is taken (with one exception described below~ from the curren~
output of the shift register through a multiplexer 59.
The multiplications are performed at the higher clock frequency.
Thus, each value obtained from the A/D
conversion is multiplied with each of the preceding 31 conversion values. As will be understood by those skilled in the art, the si~nals thereby generated are equivalent to multiplying the input signal by itself, delayed in ~ime by thirty-~wo different time increments (one of which is the zero delay). To produce the zero delay correlation, that is, the power of the signal, multiplexer 59 causes the current value of the latch register 53 to be multiplied by itself at: the time each new value is being introduced into the shift register.
~his timing function is indicated at 60O
As will also be understood by those skilled in the art, the products from a single conversion, together with its 31 predecessors, will not be fairly representative of the energy distribution or spectrum over ~ reasonable sampliny interval~ Accordingly; the apparatus of Figure Z provides for averaging of these sets of products~
( 3(~
1 An accumulation ~rocess, which ef~ects averaying, is provided by a thirty~two word shift re~ister 63 which is interconnected with an adder 65 to form a set of thirty~two accumulators~ Thus, each word can be recirculated after having been added to the corresponding increment from the digital multiplier.
The circulation loop passes through a gate 67 which is controlled by a divide~by-N divider circuit 69 driven by the low frequency clock signal. The divider 69 divides the lower fre~uency clock by a factor which determines the nu~ber of instantaneous autocorrelation functions - which are accumulated, and thus averaged, before the shift register 63 is read out.
In the illustrated example, eighty samples are accumulated before being read out. In other words, N
for the divide-by-N divider circuit 69 is equal to eighty~ After eighty conversion samples have thus been correlated and accumulatedj the divider circuit 69 triggers a computer interrupt circuit 71 over a line 72.
At this time, ~he contents of the shift register 63 are successively read into the computer memory through a suitable interface circuitry 73, the thirty-two successive words in the register being presented in ordered sequence to the computer throu~h the interface 73. ~s will be understood by those skilled in the art this data transfer from a peripheral unit~ the autocorrelator preprocessor, to the computer may be typically performed by a direct memory access procedure.
Predicated o~ an averaging of eighty sample~, at an initial sampling rate of 8,000 samples per second~ it will be seen that 100 averaged autocorrelation f-lnctions ~re provided to the computer every sec~nd.
73~
1 While the shift register contents are being read out to the c~mputer, the gate 67 is closed so that each of the words in the shift register is effectively reset to zero to permit the accumulation process to begin again.
Expressed in mathematical terms, the operation of the apparatus shown in Figure ~ can be described as follows. Assuming that the analog-to-digital converter generates the time series S(t), where t = O, Tot 2To, ... , and To is the sampling interval (1/8000 sec~ in the illustrated embodiment), the illustrated digital correlation circuitry oE Figure 2 may be considered, ignoring start-up ambiguities, to compute the autocorrelation function ~(j, t) = ~ S(t+kTo) S(t~(k-j) To) ~1) k=O
where j = O, 1, 2 ... , 31; and t = 80 Tol 160 To/ .~. , 80n To~ . These autocorrelation functions correspond to tlle correlation output on lines 19 of Figure 1.
~0 Referring now to Figure 3, the digital correlator operates continuously to transmit to the computer a series of data blocks at the rata of one complete autocorrelation function every ten milliseconds~ This is indicated at 77 ~Fig. 3). Each block of data represents the autocorrelation function derived from a corresponding subinterval of time~ As noted above, the illustrated autocorrelation funetions are provided to the computer at the ra~e of one hundred, _ !
3~
32-word functions per second. This analysis interval is referred to hereinafter as a "frame".
In the first illustrated embodiment, the processing of the autocorrelation function data is performed by ~n appropria~ely programmed, special purpose digital computer. The flow chart, which includes the function provided by the co~puter program is given in ~igure 3~ Again, however, it should he pointed out that various of the s~eps could also be performed by hardware (as described below) rather than software and that likewise certain of the ~unctions performed by apparatus of Figure 2 could additionally be performed in software by a corresponding revision of the flow chart of Figure 3.
Although the digital correlator of Figure 2 performs some time-averaging of the autocorrelation functions generated on an instantaneous basis, the average autocorrelation functions read out to the computer may still contain some anomalous discontinuities or unevenness which might interfere with the orderly processing and evaluation of the sampl~s.
Accordingly, each block of data, that is, each auto orrelation function a(j,t~ is first smoothed with respect to time. This is indicated in the flow chart of Figure 3 at 78r The preferred smoothing proces~ is one in which the smoothed autocorrelation output aS(j~t) is given by a~(j, t) = C~ a(j,t~ -~ Cl a~j, t-T) + C2 a(j,t-2T~ ~2 where a~j,t) is the unsmoothed input au~ocorrela~ion defined in Equa~ion l! aS(j~t3 is ~he smoothed .
3'73~1 1 autocorrelation output, j denotes the delay time, t denotes real time, and T denotes the time interval between consecutively generated autocorrelation functions (frames) r equal to .01 second in the preferred embodiment. The weiyhting functions CO~ Cl, C2, are preEerably chosen to be 1/4, 1/2, 1/4 in the illustrated embodiment, although other values could be chosen. For example~ a smoothing function approximatin~ a Gaussian impulse response with a frequency cutoff of~ say, 20 Hertz could have been implemented in the computer sor~ware. However, experiments indicate that the illustrated, easier to implement, smoothing function of Equation 2 provides satisfactory resultsO As indicated, the smoothing function is applied separately for each ~alue j of delay.
It will become clear that subsequent analysis involves various operations on the short-term Fourier power spectrum of the speech signal and for reasons of hardware simplicity and processing speed, the transformatiorl of the autocorrelation function to the frequency domain is carried out in eight-bit arithmetic in the illustrated'embodiment. At the high end of the band pass, near three kilohertz, the spectral power density decreases to a level at which resolution is inadequate in eight--bit quantities. Therefore, the frequency response of the system is tilted at a rising rat~ of 6db per octave~ This is indicated at 79~ This high frequency emphasis is accomplished by -taking the second derivative of the autocorrelation function with respect ~o its argument, i.e., the time delay or lag.
The derivative operation i~
b(j,t~ = -a(j~l, t) + ~a(j,t) - a(i-l,t~ (3 3~
1 To evaluate the derivative for j = 0, it is assumed that the autocorrelation function is symmetxical about 0, so that a(-j,t) = a(+j,t). Also/ there is no data for a(32) so the derivative at j = 31 is taken to be the same as the derivative when j = 30.
As indicated in the flow chart of Fig~ 3, the next step in the analysis procedure, after high frequency emphasis, is to estimate the signal power in the current frame interval by finding the peak absolute value of the autocorrelation. The power estimate, P(t), is P(t) = max ¦ b(i,t)¦ (4) In order to prepare the autocorrelation for the eightbit spectrum analysis, the smoothed autocorrelation function is block normalized with respect to P(t) (at 80~ and the most significant eight bits of each normalized value are input to the spectrum analysis hardware. The normalized (and smoothed) autocorrelation function is, therefore:
c(j,t~ 7 b(j~t)/P(t) (5) As indicated at 81, a cosine Fourier transform is then applied to each time smoothed, frequency emphasized, normalized autocorrelation function, c(j,t), to ~enerate a 31 point power spectrum. The matrix of cosine values is given by:
S(i,j~=126 ~(i)(~os~X~i/8000)f(j)~,j=0,1~2,.~.~31 ~6 73C~
1 where S (i,j) is the spectral energy in a band centered at f(j~ Hz, at time t; gti) = ~ cos 2~ i/63) is the (Elamming) window function envelope to reduce side lobes;
and ~(j) = 30 + 1000 (0.0552j ~ 0.438)1/-63 Hz; (7 j=0, 1, 2, ..., 31 which are the analysis frequencies equally spaced on the so-called "mel" curve of subjective musical pitch. As will be understocd, this corresponds to a subjective ~pitch ~mel scale) frequency-axis spacing for frequencies in the bandwidth of a typical communication channel o about 300~350~ Hertz.
Since th~ spectrum analysis requires summation over la~s from ~31 to +31, by making the assumption that the autocorrelation is symmetric about zero, only the positive values of j are required. However, to avoid counting the lag zero term twice, the cosign ma~rix is adjusted so that S~0,;) = 126/2 = 63, for all j ~8) Thus the computed power spectrum is given by S'~j~t) =¦ ~ a~i,t) S (i,j) ¦ , j = 0,1, .~., 31 (9 I i=o . I
where the jth result corresponds to the ~requency f~
'73~
1 As will also be understood, each point or value within each spectrum represents a corresponding band of frequencies. While this ~ourier transform can be performed completely within the conventional computer hardware, the process may be speeded considerably if an external hardware multiplier or Fast Fourier Transform (FFT) peripheral device is utilized~ The construction and operation of such modules are well known in the art, howeverr and are not described in detail herein~
Advantageously built into the hardware Fast Fourier Transforrn peripheral device is the frequency smoothing function wherein each of the spectra are smoothed in fre~uency according to the preferred (Hamming) window weighting function g(i) defined above~ This is inc3icated at 83 of the block 85 which corresponds to the hardware Fourier transform implementation.
If the background nois0 is significant, an estimate of the power spectrum of the background noise should be subtracted from S'(j,t) at this stage. The frame or frames selected to represent the noise should not contain any speech signals. The optimum rule for selecting noise frame intervals will vary with the application. If the talker is engaged in two-way communication, for example, with a machine controlled by the speech recognition apparatus, it is convenient, for example, to chose a rame arbitrarily in the interval immediately after the machine has finished speaking by its voice response unit~ In less cQnstrained sit~a~ions, t~e noise frame may be found by ohoosing a frame of a minimum amplitude during the past one or two seconds of audio input.
As successive smoothed power spectra are received from the Fas~ ~ourier Transform peripheral 85 ( 39~0 1 a communications channel equali~ation is obtained by determining a (generally different) peak power spectrum envelope for the spactra from peripheral 85, and nodifying the output of the Fast Fourier Transform apparatus accordingly~ as described below~ Each newly generated peak amplitude spectrum p (j, t), corresponding to and updated by an incoming windowed power spectrum S'(j, t), where j is indexed over the plur~l frequency bands of the spectrum, is the result of a fast attack, slow decay, peak detecting function for ~ach of the spectrum channels or bands~ The windowed power spectra are normalized with respect to the respective terms of the corresponding peak amplitude spectrum. This is indicated at 87.
According to the illustrated embodiment, the values of the "ol~" peak amplitude spectrum p(j, t - T), determined prior to receiving a new windowed spectrum are compared on a frequency band by frequency band basis with the new incoming spectrum S'l;, t). The new peak spectrum p(j,t) is then generated according to the follo~ing rules~ The power amplitude in each band of the "old" peak amplitude spectrum is multiplied by a fixed fraction, for example, 1023/1024, in the illustrated example. This corxesponds to the slow decay portion of the peak detecting function. If the power amplitude in a requency band j of the incomin~ spectrum S'(j,t) is greater than the power amplitude in the corresponding fre~uency band of the decayed peak amplitude spectrum, then the decayed peak amp~itude spectrum value for that (those) frequency band(s) is ~are~ replaced by the spectrum value o~ the corresponding band of the incoming windowed ~pectrum~
This corresponds to ~he ast attack portion o the peak 1 detectiny function. Mathematically, the peak detectiny function can be expressed as p(j,t)=max{p(j,t-T)~ E); P(t3~ S'(j,t)}j=0,1~..,31(10 where j is indexed over each of the frequency bands, p(j,t) is the resultiny peak spectrum, p(j, t-T) is the "old" or previous peak spectrum, S'(j,t~ is the new incoming, partially processed, power spectrum, P(t) is the power estimate at time t, and E is the decay parameter.
According ~o equation 10, the peak spectrum normally decays, absent a higher value spectrum input, by a factor of 1 - Eo Typically E equals 1/1024. It may however be undesirable to permit decay of the peak spectrum during intervals o~ silence, particularly i~ no r~lpid change in the communication channel or voice characteristics is expected. To define the silence frame, the same method empl~yed to choose background noise frames can be employed The amplitudes (square root of P(t)) o~ the pas~ 128 frames are inspected, and the minimum value found~ If the amplitude of the current frame is less than four times this minimum, the current frame i5 determined to be silence and the value "zero" is substituted for the value 1/1024, or ~.
After thP peak spectrum is genera~ed the resulting peak amplitude spectrum p(j~t) is frequency smoothed at ~9 by averaging each frequency band peak value with peak values corresponding to adjacent frequencies of the newly ~enerated peak spectra, the width of the overall band of frequeneies contributing to the averaye value being approximately equal to the 3~) 1 typical frequency separation between formant frequencies. As will be understood by those skilled in the speech recognition art, this separation is in the order of about 1000 Hz. ~y averaging in this particular way, the useful information in the spectra, that is, the local variations revealing formant resonances are retained whereas overall or gross emphasis in the frequency spectrum is suppressed~ According to the preerred embodiment the peak spectrum is smoothed with respect to freguency by a moving average function covering seven adjacent frequency bands. The averaging function is:
j+3 e(j,t) = h(j) ~ p(k,t) (11) k-j-3 At the ends of the passband, p~k,t) is taken to be 0, for k less than 0 and k greater than 31. The normalizing envelope h(j) takes into account the number of valid data elements actually summed: thus, h(0) =
7/4, h(l) = 7/5, ht~) ~ 7/6, h(3) = 1~ ... r h(283 = 1 h(29) = 7~6, h(30) = 7/5, and h~31) = 7/4. The resulting smoothed peak amplitude spectrum e(j,t) is then employed to normalize and frequency equalize the just received power spectrum, S'~j,t), by dividing the amplitude value of each ~reguency band of the incoming smoothed spectrum S'(j,t)~ by tha corresponding fr~quency band value in the smoothed speak spectrum e(j,t)~ Mathematically, this corresponds to sn~jlt~ = (S'(j,t) ~ e(j,t)) 32767 ~12 ( 3~
1 where sn(f,t) is the peak-normalized, smoothed power spectrum and j is indexed over each of the frequency bands. This step is indicated at 91. There results a sequence of frequency equalized and normali~ed short-term power spectra which emphasizes changes in the frequency content of the incoming audio signals while suppressing any generali~ed long-term frequency emphasis or distorti~n. This method of frequency compensation has been found to be highly advantageous in the recognition of speech signals transmitted over frequency di.storting communication link~ such as telephone lines, in comparison to the more usual systems of frequency compensation in which the basis for compensation is the .~t~erage power level, ei~her in the whole signal or in each respective frequency band.
It is use~ul to point out that, while successive spectra have been variously processed and equalized, the da~a representing the incoming audio signals still comprises spectra occurring at a rate of one hundred per second.
The normalized and frequency equalized spectra, indicated at 91, are subjected to an amplitude transforma~ion, indicated at 93, which effects a non-linear scaling of the spectrum amplitude values.
Designating the individual equalized and normalized spectra as sn(j,t) (from ~quation 1~) where j indexes the differen~ requency bands of the spectrum and t denotes real ~ime, the non-linear scaled spec~ru~ x~,t) .is defined by the linear fraction function sn(jytl - A
x(j7t) = 12a ~ j=0, lt ...-, 3Q ~13) 5n(i~t) ~ A
73~3 where A is the average value of the spectrurn sn(j,t) over j=0 to 31, and is defined as follows:
A = ~ sn(j, t? (14) 3~ j-0 where j indexes over the fre~uency bands of the power spectrum.
The ~hirty-first term of the spectrum is replaced by the logarithm of A so that x(31,t) = 16 1O92 A (15) This scaling function (Eq~ 13) produces a soft threshold and gradual saturation effect for spectral intensities which deviate greatly from the short-term average A. Mathema~ically, for intensities near the average, the function is approximately linear; for intensities ~urther from the average, it is approximately logarithmici and at the extreme values of intensity, it is substantially constant. On a logarithmic scale, the function x(j,t) is sym~etric about zero and the function exhibits threshold and saturation behavior that is suggestive of an auditory nerve firing-rate function. In practice, the overall recognition system performs significantly bet~er with this particular non-linear scaling function than it does with either a linear or a logarithmic scaling of the spectrum amplitudes~
T~ere is thus generated a sequence of amplitude transform~d, fre~uency-response equalized, o 1 normalized, short-term power spectra x(j,t) where t equals .01, .02, .03, .04, ... , seconds and j = 0, ....
30 (corresponding to the frequency bands of the generated power spectra). Thirty-two words are provided for each spectrum; and ~he value of A (Equation 15), the avera~e value o~ the spectrum values, is stored as the thirty-second word. The amplitude transformed, short~term power spectra hereinafter r~ferred to as "frames", are stored, as indicated at 95, in a first-in, first-out circulating memory having storage capacity, in the illustrated embodiment, for 256 thirty two-word spectra. There is thus made available for analysis, in the illustrated embodiment, 2.56 seconds of the audio input signal. This stora~e capacity provides the recognition system with the flexibility, if required, to select spectra at different real times, for analysis and e~aluation and thus with the ability to go forward and backward in time as the analysis requires.
Thus, the frames for the last 2~56 seconds are stored in the circulating memory and are available as needed. In operation, in the illustrated embodiment, each fram~ is stored for 2.56 seconds. Thus, a framer which enters the circulating memory at time tl, is lost or shifted from the memory 2.56 seconds later as a new ~rame, corresponding to a time tl + 2056, is stored.
The frames passing through the circulatory memory are compared, preferably in real time~ against a known vocabul'ary of words to deter~ine and identify the input data in word groups called a word strin~0 Each 30 voc~bulary word is represen~ed by a template pattern statistically representing a plurality of processed power spectra formed into plural non-overlapping 3~
1 multi-fra~e (preferably three frames) design set patterns. These patterns are preferably selected to best represent significant acoustical events of the vocabulary words and are stored at 94.
The spectra forming the design set patterns are generated for the words spoken in various contexts using the same system descrihed hereinabove for processing the continuous unknown speech input on line 10 as shown in Figure 3.
Thus, each vocabulary word has associated with it a generally plural sequence of design set patterns, P(i)l, P~i)2, .~. , t~hich represent, in a domain of short-term power spectra, one designation of that ith keyword. The collection o~ design set patterns for each keyword form the statistical basis from which the target patterns are ~enerated.
In the illustrated embodiment of the invention, the design set patterns P(i)j can each be considered a 96 element array comprising three selected frames arranged in a series sequence. The frames fc)rming the pattern should preferably be spaced at least 30 milliseconds apart to avoid spurious correlation due to time domain smoothing. In other embodiments of the inventionr other samplinq strategies can be implemented for choosing the ~rames; however, the preferred strategy is to select frames spaced by a constant time duratisn, preferably 30 milliseconds, and to space the non-overlapping design set patterns throughout the time interval defining the keyword~ Thus, a first design set pattern Pl corresponds to a portion of a keyword near its beginning; a second pattern P2 corresponds to a ( 3~
1 portion later in time, etc., and the patterns Pl, P2, ... form the statistical basis for the series or sequence of target patterns, the word ~emplate, against which the incoming audio data will be matched. The target patterns tl~ t2, ..., each comprise the statistical data, generated from corresponding P(i~; by assuminy the P(i)j are comprised of independent Gaussian variables, which enable a likelihood statistic to be generated between incoming frames, defined below, and the target patterns. Thus, the target patterns consist of an array wherein the entries comprise the mean, standard deviation and area normalization factor for a corresponding collection of design set pattern array entries. A more refined likelihood statistic is described below.
It will be obvious to those skilled in the àrt that substantially all words will have more than one contextual and/or regional pronounciation and hence more than one "spelling" of design set patterns. Thus, a vocabulary word having the patterned spelling Pl, P2 --referred to above, can in actuality be generally expressed as p(i)l, p(i)2, ... i = 1, 2, ..., M where each of the p(i)j are possible alternative descriptions of the jth class of design set patterns, there being a total of M diferent spellings for the wordO
The target patterns tl~ ~2~ ti~ in the most general sense, therefore, each represent p~ural alternative s~atistical spellings for ith ~roup or class of design set patterns~ In ~he illus~rated embodiment described herein, the term "target pa~tern" is ~hus used in the most ~eneral sense and each target pattern may therefore have more th~n one permissible alternative "statistical spelling~"
7~C) 1 Preprocessing of ~he incoming unknown audio signals and the audio forming the reference pa~terns is now complete.
Processing the Stored Spectra A more indepth study of the keyword recognition method of concatenating phonetic patterns into detected words, described in U.S. Patents 4,241,329, 4,227,177, and 4,227 r 177, has shown that it is a special case of a more general and possibly superior recognition method. Referrin~ to Figure 4, the word recognition search can be represented as the problem of finding an apprbpriate path through an abstract state space. In the figure, each circle represents a possible state, also designated a dwell time position or register, through which the decision making process can pass~ The space between dashed vertical lines 120, 122 represents each of the hypothetical states in which the decision ma~ing process can pass in determining whether a pattern matches or does not match a current phoneme. This space is divided into a required dwell time portion 124 and an optional dwell time portion 126. The re~uired dwell time portion is the minimum duration of the particular "current"
phoneme or pattern. The optional dwell time portion represents the additional maximum duration of a pattern~
Each of the circles within the optional or required dwell time portions rep*esents one frame time of the corfinuum oE for~ed fr~mes and corresponds to the 0.01 second intervals from frame to frame. Thus, each circle identiies a hypothesized current phonetic position in a word spellin~ and, toge~her with the number of ~.01 second~ frames hypothesized to have elapsed since the current phoneme beganr corresp~nding to the numher of ( 73~
1 earlier "circles" or positions in that phoneme or taryet pattern, represents the present duration of the pattern.
After a pattern (phoneme) has be~un and the minimum dwell time in~erval has elapsed, there are several possîble paths of advanciny to the ~irst node or position ~circle) 12B of the next target pattern tphone~e). This depends upon when the decision to move to the next pattern (phoneme) of the spelling is made~
These decision possibilities are represented in the figurc by the several arrows leading to circle 128. A
transition to the next pattern (phoneme), the beginning of which is represented by circle 128, might be made at any node or position during the optional dwell time of the current pattern (phoneme) or at the last node of the re~uired dwell time interval.
The key word recognition method described in U.S. Patents 4,241r329; 4,~27tl76; and 4,227,177, makes the transition at the irst such node for which the likelihood score relative to the next pattern (phoneme) is better th~n the likelihood score relative to the current pattern (phoneme) That is~ a frame matches the next phoneme or pattern better than the present phoneme or pattern. The total word score, however, is the average pattern (phoneme) score per frame (i.e., per node included in the path~ This sa~e "total score"
definition applied to a word score up to the current node can be used to decide when to make the transitivn;
that is, whether to make the transition to the n~xt pattern at ~a~ a first opportunity, correspondiny ~or example to a transition indicating line 130, or at a later time, correspondiny to, or example, a transitlon indicatin~ line 132. Optimally, one chooses that path into the next pattern (phoneme~ for which the avera3e 3~
:L score per node is best. Since the standard keyword method described in U.S. Patents 4,241,329, 4,227,176, and 4,227,177, does not examine any of the potential paths after it has made the decision to move to the next pattern (phone), it may make a sub-optimal decision as measured by average score per node~
Accordingl~, the present invention employs an average score per node strategy for word string recognition~ ~he problem arises, when used in connection with word string recognition as described in detail hereinafter, that one must either normalize all partial word scores by the number of nodes included~
which is computationally inefficient, or else one must bias the accumulation so that an explicit normalization is not necessary. A natural bias to use in the closed vocahulary task is the unnormalized score ror the best word ending at the present analysis time; then the accumulated scores at all nodes will always be the sum of the same number of elementary pattern scores~
Furthexmore the score is transformed by this bias into the score o~ the best string of words ending at the current analysis node.
The average score per node decision strategy is efficiently implemented in the Vector Processor described in U~S. Patent 4,228,498, by a dynamic programming technique. When programmed in this manner the processing speed is somewhat faster than for the standard key word reco~nition method described in UOS~
Patents 4 t 241,329; 4,2~7,176, and 4,227,177~ even though more hypothesis tes~s are required.
Generally speaking, to recognize strings of words, the program remembers ~he name of the best 3'73~
I
1 hypothesized vocabulary word ending at each analysis node. It also remembers the node (time) at which this best word began. The best strin~ of words is then found by tracing back from the end of the utterance, noting the stored word name and finding the next previou$ word at the indicated beginning time of the current word.
By including silence as a vocabulary word, it becomes unnecessary to specify how many words are contained in the string of words. The operation of tracing back to find the string is executed whenever the silence word has the best word score, and the operation terminates at the next previously detected silence.
Thus a string is found every time the talker pauses for breath.
The word string recognition method described herein is one level of abstraction higher than the detection of individual key w~rds. Since the word string scoring forces all speech throughout the utterance to be included in some word of the string, it has an advantage over the simpler word spotting approach, which frequently detects false sort words within longer words~
Advantageously no timing patterns are necessary for the worc~ string/ since the word concatenator outputs a word beginning time or each word ending hypoth~sis. The simplest siring concatenator assumes that these word beginning times are correctO On detec~ing silence, it assu~es that the string of words has just ended, and that the beginning of the last word is the ~nd of the previous word (which may be 5ilence)0 It is then a simple mat~er ~o trace backward through the ~9~
1 string, choosing the word ~7ith -the best ending score at each word boundary. Since there is usually a context-dependent transition between each pair of words in the string, ;t may be preferable to permit the apparatus to search the neighb~rhood of each word beginning for the best ending of the previous word~
The method and apparatus, including hardware and software embodiments are now described in greater detail r Referring to ~ig. 3, the stored spectra, or frames, at 95, representing the incoming continuous audio data, are compared with the stored template of target patterns indicated at 96, representing keywords of the vocabulary according to the following method.
For each 10 millisecond frame, a pattern for comparison with the stored reference patterns is formed at 97 by adjoining the current spectrum vec~or s(j~t), the spectrum s(j,t-~03) from three frames ago, and the spectrum s(j,t-.06) from six frames ago, to form a 96 element pattern:
( s(~Jt-.06)~ j=0,...,31 x(j,~) = ( s(j-32,~-.03), j=32,...,63 ~ s(j-64,t), j=64,...,95 As noted above, the stored reference pa~terns consist of the m~n values, standard deviations9 and area normalizing terms o~ previously collected 96 ~9~3~
1 element patterns belonging to ~he various speech pattern classes to be recognized. The comparison is accomplished by a probability model of the values x(j,t~
to be expected if the input speech belongs to a particular class.
Whiler a Gaussian distribution can be used for the probability ~odel, (see e.~. U.S. Pa~ents 4,241,329;
4r~7~176; and 4,227~177, referred to above), the Laplace distribution p(x) = (1/ ~ s') exp-( ~2 ~ x-m ¦/s') ~where m is ~he statistical mean and s~ the standard deviation of the variable x) requi.res less computation and has been found to perform nearly as well as the Guassian distrib~tion in, for example, the talker independen~, isolated word recognition method described in U.S. Patent 4,038,503~ The degree of similarity L(x ¦ k~ between an unknown input pattern x and the kth stored reference pattern is proportional to the lo~arithm of the probability and is estimated at 100 by the following formula:
~3~ 1 Xi - Uik I
L~x ¦ k) = ~ Ak s ik (17) i=l where Ak a ~ ~ ln 5~ik i=l 3~3 1 In order to combine the likelihood scores L of a sequence of patterns to form the likelihood score of a spoken word or phrase, the score L(x ¦ k) for each frame is adjusted by subtracting the best (smallest) score of all the reference patterns for that frame, as follows:
L'(x ¦k) = L(x ¦ k) - min L(x ¦ i) (183 Thus the best-fitting pattern on each fra~e will have a score of zero. The adjusted scores for a hypothesized sequence of reference patterns can be accumulated from frame to frame to obtain a seguence score rela~ed directly to the probability that a decision in favor ~f the indicated sequence would be the correct decision.
Comparison of unknown input spectrum patterns against stored known patterns is accomplished by computing the function q = ~ Sik ¦ Xi - Uik ¦ ~ Ck ~19) . i=l (where sik e~uals l~s'ik) for the kth reference pattern.
In a normal software implemented computation, the following instructions would be executed to compute the algebraic func~ion s¦ x-u ~ (of Equation 19~:
1~ c~mpute x-u 3~
1 2. test the sign of x-u 3. if x-u is negative, negate to form the absolute value 4. multiply by s 5. add the result into an accumulator In a typical spee~h recognition system having a 20~word vocabulary, there would be about 222 different reference patterns~ The number of steps required to evaluate them is then 5x96x222 = 106560 steps, not including overhead operations, and this must be done in less than 10 illiseconds in order to keep up with the real time spectrum ~rame rate. The processor must therefore be capable of executing nearly 11 million instructions per second just to evaluate th~ likelihood functions. In view of the necessary speed, a special purpose li~elihood function hardware module 200 ~Fig. 4), which is compatible with a system Vector Processor as disclosed in U.S. Patent 4j228,498, is employed.
In this special purpose hardware, the five steps listed above are performed simultaneousl~ with two sets vf the arguments s~ x, u; so that in effect ten instructions are performed in the ~ime it normally takes to execute on~ instruction. Since the basic Vector Processor operates at a rate o 8 million instiructions per second~ the effective com~utation rate for the likelihood function becomes about 80 million instructions per second with the special purpose hardware module 200 being employed.
, 73~
1 Hardware module 200, referring to Fig. 5, employs a combination of hardware pipelinin~ and parallel processing to provide the simultaneous eY.ecution of the ten steps~ Two identical sections 202, 204 Pach perform five arithmetic steps upon the independent input data arguments and the two results are combined by an adder 206 connected to their outputs.
The accumulation of the summations from adder 206 form the summation from 1 to 96 of Equation 19 and is handled by the arithmetic unit of the standard Vector Processor described in U.S. Patent 4,288.,498.
In operation, pipelining registers hold the intermediate data at the following stages of the processing:
lo input arguments (clocked registers 208, 210~ 212~ 214r 216~ 218)
2. a~solute value of x-u tclocked registers 220; 222)
3. output of multiplier ~clocked re~isters 224, 226) With the input data held in clocked registers 208-218 t ' the magnitude of x-u is determined by subtrac~ and absolute value elements 228, 230. Referring to Fig~ ~, the subtraction ~nd absolute value elements 228, 2 30 ~
each contain irst and second subtracters 232~ 234, one to find x-u and the other ~o find u-x, and a multiplexer 236 to select the positive resultn The ;nput arguments x and u over lines 238 t 240 from registers 208, 210 respectively~ are 8-bit numbers rangin~ from -128 to 1 +12-7. Since the diference output of the 8-bit subtracter ma~ over10w to 9 bits (for example, ~127 -(-128) = 255), extra circuitry is needed and employed to handle an arithmetic overflow condition. (The condition is determined by an overflow detector 235 whose inputs are the sign of "x" (over a line 235a~, the siyn of "u" (over a line 235b) and the sign of "x-u"
~over a line 235c).) The overflow detectors~ referring to Fig. 7, are, in this illustrative embodiment, combinatorial ~ircuits having three-input AMD gates 268, 270, and an OR gate 272. The truth table of Fig~ 8 defines the overflow condition as a function o its inputs.
The over10w condition is handled by providing - four choices in the multiplexer 236, the element which selects the positive subtractor output. The choices are defined by the binary levels on lines 242 and 244. The level on line 242 represents the si~n of x-u. The sign on line 24~ represents an overflow if "1". Thus the choices are:
line 242 line 244 O O select the subtracter 232 output 1 0 select the subtracter 234 output O 1 select the subtracter 232 shifted down 1 bit 1 1 select the subtracter 234 shifted down 1 bit The multiplexer is hus rontrolled to ac~ like an %-pole, 4-pos;tion electrical switch~ The "shift!' operation is performed combinatorially by connecting (gating) the subtracter ou~pu~s to ~he appropriate 3~
~L3L''3~3~
1 multiplexer inputs~ The shift has the effect of dividing arithmetically by two.
If an overflow has occurred during the subtraction, the output of the multiplexer will be the output of a subtractor divided by two. It is therefore necessary to remember that condition later in the computation so that the final result can be multiplied by two, to restore the correct scale factor. This restoration occurs at the output of the multiplier a~ter lQ the final pipelining register. Therefore an extra bit is provided in ~he pipeline registers 220, 222, 224, 226 to control second multiplexers 248, 250 which shit, respectively, the multiplicative product o an 8 x 8 bit multiplier 252, 254 up by one bit, to multiply by two, whenever the overflow bit is set (e~ual to "1") The multiplication arithmetic is carried out in a standard commercial integrated circuit device, such as the TRW
part number MPY-8-HJ, which accepts two 8-bit numbers and outputs their product.
Multipliers 252, 254 thus produce the product of s and¦ x-u ¦ at each clock pulse (the value of s being properly timed by the extra da~a registers 256~
258). The outputs of mul~ipliers 252, 254 are buffered in registers 224, 225 and are outpu~ to the remaining-circuit apparatus over lines 260, 2~2 and through adder 206.
The same special purpose hardware module 200 i~ al50 employed for computing the inner product of two vectors, as required in matrix mul~iplication. This is ~ccomplished b~ gating circuits 264, 266 which permit bypassin~, in the subtraction and absolute value ( 3(9 1 circuit, components 228, 230. In this mode of operation, the data "x" and "s" input buses are applied directly to the pipeline registers 220, 222, as the multiplier inputs.
Word level pattern alignmant A dynamic programming method (at 101) is preferably employed to optimize the correspondence between unknown input speech and each vocabulary word template~ Each word template consists not only of the sequence of reference pattern statis~ics re~erred to above, but also a minimum and maximum dwell time associated with each reference pattern. Accordingly to the dynamic programming approach, a set of storage registers is provided for each vocabulary word. The number of registers is equal to the sum of the maximum dwell times ~f the reference patterns making up that word; i.e., it is proportional to the longest per~i~sible word duration. These registers correspond to the circles in Figure 2, one register for each circle.
For every ~rame of input speech, all the registers are read and written. Each re~ister will contain7 as described in detail below, the accumulated likelihood score corresponding to the hypoth0sis that the indieated vocabulary word is being spoken and that the curren~ position in the word eorresponds ~o the particular reerenoe pattern and dwell time associated with that re~is~er. All the registers are ini~ialized to contain poor likelihood scorest to indicate that initially none of the represented hypotheses is acceptably likely.
!
3~
1 The rules for upd~ting the registers are as follows. The first register of each word template, (i.e~, the register corresponding to the hypothesis that the word has just begun to be uttered) contains the su~
of a) the likelihood score of the present frarne relative to the first reference pattern of the word and b) the best score of all last registers o~ all vocabulary words [i.e., the accumulated likelihood score for the hypothesis that~some word was completed on the previous lQ frame).
Yne second register of a word template contains the sum of a) the likelihood score of the present frame relative to the first reference pattern of the word and b) the contents of the first register from the previous frame. Thus the second register contains the score of the hypothesis that the indicated word is being uttered and that it began on the previous frame.
During the process Qf updating those registers corresponding ~o dwell times between the minimum and maximum duration, (the optional dwell interval), a separate memory register is employed to store the best aecumulated likelihood score ~register content~ in the registers corresponding to optional dwell time interval for each successive "present frame" This best score, found at the previous frame time, is used to calculate the next contents of the ~irst register corresponding to the required dwell time interval of a next targe~
pattern or template for the word. Thus, the present contents of the first register of the nex~ reference pat~ern is ~enerated by adding that bes~ score ~of the previous target pattern1 to the likelihood score of the pres~nt input frame relative to the said next reference or target pattern.
3~
each contain irst and second subtracters 232~ 234, one to find x-u and the other ~o find u-x, and a multiplexer 236 to select the positive resultn The ;nput arguments x and u over lines 238 t 240 from registers 208, 210 respectively~ are 8-bit numbers rangin~ from -128 to 1 +12-7. Since the diference output of the 8-bit subtracter ma~ over10w to 9 bits (for example, ~127 -(-128) = 255), extra circuitry is needed and employed to handle an arithmetic overflow condition. (The condition is determined by an overflow detector 235 whose inputs are the sign of "x" (over a line 235a~, the siyn of "u" (over a line 235b) and the sign of "x-u"
~over a line 235c).) The overflow detectors~ referring to Fig. 7, are, in this illustrative embodiment, combinatorial ~ircuits having three-input AMD gates 268, 270, and an OR gate 272. The truth table of Fig~ 8 defines the overflow condition as a function o its inputs.
The over10w condition is handled by providing - four choices in the multiplexer 236, the element which selects the positive subtractor output. The choices are defined by the binary levels on lines 242 and 244. The level on line 242 represents the si~n of x-u. The sign on line 24~ represents an overflow if "1". Thus the choices are:
line 242 line 244 O O select the subtracter 232 output 1 0 select the subtracter 234 output O 1 select the subtracter 232 shifted down 1 bit 1 1 select the subtracter 234 shifted down 1 bit The multiplexer is hus rontrolled to ac~ like an %-pole, 4-pos;tion electrical switch~ The "shift!' operation is performed combinatorially by connecting (gating) the subtracter ou~pu~s to ~he appropriate 3~
~L3L''3~3~
1 multiplexer inputs~ The shift has the effect of dividing arithmetically by two.
If an overflow has occurred during the subtraction, the output of the multiplexer will be the output of a subtractor divided by two. It is therefore necessary to remember that condition later in the computation so that the final result can be multiplied by two, to restore the correct scale factor. This restoration occurs at the output of the multiplier a~ter lQ the final pipelining register. Therefore an extra bit is provided in ~he pipeline registers 220, 222, 224, 226 to control second multiplexers 248, 250 which shit, respectively, the multiplicative product o an 8 x 8 bit multiplier 252, 254 up by one bit, to multiply by two, whenever the overflow bit is set (e~ual to "1") The multiplication arithmetic is carried out in a standard commercial integrated circuit device, such as the TRW
part number MPY-8-HJ, which accepts two 8-bit numbers and outputs their product.
Multipliers 252, 254 thus produce the product of s and¦ x-u ¦ at each clock pulse (the value of s being properly timed by the extra da~a registers 256~
258). The outputs of mul~ipliers 252, 254 are buffered in registers 224, 225 and are outpu~ to the remaining-circuit apparatus over lines 260, 2~2 and through adder 206.
The same special purpose hardware module 200 i~ al50 employed for computing the inner product of two vectors, as required in matrix mul~iplication. This is ~ccomplished b~ gating circuits 264, 266 which permit bypassin~, in the subtraction and absolute value ( 3(9 1 circuit, components 228, 230. In this mode of operation, the data "x" and "s" input buses are applied directly to the pipeline registers 220, 222, as the multiplier inputs.
Word level pattern alignmant A dynamic programming method (at 101) is preferably employed to optimize the correspondence between unknown input speech and each vocabulary word template~ Each word template consists not only of the sequence of reference pattern statis~ics re~erred to above, but also a minimum and maximum dwell time associated with each reference pattern. Accordingly to the dynamic programming approach, a set of storage registers is provided for each vocabulary word. The number of registers is equal to the sum of the maximum dwell times ~f the reference patterns making up that word; i.e., it is proportional to the longest per~i~sible word duration. These registers correspond to the circles in Figure 2, one register for each circle.
For every ~rame of input speech, all the registers are read and written. Each re~ister will contain7 as described in detail below, the accumulated likelihood score corresponding to the hypoth0sis that the indieated vocabulary word is being spoken and that the curren~ position in the word eorresponds ~o the particular reerenoe pattern and dwell time associated with that re~is~er. All the registers are ini~ialized to contain poor likelihood scorest to indicate that initially none of the represented hypotheses is acceptably likely.
!
3~
1 The rules for upd~ting the registers are as follows. The first register of each word template, (i.e~, the register corresponding to the hypothesis that the word has just begun to be uttered) contains the su~
of a) the likelihood score of the present frarne relative to the first reference pattern of the word and b) the best score of all last registers o~ all vocabulary words [i.e., the accumulated likelihood score for the hypothesis that~some word was completed on the previous lQ frame).
Yne second register of a word template contains the sum of a) the likelihood score of the present frame relative to the first reference pattern of the word and b) the contents of the first register from the previous frame. Thus the second register contains the score of the hypothesis that the indicated word is being uttered and that it began on the previous frame.
During the process Qf updating those registers corresponding ~o dwell times between the minimum and maximum duration, (the optional dwell interval), a separate memory register is employed to store the best aecumulated likelihood score ~register content~ in the registers corresponding to optional dwell time interval for each successive "present frame" This best score, found at the previous frame time, is used to calculate the next contents of the ~irst register corresponding to the required dwell time interval of a next targe~
pattern or template for the word. Thus, the present contents of the first register of the nex~ reference pat~ern is ~enerated by adding that bes~ score ~of the previous target pattern1 to the likelihood score of the pres~nt input frame relative to the said next reference or target pattern.
3~
-4~-1 In Figure 4, the multiple arrows leading in to the first register 128 of the required dwell interval of a reference pattern are meant to indicate that the transition from the optional register or state to required dwell time register or state can occur at any time during the optional dwell time interval or from the last register of the required dwell time interval. Thus on the basis of current information, the best fitting correspondence bet~een word template and the input patterns is the one which hypothesizes that when the next pattern is just beginning, the previous pattern has had a duration corresponding to the register containing the best score in the preceding optional dwell interval (plus the last register of the previous reguired time interval, register 300 in the illustrated embodiment).
According to the theory of dynamic programming it i9 not necessary to save previously accumulated scores corresponding to all possible dwell times, since, according to the theory any dwell time transition which produced a worse score will continue to produce worse scores at all future stages of processing.
Analysis proceeds in the manner described using all registers of all reference patterns of all word templates. The last register(s) of the last pattern of each word templa~e contains the score of the hypothesis that that word has just ended~
During the accumulation ~f likelihood scores9 a se~uence oflduration counts is kept ~or determining the duration of the best word ending at each frame time~
The count is initiated at "one" at the first register of the first template pattern of the word~ For each second and succeeding register, of a template pattern, the ~9~3~
-~5-1 count associated with the previous register is incremented by "one". However~ for each register corresponding to the beginning of a reference pattern (other than the first reference pattern of a word), that is, for example, the irst register 128 of the required dwell time interval, it i5 the count of optional d~ell time register (or last required dwell time register) of the previous reference pattern, h~ving the best likelihood score in the previous frame timel that is incremented to for~ the duration count for the register.
In order to provide a mechanism ~or "tracing back" as described in more detail below, for each frame timee the identiication of the best scoring word ending at that time, and its duration, are transferred to a circulating buffer memory. When a sequence of words ends, the stored word durations permit tracing backward, from the end of the last "best" word, via its duration, to the best preceading word ending just prior to the "last word", e-tc.~ until all words of the word string ~0 have been identified.
Strings of continuously uttered vocabulary words are bounded by silence. One of the word templates therefore coxresponds to silence, or background noise.
Whenever the silence word has the best likelihood score, it is presumed that a sequence of words has just ended~
A flag register is tested to see if any word other than si7ence has had the best score since the last initialization o~ the recognition process~ If at least one word other than "silence" has had a "best score" ~at 1033, the word string in the circulating buffer is traced backw~rds (at 105) and the resulting recognized message is ~ransmi~ed to a display or other controlled ( :~9~'7~
~46-1 equipment. Then the circulating buffer is cleared to prevent repeated transmission of the message, and the flag register is cleared. The apparatus is thus initialized to recognize the next "word strin~" (at 107).
Trainin~ of referenc~_~atterns To obtain sample means, u~ and variances, s~, for construction of reference patterns, a number of u~erances of each vocabulary word are en~ered into the speech reco~nition system and the ensemble statistics of corresponding preprocessed spectrum ~rames are evaluated. Crucial to successful operation of the equipment is the choice of whi~h input spectrum frames should correspond to which target or reference patterns.
In the absence of be~ter information such as manually chosen significant acoustical phonemes for the input word, the ~ime interval between the beginning and end of a spoken word is divided into a number o uniformly spaced subintervals. E~ch of these subintervals is forced to correspond to a unique reerence pattern. One or mor~ three-frame patterns beginning in each interval are formed and classified according to the reference pattern associated with that interval. Subsequent examples of the same vocabulary word are similarly divided into a like number of uniformly spaced intervals. The mean values and variances of the elements of the three~frame patterns extrac~ed from correspondingly order~d intervals are a~cumulated over all available examples of the vocabulary word ~o form the set of referenc0 patterns ~or that wo~d~ The number of in~ervals (number of reerence patterns) should be in the order of two or 73~
1 three per linguistic phoneme contained in the vocabulary word.
For best results, the start and end of each vocabulary word are marked through a procedure involving manual examination of ~he recorded audio waveform and spectrum frames. To implement this procedure automatically~ it is necessary to have words spoken one at a time, bounded by silence, in order for the apparatus to find word boundaries accurately. The reference patterns may be initialized ~rom one such sample o~ each word spoken in isolation, all variances being set to a convenient constant in the refarence patterns. Thereafter the training material may comprise utterances typical of those to be recognized, with word and segment boundaries as found by the recognition process.
After s~atistics from a suitable number of traininy utterances have been accumulated, the reference patterns so found replace the initîal re~erence patterns. A second pass through the training material is then made. This time the words are divided into intervals on the basis of the decisions made by the recognition processor as in Figure 3. Every three-frame input pattern ~or one ty~ical input pattern for each re~erence pattern) is associated with some reference pattern by the previously described pattern alignment method~ Meanlvalues and variances are accumulated a second time to ~orm the final set of ~eference patterns derived in a manner wholly compatible with the method in which they are to be used by the recogni~ion apparatus~
During each o~ the trai~ing passes it is preferable to ignore any training phrase which is not 3~
-~8-1 correctly recogniæed by the recognition processor, since a misreco~nized utterance is likely to have poorly placed interval boundaries~ On completion of the traininy pass, the previously misrecGgnized phrases can be attempted again with the new reference patterns, and the reference patterns can be further updated if recoynition is ~hen successful.
An alterna~ive to ignoring the misrecognized phrases is to orm a multiple-word template for each trainin~ utterance. This template is simply a concatenation of the teMplates for each of the words in the utterance in the correct order. The talker is prompted by a script to speak the indicated word sequence, and the recognition p~ocessor references only the multipl~ template and the silence template. The word boundaries and xeference pattern classification will then be optimal for the given script and available reference patterns. A disadvantage of this procedure is that a lar~er number of passes through the training script may be required.
For highest possible recogn;tion accuracy it is ~referrable to begin the training procedure with a set of previously determined talker-independent reference patterns for the vocabulary to be recognized~
The talker-independent patterns are ob~ained from phrases typical of those to be recogni~ed~ spoken by at least several different talkers. The word boundaries may bs determined by manual examination of recorded audio waveforms. Then the two step procedure just described is employed to develop the talker-independent patterns: in ~he first pass, ~ubintervals are uniformly sp~ced within each word; in the second pass~
( 3Cli 1 subintervals are as determined by the recognition process using the first-pass reference patterns.
Ensemble statistics over all talkers are derived in each pass. To train the system to a particular talker, the talker-independent patterns are employed as if they were the product of the first training pass; and only the procedure of the second pass is performed (possibly twice).
The minimum (required) and maximum (required plus optional) dwell times are preferably determined during the training process. According to the preferred embodiment of ~he invention,~ the apparatus is trained as described above, usiny several speakers. Fùrtherj as described above, the recognition process automatically determines, during the training procedure, pattern boundaries in accordance with the process described above. Thus boundaries are recorded and the dwell times for each of the apparatus identified keywords are stored.
At the end of a training runr the dwell times for each pattern are examined and the minimum and maximum dwell times for ~he pattern are chosen.
According to a preferred embodiment of the invention, a histogram of the dwell time is generated and the minimum and maximum dwell times arP set at the twenty~fifth and seventy-fi~h percentiles~ This provides a high reco~nition accuracy while maintaining a low false alarm rate. Alternately, other choices of minimum and maximum dwell times can be chosen, there being a trade of between recogni~ion accuracy and false alarm rate.
Thus, if a low minimum dwell time and large maximum dwell time are chosenr a higher recognition accuracy ( ~9~73(~
1 will yenerally result at the cost of a correspondingly high ~alse alarm rate, Syntax processor Concatenation of two or more specific word templates is a trivial example of syntax control in the decision process. Referring to ~ig. 9, a syntax circuit arrangement 308 to detect word sequences containing an odd number (1,3,5,7,..,) of words has two independent sets of pat~ern alignment registers 310, 312, maintained for each vocabulary word, The entering score for the f ir5t template is the score for silence or the best score from the set of second templates, whichever is better. The entering score for the second template is the best score from the first set of templates. This score also feeds a second silence detector template at nodP 313. On detection of silence at the end of the utterance, as measured by the detector template at node 313, the labels and durations of the words uttered may be traced back alternately from the traceback buffer~ oE
the first and second set of templates. I~portantly, the position of the silence detector template ensures that only silence after a word sequence having an odd number of words can be detected.
Somewhat more complex syntax networks may be implemented by associating with each syntax node such as nodes 313a and 313b of Fig. 9, a list oE acceptable word string lengths. For example, in the ~yn~ax network of Fig, 9 which accepts any string containing an odd nu~ber o~ words, the string length may be ~ixed at a 30 particular odd number, say 5, by examining the string length at the input to the second silence register 313a~
If the length of the string at that point is not 5, the \~
( '7~V
1 register becomes inactive ~for the present analysis interval), and no string score can be reported from that register; but if the string length is 5, a string detection can be reported. Similarly the first vocabulary register 310 can be enabled if the incoming string length is 0, 2~ or 4 and the second register only if the incoming string l~ngth is 1 or 3. Although the optimal results for a five-word string would require five eomplete sets of dynamic programming accumulators, this method permits a lesser number of accumulators to perform multiple duty with only a sli~ht reduction in typical recognition accuracy.
The Realized System [Jsin~ the Speech Recognition Method As indicated previously, a presently preferred embodimen~ of the invention was constructed in which the signal and data manipulation, heyond that performed by the preprocessor of Figure 2, was implemented on and controlled by a Di~ital Equipment Corporation PDP-11 computer working in combination with the special purpose Vector Computer Processor such as that described in copending United States Patent No. 4,228,498.
In addition to the use of a computer programming implementation of the inventive method 7 a hardware implementation of the inventive method can be emplo~ed.
In operation, the apparatus of ~igure 10 operates in a~cordance with the dynamic programming technique Each new likelihood score sequence ~hat is, the sequence of likelihood scores relative to each re~erence pa~ern in a known predetermined order, from the computer over lines 320 is added ~o exis~ing scores ( 73~3 1 in one of memories 322 and 32~. These memories alternate functiorls as described below, under the control of (a) the syntax processor 308 which receives the scores corresponding to the end of each possible word, (~) a minimum score register 326 which can replace the output of mem~ries 322 and 324 dependin~ upon the memory select and next phoneme signals, and (c) the other control and clock signals.
In operation, the circuit follows the rules for updating the regis~ers corresponding to each of the "circles" of ~igure 4 to provide at each rest or silence recognition a decision mechanism by which the best Umatch'' can be achieved.
Memories 322 and 324 have the same configuration and are interchanged every ten milliseconds~ that is, evsry time a new frame i5 analyzed. The ~emories each contain a plurality of thirty-two bi~ words, the number of thirty-two bit words corresponding to the total number of registers ~or circles in Figure 4) associated with the words of the machine vocabulary. Initially/ one memory, for example memory 322, is filled with "bad" likelihood scores~ that is, scores which in the present example have a large value~ Thereaf~er, ~he memory 322 is read sequen~ially, in a predetermined sequence corresponding to the sequence of new likelihood scores from the Vector Processor over line 320 and the scor~s are then updated as described below and rewritten into ~he oth2r memory~
memory 3240 In the next ten ~illise~ond rame, the now old scores from memory 324 are read and new scores are written into the now other memory 32~ This alternating function or relationship con~inues under the control o ( 3~3 1 the syntax processor, the minimum score register 326, arld other control and clock signals. As noted above, each word of memories 322 and 324 is a 32 bit number.
The lower 16 bits, bits 0-15, are employed to store the accumulated likelihood scores. In addition, bits 16-23 are employed for recording the phoneme duration and bits 24-31 are employed for storing the word durations at that register.
The incoming likelihood scores from the computer are stored, for each frame time in a pattern score memory 328. This information is provided in a "burst" from the computer, at a very high data transfer rake, and is read out of the patte~n score memor~ at a slower rate employed by the circuitr~ of Figure 10.
Thus, absent any interceding control from the syntax processor or the minimum score register, the output of the selected memory 32~ or 324, through the corresponding selected gate 330 or 332, is applied to lines 334. The lines 334 are connected to adders 336, 338~ 340 for updating the likelihood score, the phoneme or target pattern duration count, and the word duration COUllt res~ectively. Thus, the likelihood score corresponding ~o the "previous frame" score coming from one of memories 3~2, 324 is output from the pattern score memory over lines 342~ added ~o the old likelihood score, and is then stored in the memory not bein~ used for writin~ ~`he memory select functiorl is provided by the signal level on lines 344~ Simultaneously, the word and phoneme duration counts are incremented by "one".
` In this manner, ~he word duration counterl the phoneme duration count and the likelihood scores are normally updated.
973(~1 1 The two exceptions for the usual updating rule recited above correspond to the beginning o~ a new phoneme and the beginning of a new word. At the beginning of a new phonemer which is not the beginning of a new word, the first register of the phoneme is not updated in accordance with the usual rule; but instead, the likelihood score over line 342 is added to the minimum score from the previous reference frame or phone.ne optional dwell time registers or the last register of the previous phoneme required dwell time.
This is implemented by employing the minimum score regis~er 326. The output of the minimum score register represents the minimum score in the previous frame time for the earlier phoneme. This score is attained by continuously updating the contents of the minimum score register whenever a new "minimum score" is provided~
The new mini~um score is loaded into the minimum score register by employing the sign bit output of a subtraction arithmetic element 346. Element 346 effectively compares the present minimum score with the naw minimum score from the jus~ updated register. The minimum score register further stores the word duration count and phoneme duration count corresponding to the register having the minumum score. All of this information is output on~o lines 334 at the start of a new phoneme. This otltput process is controlled using the ga~ing element 348, enabled at the start of a new phoneme, in combination with control signals to gate~
332 and 330 which disable those gates ~rom operation during the start of a new phoneme.
The syntax processor 308 is employed for updating the f irst register of the first phoneme for a new wQrd~ wi~h ~he bes~ scorep taking into account the /
3G) 1 syntax, of a word ending in the previous frame. Thus, when the score of a register corresponding to the first re~ister of ~he first phoneme of a new word is to be updated by an incoming likelihood score, it is not the output o~ one of memories 322,324 which is employed.
Instead, it is the best likelihood score, preferably taking into account syntax, for the words ending in the previous frame. This function is enabled by disabling gates 330 and 332, and simultaneously enabling a gate 350 for placing the best available score, stored in a reqister 352, onto lines 334, for addition with the incoming pattern likelihood score over lines 342.
In this manner, therefore, each register corresponding to a dwell time of a reference frame is continuously updated in this hardware embodiment. When the likelihood scores represent the silence word, the syntax processor is designed to provide the necessary control systems for enabling a hardware or computer apparatus to track backwards to determine the recognized words.
In view of the foregoing, it may be seen that several objects of the present invention are achieved and other advanta~eous rssults have been obtained~
It will be appreciated that the word string continuous speech recognition method and apparatus described herein include isolated speech recognition as a special application. Additions, subtrac~ions, deletionsr and other modifications of the described preferred embodiments, will be obvious to those skillecl in the ar~, and are within the scope of ~he following claims~
According to the theory of dynamic programming it i9 not necessary to save previously accumulated scores corresponding to all possible dwell times, since, according to the theory any dwell time transition which produced a worse score will continue to produce worse scores at all future stages of processing.
Analysis proceeds in the manner described using all registers of all reference patterns of all word templates. The last register(s) of the last pattern of each word templa~e contains the score of the hypothesis that that word has just ended~
During the accumulation ~f likelihood scores9 a se~uence oflduration counts is kept ~or determining the duration of the best word ending at each frame time~
The count is initiated at "one" at the first register of the first template pattern of the word~ For each second and succeeding register, of a template pattern, the ~9~3~
-~5-1 count associated with the previous register is incremented by "one". However~ for each register corresponding to the beginning of a reference pattern (other than the first reference pattern of a word), that is, for example, the irst register 128 of the required dwell time interval, it i5 the count of optional d~ell time register (or last required dwell time register) of the previous reference pattern, h~ving the best likelihood score in the previous frame timel that is incremented to for~ the duration count for the register.
In order to provide a mechanism ~or "tracing back" as described in more detail below, for each frame timee the identiication of the best scoring word ending at that time, and its duration, are transferred to a circulating buffer memory. When a sequence of words ends, the stored word durations permit tracing backward, from the end of the last "best" word, via its duration, to the best preceading word ending just prior to the "last word", e-tc.~ until all words of the word string ~0 have been identified.
Strings of continuously uttered vocabulary words are bounded by silence. One of the word templates therefore coxresponds to silence, or background noise.
Whenever the silence word has the best likelihood score, it is presumed that a sequence of words has just ended~
A flag register is tested to see if any word other than si7ence has had the best score since the last initialization o~ the recognition process~ If at least one word other than "silence" has had a "best score" ~at 1033, the word string in the circulating buffer is traced backw~rds (at 105) and the resulting recognized message is ~ransmi~ed to a display or other controlled ( :~9~'7~
~46-1 equipment. Then the circulating buffer is cleared to prevent repeated transmission of the message, and the flag register is cleared. The apparatus is thus initialized to recognize the next "word strin~" (at 107).
Trainin~ of referenc~_~atterns To obtain sample means, u~ and variances, s~, for construction of reference patterns, a number of u~erances of each vocabulary word are en~ered into the speech reco~nition system and the ensemble statistics of corresponding preprocessed spectrum ~rames are evaluated. Crucial to successful operation of the equipment is the choice of whi~h input spectrum frames should correspond to which target or reference patterns.
In the absence of be~ter information such as manually chosen significant acoustical phonemes for the input word, the ~ime interval between the beginning and end of a spoken word is divided into a number o uniformly spaced subintervals. E~ch of these subintervals is forced to correspond to a unique reerence pattern. One or mor~ three-frame patterns beginning in each interval are formed and classified according to the reference pattern associated with that interval. Subsequent examples of the same vocabulary word are similarly divided into a like number of uniformly spaced intervals. The mean values and variances of the elements of the three~frame patterns extrac~ed from correspondingly order~d intervals are a~cumulated over all available examples of the vocabulary word ~o form the set of referenc0 patterns ~or that wo~d~ The number of in~ervals (number of reerence patterns) should be in the order of two or 73~
1 three per linguistic phoneme contained in the vocabulary word.
For best results, the start and end of each vocabulary word are marked through a procedure involving manual examination of ~he recorded audio waveform and spectrum frames. To implement this procedure automatically~ it is necessary to have words spoken one at a time, bounded by silence, in order for the apparatus to find word boundaries accurately. The reference patterns may be initialized ~rom one such sample o~ each word spoken in isolation, all variances being set to a convenient constant in the refarence patterns. Thereafter the training material may comprise utterances typical of those to be recognized, with word and segment boundaries as found by the recognition process.
After s~atistics from a suitable number of traininy utterances have been accumulated, the reference patterns so found replace the initîal re~erence patterns. A second pass through the training material is then made. This time the words are divided into intervals on the basis of the decisions made by the recognition processor as in Figure 3. Every three-frame input pattern ~or one ty~ical input pattern for each re~erence pattern) is associated with some reference pattern by the previously described pattern alignment method~ Meanlvalues and variances are accumulated a second time to ~orm the final set of ~eference patterns derived in a manner wholly compatible with the method in which they are to be used by the recogni~ion apparatus~
During each o~ the trai~ing passes it is preferable to ignore any training phrase which is not 3~
-~8-1 correctly recogniæed by the recognition processor, since a misreco~nized utterance is likely to have poorly placed interval boundaries~ On completion of the traininy pass, the previously misrecGgnized phrases can be attempted again with the new reference patterns, and the reference patterns can be further updated if recoynition is ~hen successful.
An alterna~ive to ignoring the misrecognized phrases is to orm a multiple-word template for each trainin~ utterance. This template is simply a concatenation of the teMplates for each of the words in the utterance in the correct order. The talker is prompted by a script to speak the indicated word sequence, and the recognition p~ocessor references only the multipl~ template and the silence template. The word boundaries and xeference pattern classification will then be optimal for the given script and available reference patterns. A disadvantage of this procedure is that a lar~er number of passes through the training script may be required.
For highest possible recogn;tion accuracy it is ~referrable to begin the training procedure with a set of previously determined talker-independent reference patterns for the vocabulary to be recognized~
The talker-independent patterns are ob~ained from phrases typical of those to be recogni~ed~ spoken by at least several different talkers. The word boundaries may bs determined by manual examination of recorded audio waveforms. Then the two step procedure just described is employed to develop the talker-independent patterns: in ~he first pass, ~ubintervals are uniformly sp~ced within each word; in the second pass~
( 3Cli 1 subintervals are as determined by the recognition process using the first-pass reference patterns.
Ensemble statistics over all talkers are derived in each pass. To train the system to a particular talker, the talker-independent patterns are employed as if they were the product of the first training pass; and only the procedure of the second pass is performed (possibly twice).
The minimum (required) and maximum (required plus optional) dwell times are preferably determined during the training process. According to the preferred embodiment of ~he invention,~ the apparatus is trained as described above, usiny several speakers. Fùrtherj as described above, the recognition process automatically determines, during the training procedure, pattern boundaries in accordance with the process described above. Thus boundaries are recorded and the dwell times for each of the apparatus identified keywords are stored.
At the end of a training runr the dwell times for each pattern are examined and the minimum and maximum dwell times for ~he pattern are chosen.
According to a preferred embodiment of the invention, a histogram of the dwell time is generated and the minimum and maximum dwell times arP set at the twenty~fifth and seventy-fi~h percentiles~ This provides a high reco~nition accuracy while maintaining a low false alarm rate. Alternately, other choices of minimum and maximum dwell times can be chosen, there being a trade of between recogni~ion accuracy and false alarm rate.
Thus, if a low minimum dwell time and large maximum dwell time are chosenr a higher recognition accuracy ( ~9~73(~
1 will yenerally result at the cost of a correspondingly high ~alse alarm rate, Syntax processor Concatenation of two or more specific word templates is a trivial example of syntax control in the decision process. Referring to ~ig. 9, a syntax circuit arrangement 308 to detect word sequences containing an odd number (1,3,5,7,..,) of words has two independent sets of pat~ern alignment registers 310, 312, maintained for each vocabulary word, The entering score for the f ir5t template is the score for silence or the best score from the set of second templates, whichever is better. The entering score for the second template is the best score from the first set of templates. This score also feeds a second silence detector template at nodP 313. On detection of silence at the end of the utterance, as measured by the detector template at node 313, the labels and durations of the words uttered may be traced back alternately from the traceback buffer~ oE
the first and second set of templates. I~portantly, the position of the silence detector template ensures that only silence after a word sequence having an odd number of words can be detected.
Somewhat more complex syntax networks may be implemented by associating with each syntax node such as nodes 313a and 313b of Fig. 9, a list oE acceptable word string lengths. For example, in the ~yn~ax network of Fig, 9 which accepts any string containing an odd nu~ber o~ words, the string length may be ~ixed at a 30 particular odd number, say 5, by examining the string length at the input to the second silence register 313a~
If the length of the string at that point is not 5, the \~
( '7~V
1 register becomes inactive ~for the present analysis interval), and no string score can be reported from that register; but if the string length is 5, a string detection can be reported. Similarly the first vocabulary register 310 can be enabled if the incoming string length is 0, 2~ or 4 and the second register only if the incoming string l~ngth is 1 or 3. Although the optimal results for a five-word string would require five eomplete sets of dynamic programming accumulators, this method permits a lesser number of accumulators to perform multiple duty with only a sli~ht reduction in typical recognition accuracy.
The Realized System [Jsin~ the Speech Recognition Method As indicated previously, a presently preferred embodimen~ of the invention was constructed in which the signal and data manipulation, heyond that performed by the preprocessor of Figure 2, was implemented on and controlled by a Di~ital Equipment Corporation PDP-11 computer working in combination with the special purpose Vector Computer Processor such as that described in copending United States Patent No. 4,228,498.
In addition to the use of a computer programming implementation of the inventive method 7 a hardware implementation of the inventive method can be emplo~ed.
In operation, the apparatus of ~igure 10 operates in a~cordance with the dynamic programming technique Each new likelihood score sequence ~hat is, the sequence of likelihood scores relative to each re~erence pa~ern in a known predetermined order, from the computer over lines 320 is added ~o exis~ing scores ( 73~3 1 in one of memories 322 and 32~. These memories alternate functiorls as described below, under the control of (a) the syntax processor 308 which receives the scores corresponding to the end of each possible word, (~) a minimum score register 326 which can replace the output of mem~ries 322 and 324 dependin~ upon the memory select and next phoneme signals, and (c) the other control and clock signals.
In operation, the circuit follows the rules for updating the regis~ers corresponding to each of the "circles" of ~igure 4 to provide at each rest or silence recognition a decision mechanism by which the best Umatch'' can be achieved.
Memories 322 and 324 have the same configuration and are interchanged every ten milliseconds~ that is, evsry time a new frame i5 analyzed. The ~emories each contain a plurality of thirty-two bi~ words, the number of thirty-two bit words corresponding to the total number of registers ~or circles in Figure 4) associated with the words of the machine vocabulary. Initially/ one memory, for example memory 322, is filled with "bad" likelihood scores~ that is, scores which in the present example have a large value~ Thereaf~er, ~he memory 322 is read sequen~ially, in a predetermined sequence corresponding to the sequence of new likelihood scores from the Vector Processor over line 320 and the scor~s are then updated as described below and rewritten into ~he oth2r memory~
memory 3240 In the next ten ~illise~ond rame, the now old scores from memory 324 are read and new scores are written into the now other memory 32~ This alternating function or relationship con~inues under the control o ( 3~3 1 the syntax processor, the minimum score register 326, arld other control and clock signals. As noted above, each word of memories 322 and 324 is a 32 bit number.
The lower 16 bits, bits 0-15, are employed to store the accumulated likelihood scores. In addition, bits 16-23 are employed for recording the phoneme duration and bits 24-31 are employed for storing the word durations at that register.
The incoming likelihood scores from the computer are stored, for each frame time in a pattern score memory 328. This information is provided in a "burst" from the computer, at a very high data transfer rake, and is read out of the patte~n score memor~ at a slower rate employed by the circuitr~ of Figure 10.
Thus, absent any interceding control from the syntax processor or the minimum score register, the output of the selected memory 32~ or 324, through the corresponding selected gate 330 or 332, is applied to lines 334. The lines 334 are connected to adders 336, 338~ 340 for updating the likelihood score, the phoneme or target pattern duration count, and the word duration COUllt res~ectively. Thus, the likelihood score corresponding ~o the "previous frame" score coming from one of memories 3~2, 324 is output from the pattern score memory over lines 342~ added ~o the old likelihood score, and is then stored in the memory not bein~ used for writin~ ~`he memory select functiorl is provided by the signal level on lines 344~ Simultaneously, the word and phoneme duration counts are incremented by "one".
` In this manner, ~he word duration counterl the phoneme duration count and the likelihood scores are normally updated.
973(~1 1 The two exceptions for the usual updating rule recited above correspond to the beginning o~ a new phoneme and the beginning of a new word. At the beginning of a new phonemer which is not the beginning of a new word, the first register of the phoneme is not updated in accordance with the usual rule; but instead, the likelihood score over line 342 is added to the minimum score from the previous reference frame or phone.ne optional dwell time registers or the last register of the previous phoneme required dwell time.
This is implemented by employing the minimum score regis~er 326. The output of the minimum score register represents the minimum score in the previous frame time for the earlier phoneme. This score is attained by continuously updating the contents of the minimum score register whenever a new "minimum score" is provided~
The new mini~um score is loaded into the minimum score register by employing the sign bit output of a subtraction arithmetic element 346. Element 346 effectively compares the present minimum score with the naw minimum score from the jus~ updated register. The minimum score register further stores the word duration count and phoneme duration count corresponding to the register having the minumum score. All of this information is output on~o lines 334 at the start of a new phoneme. This otltput process is controlled using the ga~ing element 348, enabled at the start of a new phoneme, in combination with control signals to gate~
332 and 330 which disable those gates ~rom operation during the start of a new phoneme.
The syntax processor 308 is employed for updating the f irst register of the first phoneme for a new wQrd~ wi~h ~he bes~ scorep taking into account the /
3G) 1 syntax, of a word ending in the previous frame. Thus, when the score of a register corresponding to the first re~ister of ~he first phoneme of a new word is to be updated by an incoming likelihood score, it is not the output o~ one of memories 322,324 which is employed.
Instead, it is the best likelihood score, preferably taking into account syntax, for the words ending in the previous frame. This function is enabled by disabling gates 330 and 332, and simultaneously enabling a gate 350 for placing the best available score, stored in a reqister 352, onto lines 334, for addition with the incoming pattern likelihood score over lines 342.
In this manner, therefore, each register corresponding to a dwell time of a reference frame is continuously updated in this hardware embodiment. When the likelihood scores represent the silence word, the syntax processor is designed to provide the necessary control systems for enabling a hardware or computer apparatus to track backwards to determine the recognized words.
In view of the foregoing, it may be seen that several objects of the present invention are achieved and other advanta~eous rssults have been obtained~
It will be appreciated that the word string continuous speech recognition method and apparatus described herein include isolated speech recognition as a special application. Additions, subtrac~ions, deletionsr and other modifications of the described preferred embodiments, will be obvious to those skillecl in the ar~, and are within the scope of ~he following claims~
Claims (2)
1. In a speech analysis apparatus for recognizing at least one keyword in an audio signal, a method for modeling silence in the incoming audio signal comprising the steps of:
monitoring the amplitude of predetermined short time duration portions of the incoming audio signal over a selected time duration greater than approximately one second and selecting a noise frame by choosing a frame of minimum amplitude during said time period.
monitoring the amplitude of predetermined short time duration portions of the incoming audio signal over a selected time duration greater than approximately one second and selecting a noise frame by choosing a frame of minimum amplitude during said time period.
2. The method of claim 1 wherein each keyword is characterized by a template having at least one target pattern, each target pattern representing at least one short-term power spectrum, the method further comprising the steps of forming at a repetitive frame time, a sequence of frame patterns from and representing said audio signal, examining said frames over said preselected time period and choosing one of said frames as representing background noise in the incoming audio signal.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CA000469856A CA1199730A (en) | 1981-10-05 | 1984-12-11 | Method and apparatus for continuous word string recognition |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US309,208 | 1981-10-05 | ||
US06/309,208 US4489435A (en) | 1981-10-05 | 1981-10-05 | Method and apparatus for continuous word string recognition |
CA000412843A CA1182222A (en) | 1981-10-05 | 1982-10-05 | Method and apparatus for continuous word string recognition |
CA000469856A CA1199730A (en) | 1981-10-05 | 1984-12-11 | Method and apparatus for continuous word string recognition |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CA000412843A Division CA1182222A (en) | 1981-10-05 | 1982-10-05 | Method and apparatus for continuous word string recognition |
Publications (1)
Publication Number | Publication Date |
---|---|
CA1199730A true CA1199730A (en) | 1986-01-21 |
Family
ID=25669828
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CA000469856A Expired CA1199730A (en) | 1981-10-05 | 1984-12-11 | Method and apparatus for continuous word string recognition |
Country Status (1)
Country | Link |
---|---|
CA (1) | CA1199730A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112651402A (en) * | 2019-10-11 | 2021-04-13 | 中国电信股份有限公司 | Character recognition method and device |
-
1984
- 1984-12-11 CA CA000469856A patent/CA1199730A/en not_active Expired
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112651402A (en) * | 2019-10-11 | 2021-04-13 | 中国电信股份有限公司 | Character recognition method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US4489435A (en) | Method and apparatus for continuous word string recognition | |
CA1182223A (en) | Continuous speech recognition | |
US4489434A (en) | Speech recognition method and apparatus | |
CA1172363A (en) | Continuous speech recognition method | |
Dubnowski et al. | Real-time digital hardware pitch detector | |
AU685788B2 (en) | A method and apparatus for speaker recognition | |
US4038503A (en) | Speech recognition apparatus | |
US4227176A (en) | Continuous speech recognition method | |
AU682177B2 (en) | Speech processing | |
US4032711A (en) | Speaker recognition arrangement | |
US4677673A (en) | Continuous speech recognition apparatus | |
US5144672A (en) | Speech recognition apparatus including speaker-independent dictionary and speaker-dependent | |
EP0118484B1 (en) | Lpc word recognizer utilizing energy features | |
US5764853A (en) | Voice recognition device and method using a (GGM) Guaranteed Global minimum Mapping | |
CA1199730A (en) | Method and apparatus for continuous word string recognition | |
Landaur et al. | Teaching a Minimally Structured Back-Progpagation Network to Recognise Speech Sounds | |
Essa et al. | A comparison of combined classifier architectures for Arabic speech recognition | |
CA1199731A (en) | Speech recognition method and apparatus | |
Ullah et al. | Development of a novel system for speaker verification | |
Zebulum et al. | A comparison of different spectral analysis models for speech recognition using neural networks | |
JPS59126599A (en) | Continuous word string recognition method and apparatus | |
CA1180813A (en) | Speech recognition apparatus | |
JPS59127099A (en) | Improvement in continuous voice recognition | |
Wang et al. | A phonetic labeling method for MAT database processing | |
JPS59126598A (en) | Voice recognition method and apparatus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
MKEX | Expiry |