DE19710953A1

DE19710953A1 - Sound signal recognition method

Info

Publication number: DE19710953A1
Application number: DE1997110953
Authority: DE
Inventors: Frank Dr Rer Nat Kowalewski
Original assignee: Individual
Current assignee: Individual
Priority date: 1997-03-17
Filing date: 1997-03-17
Publication date: 1997-07-24

Abstract

The components of the individual spectra are weighted initially with frequency dependent sensitivity factors (11). From these components the covering spectra are calculated. By subtraction (20) of the covering, and the spectra to be covered, an information is derived that is not perceivable by human hearing. The spectra to be covered are calculated from a transformation (12) according to sound and low-pass filtering (13), and a subsequent transformation (14) according to a loudness distinguishing scale. The covering spectra are obtained from a weighted delay (16), using low-pass filters (15), a static loudness transformation (17), a sliding transformation (18) and a transformation (19) according to the loudness distinguishing scale (20).

Description

Die Erfindung betrifft ein Verfahren und eine Vorrichtung zur Erkennung von Schallsignalen, insbesondere von Sprachsignalen.The invention relates to a method and a device for the detection of Sound signals, in particular voice signals.

In heute üblichen Spracherkennern werden die zu erkennenden Schallsignale zu nächst in eine Frequenz-Zeit-Darstellung (Spektrogramm) überführt. Diese Spek trogramme werden mit vorgegebenen Spektrogrammen verglichen (Einzelworterkennung), bzw. Teile der zu erkennenden Spektrogramme werden mit vorgegebenen Teilspektrogrammen verglichen (Erkennung kontinuierlicher Sprache). Das dem zu erkennenden Spektrogramm ähnlichste Spektrogramm bzw. die den zu erkennenden Teilspektrogrammen ähnlichsten Teilspektrogram me werden bestimmt, und es werden Erkennungssignale daraus abgeleitet.In today's speech recognizers, the sound signals to be recognized become next converted into a frequency-time representation (spectrogram). This spec Trograms are compared with specified spectrograms (Single word recognition), or parts of the spectrograms to be recognized compared with specified partial spectrograms (detection of continuous Language). The most similar spectrogram to the spectrogram to be recognized or the subspectrogram most similar to the subspectrograms to be recognized me are determined and detection signals are derived therefrom.

Die Spektrogramme werden durch Kurzzeit-FFT, Filter-Bank, LPC- oder Cep strum-Analyse bestimmt. LPC- und Cepstrum-Analyse entsprechen dem Erre gungs/Stimmkanal-Modell der menschlichen Stimmgebung. Sie versuchen die erregungs- und die stimmkanalabhängigen Eigenschaften des Schallsignals zu trennen. Die Bildung von Leistungsspektren mit Hilfe von Kurzzeit-FFT oder Filter-Bank kann als grobes Modell der menschlichen Hörwahrnehmung aufgefaßt wer den. Durch Nachbearbeitung der Spektrogramme wird teilweise versucht, einfa che psychoakustische Effekte zu berücksichtigen. Neben der relativ weit verbreiteten Frequenztransformation auf die Bark-Skala [US 4 956 865; Bridle, 1974; Mermelstein, 1976; Cohen, 1989; Gramß, 1989; Hermansky, 1990; Ruske, 1992], werden seltener die statische Lautheitsempfindung [Hanson, 1984; Schotola, 1984; Cohen, 1989; Gramß, 1989; Hermansky, 1990; Ruske, 1992], die Hörempfindlichkeit [Cohen, 1989; Hermansky, 1990] oder zeitliche Verdeckungs phänomene [Cohen, 1989; Gramß, 1989; Paping, 1991; Gramß, 1992; Aikawa, 1993; Pavel, 1994] des Menschen modelliert. Zum Teil wird die aus psychoaku stischen Untersuchungen hervorgegangene, als geeignetes Modell aber umstrit tene, 3 Bark-Integration eingesetzt [Hermansky, 1990; Kowalewski, 1991].The spectrograms are by short-term FFT, filter bank, LPC or Cep strum analysis determined. LPC and cepstrum analysis correspond to the Erre vocal / vocal channel model of human vocalization. You try that excitation and the voice channel dependent properties of the sound signal separate. The formation of performance spectra with the help of short-term FFT or Filter bank can be understood as a rough model of human hearing perception the. By postprocessing the spectrograms it is sometimes tried to simplify to consider psychoacoustic effects. In addition to the relatively far widespread frequency transformation on the Bark scale [US 4,956,865; Bridle, 1974; Mermelstein, 1976; Cohen, 1989; Gramß, 1989; Hermansky, 1990; Ruske, 1992], the static loudness perception [Hanson, 1984; Schotola, 1984; Cohen, 1989; Gramß, 1989; Hermansky, 1990; Ruske, 1992], the Hearing sensitivity [Cohen, 1989; Hermansky, 1990] or temporal concealment phenomena [Cohen, 1989; Gramß, 1989; Paping, 1991; Gramß, 1992; Aikawa, 1993; Pavel, 1994] of humans. In part, the psychoaku Research that emerged, but is controversial as a suitable model tene, 3 Bark integration used [Hermansky, 1990; Kowalewski, 1991].

Die Spektrogramme werden außerdem nach nicht psychoakustischen Gesichts punkten nach bearbeitet, vor allem um eine gegenüber Störungen robustere Er kennung zu erreichen [US 4 905 286; US 4 914 692; US 5 220 610; US 5 590 242; Porter, 1984]. Häufig werden dynamische Spektrogrammerkmale bestimmt, um die Erkennung unabhängig von langsamen Spektrumsänderungen zu machen [US 4 956 865; Elenius, 1982; Furui, 1986; Hanson, 1990; Hermansky, 1994]. Sprecherunabhängigkeit der Erkennung erreicht man entweder durch einen aus reichend großen Satz an Vergleichsspektrogrammen, der Sprachversionen vieler verschiedener Sprecher enthält, oder durch Sprecheradaptation [Lee, 1991; Ahadi, 1995; Kamm, 1995].The spectrograms are also based on non-psychoacoustic facial score after edited, especially by a more robust against faults to achieve identification [US 4 905 286; U.S. 4,914,692; US 5,220,610; US 5,590,242; Porter, 1984]. Dynamic spectrogram features are often determined to make the detection independent of slow spectrum changes [US 4,956,865; Elenius, 1982; Furui, 1986; Hanson, 1990; Hermansky, 1994]. Speaker independence of recognition can be achieved either by one sufficiently large set of comparison spectrograms, the language versions of many contains different speakers, or by speaker adaptation [Lee, 1991; Ahadi, 1995; Kamm, 1995].

Da die heute in der Spracherkennung verwendeten Methoden der Spektrogramm berechnung entweder der Sprachkodierung entstammen oder das menschliche Gehör nur grob modellieren, enthalten die berechneten Spektrogramme im allge meinen Informationen, die vom Menschen nicht wahrgenommen werden. Ent sprechend weichen die damit erreichten Erkennungsleistungen stark von denen des Menschen ab. Die zusätzlichen nicht wahrnehmbaren Informationen führen zu größerer Versionsabhängigkeit, Sprecherabhängigkeit und Störungsanfällig keit (z. B. gegenüber Hintergrundgeräuschen oder Übertragungsverlusten) der maschinellen Erkenner.Because the methods used today in speech recognition are the spectrogram calculation either come from speech coding or from human Only roughly model hearing, the calculated spectrograms generally contain my information that is not perceived by humans. Ent speaking, the recognition performance thus achieved differs greatly from that of man. The additional imperceptible information lead to greater version dependency, speaker dependency and susceptibility to faults (e.g. against background noise or transmission losses) machine recognizer.

Die Kompensation der Versions- und Sprecherabhängigkeit durch viele Ver gleichsspektrogramme verschiedener Sprachversionen bzw. Sprecher hat den Nachteil des höheren Aufwandes bei der Bestimmung der ähnlichsten Vergleichs spektrogramme. Sprecheradaptation erfordert vor der eigentlichen Erkennung eine Adaptationsphase, die zusätzlichen Zeitaufwand bedeutet und für kurze Erkennungsaufgaben nicht vertretbar ist.The compensation of the version and speaker dependency by many ver same spectrograms of different language versions or speakers has the Disadvantage of the higher effort in determining the most similar comparison spectrograms. Speaker adaptation requires before the actual recognition an adaptation phase that means additional time and for short Recognition tasks is not justifiable.

Nicht psychoakustisch orientierte Methoden zur Eliminierung von Störungen des Sprachsignals haben den Nachteil, im allgemeinen andere Informationen als das menschliche Gehör zu entfernen. Die Robustheit der Erkennung gegenüber Stö rungen weicht von der des Menschen ab.Non-psychoacoustic methods for eliminating disorders of the Speech signals have the disadvantage of generally other information than that remove human hearing. The robustness of the detection against interference stance differs from that of man.

Aufgabe der Erfindung ist es, die Erkennungsraten heutiger Erkenner für Schall signale zu verbessern. Die Erkennung soll robuster gegenüber Störungen ge macht werden. Die Schallsignale sollen möglichst ähnlich dem Menschen erkannt werden. Sprachsignale sollen versionsunabhängiger und sprecherunabhängiger erkannt werden.The object of the invention is to determine the detection rates of today's detectors for sound to improve signals. The detection should be more robust against interference be made. The sound signals should be recognized as similarly as possible to humans will. Voice signals should be version independent and speaker independent be recognized.

Diese Aufgabe wird durch das Verfahren mit den Merkmalen des Anspruchs 1 ge löst.This object is achieved by the method having the features of claim 1 solves.

Das Verfahren ahmt die auditive Wahrnehmung des Menschen nach. Dadurch macht es die Erkennung von Schallsignalen, gegenüber heute gebräuchlichen Verfahren, der menschlichen Erkennung ähnlicher. Das Verfahren berücksichtigt die subjektive Tonhöhenwahrnehmung des Menschen, die Frequenzabhängigkeit der Hörempfindlichkeit, die statische Lautstärkewahrnehmung und Simultan- und Nachverdeckungsphänomene. Neben diesen auszugsweise auch durch andere Verfahren berücksichtigten psychoakustischen Effekten, werden Effekte nachge ahmt, die andere Verfahren nicht nachahmen, und zwar die Hörschwelle des Menschen, das Intensitätsunterscheidungsvermögen und die Abhängigkeit der Nachverdeckung von der Testtonlänge. Statische Lautheit und Simultanver deckung werden korrekter berücksichtigt.The procedure mimics human auditory perception. Thereby it makes the detection of sound signals, compared to those used today Procedure similar to human detection. The procedure takes into account the subjective pitch perception of humans, the frequency dependence hearing sensitivity, static volume perception and simultaneous and After masking phenomena. In addition to these extracts also by others Procedures taken into account psychoacoustic effects, effects are followed up mimics that do not imitate other procedures, namely the hearing threshold of People, the intensity discrimination and the dependence of the Re-masking from the test tone length. Static loudness and simultaneity cover will be taken into account more correctly.

Es werden Signalanteile eliminiert, die vom Menschen nicht wahrgenommen wer den. Aus Sprachsignalen werden Anteile entfernt, die keine sprachliche Informa tion enthalten. Dadurch wird die Erkennung von Sprache versions- und sprecherunabhängiger. In Einzelworterkennungsexperimenten fanden wir im Ver gleich zu einem den Stand der heutigen Technik verwendenden Verfahren eine Erhöhung der sprecherabhängigen Erkennungsraten von 92,7% auf 98,8%. Die sprecherunabhängigen Raten stiegen von durchschnittlich 70,9% auf 87,4% (Tabelle 1).Signal components that are not perceived by humans are eliminated the. Parts that do not contain linguistic information are removed from speech signals tion included. This makes the recognition of language versions and speaker independent. In single word recognition experiments we found in ver a method using the state of the art today Increased speaker-dependent recognition rates from 92.7% to 98.8%. The speaker independent rates rose from an average of 70.9% to 87.4% (Table 1).

Diese Erkennungsraten wurden mit dem unten beschriebenen Ausführungsbei spiel des Verfahrens erzielt. Der verwendete A/D-Wandler hatte eine Abtastrate von T_s = 1/16 kHz. Es wurden N = 64-komponentige Spektren berechnet. Die Vergleichsraten wurden mit einer Anordnung bestimmt, die anstelle des gesam ten unten beschriebenen nicht-linearen Filters nur die dem Stand der Technik ent sprechenden Teile verwendet, und zwar die Gewichtung (11) der Spektrumskomponenten entsprechend der menschlichen Hörkurve und die Be rechnung der statischen Lautheit (12) durch das Monom W^1/4. Beide Spracher kenner wurden sowohl in den sprecherabhängigen als auch in den sprecherunabhängigen Experimenten mit einer Version eines männlichen Spre chers der 62 zu erkennenden Wörter trainiert. In der Testphase waren andere Versionen desselben oder anderer Sprecher zu erkennen.These detection rates were achieved with the exemplary embodiment of the method described below. The A / D converter used had a sampling rate of T _s = 1/16 kHz. N = 64-component spectra were calculated. The comparison rates were determined using an arrangement that uses only the parts corresponding to the prior art instead of the entire non-linear filter described below, namely the weighting ( 11 ) of the spectrum components according to the human hearing curve and the calculation of the static loudness ( 12 ) by the monom W ^1/4 . Both language experts were trained in both the speaker-dependent and speaker-independent experiments with a version of a male speaker of the 62 words to be recognized. In the test phase, other versions of the same or different speakers were recognizable.

Durch die verbesserte Versions- und Sprecherunabhängigkeit des Verfahrens können mit weniger Vergleichsspektrogrammen ähnliche oder bessere Erken nungsraten erzielt werden als mit herkömmlichen Verfahren. Der besonders zeitkritische Spektrogrammerkennungsprozeß kann hierdurch beschleunigt wer den.Due to the improved version and speaker independence of the procedure can make similar or better orchestras with fewer comparison spectrograms rates are achieved than with conventional methods. The special one This can accelerate the time-critical spectrogram detection process the.

Durch Verwendung des Verfahrens wird die Erkennung gegenüber Störungen ro buster. Die sprecherabhängigen Erkennungsraten für verrauschte und höhenan gehobene Sprache wurde durch Einsatz des Verfahrens von durchschnittlich 58,3% auf 97,2% erhöht (Tabelle 1). Die Experimente wurden mit denselben un gestörten Lerndaten wie oben durchgeführt.By using the method, the detection against disturbances ro buster. The speaker-dependent detection rates for noisy and high Upscale language was made average by using the procedure 58.3% to 97.2% increased (Table 1). The experiments were carried out with the same disturbed learning data as performed above.

Tabelle 1 Table 1

Erkennungsraten Detection rates

Die Erkennungsraten lassen sich weiter steigern, indem das Filter an Art und Stär ke vorhandener Störungen angepaßt wird. Durch Erhöhung des im unten gege benen Ausführungsbeispiel beschriebenen Parameters W₀ bei additiven Störungen, nimmt die Erkennungsrate für verrauschte Sprache noch einmal von 96,0% auf 99,2% zu (Tabelle 1).The detection rates can be further increased by the filter in type and strength ke existing disturbances is adjusted. By increasing the ben embodiment described parameters W₀ in additive Interference, the recognition rate for noisy speech decreases again 96.0% to 99.2% (Table 1).

Da das nicht-lineare Filter auf zeitlich grob abgetasteten Spektrogrammen arbei tet, erfordert es nur geringen rechnerischen Mehraufwand gegenüber Verfahren, die psychoakustische Effekte nicht berücksichtigen.Since the non-linear filter works on temporally roughly sampled spectrograms tet, it requires only a small amount of computational effort compared to processes, disregard the psychoacoustic effects.

Ein Ausführungsbeispiel der Erfindung ist in Abb. 1 und Abb. 2 darge stellt. Es handelt sich um ein Verfahren zur Einzelworterkennung. Abb. 1 zeigt es in der Übersicht.An embodiment of the invention is shown in Fig. 1 and Fig. 2 Darge. It is a method for single word recognition. Fig. 1 shows an overview.

Das zu erkennende Schallsignal wird zunächst einer Kurzzeitfrequenzanalyse (1) unterzogen, die entsprechend der Bark-Skala unterteilte Leistungsspektren lie fert. Die Leistungsspektren bilden das Spektrogramm des Schallsignals. Dieses Spektrogramm wird durch ein zweidimensionales nicht-lineares Filter (2) in eine der menschlichen auditiven Erkennung besser entsprechende Form transfor miert. Die transformierten Spektrogramme werden durch einen Vergleicher (3) mit vorgegebenen Spektrogrammen verglichen. Das dem zu erkennenden Spektro gramm ähnlichste Vergleichsspektrogramm wird bestimmt.The sound signal to be recognized is first subjected to a short-term frequency analysis ( 1 ), which delivers power spectra divided according to the Bark scale. The power spectra form the spectrogram of the sound signal. This spectrogram is transformed by a two-dimensional non-linear filter ( 2 ) into a form that corresponds better to human auditory recognition. The transformed spectrograms are compared with predetermined spectrograms by a comparator ( 3 ). The comparison spectrogram most similar to the spectrogram to be recognized is determined.

Wesentliches und neues Element des Erkennungsverfahrens ist das zweidimen sionale Filter (2). Es ist für die erhöhten Erkennungsraten verantwortlich. Abb. 2 zeigt seinen Aufbau.An essential and new element of the detection process is the two-dimensional filter ( 2 ). It is responsible for the higher detection rates. Fig. 2 shows its structure.

Die Komponenten der Einzelspektren werden zuerst mit frequenzabhängigen Empfindlichkeitsfaktoren gewichtet (11). Aus den gewichteten Spektren werden verdeckende und zu verdeckende Spektren berechnet. Durch komponentenwei se Subtraktion (20) der verdeckenden von den zu verdeckenden Spektren, wer den aus den zu verdeckenden Spektren Informationen entfernt, die der Mensch nicht wahrnimmt.The components of the individual spectra are first weighted with frequency-dependent sensitivity factors ( 11 ). Hidden and hidden spectra are calculated from the weighted spectra. By subtracting components ( 20 ) of the masking spectra from the masking spectra, who removes the information from the spectra to be masked that man does not perceive.

Die zu verdeckenden Spektren errechnen sich aus den gewichteten durch kom ponentenweise Transformation (12) auf die statische Lautheit, Tiefpaßfilterung (13) und nachfolgende Transformation (14) auf eine Lautheitsunterscheidungs skala. Die verdeckenden Spektren erhält man aus den gewichteten durch Tief paßfilterung (15), zeitliche Verzögerung (16), Anwendung der statischen Lautheitstransformation (17), lineare, bzgl. der Frequenz verschmierende Trans formation (18) und Transformation (19) auf die Lautheitsunterscheidungsskala.The spectra to be masked are calculated from the weighted by component-wise transformation ( 12 ) on the static loudness, low-pass filtering ( 13 ) and subsequent transformation ( 14 ) on a loudness differentiation scale. The masking spectra are obtained from the weighted by low-pass filtering ( 15 ), time delay ( 16 ), application of the static loudness transformation ( 17 ), linear transformation with respect to frequency ( 18 ) and transformation ( 19 ) on the loudness discrimination scale.

Im folgenden werden die einzelnen Schritte des Ausführungsbeispiels näher be schrieben.The individual steps of the exemplary embodiment are described in more detail below wrote.

Das durch A/D-Wandlung gewonnene und zu den diskreten Zeitpunkten t_ν = ν·T_s, ν ε IN, T 1/16 kHz vorliegende Schallsignal S_ν = S(t_ν) wird ge mäß der RechenvorschriftThe sound signal S _ν = S (t _ν ) obtained by A / D conversion and present at the discrete times t _ν = ν · T _s , ν ε IN, T 1/16 kHz is ge according to the calculation rule

diskret Fourier-t-transformiert [Terhardt, 1985], mit den Bark-Skalen angepaßten Analyseparameterndiscretely Fourier-t-transformed [Terhardt, 1985], adjusted with the Bark scales Analysis parameters

wobeiin which

nach [Traunmüller, 1987] als Approximation der Bark-Skala genommen wird. Die Komponenten S_n, _ν der Fourier-t-Spektren entsprechen dem mit verschiede nen Bandpässen gefilterten Schallsignal. Die quadrierten Übertragungsfunktio nen der Bandpässe sind für T_s = 1/16 kHz und N = 64 in Abb. 3 dargestellt.after [Traunmüller, 1987] is taken as an approximation of the Bark scale. The components S _n, _{ν of} the Fourier t spectra correspond to the sound signal filtered with various bandpasses. The squared transmission functions of the bandpasses are shown in Fig. 3 for T _s = 1/16 kHz and N = 64.

Die Leistungssignale P_n, _ν = |S_n, _ν|² werden gemäß
P′_n,0 = 0
P′_n, _ν = α·P′_n, _ν _-1 + (1-α)·P_n, _ν
Q′_n,µ = P′_n,c·µ The power signals P _n, _ν = | S _n, _ν | ² are according to
P ′ _{n, 0} = 0
P ′ _n, _ν = α · P ′ _n, _ν _-1 + (1-α) · P _n, _ν
Q ′ _{n, µ} = P ′ _{n, c · µ}

≈ 10 ms T = ganzzahliges Vielfaches von T_s ≈ 10 ms T = integer multiple of T _s

zeitlich geglättet und im zeitlichen Abstand T abgetastet. Abb. 4 zeigt oben ein so gewonnenes Spektrogramm des Wortes "Senken". Die Leistungsspektren werden nun dem zweidimensionalen Filter (2) zugeführt. Entsprechendsmoothed in time and sampled at a time interval T. Fig. 4 shows a spectrogram of the word "lowering" obtained in this way. The power spectra are now fed to the two-dimensional filter ( 2 ). Corresponding

W_n,µ = w_n·Q_n,µ W _{n, µ} = w _n · Q _{n, µ}

werden die Spektrumskomponenten zunächst mit Empfindlichkeitsfaktoren w_n gewichtet (11). Die Faktoren ergeben sich aus der menschlichen Hörschwelle L(ω) zuthe spectrum components are first weighted with sensitivity factors w _n (11). The factors result from the human hearing threshold L (ω)

L(ω) kann durch lineare Interpolation der in Tabelle 2 gezeigten Werte angenä hert werden.L (ω) can be approximated by linear interpolation of the values shown in Table 2 be heard.

Tabelle 2 Table 2

Menschliche Hörschwelle Human hearing threshold

Die für die anschließende Berechnung der verdeckenden und der zu verdecken den Spektren nötigen statischen Lautheitstransformationen (12) und (17) können sinnvoll durchThe static loudness transformations ( 12 ) and ( 17 ) required for the subsequent calculation of the masking and the spectrum to be masked, can be sensibly performed by

approximiert werden. W und W′ bezeichnen hier wie im folgenden das Eingangs bzw. Ausgangssignal des Verarbeitungsschrittes. Für eine möglichst menschen ähnliche Erkennung ist W₀ auf den Eingangswert W zu setzen, den ein 1 kHz-Ton mit einem Schallpegel von 36 dB an der für 1 kHz empfindlichsten Stelle des zuge hörigen gewichteten Spektrums erzeugt.be approximated. W and W ′ denote the entrance here, as in the following or output signal of the processing step. For a human as possible similar detection, W₀ is to be set to the input value W, which is a 1 kHz tone with a sound level of 36 dB at the most sensitive point of the low weighted spectrum.

Die Tiefpaßfilter (13) und (15) bestimmen die zeitlichen Verdeckungseigenschaf ten des zweidimensionalen Filters. Sie können identisch und als Leaky-Integrato ren ausgeführt werden:The low-pass filters ( 13 ) and ( 15 ) determine the temporal occlusion properties of the two-dimensional filter. They can be identical and can be implemented as leaky integrators:

W′_n,µ = β·W′_n,µ-1 + (1-β)·W_n,µ W ′ _{n, µ} = β · W ′ _{n, µ-1} + (1-β) · W _{n, µ}

Diese Filter haben den Vorteil sehr einfach berechnet werden zu können. Für β erweist sich ein Wert von 0,6 als günstig.The advantage of these filters is that they can be calculated very easily. For β a value of 0.6 turns out to be favorable.

Die Verzögerung (16) in der Berechnung der verdeckenden Spektren kann für Verzögerungszeiten, die nicht ganzzahlige Vielfache von T sind, nur angenähert werden, etwa durch die lineare InterpolationThe delay ( 16 ) in the calculation of the masking spectra can only be approximated for delay times that are not integer multiples of T, for example by the linear interpolation

Für γ erweist sich ein Wert von 1,0 als sinnvoll.A value of 1.0 proves to be useful for γ.

Die zur Verschmierung der verdeckenden Spektren nötige lineare Transforma tion (18)The linear transformation necessary to smear the masking spectra ( 18 )

soll Simultanverdeckungseffekte der menschlichen Hörwahrnehmung modellie ren. Die Zeilen der Tranformationsmatrix sollten daher dem Kehrwert psychoaku stischer Tuningkurven entsprechen. Das wird durch die Wahlis to model simultaneous masking effects of human hearing perception ren. The lines of the transformation matrix should therefore the reciprocal psychoaku stian tuning curves. That will be by choice

erreicht. δ kann sinnvoll auf 0,05 gesetzt werden. Abb. 5 zeigt die resultie rende Matrix für N = Z(f_Nyq) = 21 in bildlicher Form.reached. δ can sensibly be set to 0.05. Fig. 5 shows the resulting matrix for N = Z (f _Nyq ) = 21 in a pictorial form.

Bei der oben gegebenen Wahl der statischen Lautheitstransformationen (12) und (17), ist es sinnvoll die Lautheitsunterscheidungstransformationen (14) und (19) gemäßGiven the choice of the static loudness transformations ( 12 ) and ( 17 ) given above, it makes sense to use the loudness distinction transformations ( 14 ) and ( 19 ) according to

vorzunehmen.to make.

Abb. 4 zeigt unten das darüber dargestellte Wort "Senken" nach Bearbei tung durch das beschriebene zweidimensionale Filter (2). Fig. 4 shows the word "lowering" shown above after processing by the described two-dimensional filter ( 2 ).

Zum Vergleich der gefilterten Spektrogramme mit vorgegebenen Spektrogram men wird ein DTW-Verfahren (3) eingesetzt. Die vorgegebenen Spektrogramme werden aus Wortversionen berechnet, deren Wortklassen bekannt sind. Die Spektrogramme werden mit demselben Verfahren wie die zu erkennenden Spek trogramme berechnet. Es wird das Vergleichsspektrogramm bestimmt, dessen DTW-Abstand zum zu erkennenden Spektrogramm am kleinsten ist. Seine Wort klasse wird ausgegeben.A DTW method ( 3 ) is used to compare the filtered spectrograms with specified spectrograms. The specified spectrograms are calculated from word versions whose word classes are known. The spectrograms are calculated using the same procedure as the spectrograms to be recognized. The comparison spectrogram is determined whose DTW distance to the spectrogram to be recognized is the smallest. His word class is issued.

Mit Vorteil kann ein modifiziertes DTW-Verfahren eingesetzt werden, das beliebi ge Zeitverzerrungen zuläßt und Schritte ohne Zeitverzerrung mit einem Faktor C_Diag ε [0, 1], vorzugsweise C_Diag = 0, 7, gewichtet. Der Abstand D(W⁽¹⁾, W⁽²⁾) zweier Spektrogramme W⁽¹⁾ _n,µ, W⁽²⁾ _n, _ν (µ = 1, . . ., M₁; v = 1, . . ., M₂) berechnet sich dann nach:A modified DTW method can advantageously be used, which permits arbitrary time distortions and weights steps without time distortion with a factor C _Diag ε [0, 1], preferably C _Diag = 0, 7. The distance D (W ⁽¹⁾ , W ⁽²⁾ ) of two spectrograms W ⁽¹⁾ _{n, µ} , W ⁽²⁾ _n, _ν (µ = 1,..., M₁; v = 1,..., M₂) is then calculated according to:

credentials

Ahadi, S.M. und P.C. Woodland: "Rapid speaker adaptation using model predic tion", Proc. IEEE Internat. Conf. Acoust. Speech 1995, Detroit, MI, 684-687.
Aikawa, K., H. Singer, H. Kawahara und Y. Tohkura: "A dynamic cepstrum incor porating time-frequency masking and its application to continuous speech re cognition", Proc. IEEE Internat. Conf. Acoust. Speech Signal Process. 1993, Minneapolis, MN, II-668-671.
Bridle, J.S. und M.D. Brown: "An experimental automatic word recognition sy stem", JSRU Report Nr. 1003, Ruislip, England: Joint Speech Research Unit, 1974.
Cohen J.R.: "Application of an auditory model to speech recognition", J. Acoust. Soc. Amer., Bd. 85 (1989), Nr. 6, 2623-2629.
Elenius, K. und M. Blomberg: "Effects of emphasizing transitional or stationary parts of the speech signal in a discrete utterance recognition system", Proc. IEEE Internat. Conf. Acoust. Speech Signal Process. 1982, Paris, France, 535-537.
Furui, S.: "Speaker-independent isolated word recognition using dynamic features of speech spectrum", IEEE Tans. Acoust. Speech Signal Process., Bd. 34 (1986), 52-59.
Gramß, T. und H.W. Strube: "Entwicklung mehrschichtiger neuronaler Netzwerke zur Worterkennung und -reproduktion", Informationstechnik, Bd. 5 (1989), 324-333.
Gramß, T.: "Worterkennung mit einem künstlichen neuronalen Netz", Dissertati on, Georg-August-Universität Göttingen, 1992.
Hanson, B. und D. Wong: "The harmonic magnitude suppression (HMS) tech nique for intelligibility enhancement in the presence of interfering speech", Proc. IEEE Internat. Conf. Acoust. Speech Signal Process. 1984, 18.A.5.1.-18.A.5.4.
Hanson, B.A. und T.H. Applebaum: "Robust speaker-independent word recogni tion using static, dynamic and acceleration features: experiments with Iombard and noisy speech", Proc. IEEE Internat. Conf. Acoust. Speech Signal Process. 1990, 857-860.
Hermansky H.: "Perceptual linear predictive (PLP) analysis of speech", J. Acoust. Soc. Amer., Bd. 87 (1990), Nr. 4, 1738-1752.
Hermansky H. und N. Morgan: "RASTA processing of speech", IEEE Trans. Speech Audio Process., Bd. 2 (1994), Nr. 4, 578-589.
Kamm T., A.G. Andreou und J. Cohen: "Vocal tract normalization in speech re cognition: Compensation for systematic speaker variability", Proc. 1 5th Annual Speech Research Symposium 1995, Johns Hopkins University, Baltimore, MI, 175-179.
Kowalewski, F.: "Rückgekoppelte und wachsende neuronale Netze zur dynami schen Erkennung von Sprache", Dissertation, Georg-August-Universität Göt tingen, 1991.
Lee C-H., C-H. Lin und B-H. Juang: "A study on speaker adaptation of the para meters of continuous density hidden Markov models", IEEE Trans. Signal Pro cess., Bd. 39 (1991), Nr. 4, 806-814.
Mermelstein, P.: "Distance measures for speech recognition, psychological and instrumental", in Pattern Recognition and Artificial Intelligenoe, Hrsg. R.C.H. Chen, Academic Press, New York, 1976, 374-388.
Paping, M. und H.W. Strube: "Psychoakustische Vorverarbeitung zur Spracher kennung", Fortschritte der Akustik - DAGA′91,1 991, 997-1000.
Pavel M. und H. Hermansky: "Temporal masking in automatic speech recogniti on", J. Acoust. Soc. Amer., Bd. 95(1994), Nr. 5, 2876ff.
Porter J.E. und S.F. Boll: "Optimal estimators for spectral restoration of noisy speech", Proc. IEEE Internat. Conf. Acoust. Speech Signal Process. 1984, San Diego, CA, 18.A.2.1.-18.A.2.4.
Ruske, G. und M. Beham: "Gehörbezogene automatische Spracherkennung", in Sprachliche Mensch-Maschine-Kommunikation, H. Mangold (Hrsg.), Olden bourg, München usw., 1992, 33-47.
Schotola, T.: "On the use of demisyllables in automatic word recognition", Speech Comm., Bd. 3 (1984), 63-87.
Terhardt, E.: "Fourier transformation of time tignals: conceptual revision", Acustica, Bd. 57 (1985), 242-256.
Traunmüller, H. und F. Lacerda: "Perceptual relativity in identification of two-for mant vowels", Speech Comm., Bd. 6 (1987), 143-157.Ahadi, SM and PC Woodland: "Rapid speaker adaptation using model prediction", Proc. IEEE boarding school Conf. Acoust. Speech 1995, Detroit, MI, 684-687.
Aikawa, K., H. Singer, H. Kawahara and Y. Tohkura: "A dynamic cepstrum incor porating time-frequency masking and its application to continuous speech re cognition", Proc. IEEE boarding school Conf. Acoust. Speech signal process. 1993, Minneapolis, MN, II-668-671.
Bridle, JS and MD Brown: "An experimental automatic word recognition sy stem", JSRU Report No. 1003, Ruislip, England: Joint Speech Research Unit, 1974.
Cohen JR: "Application of an auditory model to speech recognition", J. Acoust. Soc. Amer., Vol. 85 (1989), No. 6, 2623-2629.
Elenius, K. and M. Blomberg: "Effects of emphasizing transitional or stationary parts of the speech signal in a discrete utterance recognition system", Proc. IEEE boarding school Conf. Acoust. Speech signal process. 1982, Paris, France, 535-537.
Furui, S .: "Speaker-independent isolated word recognition using dynamic features of speech spectrum", IEEE Tans. Acoust. Speech Signal Process., Vol. 34 (1986), 52-59.
Gramß, T. and HW Strube: "Development of multilayer neural networks for word recognition and reproduction", Informationstechnik, Vol. 5 (1989), 324-333.
Gramß, T .: "Word recognition with an artificial neural network", PhD thesis, Georg-August-Universität Göttingen, 1992.
Hanson, B. and D. Wong: "The harmonic magnitude suppression (HMS) tech nique for intelligibility enhancement in the presence of interfering speech", Proc. IEEE boarding school Conf. Acoust. Speech signal process. 1984, 18.A.5.1.-18.A.5.4.
Hanson, BA and TH Applebaum: "Robust speaker-independent word recognition using static, dynamic and acceleration features: experiments with Iombard and noisy speech", Proc. IEEE boarding school Conf. Acoust. Speech signal process. 1990, 857-860.
Hermansky H .: "Perceptual linear predictive (PLP) analysis of speech", J. Acoust. Soc. Amer., Vol. 87 (1990), No. 4, 1738-1752.
Hermansky H. and N. Morgan: "RASTA processing of speech", IEEE Trans. Speech Audio Process., Vol. 2 (1994), No. 4, 578-589.
Kamm T., AG Andreou and J. Cohen: "Vocal tract normalization in speech re cognition: Compensation for systematic speaker variability", Proc. 1 5th Annual Speech Research Symposium 1995, Johns Hopkins University, Baltimore, MI, 175-179.
Kowalewski, F .: "Feedback and growing neural networks for the dynamic recognition of speech", dissertation, Georg-August-Universität Göttingen, 1991.
Lee CH., CH. Lin and BH. Juang: "A study on speaker adaptation of the para meters of continuous density hidden Markov models", IEEE Trans. Signal Process., Vol. 39 (1991), No. 4, 806-814.
Mermelstein, P .: "Distance measures for speech recognition, psychological and instrumental", in Pattern Recognition and Artificial Intelligenoe, ed. RCH Chen, Academic Press, New York, 1976, 374-388.
Paping, M. and HW Strube: "Psychoacoustic preprocessing for speech recognition", progress in acoustics - DAGA′91,1 991, 997-1000.
Pavel M. and H. Hermansky: "Temporal masking in automatic speech recogniti on", J. Acoust. Soc. Amer., Vol. 95 (1994), No. 5, 2876ff.
Porter JE and SF Boll: "Optimal estimators for spectral restoration of noisy speech", Proc. IEEE boarding school Conf. Acoust. Speech signal process. 1984, San Diego, CA, 18.A.2.1.-18.A.2.4.
Ruske, G. and M. Beham: "Hearing-related automatic speech recognition", in human-machine communication, H. Mangold (ed.), Olden bourg, Munich etc., 1992, 33-47.
Schotola, T .: "On the use of demisyllables in automatic word recognition", Speech Comm., Vol. 3 (1984), 63-87.
Terhardt, E .: "Fourier transformation of time signals: conceptual revision", Acustica, Vol. 57 (1985), 242-256.
Traunmüller, H. and F. Lacerda: "Perceptual relativity in identification of two-for mant vowels", Speech Comm., Vol. 6 (1987), 143-157.

U.S. Patent, Nr. 4.905.286
U.S. Patent, Nr. 4.914.692
U.S. Patent, Nr. 4.956.865
U.S. Patent, Nr. 5.220.610
U.S. Patent, Nr. 5.590.242U.S. Patent No. 4,905,286
U.S. Patent No. 4,914,692
U.S. Patent No. 4,956,865
U.S. Patent No. 5,220,610
U.S. Patent No. 5,590,242

Claims

1. A method for detecting sound signals, which has a processing step for obtaining Bark-scale-adapted short-term power specs and has a spectrogram detection stage, characterized by :

i) the weighting ( 11 ) of the components of the Bark power spectra of the sound signal to be recognized in accordance with the human hearing curve.
ii) the calculation of spectra to be masked from the spectra obtained in process step i) by transformation ( 12 ) to the static loudness, low-pass filtering ( 13 ) and subsequent transformation ( 14 ) to a loudness discrimination scale.
iii) the calculation of masking spectra from the spectra obtained in method step i) by low-pass filtering ( 15 ), delay ( 16 ), using a loudness transformation ( 17 ) identical to the static loudness transformation ( 12 ) from step ii), smearing linear transformation ( 18 ) and transformation ( 19 ) to the loudness discrimination scale from step ii) ( 14 ).
iv) component-by-subtraction ( 20 ) of the masking spectra from step iii) from the spectrum to be masked from step ii) and forwarding of the resulting spectra to the spectrogram detection stage.

2. A method according to claim 1, characterized in that the short time power spectra are calculated by fast Fourier transformation, where neighboring components of the Fourier power spectra to Bark broad components can be summarized.

3. A method according to claim 1, characterized in that the short time power spectra are calculated by Fourier-t transformation, where the analysis parameters are selected according to the Bark scale.

4. A method according to claim 1, characterized in that the short-term power spectra are calculated by a filter bank, the filter according to the Bark scale.

5. A method according to claim 1, characterized in that the static loudness transformations ( 12 ) and ( 17 ) of process steps ii) and iii) of claim 1 are given by

6. A method according to claim 5, characterized in that the constant W₀ the loudness transformation from claim 5 to the input value W ge a 1 kHz tone with a sound level of 36 dB is used for 1 kHz most sensitive point of the associated weighted spectrum.

7. A method according to claim 1, characterized in that the low-pass filters ( 13 ) and ( 15 ) of process steps ii) and iii) of claim 1 are identical leaky integrators.

8. A method according to claim 7, characterized in that the leaky integrators from claim 7 by W ' _{n, µ} = β · W ′ _{n, µ1} + (1-β) · W _{n, µ}
β = 0.6
W _{n, µ} = nth component of the µth input spectrum
W ′ _{n, µ} = nth component of the µth output spectrum are given.

9. A method according to claim 1, characterized in that the lubricating linear transformation ( 18 ) from step iii) of claim 1 is given by: f _Nyq = 1 (2T _s )
N = number of spectrum components
T _s = sampling period
W _{m, µ} = mth component of the µth input spectrum
W ′ _{n, µ} = nth component of the µth output spectrum

10. A method according to claim 9, characterized in that the parameter δ for calculating the smear matrix M _{n, m} from claim 9 has the value 0.05.

11. A method according to claim 5, characterized in that the loudness differentiation transformations ( 14 ) and ( 19 ) from step ii) and iii) of claim 1 are given by:

12. A method according to claim 1, characterized in that the parameters the processing steps described in claim 1 for different disturbances The sound signal to be recognized varies and may take a long time can be chosen to be changeable and both that which is to be recognized Spectrogram as well as the comparison spectrograms with these parameters be edited.

13. A method according to claim 12, characterized in that from the un filtered power spectrograms of the sound signals to be recognized Art and strength of existing disorders can be estimated and derived from this estimate tion, the parameters required for the method according to claim 12 are removed be tested.

14. A device for the detection of sound signals, which has a device for obtaining Bark-scale-adapted short-term power spectra and a device for the detection of spectrograms, characterized by:

i) a device for weighting ( 11 ) the components of the Bark power spectra supplied by the short-term frequency analysis device of the sound signal to be recognized in accordance with the human hearing curve.
ii) a device for calculating spectra to be masked from the spectra obtained by device i) by transformation ( 12 ) to the static loudness, low-pass filtering ( 13 ) and subsequent transformation ( 14 ) to a loudness discrimination scale.
iii) a device for calculating masking spectra from the spectra obtained by device i) by low-pass filtering ( 15 ), delay ( 16 ), using a loudness transformation ( 17 ) identical to the static loudness transformation ( 12 ) of the device ii), smearing linear Transformation ( 18 ) and transformation ( 19 ) on the loudness discrimination scale of facility ii) ( 14 ).
iv) a device for component-wise subtraction ( 20 ) of the masking spectra provided by device iii) from the spectra to be masked provided by device ii) and forwarding of the resulting spectra to the spectrogram detection device.

15. A device according to claim 14, characterized in that the short time power spectra are calculated by fast Fourier transformation which, with neighboring components of the Fourier power spectra Bark-wide components can be summarized.

16. A device according to claim 14, characterized in that the short time power spectra are calculated by Fourier-t transformation where the analysis parameters are selected according to the Bark scale.

17. A device according to claim 14, characterized in that the short time performance spectra are calculated by a filter bank, the Fil ter can be selected according to the Bark scale.

18. A device according to claim 14, characterized in that the static loudness transformations ( 12 ) and ( 17 ) of the devices ii) and iii) of claim 14 are given by

19. A device according to claim 18, characterized in that the con constant W₀ of the loudness transformation from claim 18 to the input value W is set, which is a 1 kHz tone with a sound level of 36 dB the most sensitive point of the associated weighted spectrum for 1 kHz generated.

20. A device according to claim 14, characterized in that the low-pass filter ( 13 ) and ( 15 ) of the devices ii) and iii) of claim 14 are identical leaky integrators.

21. A device according to claim 20, characterized in that the lea ky integrators from claim 20 by W ' _{n, µ} = β · W' _{n, µ-1} + (1-β) · W _{n, µ}
β = 0.6
W _{n, µ} = nth component of the µth input spectrum
W ′ _{n, µ} = nth component of the µth output spectrum are given.

22. A device according to claim 14, characterized in that the lubricating linear transformation ( 18 ) of device iii) of claim 14 is given by: f _Nyq = 1 / (2T _s )
N = number of spectrum components
T _s = sampling period
W _{m, µ} = mth component of the µth input spectrum
W ′ _{n, µ} = nth component of the µth output spectrum

23. A device according to claim 22, characterized in that the parameter δ for calculating the smear matrix M _{n, m} from claim 22 has the value 0.05.

24. A device according to claim 18, characterized in that the loudness discrimination transformations ( 14 ) and ( 19 ) of the devices ii) and iii) of claim 14 are given by:

25. A device according to claim 14, characterized in that the para meters of the devices described in claim 14 for different disturbances the sound signal to be recognized differently and possibly temporally slowly changing and both the speck to be recognized trogram as well as the comparison spectrograms with these parameters be working.

26. A device according to claim 25, characterized in that from the unfiltered power spectrograms of the sound signals to be recognized Type and strength of existing faults can be estimated and derived from this Estimate the parameters required for the device according to claim 25 be derived.