JP4531350B2

JP4531350B2 - Voice input device and voice recognition processing system

Info

Publication number: JP4531350B2
Application number: JP2003159025A
Authority: JP
Inventors: 真吾木内
Original assignee: Alpine Electronics Inc
Current assignee: Alpine Electronics Inc
Priority date: 2003-06-04
Filing date: 2003-06-04
Publication date: 2010-08-25
Anticipated expiration: 2023-06-04
Also published as: JP2004361604A

Description

【０００１】
【発明の属する技術分野】
本発明は、入力された音声信号を音声認識処理の対象となる音声データに変換する音声入力装置に関する。
【０００２】
【従来の技術】
マイクロホンで収集した音声の内容を認識する音声認識装置が知られており、車載のナビゲーション装置の入力装置等に応用されている。このような音声認識装置では、利用者が自然に発声した音声に対して完全にその内容を認識するまでには至っておらず、認識率を高める各種の工夫がなされている。例えば、入力音声を増幅する増幅器の利得を、入力音声のダイナミックレンジに応じて設定することにより、認識対象となる音声の振幅を調整した音声認識装置が知られている（例えば、特許文献１参照。）。この音声認識装置では、小さな声の利用者に対しては増幅器の利得が高く設定され、反対に大きな声の利用者に対しては増幅器の利得が低く設定されるため、認識対象となる音声のダイナミックレンジを常に最適レベルに維持することが可能になり、認識率を高めることができる。
【０００３】
【特許文献１】
特開昭６１−１８０２９６号公報（第２頁、図１）
【０００４】
【発明が解決しようとする課題】
ところで、上述した特許文献１に開示された音声認識装置では、入力音声のダイナミックレンジに基づいて増幅器の利得が設定され、その後の入力音声に対して最適なダイナミックレンジが設定されるため、音声が最初に入力されてから増幅器の利得設定が終了するまでは、認識率を高めることができないという問題があった。また、同じ利用者あるいは複数の利用者が大きな声と小さな声を交互に発声した場合のように、入力音声のダイナミックレンジ自体が変化する場合には適用できず、認識率を高めることができないという問題があった。
【０００５】
一般に、人の声のダイナミックレンジは、ささやき声から怒鳴り声まで６０ｄＢ程度あるといわれている。しかも、声の大きさが人によってばらつくことを考慮すると、音声全体のダイナミックレンジは、さらに大きくなると考えられる。このような入力音声を、一般に用いられる１６ビット量子化のアナログ−デジタル変換器を用いて音声データに変換した場合には、１５ビットから５ビットの範囲のデータに相当する。
【０００６】
一方、音声認識処理の認識可能な入力音声のダイナミックレンジは、現状では４０ｄＢ程度が上限であり、１６ビット量子化のアナログ−デジタル変換器を用いた場合に１５ビットから９ビットの範囲のデータに相当する。すなわち、大きな声に対応する利得の設定がなされているときに小さな声に対応する音声が入力されると、この内容については認識することができなくなってしまう。
【０００７】
本発明は、このような点に鑑みて創作されたものであり、その目的は、広いダイナミックレンジを有する音声に対する認識率を高めることができる音声入力装置を提供することにある。
【０００８】
【課題を解決するための手段】
上述した課題を解決するために、本発明の音声入力装置は、音声認識装置の前段に設けられ、入力音声信号に対応する音声データを生成する音声入力装置であって、入力音声信号に対して所定の利得で減衰あるいは増幅を行うことにより、振幅が異なる複数の音声信号を生成する信号生成手段と、信号生成手段によって生成された複数の音声信号のそれぞれをデジタルデータに変換する複数のアナログ−デジタル変換手段と、複数のアナログ−デジタル変換手段から出力される複数のデジタルデータを合成するデータ合成手段と、信号生成手段によって生成される複数の音声信号のそれぞれのレベル検出を行うレベル検出手段とを備え、データ合成手段は、複数のアナログ−デジタル変換手段から出力される複数のデジタルデータのそれぞれのビット位置を、レベル検出手段によって検出された複数の音声信号のそれぞれのレベルの比に応じたビット数分ずらして合成している。これにより、一つのアナログ−デジタル変換手段の量子化ビット数では足りないようなダイナミックレンジの広い音声信号に対しても、波形の部分的な欠落がない符号化処理を行うことが可能になり、広いダイナミックレンジの確保とともに、音声波形全体が含まれるデータを生成して音声認識装置に入力することによって音声認識処理の認識率を高めることが可能になる。また、入力音声信号に対して所定の利得で減衰あるいは増幅を行うことにより、一の入力音声信号に対して、振幅（利得）が異なる複数の音声信号を容易に生成することができる。また、信号生成手段によって生成される複数の音声信号のそれぞれのレベル検出を行うレベル検出手段を備え、データ合成手段は、レベル検出手段によって検出された複数の音声信号のそれぞれのレベルの比に応じて、合成処理の際にずらすビット数を決定しているため、素子定数や製造上のばらつきを考慮したデータの合成を行うことが可能になり、入力音声信号に対応する歪みの少ないデータを生成することができる。
【０００９】
また、音声を集音して入力音声信号を出力するマイクロホンをさらに備えることが望ましい。これにより、マイクロホンで集音したダイナミックレンジが広い各種の音声をそのまま音声認識用のデジタルデータに変換することが可能になる。
【００１４】
また、上述したアナログ−デジタル変換手段の数は２であり、ステレオ用のアナログ−デジタル変換器を用いることが望ましい。これにより、一般にステレオ用として市販されている２個一組のアナログ−デジタル変換器を用いることにより、部品コストを下げることができる。
【００１５】
また、上述したデータ合成手段は、所定期間に入力されるデジタルデータを蓄積し、この期間内で最も振幅が大きな音声信号が含まれる所定ビット数のデータを、蓄積された各デジタルデータの中から切り出して出力する処理を行うことが望ましい。これにより、音声認識に必要な音声波形のピークを含む所定ビット数のデータを生成することが可能になる。また、所定ビット数のデータを生成することにより、所定ビット数のデータに対して音声認識処理を行う従来の音声認識装置を用いることができるため、音声認識処理システム全体のコスト上昇を抑えることが可能になる。
【００１６】
また、上述した所定期間は、入力音声信号が途切れるまでの音声入力区間であることが望ましい。これにより、音声認識の対象となる一連の音声について、その波形に含まれるピークの情報を保持した所定ビット数のデータを切り出すことが可能になり、認識率を高めることができる。
【００１７】
また、上述した所定期間は、入力音声信号が途切れるまでの音声入力区間よりも短い分割期間であり、後の分割期間に対応する音声信号の振幅がそれ以前の分割期間に対応する音声信号の振幅よりも大きい場合には、切り出す所定ビット数のデータの切り出し位置をこの大きい振幅に対応する位置に変更して、蓄積された各デジタルデータの中から切り出して出力する処理をそれ以前の分割期間から繰り返すことが望ましい。これにより、音声認識処理の遅延時間を少なくすることが可能になる。
【００１８】
また、本発明の音声認識処理システムは、上述した音声入力装置と、この音声入力装置から出力されるデータに対して音声認識処理を行う音声認識装置とを備えており、所定期間に入力されるデジタルデータについては、所定ビット数のデータを切り出す位置は固定であり、音声認識装置は、データ合成手段から出力されるデータが一のアナログ−デジタル変換手段から出力されるデジタルデータを用いて生成される場合と、複数のアナログ−デジタル変換手段から出力されるデジタルデータを用いて生成される場合とで、音声認識処理に用いられる複数の音響辞書を使い分けている。特に、上述した複数の音響辞書には、一のアナログ−デジタル変換手段から出力されるデジタルデータを合成する際に発生する歪みを考慮した歪み学習音響辞書と、この歪みが考慮されていない通常音響辞書が含まれていることが望ましい。これにより、データの合成に伴って発生する歪みを考慮した音声認識処理が可能になり、認識率を高めることができる。
【００１９】
【発明の実施の形態】
以下、本発明を適用した一実施形態の音声入力装置について、図面を参照しながら詳細に説明する。
〔第１の実施形態〕
図１は、第１の実施形態の音声入力装置の構成を示す図である。図１に示す本実施形態の音声入力装置１００は、音声認識装置の前段に設けられて入力音声信号に対応する音声データを生成するためのものであり、マイクロホン１０、増幅器１２、アナログ−デジタル変換器（Ａ／Ｄ）１４、１８、減衰器１６、波形推定部２０を含んで構成されている。また、この音声入力装置１００とその後段に接続された音声認識装置２００を含んで音声認識処理システムが構成されている。
【００２０】
マイクロホン１０は、音声認識対象となる利用者の音声を集音して、この音声に対応する入力音声信号を出力する。増幅器１２は、アナログ−デジタル変換器１４、１８による処理が可能な振幅レベルになるように、入力音声信号を所定のゲインで増幅する。一方のアナログ−デジタル変換器１４は、増幅器１２から出力される増幅後の音声信号を所定ビット数のデジタルデータに変換する。例えば音声信号は、符号ビットが１、データビットが１５の合計１６ビットの音声データ（中間データ）に変換される。
【００２１】
減衰器１６は、増幅器１２から出力される音声信号を減衰させて、減衰後の音声信号を出力する。例えば、減衰の利得が（１／２）¹⁵倍に設定されている。他方のアナログ−デジタル変換器１８は、減衰器１６から出力される減衰後の音声信号を所定ビット数のデジタルデータに変換する。例えば、一方のアナログ−デジタル変換器１４と同様に、音声信号は符号ビットが１、データビットが１５の合計１６ビットの音声データ（中間データ）に変換される。なお、一般にステレオ用として市販されている２個一組のアナログ−デジタル変換器１４、１８を用いることにより、部品コストを下げることができる。
【００２２】
波形推定部２０は、２つのアナログ−デジタル変換器１４、１８のそれぞれから出力される１６ビットの中間データを合成して、符号ビットが１、データビットが３０の合計３１ビットの合成データを生成する。例えば、波形推定部２０は、アナログ−デジタル変換器１４の出力データが飽和していない場合にこの出力データを用い、この出力データが飽和した場合にはこの飽和した部分の波形形状を他のアナログ−デジタル変換器１８の飽和していない出力データに基づいて推定することにより、中間データの合成処理を行う。
【００２３】
図２は、波形推定部２０の詳細構成を示す図である。図２に示すように、波形推定部２０は、倍精度データ生成部２２、音声区間終了判定部２４、有効データ位置監視部２６、認識処理用データ生成部２８を備えている。
上述したように、一方のアナログ−デジタル変換器１４に入力される音声信号に対して、他方のアナログ−デジタル変換器１８に入力される音声信号は、信号レベルが（１／２）¹⁵倍に減衰しているため、これらの音声信号を３１ビット長のデジタルデータに変換すると、それぞれの音声信号の波形情報が現れる位置は１５ビットシフトしている。実際には、アナログ−デジタル変換器１４、１８は、入力される音声信号を１５ビットのデータにしか変換できない。
【００２４】
このため、大きな信号レベルの音声信号が入力された場合には、一方のアナログ−デジタル変換器１４では、許容入力電圧を超えてしまい、音声波形のピーク部分が飽和した状態で音声データに変換される。このとき、他方のアナログ−デジタル変換器１８では、大きな信号レベルの音声信号が減衰した状態で入力されるため、音声波形のピーク部分が正常に音声データに変換される。
【００２５】
また、小さな信号レベルの音声信号が入力された場合には、一方のアナログ−デジタル変換器１４では、許容入力電圧の範囲内の音声信号が入力されるため、音声波形のピーク部分が正常に音声データに変換される。
倍精度データ生成部２２は、一方のアナログ−デジタル変換器１４から出力される中間データが飽和していない場合にはこの中間データをそのまま用いて３０ビットのデータビットを生成し、一方のアナログ−デジタル変換器１４から出力される中間データが飽和している場合には他方のアナログ−デジタル変換器１８から出力される中間データを２¹⁵倍して３０ビットのデータビットを生成し、符号ビット１、データビット３０で合計３１ビットの倍精度データ（合成データ）を生成する。合成データに含まれる符号ビットは、アナログ−デジタル変換器１４、１８から出力される中間データの符号ビットがそのまま用いられる。
【００２６】
なお、倍精度データ生成部２２の他の動作例としては、３１ビットの音声データに含まれるデータビットの中の上位１５ビットに、他方のアナログ−デジタル変換器１８から出力される中間データの中のデータビットを当てはめ、３１ビットの音声データに含まれるデータビットの中の下位１５ビットに、一方のアナログ−デジタル変換器１４から出力される中間データの中のデータビットを当てはめることにより（この中間データが飽和している場合には下位１５ビットの各ビットを“０”とする）、符号ビット１、データビット３０で合計３１ビットの合成データを生成するようにしてもよい。
【００２７】
音声区間終了判定部２４は、倍精度データ生成部２２によって生成される合成データを監視することにより、音声区間の終了タイミングを判定する。例えば、音声信号の入力が開始された後、合成データの値が「０」あるいは所定値よりも小さくなったときに、認識対象としてのひとまとまりの音声入力が終了したものとして判定される。
【００２８】
有効データ位置監視部２６は、合成データに含まれる３０ビットのデータビットの各値を調べ、値が“１”となる最上位のビット位置を検出し、そのビット位置を有効データ位置として抽出する。この有効データ位置は、次に入力される合成データに対応する有効データ位置の方が上位ビット側にある場合には、それまでの値が更新され、それ以外の場合には廃棄される。このようにして、最も信号レベルが大きい音声データに対応する有効データ位置が保持される。
【００２９】
認識処理用データ生成部２８は、音声区間終了判定部２４によって音声区間の終了タイミングが判定されるまでの間、倍精度データ生成部２２から出力される合成データを蓄積する。また、認識処理用データ生成部２８は、この蓄積期間終了後に、有効データ位置監視部２６で検出された有効データ位置を含む下位１５ビットの抽出位置を決定し、蓄積順に合成データを読み出してこの抽出位置に対応する１５ビットデータを抽出し、さらに符号ビットを加えた合計１６ビットの認識処理用データを生成する。このようにして生成された認識処理用データは、音声入力装置１００の後段に接続された音声認識装置２００に入力される。
【００３０】
上述した増幅器１２、減衰器１６が信号生成手段、利得変更手段に、アナログ−デジタル変換器１４、１８がアナログ−デジタル変換手段に、波形推定部２０がデータ合成手段にそれぞれ対応する。
このように、本実施形態の音声入力装置１００では、一つのアナログ−デジタル変換器の量子化ビット数では足りないようなダイナミックレンジの広い入力音声信号に対しても、２つのアナログ−デジタル変換器１４、１８を用いることにより波形の部分的な欠落がない符号化処理を行うことが可能になり、広いダイナミックレンジの確保とともに、音声波形全体が含まれるデータを生成して音声認識装置２００に入力することによって音声認識処理の認識率を高めることが可能になる。
【００３１】
また、増幅器１２や減衰器１６のそれぞれを単独であるいは組み合わせて用いることにより、一の入力音声信号に対して、振幅（利得）が異なる２つの音声信号を生成することが可能になる。
また、波形推定部２０では、減衰器１６の利得に対応してビット数をシフトして２つの中間データを合成することにより、元の入力音声信号の波形全体を含むビット数が多い合成データを容易に生成することができる。
【００３２】
また、波形推定部２０においてビット長の多い合成データの中から所定ビット数の認識処理用データを切り出して出力することにより、所定ビット数のデータに対して音声認識処理を行う従来の音声認識装置２００を用いることができるため、音声認識システム全体のコスト上昇を抑えることが可能になる。
【００３３】
また、波形推定部２０は、音声区間が終了するまでの一連の合成データ蓄積し、この区間が終了した後に認識処理用データを切り出しているため、音声認識の対象となる一連の音声について、その波形に含まれるピークの情報を保持した所定ビット数の認識処理用データを切り出すことが可能になり、認識率を高めることができる。
【００３４】
なお、上述した本実施形態の音声入力装置１００では、波形推定部２０内の認識処理用データ生成部２８は、音声区間終了判定部２４によって音声区間の終了タイミングが判定されるまでの期間合成データを蓄積し、この蓄積期間が終了した後認識処理用データを出力していたため、この蓄積期間に相当する遅延時間が発生する。この遅延時間を短くするために、例えば、短い分割期間を設定し、この分割期間毎に認識処理用データ生成部２８から認識処理用データを出力するようにしてもよい。但し、それ以前の分割期間に対応して出力された認識処理用データよりも大きな値を有する認識処理用データが出力されると、これ以後の分割期間における認識処理用データの合成データ中の切り出し位置が変更されてしまい、それ以前の認識処理用データが無効になってしまう。この時点で、後段の音声認識装置２００にその旨を通知して音声認識処理を中断させるとともに、最初から切り出し位置を変更した認識処理用データを再度出力する必要がある。後段の音声認識装置２００では、このようにして再度出力された一連の認識処理用データを用いて音声認識処理を行う。
【００３５】
また、上述した本実施形態の音声入力装置１００を用いて１６ビットの認識処理用データを生成した場合には、信号レベルが小さな入力音声信号に対しては音声データの合成が行われない認識処理用データが生成され、信号レベルが大きな入力音声信号に対しては音声データの合成が行われて認識処理用データが生成される。合成処理によって認識処理用データに含まれる歪み（誤差）が増加する場合には、合成処理の有無に応じて音声認識装置２００での認識方式を変更することが望ましい。
【００３６】
図３は、音声入力装置および音声認識装置の変形例を示す図である。図３に示す音声入力装置１００Ａは、図１に示した音声入力装置１００に対してレベルメータ３０を追加した点が異なっている。レベルメータ３０は、増幅器１２から出力される音声信号の信号レベルを検出する。この信号レベルが所定値以上になったときに、アナログ−デジタル変換器１４の入力許容電圧範囲を超えて音声データの合成処理が行われるため、レベルメータ３０の検出出力を監視することにより、波形推定部２０から出力される認識処理用データが合成処理によって生成されたものであるか否かを判別することが可能になる。
【００３７】
また、図３に示す音声認識装置２００Ａは、認識処理部２１０、通常音響辞書２１２、歪み学習音響辞書２１４、切替部２１６を備えている。通常音響辞書２１２には、合成処理が行われていない認識処理用データの内容を認識するための照合用波形データが格納されている。また、歪み学習音響辞書２１４には、合成処理が行われた認識処理用データの内容を認識するための照合用波形データが格納されている。切替部２１６は、レベルメータ３０によって検出された音声信号のレベル値が所定値を超えていないときに通常音響辞書２１２を認識処理部２１０に接続し、所定値を超えているときに歪み学習音響辞書２１４を認識処理部２１０に接続する。認識処理部２１０は、音声入力装置１００Ａ内の波形推定部２０から出力される１６ビットの認識処理用データに対して、接続された通常音響辞書２１２あるいは歪み学習音響辞書２１４に格納された照合用波形データを用いて音声認識処理を実行する。
【００３８】
このように、同じ１６ビットの認識処理用データであっても、合成処理によって得られたものか否かによって、使用する辞書を切り替えることにより、認識率をさらに向上させることが可能になる。
〔第２の実施形態〕
図４は、第２の実施形態の音声入力装置の構成を示す図である。図４に示す本実施形態の音声入力装置１００Ｂは、マイクロホン１０、増幅器１２、アナログ−デジタル変換器（Ａ／Ｄ）１４、１８、減衰器１６、波形推定部２０Ｂ、レベルメータ３２、３４を含んで構成されている。この音声入力装置１００Ｂは、図１に示した音声入力装置１００に対して、レベルメータ３２、３４を追加するとともに、波形推定部２０を波形推定部２０Ｂに変更した点が異なっている。
【００３９】
一方のレベルメータ３２は、一方のアナログ−デジタル変換器１４に入力される音声信号の信号レベルを検出する。また、他方のレベルメータ３４は、他方のアナログ−デジタル変換器１８に入力される音声信号の信号レベルを検出する。これらのレベルメータ３２、３４がレベル検出手段に対応する。
【００４０】
波形推定部２０Ｂは、レベルメータ３２、３４によって検出される２つの音声信号のレベル比に基づいて、アナログ−デジタル変換器１４、１８から出力される２つの中間データを合成する。
上述した第１の実施形態では、減衰器１６の利得が（１／２）¹⁵倍に設定されているものとしたが、実際には減衰器１６を構成する素子の製造上のばらつき等があるため、正確にこの利得を実現することは難しい。本実施形態の波形推定部２０Ｂは、レベルメータ３２、３４によって検出される２つの音声信号のレベル比に基づいて２つの中間データを合成する際のビット位置を決定している。例えば、減衰器１６の設計上の利得を（１／２）¹⁵倍に設定したときに、レベルメータ３２、３４の検出結果からこの設計値に一致する音声信号のレベル比（減衰比）が確かめられた場合には、アナログ−デジタル変換器１４、１８から出力される２つの中間データを１５ビットシフトさせて合成データが生成される。また、レベルメータ３２、３４の検出結果から、減衰器１６の実際の利得が（１／２）¹⁴倍であることが確かめられた場合には、アナログ−デジタル変換器１４、１８から出力される２つの中間データを１４ビットシフトさせて合成データが生成される。
【００４１】
このように、本実施形態の音声入力装置１００Ｂでは、製造上のばらつき等を考慮して音声データの合成を行うこにより、歪みの少ない認識処理用データを生成することが可能になり、後段の音声認識装置２００における音声認識処理の認識率を高めることが可能になる。
【００４２】
なお、本発明は上記実施形態に限定されるものではなく、本発明の要旨の範囲内において種々の変形実施が可能である。上述した実施形態では、２つのアナログ−デジタル変換器を用いて音声入力装置１００、１００Ａ、１００Ｂを構成したが、３つ以上のアナログ−デジタル変換器を用いて音声入力装置を構成するようにしてもよい。
【００４３】
図５は、３つ以上のアナログ−デジタル変換器を用いた音声入力装置の構成を示す図である。図５に示す音声入力装置１００Ｃは、マイクロホン１０、増幅器１２、アナログ−デジタル変換器１４、複数の減衰器１６、複数のアナログ−デジタル変換器１８、波形推定部２０Ｃを含んで構成されている。例えば、アナログ−デジタル変換器１４、１８の合計の個数をｎ、それぞれから出力される中間データの符号ビットを除くデータビットのビット長をｍとすると、波形推定部２０Ｃは、ｎ×ｍビットのデータビットに１ビットの符号ビットを加えた合成データを生成し、その中から所定ビット数の認識処理用データを生成する。このように、アナログ−デジタル変換器の数を増やすことにより、入力音声信号を認識処理用データに変換する際のダイナミックレンジをさらに広くすることができる。また、それぞれのアナログ−デジタル変換器は、ビット数の少ない安価なものを用いることができるようになるため、装置全体のコストダウンを図ることが可能になる。
【００４４】
また、上述した各実施形態では、各アナログ−デジタル変換器によって変換される中間データのビット数を全て同じにしたが、異なるビット数のアナログ−デジタル変換器を組み合わせて用いるようにしてもよい。
また、上述した各実施形態では、波形推定部２０等において生成したビット数の多い合成データの中から所定ビット数の認識処理用データを抜き出しているが、音声認識装置２００、２００Ａにおいてこの合成データをそのまま処理することができる場合には、合成データそのものを認識処理用データとして出力するようにしてもよい。このような場合であっても、ビット数が少ないアナログ−デジタル変換器を用いることが可能であり、入力音声信号に対して広いダイナミックレンジを確保しつつ、コストダウンを図ることが可能になる。
【００４５】
また、上述した各実施形態では、減衰器１６を用いて振幅が異なる複数の音声信号を生成しているが、増幅器を用いたり、増幅器と減衰器１６とを組み合わせて用いて振幅が異なる複数の音声信号を生成するようにしてもよい。
【００４６】
【発明の効果】
上述したように、本発明によれば、一つのアナログ−デジタル変換手段の量子化ビット数では足りないようなダイナミックレンジの広い音声信号に対しても、波形の部分的な欠落がない符号化処理を行うことが可能になり、広いダイナミックレンジの確保とともに、音声波形全体が含まれるデータを生成して音声認識装置に入力することによって音声認識処理の認識率を高めることが可能になる。
【図面の簡単な説明】
【図１】第１の実施形態の音声入力装置の構成を示す図である。
【図２】波形推定部の詳細構成を示す図である。
【図３】音声入力装置および音声認識装置の変形例を示す図である。
【図４】第２の実施形態の音声入力装置の構成を示す図である。
【図５】３つ以上のアナログ−デジタル変換器を用いた音声入力装置の構成を示す図である。
【符号の説明】
１０マイクロホン
１２増幅器
１４、１８アナログ−デジタル変換器（Ａ／Ｄ）
１６減衰器
２０、２０Ｂ、２０Ｃ波形推定部
２２倍精度データ生成部
２４音声区間終了判定部
２６有効データ位置監視部
２８認識処理用データ生成部
１００、１００Ａ、１００Ｂ、１００Ｃ音声入力装置
２００、２００Ａ音声認識装置
２１０認識処理部
２１２通常音響辞書
２１４歪み学習音響辞書
２１６切替部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a voice input device that converts an input voice signal into voice data to be subjected to voice recognition processing.
[0002]
[Prior art]
A voice recognition device for recognizing the contents of voice collected by a microphone is known and applied to an input device of an in-vehicle navigation device. In such a speech recognition device, the content of the speech naturally uttered by the user has not yet been fully recognized, and various devices for increasing the recognition rate have been made. For example, a speech recognition device is known in which the gain of an amplifier that amplifies input speech is set in accordance with the dynamic range of the input speech to adjust the amplitude of the speech to be recognized (see, for example, Patent Document 1). .) In this speech recognition apparatus, the gain of the amplifier is set high for a user with a low voice, and conversely, the gain of the amplifier is set low for a user with a loud voice. The dynamic range can always be maintained at the optimum level, and the recognition rate can be increased.
[0003]
[Patent Document 1]
JP 61-180296 A (2nd page, FIG. 1)
[0004]
[Problems to be solved by the invention]
By the way, in the speech recognition apparatus disclosed in Patent Document 1 described above, the gain of the amplifier is set based on the dynamic range of the input speech, and the optimum dynamic range is set for the subsequent input speech. There is a problem that the recognition rate cannot be increased from the first input until the amplifier gain setting is completed. In addition, it cannot be applied when the dynamic range of the input speech itself changes, such as when the same user or multiple users utter loud and small voices alternately, and the recognition rate cannot be increased. There was a problem.
[0005]
In general, it is said that the dynamic range of human voice is about 60 dB from whispering to yelling. Moreover, considering that the loudness of the voice varies from person to person, the dynamic range of the entire voice is considered to be further increased. When such input voice is converted into voice data using a 16-bit quantization analog-digital converter that is generally used, it corresponds to data in the range of 15 bits to 5 bits.
[0006]
On the other hand, the dynamic range of input speech that can be recognized by the speech recognition processing is currently limited to about 40 dB, and when a 16-bit quantization analog-to-digital converter is used, the dynamic range is changed from 15 bits to 9 bits. Equivalent to. That is, if a voice corresponding to a small voice is input while a gain corresponding to a loud voice is set, this content cannot be recognized.
[0007]
The present invention has been created in view of such a point, and an object thereof is to provide a voice input device capable of increasing the recognition rate for voice having a wide dynamic range.
[0008]
[Means for Solving the Problems]
  In order to solve the above-described problem, a voice input device according to the present invention is a voice input device that is provided in a preceding stage of a voice recognition device and generates voice data corresponding to an input voice signal,By attenuating or amplifying with a predetermined gainA signal generating means for generating a plurality of audio signals having different amplitudes, a plurality of analog-digital converting means for converting each of the plurality of audio signals generated by the signal generating means into digital data, and a plurality of analog-digital conversions Data synthesizing means for synthesizing a plurality of digital data output from the means;And level detection means for detecting the level of each of the plurality of audio signals generated by the signal generation means, and the data synthesis means each bit of the plurality of digital data output from the plurality of analog-digital conversion means The position is synthesized by shifting the number of bits according to the level ratio of each of the plurality of audio signals detected by the level detection means.ing. As a result, it is possible to perform coding processing without a partial loss of a waveform even for an audio signal with a wide dynamic range in which the number of quantization bits of one analog-digital conversion means is insufficient, In addition to ensuring a wide dynamic range, it is possible to increase the recognition rate of the speech recognition process by generating data including the entire speech waveform and inputting it to the speech recognition apparatus.Also, by attenuating or amplifying the input audio signal with a predetermined gain, a plurality of audio signals having different amplitudes (gains) can be easily generated for one input audio signal. In addition, level detection means for detecting the level of each of the plurality of audio signals generated by the signal generation means is provided, and the data synthesizing means is in accordance with the ratio of the levels of the plurality of audio signals detected by the level detection means. Since the number of bits shifted during the synthesis process is determined, it is possible to synthesize data that takes into account element constants and manufacturing variations, and generates data with low distortion corresponding to the input audio signal. can do.
[0009]
It is desirable to further include a microphone that collects sound and outputs an input sound signal. As a result, it is possible to directly convert various sounds collected by the microphone with a wide dynamic range into digital data for speech recognition.
[0014]
The number of analog-digital conversion means described above is two, and it is desirable to use a stereo analog-digital converter. Accordingly, the cost of components can be reduced by using a set of two analog-digital converters that are generally commercially available for stereo use.
[0015]
Further, the data synthesizing means described above accumulates digital data input during a predetermined period, and stores data of a predetermined number of bits including an audio signal having the largest amplitude within this period from among the accumulated digital data. It is desirable to perform a process of cutting out and outputting. As a result, it is possible to generate data of a predetermined number of bits including the peak of the speech waveform necessary for speech recognition. In addition, by generating data with a predetermined number of bits, it is possible to use a conventional voice recognition device that performs voice recognition processing on data with a predetermined number of bits, thereby suppressing an increase in the cost of the entire voice recognition processing system. It becomes possible.
[0016]
Moreover, it is desirable that the predetermined period described above is a voice input section until the input voice signal is interrupted. This makes it possible to cut out a predetermined number of bits of data holding peak information included in the waveform of a series of voices that are subject to voice recognition, thereby increasing the recognition rate.
[0017]
  Further, the predetermined period is a divided period shorter than the voice input period until the input voice signal is interrupted, and the amplitude of the voice signal corresponding to the subsequent divided period is the amplitude of the voice signal corresponding to the previous divided period. If it is larger thanChange the cutout position of the data of the predetermined number of bits to cut out to a position corresponding to this large amplitude,It is desirable to repeat the process of cutting out and outputting from each stored digital data from the previous divided period. As a result, the delay time of the voice recognition process can be reduced.
[0018]
  The speech recognition processing system of the present invention includes the speech input device described above and a speech recognition device that performs speech recognition processing on data output from the speech input device.For digital data input during a predetermined period, the position where data of a predetermined number of bits is cut out is fixed,The voice recognition apparatus is configured to generate digital data output from a plurality of analog-digital conversion means when data output from the data synthesis means is generated using digital data output from one analog-digital conversion means. A plurality of acoustic dictionaries used for the speech recognition process are used differently depending on whether they are used. In particular, the above-described plurality of acoustic dictionaries include a distortion learning acoustic dictionary that takes into account distortion generated when synthesizing digital data output from one analog-to-digital converter, and a normal acoustic that does not consider this distortion. It is desirable to include a dictionary. As a result, it is possible to perform speech recognition processing in consideration of the distortion that occurs with the synthesis of data, and the recognition rate can be increased.
[0019]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, a voice input device according to an embodiment to which the present invention is applied will be described in detail with reference to the drawings.
[First Embodiment]
FIG. 1 is a diagram illustrating the configuration of the voice input device according to the first embodiment. A voice input device 100 according to the present embodiment shown in FIG. 1 is provided in the preceding stage of a voice recognition device and is used to generate voice data corresponding to an input voice signal. A microphone 10, an amplifier 12, and an analog-digital conversion. (A / D) 14 and 18, an attenuator 16, and a waveform estimator 20. In addition, the voice recognition processing system includes the voice input device 100 and the voice recognition device 200 connected to the subsequent stage.
[0020]
The microphone 10 collects the voice of a user who is a voice recognition target and outputs an input voice signal corresponding to this voice. The amplifier 12 amplifies the input audio signal with a predetermined gain so that the amplitude level can be processed by the analog-digital converters 14 and 18. One analog-digital converter 14 converts the amplified audio signal output from the amplifier 12 into digital data having a predetermined number of bits. For example, the audio signal is converted into audio data (intermediate data) having a code bit of 1 and a data bit of 15 in total, ie 16 bits.
[0021]
The attenuator 16 attenuates the audio signal output from the amplifier 12, and outputs the attenuated audio signal. For example, the attenuation gain is (1/2)¹⁵It is set to double. The other analog-digital converter 18 converts the attenuated audio signal output from the attenuator 16 into digital data having a predetermined number of bits. For example, like the one analog-to-digital converter 14, the audio signal is converted into 16-bit audio data (intermediate data) having a code bit of 1 and a data bit of 15. Note that the cost of components can be reduced by using a set of two analog-digital converters 14 and 18 that are generally commercially available for stereo use.
[0022]
The waveform estimator 20 synthesizes 16-bit intermediate data output from each of the two analog-to-digital converters 14 and 18 to generate 31-bit combined data with 1 sign bit and 30 data bits. To do. For example, the waveform estimation unit 20 uses this output data when the output data of the analog-to-digital converter 14 is not saturated, and when this output data is saturated, the waveform shape of this saturated portion is changed to another analog waveform. A synthesis process of intermediate data is performed by estimating based on the unsaturated output data of the digital converter 18.
[0023]
FIG. 2 is a diagram illustrating a detailed configuration of the waveform estimation unit 20. As shown in FIG. 2, the waveform estimation unit 20 includes a double precision data generation unit 22, a speech segment end determination unit 24, a valid data position monitoring unit 26, and a recognition processing data generation unit 28.
As described above, the audio signal input to the other analog-digital converter 18 has a signal level of (1/2) with respect to the audio signal input to one analog-digital converter 14.¹⁵Since these sound signals are attenuated by a factor of 2, when the sound signals are converted into 31-bit digital data, the positions at which the waveform information of each sound signal appears are shifted by 15 bits. In practice, the analog-to-digital converters 14 and 18 can only convert the input audio signal into 15-bit data.
[0024]
For this reason, when an audio signal having a large signal level is input, the one analog-digital converter 14 exceeds the allowable input voltage and is converted into audio data in a state where the peak portion of the audio waveform is saturated. The At this time, in the other analog-to-digital converter 18, since the audio signal having a large signal level is input in a attenuated state, the peak portion of the audio waveform is normally converted into audio data.
[0025]
In addition, when an audio signal having a small signal level is input, the audio signal within the allowable input voltage range is input to one of the analog-to-digital converters 14, so that the peak portion of the audio waveform is normal. Converted to data.
When the intermediate data output from one of the analog-digital converters 14 is not saturated, the double precision data generation unit 22 generates 30 bits of data using the intermediate data as it is, When the intermediate data output from the digital converter 14 is saturated, the intermediate data output from the other analog-digital converter 18 is 2¹⁵The data bits are multiplied to generate 30 data bits, and the code bit 1 and the data bit 30 generate a total of 31 bits of double precision data (synthesized data). As the sign bit included in the combined data, the sign bit of the intermediate data output from the analog-digital converters 14 and 18 is used as it is.
[0026]
As another operation example of the double-precision data generation unit 22, intermediate data output from the other analog-digital converter 18 is added to the upper 15 bits of the data bits included in the 31-bit audio data. And the data bits in the intermediate data output from one of the analog-to-digital converters 14 are applied to the lower 15 bits of the data bits included in the 31-bit audio data (this intermediate When the data is saturated, the lower 15 bits are set to “0”), and combined data of 31 bits in total with the sign bit 1 and the data bit 30 may be generated.
[0027]
The voice segment end determination unit 24 determines the end timing of the voice segment by monitoring the synthesized data generated by the double precision data generation unit 22. For example, after the input of the voice signal is started, when the value of the synthesized data becomes “0” or smaller than a predetermined value, it is determined that the voice input as a recognition target is completed.
[0028]
The valid data position monitoring unit 26 examines each value of 30 bits included in the composite data, detects the most significant bit position where the value is “1”, and extracts the bit position as the valid data position. . This valid data position is updated if the valid data position corresponding to the next input composite data is on the higher bit side, and is discarded in other cases. In this way, the effective data position corresponding to the audio data having the highest signal level is held.
[0029]
The recognition processing data generation unit 28 accumulates the synthesized data output from the double-precision data generation unit 22 until the voice segment end determination unit 24 determines the end timing of the voice segment. Further, the recognition processing data generation unit 28 determines the extraction position of the lower 15 bits including the effective data position detected by the effective data position monitoring unit 26 after the end of the accumulation period, reads the synthesized data in the accumulation order, and 15-bit data corresponding to the extraction position is extracted, and further 16-bit recognition processing data is generated by adding a sign bit. The recognition processing data generated in this way is input to the speech recognition device 200 connected to the subsequent stage of the speech input device 100.
[0030]
The amplifier 12 and the attenuator 16 described above correspond to signal generation means and gain change means, the analog-digital converters 14 and 18 correspond to analog-digital conversion means, and the waveform estimation unit 20 corresponds to data synthesis means.
As described above, in the audio input device 100 according to the present embodiment, two analog-digital converters can be used for an input audio signal having a wide dynamic range in which the number of quantization bits of one analog-digital converter is insufficient. 14 and 18 makes it possible to perform an encoding process with no partial omission of the waveform, secure a wide dynamic range, and generate data including the entire speech waveform and input it to the speech recognition apparatus 200. By doing so, it becomes possible to increase the recognition rate of the voice recognition processing.
[0031]
Further, by using each of the amplifier 12 and the attenuator 16 alone or in combination, it is possible to generate two audio signals having different amplitudes (gains) for one input audio signal.
Further, the waveform estimation unit 20 synthesizes two intermediate data by shifting the number of bits corresponding to the gain of the attenuator 16, thereby generating synthesized data having a large number of bits including the entire waveform of the original input speech signal. It can be easily generated.
[0032]
Also, a conventional speech recognition apparatus that performs speech recognition processing on data having a predetermined number of bits by cutting out and outputting recognition processing data having a predetermined number of bits from synthesized data having a large bit length in the waveform estimation unit 20 Since 200 can be used, it is possible to suppress an increase in cost of the entire speech recognition system.
[0033]
Further, since the waveform estimation unit 20 accumulates a series of synthesized data until the end of the speech section and cuts out the recognition processing data after the end of the section, the waveform estimation unit 20 It is possible to extract recognition processing data having a predetermined number of bits that retains peak information included in the waveform, and the recognition rate can be increased.
[0034]
In the voice input device 100 of the present embodiment described above, the recognition processing data generation unit 28 in the waveform estimation unit 20 is the period synthesis data until the voice segment end determination unit 24 determines the end timing of the voice section. Since the recognition processing data is output after the accumulation period ends, a delay time corresponding to the accumulation period occurs. In order to shorten the delay time, for example, a short divided period may be set, and the recognition processing data may be output from the recognition processing data generation unit 28 for each divided period. However, if recognition processing data having a larger value than the recognition processing data output corresponding to the previous divided period is output, the recognition processing data in the subsequent divided period is cut out in the combined data. The position is changed, and the previous recognition processing data becomes invalid. At this time, it is necessary to notify the subsequent speech recognition apparatus 200 to interrupt the speech recognition processing and output again the recognition processing data whose cut-out position has been changed from the beginning. The subsequent speech recognition apparatus 200 performs speech recognition processing using the series of recognition processing data output again in this manner.
[0035]
In addition, when 16-bit recognition processing data is generated using the voice input device 100 according to the present embodiment described above, a recognition process in which voice data is not synthesized for an input voice signal having a low signal level. For the input voice signal having a large signal level, the voice data is synthesized to generate the recognition processing data. When the distortion (error) included in the recognition processing data increases due to the synthesis processing, it is desirable to change the recognition method in the speech recognition apparatus 200 according to the presence or absence of the synthesis processing.
[0036]
FIG. 3 is a diagram illustrating a modification of the voice input device and the voice recognition device. The voice input device 100A shown in FIG. 3 is different in that a level meter 30 is added to the voice input device 100 shown in FIG. The level meter 30 detects the signal level of the audio signal output from the amplifier 12. When this signal level exceeds a predetermined value, voice data synthesis processing is performed exceeding the input allowable voltage range of the analog-to-digital converter 14, so that the waveform is monitored by monitoring the detection output of the level meter 30. It is possible to determine whether or not the recognition processing data output from the estimation unit 20 is generated by the synthesis process.
[0037]
The speech recognition apparatus 200A shown in FIG. 3 includes a recognition processing unit 210, a normal acoustic dictionary 212, a distortion learning acoustic dictionary 214, and a switching unit 216. The normal acoustic dictionary 212 stores collation waveform data for recognizing the content of recognition processing data that has not been subjected to synthesis processing. The distortion learning acoustic dictionary 214 stores collation waveform data for recognizing the content of recognition processing data that has undergone synthesis processing. The switching unit 216 connects the normal acoustic dictionary 212 to the recognition processing unit 210 when the level value of the audio signal detected by the level meter 30 does not exceed the predetermined value, and when the level value exceeds the predetermined value, the distortion learning sound The dictionary 214 is connected to the recognition processing unit 210. The recognition processing unit 210 uses the normal acoustic dictionary 212 or the distortion learning acoustic dictionary 214 connected to the 16-bit recognition processing data output from the waveform estimation unit 20 in the speech input device 100A. Voice recognition processing is executed using waveform data.
[0038]
In this way, even if the same 16-bit recognition processing data is used, it is possible to further improve the recognition rate by switching the dictionary to be used depending on whether it is obtained by the synthesis processing.
[Second Embodiment]
FIG. 4 is a diagram illustrating a configuration of the voice input device according to the second embodiment. 4 includes a microphone 10, an amplifier 12, analog-to-digital converters (A / D) 14, 18, an attenuator 16, a waveform estimation unit 20B, and level meters 32, 34. It consists of This voice input device 100B is different from the voice input device 100 shown in FIG. 1 in that level meters 32 and 34 are added and the waveform estimation unit 20 is changed to a waveform estimation unit 20B.
[0039]
One level meter 32 detects the signal level of the audio signal input to one analog-digital converter 14. The other level meter 34 detects the signal level of the audio signal input to the other analog-digital converter 18. These level meters 32 and 34 correspond to level detecting means.
[0040]
The waveform estimation unit 20B synthesizes the two intermediate data output from the analog-digital converters 14 and 18 based on the level ratio between the two audio signals detected by the level meters 32 and 34.
In the first embodiment described above, the gain of the attenuator 16 is (1/2).¹⁵However, it is difficult to realize this gain accurately because there are actually variations in manufacturing of the elements constituting the attenuator 16. The waveform estimation unit 20B according to the present embodiment determines a bit position when combining the two intermediate data based on the level ratio of the two audio signals detected by the level meters 32 and 34. For example, the design gain of the attenuator 16 is (1/2)¹⁵If the level ratio (attenuation ratio) of the audio signal that matches this design value is ascertained from the detection results of the level meters 32 and 34 when it is set to double, it is output from the analog-digital converters 14 and 18. The two intermediate data are shifted by 15 bits to generate composite data. Further, from the detection results of the level meters 32 and 34, the actual gain of the attenuator 16 is (1/2).¹⁴When it is confirmed that the data is doubled, the two intermediate data output from the analog-digital converters 14 and 18 are shifted by 14 bits to generate composite data.
[0041]
As described above, in the voice input device 100B according to the present embodiment, it is possible to generate recognition processing data with less distortion by synthesizing voice data in consideration of manufacturing variations and the like. It becomes possible to increase the recognition rate of the speech recognition processing in the speech recognition apparatus 200.
[0042]
In addition, this invention is not limited to the said embodiment, A various deformation | transformation implementation is possible within the range of the summary of this invention. In the above-described embodiment, the voice input devices 100, 100A, and 100B are configured using two analog-digital converters. However, the voice input device is configured using three or more analog-digital converters. Also good.
[0043]
FIG. 5 is a diagram showing a configuration of a voice input device using three or more analog-digital converters. A voice input device 100C shown in FIG. 5 includes a microphone 10, an amplifier 12, an analog-digital converter 14, a plurality of attenuators 16, a plurality of analog-digital converters 18, and a waveform estimation unit 20C. For example, if the total number of analog-digital converters 14 and 18 is n and the bit length of data bits excluding the sign bits of the intermediate data output from each is m, the waveform estimation unit 20C has n × m bits. Composite data is generated by adding one sign bit to data bits, and recognition processing data having a predetermined number of bits is generated from the composite data. As described above, by increasing the number of analog-digital converters, it is possible to further widen the dynamic range when converting the input voice signal into the data for recognition processing. In addition, since each analog-digital converter can use an inexpensive one with a small number of bits, the cost of the entire apparatus can be reduced.
[0044]
In each of the embodiments described above, the number of bits of the intermediate data converted by each analog-digital converter is all the same, but analog-digital converters having different numbers of bits may be used in combination.
In each of the above-described embodiments, recognition processing data having a predetermined number of bits is extracted from the synthesized data having a large number of bits generated by the waveform estimation unit 20 or the like, but this synthesized data is extracted by the speech recognition apparatuses 200 and 200A. Can be processed as they are, the synthesized data itself may be output as recognition processing data. Even in such a case, it is possible to use an analog-digital converter having a small number of bits, and it is possible to reduce the cost while ensuring a wide dynamic range for the input audio signal.
[0045]
In each of the above-described embodiments, a plurality of audio signals having different amplitudes are generated using the attenuator 16. However, a plurality of audio signals having different amplitudes using an amplifier or a combination of the amplifier and the attenuator 16 are used. An audio signal may be generated.
[0046]
【The invention's effect】
As described above, according to the present invention, an encoding process in which a waveform is not partially lost even for an audio signal having a wide dynamic range in which the number of quantization bits of one analog-digital conversion unit is insufficient. In addition to ensuring a wide dynamic range, it is possible to increase the recognition rate of the speech recognition process by generating data including the entire speech waveform and inputting it to the speech recognition apparatus.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating a configuration of a voice input device according to a first embodiment.
FIG. 2 is a diagram illustrating a detailed configuration of a waveform estimation unit.
FIG. 3 is a diagram showing a modification of the voice input device and the voice recognition device.
FIG. 4 is a diagram illustrating a configuration of a voice input device according to a second embodiment.
FIG. 5 is a diagram showing a configuration of a voice input device using three or more analog-digital converters.
[Explanation of symbols]
10 Microphone
12 Amplifier
14, 18 Analog-digital converter (A / D)
16 Attenuator
20, 20B, 20C Waveform estimation unit
22 Double precision data generator
24 Voice segment end determination unit
26 Valid data position monitoring unit
28 Recognition processing data generator
100, 100A, 100B, 100C Voice input device
200, 200A voice recognition device
210 Recognition processing unit
212 Normal acoustic dictionary
214 Distortion Learning Acoustic Dictionary
216 switching part

Claims

A voice input device that is provided in a front stage of a voice recognition device and generates voice data corresponding to an input voice signal,
Signal generating means for generating a plurality of audio signals having different amplitudes by attenuating or amplifying the input audio signal with a predetermined gain ;
A plurality of analog-digital conversion means for converting each of the plurality of audio signals generated by the signal generation means into digital data;
Data combining means for combining the plurality of digital data output from the plurality of analog-digital conversion means;
Level detecting means for detecting the level of each of the plurality of audio signals generated by the signal generating means;
And the data synthesizing unit is configured to detect the bit positions of the plurality of digital data output from the plurality of analog-digital conversion units, and the levels of the plurality of audio signals detected by the level detection unit. A voice input device characterized by being synthesized by shifting by the number of bits corresponding to the ratio of.

In claim 1,
A voice input device further comprising a microphone for collecting voice and outputting the input voice signal.

In claim 1 or 2,
The number of the analog-digital conversion means is 2, and a stereo analog-digital converter is used.

In any one of Claims 1-3,
The data synthesizing unit accumulates the digital data input in a predetermined period, and cuts out data of a predetermined number of bits including an audio signal having the largest amplitude in the period from the accumulated digital data. An audio input device that performs an output process.

In claim 4,
The voice input device, wherein the predetermined period is a voice input section until the input voice signal is interrupted.

In claim 4,
The predetermined period is a divided period shorter than the audio input period until the input audio signal is interrupted, and the amplitude of the audio signal corresponding to the subsequent divided period is the amplitude of the audio signal corresponding to the previous divided period If it is larger than the predetermined divided number of bits, the cut-out position of the data of a predetermined number of bits to be cut out is changed to a position corresponding to this large amplitude, and the process of cutting out and outputting from each accumulated digital data is performed from the previous divided period. A voice input device characterized by repetition.

A speech recognition processing system comprising: the speech input device according to any one of claims 4 to 6 ; and a speech recognition device that performs speech recognition processing on data output from the speech input device,
For the digital data input during the predetermined period, the position for cutting out the data of the predetermined number of bits is fixed,
In the speech recognition apparatus, the data output from the data synthesizing means is generated using the digital data output from one analog-digital converting means, and the data is output from a plurality of analog-digital converting means. A plurality of acoustic dictionaries used for voice recognition processing depending on whether the digital data is generated using the digital data.

In claim 7,
The plurality of acoustic dictionaries include a distortion learning acoustic dictionary that takes into account distortion that occurs when the digital data output from one analog-to-digital converter is combined, and a normal acoustic dictionary that does not consider this distortion. A speech recognition processing system characterized by comprising: