JP6577930B2

JP6577930B2 - Feature extraction device, acoustic model learning device, acoustic model selection device, feature extraction method, and program

Info

Publication number: JP6577930B2
Application number: JP2016225632A
Authority: JP
Inventors: 弘章伊藤; 智子川瀬; 小林　和則; 和則小林
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-11-21
Filing date: 2016-11-21
Publication date: 2019-09-18
Anticipated expiration: 2036-11-21
Also published as: JP2018084594A

Description

この発明は、ユーザ環境の音響特性を表すインパルス応答から特徴量を抽出し、インパルス応答の類似性を評価する技術に関する。 The present invention relates to a technique for extracting a feature amount from an impulse response representing an acoustic characteristic of a user environment and evaluating the similarity of the impulse response.

音声を入力インターフェースとして扱うシステムにおいて、その性能はユーザ利用環境に大きく影響される。性能劣化の要因としては、発話による要因と音環境による要因とが考えられる。音環境による要因を改善するためには、ユーザ環境をシステム上で再現する必要がある。そのような従来技術としては、実環境で収録したデータを用いて再現する手法と、計算機シミュレーション上でユーザ環境を模擬する手法とがある。非特許文献１には、人工残響インパルス応答をクリーン音声に畳み込んで生成した学習用音声から音響モデルを学習し、音声認識性能を測定する手法が記載されている。 In a system that handles audio as an input interface, its performance is greatly affected by the user usage environment. As a factor of performance deterioration, a factor due to speech and a factor due to sound environment can be considered. In order to improve the factors caused by the sound environment, it is necessary to reproduce the user environment on the system. As such conventional technologies, there are a method of reproducing using data recorded in an actual environment and a method of simulating a user environment on a computer simulation. Non-Patent Document 1 describes a method of learning an acoustic model from learning speech generated by convolving an artificial reverberation impulse response with clean speech and measuring speech recognition performance.

西亀健太，渡部晋治，西本卓也，小野順貴，嵯峨山茂樹，“複数残響特性下の音声を単一モデル学習に用いた未知残響環境に頑健な音声認識の検討”，電子情報通信学会技術研究報告，vol. 108(66)，pp. 43-48，May 2008Kenta Nishigame, Junji Watanabe, Takuya Nishimoto, Junki Ono, Shigeki Hiyama, “Study on robust speech recognition in unknown reverberation environment using speech under multiple reverberation characteristics”, IEICE technical research Report, vol. 108 (66), pp. 43-48, May 2008

実環境で音声データを収録する場合、大きなコストがかかるという問題がある。計算機シミュレーション上でユーザ環境を模擬する場合、コストを削減することが可能であるが、模擬するための指針を決める必要がある。ユーザ環境で取得される情報（インパルス応答）から特徴量を抽出し、その特徴量に基づいて音響特性の類似度を算出することができれば、ユーザ環境を模擬するための適切なデータを選択することができると考えられる。しかしながら、どのような特徴量であれば適切に音響特性の類似度を評価できるかは明らかでなかった。 When recording audio data in a real environment, there is a problem that it costs a lot. When simulating a user environment on a computer simulation, it is possible to reduce costs, but it is necessary to determine guidelines for simulating. If a feature amount is extracted from information (impulse response) acquired in the user environment and the similarity of the acoustic characteristics can be calculated based on the feature amount, appropriate data for simulating the user environment is selected. It is thought that you can. However, it has not been clear what feature quantity can appropriately evaluate the similarity of acoustic characteristics.

この発明の目的は、上述のような点に鑑みて、インパルス応答の特徴量に基づいてユーザ環境に適合したデータを選択することができる特徴量抽出技術を提供することである。 In view of the above points, an object of the present invention is to provide a feature amount extraction technique that can select data suitable for a user environment based on a feature amount of an impulse response.

上記の課題を解決するために、この発明の第一の態様の特徴量抽出装置は、インパルス応答における後部残響のパワーが全体のパワーに占める割合を表す特徴量を算出する特徴量算出部と、ユーザ環境のインパルス応答を測定するインパルス応答測定部と、複数の相異なる音響環境に関連するデータとその音響環境で測定したインパルス応答から算出した特徴量とを関連付けて記憶するデータ記憶部と、ユーザ環境のインパルス応答から算出した特徴量と音響環境のインパルス応答から算出した特徴量との距離に基づいてユーザ環境に対応する音響環境に関連するデータを選択するデータ選択部と、を含む。 In order to solve the above-described problem, the feature quantity extraction device according to the first aspect of the present invention includes a feature quantity calculation unit that calculates a feature quantity that represents a ratio of the power of the rear reverberation in the impulse response to the total power, An impulse response measurement unit that measures an impulse response of a user environment, a data storage unit that stores data related to a plurality of different acoustic environments and feature quantities calculated from the impulse responses measured in the acoustic environment, and a user And a data selection unit that selects data related to the acoustic environment corresponding to the user environment based on a distance between the feature amount calculated from the impulse response of the environment and the feature amount calculated from the impulse response of the acoustic environment.

この発明の第二の態様の音響モデル学習装置は、クリーン音声を記憶するクリーン音声記憶部と、複数の相異なる音響環境で測定したインパルス応答を記憶するインパルス応答記憶部と、インパルス応答における後部残響のパワーが全体のパワーに占める割合を表す特徴量を算出する特徴量算出部と、音響環境のインパルス応答の特徴量間の距離に基づいてインパルス応答を複数のグループに分類するインパルス応答分類部と、各グループに分類されたインパルス応答それぞれをクリーン音声にたたみ込んで各グループに対応する学習用音声を生成する学習データ生成部と、各グループに対応する学習用音声を用いて各グループに対応する音響モデルを学習する音響モデル学習部と、を含む。 The acoustic model learning device according to the second aspect of the present invention includes a clean speech storage unit that stores clean speech, an impulse response storage unit that stores impulse responses measured in a plurality of different acoustic environments, and a posterior reverberation in the impulse response. A feature amount calculation unit that calculates a feature amount that represents a ratio of the total power to the power, an impulse response classification unit that classifies the impulse responses into a plurality of groups based on the distance between the feature amounts of the impulse response in the acoustic environment, and A learning data generation unit that generates a learning voice corresponding to each group by convolving each impulse response classified into each group into a clean voice, and a learning voice corresponding to each group is used to correspond to each group. An acoustic model learning unit that learns an acoustic model.

この発明の第三の態様の音響モデル選択装置は、インパルス応答における後部残響のパワーが全体のパワーに占める割合を表す特徴量を算出する特徴量算出部と、ユーザ環境のインパルス応答を測定するインパルス応答測定部と、複数の相異なる音響環境に対応する学習用音声を用いて学習した音響モデルとその音響環境で測定したインパルス応答から算出した特徴量とを関連付けて記憶する音響モデル記憶部と、ユーザ環境のインパルス応答の特徴量と各音響モデルに対応するインパルス応答の特徴量との距離に基づいてユーザ環境に対応する音響モデルを選択する音響モデル選択部と、を含む。 According to a third aspect of the present invention, there is provided an acoustic model selection device including a feature amount calculation unit that calculates a feature amount that represents a ratio of the power of rear reverberation in an impulse response to the total power, and an impulse that measures an impulse response in a user environment. An acoustic model storage unit that associates and stores a response measurement unit, an acoustic model learned using learning speech corresponding to a plurality of different acoustic environments, and a feature amount calculated from an impulse response measured in the acoustic environment; An acoustic model selection unit that selects an acoustic model corresponding to the user environment based on a distance between the characteristic amount of the impulse response of the user environment and the characteristic amount of the impulse response corresponding to each acoustic model.

この発明によれば、インパルス応答から算出した特徴量間の距離により音響特性の類似度を評価することができるため、ユーザ環境に適合したデータを選択することができる。特に、音声認識で用いる音響モデルをユーザ環境の音響特性に合わせて学習して利用することができるため、音声認識の精度を向上することができる。 According to the present invention, since the similarity of the acoustic characteristics can be evaluated based on the distance between the feature amounts calculated from the impulse response, data suitable for the user environment can be selected. In particular, since the acoustic model used for speech recognition can be learned and used according to the acoustic characteristics of the user environment, the accuracy of speech recognition can be improved.

図１は、特徴量抽出装置の機能構成を例示する図である。FIG. 1 is a diagram illustrating a functional configuration of the feature quantity extraction device. 図２は、特徴量抽出方法の処理手続きを例示する図である。FIG. 2 is a diagram illustrating a processing procedure of the feature amount extraction method. 図３は、音響モデル学習装置の機能構成を例示する図である。FIG. 3 is a diagram illustrating a functional configuration of the acoustic model learning device. 図４は、音響モデル選択装置の機能構成を例示する図である。FIG. 4 is a diagram illustrating a functional configuration of the acoustic model selection device. 図５は、音声認識方法の処理手続きを例示する図である。FIG. 5 is a diagram illustrating a processing procedure of the speech recognition method. 図６は、実験結果を示す図である。FIG. 6 is a diagram showing experimental results. 図７は、実験結果を示す図である。FIG. 7 is a diagram showing experimental results.

以下、この発明の実施の形態について詳細に説明する。なお、図面中において同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the component which has the same function in drawing, and duplication description is abbreviate | omitted.

＜第一実施形態＞
この発明の第一実施形態は、ユーザ環境で測定したインパルス応答から算出した特徴量に基づいて、そのユーザ環境を模擬するためのデータを選択する特徴量抽出装置および方法である。第一実施形態の特徴量抽出装置は、図１に示すように、インパルス応答測定部１、特徴量算出部２、データ選択部３、およびデータ記憶部９を含む。この特徴量抽出装置が図２に示す各ステップの処理を行うことにより第一実施形態の特徴量抽出方法が実現される。 <First embodiment>
1st Embodiment of this invention is the feature-value extraction apparatus and method which select the data for simulating the user environment based on the feature-value calculated from the impulse response measured in the user environment. As shown in FIG. 1, the feature quantity extraction device of the first embodiment includes an impulse response measurement unit 1, a feature quantity calculation unit 2, a data selection unit 3, and a data storage unit 9. The feature quantity extraction apparatus performs the process of each step shown in FIG. 2 to realize the feature quantity extraction method of the first embodiment.

特徴量抽出装置は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知または専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。特徴量抽出装置は、例えば、中央演算処理装置の制御のもとで各処理を実行する。特徴量抽出装置に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて読み出されて他の処理に利用される。また、特徴量抽出装置の各処理部の少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。特徴量抽出装置が備える各記憶部は、例えば、RAM（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）のような半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。特徴量抽出装置が備える各記憶部は、それぞれ論理的に分割されていればよく、一つの物理的な記憶装置に記憶されていてもよい。 The feature quantity extraction device is, for example, a special configuration in which a special program is read into a known or dedicated computer having a central processing unit (CPU), a main storage device (RAM), and the like. Device. For example, the feature quantity extraction device executes each process under the control of the central processing unit. The data input to the feature quantity extraction device and the data obtained in each process are stored, for example, in the main storage device, and the data stored in the main storage device is read out as necessary and used for other processing. Is done. In addition, at least a part of each processing unit of the feature quantity extraction device may be configured by hardware such as an integrated circuit. Each storage unit included in the feature quantity extraction device includes, for example, a main storage device such as a RAM (Random Access Memory), an auxiliary storage device configured by a semiconductor memory element such as a hard disk, an optical disk, or a flash memory, or It can be configured with middleware such as a relational database or key-value store. Each storage unit included in the feature quantity extraction device may be logically divided, and may be stored in one physical storage device.

特徴量抽出装置のデータ記憶部９には、複数の音響環境を模擬するためのデータが、各音響環境で測定したインパルス応答から算出した特徴量と関連付けられて記憶されている。この特徴量は、後述する特徴量抽出部２により算出されたものである。音響環境を模擬するためのデータは、従来技術において計算機シミュレーションを行うために必要とされるデータであり、その内容は計算機シミュレーションの仕様により異なる。ここでは、音響環境で測定したインパルス応答の波形データであるものとする。 The data storage unit 9 of the feature quantity extraction device stores data for simulating a plurality of acoustic environments in association with feature quantities calculated from impulse responses measured in each acoustic environment. This feature amount is calculated by the feature amount extraction unit 2 described later. The data for simulating the acoustic environment is data required for performing computer simulation in the prior art, and the content varies depending on the specifications of the computer simulation. Here, it is assumed that the waveform data is an impulse response measured in an acoustic environment.

ステップＳ１において、特徴量抽出装置のインパルス応答測定部１は、ユーザ環境の空間音響特性を表すインパルス応答を測定する。以下、測定したインパルス応答をh(t)と表す。なお、tは時間を表す。インパルス応答は、公知のインパルス応答測定技術を用いて測定すればよい。測定したインパルス応答h(t)は特徴量算出部２へ送られる。 In step S1, the impulse response measurement unit 1 of the feature quantity extraction device measures an impulse response representing the spatial acoustic characteristics of the user environment. Hereinafter, the measured impulse response is represented as h (t). Note that t represents time. The impulse response may be measured using a known impulse response measurement technique. The measured impulse response h (t) is sent to the feature quantity calculation unit 2.

ステップＳ２において、特徴量抽出装置の特徴量算出部２は、インパルス応答測定部１からインパルス応答h(t)を受け取り、インパルス応答h(t)を直接音成分、初期反射成分、後部残響成分に分離して、後部残響成分のパワーが全体のパワーに占める割合を特徴量として算出する。以下、算出した特徴量をDと表す。特徴量Dは、例えば、式（１）により算出することができる。 In step S2, the feature amount calculation unit 2 of the feature amount extraction apparatus receives the impulse response h (t) from the impulse response measurement unit 1, and converts the impulse response h (t) into a direct sound component, an initial reflection component, and a rear reverberation component. Separately, the ratio of the power of the rear reverberation component to the total power is calculated as a feature amount. Hereinafter, the calculated feature amount is represented as D. The feature amount D can be calculated by, for example, Expression (1).

ただし、t₁は後部残響が開始する時間を表す。算出したインパルス応答h(t)の特徴量Dはデータ選択部３へ送られる。 Where t ₁ represents the time at which rear reverberation starts. The calculated feature value D of the impulse response h (t) is sent to the data selection unit 3.

ステップＳ３において、特徴量抽出装置のデータ選択部３は、特徴量算出部２からインパルス応答h(t)の特徴量Dを受け取り、インパルス応答h(t)から算出した特徴量Dとデータ記憶部９に記憶されている各インパルス応答から算出した特徴量との距離に基づいて、ユーザ環境を模擬するためのデータを選択し、出力する。出力されたデータは、計算シミュレーション上でユーザ環境を模擬するために用いられ、ユーザ環境を分析することが可能となる。特徴量間の距離の算出方法は、特徴量間の差分の絶対値を用いてもよいし、ユークリッド距離を用いてもよい。データの選択にあたっては、距離が０のもののみを選択するのではなく、一定の拡がりを持たせて選択するようにする。例えば、特徴量間の距離が予め定めた閾値よりも短い場合にも特徴量が一致するものとみなす。これにより、ユーザ環境の軽微な変化による性能劣化を防ぐことができる。 In step S3, the data selection unit 3 of the feature amount extraction apparatus receives the feature amount D of the impulse response h (t) from the feature amount calculation unit 2, and calculates the feature amount D calculated from the impulse response h (t) and the data storage unit. The data for simulating the user environment is selected and output based on the distance from the feature amount calculated from each impulse response stored in FIG. The output data is used for simulating the user environment on the calculation simulation, and the user environment can be analyzed. As a method for calculating the distance between feature amounts, an absolute value of a difference between feature amounts may be used, or an Euclidean distance may be used. When selecting data, it is preferable to select data with a certain spread rather than selecting only data with a distance of zero. For example, even when the distance between feature amounts is shorter than a predetermined threshold value, the feature amounts are considered to match. Thereby, it is possible to prevent performance degradation due to a slight change in the user environment.

上記のように構成することにより、第一実施形態の特徴量抽出装置によれば、空間音響特性を表すインパルス応答の類似性を評価することが可能な特徴量を抽出することができる。そのため、インパルス応答の特徴量間の類似性（距離）に基づいて、予め用意した様々な音響環境を模擬するデータの中からユーザ環境を模擬するためのデータを適切に選択することができる。 With the configuration as described above, according to the feature quantity extraction device of the first embodiment, it is possible to extract a feature quantity that can evaluate the similarity of impulse responses representing spatial acoustic characteristics. Therefore, data for simulating the user environment can be appropriately selected from data prepared for simulating various acoustic environments prepared in advance based on the similarity (distance) between the feature quantities of the impulse response.

＜第二実施形態＞
この発明の第二実施形態は、様々な音響環境を模擬した学習用音声を生成し、それらを用いて様々な音響環境に対応した音響モデルを学習する音響モデル学習装置および方法と、様々な音響環境に対応した音響モデルからユーザ環境に適した音響モデルを選択する音響モデル選択装置および方法である。 <Second embodiment>
The second embodiment of the present invention generates a learning speech that simulates various acoustic environments, and uses them to learn acoustic models corresponding to various acoustic environments, and various acoustics. An acoustic model selection apparatus and method for selecting an acoustic model suitable for a user environment from an acoustic model corresponding to an environment.

第二実施形態の音響モデル学習装置は、図３に示すように、インパルス応答記憶部１１、特徴量算出部１２、インパルス応答分類部１３、学習データ生成部１４、音響モデル学習部１５、音響モデル記憶部１６、およびクリーン音声記憶部１９を含む。第二実施形態の音響モデル選択装置は、図４に示すように、音響モデル記憶部１６、インパルス応答測定部２１、特徴量算出部２２、および音響モデル選択部２３を含む。この音響モデル学習装置および音響モデル選択装置が図５に示す各ステップを実行することで第二実施形態の音声認識方法が実現される。 As shown in FIG. 3, the acoustic model learning device according to the second embodiment includes an impulse response storage unit 11, a feature amount calculation unit 12, an impulse response classification unit 13, a learning data generation unit 14, an acoustic model learning unit 15, and an acoustic model. A storage unit 16 and a clean voice storage unit 19 are included. As shown in FIG. 4, the acoustic model selection device according to the second embodiment includes an acoustic model storage unit 16, an impulse response measurement unit 21, a feature amount calculation unit 22, and an acoustic model selection unit 23. The acoustic model learning device and the acoustic model selection device execute the steps shown in FIG. 5 to realize the speech recognition method of the second embodiment.

音響モデル学習装置のインパルス応答記憶部１１には、様々な音響環境において測定された複数のインパルス応答が記憶されている。 The impulse response storage unit 11 of the acoustic model learning device stores a plurality of impulse responses measured in various acoustic environments.

音響モデル学習装置のクリーン音声記憶部１９には、事前に用意しておいたクリーン音声が記憶されている。 A clean voice prepared in advance is stored in the clean voice storage unit 19 of the acoustic model learning device.

ステップＳ１２において、音響モデル学習装置の特徴量算出部１２は、インパルス応答記憶部１１に記憶されている各インパルス応答から後部残響成分のパワーが全体のパワーに占める割合を特徴量として算出する。特徴量の算出方法は、第一実施形態の特徴量算出部２と同様である。算出した各インパルス応答の特徴量はインパルス応答分類部１３へ送られる。 In step S <b> 12, the feature amount calculation unit 12 of the acoustic model learning device calculates, as a feature amount, the ratio of the power of the rear reverberation component to the total power from each impulse response stored in the impulse response storage unit 11. The feature amount calculation method is the same as the feature amount calculation unit 2 of the first embodiment. The calculated feature value of each impulse response is sent to the impulse response classification unit 13.

ステップＳ１３において、音響モデル学習装置のインパルス応答分類部１３は、特徴量算出部１２から複数のインパルス応答の特徴量を受け取り、特徴量間の距離に基づいてインパルス応答を複数のグループに分類する。各インパルス応答はグループ毎に学習データ生成部１４へ送られる。 In step S <b> 13, the impulse response classification unit 13 of the acoustic model learning device receives the feature amounts of the plurality of impulse responses from the feature amount calculation unit 12 and classifies the impulse responses into a plurality of groups based on the distances between the feature amounts. Each impulse response is sent to the learning data generation unit 14 for each group.

ステップＳ１４において、音響モデル学習装置の学習データ生成部１４は、インパルス応答分類部１３からグループ毎にインパルス応答を受け取り、各グループに含まれるインパルス応答それぞれをクリーン音声記憶部１９に記憶されたクリーン音声に畳み込んで、各グループに対応する学習用音声を生成する。例えば、１０個のクリーン音声が存在し、グループに含まれるインパルス応答が５個あったとしたら、５０個の学習用音声が生成されることになる。生成した学習用音声はグループ毎に音響モデル学習部１５へ送られる。 In step S <b> 14, the learning data generation unit 14 of the acoustic model learning device receives an impulse response for each group from the impulse response classification unit 13, and clean speech stored in the clean speech storage unit 19 for each impulse response included in each group. The speech for learning corresponding to each group is generated. For example, if there are 10 clean sounds and there are 5 impulse responses included in the group, 50 learning sounds are generated. The generated learning voice is sent to the acoustic model learning unit 15 for each group.

ステップＳ１５において、音響モデル学習装置の音響モデル学習部１５は、学習データ生成部１４からグループ毎に学習用音声を受け取り、各グループに対応する学習用音声を用いてグループ毎に音響モデルを学習する。学習した各グループに対応する音響モデルは各グループに含まれるインパルス応答と関連付けて音響モデル記憶部１６へ記憶される。 In step S15, the acoustic model learning unit 15 of the acoustic model learning device receives the learning speech for each group from the learning data generation unit 14, and learns the acoustic model for each group using the learning speech corresponding to each group. . The acoustic model corresponding to each learned group is stored in the acoustic model storage unit 16 in association with the impulse response included in each group.

ステップＳ２１において、音響モデル選択装置のインパルス応答測定部２１は、ユーザ環境の空間音響特性を表すインパルス応答h(t)を測定する。インパルス応答の測定方法は、第一実施形態のインパルス応答測定部１と同様である。測定したインパルス応答h(t)は特徴量算出部２２へ送られる。 In step S21, the impulse response measurement unit 21 of the acoustic model selection device measures an impulse response h (t) representing the spatial acoustic characteristics of the user environment. The impulse response measurement method is the same as that of the impulse response measurement unit 1 of the first embodiment. The measured impulse response h (t) is sent to the feature quantity calculation unit 22.

ステップＳ２２において、音響モデル選択装置の特徴量算出部２２は、インパルス応答測定部２１からインパルス応答h(t)を受け取り、インパルス応答h(t)を直接音成分、初期反射成分、後部残響成分に分離して、後部残響成分のパワーが全体のパワーに占める割合を特徴量Dとして算出する。特徴量の算出方法は、第一実施形態の特徴量算出部２と同様である。算出したインパルス応答h(t)の特徴量Dは音響モデル選択部２３へ送られる。 In step S22, the feature value calculation unit 22 of the acoustic model selection device receives the impulse response h (t) from the impulse response measurement unit 21, and converts the impulse response h (t) into a direct sound component, an initial reflection component, and a rear reverberation component. Separately, the ratio of the power of the rear reverberation component to the total power is calculated as the feature amount D. The feature amount calculation method is the same as the feature amount calculation unit 2 of the first embodiment. The calculated feature value D of the impulse response h (t) is sent to the acoustic model selection unit 23.

ステップＳ２３において、音響モデル選択装置の音響モデル選択部２３は、特徴量算出部２２からインパルス応答h(t)の特徴量Dを受け取り、インパルス応答h(t)から算出した特徴量Dと音響モデル記憶部１６に記憶されている各音響モデルに対応するインパルス応答から算出した特徴量との距離に基づいて、ユーザ環境に対応する音響モデルを選択し、出力する。出力された音響モデルは、ユーザ環境で収音した音声を音声認識するために用いられる。特徴量間の距離の算出方法やデータの選択方法は、第一実施形態のデータ選択部３と同様であり、ユーザ環境の軽微な変化による性能劣化を防ぐために、一定の拡がりを持たせて選択するようにする。 In step S23, the acoustic model selection unit 23 of the acoustic model selection device receives the feature amount D of the impulse response h (t) from the feature amount calculation unit 22, and calculates the feature amount D calculated from the impulse response h (t) and the acoustic model. Based on the distance from the feature amount calculated from the impulse response corresponding to each acoustic model stored in the storage unit 16, the acoustic model corresponding to the user environment is selected and output. The output acoustic model is used to recognize voice collected in the user environment. The method for calculating the distance between feature amounts and the method for selecting data are the same as those of the data selection unit 3 of the first embodiment. In order to prevent performance degradation due to slight changes in the user environment, the selection is performed with a certain spread. To do.

上記のように構成することにより、第二実施形態の音響モデル学習装置および音響モデル選択装置によれば、インパルス応答の特徴量が近い音響モデルを選択することにより、ユーザ環境に適切な音響モデルを用いることができるため、音声認識の精度を向上することが可能である。 By configuring as described above, according to the acoustic model learning device and the acoustic model selection device of the second embodiment, an acoustic model suitable for the user environment can be obtained by selecting an acoustic model having a close impulse response feature amount. Therefore, it is possible to improve the accuracy of speech recognition.

＜実験結果＞
複数の音響環境に対応する学習データを計算機シミュレーション上で生成し、音声認識実験を実施した。図６−７に、その実験結果を示す。 <Experimental result>
Learning data corresponding to multiple acoustic environments was generated on a computer simulation and a speech recognition experiment was conducted. FIG. 6-7 shows the experimental results.

図６は、特徴量間の距離と認識率との関係を表した実験結果である。様々な音響環境に対応する学習用音声（学習データ）から各音響環境に対応する音響モデルを学習し、各音響モデルを用いてあるユーザ環境で収録した音声（評価データ）を音声認識したときの認識率をグラフ上にプロットした。横軸は学習データと評価データとの特徴量間の距離であり、縦軸は認識率である。特徴量間の距離と認識率との間には高い相関がみられ、特徴量間の距離が近いほど認識率が高くなることがわかる。 FIG. 6 is an experimental result showing the relationship between the distance between feature amounts and the recognition rate. When learning acoustic models corresponding to each acoustic environment from learning speech corresponding to various acoustic environments (learning data), and recognizing speech (evaluation data) recorded in a user environment using each acoustic model The recognition rate was plotted on the graph. The horizontal axis is the distance between the feature amounts of the learning data and the evaluation data, and the vertical axis is the recognition rate. A high correlation is observed between the distance between the feature amounts and the recognition rate, and it can be seen that the recognition rate increases as the distance between the feature amounts is closer.

図７は、あるユーザ環境で収録した音声（評価データ）を複数種類の音響環境に対応する学習用音声（学習データ）から学習した音響モデルを用いて音声認識したときの認識率の変化を表した実験結果である。４種類の音声（クリーン音声、D=0.028となる音響環境の音声、D=0.056となる音響環境の音声、D=0.075となる音響環境の音声）を、特徴量が合致する音響環境の音声で学習した音響モデルと、特徴量が合致しない音響環境の音声で学習した音響モデルとで音声認識したときの認識率をグラフに表している。太い点線で表す認識率のレベル（80％）は、音声認識が実用的と考えられる性能を表している。例えば、D=0.028となる音響環境の音声を、クリーン音声で学習した音響モデル（クリーン音声モデル）と、D=0.028となる音響環境の音声で学習した音響モデル（D=0.028のモデル）とで音声認識した場合、後者の方が認識率は高くなっている。すなわち、特徴量が合致する、もしくは類似する音響モデルを用いて音声認識を行う方が高い認識率を達成できることがわかる。また、すべてのパターンで特徴量が合致する場合に実用的な音声認識精度が達成されていることがわかる。D=0.056となる環境の音声については、クリーン音声モデルと、D=0.075のモデルとを用いて実験を行った（すなわち、いずれも特徴量が合致しない）が、より特徴量が近いD=0.075のモデルの方が高い認識率となっていることがわかる。これは、必ずしも特徴量が合致しなくても一定の拡がりをもって音響モデルを選択すれば実用上十分に適切な音響モデルを選択できることを意味している。 FIG. 7 shows a change in recognition rate when speech (evaluation data) recorded in a certain user environment is recognized by speech using an acoustic model learned from learning speech (learning data) corresponding to a plurality of types of acoustic environments. It is an experimental result. Four types of sound (clean sound, sound in an acoustic environment with D = 0.028, sound in an acoustic environment with D = 0.056, sound in an acoustic environment with D = 0.075) The graph shows the recognition rate when speech recognition is performed between the learned acoustic model and the acoustic model learned with speech in an acoustic environment where the feature amounts do not match. The recognition rate level (80%) indicated by a thick dotted line represents the performance at which speech recognition is considered practical. For example, an acoustic model (clean speech model) trained with clean speech for an acoustic environment with D = 0.028 and an acoustic model (model with D = 0.028) trained with speech in an acoustic environment with D = 0.028 In the case of voice recognition, the latter has a higher recognition rate. That is, it can be seen that a higher recognition rate can be achieved by performing speech recognition using an acoustic model with matching or similar feature amounts. It can also be seen that practical speech recognition accuracy is achieved when the feature values match in all patterns. For the voice in the environment where D = 0.056, an experiment was performed using a clean voice model and a model with D = 0.075 (that is, none of the feature quantities match), but the feature quantity is closer to D = 0.075. It can be seen that the model has a higher recognition rate. This means that even if the feature quantity does not necessarily match, if an acoustic model is selected with a certain spread, a practically adequate acoustic model can be selected.

上記のように構成することにより、この発明の特徴量抽出技術によれば、ユーザが利用環境を選択するだけで、ユーザ環境を模擬することが可能であるため、利便性が向上する。音声認識システムに適用した場合、実際のユーザ環境で収録すべき音声データ量を削減できるため、音響モデルの学習に伴うコストを大幅に低減することができる。また、実際のユーザ環境に近い音響モデルを学習できるため、実環境における音声認識率を向上することができる。 With the configuration as described above, according to the feature amount extraction technique of the present invention, the user environment can be simulated only by the user selecting the usage environment, so that convenience is improved. When applied to a speech recognition system, the amount of speech data to be recorded in an actual user environment can be reduced, so the cost associated with learning an acoustic model can be greatly reduced. Moreover, since the acoustic model close to the actual user environment can be learned, the speech recognition rate in the actual environment can be improved.

以上、この発明の実施の形態について説明したが、具体的な構成は、これらの実施の形態に限られるものではなく、この発明の趣旨を逸脱しない範囲で適宜設計の変更等があっても、この発明に含まれることはいうまでもない。実施の形態において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 As described above, the embodiments of the present invention have been described, but the specific configuration is not limited to these embodiments, and even if there is a design change or the like as appropriate without departing from the spirit of the present invention, Needless to say, it is included in this invention. The various processes described in the embodiments are not only executed in time series according to the description order, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes.

［プログラム、記録媒体］
上記実施形態で説明した各装置における各種の処理機能をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 [Program, recording medium]
When various processing functions in each device described in the above embodiment are realized by a computer, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 This program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, this computer reads the program stored in its own recording medium and executes the process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. A configuration in which the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by an execution instruction and result acquisition without transferring a program from the server computer to the computer. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

１インパルス応答測定部
２特徴量抽出部
３データ選択部
９データ記憶部
１１インパルス応答記憶部
１２特徴量抽出部
１３インパルス応答分類部
１４学習データ生成部
１５音響モデル学習部
１９クリーン音声記憶部
２１インパルス応答測定部
２２特徴量抽出部
２３音響モデル選択部 DESCRIPTION OF SYMBOLS 1 Impulse response measurement part 2 Feature quantity extraction part 3 Data selection part 9 Data storage part 11 Impulse response storage part 12 Feature quantity extraction part 13 Impulse response classification part 14 Learning data generation part 15 Acoustic model learning part 19 Clean speech storage part 21 Impulse Response measurement unit 22 Feature amount extraction unit 23 Acoustic model selection unit

Claims

A feature quantity extraction device for extracting feature quantities in the user environment in order to appropriately select data for simulating the user environment from data prepared to simulate various acoustic environments,
A feature amount calculation unit that calculates a feature amount that represents the ratio of the power of the rear reverberation in the impulse response to the total power; and
An impulse response measurement unit for measuring the impulse response of the user environment;
A data storage unit that stores data related to a plurality of different acoustic environments and the feature amount calculated from the impulse response measured in the acoustic environment in association with each other;
A data selection unit that selects data related to the acoustic environment corresponding to the user environment based on a distance between the feature amount calculated from the impulse response of the user environment and the feature amount calculated from the impulse response of the acoustic environment; ,
A feature amount extraction device.

An acoustic model learning device that learns an acoustic model suitable for each acoustic environment from learning speech generated using an impulse response selected from impulse responses measured in various acoustic environments prepared in advance,
A clean sound storage unit for storing clean sound;
An impulse response storage unit for storing impulse responses measured in a plurality of different acoustic environments;
A feature amount calculation unit that calculates a feature amount that represents the ratio of the power of the rear reverberation in the impulse response to the total power; and
An impulse response classifying unit that classifies the impulse responses into a plurality of groups based on the distance between the feature values of the impulse response of the acoustic environment;
A learning data generating unit that convolves each of the impulse responses classified into each group into the clean sound and generates a learning sound corresponding to each group;
An acoustic model learning unit that learns an acoustic model corresponding to each group using learning speech corresponding to each group;
An acoustic model learning device.

An acoustic model selection device for selecting an acoustic model suitable for a user environment from acoustic models suitable for each acoustic environment learned using learning speech corresponding to various acoustic environments prepared in advance,
A feature amount calculation unit that calculates a feature amount that represents the ratio of the power of the rear reverberation in the impulse response to the total power; and
An impulse response measurement unit for measuring the impulse response of the user environment;
An acoustic model storage unit that associates and stores an acoustic model learned using learning speech corresponding to a plurality of different acoustic environments and the feature amount calculated from the impulse response measured in the acoustic environment;
An acoustic model selection unit that selects an acoustic model corresponding to the user environment based on a distance between the characteristic amount of the impulse response of the user environment and the characteristic amount of the impulse response corresponding to each acoustic model;
An acoustic model selection device including:

A feature amount extraction method for extracting feature amounts in the user environment in order to appropriately select data for simulating the user environment from data prepared to simulate various acoustic environments,
The data storage unit stores data related to a plurality of different acoustic environments,
The impulse response measurement unit measures the impulse response of the user environment,
The feature amount calculation unit calculates a feature amount that represents the ratio of the power of the rear reverberation in the impulse response of the user environment to the total power,
The feature amount calculation unit calculates a feature amount that represents the ratio of the power of rear reverberation in the impulse response measured in the acoustic environment to the total power,
A data selection unit selects data related to the acoustic environment corresponding to the user environment based on a distance between the feature amount calculated from the impulse response of the user environment and the feature amount calculated from the impulse response of the acoustic environment. A feature extraction method including

A program for causing a computer to function as the feature quantity extraction device according to claim 1, the acoustic model learning device according to claim 2, or the acoustic model selection device according to claim 3.