JP2005221752A

JP2005221752A - Speech recognition apparatus, speech recognition method and program

Info

Publication number: JP2005221752A
Application number: JP2004029344A
Authority: JP
Inventors: Tsukasa Shimizu; 司清水; Toshihiro Wakita; 敏裕脇田
Original assignee: Toyota Central R&D Labs Inc
Current assignee: Toyota Central R&D Labs Inc
Priority date: 2004-02-05
Filing date: 2004-02-05
Publication date: 2005-08-18

Abstract

<P>PROBLEM TO BE SOLVED: To obtain a speech recognition result with high robustness while improving meaning adaptability. <P>SOLUTION: A 1st decoder 71 computes N or more most likelihood solutions and their likelihoods (score) by using a speech feature quantity extracted by a feature quantity extraction part 2, a phoneme HMM stored in a sound model storage part 3, and a word 2-gram model stored in a language model storage part 4. The maximum likelihood solution of highest order among the plurality of maximum likelihood solutions is selected by using a concept expression pattern model stored in a concept expression pattern model storage part 5 and a concept appearance pattern model stored in a concept appearance pattern model storage part 6 and regarded as a final recognition result. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、音声認識装置、音声認識方法及びプログラムに係り、特に、意味的適合性の向上を図った音声認識装置、音声認識方法及びプログラムに関する。 The present invention relates to a speech recognition device, a speech recognition method, and a program, and more particularly, to a speech recognition device, a speech recognition method, and a program that improve semantic adaptability.

一般的に使われている音声認識装置は、音声の特徴を統計的にモデル化した音響モデルの他に、言語制約として単語の局所的な連接を統計的にモデル化した言語モデル（例えば、単語ｎ−ｇｒａｍモデルなど）を用いている。しかし、言語モデルの言語制約を用いるだけでは、全体として構文的もしくは意味的に不適格な認識結果が得られることが起こる。 Commonly used speech recognition devices include not only acoustic models that statistically model speech features, but also language models that statistically model local concatenation of words as language constraints (for example, words n-gram model etc.) is used. However, using only the language constraints of the language model may result in recognition results that are syntactically or semantically ineligible as a whole.

そこで、従来の音声認識装置から得られた上位Ｎ個の最尤認識候補に対して、文法知識として予め作成しておいた文脈自由文法を用いた構文解析を行って文法的な適合性を判定し、文法的に不適格な認識候補を出力しないようにすることで、認識精度の向上を図った音声認識装置が提案されている（例えば、特許文献１を参照。）
また、文構造をより正確に解析的に示し、認識時に高い認識率を得る言語モデルを生成する言語モデル生成装置、及びこの言語モデルを用いて音声認識を行う音声認識装置が提案されている（例えば、特許文献２を参照。）。 Therefore, the top N maximum likelihood recognition candidates obtained from the conventional speech recognition apparatus are subjected to syntax analysis using a context-free grammar prepared in advance as grammar knowledge to determine grammatical suitability. In addition, a speech recognition apparatus has been proposed that improves recognition accuracy by preventing recognition candidates that are grammatically ineligible from being output (see, for example, Patent Document 1).
In addition, a language model generation device that generates a language model that accurately and analytically shows a sentence structure and obtains a high recognition rate at the time of recognition, and a speech recognition device that performs speech recognition using this language model have been proposed ( For example, see Patent Document 2.)

特許文献２では、変換主導型機械翻訳装置（ＴＤＭＬ装置）を用いてｂｉｇｒａｍマーカーを挿入した学習用テキストデータから、単語（もしくは単語クラス）Ｎ−ｇｒａｍモデルを生成する。このＮ−ｇｒａｍモデルは、局所的な文構造の制約を有している。さらに、ｂｉｇｒａｍマーカーを挿入した学習用テキストデータを構文解析して得られた構文木から、１つの部分木及び複数の連接する部分木に含まれる単語からなる単語パターンの連接関係の単語パターンモデルを生成する。単語パターンモデルは、部分語列と関連のある遠距離にある単語の関係をモデル化したものである。 In Patent Literature 2, a word (or word class) N-gram model is generated from learning text data in which a big marker is inserted using a conversion-driven machine translation device (TDML device). This N-gram model has local sentence structure constraints. Furthermore, a word pattern model of a word pattern concatenation relation consisting of words included in one subtree and a plurality of connected subtrees from a syntax tree obtained by parsing learning text data into which a bigram marker is inserted. Generate. The word pattern model models a relationship between words at a long distance that are related to a partial word string.

特許文献２に記載の技術は、これらのモデルを用いることにより、従来の局所的な単語の連鎖の制約だけでなく、局所的な文構造の制約および従来の単語Ｎ−ｇｒａｍモデルではモデル化できない遠距離の単語間の制約を取り入れて、音声認識精度の向上を図っている。
特開２０００−２９３１９６号公報特開２００２−２６８６７６号公報 The technique described in Patent Document 2 cannot be modeled by using not only the conventional local word chain constraint but also the local sentence structure constraint and the conventional word N-gram model by using these models. Incorporating constraints between long-distance words, we aim to improve speech recognition accuracy.
JP 2000-293196 A JP 2002-268676 A

音声言語（話し言葉）は、元々書き言葉に比べて文法的に厳密ではないため、ユーザ（話者）は必ずしも文法的に適合した文を話すとは限らない。また、このような文法的に揺らぎのある話し言葉を包含するような文法を作成するのはほぼ不可能である。 Since the spoken language (spoken language) is not grammatically stricter than the originally written language, the user (speaker) does not always speak a grammatically adapted sentence. It is almost impossible to create a grammar that includes such grammatically fluctuating spoken words.

これに対して、特許文献１の音声認識装置は、文法的に不適格な認識結果を棄却してしまうことがある。このため、、音声認識装置は、元々話者が文法的に不適格な発話をしてしまった場合、初期段階で正しく認識していたとしても、文法的適合性の判定段階において、その認識結果を棄却してしまうことがある。 On the other hand, the speech recognition apparatus of Patent Document 1 may reject recognition results that are grammatically ineligible. For this reason, if the speaker originally made a grammatically ineligible utterance, the speech recognition device, even if it was correctly recognized at the initial stage, the recognition result at the grammatical suitability determination stage. May be rejected.

さらに、上記音声認識装置は、文脈自由文法による構文解析だけでは、たとえ意味的に不適合であっても構文的に正しければ、正解とすることもある。例えば、「千種区にある中区に行きたい」という文は、意味的には不適合であるが、構文的には正しいため、正解と認識されることがあった。 Furthermore, the above speech recognition apparatus may be correct if it is syntactically correct even if it is semantically unsuitable only by syntactic analysis using a context-free grammar. For example, the sentence “I want to go to Naka-ku in Chikusa-ku” is not relevant in terms of meaning, but is syntactically correct, so it may be recognized as a correct answer.

特許文献２では、学習用テキストデータに文法的に不適格な発話が含まれていれば、ある程度は文法的に揺らぎのある話し言葉に対して、頑健であると考えられる。しかし、認識結果候補には誤認識による単語の置換、挿入、欠落などが含まれているため、必ずしも構文解析が可能であるとは限らない。 In Patent Document 2, if a grammatically ineligible utterance is included in the learning text data, it is considered that a grammatically fluctuating spoken word is robust to some extent. However, since the recognition result candidate includes word replacement, insertion, omission, and the like due to misrecognition, syntax analysis is not always possible.

したがって、特許文献２に記載の技術は、構文解析を行い、構文木を求める必要があるため、誤認識を多く含む認識結果候補に対する頑健性が高いとは言えない。また、特許文献１と同様に、構文的に正しいが、意味的に不適格な認識結果候補を正解とする場合もある。 Therefore, since the technique described in Patent Document 2 needs to perform syntax analysis and obtain a syntax tree, it cannot be said that the robustness with respect to recognition result candidates including many misrecognitions is high. In addition, similar to Patent Document 1, there are cases where a recognition result candidate that is syntactically correct but semantically ineligible is regarded as a correct answer.

本発明は、上述した課題を解決するために提案されたものであり、意味的適合性の向上、及び頑健性の高い音声認識装置、音声認識方法及びプログラムを提供することを目的とする。 The present invention has been proposed to solve the above-described problems, and an object of the present invention is to provide a speech recognition apparatus, a speech recognition method, and a program with improved semantic adaptability and high robustness.

本発明は、音声信号の特徴量を抽出する特徴量抽出手段と、特徴量を統計的にモデル化した音響モデルを記憶する音響モデル記憶手段と、語の連鎖を統計的にモデル化した言語モデルを記憶する言語モデル記憶手段と、１つ以上の概念の組み合わせ順を表した複数の出現パターンを有する概念出現パターンモデルを記憶する概念出現パターンモデル記憶手段と、前記特徴量抽出手段により抽出された特徴量と、前記音響モデル記憶手段に記憶された音響モデルと、前記言語モデル記憶手段に記憶された言語モデルと、前記概念出現パターンモデルに記憶された概念出現パターンモデルと、に基づいて、前記音声信号の表す音声を認識する認識手段と、を備えている。 The present invention relates to a feature quantity extracting means for extracting a feature quantity of a speech signal, an acoustic model storing means for storing an acoustic model obtained by statistically modeling the feature quantity, and a language model that statistically models a chain of words. Extracted by the feature model extracting means, a concept appearance pattern model storing means for storing a concept appearance pattern model having a plurality of appearance patterns representing a combination order of one or more concepts, and a feature quantity extracting means Based on the feature amount, the acoustic model stored in the acoustic model storage unit, the language model stored in the language model storage unit, and the concept appearance pattern model stored in the concept appearance pattern model, Recognition means for recognizing the voice represented by the voice signal.

特徴量抽出手段は、音声信号について音声分析の結果得られる特徴量を抽出する。音響モデル記憶手段は、特徴量を統計的にモデル化した音響モデルを記憶する。音響モデルは、音素単位に限らず、例えば、音節またはモーラなどのサブワード単位であってもよい。 The feature amount extraction unit extracts a feature amount obtained as a result of speech analysis for the speech signal. The acoustic model storage unit stores an acoustic model obtained by statistically modeling the feature amount. The acoustic model is not limited to phoneme units, but may be subword units such as syllables or mora, for example.

言語モデル記憶手段は、語の連鎖を統計的にモデル化した言語モデルを記憶している。「語」とは、音素、単語もしくは形態素、又はこれらの組み合わせであってもよい。すなわち、言語モデルは、言語的制約をモデル化したものである。 The language model storage means stores a language model obtained by statistically modeling a word chain. The “word” may be a phoneme, a word or a morpheme, or a combination thereof. In other words, the language model is a model of linguistic constraints.

概念出現パターンモデル記憶手段は、概念出現パターンモデルを記憶している。概念出現パターンモデルは、１つ以上の概念の組み合わせ順を表した複数の出現パターンを表している。なお、複数の概念の組み合わせ順は、意味的制約を有する、すなわち意味的な適合性を有するものである。したがって、出現パターンは、概念間にどのような語がどの程度存在しているか否かを問わないが、概念の組み合わせ順に対して必ず意味的な適合性を有している。なお、各出現パターンにはスコアが対応付けられていてもよい。 The concept appearance pattern model storage means stores a concept appearance pattern model. The concept appearance pattern model represents a plurality of appearance patterns representing the combination order of one or more concepts. Note that the order of combination of a plurality of concepts has semantic constraints, that is, has semantic suitability. Therefore, the appearance pattern does not matter how many words are present between the concepts, but always has a semantic adaptability to the combination order of the concepts. Each appearance pattern may be associated with a score.

認識手段は、特徴量抽出手段で抽出された特徴量に、音響モデル、言語モデル及び概念出現パターンモデルを用いることで、音声信号が表す音声を認識する。 The recognizing unit recognizes the voice represented by the audio signal by using the acoustic model, the language model, and the concept appearance pattern model for the feature amount extracted by the feature amount extracting unit.

したがって、前記発明によれば、音声信号の特徴量、音響モデル、言語モデル及び１つ以上の概念の組み合わせ順を表した複数の出現パターンを有する概念出現パターンモデルに基づいて、音声信号が表す音声を認識することで、音声認識の意味的適合性を向上させることができる。 Therefore, according to the invention, the voice represented by the voice signal based on the concept appearance pattern model having a plurality of appearance patterns representing the feature amount of the voice signal, the acoustic model, the language model, and the combination order of one or more concepts. By recognizing, the semantic suitability of speech recognition can be improved.

前記発明において、前記認識手段は、前記特徴量、前記音響モデル及び前記言語モデルに基づく複数の認識結果候補及びそれらのスコアに対して、前記各出現パターンにスコアが対応付けられた概念出現パターンモデルを用いることで、前記複数の認識結果候補の各スコアに重み付けを行い、重み付けされたスコアのうち最上位にあるスコアの認識結果候補を出力してもよい。 In the present invention, the recognition means includes a concept appearance pattern model in which a score is associated with each appearance pattern for a plurality of recognition result candidates based on the feature quantity, the acoustic model, and the language model, and their scores. May be used to weight each score of the plurality of recognition result candidates, and the recognition result candidate of the highest score among the weighted scores may be output.

前記発明は、概念毎に複数の表現パターンを表した表現パターンモデルを記憶する表現パターンモデル記憶手段をさらに備え、前記認識手段は、前記表現パターンモデル記憶手段に記憶された表現パターンモデルをさらに用いて、音声を認識してもよい。ここで、表現パターンは、所定概念を表す語もしくはその周辺の部分語列を含むものである。所定概念を表す語だけでなく、その周辺の部分語列を含むことにより、誤認識による語の置換、欠落などの影響を受けることなく、頑健性を高くすることができる。 The invention further includes an expression pattern model storage unit that stores an expression pattern model representing a plurality of expression patterns for each concept, and the recognition unit further uses the expression pattern model stored in the expression pattern model storage unit. The voice may be recognized. Here, the expression pattern includes a word representing a predetermined concept or a peripheral partial word string. By including not only a word representing a predetermined concept but also a partial word string around it, robustness can be enhanced without being affected by word replacement or omission due to misrecognition.

さらに、前記発明において、前記認識手段は、前記特徴量、前記音響モデル及び前記言語モデルに基づく複数の認識結果候補及びそれらのスコアに対して、前記各表現パターンにスコアが対応付けられた概念表現パターンモデルをさらに用いることで、前記複数の認識結果候補の各スコアに重み付けを行ってもよい。 Furthermore, in the invention, the recognition means includes a conceptual expression in which a score is associated with each expression pattern for a plurality of recognition result candidates based on the feature quantity, the acoustic model, and the language model, and their scores. By further using a pattern model, each score of the plurality of recognition result candidates may be weighted.

本発明に係る音声認識装置、音声認識方法及びプログラムによれば、音声信号の特徴量、音響モデル、言語モデル及び１つ以上の概念の組み合わせ順を表した複数の出現パターンを有する概念出現パターンモデルに基づいて、音声信号が表す音声を認識することで、音声認識の意味的適合性を向上させることができる。 According to the speech recognition apparatus, the speech recognition method, and the program according to the present invention, a concept appearance pattern model having a plurality of appearance patterns representing a combination of a feature amount of an audio signal, an acoustic model, a language model, and one or more concepts. By recognizing the voice represented by the voice signal based on the above, it is possible to improve the semantic suitability of the voice recognition.

以下、最初に本発明の原理をの説明した後、本発明を実施するための最良の形態について図面を参照しながら詳細に説明する。 The principle of the present invention will be described first, and then the best mode for carrying out the present invention will be described in detail with reference to the drawings.

［本発明の原理］
音声認識の問題は、（式１）を最大化する単語列Ｗを求める問題として捉えられる。 [Principle of the present invention]
The problem of speech recognition can be regarded as a problem of obtaining a word string W that maximizes (Equation 1).

ただし、
Ｗ：単語列Ｗ₁，Ｗ₂，Ｗ₃，…，Ｗ_M
Ｏ：音声特徴量列Ｏ₁，Ｏ₂，Ｏ₃，…，Ｏ_N
である。 However,
W: Word strings W ₁ , W ₂ , W ₃ ,..., W _M
O: Speech feature sequence O ₁ , O ₂ , O ₃ ,..., O _N
It is.

すなわち、言語モデルから求められる単語列Ｗの出現確率Ｐ（Ｗ）と、音響モデルから求められる単語列Ｗにおける音声特徴量列の出現確率Ｐ（Ｏ｜Ｗ）と、の積が最大となる単語列Ｗを求めるという問題である。 That is, the word that maximizes the product of the appearance probability P (W) of the word sequence W obtained from the language model and the appearance probability P (O | W) of the speech feature amount sequence in the word sequence W obtained from the acoustic model. The problem is to obtain the column W.

本発明は、（式１）に対して、単語列Ｗ中の部分語列ｗ_i ^i+Kが予め定められた概念Ｃ_jであれば、概念表現パターンモデルから求められるスコアScore（ｗ_i ^i+K｜Ｃ_j）を加算し、また、単語列Ｗ中の出現パターン（概念の並び）に基づく概念出現パターンモデルから求められるスコアScore（Ｃ_1j，Ｃ_2j，…，Ｃ_Lj）を加算するというものである。ここで、Score（ｗ_i ^i+K｜Ｃ_j）は、概念Ｃ_jにおいて部分語列ｗ_i ^i+Kが出現する頻度に基づいて求められるスコアである。また、Score（Ｃ_1j，Ｃ_2j，…，Ｃ_Lj）（Ｃ_ijは複数の概念のうちの任意の１つ）は、一発話中に概念Ｃ_ijが同時に出現する頻度に基づいて求められるスコアである。 According to the present invention, if the partial word sequence w _i ^{i + K} in the word sequence W is a predetermined concept C _j with respect to (Equation 1), the score Score (w _i ⁱ obtained from the concept expression pattern model is determined. ^{+ K} | C _j ) is added, and scores Score (C _1j , C _2j ,..., C _Lj ) obtained from the concept appearance pattern model based on the appearance pattern (concept array) in the word string W are added. That's it. Here, Score (w _i ^{i + K} | C _j ) is a score obtained based on the frequency of occurrence of the partial word string w _i ^{i + K} in the concept C _j . Score (C _1j , C _2j ,..., C _Lj ) (C _ij is any one of a plurality of concepts) is a score obtained based on the frequency at which the concepts C _ij appear simultaneously in one utterance. It is.

したがって、本発明の音声認識の問題は、（式２）を最大化する問題として捉えることができる。 Therefore, the problem of speech recognition according to the present invention can be regarded as a problem that maximizes (Equation 2).

なお、（式２）では便宜上加算及び乗算を用いたが、本発明はこれに限定されるものではない。すなわち、本発明は、音響モデル及び言語モデルのみを用いた従来の認識尤度に対して、概念表現パターンモデルのスコア及び概念出現パターンモデルのスコアを用いて重み付けをすることができれば、上記の加算や乗算に限らず、減算や除算を用いてもよい。 In addition, in (Formula 2), addition and multiplication are used for convenience, but the present invention is not limited to this. That is, according to the present invention, if the weight of the conventional recognition likelihood using only the acoustic model and the language model can be weighted using the score of the concept expression pattern model and the score of the concept appearance pattern model, the above addition is performed. In addition to subtraction and multiplication, subtraction or division may be used.

［実施形態］
図１は、本発明の実施形態に係る音声認識装置の構成を示すブロック図である。本実施形態では、目的地タスク対話システムに用いて好適な音声認識装置を例に挙げて説明する。すなわち、本実施形態に係る音声認識装置は、目的地を設定するために必要な「市町村名」、「住所」、「店名」、「業種」の少なくとも１つを含んだ音声を入力し、目的地を認識するものである。 [Embodiment]
FIG. 1 is a block diagram showing a configuration of a speech recognition apparatus according to an embodiment of the present invention. In this embodiment, a voice recognition apparatus suitable for use in a destination task dialogue system will be described as an example. That is, the speech recognition apparatus according to the present embodiment inputs speech including at least one of “city name”, “address”, “store name”, and “business type” necessary for setting a destination. It recognizes the earth.

上記音声認識装置は、話者の発話に応じた音声信号を生成するマイク１と、音声信号の特徴量（音声特徴量）を抽出する特徴量抽出部２と、音響モデルを記憶する音響モデル記憶部３と、言語モデルを記憶する言語モデル記憶部４と、概念表現パターンモデルを記憶する概念表現パターンモデル記憶部５と、概念出現パターンモデルを記憶する概念出現パターンモデル記憶部６と、音声特徴量及び各モデルを用いて音声認識を行うデコーダ７と、認識結果を所定のシステムに出力する出力部８と、を備えている。 The voice recognition apparatus includes a microphone 1 that generates a voice signal according to a speaker's utterance, a feature quantity extraction unit 2 that extracts a feature quantity (voice feature quantity) of the voice signal, and an acoustic model storage that stores an acoustic model. Unit 3, a language model storage unit 4 for storing language models, a concept expression pattern model storage unit 5 for storing concept expression pattern models, a concept appearance pattern model storage unit 6 for storing concept appearance pattern models, and speech features A decoder 7 that performs speech recognition using the quantity and each model, and an output unit 8 that outputs a recognition result to a predetermined system are provided.

マイク１は、話者の発話した音声に応じた音声信号を生成し、この音声信号を特徴量抽出部２に供給する。特徴量抽出部２は、マイク１から供給された音声信号を分析して、無音で区切られた音声区間を切り出して、音声特徴量を抽出する。 The microphone 1 generates a voice signal corresponding to the voice uttered by the speaker and supplies the voice signal to the feature amount extraction unit 2. The feature quantity extraction unit 2 analyzes the voice signal supplied from the microphone 1, cuts out a voice section divided by silence, and extracts a voice feature quantity.

音響モデル記憶部３には、音響モデルとして、例えば音素隠れマルコフモデル（ＨＭＭ：ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）が記憶されている。音素ＨＭＭは、「あ」、「い」などの音素単位毎に音声特徴量を統計的に表したモデルである。用途の違いにより音節、単語という単位で存在することもある。 The acoustic model storage unit 3 stores, for example, a phoneme hidden Markov model (HMM) as an acoustic model. The phoneme HMM is a model that statistically represents a speech feature amount for each phoneme unit such as “A” and “I”. It may exist in units of syllables and words depending on the usage.

言語モデル記憶部４には、語の連鎖を統計的にモデル化した言語モデルとして、例えば単語２−ｇｒａｍ（バイグラム；ｂｉｇｒａｍ）モデルが記憶されている。単語２−ｇｒａｍモデルは、単語の生起確率が直前の１個の単語のみに依存する言語モデルであり、例えば学習用テキストデータによって学習されたものである。 The language model storage unit 4 stores, for example, a word 2-gram (bigram) model as a language model obtained by statistically modeling a word chain. The word 2-gram model is a language model in which the occurrence probability of a word depends only on the immediately preceding word, and is learned by learning text data, for example.

概念表現パターンモデル記憶部５には、発話の意味／意図を理解する上でキーとなる語もしくはその周辺部分語列を含む表現パターンを表した概念表現パターンモデルが記憶されている。 The concept expression pattern model storage unit 5 stores a concept expression pattern model representing an expression pattern including a word that is a key for understanding the meaning / intention of an utterance or a peripheral partial word string.

本実施形態における目的地タスク対話システムでは、発話中の住所、店名、業種などの概念を表す語を同定することが重要である。例えば、「業種」の概念に含まれる「ソバ屋」は、この他に“ソバ屋さん”、“おソバ屋さん”、“ソバの店”など様々な表現パターンが存在する。同様に、「業種」の概念に含まれる「ラーメン屋」は、この他に“ラーメン店”など様々な表現パターンが存在する。そこで、概念表現パターンモデルは、ある概念を表す複数の表現パターンを学習用テキストデータから学習し、その概念において各々の表現パターンをモデル化したものである。 In the destination task dialogue system according to the present embodiment, it is important to identify words representing concepts such as the address, the store name, and the type of business being uttered. For example, “soba shop” included in the concept of “industry” has various expression patterns such as “soba shop”, “soba shop”, and “soba shop”. Similarly, the “ramen restaurant” included in the concept of “industry” has various expression patterns such as “ramen store”. Therefore, the concept expression pattern model is obtained by learning a plurality of expression patterns representing a certain concept from learning text data and modeling each expression pattern in the concept.

図２は、概念表現パターンモデル記憶部５に記憶されている概念表現パターンモデルを示す図である。本実施形態では、概念表現パターンモデルは、「市町村名（ｃｉｔｙ）」、「詳細住所（ａｄｄｒｅｓｓ）」、「店名（ｎａｍｅ）」、「業種（ｔｙｐｅ）」の各々の概念における「表現パターン」及びその「スコア」を表したモデルである。 FIG. 2 is a diagram illustrating the concept expression pattern model stored in the concept expression pattern model storage unit 5. In the present embodiment, the concept expression pattern model includes “expression pattern” in each concept of “city name (city)”, “detailed address (address)”, “store name (name)”, and “business type (type)”, and It is a model that represents the “score”.

「表現パターン」は、単語２−ｇｒａｍモデルを学習する際に用いた学習用テキストデータから例えば人手によって抽出されたものである。例えば図２に示すように、「業種（ｔｙｐｅ）」の「概念」内には、ソバ屋、ソバ屋さん、おソバ屋さん、レストラン、飲食店、ラーメン屋、ラーメン店などの「表現パターン」がある。各「表現パターン」には「スコア」が対応付けられている。 The “expression pattern” is, for example, manually extracted from the text data for learning used when learning the word 2-gram model. For example, as shown in FIG. 2, “concept” of “type of industry” includes “expression patterns” such as buckwheat shop, buckwheat shop, buckwheat shop, restaurant, restaurant, ramen shop, ramen shop, etc. There is. Each “expression pattern” is associated with a “score”.

「スコア」は、当該「概念」内における当該「表現パターン」の出現確率に基づく値である。例えば、レストランのスコア“０．１５２”は、「業種」の概念内における「レストラン」の出現確率に基づく値である。その他の概念についても同様に、１つの概念に対して複数の表現パターンが用意されており、各表現パターンにスコアが対応付けられている。 The “score” is a value based on the appearance probability of the “expression pattern” in the “concept”. For example, the score “0.152” of a restaurant is a value based on the appearance probability of “restaurant” within the concept of “business type”. Similarly, for other concepts, a plurality of expression patterns are prepared for one concept, and a score is associated with each expression pattern.

概念出現パターンモデル記憶部６には、１発話（１回の認識単位）において１つ以上の概念が出現した場合において概念間の出現パターンの統計的な傾向をモデル化した概念出現パターンモデルが記憶されている。 The concept appearance pattern model storage unit 6 stores a concept appearance pattern model that models a statistical tendency of appearance patterns between concepts when one or more concepts appear in one utterance (one recognition unit). Has been.

図３は、概念出現パターンモデル記憶部６に記憶されている概念表現パターンモデルを示す図である。概念出現パターンモデルは、上記４つの概念の任意の組合せ順である「出現パターン」及びその「スコア」を表したモデルである。 FIG. 3 is a diagram showing a concept expression pattern model stored in the concept appearance pattern model storage unit 6. The concept appearance pattern model is a model representing an “appearance pattern” that is an arbitrary combination order of the above four concepts and its “score”.

「出現パターン」は、上記学習テキストデータ内に現れる「表現パターン」をそれぞれ該当する「概念」を示す概念記号（例えば、［ｃｉｔｙ］や［ｔｙｐｅ］など）に置換したテキストから作成されたものである。このため、「出現パターン」では、「概念」間に特に意味のない語が多数存在していたとしても、「概念」の配列順に意味的制約を有している。各「出現パターン」には「スコア」が対応付けられている。「スコア」は、概念共起確率に基づく値である。 The “appearance pattern” is created from text obtained by replacing the “expression pattern” appearing in the learning text data with a concept symbol (eg, [city] or [type]) indicating the corresponding “concept”. is there. For this reason, the “appearance pattern” has semantic restrictions in the order of the “concept” even if there are many meaningless words between the “concepts”. Each “appearance pattern” is associated with a “score”. The “score” is a value based on the concept co-occurrence probability.

デコーダ７は、複数の認識結果候補を出力する第１のデコーダ７１と、認識結果候補の中から最終的な認識結果を出力する第２のデコーダ７２と、を備えている。 The decoder 7 includes a first decoder 71 that outputs a plurality of recognition result candidates, and a second decoder 72 that outputs a final recognition result from among the recognition result candidates.

第１のデコーダ７１は、特徴量抽出部２で抽出された音声特徴量、音響モデル記憶部３に記憶された音素ＨＭＭ、言語モデル記憶部４に記憶された単語２−ｇｒａｍモデルを用いて、複数（例えばＮ個以上）の最尤解及びその尤度（スコア）を演算する。最尤解は、音響モデルを言語モデルに従って連結してネットワーク化したものであり、いわゆる認識結果候補である。なお、第１のデコーダ７１によって求められた最尤解の尤度（後述する尤度パラメータＬ）は、例えば（式３）によって表される。なお、各パラメータは（式１）と同様である。 The first decoder 71 uses the speech feature amount extracted by the feature amount extraction unit 2, the phoneme HMM stored in the acoustic model storage unit 3, and the word 2-gram model stored in the language model storage unit 4. A plurality of (for example, N or more) maximum likelihood solutions and their likelihoods (scores) are calculated. The maximum likelihood solution is a so-called recognition result candidate in which acoustic models are connected and networked according to a language model. Note that the likelihood of the maximum likelihood solution obtained by the first decoder 71 (likelihood parameter L described later) is expressed by, for example, (Equation 3). Each parameter is the same as in (Equation 1).

第２のデコーダ７２は、概念表現パターンモデル記憶部５に記憶された概念表現パターンモデル、概念出現パターンモデル記憶部６に記憶された概念出現パターンモデルを用いて、複数の最尤解の中から最上位の最尤解を選択し、この最尤解を最終的な認識結果とする。 The second decoder 72 uses the concept expression pattern model stored in the concept expression pattern model storage unit 5 and the concept appearance pattern model stored in the concept appearance pattern model storage unit 6 to select from the plurality of maximum likelihood solutions. The highest likelihood solution is selected and this maximum likelihood solution is used as the final recognition result.

図４は、第２のデコーダ７２が最終的な認識結果を得るための処理手順を示したフローチャートである。 FIG. 4 is a flowchart showing a processing procedure for the second decoder 72 to obtain a final recognition result.

ステップＳＴ１では、第２のデコーダ７２は、第１のデコーダ７１で求められた複数の最尤解のうち、上位Ｎ個の最尤解及びそのスコアを最尤解行列Ｃにセットする。そして、ｉ＝１とした後、ステップＳＴ２に移行する。なお、ｉは、１からＮまでの自然数である。 In step ST1, the second decoder 72 sets the top N most likely solutions and their scores in the maximum likelihood solution matrix C among the plurality of maximum likelihood solutions obtained by the first decoder 71. Then, after i = 1, the process proceeds to step ST2. Note that i is a natural number from 1 to N.

ステップＳＴ２では、第２のデコーダ７２は、最尤解行列Ｃからｉ番目の最尤解を取り出し、この最尤解を最尤解パラメータＲにセットすると共に、その尤度（スコア）を尤度パラメータＬにコピーして、ステップＳＴ３に移行する。なお、ステップＳＴ３移行の処理では、ｉ番目の最尤解として次の例を用いて説明する。 In step ST2, the second decoder 72 extracts the i-th maximum likelihood solution from the maximum likelihood matrix C, sets this maximum likelihood solution as the maximum likelihood parameter R, and sets the likelihood (score) to the likelihood. The parameter L is copied and the process proceeds to step ST3. In the process of step ST3, the i-th maximum likelihood solution will be described using the following example.

図５（Ａ）はｉ番目の最尤解をテキスト表示した図、（Ｂ）はｉ番目の最尤解に概念表現パターンモデルを用いたときの処理を説明する図、（Ｃ）はｉ番目の最尤解に概念出現パターンモデルを用いたときの処理を説明する図である。 FIG. 5A is a diagram showing the i-th maximum likelihood solution in text display, FIG. 5B is a diagram for explaining processing when a conceptual expression pattern model is used for the i-th maximum likelihood solution, and FIG. It is a figure explaining a process when a concept appearance pattern model is used for the maximum likelihood solution.

同図（Ａ）に示すように、ステップＳＴ２におけるｉ番目の最尤解は、従来と同様に、音響モデル及び言語モデルにより得られた認識結果候補の１つである。なお、“＜”は先頭記号、“＞”は終端記号を表している。 As shown in FIG. 5A, the i-th maximum likelihood solution in step ST2 is one of recognition result candidates obtained by the acoustic model and the language model, as in the conventional case. Note that “<” represents a head symbol and “>” represents a terminal symbol.

ステップＳＴ３では、第２のデコーダ７２は、最尤解パラメータＲ中の全ての部分語列に対して、概念表現パターンモデル中の「表現パターン」を最長一致でマッチングを行う。そして、マッチングした部分語列に対して、該当する概念記号を付与し、さらに該当する「表現パターン」のスコアを尤度パラメータＬに加算して、ステップＳＴ４に移行する。 In step ST3, the second decoder 72 matches the “expression pattern” in the concept expression pattern model with the longest match for all the partial word strings in the maximum likelihood parameter R. Then, a corresponding conceptual symbol is assigned to the matched partial word string, and the score of the corresponding “expression pattern” is added to the likelihood parameter L, and the process proceeds to step ST4.

例えば図５（Ａ）に示す最尤解パラメータＲでは、同図（Ｂ）に示すように、“千種区にある”が、概念「詳細住所」における「表現パターン」の１つにマッチングする。このため、“千種区にある”に対して、概念記号［ａｄｄｒｅｓｓ］を付与すると共に、この表現パターンに対応するスコア（＝Ｓｃｏｒｅ（千種区まで｜［ａｄｄｒｅｓｓ］））を尤度パラメータＬに加算する。 For example, in the maximum likelihood parameter R shown in FIG. 5A, as shown in FIG. 5B, “in the Chikusa-ku” matches one of the “expression patterns” in the concept “detailed address”. For this reason, a concept symbol [address] is assigned to “in the Chikusa-ku” and a score corresponding to this expression pattern (= Score (up to Chikusa-ku | [address])) is added to the likelihood parameter L. To do.

さらに、“ナカヨシまで”が、概念「店名」における「表現パターン」の１つにマッチングする。このため、“ナカヨシまで”に対して、概念記号［ｎａｍｅ］を付与すると共に、この表現パターンに対応するスコア（＝Ｓｃｏｒｅ（ナカヨシまで｜［ｎａｍｅ］））を尤度パラメータＬにさらに加算する。このとき第２のデコーダ７２によって求められた尤度パラメータＬは、例えば（式４）で表される。 Furthermore, “to Nakayoshi” matches one of the “expression patterns” in the concept “store name”. Therefore, a concept symbol [name] is given to “up to Nakayoshi”, and a score (= Score (| [name] up to Nakayoshi)) corresponding to this expression pattern is further added to the likelihood parameter L. At this time, the likelihood parameter L obtained by the second decoder 72 is expressed by, for example, (Expression 4).

ステップＳＴ４では、第２のデコーダ７２は、最尤解パラメータＲから先頭記号、概念記号、終端記号を取り出し、取り除かれた１つ以上の概念記号を順に概念出現パターンモデルの「出現パターン」マッチングさせる。そして、概念記号の組合せ順序にマッチングした「出現パターン」のスコアを求め、このスコアを尤度パラメータＬに加算して、ステップＳＴ５に移行する。 In step ST4, the second decoder 72 extracts the first symbol, the concept symbol, and the terminal symbol from the maximum likelihood parameter R, and sequentially matches one or more removed concept symbols with the “appearance pattern” of the concept appearance pattern model. . Then, the score of the “appearance pattern” that matches the combination order of the concept symbols is obtained, this score is added to the likelihood parameter L, and the process proceeds to step ST5.

例えば、図５（Ｂ）示したテキストからは、概念記号［ａｄｄｒｅｓｓ］及び［ｎａｍｅ］が順に取り除かれる。そこで、この概念記号の組合せ順序には出現パターン（［ａｄｄｒｅｓｓ］［ｎａｍｅ］）がマッチングするので、同図（Ｃ）に示すように、これに対応するスコア（＝Ｓｃｏｒｅ（［ａｄｄｒｅｓｓ］，［ｎａｍｅ］））を尤度パラメータＬに加算する。このとき第２のデコーダ７２によって求められた尤度パラメータＬは、例えば（式５）で表される。 For example, the concept symbols [address] and [name] are sequentially removed from the text shown in FIG. Therefore, since the appearance pattern ([address] [name]) is matched with the combination order of the concept symbols, as shown in FIG. 5C, the corresponding score (= Score ([address], [name] ])) Is added to the likelihood parameter L. At this time, the likelihood parameter L obtained by the second decoder 72 is expressed by, for example, (Equation 5).

ステップＳＴ５では、第２のデコーダ７２は、最尤解行列Ｃにおけるｉ番目の最尤解の尤度を尤度パラメータＬの値に更新して、ステップＳＴ６に移行する。 In step ST5, the second decoder 72 updates the likelihood of the i-th maximum likelihood solution in the maximum likelihood matrix C to the value of the likelihood parameter L, and proceeds to step ST6.

ステップＳＴ６では、第２のデコーダ７２は、ｉ＝Ｎであるかを判定し、肯定判定のときはステップＳＴ８に移行し、否定判定のときはステップＳＴ７に移行する。 In step ST6, the second decoder 72 determines whether i = N. If the determination is affirmative, the process proceeds to step ST8. If the determination is negative, the process proceeds to step ST7.

ステップＳＴ７では、第２のデコーダ７２は、ｉをインクリメントして（ｉ＝ｉ＋１）、ステップＳＴ２に戻る。これにより、第２のデコーダ７２は、Ｎ個の最尤解及びそのスコアに対してステップＳＴ２からステップＳＴ５までの処理を施して、その後ステップＳＴ８に移行する。 In step ST7, the second decoder 72 increments i (i = i + 1), and returns to step ST2. As a result, the second decoder 72 performs the processing from step ST2 to step ST5 on the N maximum likelihood solutions and their scores, and then proceeds to step ST8.

ステップＳＴ８では、第２のデコーダ７２は、最尤解行列Ｃ中のＮ個の最尤解をその尤度の順にソートし、最上位の最尤解を最終的な認識結果として出力する。そして、最終的な認識結果のデータは、ディジタル／アナログ変換されて、出力部８に供給される。 In step ST8, the second decoder 72 sorts the N maximum likelihood solutions in the maximum likelihood matrix C in the order of the likelihoods, and outputs the most significant solution as the final recognition result. The final recognition result data is digital / analog converted and supplied to the output unit 8.

以上のように、本実施形態に係る音声認識装置は、複数の最尤解に対して、概念間の意味的な制約の強い概念出現パターンモデルを用いることにより、Ｎ−ｇｒａｍモデルでは認識できないような遠距離の概念間の意味を考慮して、認識精度の向上を図ることができる。 As described above, the speech recognition apparatus according to the present embodiment cannot be recognized by the N-gram model by using the concept appearance pattern model having strong semantic constraints between concepts for a plurality of maximum likelihood solutions. The recognition accuracy can be improved in consideration of the meaning between the concepts of long distances.

また、音声認識装置は、意味的に重要なキーワード及びその周辺の語を表現パターンとしてまとめた概念表現パターンモデルを用いることにより、発話の部分語列に対して文法上比較的強い制約をもたせて、認識精度の向上を図ることができる。 In addition, the speech recognition device uses a conceptual expression pattern model that summarizes semantically important keywords and surrounding words as expression patterns, thereby placing relatively strong grammatical restrictions on the partial word strings of utterances. The recognition accuracy can be improved.

なお、本発明は、上述した実施形態に限定されるものではなく、特許請求の範囲に記載された事項の範囲内で設計変更されたものについても同様に適用可能である。 Note that the present invention is not limited to the above-described embodiments, and can be similarly applied to those whose design is changed within the scope of the matters described in the claims.

図６は、音声認識装置の他の構成を示すブロック図である。 FIG. 6 is a block diagram showing another configuration of the speech recognition apparatus.

上記音声認識装置は、音声信号を得るマイク１と、音声信号から音声特徴量を抽出する特徴量抽出部２と、音響モデルを記憶する音響モデル記憶部３と、言語モデルを記憶する言語モデル記憶部４と、概念表現パターンモデルを記憶する概念表現パターンモデル記憶部５と、概念出現パターンモデルを記憶する概念出現パターンモデル記憶部６と、音声特徴量及び各モデルを用いて音声認識を行うデコーダ９と、認識結果を所定のシステムに出力する出力部８と、を備えている。 The voice recognition device includes a microphone 1 that obtains a voice signal, a feature quantity extraction unit 2 that extracts a voice feature quantity from the voice signal, an acoustic model storage unit 3 that stores an acoustic model, and a language model storage that stores a language model. Unit 4, a concept expression pattern model storage unit 5 that stores a concept expression pattern model, a concept appearance pattern model storage unit 6 that stores a concept appearance pattern model, and a decoder that performs speech recognition using the speech feature amount and each model 9 and an output unit 8 for outputting the recognition result to a predetermined system.

図１に示したデコーダ７は、音響モデル及び言語モデルを用いてＮ個の最尤解を演算した後、概念表現パターンモデル及び概念出現パターンモデルを用いて最上位の最尤解を最終的な認識結果として出力した。これに対して、図６に示したデコーダ９は、音響モデル、言語モデル、概念表現パターンモデル及び概念出現パターンモデルを用いて、一度に最終的な認識結果を出力する。 The decoder 7 shown in FIG. 1 calculates the N maximum likelihood solutions using the acoustic model and the language model, and then finalizes the highest likelihood solution using the concept expression pattern model and the concept appearance pattern model. Output as recognition result. On the other hand, the decoder 9 shown in FIG. 6 outputs a final recognition result at a time using an acoustic model, a language model, a concept expression pattern model, and a concept appearance pattern model.

また、概念表現パターンモデルを予め言語モデル（言語モデル記憶部４）に組み込んでもよい。これにより、音声認識装置は、概念表現パターンモデルを言語モデルとして利用することができる。また、上記４つのモデルから有限状態オートマトン（ＦＳＡ）を予め作成しておけば、デコード時にはＦＳＡのみを用いればよい。 Further, the concept expression pattern model may be incorporated in the language model (language model storage unit 4) in advance. Thereby, the speech recognition apparatus can use the concept expression pattern model as a language model. Further, if a finite state automaton (FSA) is created in advance from the above four models, only FSA may be used for decoding.

さらに、音声認識装置は、概念表現パターンモデル、概念出現パターンモデルをデコード時に使用せず、予めオフラインでこれらのモデルの知識を組み込んだ別のモデルを作成し、当該別のモデルを使用してもよい。 Furthermore, the speech recognition apparatus does not use the concept expression pattern model and the concept appearance pattern model at the time of decoding, but creates another model that incorporates knowledge of these models offline in advance, and uses the other model. Good.

また、音声認識装置は、目的地タスク対話システムに適用される場合に限らないのは勿論である。 Needless to say, the speech recognition apparatus is not limited to being applied to the destination task dialogue system.

概念表現パターンモデルは、図２に示したように体言を中心とした「概念パターン」に限らず、例えば、「行きたい」、「知りたい」、「教えて欲しい」などの用言を中心とした「概念パターン」を用いてもよい。また、概念出現パターンモデルは、図３に示したものに限らず、話者の発話履歴、音声認識装置を用いる状態の履歴によって変更してもよい。 As shown in FIG. 2, the concept expression pattern model is not limited to the “concept pattern” centered on the body, but for example, the premise such as “I want to go”, “I want to know”, “I want you to tell me”, etc. The “concept pattern” may be used. Further, the concept appearance pattern model is not limited to the one shown in FIG.

また、上述したステップＳＴ１からステップＳＴ８までの処理を実行するプログラムをコンピュータにインストールしてもよい。これにより、上記コンピュータは、音声認識装置として機能し、上述したステップＳＴ１からステップＳＴ８までの処理を実行することができる。 Further, a program for executing the processes from step ST1 to step ST8 described above may be installed in the computer. As a result, the computer functions as a voice recognition device and can execute the processes from step ST1 to step ST8 described above.

なお、上記プログラムは、通信回線を介して伝送されたものでもよいし、光ディスク、磁気ディスク、半導体メモリなどの記録媒体に記録されたものであってもよいのは勿論である。 Of course, the program may be transmitted via a communication line or may be recorded on a recording medium such as an optical disk, a magnetic disk, or a semiconductor memory.

本発明の実施形態に係る音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus which concerns on embodiment of this invention. 概念表現パターンモデル記憶部に記憶されている概念表現パターンモデルを示す図である。It is a figure which shows the concept expression pattern model memorize | stored in the concept expression pattern model memory | storage part. 概念出現パターンモデル記憶部に記憶されている概念表現パターンモデルを示す図である。It is a figure which shows the concept expression pattern model memorize | stored in the concept appearance pattern model memory | storage part. 第２のデコーダが最終的な認識結果を得るための処理手順を示したフローチャートである。It is the flowchart which showed the process sequence for a 2nd decoder to obtain a final recognition result. （Ａ）はｉ番目の最尤解をテキスト表示した図、（Ｂ）はｉ番目の最尤解に概念表現パターンモデルを用いたときの処理を説明する図、（Ｃ）はｉ番目の最尤解に概念出現パターンモデルを用いたときの処理を説明する図である。(A) is a diagram in which the i-th maximum likelihood solution is displayed in text, (B) is a diagram for explaining the processing when the conceptual expression pattern model is used for the i-th maximum likelihood solution, and (C) is the i-th maximum likelihood solution. It is a figure explaining a process when a concept appearance pattern model is used for likelihood. 音声認識装置の他の構成を示すブロック図である。It is a block diagram which shows the other structure of a speech recognition apparatus.

Explanation of symbols

１マイク
２特徴量抽出部
３音響モデル記憶部
４言語モデル記憶部
５概念表現パターンモデル記憶部
６概念出現パターンモデル記憶部
７，９デコーダ
８出力部
７１第１のデコーダ
７２第２のデコーダ DESCRIPTION OF SYMBOLS 1 Microphone 2 Feature-value extraction part 3 Acoustic model memory | storage part 4 Language model memory | storage part 5 Concept expression pattern model memory | storage part 6 Concept appearance pattern model memory | storage part 7, 9 Decoder 8 Output part 71 1st decoder 72 2nd decoder

Claims

A feature amount extraction means for extracting a feature amount of an audio signal;
Acoustic model storage means for storing an acoustic model obtained by statistically modeling features;
Language model storage means for storing a language model that statistically models a chain of words;
A concept appearance pattern model storage means for storing a concept appearance pattern model having a plurality of appearance patterns representing a combination order of one or more concepts;
The feature amount extracted by the feature amount extraction unit, the acoustic model stored in the acoustic model storage unit, the language model stored in the language model storage unit, and the concept appearance stored in the concept appearance pattern model Recognition means for recognizing the voice represented by the voice signal based on the pattern model;
A speech recognition device comprising:

The recognition means uses a concept appearance pattern model in which a score is associated with each appearance pattern for a plurality of recognition result candidates based on the feature quantity, the acoustic model, and the language model, and their scores. The speech recognition apparatus according to claim 1, wherein each score of the plurality of recognition result candidates is weighted, and a recognition result candidate of the highest score among the weighted scores is output.

An expression pattern model storage means for storing an expression pattern model representing a plurality of expression patterns for each concept;
The speech recognition apparatus according to claim 1, wherein the recognition unit recognizes speech by further using an expression pattern model stored in the expression pattern model storage unit.

The recognition means further uses a concept expression pattern model in which a score is associated with each expression pattern for a plurality of recognition result candidates based on the feature quantity, the acoustic model, and the language model, and their scores. The voice recognition device according to claim 3, wherein each score of the plurality of recognition result candidates is weighted.

A feature amount extraction step for extracting a feature amount of an audio signal;
The feature amount extracted in the feature amount extraction step, an acoustic model in which the feature amount is statistically modeled, a language model in which the word chain is statistically modeled, and a combination order of one or more concepts are displayed. A recognition step of recognizing a voice represented by the voice signal based on a concept appearance pattern model having a plurality of appearance patterns.
A speech recognition method comprising:

In the recognition step, a concept appearance pattern model in which a score is associated with each appearance pattern is used for a plurality of recognition result candidates based on the feature quantity, the acoustic model, and the language model and their scores. The speech recognition method according to claim 5, wherein each score of the plurality of recognition result candidates is weighted, and a recognition result candidate of the highest score among the weighted scores is output.

The speech recognition method according to claim 5, wherein in the recognition step, speech is recognized by further using an expression pattern model representing a plurality of expression patterns and their scores for each concept.

In the recognition step, a concept expression pattern model in which a score is associated with each expression pattern is further used for a plurality of recognition result candidates based on the feature quantity, the acoustic model, and the language model and their scores. The voice recognition method according to claim 7, wherein each score of the plurality of recognition result candidates is weighted.

On the computer,
A feature amount extraction step for extracting a feature amount of an audio signal;
The feature amount extracted in the feature amount extraction step, an acoustic model in which the feature amount is statistically modeled, a language model in which the word chain is statistically modeled, and a combination order of one or more concepts are displayed. A recognition step of recognizing a voice represented by the voice signal based on a concept appearance pattern model having a plurality of appearance patterns,
Voice recognition program that executes