JP2005134442A

JP2005134442A - Speech recognition device and method, recording medium, and program

Info

Publication number: JP2005134442A
Application number: JP2003367223A
Authority: JP
Inventors: Katsuki Minamino; 活樹南野; Koji Asano; 康治浅野; Hiroaki Ogawa; 浩明小川
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2003-10-28
Filing date: 2003-10-28
Publication date: 2005-05-26

Abstract

<P>PROBLEM TO BE SOLVED: To speed up the processing while preventing accuracy of speech recognition from deteriorating. <P>SOLUTION: A word preliminary selection part 56 selects one or more words following the pre-acquired words of a word string as a candidate of a speech recognition result; a word candidate clustering part 57 classifies the candidate words into the same sound words; a matching part 58 calculates a sound score about a word candidate which has a highest initial value of the score among the classified word candidates; and the sound score substitutes for the sound score of another word candidate included in the same word candidate set. Thereafter, a re-evaluation part 59 sequentially corrects word connecting relations between words in the word string as a candidate of the speech recognition result. For example, this invention is applicable to a speech dialog system. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、音声認識装置および方法、記録媒体、並びにプログラムに関し、特に、音声認識の精度の低下を防ぎつつ、処理を高速化することができるようにした音声認識装置および方法、記録媒体、並びにプログラムに関する。 The present invention relates to a speech recognition apparatus and method, a recording medium, and a program, and more particularly, a speech recognition apparatus and method, a recording medium, and a recording medium capable of speeding up processing while preventing deterioration in accuracy of speech recognition. Regarding the program.

近年、音声認識を応用した製品やサービスの実用化が盛んに行なわれるようになってきた。 In recent years, products and services using voice recognition have been actively put into practical use.

音声認識とは、入力音声信号に対応した単語の系列を自動的に決定する技術のことであり、この技術とアプリケーションを組み合わせることで、様々な製品やサービスが可能となる。 Speech recognition is a technology that automatically determines a word sequence corresponding to an input speech signal. By combining this technology with an application, various products and services are possible.

音声認識装置の基本構成例を図１に示す。図１において、ユーザによって発声された音声は、マイク（マイクロフォン）１により取り込まれ、AD（Analog to Digital）変換部２によりサンプリングされ、デジタルデータに変換される。特徴抽出部３は、AD変換部２から出力されたデジタルデータに対して、適当な時間間隔で周波数分析等の処理を施し、スペクトルや、その他の音声の音響的特徴を表すパラメータに変換する。この結果、音声の特徴量の時系列が得られることになる。このパラメータは、特徴抽出部３からマッチング部４に送信される。マッチング部４は、音響モデルデータベース（DB）５、辞書データベース（DB）６、および文法データベース（DB）７に保持されている各データを利用して、特徴抽出部３から供給された特徴量系列とのマッチングを行ない、音声認識結果を出力する。 A basic configuration example of the speech recognition apparatus is shown in FIG. In FIG. 1, a voice uttered by a user is captured by a microphone (microphone) 1, sampled by an AD (Analog to Digital) conversion unit 2, and converted into digital data. The feature extraction unit 3 performs processing such as frequency analysis on the digital data output from the AD conversion unit 2 at appropriate time intervals, and converts the digital data into parameters representing the acoustic characteristics of the spectrum and other sounds. As a result, a time series of audio feature values is obtained. This parameter is transmitted from the feature extraction unit 3 to the matching unit 4. The matching unit 4 uses the data stored in the acoustic model database (DB) 5, the dictionary database (DB) 6, and the grammar database (DB) 7 to use the feature amount series supplied from the feature extraction unit 3. Is matched, and the speech recognition result is output.

音響モデルデータベース５は、対象となる言語において用いられる個々の音素や音節などの音響的な特徴を保持するモデルで、隠れマルコフモデル（HMM）などが用いられる。辞書データベース６は、認識対象となる個々の単語の発音に関する情報を保持するものであり、これにより、単語と上述の音響モデルが関連付けられる。その結果、辞書データベース６に含まれる各単語に対応する音響的な標準パターンが得られることになる。文法データベース７は、辞書データベース６に記述されている個々の単語がどのように連鎖し得るかを記述するもので、正規文法や文脈自由文法に基づく記述や、統計的な単語連鎖確率を含む文法(N-gram)などが用いられる。 The acoustic model database 5 is a model that retains acoustic features such as individual phonemes and syllables used in a target language, and a hidden Markov model (HMM) or the like is used. The dictionary database 6 holds information related to pronunciation of individual words to be recognized, thereby associating the words with the acoustic model described above. As a result, an acoustic standard pattern corresponding to each word included in the dictionary database 6 is obtained. The grammar database 7 describes how individual words described in the dictionary database 6 can be chained. The grammar database 7 includes a description based on a regular grammar and a context-free grammar, and a statistical word chain probability. (N-gram) is used.

マッチング部４では、これらの音響モデルデータベース５、辞書データベース６、および文法データベース７を用いて、特徴抽出部３から入力された特徴量系列と最も適合する単語系列の決定がなされる。例えば音響モデルとして隠れマルコフモデル（HMM）が用いられる場合には、各特徴量の出現確率を特徴量系列に従って累積した値が音響的な評価値（以下、音響スコアと称する）として用いられる。音響スコアは、上述の標準パターンを用いることで、各単語に対応して求められる。 The matching unit 4 uses the acoustic model database 5, the dictionary database 6, and the grammar database 7 to determine a word sequence that most closely matches the feature amount sequence input from the feature extraction unit 3. For example, when a hidden Markov model (HMM) is used as the acoustic model, a value obtained by accumulating the appearance probability of each feature amount according to the feature amount series is used as an acoustic evaluation value (hereinafter referred to as an acoustic score). The acoustic score is obtained corresponding to each word by using the standard pattern described above.

また、例えば文法としてバイグラムが用いられる場合には、直前の単語との連鎖確率に基づく各単語の言語的な確からしさが数値化され、その値が言語的評価値（以下、言語スコアと称する）として与えられる。 For example, when a bigram is used as a grammar, the linguistic accuracy of each word based on the chain probability with the immediately preceding word is quantified, and the value is a linguistic evaluation value (hereinafter referred to as a language score). As given.

そして、音響スコアと言語スコアを総合して評価することにより、入力音声信号に最も適合する単語系列が決定されることになる。 Then, by comprehensively evaluating the acoustic score and the language score, the word sequence that best matches the input speech signal is determined.

例として、ユーザが、「今日はいい天気ですね。」と発声した場合を考える。この場合、例えば、「今日」、「は」、「いい」、「天気」、「ですね」のような単語の系列が認識結果として得られることになる。この際、各単語に対して、音響スコアおよび言語スコアが与えられることになる。 As an example, let us consider a case where a user utters “It is a nice weather today”. In this case, for example, a series of words such as “today”, “ha”, “good”, “weather”, and “sound” is obtained as a recognition result. At this time, an acoustic score and a language score are given to each word.

このような音声認識の処理を行なう様々な方法が提案されているが、一般的に、音声認識処理は、多くの計算量と記憶容量を必要とし、音声認識処理を高速化するためには、より高速な計算処理能力と容量の大きなメモリ（記憶手段）を備える装置が要求される。そこで、音声認識処理に要する計算量や記憶容量を低く抑えつつ、音声認識処理を高速化する技術が提案されている。 Various methods for performing such speech recognition processing have been proposed. Generally, speech recognition processing requires a large amount of calculation and storage capacity, and in order to speed up speech recognition processing, There is a demand for a device having a higher-speed calculation processing capacity and a large capacity memory (storage means). Therefore, a technique has been proposed for speeding up the speech recognition processing while keeping the amount of calculation and storage capacity required for the speech recognition processing low.

例えば、本出願人は、音声認識結果として既に求まっている単語に続く次の単語の候補を、予備選択を行うことにより絞り込み、単語予備選択により絞り込まれた単語候補についてマッチング処理を実行して、音声認識結果の候補となる単語列を特定し、その後、音声認識結果の候補となる単語列の単語どうしの単語接続関係を修正し、修正後の単語接続関係に基づいて、音声認識結果となる単語列を確定することにより、入力音声に対応する単語列仮説を逐次的に決定する方式を提案している（例えば、特許文献１参照）。この方式により、より効率的に音声認識処理を実行することができる。 For example, the applicant narrows down the candidate of the next word following the word already obtained as a speech recognition result by performing a preliminary selection, and executes a matching process for the word candidate narrowed down by the word preliminary selection, A word string that is a candidate for the speech recognition result is identified, and then the word connection relationship between the words in the word sequence that is the candidate for the speech recognition result is corrected, and the speech recognition result is obtained based on the corrected word connection relationship. A method of sequentially determining a word string hypothesis corresponding to an input speech by determining a word string has been proposed (see, for example, Patent Document 1). With this method, the speech recognition process can be executed more efficiently.

また、音声認識の処理を高速化する方法の１つとして、バンドルサーチ法が提案されている（例えば、非特許文献１参照）。図２を参照して、非特許文献1に提案されているバンドルサーチ法の概略を説明する。図２において、Ａ、Ｂ、およびＣは単語候補を表している。なお、図２において、○印はノードを表し、○印どうしを結ぶ線分はアークを表している。また、図２においては、左から右方向が、時間の経過を表している。 In addition, a bundle search method has been proposed as one method for speeding up speech recognition processing (see, for example, Non-Patent Document 1). An outline of the bundle search method proposed in Non-Patent Document 1 will be described with reference to FIG. In FIG. 2, A, B, and C represent word candidates. In FIG. 2, a circle represents a node, and a line segment connecting the circles represents an arc. In FIG. 2, the direction from left to right represents the passage of time.

また、単語Ａまでの累積スコアをスコアＳ１とし、単語Ｂまでの累積スコアをスコアＳ２とする。 Further, the cumulative score up to word A is set as score S1, and the cumulative score up to word B is set as score S2.

図２の例の場合、同一の単語Ｃが文法上の異なる位置に出現している。すなわち、単語Ｃは、単語Ａの次の位置にも出現し、単語Ｂの次の位置にも出現している。また、ノード２およびノード４は、同一の時刻ｔ１に位置している。 In the case of the example in FIG. 2, the same word C appears at different positions in the grammar. That is, the word C appears also at the position next to the word A, and also appears at the position next to the word B. Node 2 and node 4 are located at the same time t1.

バンドルサーチ法を用いない従来の音声認識処理においては、単語Ａに続く単語Ｃのスコア、および単語Ｂに続く単語Ｃのスコアを、それぞれ独立に算出していた。それに対して、バンドルサーチ法を用いた音声認識処理においては、例えば、スコアＳ１を初期値としてマッチング処理を行なうことにより単語Ａに続く単語ＣのスコアＳ２を算出した場合、スコアの差（△＝Ｓ２−Ｓ１）を用いて、単語Ｂに続く単語ＣのスコアＴ２を、Ｔ２＝Ｔ１＋△として推定することにより、同一の単語候補Ｃに対するマッチング処理を１回で済ましている。これにより、単語Ｂに続く単語Ｃの本来のスコア計算を省略することができ、結果的に、音声認識処理の高速化を図っている。 In the conventional speech recognition process that does not use the bundle search method, the score of the word C following the word A and the score of the word C following the word B are calculated independently. On the other hand, in the speech recognition process using the bundle search method, for example, when the score S2 of the word C following the word A is calculated by performing the matching process using the score S1 as an initial value, the difference in scores (Δ = By using S2-S1) and estimating the score T2 of the word C following the word B as T2 = T1 + Δ, the matching process for the same word candidate C is completed once. Thereby, the original score calculation of the word C following the word B can be omitted, and as a result, the speed of the speech recognition process is increased.

このように、バンドルサーチ法においては、フレーム同期型のマッチング処理に基づく音声認識処理を対象とし、文法上において異なる位置にある単語とのマッチング処理が別々に行なわれるという問題に対して、同じ単語に対するマッチング処理を１回とする対策をとっている。
特開２００１−２４２８８４号公報「バンドルサーチ法を用いた連続音声認識の高速化」電子情報通信学会論文誌、1992年11月発行、 D-22, Vol.J75-D-II,No.11, pp.1761-1769 In this way, the bundle search method targets speech recognition processing based on frame-synchronized matching processing, and the same word is used for the problem that matching processing with words at different positions in the grammar is performed separately. Measures are taken so that the matching process is performed once.
JP 2001-242848 A "Acceleration of continuous speech recognition using bundle search method" IEICE Transactions, November 1992, D-22, Vol.J75-D-II, No.11, pp.1761-1769

しかしながら、音声認識処理に要する計算量や記憶容量を低く抑えつつ、音声認識処理を高速化しようとする従来の技術では、音声認識の精度を低下させずに、処理速度を高速化するのが困難であるという課題があった。 However, it is difficult to increase the processing speed without lowering the accuracy of the speech recognition with the conventional technology that tries to speed up the speech recognition processing while keeping the calculation amount and the storage capacity required for the speech recognition processing low. There was a problem of being.

例えば、特許文献１に記載の方式においては、単語予備選択によって決定されるすべての単語候補に対して後続するマッチング処理が施されるため、マッチングの処理量が比較的大きなものになるという課題があった。また、マッチング処理で決定される音響スコアや言語スコアが直前の単語などに依存する場合、厳密にマッチング処理を行なうためには、同じ時刻を開始点とするマッチング処理に関して、直前の単語の種類の数だけマッチングの処理を行なう必要があり、これもマッチングの処理量を大きくする要因となっていた。 For example, in the method described in Patent Document 1, since the subsequent matching process is performed on all word candidates determined by the word preliminary selection, there is a problem that the amount of matching processing becomes relatively large. there were. In addition, when the acoustic score or language score determined by the matching process depends on the immediately preceding word, etc., in order to perform the matching process strictly, the matching process starting at the same time It is necessary to perform matching processing as many as the number, and this is also a factor for increasing the amount of matching processing.

例えば、「窓を開ける」という音声を認識する場合の単語接続関係情報の例を図３および図４に示す。なお、図３および図４において、○印はノードを表し、○印どうしを結ぶ線分はアークを表している。また、図３および図４においては、左から右方向が、時間の経過を表している。図３において、入力音声に対するマッチング処理が時刻t１まで進み、ノード１、ノード２、およびノード３を介する単語列部分仮説「窓を」、並びにノード１およびノード４を介する単語列部分仮説「ドア」が求まっているものとする。ここで、単語予備選択の処理により、時刻t１を開始点とする４つの単語候補「開ける」、「空ける」、「明ける」、および「閉める」が求まったとする。図４は、このときの単語接続関係情報を示している。図４に示されるように、単語候補「開ける」、「空ける」、「明ける」、および「閉める」の各単語に対するマッチング処理が行われることになり４回のマッチング処理が必要となる。 For example, FIG. 3 and FIG. 4 show examples of word connection relation information when recognizing a voice “open a window”. In FIGS. 3 and 4, a circle represents a node, and a line segment connecting the circles represents an arc. In FIGS. 3 and 4, the direction from left to right represents the passage of time. In FIG. 3, the matching process for the input speech proceeds until time t 1, and the word string partial hypothesis “window” through node 1, node 2, and node 3, and the word string partial hypothesis “door” through node 1 and node 4. Is required. Here, it is assumed that four word candidates “open”, “open”, “open”, and “close” starting from time t1 are obtained by the word preliminary selection process. FIG. 4 shows the word connection relation information at this time. As shown in FIG. 4, matching processing is performed for each of the word candidates “open”, “open”, “dawn”, and “close”, and four matching processes are required.

すなわち、図４において、単語列部分仮説「窓を」の終端のノード３から、ノード５乃至ノード８のそれぞれに対してアークが延ばされている。ノード３からノード５に延びたアークは単語候補「開ける」に対応し、ノード３からノード６に延びたアークは単語候補「明ける」に対応し、ノード３からノード７に延びたアークは単語候補「空ける」に対応し、ノード３からノード８に延びたアークは単語候補「閉める」に対応している。この４つの単語候補に対して、それぞれマッチング処理が実行され、それぞれのスコアが求められる。 That is, in FIG. 4, arcs are extended from the node 3 at the end of the word string partial hypothesis “window” to each of the nodes 5 to 8. The arc extending from node 3 to node 5 corresponds to the word candidate “open”, the arc extending from node 3 to node 6 corresponds to the word candidate “open”, and the arc extending from node 3 to node 7 is the word candidate The arc extending from the node 3 to the node 8 corresponding to “open” corresponds to the word candidate “close”. A matching process is performed on each of the four word candidates, and respective scores are obtained.

また、前に接続する単語に依存する音響モデル（例：クロスワードトライフォン）や言語モデル（バイグラム）を適用する場合、前に接続する２つの単語列部分仮説「窓を」と「ドア」のそれぞれに対して、マッチング処理を行なう必要がある。従って、図４に示されるように、単語列部分仮説「窓を」に対するマッチング処理とは別に、単語列部分仮説「ドア」においても、同一の単語候補に対してマッチング処理を実行する必要がある。 Also, when applying an acoustic model (eg crossword triphone) or language model (bigram) that depends on the word connected before, the two word string partial hypotheses “window” and “door” connected before It is necessary to perform matching processing for each. Therefore, as shown in FIG. 4, in addition to the matching process for the word string partial hypothesis “window”, it is necessary to execute the matching process for the same word candidate also in the word string partial hypothesis “door”. .

すなわち、図４において、単語列部分仮説「ドア」の終端のノード４から、ノード９乃至ノード１２のそれぞれに対してアークが延ばされている。ノード４からノード９に延びたアークは単語候補「開ける」に対応し、ノード４からノード１０に延びたアークは単語候補「明ける」に対応し、ノード４からノード１１に延びたアークは単語候補「空ける」に対応し、ノード４からノード１２に延びたアークは単語候補「閉める」に対応している。この４つの単語候補に対して、それぞれマッチング処理が実行され、それぞれのスコアが求められる。 That is, in FIG. 4, arcs are extended from the node 4 at the end of the word string partial hypothesis “door” to each of the nodes 9 to 12. An arc extending from node 4 to node 9 corresponds to the word candidate “open”, an arc extending from node 4 to node 10 corresponds to the word candidate “open”, and an arc extending from node 4 to node 11 corresponds to the word candidate The arc extending from the node 4 to the node 12 corresponding to “open” corresponds to the word candidate “close”. A matching process is performed on each of the four word candidates, and respective scores are obtained.

結果的に、単語候補数（図４の場合、「開ける」、「空ける」、「明ける」、および「閉める」の４つ）に、直前の単語列部分仮説数（図４の場合、「窓を」および「ドア」の２つ）を掛け算した回数（図４の場合、４×２＝８回）だけ、マッチング処理を実行する必要があり、計算量が増大する。 As a result, the number of word candidates (in the case of FIG. 4, “open”, “empty”, “dawn”, and “close”) is added to the number of immediately preceding word string partial hypotheses (in FIG. 4, “window” The matching process needs to be executed as many times as the number obtained by multiplying (“2” and “door”) (in the case of FIG. 4, 4 × 2 = 8 times), and the amount of calculation increases.

一方、非特許文献１に記載のバンドルサーチ法においては、マッチング処理が厳密な動的計画法に基づく処理とならないという課題があった。すなわち、例えば図２において、着目する同じ単語Ｃに関するマッチング処理を、動的計画法に基づいてクロスワードトライフォンを用いて行なった場合、同じ単語であっても、文法上において前に接続する単語（単語Ａ、および単語Ｂ）に応じ、着目する単語Ｃの音響モデルが変化する。すなわち、図２において単語Ａに続く場合の単語Ｃの音響モデルと、単語Ｂに続く場合の単語Ｃの音響モデルは異なるものとなる。 On the other hand, the bundle search method described in Non-Patent Document 1 has a problem that the matching process is not a process based on strict dynamic programming. That is, for example, in FIG. 2, when the matching process related to the same word C of interest is performed using a crossword triphone based on dynamic programming, even if the same word is used, the word connected before in the grammar The acoustic model of the focused word C changes according to (Word A and Word B). That is, the acoustic model of the word C when following the word A in FIG. 2 and the acoustic model of the word C when following the word B are different.

しかし、バンドルサーチ法による上述したスコアの推定方法を適用すると、単語Ｂに続く単語Ｃの音響モデルは、実質的に、単語Ａに続く単語Ｃの音響モデルで代用することになる。従って、正確な音響スコアを求めることができない。さらに、着目する同じ単語Ｃに関するマッチング処理を動的計画法に基づいて行なった場合、着目する単語Ｃへの遷移時刻が変化するが、バンドルサーチ法によるスコアの推定方法を適用すると、着目する単語Ｃへの遷移時刻ｔ１は、同じ単語どうしで固定されることになってしまうという問題がある。 However, when the above-described score estimation method based on the bundle search method is applied, the acoustic model of the word C following the word B is substantially substituted with the acoustic model of the word C following the word A. Therefore, an accurate acoustic score cannot be obtained. Furthermore, when the matching process related to the same word C of interest is performed based on dynamic programming, the transition time to the word of interest C changes, but if the score estimation method by the bundle search method is applied, the word of interest There is a problem that the transition time t1 to C is fixed between the same words.

バンドルサーチ法においては、結果的に、スコアの推定値Ｔ２＝Ｔ１＋△は、厳密なスコア値とは異なったものとなる場合が生じる。これは、認識精度の劣化を引き起こす原因となる。 In the bundle search method, as a result, the estimated score value T2 = T1 + Δ may differ from the exact score value. This causes a deterioration in recognition accuracy.

本発明の音声認識装置は、単語列のうち、既に求められている単語に続く１以上の単語の候補を、同一の属性を有する単語毎に分類した単語群毎に、処理の一部を共有化してマッチング処理するマッチング処理実行手段と、マッチング処理実行手段によるマッチング処理の結果に基づいて生成された、音声認識結果の候補となる単語列の単語どうしの接続関係を示す単語接続関係情報を修正する修正手段とを備えることを特徴とする。 The speech recognition apparatus of the present invention shares a part of the processing for each word group in which one or more word candidates following the already obtained word in the word string are classified for each word having the same attribute. The word connection relationship information indicating the connection relationship between the words in the word string that is a candidate for the speech recognition result, generated based on the result of the matching process performed by the matching processing execution unit And a correcting means.

前記単語列のうち、既に求められている前記単語に続く１以上の前記単語の候補を、同一の前記属性を有する前記単語毎に前記単語群として分類する分類手段をさらに設けるようにし、前記マッチング処理実行手段には、分類手段により分類された前記単語群毎に、前記処理の一部を共有化するようにさせることができる。 A classifying unit that classifies one or more word candidates following the already obtained word in the word string as the word group for each word having the same attribute; The processing execution means can share a part of the processing for each word group classified by the classification means.

前記単語列のうち、既に求められている前記単語に続く１以上の前記単語の候補を選択する選択手段をさらに設けるようにし、前記分類手段には、選択手段により選択された１以上の前記単語の候補を、同一の前記属性を有する前記単語毎に前記単語群として分類するようにさせることができる。 A selection means for selecting one or more word candidates following the already obtained word from the word string is further provided, and the classification means includes the one or more words selected by the selection means. Can be classified as the word group for each word having the same attribute.

既に求められている前記単語に続く１以上の前記単語の候補は、同一の発音を有する前記単語毎に、前記単語群に分類されるようにすることができる。 One or more candidate words following the already obtained word can be classified into the word group for each word having the same pronunciation.

既に求められている前記単語に続く１以上の前記単語の候補は、音響的に類似した前記単語毎に、前記単語群に分類されるようにすることができる。 One or more word candidates following the already-sought word can be classified into the word group for each of the acoustically similar words.

前記マッチング処理実行手段による前記マッチング処理の結果に基づいて生成された前記単語接続関係情報を記憶する記憶手段をさらに設けるようにし、前記修正手段には、記憶手段により記憶された前記単語接続関係情報を修正するようにさせることができる。 Storage means for storing the word connection relation information generated based on the result of the matching process by the matching process execution means is further provided, and the correction means includes the word connection relation information stored by the storage means. Can be fixed.

前記修正手段により、前記単語接続関係情報を修正するために、部分的な単語列の音響的なスコアを計算するための状態遷移モデルが構築された場合、構築された状態遷移モデルを記憶する記憶手段をさらに設けるようにし、前記修正手段には、同一の部分的な単語列の音響的なスコアを再度計算する場合、記憶手段により記憶された状態遷移モデルを利用するようにさせることができる。 When a state transition model for calculating an acoustic score of a partial word string is constructed by the modifying means to modify the word connection relation information, a memory for storing the constructed state transition model Means may be further provided, and the correction means may use the state transition model stored by the storage means when recalculating the acoustic score of the same partial word string.

前記マッチング処理実行手段により、単語の音響的なスコアを計算するための状態遷移モデルが構築された場合、構築された前記状態遷移モデルを記憶する記憶手段をさらに設けるようにし、前記マッチング処理実行手段には、同一の前記単語の音響的なスコアを再度計算する場合、記憶手段により記憶された状態遷移モデルを利用するようにさせることができる。 When a state transition model for calculating an acoustic score of a word is constructed by the matching process execution means, a storage means for storing the constructed state transition model is further provided, and the matching process execution means In the case where the acoustic score of the same word is calculated again, the state transition model stored in the storage unit can be used.

前記マッチング処理実行手段により算出された、前記マッチング処理の途中経過としての値を記憶する記憶手段をさらに設けるようにし、前記マッチング処理実行手段には、記憶手段により記憶された途中経過としての値を利用して、前記音声認識結果の候補となる複数の前記単語列の前記マッチング処理を交互に実行するようにさせることができる。 A storage means for storing the value as the progress of the matching process calculated by the matching process execution means is further provided, and the value as the progress of the process stored by the storage means is stored in the matching process execution means. The matching processing of the plurality of word strings that are candidates for the speech recognition result can be alternately executed.

本発明の音声認識方法は、単語列のうち、既に求められている単語に続く１以上の単語の候補を、同一の属性を有する単語毎に分類した単語群毎に、処理の一部を共有化してマッチング処理するマッチング処理実行ステップと、マッチング処理実行ステップの処理によるマッチング処理の結果に基づいて生成された、音声認識結果の候補となる単語列の単語どうしの接続関係を示す単語接続関係情報を修正する修正ステップとを含むことを特徴とする。 In the speech recognition method of the present invention, a part of the processing is shared for each word group in which one or more word candidates following the already obtained word in the word string are classified for each word having the same attribute. Connection processing information indicating a connection relationship between words in a word sequence that is a candidate for a speech recognition result, generated based on a matching processing execution step in which the matching processing is performed and a matching processing result by the processing of the matching processing execution step And a correction step for correcting.

本発明の記録媒体のプログラムは、単語列のうち、既に求められている単語に続く１以上の単語の候補を、同一の属性を有する単語毎に分類した単語群毎に、処理の一部を共有化してマッチング処理するマッチング処理実行ステップと、マッチング処理実行ステップの処理によるマッチング処理の結果に基づいて生成された、音声認識結果の候補となる単語列の単語どうしの接続関係を示す単語接続関係情報を修正する修正ステップとを含むことを特徴とする。 The program of the recording medium of the present invention performs a part of the processing for each word group in which one or more word candidates following the already obtained word in the word string are classified for each word having the same attribute. A matching process execution step for sharing and matching processing, and a word connection relation that indicates a connection relation between words in a word sequence that is a candidate for a speech recognition result generated based on the result of the matching process by the process of the matching process execution step And a correction step for correcting the information.

本発明のプログラムは、入力された音声に対応する単語列を決定する処理を制御するコンピュータに、単語列のうち、既に求められている単語に続く１以上の単語の候補を、同一の属性を有する単語毎に分類した単語群毎に、処理の一部を共有化してマッチング処理するマッチング処理実行ステップと、マッチング処理実行ステップの処理によるマッチング処理の結果に基づいて生成された、音声認識結果の候補となる単語列の単語どうしの接続関係を示す単語接続関係情報を修正する修正ステップとを実行させることを特徴とする。 The program of the present invention allows a computer that controls processing for determining a word string corresponding to an input speech to select one or more word candidates that follow the already obtained word from the word string and assign the same attribute. For each word group classified for each word it has, a matching process execution step that shares a part of the process and performs a matching process, and a voice recognition result generated based on the result of the matching process by the process of the matching process execution step A correction step of correcting word connection relation information indicating connection relations between words in a candidate word string is executed.

本発明の音声認識装置および方法、記録媒体、並びにプログラムにおいては、単語列のうち、既に求められている単語に続く１以上の単語の候補が、同一の属性を有する単語毎に分類した単語群毎に、処理の一部を共有化してマッチング処理され、マッチング処理の結果に基づいて生成された、音声認識結果の候補となる単語列の単語どうしの接続関係を示す単語接続関係情報が修正される。 In the speech recognition apparatus and method, the recording medium, and the program according to the present invention, a word group in which one or more word candidates subsequent to the already obtained word in the word string are classified for each word having the same attribute. Each time, a part of the process is shared and the matching process is performed, and the word connection relation information indicating the connection relation between the words in the word string that is a candidate of the speech recognition result generated based on the result of the matching process is corrected. The

本発明は、例えば、音声によってデータベースの検索を行なう場合や、各種の機器の操作を行なう場合、各機器へのデータ入力を行なう場合、音声対話システム等に適用可能である。より具体的には、例えば、音声による地名の問合せに対して、対応する地図情報を表示するデータベース検索装置や、音声による命令に対して、荷物の仕分けを行なう産業用ロボット、キーボードの代わりに音声入力によりテキスト作成を行なうディクテーションシステム、ユーザとの会話を行なうロボットにおける対話システム等に適用可能である。 The present invention can be applied to, for example, a voice interaction system when searching a database by voice, operating various devices, or inputting data to each device. More specifically, for example, a database search device that displays map information corresponding to a place name inquiry by voice, an industrial robot that sorts luggage for voice instructions, a voice instead of a keyboard The present invention can be applied to a dictation system that creates text by input, a dialog system in a robot that performs conversation with a user, and the like.

本発明によれば、音声認識処理を実行することができる。特に、処理に要する計算量や記憶容量の増大を抑え、かつ、音声認識の精度を保ちながら、音声認識処理を高速化することが可能となる。 According to the present invention, voice recognition processing can be executed. In particular, it is possible to increase the speed of speech recognition processing while suppressing an increase in calculation amount and storage capacity required for processing and maintaining accuracy of speech recognition.

以下に本発明の最良の形態を説明するが、開示される発明と実施の形態との対応関係を例示すると、次のようになる。明細書中には記載されているが、発明に対応するものとして、ここには記載されていない実施の形態があったとしても、そのことは、その実施の形態が、その発明に対応するものではないことを意味するものではない。逆に、実施の形態が発明に対応するものとしてここに記載されていたとしても、そのことは、その実施の形態が、その発明以外の発明には対応しないものであることを意味するものでもない。 BEST MODE FOR CARRYING OUT THE INVENTION The best mode of the present invention will be described below. The correspondence relationship between the disclosed invention and the embodiments is exemplified as follows. Although there is an embodiment which is described in the specification but is not described here as corresponding to the invention, it means that the embodiment corresponds to the invention. It doesn't mean not. Conversely, even if an embodiment is described herein as corresponding to an invention, that means that the embodiment does not correspond to an invention other than the invention. Absent.

さらに、この記載は、明細書に記載されている発明の全てを意味するものではない。換言すれば、この記載は、明細書に記載されている発明であって、この出願では請求されていない発明の存在、すなわち、将来、分割出願されたり、補正により出現し、追加されたりする発明の存在を否定するものではない。 Further, this description does not mean all the inventions described in the specification. In other words, this description is an invention described in the specification and is not claimed in this application, that is, an invention that will be filed in division in the future, appearing by amendment, and added. The existence of is not denied.

本発明によれば、音声認識装置が提供される。この音声認識装置は、単語列のうち、既に求められている単語に続く１以上の単語の候補（例えば、「開ける」、「明ける」、「空ける」、および「閉める」）を、同一の属性（例えば、同音語）を有する単語毎に分類した単語群（例えば、「あける」および「しめる」）毎に、処理の一部（例えば、音響スコアの計算）を共有化してマッチング処理するマッチング処理実行手段（例えば、図５のマッチング部５８）と、マッチング処理実行手段によるマッチング処理の結果に基づいて生成された、音声認識結果の候補となる単語列の単語どうしの接続関係を示す単語接続関係情報を修正する修正手段（例えば、図５の再評価部５９）とを備える。 According to the present invention, a voice recognition device is provided. This speech recognition apparatus uses one or more word candidates (for example, “open”, “dawn”, “empty”, and “close”) following a word that has already been obtained in a word string with the same attribute. A matching process in which a part of the process (for example, calculation of an acoustic score) is shared and matched for each word group (for example, “open” and “shimeru”) classified for each word having (for example, a homophone) A word connection relationship indicating a connection relationship between words in a word string that is a candidate for a speech recognition result generated based on the result of the matching process performed by the execution unit (for example, the matching unit 58 in FIG. 5). And correction means for correcting the information (for example, the re-evaluation unit 59 in FIG. 5).

本発明によれば、音声認識装置が提供される。この音声認識装置では、前記単語列のうち、既に求められている前記単語に続く１以上の前記単語の候補を、同一の前記属性を有する前記単語毎に前記単語群として分類する分類手段（例えば、図５の単語候補クラスタリング部５７）をさらに設けるようにし、前記マッチング処理実行手段には、分類手段により分類された前記単語群毎に、前記処理の一部を共有化するようにさせることができる。 According to the present invention, a voice recognition device is provided. In this speech recognition apparatus, classification means (for example, classifying one or more word candidates subsequent to the already obtained word in the word string as the word group for each word having the same attribute) 5 is further provided, and the matching processing execution means may share a part of the processing for each word group classified by the classification means. it can.

本発明によれば、音声認識装置が提供される。この音声認識装置では、前記単語列のうち、既に求められている前記単語に続く１以上の前記単語の候補を選択する選択手段（例えば、図５の単語予備選択部５６）をさらに設けるようにし、前記分類手段には、選択手段により選択された１以上の前記単語の候補を、同一の前記属性を有する前記単語毎に前記単語群として分類するようにさせることができる。 According to the present invention, a voice recognition device is provided. The speech recognition apparatus further includes selection means (for example, the word preliminary selection unit 56 in FIG. 5) for selecting one or more word candidates subsequent to the already obtained word from the word string. The classifying means may classify the one or more word candidates selected by the selecting means as the word group for each word having the same attribute.

本発明によれば、音声認識装置が提供される。この音声認識装置では、既に求められている前記単語に続く１以上の前記単語の候補は、同一の発音を有する前記単語毎に、前記単語群に分類されるようにすることができる。 According to the present invention, a voice recognition device is provided. In this speech recognition apparatus, one or more candidate words following the already obtained word can be classified into the word group for each word having the same pronunciation.

本発明によれば、音声認識装置が提供される。この音声認識装置では、既に求められている前記単語に続く１以上の前記単語の候補は、音響的に類似した前記単語毎に、前記単語群に分類されるようにすることができる。 According to the present invention, a voice recognition device is provided. In this speech recognition apparatus, one or more candidate words following the already obtained word can be classified into the word group for each of the acoustically similar words.

本発明によれば、音声認識装置が提供される。この音声認識装置では、前記マッチング処理実行手段による前記マッチング処理の結果に基づいて生成された前記単語接続関係情報を記憶する記憶手段（例えば、図５の単語接続関係記憶部６０）をさらに設けるようにし、前記修正手段には、記憶手段により記憶された前記単語接続関係情報を修正するようにさせることができる。 According to the present invention, a voice recognition device is provided. The speech recognition apparatus further includes storage means (for example, the word connection relation storage unit 60 in FIG. 5) for storing the word connection relation information generated based on the result of the matching process performed by the matching process execution means. The correcting means can correct the word connection relation information stored in the storing means.

本発明によれば、音声認識装置が提供される。この音声認識装置では、前記修正手段により、前記単語接続関係情報を修正するために、部分的な単語列の音響的なスコアを計算するための状態遷移モデルが構築された場合、構築された状態遷移モデルを記憶する記憶手段（例えば、図２２の再評価処理過程記憶部２０１）をさらに設けるようにし、前記修正手段には、同一の部分的な単語列の音響的なスコアを再度計算する場合、記憶手段により記憶された状態遷移モデルを利用するようにさせることができる。 According to the present invention, a voice recognition device is provided. In this speech recognition apparatus, when a state transition model for calculating an acoustic score of a partial word string is constructed by the modifying means to modify the word connection relation information, the constructed state A storage means (for example, the reevaluation process storage unit 201 in FIG. 22) for storing the transition model is further provided, and the correction means recalculates the acoustic score of the same partial word string. The state transition model stored by the storage means can be used.

本発明によれば、音声認識装置が提供される。この音声認識装置では、前記マッチング処理実行手段により、単語の音響的なスコアを計算するための状態遷移モデルが構築された場合、構築された前記状態遷移モデルを記憶する記憶手段（例えば、図２５の単語モデル記憶部３０１）をさらに設けるようにし、前記マッチング処理実行手段には、同一の前記単語の音響的なスコアを再度計算する場合、記憶手段により記憶された状態遷移モデルを利用するようにさせることができる。 According to the present invention, a voice recognition device is provided. In this speech recognition apparatus, when a state transition model for calculating an acoustic score of a word is constructed by the matching processing execution unit, a storage unit (for example, FIG. 25) that stores the constructed state transition model. The word model storage unit 301) is further provided, and when the acoustic score of the same word is calculated again in the matching processing execution unit, the state transition model stored in the storage unit is used. Can be made.

本発明によれば、音声認識装置が提供される。この音声認識装置では、前記マッチング処理実行手段により算出された、前記マッチング処理の途中経過としての値を記憶する記憶手段（例えば、図２８のマッチング処理過程記憶部４０１）をさらに設けるようにし、前記マッチング処理実行手段には、記憶手段により記憶された途中経過としての値を利用して、前記音声認識結果の候補となる複数の前記単語列の前記マッチング処理を交互に実行するようにさせることができる。 According to the present invention, a voice recognition device is provided. In this speech recognition apparatus, a storage unit (for example, a matching process storage unit 401 in FIG. 28) that stores a value calculated by the matching process execution unit as a midway of the matching process is further provided. The matching processing execution means may be configured to alternately execute the matching processing of the plurality of word strings that are candidates for the speech recognition result by using the value as the intermediate progress stored by the storage means. it can.

本発明によれば、音声認識方法が提供される。この音声認識方法は、単語列のうち、既に求められている単語に続く１以上の単語の候補（例えば、「開ける」、「明ける」、「空ける」、および「閉める」）を、同一の属性（例えば、同音語）を有する単語毎に分類された単語群（例えば、「あける」および「しめる」）毎に、処理の一部（例えば、音響スコアの計算）を共有化してマッチング処理するマッチング処理実行ステップ（例えば、図７のステップＳ７）と、マッチング処理実行ステップの処理によるマッチング処理の結果に基づいて生成された、音声認識結果の候補となる単語列の単語どうしの接続関係を示す単語接続関係情報を修正する修正ステップ（例えば、図７のステップＳ４）とを含む。 According to the present invention, a speech recognition method is provided. In this speech recognition method, one or more word candidates (for example, “open”, “dawn”, “open”, and “close”) following a word that has already been obtained in a word string are assigned the same attribute. A matching process in which a part of the processing (for example, calculation of an acoustic score) is shared for each word group (for example, “open” and “shime”) classified for each word having (for example, a homophone). A word indicating a connection relationship between words in a word sequence that is a candidate for a speech recognition result, generated based on the result of the matching process executed in the process execution step (for example, step S7 in FIG. 7) and the process in the matching process execution step. And a correction step (for example, step S4 in FIG. 7) for correcting the connection relation information.

本発明によれば、音声認識方法と同様のプログラムが提供される。 According to the present invention, a program similar to the speech recognition method is provided.

以下、図を参照して、本発明の実施の形態について説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図５は、本発明を適用した音声認識装置の構成例を示している。 FIG. 5 shows a configuration example of a speech recognition apparatus to which the present invention is applied.

ユーザが発した音声は、図５のマイク（マイクロフォン）５１に入力され、マイク５１では、その入力音声が、電気信号としての音声信号に変換される。この音声信号は、ＡＤ変換部５２に供給される。ＡＤ変換部５２では、マイク５１からのアナログ信号である音声信号がサンプリング、量子化され、デジタル信号である音声データに変換される。この音声データは、特徴抽出部５３に供給される。 The voice uttered by the user is input to a microphone (microphone) 51 in FIG. 5, and the microphone 51 converts the input voice into an audio signal as an electric signal. This audio signal is supplied to the AD converter 52. In the AD conversion unit 52, the audio signal that is an analog signal from the microphone 51 is sampled, quantized, and converted into audio data that is a digital signal. This audio data is supplied to the feature extraction unit 53.

特徴抽出部５３は、ＡＤ変換部５２からの音声データについて、適当なフレームごとに音響処理を施し、これにより、例えば、ＭＦＣＣ(Mel Frequency Cepstrum Coefficient)等の特徴量を抽出し、制御部５４に供給する。なお、特徴抽出部５３では、その他、例えば、スペクトルや、線形予測係数、ケプストラム係数、線スペクトル対等の特徴量を抽出することが可能である。なお、特徴抽出部５３が出力する、ユーザが発した音声の特徴量の系列は、フレーム単位で、制御部５４に供給されるようになっており、制御部５４は、特徴抽出部５３からの特徴量を、特徴量記憶部５５に供給する。 The feature extraction unit 53 performs acoustic processing on the audio data from the AD conversion unit 52 for each appropriate frame, thereby extracting, for example, a feature quantity such as MFCC (Mel Frequency Cepstrum Coefficient) and the like to the control unit 54. Supply. In addition, the feature extraction unit 53 can extract other feature quantities such as, for example, a spectrum, a linear prediction coefficient, a cepstrum coefficient, and a line spectrum pair. Note that the sequence of feature amounts of the voice uttered by the user output from the feature extraction unit 53 is supplied to the control unit 54 in units of frames, and the control unit 54 receives from the feature extraction unit 53. The feature amount is supplied to the feature amount storage unit 55.

制御部５４は、単語接続関係記憶部６０に記憶された単語接続関係情報を参照し、マッチング部５８や再評価部５９を制御する。さらに、制御部５４は、マッチング部５８がマッチング処理を行なうことにより得られるマッチング処理結果としての音響スコアや言語スコア等に基づいて、単語接続関係情報を生成し、その単語接続関係情報によって、単語接続関係記憶部６０の記憶内容を更新する。また、制御部５４は、再評価部５９の出力に基づいて、単語接続関係記憶部６０の記憶内容を修正する。さらに、制御部５４は、単語接続関係記憶部６０に記憶された単語接続関係情報に基づいて、最終的な音声認識結果を確定して出力する。 The control unit 54 refers to the word connection relationship information stored in the word connection relationship storage unit 60 and controls the matching unit 58 and the reevaluation unit 59. Further, the control unit 54 generates word connection relation information based on the acoustic score, the language score, and the like as the matching process result obtained by the matching process performed by the matching unit 58, and the word connection relation information is used to generate the word connection relation information. The stored contents of the connection relation storage unit 60 are updated. Further, the control unit 54 corrects the stored contents of the word connection relation storage unit 60 based on the output of the reevaluation unit 59. Further, the control unit 54 determines and outputs a final speech recognition result based on the word connection relationship information stored in the word connection relationship storage unit 60.

ここで、単語接続関係情報は、最終的な音声認識結果の候補となる単語列を構成する単語どうしの接続（連鎖または連接）関係を表すもので、各単語の音響スコアおよび言語スコア、並びに各単語に対応する発話の開始時刻および終了時刻を含んでいる。 Here, the word connection relationship information represents a connection (chain or concatenation) relationship between words constituting a word sequence that is a candidate for a final speech recognition result, and includes an acoustic score and a language score of each word, and each The start time and end time of the utterance corresponding to the word are included.

なお、本明細書においては、「音響スコアおよび言語スコアは、その値の大きいものほど良好であることを意味する」と仮定して説明することとする。ただし逆に、「音響スコアおよび言語スコアは、その値の小さいものほど良好であることを意味する」と仮定した場合にも、音響スコアおよび言語スコアの符号を反転するなどにすれば、本発明を適用することができる。 In the present specification, description will be made on the assumption that “the higher the value of the acoustic score and the language score, the better”. However, conversely, even if it is assumed that “the smaller the value of the acoustic score and the language score, the better”, if the sign of the acoustic score and the language score is reversed, the present invention Can be applied.

図６は、単語接続関係記憶部６０に記憶される単語接続関係情報の例を、グラフ構造を用いて表したものである。図６において、単語接続関係情報としてのグラフ構造は、単語を表すアーク（図６において、○印どうしを結ぶ線分で示す部分）と、単語どうしの境界を表すノード（図６において○印で示す部分）とから構成されている。なお、後述する図８乃至図１１、図１４、図１６、図１７、図１９乃至図２１、図２４、および図２６においても、同様とする。 FIG. 6 shows an example of word connection relationship information stored in the word connection relationship storage unit 60 using a graph structure. In FIG. 6, the graph structure as the word connection relation information includes an arc representing a word (portion indicated by a line segment connecting the circles in FIG. 6) and a node representing a boundary between the words (circles in FIG. 6). Part). The same applies to FIGS. 8 to 11, 14, 16, 17, 19 to 21, 24, and 26 described later.

ノードは、時刻情報を有しており、この時刻情報は、そのノードに対応する特徴量の抽出時刻を表す。後述するように、抽出時刻は、音声区間の開始時刻を０とする、特徴抽出部５３が出力する特徴量が得られた時刻である。従って、図６において、音声区間の開始、すなわち、最初の単語の先頭に対応するノード１が有する時刻情報は０となる。ノードは、アークの始端および終端となるが、始端のノード（始端ノード）、または終端のノード（終端ノード）が有する時刻情報は、それぞれ、そのノードに対応する単語の発話の開始時刻、または終了時刻となる。 The node has time information, and this time information represents the extraction time of the feature amount corresponding to the node. As will be described later, the extraction time is the time at which the feature amount output by the feature extraction unit 53 is obtained, with the start time of the speech section being 0. Accordingly, in FIG. 6, the time information of the node 1 corresponding to the start of the speech section, that is, the head of the first word is 0. The node is the start and end of the arc, but the time information of the start node (start node) or the end node (end node) is the start time or end time of the utterance of the word corresponding to the node, respectively. It is time.

なお、図６では、左から右方向が、時間の経過を表しており、従って、あるアークの左右にあるノードのうち、左側のノードが始端ノードとなり、右側のノードが終端ノードとなる。なお、後述する図８乃至図１１、図１４、図１６、図１７、図１９乃至図２１、図２４、および図２６においても、同様とする。 In FIG. 6, the time from left to right represents the passage of time, and therefore, among the nodes on the left and right of a certain arc, the left node is the start node and the right node is the end node. The same applies to FIGS. 8 to 11, 14, 16, 17, 19 to 21, 24, and 26 described later.

アークは、そのアークに対応する単語、並びに単語の音響スコアおよび言語スコアを有しており、このアークが、終端ノードとなっているノードを始端ノードとして、順次接続されていくことにより、音声認識結果の候補となる単語の系列が構成されていく。図６において、各アークの直近に括弧書きで、各アークに対応する単語、音響スコア、および言語スコアが示されている。すなわち、図６の例の場合、アーク１に対応する単語は「部屋」であり、音響スコアはＡ１であり、言語スコアはＬ１である。また、アーク２に対応する単語は「を」であり、音響スコアはＡ２であり、言語スコアはＬ２である。また、アーク３に対応する単語は「空ける」であり、音響スコアはＡ３であり、言語スコアはＬ３である。また、アーク４に対応する単語は「窓」であり、音響スコアはＡ４であり、言語スコアはＬ４である。また、アーク５に対応する単語は「を」であり、音響スコアはＡ５であり、言語スコアはＬ５である。 The arc has a word corresponding to the arc, and an acoustic score and a language score of the word, and the arc is sequentially connected with a node that is a terminal node as a start node, thereby performing speech recognition. A sequence of words as candidate results is constructed. In FIG. 6, the word, the acoustic score, and the language score corresponding to each arc are shown in parentheses in the immediate vicinity of each arc. That is, in the example of FIG. 6, the word corresponding to the arc 1 is “room”, the acoustic score is A1, and the language score is L1. The word corresponding to arc 2 is “O”, the acoustic score is A2, and the language score is L2. The word corresponding to the arc 3 is “open”, the acoustic score is A3, and the language score is L3. The word corresponding to the arc 4 is “window”, the acoustic score is A4, and the language score is L4. The word corresponding to the arc 5 is “O”, the acoustic score is A5, and the language score is L5.

制御部５４においては、まず最初に、音声区間の開始を表すノード１に対して、音声認識結果として確からしい単語に対応するアークが接続される。図６の例では、「部屋」に対応するアーク１、および「窓」に対応するアーク４が接続されている。なお、音声認識結果として確からしい単語か否かは、マッチング部５８において求められる音響スコアおよび言語スコアに基づいて決定される。 In the control unit 54, first, an arc corresponding to a probable word as a speech recognition result is connected to the node 1 representing the start of the speech section. In the example of FIG. 6, an arc 1 corresponding to “room” and an arc 4 corresponding to “window” are connected. Whether or not the word is likely to be a speech recognition result is determined based on the acoustic score and language score obtained by the matching unit 58.

そして、以下、同様にして、「部屋」に対応するアーク１の終端である終端ノード２、「を」に対応するアーク２の終端である終端ノード３、「窓」に対応するアーク４の終端である終端ノード５それぞれに対して、同様に、確からしい単語に対応するアークが接続されていく。 In the same manner, the end node 2 corresponding to the end of the arc 1 corresponding to “room”, the end node 3 corresponding to the end of arc 2 corresponding to “O”, and the end of the arc 4 corresponding to “window”. Similarly, arcs corresponding to probable words are connected to each of the terminal nodes 5.

以上のようにしてアークが接続されていくことで、音声区間の開始を始点として、左から右方向に、アークとノードで構成される１以上のパスが構成されて行くが、例えば、そのパスのすべてが、音声区間の最後（図６の実施の形態では、時刻Ｔ）に到達すると、制御部５４において、音声区間の開始から最後までに形成された各パスについて、そのパスを構成するアークが有している音響スコアおよび言語スコアが累積され、最終スコアが求められる。そして、例えば、その最終スコアが最も高いパスを構成するアークに対応する単語列が、音声認識結果として確定されて出力される。 By connecting arcs as described above, one or more paths composed of arcs and nodes are formed from left to right starting from the start of the speech section. Are reached at the end of the voice section (time T in the embodiment of FIG. 6), the control unit 54 determines, for each path formed from the start to the end of the voice section, an arc that constitutes the path. The sound score and the language score possessed by are accumulated, and a final score is obtained. Then, for example, a word string corresponding to an arc constituting a path having the highest final score is determined and output as a speech recognition result.

なお、上述の場合には、音声区間内にあるノードについて、必ずアークを接続して、音声区間の開始から最後にまで延びるパスを構成するようにしたが、このようなパスを構成する過程において、それまでに構成されたパスについてのスコアから、音声認識結果として不適当であることが明らかであるパスに関しては、その時点で、パスの構成を打ち切る（その後に、アークを接続しない）ようにすることが可能である。 In the above-described case, the arcs are always connected to the nodes in the speech section, and the path extending from the start to the end of the speech section is configured. In the process of configuring such a path, For a path that is clearly unsuitable as a speech recognition result from the scores for the paths that have been constructed so far, the path configuration should be terminated at that point (the arc is not connected thereafter). Is possible.

また、上述のようなパスの構成ルールに従えば、１つのアークの終端が、次に接続される１以上のアークの始端ノードなり、基本的には、枝葉が拡がるように、パスが構成されて行くが、例外的に、１つのアークの終端が、他のアークの終端に一致する場合、つまり、あるアークの終端ノードと、他のアークの終端ノードとが同一のノードに共通化される場合がある。 Further, according to the path configuration rule as described above, the end of one arc becomes the start node of one or more arcs to be connected next, and basically the path is configured so that the branches and leaves expand. Exceptionally, if the end of one arc matches the end of another arc, that is, the end node of one arc and the end node of another arc are shared by the same node. There is a case.

すなわち、文法規則としてバイグラムを用いた場合には、別のノードから延びる２つのアークが、同一の単語に対応するものであり、さらに、その単語の発話の終了時刻も同一であるときには、その２つのアークの終端は一致する。 That is, when a bigram is used as a grammar rule, two arcs extending from another node correspond to the same word, and when the end time of the utterance of the word is also the same, The ends of the two arcs coincide.

なお、ノードの共通化は行なわないようにすることも可能であるが、メモリ容量の効率化の観点からは、行なうのが好ましい。 Although it is possible not to share the nodes, it is preferable to do so from the viewpoint of increasing the memory capacity.

また、単語接続関係記憶部６０に記憶されている単語接続関係情報は、単語予備選択部５６、マッチング部５８、および再評価部５９において、必要に応じて参照することができるようになっている。 Further, the word connection relation information stored in the word connection relation storage unit 60 can be referred to as needed by the word preliminary selection unit 56, the matching unit 58, and the reevaluation unit 59. .

図５において、特徴量記憶部５５は、制御部５４から供給される特徴量の系列を記憶する。特徴量記憶部５５は、供給された特徴量の時系列を、時間を遡って利用できるように、全て記憶する。なお、制御部５４は、音声区間の開始時刻を基準（例えば０）とする、特徴抽出部５３が出力する特徴量が得られた時刻（以下、適宜、抽出時刻という）を、その特徴量とともに、特徴量記憶部５５に供給するようになっており、特徴量記憶部５５は、特徴量を、その抽出時刻とともに記憶する。特徴量記憶部５５に記憶された特徴量およびその抽出時刻は、単語予備選択部５６、マッチング部５８、および再評価部５９が、必要に応じて参照できるようになっている。 In FIG. 5, the feature quantity storage unit 55 stores a series of feature quantities supplied from the control unit 54. The feature quantity storage unit 55 stores all of the supplied feature quantity time series so that the time series can be used retroactively. The control unit 54 uses the start time of the voice section as a reference (for example, 0), and the time when the feature amount output by the feature extraction unit 53 is obtained (hereinafter, referred to as extraction time as appropriate) together with the feature amount. The feature amount storage unit 55 stores the feature amount together with its extraction time. The feature amount stored in the feature amount storage unit 55 and the extraction time thereof can be referred to by the word preliminary selection unit 56, the matching unit 58, and the reevaluation unit 59 as necessary.

単語予備選択部５６は、マッチング部５８から、指定されたノードが有する時刻を開始時刻として、単語予備選択を行なうよう指令された場合、単語接続関係記憶部６０、音響モデルデータベース（DB）６１、辞書データベース（DB）６２、および文法データベース（DB）６３を必要に応じて参照しながら、マッチング部５８でマッチング処理の対象とする１以上の単語を選択する単語予備選択処理を、特徴量記憶部５５に記憶された特徴量を用いて行なう。単語予備選択部５６は、単語予備選択処理の結果選択された１以上の単語を単語候補クラスタリング部５７に送信する。単語予備選択部５６は、例えば、単語予備選択の処理により、図６の時刻t１を開始点とする４つの単語候補「開ける」、「空ける」、「明ける」、および「閉める」を選択した場合、この４つの単語候補を単語候補クラスタリング部５７に送信する。 When the word preliminary selection unit 56 is instructed by the matching unit 58 to perform word preliminary selection using the time of the designated node as the start time, the word connection relation storage unit 60, the acoustic model database (DB) 61, A feature quantity storage unit performs word preliminary selection processing in which the matching unit 58 selects one or more words to be subjected to matching processing while referring to the dictionary database (DB) 62 and the grammar database (DB) 63 as necessary. The feature amount stored in 55 is used. The word preliminary selection unit 56 transmits one or more words selected as a result of the word preliminary selection processing to the word candidate clustering unit 57. When the word preliminary selection unit 56 selects, for example, four word candidates “open”, “open”, “open”, and “close” starting from time t1 in FIG. 6 by the word preliminary selection process. The four word candidates are transmitted to the word candidate clustering unit 57.

単語候補クラスタリング部５７は、単語予備選択部５６から供給された単語候補に関して、それぞれの発音を調べ、同じ発音を持つ単語候補（同音語）どうしをクラスタリングする。これにより、１以上の単語候補セットが生成される。例えば、単語予備選択部５６から単語候補「開ける」、「空ける」、「明ける」、および「閉める」が供給された場合、単語候補クラスタリング部５７は、「開ける」「空ける」「明ける」を同じ発音をもつ単語候補（同音語）としてクラスタリングし、１つの単語候補セットとしてまとめ、「閉める」を、それだけを要素とするもう１つの単語候補セットとする。単語候補クラスタリング部５７は、単語予備選択部５６から供給された単語候補のクラスタリング処理が終了した後、各単語候補に、その単語候補が属する単語候補セットを識別するための分類情報を付与して、マッチング部５８に送信する。 The word candidate clustering unit 57 examines the pronunciations of the word candidates supplied from the word preliminary selection unit 56 and clusters word candidates (sound words) having the same pronunciation. Thereby, one or more word candidate sets are generated. For example, when word candidates “open”, “open”, “open”, and “close” are supplied from the word preliminary selection unit 56, the word candidate clustering unit 57 uses the same “open”, “open”, and “open”. Clustering as a word candidate (sound word) having pronunciation, collecting it as one word candidate set, and letting “close” be another word candidate set having only that as an element. The word candidate clustering unit 57 assigns classification information for identifying the word candidate set to which the word candidate belongs to each word candidate after the word candidate clustering process supplied from the word preliminary selection unit 56 is completed. To the matching unit 58.

マッチング部５８は、制御部５４からの制御に基づき、単語予備選択部５６に対して、指定されたノードが有する時刻を開始時刻として、単語予備選択を行なうように指令する。また、マッチング部５８は、単語接続関係記憶部６０、音響モデルデータベース６１、辞書データベース６２、および文法データベース６３を必要に応じて参照しながら、単語候補クラスタリング部５７からのクラスタリング処理の結果得られる単語候補を対象としたマッチング処理を、特徴量記憶部５５に記憶された特徴量を用いて行ない、単語候補毎の音響スコア、言語スコア、および終了時刻を決定する。 Based on the control from the control unit 54, the matching unit 58 instructs the word preliminary selection unit 56 to perform word preliminary selection using the time of the designated node as the start time. Further, the matching unit 58 refers to the word connection relationship storage unit 60, the acoustic model database 61, the dictionary database 62, and the grammar database 63 as necessary, and the word obtained as a result of the clustering process from the word candidate clustering unit 57 The matching process for the candidate is performed using the feature amount stored in the feature amount storage unit 55, and the acoustic score, the language score, and the end time for each word candidate are determined.

すなわち、音響モデルデータベース６１は、音声認識する音声の言語における個々の音素や音節などの音響的な特徴を表す音響モデルを記憶している。ここでは、連続分布HMM法に基づいて音声認識を行なうので、音響モデルとしては、例えば、HMM(Hidden Markov Model)が用いられる。辞書データベース６２は、認識対象の各単語（語彙）について、その発音に関する情報（音韻情報）が記述された単語辞書を記憶している。文法データベース６３は、辞書データベース６２の単語辞書に登録されている各単語が、どのように連鎖しているか（つながっているか）を記述した文法規則（言語モデル）を記憶している。ここで、文法規則としては、例えば、文脈自由文法（CFG）や、統計的な単語連鎖確率（N-gram）などに基づく規則を用いることができる。 That is, the acoustic model database 61 stores an acoustic model representing acoustic features such as individual phonemes and syllables in the speech language for speech recognition. Here, since speech recognition is performed based on the continuous distribution HMM method, for example, an HMM (Hidden Markov Model) is used as the acoustic model. The dictionary database 62 stores a word dictionary in which information about pronunciation (phoneme information) is described for each word (vocabulary) to be recognized. The grammar database 63 stores grammar rules (language model) describing how the words registered in the word dictionary of the dictionary database 62 are linked (connected). Here, as the grammar rule, for example, a rule based on context-free grammar (CFG), statistical word chain probability (N-gram), or the like can be used.

マッチング部５８は、単語候補クラスタリング部５７からのクラスタリング処理の結果得られる単語候補セットに含まれる単語候補毎に、スコアの初期値を算出し、初期値が１番大きい単語について、辞書データベース６２の単語辞書を参照することにより、音響モデルデータベース６１に記憶されている音響モデルを接続することで、単語の音響モデル（単語モデル）を構成し、単語モデルを用いて、特徴量に基づき、単語候補セット毎の音響スコアおよび終了時刻を決定する。マッチング部５８はまた、文法データベース６３を参照して、単語候補セットに含まれる単語候補の言語スコアを決定する。 The matching unit 58 calculates the initial value of the score for each word candidate included in the word candidate set obtained as a result of the clustering process from the word candidate clustering unit 57, and for the word having the largest initial value, the matching unit 58 By connecting the acoustic model stored in the acoustic model database 61 by referring to the word dictionary, a word acoustic model (word model) is constructed, and the word candidate is used based on the feature amount using the word model. Determine the acoustic score and end time for each set. The matching unit 58 also refers to the grammar database 63 to determine the language score of the word candidate included in the word candidate set.

マッチング部５８は、マッチング処理の結果得られた単語候補セット毎の音響スコア、および終了時刻、並びに単語候補毎の言語スコアを制御部５４に供給する。 The matching unit 58 supplies the control unit 54 with the acoustic score for each word candidate set, the end time, and the language score for each word candidate obtained as a result of the matching process.

再評価部５９は、制御部５４からの制御に基づき、音響モデルデータベース６１、辞書データベース６２、および文法データベース６３を必要に応じて参照しながら、単語接続関係記憶部６０に記憶された単語接続関係情報の再評価を、特徴量記憶部５５に記憶された特徴量を用いて行ない、その再評価結果を、制御部５４に供給する。 Based on the control from the control unit 54, the re-evaluation unit 59 refers to the acoustic model database 61, the dictionary database 62, and the grammar database 63 as necessary, and stores the word connection relationship stored in the word connection relationship storage unit 60. Information reevaluation is performed using the feature quantity stored in the feature quantity storage unit 55, and the reevaluation result is supplied to the control unit 54.

すなわち、再評価部５９は、単語接続関係記憶部６０に記憶される単語接続関係情報において、マッチング処理を継続すべきノードに対して、マッチング処理を行なう前に、そのノードから時間的に遡った部分的な単語系列に対する再評価を、音響モデルデータベース６１、辞書データベース６２、および文法データベース６３を用いて行なう。このとき、後述するように、部分的な単語列に対しては、その単語境界を固定することなく評価が行なえるので、動的計画法に基づく単語境界の決定が行われることになる。そして、再評価を行った結果に基づき、単語接続関係記憶部６０に記憶される単語接続関係情報が修正される。この処理は、マッチング部５８に、次のマッチング処理の実行指令を出す直前に行なわれる。例えば、図６の「窓を」と「ドア」という２つの単語列仮説に着目し、それぞれの再評価が行なわれ、その後、単語予備選択部５６、単語候補クラスタリング部５７、およびマッチング部５８の処理が行なわれる。同様に、新たに延びた単語列仮説「窓を開ける」と「ドア開ける」が単語接続関係記憶部６０に記憶された場合、その単語列仮説に対する再評価も後に行なわれることになる。その結果、同じ「開ける」という単語であっても、直前の単語列仮説「窓を」と「ドア」に依存して音響モデルを用いた正しいマッチング処理がそれぞれ行なわれ、単語境界の補正と音響スコアの補正が行なわれることになる。 That is, the re-evaluation unit 59 traces back in time from the node before performing the matching process on the node that should continue the matching process in the word connection relation information stored in the word connection relation storage unit 60. The reevaluation for the partial word sequence is performed using the acoustic model database 61, the dictionary database 62, and the grammar database 63. At this time, as will be described later, since a partial word string can be evaluated without fixing the word boundary, the word boundary is determined based on dynamic programming. Then, based on the result of the re-evaluation, the word connection relation information stored in the word connection relation storage unit 60 is corrected. This process is performed immediately before issuing the next matching process execution command to the matching unit 58. For example, paying attention to the two word string hypotheses “window” and “door” in FIG. 6, each reevaluation is performed, and thereafter, the word preliminary selection unit 56, the word candidate clustering unit 57, and the matching unit 58 Processing is performed. Similarly, when the newly extended word string hypotheses “open window” and “open door” are stored in the word connection relationship storage unit 60, the word string hypothesis is also re-evaluated later. As a result, correct matching processing using the acoustic model is performed for the same word “open”, depending on the previous word string hypotheses “window” and “door”, and the word boundary correction and sound The score will be corrected.

単語接続関係記憶部６０は、上述したように、制御部５４から供給される単語接続関係情報を、ユーザの音声の認識結果が得られるまで記憶する。 As described above, the word connection relationship storage unit 60 stores the word connection relationship information supplied from the control unit 54 until a recognition result of the user's voice is obtained.

以上のように、単語候補クラスタリング部５７において、同じ発音の単語候補をクラスタリングすることでマッチング処理の回数を削減し、その結果を一旦単語接続関係記憶部６０に記憶した後、記憶された単語列仮説に対する再評価を再評価部５９において行なうようにすることで、マッチング処理の際には使われない直前の単語列仮説に依存した音響モデルが適用され、その結果、正しい単語境界および音響スコアに修正することが可能となる。 As described above, the word candidate clustering unit 57 reduces the number of matching processes by clustering word candidates having the same pronunciation, and once the result is stored in the word connection relation storage unit 60, the stored word string By re-evaluating the hypothesis in the re-evaluation unit 59, the acoustic model depending on the immediately preceding word string hypothesis that is not used in the matching process is applied, and as a result, the correct word boundary and acoustic score are obtained. It becomes possible to correct.

次に、図７のフローチャートを参照して、図５の音声認識装置による音声認識処理について説明する。 Next, the speech recognition process by the speech recognition apparatus in FIG. 5 will be described with reference to the flowchart in FIG.

ユーザが発話を行なうと、その発話としての音声は、マイク５１およびＡＤ変換部５２を介することにより、デジタルの音声データとされ、特徴抽出部５３に供給される。特徴抽出部５３は、そこに供給される音声データから、音声の特徴量を、フレームごとに順次抽出し、制御部５４に供給する。 When the user speaks, the voice as the speech is converted into digital voice data via the microphone 51 and the AD conversion unit 52 and supplied to the feature extraction unit 53. The feature extraction unit 53 sequentially extracts voice feature amounts for each frame from the audio data supplied thereto, and supplies the extracted feature amounts to the control unit 54.

制御部５４は、何らかの手法で音声区間を認識するようになっており、音声区間においては、特徴抽出部５３から供給される特徴量の系列を、各特徴量の抽出時刻と対応付けて、特徴量記憶部５５に供給して記憶させる。 The control unit 54 recognizes a speech section by some method. In the speech section, the feature amount sequence supplied from the feature extraction unit 53 is associated with the extraction time of each feature amount, and the feature is detected. The quantity is supplied to and stored in the quantity storage unit 55.

さらに、制御部５４は、音声区間の開始後、ステップＳ１において、音声区間の開始を表すノード（以下、適宜、初期ノードと称する）を生成し、単語接続関係記憶部６０に供給して記憶させる。すなわち、制御部５４は、ステップＳ１において、例えば図６におけるノード１を、単語接続関係記憶部６０に記憶させる。 Further, after the start of the speech section, the control unit 54 generates a node representing the start of the speech section (hereinafter, referred to as an initial node as appropriate) in step S1, and supplies the node to the word connection relation storage unit 60 for storage. . That is, the control unit 54 stores, for example, the node 1 in FIG. 6 in the word connection relation storage unit 60 in step S1.

そして、ステップＳ２に進み、制御部５４は、単語接続関係記憶部６０の単語接続関係情報を参照することで、途中ノードが存在するか否かを判定する。 Then, the process proceeds to step S <b> 2, and the control unit 54 refers to the word connection relationship information in the word connection relationship storage unit 60 to determine whether there is a midway node.

すなわち、上述したように、例えば図６に示した単語接続関係情報においては、終端ノードに、アークが接続されていくことにより、音声区間の開始から最後にまで延びるパスが形成されて行くが、ステップＳ２では、終端ノードのうち、まだアークが接続されておらず、かつ、音声区間の最終時刻Ｔにまで到達していないものが、途中ノード（例えば、図６におけるノード６）として検索され、そのような途中ノードが存在するか否かが判定される。 That is, as described above, in the word connection relation information shown in FIG. 6, for example, a path extending from the start to the end of the speech interval is formed by connecting an arc to the terminal node. In step S2, an end node that has not yet been connected to the arc and has not reached the final time T of the voice section is searched as an intermediate node (for example, node 6 in FIG. 6). It is determined whether or not such a halfway node exists.

なお、上述したように、音声区間は何らかの手法で認識され、さらに、終端ノードに対応する時刻は、その終端ノードが有する時刻情報を参照することで認識することができるから、アークが接続されていない終端ノードが、音声区間の最後に到達していない途中ノードであるか否かは、音声区間の最後の時刻と、終端ノードが有する時刻情報とを比較することで判定することができる。 As described above, the voice section is recognized by some method, and the time corresponding to the terminal node can be recognized by referring to the time information of the terminal node, so that the arc is connected. Whether or not a terminal node that is not present is a midway node that has not reached the end of the speech segment can be determined by comparing the last time of the speech segment with the time information of the termination node.

ステップＳ２において、途中ノードが存在すると判定された場合、処理はステップＳ３に進み、制御部５４は、単語接続関係情報の中に存在する途中ノードのうちの１つを、それに接続するアークとしての単語を決定するノード（以下、適宜、注目ノードという）として選択する。 If it is determined in step S2 that there is an intermediate node, the process proceeds to step S3, and the control unit 54 sets one of the intermediate nodes existing in the word connection relation information as an arc connected to the intermediate node. The node is selected as a node for determining a word (hereinafter, referred to as an attention node as appropriate).

すなわち、制御部５４は、単語接続関係情報の中に１つの途中ノードしか存在しない場合には、その途中ノードを、注目ノードとして選択する。また、制御部５４は、単語接続関係情報の中に複数の途中ノードが存在する場合には、その複数の途中ノードのうちの１つを注目ノードとして選択する。具体的には、制御部５４は、例えば、複数の途中ノードそれぞれが有する時刻情報を参照し、その時刻情報が表す時刻が最も古いもの（音声区間の開始側のもの）を、注目ノードとして選択する。なお、同一の時刻に、時刻が最も古い途中ノードが複数存在した場合（例えば、後述する図１７に示される例の場合）、制御部５４は、時刻が最も古い複数の途中ノードを全て注目ノードとして選択する。 That is, when there is only one halfway node in the word connection relation information, the control unit 54 selects that halfway node as the node of interest. In addition, when there are a plurality of halfway nodes in the word connection relation information, the control unit 54 selects one of the plurality of halfway nodes as the attention node. Specifically, for example, the control unit 54 refers to time information possessed by each of a plurality of intermediate nodes, and selects, as the node of interest, the one with the earliest time represented by the time information (the one on the voice segment start side). To do. When there are a plurality of intermediate nodes with the oldest time at the same time (for example, in the case of the example shown in FIG. 17 described later), the control unit 54 selects all of the intermediate nodes with the oldest time as nodes of interest. Select as.

その後、制御部５４は、注目ノードが有する時刻情報を開始時刻としてマッチング処理を行なう旨の指令（以下、適宜、マッチング処理指令という）を、マッチング部５８および再評価部５９に出力する。 Thereafter, the control unit 54 outputs, to the matching unit 58 and the reevaluation unit 59, a command for performing the matching process using the time information of the node of interest as the start time (hereinafter referred to as a matching process command as appropriate).

再評価部５９は、制御部５４からマッチング処理指令を受信すると、ステップＳ４に進み、単語接続関係記憶部６０を参照することにより、初期ノードから注目ノードに至るまでのパス（以下、適宜、部分パスという）を構成するアークによって表される単語列（以下、適宜、部分単語列という）を認識し、その部分単語列の再評価を行なう。すなわち、部分単語列は、後述するようにして、単語予備選択部５６が予備選択した単語を対象に、マッチング部５８がマッチング処理を行なうことにより得られた音声認識結果の候補とする単語列の途中結果であるが、再評価部５９では、その途中結果が、再度、評価される。 When the re-evaluation unit 59 receives the matching processing command from the control unit 54, the process proceeds to step S 4, and by referring to the word connection relation storage unit 60, the path from the initial node to the node of interest (hereinafter referred to as partial A word string (hereinafter referred to as a partial word string as appropriate) represented by an arc constituting a path) is recognized, and the partial word string is re-evaluated. That is, the partial word string is a word string that is a candidate for a speech recognition result obtained by the matching unit 58 performing a matching process on the word preliminarily selected by the word preliminary selection unit 56 as described later. Although it is an intermediate result, the re-evaluation unit 59 evaluates the intermediate result again.

具体的には、再評価部５９は、部分単語列について、言語スコアおよび音響スコアを再計算するため、部分単語列に対応する特徴量の系列を、特徴量記憶部５５から読み出す。すなわち、再評価部５９は、例えば、部分パスの先頭のノードである初期ノードが有する時刻情報が表す時刻から、注目ノードが有する時刻情報が表す時刻までに対応付けられている特徴量の系列（特徴量系列）を、特徴量記憶部５５から読み出す。さらに、再評価部５９は、音響モデルデータベース６１、辞書データベース６２、および文法データベース６３を参照し、特徴量記憶部５５から読み出した特徴量系列を用いて、部分単語列について、言語スコアおよび音響スコアを再計算する。なお、この再計算は、部分単語列を構成する各単語の単語境界を固定せずに行われる。従って、再評価部５９では、部分単語列の言語スコアおよび音響スコアを再計算することにより、部分単語列について、それを構成する各単語の単語境界の決定が、動的計画法に基づいて行われることになる。 Specifically, the re-evaluation unit 59 reads a feature amount series corresponding to the partial word sequence from the feature amount storage unit 55 in order to recalculate the language score and the acoustic score for the partial word sequence. That is, the re-evaluation unit 59, for example, a series of feature amounts (from the time indicated by the time information of the initial node that is the first node of the partial path to the time indicated by the time information of the node of interest ( The feature amount series) is read from the feature amount storage unit 55. Further, the re-evaluation unit 59 refers to the acoustic model database 61, the dictionary database 62, and the grammar database 63, and uses the feature amount sequence read from the feature amount storage unit 55 to determine the language score and the acoustic score for the partial word string. Is recalculated. This recalculation is performed without fixing the word boundaries of the words constituting the partial word string. Therefore, the re-evaluation unit 59 recalculates the language score and the acoustic score of the partial word string, so that the word boundary of each word constituting the partial word string is determined based on the dynamic programming. It will be.

再評価部５９は、以上のようにして、部分単語列の各単語の言語スコアおよび音響スコア、並びに単語境界を新たに得ると、その新たな言語スコアおよび音響スコアによって、単語接続関係記憶部６０の部分単語列に対応する部分パスを構成するアークが有する言語スコアおよび音響スコアを修正するとともに、新たな単語境界によって、単語接続関係記憶部６０の部分単語列に対応する部分パスを構成するノードが有する時刻情報を修正する。なお、本実施の形態では、再評価部５９による単語接続関係情報の修正は、制御部５４を介して行われるようになっている。 When the re-evaluation unit 59 newly obtains the language score and the acoustic score of each word of the partial word string and the word boundary as described above, the word connection relation storage unit 60 uses the new language score and acoustic score. A node constituting a partial path corresponding to a partial word string in the word connection relation storage unit 60 by correcting a language score and an acoustic score of an arc constituting a partial path corresponding to a partial word string of The time information possessed by is corrected. In the present embodiment, correction of the word connection relation information by the reevaluation unit 59 is performed via the control unit 54.

以上に説明した再評価部５９の処理の例を、図８乃至図１１を参照して説明する。図８は、単語接続関係記憶部６０に記憶された単語列仮説の１つを抜き出して、グラフ化して示したものである。図８において、時刻０のノード１から時刻ａのノード２までを結ぶアーク１は、単語候補「窓」に対応し、単語候補「窓」の音響スコアはＡ１であり、言語スコアはＬ１である。また、時刻ａのノード２から時刻ｔのノード３までを結ぶアーク２は、単語候補「を」に対応し、単語候補「を」の音響スコアはＡ２であり、言語スコアはＬ２である。 An example of the process of the re-evaluation unit 59 described above will be described with reference to FIGS. FIG. 8 shows one of the word string hypotheses stored in the word connection relation storage unit 60 extracted as a graph. In FIG. 8, arc 1 connecting node 0 at time 0 to node 2 at time a corresponds to the word candidate “window”, the acoustic score of the word candidate “window” is A1, and the language score is L1. . An arc 2 connecting node 2 at time a to node 3 at time t corresponds to the word candidate “O”, the acoustic score of the word candidate “O” is A2, and the language score is L2.

今、制御部５４より、ノード３が有する時刻情報を開始時刻としてマッチング処理を行なう旨の指令が通知された場合、再評価部５９は、ノード１乃至ノード３に対応する単語列仮説「窓を」の単語列に対する再評価を行なう。その結果、図９に示されるように、「窓」と「を」の単語境界となるノード２の時刻ａが時刻ｂに修正され、アーク１に付与された音響スコアＡ１が、Ａ１ｂに修正され、アーク２に付与された音響スコアＡ２が、Ａ２ｂに修正される。その後、マッチング部５８により、時刻tを開始点として、「開ける」という単語に対するマッチング処理が行なわれたとする。この結果、図１０に示されるように、時刻uのノード４を終端ノードとして、単語候補「開ける」に対応するアーク３が追加される。図１０において、単語候補「開ける」の音響スコアはＡ３であり、言語スコアはＬ３である。 When the control unit 54 is instructed to perform the matching process using the time information of the node 3 as the start time, the re-evaluation unit 59 sets the word string hypothesis “window to correspond to the nodes 1 to 3”. Is re-evaluated for the word string. As a result, as shown in FIG. 9, the time a of the node 2 that is the word boundary between “window” and “wo” is corrected to the time b, and the acoustic score A1 given to the arc 1 is corrected to A1b. The acoustic score A2 given to the arc 2 is corrected to A2b. Thereafter, it is assumed that matching processing is performed by the matching unit 58 on the word “open”, starting at time t. As a result, as shown in FIG. 10, the arc 3 corresponding to the word candidate “open” is added with the node 4 at the time u as the terminal node. In FIG. 10, the acoustic score of the word candidate “open” is A3, and the language score is L3.

その後、さらに制御部５４より、ノード４が有する時刻情報を開始時刻としてマッチング処理を行なう旨の指令が通知された場合、再評価部５９は、ノード２乃至ノード４に対応する「を開ける」の単語列に対する再評価を行なう。その結果、図１１に示されるように、「を」と「開ける」の単語境界となるノード３の時刻tが時刻sに修正され、アーク２に付与された音響スコアＡ２ｂがＡ２ｃに修正され、アーク３に付与された音響スコアＡ３が、Ａ３ｂに修正される。 Thereafter, when the control unit 54 is further instructed to perform the matching process using the time information of the node 4 as the start time, the reevaluation unit 59 “opens” corresponding to the nodes 2 to 4. Re-evaluate the word string. As a result, as shown in FIG. 11, the time t of the node 3 serving as the word boundary between “open” and “open” is corrected to the time s, and the acoustic score A2b given to the arc 2 is corrected to A2c, The acoustic score A3 given to the arc 3 is corrected to A3b.

再評価部５９は、単語接続関係記憶部６０の単語接続関係情報の修正を終了すると、その旨を、制御部５４を介して、マッチング部５８に供給する。 When the re-evaluation unit 59 finishes correcting the word connection relationship information in the word connection relationship storage unit 60, the re-evaluation unit 59 supplies the fact to the matching unit 58 via the control unit.

マッチング部５８は、上述したように、制御部５４からマッチング処理指令を受信した後、再評価部５９から、制御部５４を介して、単語接続関係情報の修正が終了した旨を受信すると、注目ノード、およびそれが有する時刻情報を、単語予備選択部５６に供給し、それぞれに、単語予備選択処理を要求して、ステップＳ５に進む。 As described above, when the matching unit 58 receives the matching processing command from the control unit 54 and then receives a notification from the reevaluation unit 59 via the control unit 54 that the correction of the word connection relation information has been completed, The node and the time information possessed by the node are supplied to the word preliminary selection unit 56, each of which requests a word preliminary selection process, and proceeds to step S5.

単語予備選択部５６は、マッチング部５８から、単語予備選択処理の要求を受信すると、ステップＳ５において、注目ノードに接続されるアークとなる単語の候補を選択する単語予備選択処理を、辞書データベース６２の単語辞書に登録された単語を対象として行なう。 When the word preliminary selection unit 56 receives a request for word preliminary selection processing from the matching unit 58, in step S5, the word preliminary selection unit 56 performs word preliminary selection processing for selecting a word candidate to be an arc connected to the target node. This is performed on words registered in the word dictionary.

すなわち、単語予備選択部５６は、言語スコアおよび音響スコアを計算するのに用いる特徴量の系列の開始時刻を、注目ノードが有する時刻情報から認識し、その開始時刻以降の、必要な特徴量の系列を特徴量記憶部５５から読み出す。さらに、単語予備選択部５６は、辞書データベース６２の単語辞書に登録された各単語の単語モデルを、音響モデルデータベース６１の音響モデルを接続することで構成し、その単語モデルに基づき、特徴量記憶部５５から読み出した特徴量の系列を用いて、音響スコアを計算する。 That is, the word preliminary selection unit 56 recognizes the start time of the feature amount series used to calculate the language score and the acoustic score from the time information of the node of interest, and sets the necessary feature amount after the start time. The series is read from the feature amount storage unit 55. Further, the word preliminary selection unit 56 configures the word model of each word registered in the word dictionary of the dictionary database 62 by connecting the acoustic model of the acoustic model database 61, and stores the feature amount based on the word model. The acoustic score is calculated using the feature amount sequence read from the unit 55.

また、単語予備選択部５６は、各単語モデルに対応する単語の言語スコアを、文法データベース６３に記憶された文法規則に基づいて計算する。 The word preliminary selection unit 56 calculates the language score of the word corresponding to each word model based on the grammar rules stored in the grammar database 63.

なお、単語予備選択部５６においては、単語接続関係情報を参照することにより、各単語の音響スコアの計算を、その単語の直前の単語（注目ノードが終端となっているアークに対応する単語）に依存するクロスワードモデルを用いて行なうことが可能である。 In addition, the word preliminary selection unit 56 refers to the word connection relation information to calculate the acoustic score of each word (word corresponding to the arc whose terminal node is terminated) immediately before the word. Can be performed using a crossword model that depends on.

また、単語予備選択部５６においては、単語接続関係情報を参照することにより、各単語の言語スコアの計算を、その単語が、その直前の単語と連鎖する確率を規定するバイグラムに基づいて行なうことが可能である。 In addition, the word preliminary selection unit 56 calculates the language score of each word by referring to the word connection relation information based on the bigram that defines the probability that the word is linked to the immediately preceding word. Is possible.

単語予備選択部５６は、以上のようにして、各単語について音響スコアおよび言語スコアを求めると、その音響スコアおよび言語スコアを総合評価したスコアを、以下、適宜、単語スコアと称する）を求め、その上位Ｌ個を、マッチング処理の対象とする単語候補として、単語候補クラスタリング部５７に供給する。 When the word preliminary selection unit 56 obtains the acoustic score and the language score for each word as described above, a score obtained by comprehensively evaluating the acoustic score and the language score is hereinafter referred to as a word score as appropriate). The upper L words are supplied to the word candidate clustering unit 57 as word candidates to be subjected to matching processing.

なお、ここでは、単語予備選択部５６において、各単語の音響スコアおよび言語スコアを総合評価した単語スコアに基づいて、単語を選択するようにしたが、単語予備選択部５６では、その他、例えば、音響スコアだけや、言語スコアだけに基づいて、単語を選択するようにすることが可能である。 Here, in the word preliminary selection unit 56, the word is selected based on the word score obtained by comprehensively evaluating the acoustic score and the language score of each word. However, in the word preliminary selection unit 56, for example, It is possible to select a word based on only the acoustic score or the language score.

また、単語予備選択部５６では、特徴量記憶部５５から読み出した特徴量の系列の最初の部分だけを用いて、音響モデルデータベース６１の音響モデルに基づき、対応する単語の最初の部分の幾つかの音韻を求め、最初の部分が、その音韻に一致する単語を選択するようにすることも可能である。 In addition, the word preliminary selection unit 56 uses only the first part of the feature quantity series read from the feature quantity storage unit 55 and uses some of the first parts of the corresponding words based on the acoustic model in the acoustic model database 61. It is also possible to obtain a phoneme and select a word whose first part matches the phoneme.

さらに、単語予備選択部５６では、単語接続関係情報を参照して、直前の単語（注目ノードが終端ノードとなっているアークに対応する単語）の品詞を認識し、その品詞に続く単語の品詞として可能性の高い品詞の単語を選択するようにすることも可能である。 Further, the word preliminary selection unit 56 refers to the word connection relation information, recognizes the part of speech of the immediately preceding word (the word corresponding to the arc whose target node is the terminal node), and the part of speech of the word following the part of speech. It is also possible to select a word with the most likely part of speech.

すなわち、単語予備選択部５６における単語の選択方法は、どのような方法を用いても良く、究極的には、単語を、ランダムに選択しても良い。 That is, any method may be used as a word selection method in the word preliminary selection unit 56, and ultimately a word may be selected at random.

単語候補クラスタリング部５７は、単語予備選択部５６から、マッチング処理に用いるＬ個の単語候補を受信すると、ステップＳ６において、その単語候補を対象として、クラスタリング処理を行なう。 When the word candidate clustering unit 57 receives L word candidates used for the matching process from the word preliminary selection unit 56, in step S6, the word candidate clustering unit 57 performs the clustering process on the word candidates.

すなわち、単語候補クラスタリング部５７はＬ個の単語候補を発音が同一の語（同音語）毎に分類して、これらを単語候補セットとする。従って、例えば、単語予備選択部５６から、単語候補「開ける」、「明ける」、「空ける」、および「閉める」が供給された場合、単語候補クラスタリング部５７は、同音語である単語候補「開ける」、「明ける」、および「空ける」を１つの単語候補セット「あける」に分類し、単語候補「閉める」を１つの単語候補セット「閉める」に分類する。単語候補クラスタリング部５７は、単語候補を単語候補セットに分類した後、単語候補がどの単語候補セットに分類されているかを示す分類情報を各単語候補に付与して、単語候補をマッチング部５８に供給する。 That is, the word candidate clustering unit 57 classifies the L word candidates into words (sound words) having the same pronunciation, and sets these as word candidate sets. Therefore, for example, when the word candidates “open”, “dawn”, “empty”, and “close” are supplied from the word preliminary selection unit 56, the word candidate clustering unit 57 opens the word candidate “sound word”. ”,“ Dawn ”, and“ open ”are classified into one word candidate set“ open ”, and the word candidate“ close ”is classified into one word candidate set“ close ”. After classifying the word candidates into word candidate sets, the word candidate clustering unit 57 assigns each word candidate with classification information indicating to which word candidate set the word candidates are classified, and the word candidates are sent to the matching unit 58. Supply.

マッチング部５８は、単語候補クラスタリング部５７から、分類情報が付与されたＬ個の単語候補を受信すると、ステップＳ７において、その単語候補を対象として、マッチング処理を行なう。ここで、図１２のフローチャートを参照して、図７のステップＳ７のマッチング処理について詳細に説明する。なお、以下の説明においては、例として、図６のノード６が注目ノードとして選択され（図７のステップＳ３）、単語予備選択部５６により単語候補「開ける」、「明ける」、「空ける」、および「閉める」が選択され（図７のステップＳ５）、単語候補クラスタリング部５７により、単語候補「開ける」、「明ける」、および「空ける」を含む単語候補セット「あける」、並びに単語候補「閉める」を含む単語候補セット「しめる」にクラスタリングされた（図７のステップＳ６）場合について説明する。 When the matching unit 58 receives the L word candidates to which the classification information is assigned from the word candidate clustering unit 57, the matching unit 58 performs matching processing on the word candidates in step S7. Here, the matching process in step S7 in FIG. 7 will be described in detail with reference to the flowchart in FIG. In the following description, as an example, the node 6 in FIG. 6 is selected as the node of interest (step S3 in FIG. 7), and the word candidates “open”, “open”, “open” by the word preliminary selection unit 56, And “close” are selected (step S5 in FIG. 7), and the word candidate clustering unit 57 closes the word candidate set “open” including the word candidates “open”, “open”, and “open”, and the word candidate “close”. A case will be described in which the word candidate set “Shime” including “is clustered (step S6 in FIG. 7).

図１２のステップＳ１０１において、マッチング部５８は、単語接続関係記憶部６０に記憶された単語接続関係情報を参照することにより、単語候補を接続する直前の単語列仮説を特定し、単語列仮説に対するスコアを算出する。 In step S101 of FIG. 12, the matching unit 58 refers to the word connection relationship information stored in the word connection relationship storage unit 60 to identify the word string hypothesis immediately before connecting the word candidates, and for the word string hypothesis. Calculate the score.

単語列仮説に対応するスコアは、音響スコアと言語スコアの累積値で与えられる。例えば、単語列仮説「窓を」に対応するスコアをＳ（窓を）とすると、図６に示される、アーク４およびアーク５により構成される単語列仮説「窓を」に対応するスコアは以下の式で与えられる。 The score corresponding to the word string hypothesis is given by the cumulative value of the acoustic score and the language score. For example, if the score corresponding to the word string hypothesis “window” is S (window), the score corresponding to the word string hypothesis “window” composed of arc 4 and arc 5 shown in FIG. Is given by

Ｓ（窓を）＝ (A4 + L4) + (A5 + L5) S (window) = (A4 + L4) + (A5 + L5)

すなわち、単語列仮説に対応するスコアは、その単語列仮説に含まれている各単語の音響スコアと言語スコアを足し算した総合のスコアを、単語列仮説に含まれている全ての単語分足し算して算出される。 In other words, the score corresponding to the word string hypothesis is obtained by adding the total score obtained by adding the acoustic score and language score of each word included in the word string hypothesis to all the words included in the word string hypothesis. Is calculated.

ステップＳ１０２において、マッチング部５８は、単語候補クラスタリング部５７から供給された単語候補セットの中から、ステップＳ１０３乃至ステップＳ１０７の処理を行なうべき単語候補セットを１つ選択する。例えば、単語候補クラスタリング部５７により、単語候補「開ける」、「明ける」、および「空ける」を含む単語候補セット「あける」、並びに単語候補「閉める」を含む単語候補セット「しめる」にクラスタリングされた場合、マッチング部５８は、単語候補セット「あける」および「しめる」のうち、いずれか１つを選択する。 In step S102, the matching unit 58 selects one word candidate set to be processed in steps S103 to S107 from the word candidate sets supplied from the word candidate clustering unit 57. For example, the word candidate clustering unit 57 performs clustering into the word candidate set “open” including the word candidates “open”, “open”, and “open”, and the word candidate set “shime” including the word candidate “close”. In this case, the matching unit 58 selects one of the word candidate sets “open” and “shimeru”.

次に、ステップＳ１０３において、マッチング部５８は、例えばバイグラムに基づく確率から、ステップＳ１０２で選択された単語候補セットに含まれている全ての単語候補の言語スコアを求める。例えば、ステップＳ１０２で、単語候補セット「あける」が選択された場合、ステップＳ１０３において、マッチング部５８は、「窓を」に「開ける」が後続する場合の言語スコア、「窓を」に「明ける」が後続する場合の言語スコア、および「窓を」に「空ける」が後続する場合の言語スコアをそれぞれ求める。 Next, in step S103, the matching unit 58 obtains the language scores of all the word candidates included in the word candidate set selected in step S102 from the probability based on the bigram, for example. For example, when the word candidate set “open” is selected in step S102, in step S103, the matching unit 58 “opens” in the language score “open window” when “open” follows “open”. "Is followed by a language score, and a" window "is followed by" open "followed by a language score.

ステップＳ１０４において、マッチング部５８は、ステップＳ１０１で算出された、単語候補を接続する直前の単語までの累積スコア、およびステップＳ１０３で算出された、単語候補の言語スコアに基づいて、スコアの初期値を算出する。 In step S104, the matching unit 58 calculates the initial score based on the cumulative score up to the word immediately before connecting the word candidate calculated in step S101 and the language score of the word candidate calculated in step S103. Is calculated.

スコアの初期値は、前に接続する単語列仮説に応じて異なったものとなる。例えば、単語列仮説「窓を」に対して「開ける」が後続する場合、その初期値Ｕ１は、以下の式で与えられる。 The initial value of the score differs depending on the word string hypothesis connected before. For example, when the word string hypothesis “open window” is followed by “open”, the initial value U1 is given by the following equation.

Ｕ１＝Ｓ（窓を）＋Ｌ（開ける|窓を）
・・・（１） U1 = S (open window) + L (open | open window)
... (1)

上記の式（１）において、Ｌ（開ける|窓を）は、「窓を」に「開ける」が後続する場合の言語スコアを表す。バイグラムを用いる場合には、この言語スコアは、（を、開ける）という２単語の連鎖確率で与えられることになる。 In the above equation (1), L (open | window) represents a language score when “open” is followed by “open”. In the case of using a bigram, this language score is given by a two word chain probability of (open).

同様にして、単語列仮説「窓を」に対して「明ける」が後続する場合、その初期値Ｕ２は、以下の式で与えられる。なお、Ｌ（明ける|窓を）は、「窓を」に「明ける」が後続する場合の言語スコアを表す。 Similarly, when “break” follows the word string hypothesis “window”, the initial value U2 is given by the following equation. Note that L (dawn | window) represents a language score when “dawn” follows “open window”.

Ｕ２＝Ｓ（窓を）＋Ｌ（明ける|窓を）
・・・（２） U2 = S (window) + L (dawn | window)
... (2)

同様にして、単語列仮説「窓を」に対して「空ける」が後続する場合、その初期値Ｕ３は、以下の式で与えられる。なお、Ｌ（空ける|窓を）は、「窓を」に「空ける」が後続する場合の言語スコアを表す。 Similarly, when the word string hypothesis “window” is followed by “open”, the initial value U3 is given by the following equation. L (open | window) represents a language score when “open” is followed by “open”.

Ｕ３＝Ｓ（窓を）＋Ｌ（空ける|窓を）
・・・（３） U3 = S (window) + L (open | window)
... (3)

このようにして、新たな単語候補セットを接続するノード（例えば、図６のノード６）に至る単語列仮説（例えば、図６の「窓を」）のスコアに、単語候補セットに含まれている各単語（例えば、「開ける」、「明ける」、または「空ける」）を接続した場合の言語スコアを足し算することにより、各単語候補に対応するスコアの初期値を算出することができる。上記の例の場合、１つの単語列仮説と３つの単語候補に対して、３つのスコアの初期値Ｕ１乃至Ｕ３が求まることになる。 In this way, the score of the word string hypothesis (for example, “window” in FIG. 6) reaching the node (for example, node 6 in FIG. 6) connecting the new word candidate set is included in the word candidate set. By adding the language score when each word (for example, “open”, “dawn”, or “open”) is connected, the initial value of the score corresponding to each word candidate can be calculated. In the above example, three score initial values U1 to U3 are obtained for one word string hypothesis and three word candidates.

図１３は、単語列仮説と単語候補の組み合わせと、それぞれのスコアの初期値Ｕ１乃至Ｕ３が与えられた例を示している。すなわち、図１３において、Ｕ１は単語列仮説「窓を」に対して「開ける」が後続する場合の初期値を表し、Ｕ２は単語列仮説「窓を」に対して「明ける」が後続する場合の初期値を表し、Ｕ３は単語列仮説「窓を」に対して「空ける」が後続する場合の初期値を表している。 FIG. 13 shows an example in which combinations of word string hypotheses and word candidates and initial values U1 to U3 of the respective scores are given. That is, in FIG. 13, U1 represents an initial value when “open” follows the word string hypothesis “window”, and U2 when “open” follows the word string hypothesis “window”. U3 represents the initial value when “open” follows the word string hypothesis “window”.

次に、マッチング部５８は、ステップＳ１０５において、算出したスコアの初期値が１番大きな値の単語候補を、その単語候補を含む単語候補セットを代表する単語（以下、代表単語とも称する）として決定し、ステップＳ１０６において、ステップＳ１０５で決定された代表単語の音響スコアを算出する。例えば、図１３の(窓を、開ける)の組み合わせの初期値Ｕ１が最も高い場合、マッチング部５８は、「開ける」に対するマッチング処理を実行する。ここで、「開ける」に対応する音響モデルは、直前の単語列仮説「窓を」に依存したものが用いられる。具体的には、マッチング部５８は、音響スコアを計算するのに用いる特徴量の系列の開始時刻を、注目ノード（図６の例の場合、ノード６）が有する時刻情報から認識し、その開始時刻以降の、必要な特徴量の系列を特徴量記憶部５５から読み出す。さらに、マッチング部５８は、辞書データベース６２を参照することで、代表単語「開ける」の音韻情報を認識し、その音韻情報に対応する音響モデルを、音響モデルデータベース６１から読み出して接続することで、単語モデルを構成する。 Next, in step S105, the matching unit 58 determines the word candidate having the largest initial value of the calculated score as a word representing the word candidate set including the word candidate (hereinafter also referred to as a representative word). In step S106, the acoustic score of the representative word determined in step S105 is calculated. For example, when the initial value U1 of the combination (open the window) in FIG. 13 is the highest, the matching unit 58 executes a matching process for “open”. Here, the acoustic model corresponding to “open” is dependent on the immediately preceding word string hypothesis “window”. Specifically, the matching unit 58 recognizes the start time of the feature amount series used to calculate the acoustic score from the time information of the node of interest (in the example of FIG. 6, the node 6), and starts the start A series of necessary feature quantities after the time is read from the feature quantity storage unit 55. Further, the matching unit 58 recognizes the phoneme information of the representative word “open” by referring to the dictionary database 62, reads out the acoustic model corresponding to the phoneme information from the acoustic model database 61, and connects it. Construct a word model.

そして、マッチング部５８は、上述のようにして構成した単語モデルに基づき、特徴量記憶部５５から読み出した特徴量系列を用いて、代表単語の音響スコアを計算する。なお、マッチング部５８においては、単語接続関係情報を参照することにより、単語の音響スコアの計算を、クロスワードモデルに基づいて行なうようにすることが可能である。 Then, the matching unit 58 calculates the acoustic score of the representative word using the feature amount series read from the feature amount storage unit 55 based on the word model configured as described above. The matching unit 58 can calculate the acoustic score of the word based on the crossword model by referring to the word connection relation information.

ステップＳ１０７において、マッチング部５８は、ステップＳ１０６で求められた代表単語の音響スコアを、同一の単語候補セットに含まれている他の単語候補の音響スコアとして決定する。その結果、例えばステップＳ１０６で、図６の時刻t１を開始点とし、ある時刻を終了点とする音響スコアが代表単語「開ける」に対して求まっていた場合、この音響スコアを、残りの単語列仮説と単語候補の組み合わせに対しても近似値としてそのまま利用し、残りの２つの単語候補に対するマッチング処理は行なわない。すなわち、例えば、（窓を、明ける）の組み合わせおよび（窓を、空ける）の組み合わせについては、音響スコアの算出を省略し、（窓を、開ける）の組み合わせで算出された音響スコアで代用する。 In step S107, the matching unit 58 determines the acoustic score of the representative word obtained in step S106 as the acoustic score of other word candidates included in the same word candidate set. As a result, for example, in step S106, if an acoustic score starting at time t1 in FIG. 6 and ending at a certain time is obtained for the representative word “open”, this acoustic score is used for the remaining word string. A combination of a hypothesis and a word candidate is also used as an approximate value as it is, and matching processing for the remaining two word candidates is not performed. That is, for example, for the combination of (open the window) and the combination of (open the window), the calculation of the acoustic score is omitted, and the acoustic score calculated by the combination of (open the window) is used instead.

ステップＳ１０８において、マッチング部５８は、未選択の単語候補セットが存在するか否かを判定し、未選択の単語候補セットが存在する場合、処理はステップＳ１０２に戻り、上述したステップＳ１０２以降の処理を繰り返す。例えば、上述したように、単語候補セット「あける」および「しめる」のうち、単語候補セット「あける」についてステップＳ１０２乃至ステップＳ１０７の処理が実行された後、ステップＳ１０８において、マッチング部５８は、未選択の単語候補セット「しめる」が存在すると判定し、処理はステップＳ１０２に戻る。その後、単語候補セット「しめる」についてもステップＳ１０２乃至ステップＳ１０７の処理が実行される。なお、単語候補セット「しめる」のように、単語候補セットに単語候補が１つしか含まれていない場合、ステップＳ１０７の処理はスキップされる。 In step S108, the matching unit 58 determines whether or not there is an unselected word candidate set. If there is an unselected word candidate set, the process returns to step S102, and the processes after step S102 described above are performed. repeat. For example, as described above, after the processing of step S102 to step S107 is performed for the word candidate set “open” among the word candidate sets “open” and “shimeru”, in step S108, the matching unit 58 determines whether or not It is determined that the selected word candidate set “Shimeru” exists, and the process returns to step S102. Thereafter, the processing from step S102 to step S107 is also executed for the word candidate set “Shimeru”. If only one word candidate is included in the word candidate set as in the word candidate set “Shime”, the process of step S107 is skipped.

ステップＳ１０８において、マッチング部５８が、未選択の単語候補セットは存在しないと判定した場合、図１２のマッチング処理５８は終了し、処理は図７のステップＳ８に進む。 If the matching unit 58 determines in step S108 that there is no unselected word candidate set, the matching process 58 in FIG. 12 ends, and the process proceeds to step S8 in FIG.

マッチング部５８は、以上のようにして、単語候補クラスタリング部５７からの単語候補セットすべてについて、その音響スコアおよび言語スコアを求め、ステップＳ８に進む。 As described above, the matching unit 58 obtains the acoustic score and the language score for all the word candidate sets from the word candidate clustering unit 57, and proceeds to step S8.

ステップＳ８では、単語候補それぞれについて、その音響スコアおよび言語スコアを総合評価した単語スコアが求められ、その単語スコアに基づいて、単語接続関係記憶部６０に記憶された単語接続関係情報が更新される。 In step S8, a word score obtained by comprehensively evaluating the acoustic score and the language score is obtained for each word candidate, and the word connection relation information stored in the word connection relation storage unit 60 is updated based on the word score. .

すなわち、ステップＳ８では、マッチング部５８は、単語候補について単語スコアを求め、例えば、その単語スコアを所定の閾値と比較すること等によって、注目ノードに接続するアークとしての単語を、単語候補の中から絞り込む。そして、マッチング部５８は、その絞り込みの結果残った単語を、その音響スコア、言語スコア、およびその単語の終了時刻とともに、制御部５４に供給する。 That is, in step S8, the matching unit 58 obtains a word score for the word candidate, and compares the word score as an arc connected to the node of interest, for example, by comparing the word score with a predetermined threshold. We narrow down from. Then, the matching unit 58 supplies the word remaining as a result of the narrowing down to the control unit 54 together with the acoustic score, the language score, and the end time of the word.

なお、マッチング部５８において、単語の終了時刻は、音響スコアを計算するのに用いた特徴量の抽出時刻から認識される。また、ある単語について、その終了時刻としての蓋然性の高い抽出時刻が複数得られた場合には、その単語については、各終了時刻と、対応する音響スコアおよび言語スコアとのセットが、制御部５４に供給される。 In the matching unit 58, the end time of the word is recognized from the extraction time of the feature value used for calculating the acoustic score. In addition, when a plurality of extraction times with high probability as the end time are obtained for a certain word, a set of each end time and a corresponding acoustic score and language score is obtained for the word by the control unit 54. To be supplied.

制御部５４は、上述のようにしてマッチング部５８から供給される単語の音響スコア、言語スコア、および終了時刻を受信すると、マッチング部５８からの各単語について、単語接続関係記憶部６０に記憶された単語接続関係情報における注目ノードを始端ノードとして、アークを延ばし、そのアークを、終了時刻の位置に対応する終端ノードに接続する。さらに、制御部５４は、各アークに対して、対応する単語、並びにその音響スコアおよび言語スコアを付与するとともに、各アークの終端ノードに対して、対応する終了時刻を時刻情報として与える。そして、ステップＳ２に戻り、以下、同様の処理が繰り返される。 When receiving the acoustic score, language score, and end time of the word supplied from the matching unit 58 as described above, the control unit 54 stores each word from the matching unit 58 in the word connection relation storage unit 60. The target node in the word connection relation information is set as the start node, the arc is extended, and the arc is connected to the end node corresponding to the position of the end time. Further, the control unit 54 gives a corresponding word, an acoustic score and a language score to each arc, and gives a corresponding end time as time information to the end node of each arc. And it returns to step S2 and the same process is repeated hereafter.

その結果、例えば、図１４に示されるような単語接続関係情報が、単語接続関係記憶部６０に記憶される。図１４は図６のノード６に対して、単語候補セット「あける」に含まれている単語候補「開ける」、「明ける」、および「空ける」、並びに単語候補セット「しめる」に含まれている単語候補「閉める」が接続された場合の例を示している。アーク６は、単語候補「開ける」に対応しており、この音響スコアＡ６および言語スコアＬ６は、従来通り計算される。アーク７は、単語候補「明ける」に対応しており、この言語スコアＬ７は、従来通り計算されているが、音響スコアは、単語候補「開ける」の音響スコアＡ６で代用している。すなわち、単語候補「明ける」の音響スコアは計算されない。また、アーク８は、単語候補「空ける」に対応しており、この言語スコアＬ８は、従来通り計算されているが、音響スコアは、単語候補「開ける」の音響スコアＡ６で代用している。すなわち、単語候補「空ける」の音響スコアは計算されない。 As a result, for example, word connection relationship information as shown in FIG. 14 is stored in the word connection relationship storage unit 60. FIG. 14 is included in the word candidates “open”, “open”, and “open” included in the word candidate set “open”, and the word candidate set “shimeru” with respect to the node 6 in FIG. 6. An example in which word candidates “close” are connected is shown. The arc 6 corresponds to the word candidate “open”, and the acoustic score A6 and the language score L6 are calculated as usual. The arc 7 corresponds to the word candidate “dawn”, and the language score L7 is calculated as usual, but the acoustic score is substituted by the acoustic score A6 of the word candidate “open”. That is, the acoustic score of the word candidate “bright” is not calculated. The arc 8 corresponds to the word candidate “open”, and the language score L8 is calculated as usual, but the acoustic score is substituted with the acoustic score A6 of the word candidate “open”. That is, the acoustic score of the word candidate “open” is not calculated.

また、アーク９は、単語候補「閉める」に対応しており、この音響スコアＡ９および言語スコアＬ９は、従来通り計算される。 The arc 9 corresponds to the word candidate “close”, and the acoustic score A9 and the language score L9 are calculated as usual.

以上の処理により、音声認識処理の計算量を削減することが可能となる。上記の例の場合、従来は、（窓を、開ける）、（窓を、明ける）、および（窓を、空ける）の３つの組み合わせそれぞれに対して、音響スコアを求めていたのに対して、マッチング部５８は、図１２のステップＳ１０６で、最も初期値の高い（窓を、開ける）の組み合わせでしか音響スコアを求める必要が無いので、音響スコアの計算回数を３分の１に削減することができる。 With the above processing, it is possible to reduce the calculation amount of the speech recognition processing. In the case of the above example, conventionally, the acoustic score was obtained for each of the three combinations of (open the window), (open the window), and (open the window). The matching unit 58 needs to obtain the acoustic score only in the combination with the highest initial value (open the window) in step S106 of FIG. 12, and thus the number of calculations of the acoustic score is reduced to one third. Can do.

従来の場合、単語候補「開ける」、「明ける」、「空ける」、および「閉める」に対してそれぞれ音響スコアを計算しなければならなかったので、合計４回音響スコアを計算することになる。それに対して、マッチング部５８は、単語候補「開ける」、「明ける」、および「空ける」を含む単語候補セット「あける」に対して１回、単語候補「閉める」を含む単語候補セット「しめる」に対して１回の、合計２回計算するだけで済む。従って、従来の計算回数４回から、図８の場合の計算回数２回を差し引いた２回分だけ計算回数を削減することができる。 In the conventional case, the acoustic score has to be calculated for each of the word candidates “open”, “dawn”, “empty”, and “close”, so the acoustic score is calculated four times in total. On the other hand, the matching unit 58 performs the word candidate set “Shime” including the word candidate “Close” once for the word candidate set “Open” including the word candidates “Open”, “Dawn”, and “Open”. It is only necessary to calculate once for each, a total of two times. Therefore, the number of calculations can be reduced by two times obtained by subtracting the number of calculations of 2 in the case of FIG.

なお、図１４の例では、各単語候補の終了点（終端ノードの時刻）はＴの１つだけであるが、各単語候補の終了点（終端ノードの時刻）の候補が複数存在する場合がある。 In the example of FIG. 14, each word candidate has only one end point (terminal node time) of T, but there may be a plurality of candidates for each word candidate end point (terminal node time). is there.

ここで、図１２のステップＳ１０６で終了点を求める原理について説明する。今、図６のノード６を開始時刻として、単語候補「開ける」に対してマッチング処理を行なう場合を考える。この場合、前述したように、開始時刻をt１、スコアの初期値をＵ１として、単語候補「開ける」に対するマッチング処理を行なうことになる。マッチング部５８は、音響モデルデータベース６１と辞書データベース６２に保持された発音の情報から、単語候補「開ける」に対応する音響的な標準パターンを取得する。 Here, the principle of obtaining the end point in step S106 in FIG. 12 will be described. Consider a case where the matching process is performed on the word candidate “open” with the node 6 in FIG. 6 as the start time. In this case, as described above, the matching process for the word candidate “open” is performed with the start time t1 and the initial score value U1. The matching unit 58 acquires an acoustic standard pattern corresponding to the word candidate “open” from the pronunciation information stored in the acoustic model database 61 and the dictionary database 62.

音響モデルとして隠れマルコフモデル(HMM)を用いる場合、標準パターンは、図１５に示されるような状態遷移モデルとなり、各状態間の遷移には遷移確率が与えられ、各状態において出力確率密度関数が与えられる。この時、初期値Ｕ１が初期状態１０１に与えられ、図６の時刻t１を開始点とする特徴量時系列を用いて、音響スコアの計算が行なわれる。スコアの計算には、Viterbi Searchと呼ばれるスコアの累積方法が適用される。その結果、最終状態１０２における累積スコアが、毎時刻求まることになる。そして、各時刻において求まる最終状態１０２における累積スコアが、「開ける」に対して求まる音響スコアとなる。 When a hidden Markov model (HMM) is used as an acoustic model, the standard pattern is a state transition model as shown in FIG. 15, and transition probabilities are given to transitions between the states, and the output probability density function is expressed in each state. Given. At this time, the initial value U1 is given to the initial state 101, and the acoustic score is calculated using the feature amount time series starting at time t1 in FIG. A score accumulation method called Viterbi Search is applied to the score calculation. As a result, the accumulated score in the final state 102 is obtained every hour. The accumulated score in the final state 102 obtained at each time is the acoustic score obtained for “open”.

ここで、ある時刻に着目した場合、初期状態１０１から最終状態１０２の間のある状態における累積スコアと他の状態における累積スコアは比較可能であり、相対的に累積スコアの良い状態だけを有効にしながら、Viterbi Searchを行なう方法を適用する。これは、ビームサーチと呼ばれる手法であり、マッチング処理の効率化のために広く利用されている技術である。図１５の最終状態１０２においても、累積スコアが相対的に良くなる時刻だけを有効なものと判断することが可能である。そこで、最終状態１０２が有効になる時刻をマッチング部５８の処理の終了点と判断する。 Here, when paying attention to a certain time, the cumulative score in one state between the initial state 101 and the final state 102 can be compared with the cumulative score in another state, and only a state with a relatively good cumulative score is enabled. While applying the Viterbi Search method. This is a technique called beam search, which is a widely used technique for improving the efficiency of matching processing. Also in the final state 102 in FIG. 15, it is possible to determine that only the time when the cumulative score becomes relatively good is valid. Therefore, the time when the final state 102 becomes valid is determined as the end point of the processing of the matching unit 58.

通常、最終状態１０２が有効になる時刻は必ずしも一意に決まるわけではないので、複数の時刻において有効となる場合が発生する。図１６は、２つの時刻ｔ２とＴにおいて、最終状態１０２が有効になった場合に単語接続関係記憶部６０に記憶された単語接続関係情報の例を示している。 Usually, the time at which the final state 102 becomes valid is not necessarily determined uniquely, and therefore, it may become effective at a plurality of times. FIG. 16 shows an example of word connection relationship information stored in the word connection relationship storage unit 60 when the final state 102 becomes valid at two times t2 and T.

なお、図１６は、図６のノード６に対して、単語候補セット「あける」に含まれている単語候補「開ける」、「明ける」、および「空ける」、並びに単語候補セット「しめる」に含まれている単語候補「閉める」が接続された場合の例を示している。 FIG. 16 includes the word candidates “open”, “open”, and “open” included in the word candidate set “open” and the word candidate set “shime” with respect to the node 6 in FIG. 6. In this example, the word candidates “closed” are connected.

図１６において、アーク６は、単語候補「開ける」に対応しており、終端ノードの時刻は音声区間の終了時刻Ｔである。この単語候補「開ける」の音響スコアＡ６および言語スコアＬ６は、従来通り計算される。アーク７は、単語候補「開ける」に対応しており、終端ノードの時刻は音声区間の終了時刻Ｔより前の時刻ｔ２である。なお、この単語候補「開ける」の音響スコアＡ７は、単語候補「開ける」に対するスコアの計算過程の時刻ｔ２に算出され、その後、時刻Ｔに音響スコアＡ６が算出される。また、この単語候補「開ける」は、アーク６と同一の単語候補なので、言語スコアはアーク６と同一の値Ｌ６となる。 In FIG. 16, arc 6 corresponds to the word candidate “open”, and the time of the end node is the end time T of the speech section. The acoustic score A6 and language score L6 of this word candidate “open” are calculated as usual. The arc 7 corresponds to the word candidate “open”, and the time of the terminal node is the time t2 before the end time T of the speech section. The acoustic score A7 of the word candidate “open” is calculated at time t2 in the score calculation process for the word candidate “open”, and then the acoustic score A6 is calculated at time T. Since the word candidate “open” is the same word candidate as the arc 6, the language score is the same value L 6 as the arc 6.

また、アーク８は、単語候補「明ける」に対応しており、終端ノードの時刻は、アーク６の単語候補「開ける」と同様、音声区間の終了時刻Ｔである。この単語候補「明ける」の言語スコアＬ８は、従来通り計算されるが、音響スコアは、終了点が同一である単語候補「開ける」（アーク６に対応する単語候補）の音響スコアＡ６で代用している。すなわち、アーク８に対応する単語候補「明ける」の音響スコアは計算されない。アーク９は、単語候補「明ける」に対応しており、終端ノードの時刻は、アーク７の単語候補「開ける」と同様、時刻ｔ２である。この単語候補「明ける」の音響スコアは、終了点が同一である単語候補「開ける」（アーク７に対応する単語候補）の音響スコアＡ７で代用している。すなわち、アーク９に対応する単語候補「明ける」の音響スコアは計算されない。また、この単語候補「明ける」（アーク９に対応する単語候補）は、アーク８と同一の単語候補なので、言語スコアはアーク８と同一の値Ｌ８となる。 The arc 8 corresponds to the word candidate “dawn”, and the time of the terminal node is the end time T of the speech section, similar to the word candidate “open” of the arc 6. The language score L8 of this word candidate “dawn” is calculated as usual, but the acoustic score is substituted by the acoustic score A6 of the word candidate “open” (word candidate corresponding to arc 6) having the same end point. ing. That is, the acoustic score of the word candidate “bright” corresponding to the arc 8 is not calculated. The arc 9 corresponds to the word candidate “dawn”, and the time of the terminal node is the time t2 as in the word candidate “open” of the arc 7. The acoustic score of the word candidate “dawn” is substituted by the acoustic score A7 of the word candidate “open” (word candidate corresponding to the arc 7) having the same end point. That is, the acoustic score of the word candidate “bright” corresponding to the arc 9 is not calculated. Further, since this word candidate “bright” (word candidate corresponding to arc 9) is the same word candidate as arc 8, the language score is the same value L8 as arc 8.

また、アーク１０は、単語候補「空ける」に対応しており、終端ノードの時刻は、アーク６の単語候補「開ける」と同様、音声区間の終了時刻Ｔである。この単語候補「空ける」の言語スコアＬ１０は、従来通り計算されるが、音響スコアは、終了点が同一である単語候補「開ける」（アーク６に対応する単語候補）の音響スコアＡ６で代用している。すなわち、アーク１０に対応する単語候補「空ける」の音響スコアは計算されない。アーク１１は、単語候補「空ける」に対応しており、終端ノードの時刻は、アーク７の単語候補「開ける」と同様、時刻ｔ２である。この単語候補「空ける」の音響スコアは、終了点が同一である単語候補「開ける」（アーク７に対応する単語候補）の音響スコアＡ７で代用している。すなわち、アーク１１に対応する単語候補「明ける」の音響スコアは計算されない。また、この単語候補「空ける」（アーク１１に対応する単語候補）は、アーク１０と同一の単語候補なので、言語スコアはアーク１０と同一の値Ｌ１０となる。 Further, the arc 10 corresponds to the word candidate “open”, and the time of the terminal node is the end time T of the speech section, similar to the word candidate “open” of the arc 6. The language score L10 of this word candidate “open” is calculated as usual, but the acoustic score is substituted with the acoustic score A6 of the word candidate “open” (word candidate corresponding to arc 6) having the same end point. ing. That is, the acoustic score of the word candidate “open” corresponding to the arc 10 is not calculated. The arc 11 corresponds to the word candidate “open”, and the time of the terminal node is the time t2 as in the word candidate “open” of the arc 7. The acoustic score A7 of the word candidate “open” (word candidate corresponding to the arc 7) having the same end point is substituted for the acoustic score of the word candidate “open”. That is, the acoustic score of the word candidate “bright” corresponding to the arc 11 is not calculated. Further, since this word candidate “open” (word candidate corresponding to the arc 11) is the same word candidate as the arc 10, the language score is the same value L10 as the arc 10.

アーク１２は、単語候補「閉める」に対応しており、終端ノードの時刻は音声区間の終了時刻Ｔである。この単語候補「閉める」の音響スコアＡ１２および言語スコアＬ１２は、従来通り計算される。アーク１３は、単語候補「閉める」に対応しており、終端ノードの時刻は音声区間の終了時刻Ｔより前の時刻ｔ２である。この単語候補「閉める」の音響スコアＡ１３は、従来通り計算される。また、この単語候補「閉める」は、アーク１２と同一の単語候補なので、言語スコアはアーク１２と同一の値Ｌ１２となる。 The arc 12 corresponds to the word candidate “close”, and the time of the end node is the end time T of the speech section. The acoustic score A12 and the language score L12 of this word candidate “close” are calculated as usual. The arc 13 corresponds to the word candidate “close”, and the time of the terminal node is the time t2 before the end time T of the speech section. The acoustic score A13 of this word candidate “close” is calculated as usual. Since the word candidate “close” is the same word candidate as the arc 12, the language score is the same value L 12 as the arc 12.

図１６の例においては、図１２のステップＳ１０６の処理において、各単語候補毎に、２つの終了点（終端ノードの時刻）の候補が求められている。この場合、同一の単語候補セットに含まれている単語候補については、各時間候補（Ｔおよびｔ２）毎に音響スコアの計算を１回で済ますことができる。従来も、同一の単語候補については、スコア計算を１回で済ませることが可能であったが、異なる単語候補それぞれの音響スコアを求めていたので、アーク６乃至アーク１４について、合計４回、音響スコアを求めることになる。それに対して、マッチング部５８は、同一の単語候補セットに含まれる単語候補については、終了点の候補毎に、１回だけ音響スコアを計算するだけで済む。すなわち、図１６において、音響スコアの計算回数は２回で済む。 In the example of FIG. 16, in the process of step S106 of FIG. 12, candidates for two end points (end node times) are obtained for each word candidate. In this case, for the word candidates included in the same word candidate set, the acoustic score can be calculated once for each time candidate (T and t2). Conventionally, although it was possible to complete the score calculation once for the same word candidate, since the acoustic score of each of the different word candidates was obtained, a total of four times for the arc 6 to arc 14 You will ask for a score. On the other hand, for the word candidates included in the same word candidate set, the matching unit 58 only needs to calculate the acoustic score once for each end point candidate. That is, in FIG. 16, the number of calculations of the acoustic score is only two.

従って、従来の計算回数４回から、図１６の場合の計算回数２回を差し引いた２回分だけ、計算回数を削減することができる。これは、図１４に示した、終了点が１つの場合と計算回数の削減効果は同じである。終了点の候補が３以上であっても、同様の計算回数の削減効果を得ることができる。 Therefore, the number of calculations can be reduced by two times obtained by subtracting the number of calculations of 2 in the case of FIG. This is the same as the case of one end point shown in FIG. Even if the number of end point candidates is 3 or more, the same effect of reducing the number of calculations can be obtained.

なお、図１４および図１６の例では、単語候補セットを接続する単語列仮説は、「窓を」の１つだけ（すなわち、単語候補セットを接続するノードはノード６の１つだけ）であるが、同一の時刻に、単語候補セットを接続する単語列仮説が複数存在する場合がある。すなわち、図７のステップＳ３において、同一の時刻に複数の途中ノードが存在した場合、複数の途中ノードが選択される。同一の時刻に、単語候補セットを接続する単語列仮説が複数存在する場合の単語接続関係情報の例を図１７に示す。 In the examples of FIGS. 14 and 16, the word string hypothesis that connects the word candidate sets is only one “window” (that is, only one node 6 is connected to the word candidate set). However, there may be a plurality of word string hypotheses connecting word candidate sets at the same time. That is, if there are a plurality of intermediate nodes at the same time in step S3 of FIG. 7, a plurality of intermediate nodes are selected. FIG. 17 shows an example of word connection relation information when there are a plurality of word string hypotheses connecting word candidate sets at the same time.

図１７に示されたグラフは、図６に示されたグラフに、さらにアーク６が追加されたものである。すなわち、図１７においては、ノード１に対して、単語候補「ドア」に対応するアーク６がさらに接続されている。ノード７は、アーク６の終端ノードであり、ノード７が保有する時刻情報は、ノード６が保有する時刻情報と同一となっている。 The graph shown in FIG. 17 is obtained by adding an arc 6 to the graph shown in FIG. That is, in FIG. 17, an arc 6 corresponding to the word candidate “door” is further connected to the node 1. The node 7 is a terminal node of the arc 6, and the time information held by the node 7 is the same as the time information held by the node 6.

図１７に示されるように、同一の時刻ｔ１に、単語候補セットを接続する単語列仮説が複数（「窓を」と「ドア」の２つ）存在する場合がある。 As shown in FIG. 17, there may be a plurality of word string hypotheses (two “window” and “door”) that connect word candidate sets at the same time t1.

この場合、図１２のステップＳ１０１において、それぞれの単語列仮説に対応するスコアが、音響スコアと言語スコアの累積値で与えられる。例えば、単語列仮説「窓を」と「ドア」に対応するスコアを、それぞれS(窓を)、およびS(ドア)とすると、図１７に示される単語列仮説「窓を」と「ドア」に対応するスコアは、それぞれ、以下の式で与えられる。 In this case, in step S101 of FIG. 12, the score corresponding to each word string hypothesis is given as the cumulative value of the acoustic score and the language score. For example, if the scores corresponding to the word string hypotheses “window” and “door” are S (window) and S (door), respectively, the word string hypotheses “window” and “door” shown in FIG. The scores corresponding to are given by the following equations.

Ｓ（窓を）＝ (A4 + L4) + (A5 + L5)
Ｓ（ドア）＝ A6 + L6 S (window) = (A4 + L4) + (A5 + L5)
S (door) = A6 + L6

そして、図１２のステップＳ１０３において、単語候補毎に言語スコアが求められ、ステップＳ１０４において、スコアの初期値が求められる。同音語である３つの単語候補「開ける」、「空ける」、および「明ける」のマッチング処理を行なう際に必要となるスコアの初期値は、前に接続する単語列仮説に応じて異なったものとなる。単語列仮説「窓を」に対して「開ける」、「明ける」、および「空ける」が後続する場合の初期値は、上述した式（１）、式（２）、および式（３）により、それぞれ、Ｕ１、Ｕ２、およびＵ３として与えられる。 Then, in step S103 of FIG. 12, a language score is obtained for each word candidate, and in step S104, an initial score value is obtained. The initial value of the score required for matching the three word candidates “open”, “open”, and “dawn”, which are homophones, differs according to the word string hypothesis connected before. Become. The initial value when the word string hypothesis “window” is followed by “opening”, “opening”, and “opening” is expressed by the above-described equations (1), (2), and (3). They are given as U1, U2, and U3, respectively.

また、単語列仮説「ドア」に対して「開ける」が後続する場合の初期値Ｕ４、単語列仮説「ドア」に対して「明ける」が後続する場合の初期値Ｕ５、および単語列仮説「ドア」に対して「空ける」が後続する場合の初期値Ｕ６は、それぞれ以下の式で与えられる。なお、Ｌ（開ける|ドア）は、「ドア」に「開ける」が後続する場合の言語スコアを表し、Ｌ（明ける|ドア）は、「ドア」に「明ける」が後続する場合の言語スコアを表し、Ｌ（空ける|ドア）は、「ドア」に「空ける」が後続する場合の言語スコアを表す。 The initial value U4 when “open” follows the word string hypothesis “door”, the initial value U5 when “open” follows the word string hypothesis “door”, and the word string hypothesis “door”. The initial value U6 in the case where “empty” follows for “” is given by the following equations, respectively. Note that L (open | door) represents the language score when “open” follows “door”, and L (open | door) represents the language score when “open” follows “door”. L (open | door) represents a language score when “open” is followed by “open”.

Ｕ４＝Ｓ（ドア）＋Ｌ（開ける|ドア）
・・・（４） U4 = S (door) + L (open | door)
... (4)

Ｕ５＝Ｓ（ドア）＋Ｌ（明ける|ドア）
・・・（５） U5 = S (door) + L (dawn | door)
... (5)

Ｕ６＝Ｓ（ドア）＋Ｌ（空ける|ドア）
・・・（６） U6 = S (door) + L (open | door)
... (6)

なお、以上の式において、Ｌ（開ける|ドア）は、「ドア」に「開ける」が後続する場合の言語スコアを表し、Ｌ（明ける|ドア）は、「ドア」に「明ける」が後続する場合の言語スコアを表し、Ｌ（空ける|ドア）は、「ドア」に「空ける」が後続する場合の言語スコアを表す。これらの言語スコアは、図１２のステップＳ１０３で求められる。 In the above formula, L (open | door) represents a language score when “open” follows “door”, and L (open | door) follows “door” followed by “open”. L (open | door) represents a language score when “open” is followed by “open”. These language scores are obtained in step S103 of FIG.

このようにして、２つの単語列仮説「窓を」と「ドア」、および３つの単語候補「開ける」、「明ける」、および「空ける」に対して、６つのスコアの初期値が求まることになる。図１８は、単語列仮説と単語候補の組み合わせと、それぞれの初期値スコアＵ１乃至Ｕ６が与えられた例を示している。すなわち、図１８において、Ｕ１は単語列仮説「窓を」に対して「開ける」が後続する場合の初期値を表し、Ｕ２は単語列仮説「窓を」に対して「明ける」が後続する場合の初期値を表し、Ｕ３は単語列仮説「窓を」に対して「空ける」が後続する場合の初期値を表し、Ｕ４は単語列仮説「ドア」に対して「開ける」が後続する場合の初期値を表し、Ｕ５は単語列仮説「ドア」に対して「明ける」が後続する場合の初期値を表し、Ｕ６は単語列仮説「ドア」に対して「空ける」が後続する場合の初期値を表している。 In this manner, initial values of six scores are obtained for two word string hypotheses “window” and “door” and three word candidates “open”, “dawn”, and “open”. Become. FIG. 18 shows an example in which combinations of word string hypotheses and word candidates and respective initial value scores U1 to U6 are given. That is, in FIG. 18, U1 represents an initial value when “open” follows the word string hypothesis “window”, and U2 when “open” follows the word string hypothesis “window”. U3 represents the initial value when “open” follows the word string hypothesis “window”, and U4 represents the case where “open” follows the word string hypothesis “door”. U5 represents an initial value when “open” follows the word string hypothesis “door”, and U6 represents an initial value when “open” follows the word string hypothesis “door”. Represents.

マッチング部５８は、図１２のステップＳ１０５において、図１８に示された組み合わせの中で、初期値が最も良いものを１つ選択し、その組み合わせに基づいて、対応する単語候補に対するマッチング処理を施す。例えば、初期値Ｕ１が最も良く、(窓を、開ける)の組み合わせが選択された場合、マッチング部５８は、図１２のステップＳ１０６において、単語候補「開ける」に対するマッチング処理を行なう。ここで、単語候補「開ける」に対応する音響モデルは、直前の単語列仮説「窓を」に依存したものが用いられる。その結果、図１７の時刻t１を開始点とし、ある時刻を終了点とする音響スコアが「開ける」に対して求まることになる。 In step S105 in FIG. 12, the matching unit 58 selects one of the combinations shown in FIG. 18 that has the best initial value, and performs matching processing on the corresponding word candidate based on the combination. . For example, when the initial value U1 is the best and the combination of (open window) is selected, the matching unit 58 performs matching processing for the word candidate “open” in step S106 of FIG. Here, the acoustic model corresponding to the word candidate “open” is dependent on the immediately preceding word string hypothesis “window”. As a result, an acoustic score starting from time t1 in FIG. 17 and ending at a certain time is obtained for “open”.

マッチング部５８は、ステップＳ１０７において、ステップＳ１０６で求められた音響スコアを、残りの単語列仮説と単語候補の組み合わせに対しても近似値としてそのまま利用し、残りの５つの単語候補に対するマッチング処理は行なわない。すなわち、例えば、（窓を、明ける）の組み合わせ、（窓を、空ける）の組み合わせ、（ドア、開ける）の組み合わせ、（ドア、明ける）の組み合わせ、および（ドア、空ける）の組み合わせについては、図１９に示すように、音響スコアの算出を省略し、（窓を、開ける）の組み合わせで算出された音響スコアで代用する。 In step S107, the matching unit 58 directly uses the acoustic score obtained in step S106 as an approximate value for the remaining combinations of word string hypotheses and word candidates, and matching processing for the remaining five word candidates is performed. Don't do it. That is, for example, the combinations of (open window, open), (open window, open), (door, open), (door, open), and (door, open), As shown in FIG. 19, the calculation of the acoustic score is omitted, and the acoustic score calculated by the combination of (open the window) is used instead.

すなわち、図１９は図１７のノード６およびノード７に対して、それぞれ単語候補セット「あける」に含まれている単語候補「開ける」、「明ける」、および「空ける」、並びに単語候補セット「しめる」に含まれている単語候補「閉める」が接続された場合の単語接続関係情報の例を示している。 That is, FIG. 19 shows the word candidates “open”, “open”, and “open” included in the word candidate set “open”, and the word candidate set “shime” for the nodes 6 and 7 in FIG. ] Shows an example of word connection relation information when word candidates “close” included in “” are connected.

図１９において、ノード６に接続されたアーク７は、単語候補「開ける」に対応しており、この音響スコアＡ７および言語スコアＬ７は、従来通り計算される。ノード６に接続されたアーク８は、単語候補「明ける」に対応しており、この言語スコアＬ８は、従来通り計算されるが、音響スコアは、単語候補「開ける」の音響スコアＡ７で代用している。すなわち、アーク８に対応する単語候補「明ける」の音響スコアは計算されない。また、ノード６に接続されたアーク９は、単語候補「空ける」に対応しており、この言語スコアＬ９は、従来通り計算されるが、音響スコアは、単語候補「開ける」の音響スコアＡ７で代用している。すなわち、アーク９に対応する単語候補「空ける」の音響スコアは計算されない。また、ノード６に接続されたアーク１０は、単語候補「閉める」に対応しており、この音響スコアＡ１０および言語スコアＬ１０は、従来通り計算される。 In FIG. 19, the arc 7 connected to the node 6 corresponds to the word candidate “open”, and the acoustic score A7 and the language score L7 are calculated as usual. The arc 8 connected to the node 6 corresponds to the word candidate “open”, and this language score L8 is calculated as usual, but the acoustic score is substituted by the acoustic score A7 of the word candidate “open”. ing. That is, the acoustic score of the word candidate “bright” corresponding to the arc 8 is not calculated. The arc 9 connected to the node 6 corresponds to the word candidate “open”, and the language score L9 is calculated as usual, but the acoustic score is an acoustic score A7 of the word candidate “open”. Substituting. That is, the acoustic score of the word candidate “open” corresponding to the arc 9 is not calculated. The arc 10 connected to the node 6 corresponds to the word candidate “close”, and the acoustic score A10 and the language score L10 are calculated as usual.

ノード７に接続されたアーク１１は、単語候補「開ける」に対応しており、この言語スコアＬ１１は、従来通り計算されるが、音響スコアは、アーク７に対応する単語候補「開ける」の音響スコアＡ７で代用している。すなわち、アーク１１に対応する単語候補「開ける」の音響スコアは計算されない。ノード７に接続されたアーク１２は、単語候補「明ける」に対応しており、この言語スコアＬ１２は、従来通り計算されるが、音響スコアは、アーク７に対応する単語候補「開ける」の音響スコアＡ７で代用している。すなわち、アーク１２に対応する単語候補「明ける」の音響スコアは計算されない。また、ノード７に接続されたアーク１３は、単語候補「空ける」に対応しており、この言語スコアＬ１３は、従来通り計算されるが、音響スコアは、アーク７に対応する単語候補「開ける」の音響スコアＡ７で代用している。すなわち、アーク１３に対応する単語候補「空ける」の音響スコアは計算されない。また、アーク１４は、単語候補「閉める」に対応しており、この言語スコアＬ１４は、従来通り計算されるが、音響スコアは、アーク１０に対応する単語候補「閉める」の音響スコアＡ１０で代用している。すなわち、アーク１４に対応する単語候補「閉める」の音響スコアは計算されない。 The arc 11 connected to the node 7 corresponds to the word candidate “open”, and the language score L11 is calculated as usual, but the acoustic score is the sound of the word candidate “open” corresponding to the arc 7. A score A7 is substituted. That is, the acoustic score of the word candidate “open” corresponding to the arc 11 is not calculated. The arc 12 connected to the node 7 corresponds to the word candidate “open”, and the language score L12 is calculated as usual, but the acoustic score is the sound of the word candidate “open” corresponding to the arc 7. A score A7 is substituted. That is, the acoustic score of the word candidate “bright” corresponding to the arc 12 is not calculated. The arc 13 connected to the node 7 corresponds to the word candidate “open”, and the language score L13 is calculated as usual, but the acoustic score is the word candidate “open” corresponding to the arc 7. The acoustic score A7 is substituted. That is, the acoustic score of the word candidate “open” corresponding to the arc 13 is not calculated. The arc 14 corresponds to the word candidate “close”, and this language score L14 is calculated as usual, but the acoustic score is substituted by the acoustic score A10 of the word candidate “close” corresponding to the arc 10. doing. That is, the acoustic score of the word candidate “close” corresponding to the arc 14 is not calculated.

以上の処理により、音声認識処理の計算量を削減することが可能となる。上記の例の場合、従来は、（窓を、開ける）、（窓を、明ける）、（窓を、空ける）、（ドア、開ける）、（ドア、明ける）、および（ドア、空ける）の６つの組み合わせそれぞれに対して、音響スコアを求めていたのに対して、マッチング部５８は、最も初期値の高い（窓を、開ける）の組み合わせでしか音響スコアを求める必要が無いので、音響スコアの計算回数を６分の１に削減することができる。 With the above processing, it is possible to reduce the calculation amount of the speech recognition processing. In the case of the above example, conventionally, (open the window), (open the window), (open the window), (open the door), (open the door), and (open the door) 6 While the acoustic score is obtained for each of the two combinations, the matching unit 58 needs to obtain the acoustic score only for the combination having the highest initial value (open the window). The number of calculations can be reduced to 1/6.

従来の場合、単語列仮説「窓を」および「ドア」、並びに単語候補「開ける」、「明ける」、「空ける」、および「閉める」の全ての組み合わせに対して、それぞれ音響スコアを計算しなければならなかったので、合計８回音響スコアを計算することになる。それに対して、マッチング部５８は、単語候補「開ける」、「明ける」、および「空ける」を含む単語候補セット「あける」に対して１回、単語候補「閉める」を含む単語候補セット「しめる」に対して１回の、合計２回計算するだけで済む。従って、従来の計算回数８回から、図１９の場合の計算回数２回を差し引いた６回分だけ計算回数を削減することができる。 In the conventional case, acoustic scores must be calculated for all combinations of word string hypotheses “window” and “door” and word candidates “open”, “open”, “open”, and “close”, respectively. Since it had to be done, the acoustic score is calculated 8 times in total. On the other hand, the matching unit 58 performs the word candidate set “Shime” including the word candidate “Close” once for the word candidate set “Open” including the word candidates “Open”, “Dawn”, and “Open”. It is only necessary to calculate once for each, a total of two times. Therefore, the number of calculations can be reduced by 6 times obtained by subtracting the number of calculations 2 times in the case of FIG. 19 from the conventional number of calculations 8 times.

図１４に示した、単語候補セットを接続する同一時刻の単語列仮説が１つの場合、上述したように２回分だけ計算回数を削減できたが、図１９に示した、単語候補セットを接続する同一時刻の単語列仮説が２つの場合、６回分、計算回数を削減することができる。例えば、単語候補セットを接続する同一時刻の単語列仮説が３つ以上であれば、従来と比較した、計算回数の削減数は、さらに大きなものとなる。すなわち、単語候補セットを接続する同一時刻の単語列仮説が増えた場合、計算回数をより削減することができる。結果的に、音声認識処理の計算量を大幅に削減することが可能となる。さらに、例えば図７のステップＳ３において、同一時刻の単語列仮説が複数選択され、かつ、図１２のステップＳ１０６の処理において、単語候補セットの代表となる単語の終了点の候補が複数特定された場合、音声認識処理の計算量をさらに大幅に削減することが可能となる。 When there is one word string hypothesis at the same time to connect the word candidate sets shown in FIG. 14, the number of calculations can be reduced by two times as described above, but the word candidate sets shown in FIG. 19 are connected. If there are two word string hypotheses at the same time, the number of calculations can be reduced by six times. For example, if there are three or more word string hypotheses at the same time connecting the word candidate sets, the number of reductions in the number of calculations compared to the prior art is even greater. That is, when the number of word string hypotheses at the same time connecting word candidate sets increases, the number of calculations can be further reduced. As a result, it is possible to greatly reduce the calculation amount of the speech recognition process. Further, for example, in step S3 in FIG. 7, a plurality of word string hypotheses at the same time are selected, and in the process in step S106 in FIG. 12, a plurality of word end point candidates that are representative of the word candidate set are specified. In this case, the calculation amount of the speech recognition process can be further greatly reduced.

図７に戻り、上述した処理により、単語接続関係情報は、マッチング部５８の処理結果に基づいて逐次更新され、さらに、再評価部５９において逐次修正されるので、単語予備選択部５６およびマッチング部５８は、常時、単語接続関係情報を利用して処理を行なうことが可能となる。 Returning to FIG. 7, the word connection relation information is sequentially updated based on the processing result of the matching unit 58 and further corrected by the re-evaluation unit 59 by the above-described processing. Therefore, the word preliminary selection unit 56 and the matching unit 58 can always perform processing using the word connection relation information.

なお、制御部５４は、単語接続関係情報を更新する際に、可能であれば、上述したような終端ノードの共通化を行なう。 In addition, when updating the word connection relation information, the control unit 54 shares the terminal nodes as described above if possible.

図７のステップＳ２において、途中ノードが存在しないと判定された場合、ステップＳ９に進み、制御部５４は、単語接続関係情報を参照することで、その単語接続関係情報として構成された各パスについて、単語スコアを累積することで、最終スコアを求め、例えば、その最終スコアが最も大きいパスを構成するアークに対応する単語列を、ユーザの発話に対する音声認識結果として出力して、処理を終了する。 If it is determined in step S2 in FIG. 7 that there is no midway node, the process proceeds to step S9, and the control unit 54 refers to the word connection relation information, so that each path configured as the word connection relation information is obtained. By accumulating word scores, a final score is obtained, for example, a word string corresponding to an arc constituting a path having the largest final score is output as a voice recognition result for the user's utterance, and the process is terminated. .

以上のようにして、単語予備選択部５６において、音声認識結果の候補となる単語列の、既に求まっている単語に続く、１以上の単語が選択され、単語候補クラスタリング部５７において、同音語どうしを１つのセットに分類し、マッチング部５８において、その分類された単語候補セットの代表の単語候補について、音響スコアが計算されて、その音響スコアで、同一の単語候補セットに含まれる他の単語候補の音響スコアが代用され、音声認識結果の候補となる単語列が構成される。そして、再評価部５９において、音声認識結果の候補となる単語列の単語どうしの単語接続関係が修正され、制御部５４において、その修正後の単語接続関係に基づいて、音声認識結果となる単語列が確定される。従って、処理に要するリソースの増大を抑えながら、精度の高い音声認識を行なうことができる。 As described above, the word preliminary selection unit 56 selects one or more words following the already obtained word in the word string that is a candidate for the speech recognition result, and the word candidate clustering unit 57 selects homophones. Are classified into one set, and an acoustic score is calculated for the representative word candidates of the classified word candidate set in the matching unit 58, and other words included in the same word candidate set with the acoustic score. Candidate acoustic scores are substituted to form word strings that are candidates for speech recognition results. Then, the re-evaluation unit 59 corrects the word connection relationship between the words in the word string that is a candidate for the speech recognition result, and the control unit 54 uses the corrected word connection relationship as a word to be the speech recognition result. The column is confirmed. Therefore, highly accurate speech recognition can be performed while suppressing an increase in resources required for processing.

すなわち、単語予備選択部５６で選択された全ての単語候補の音響スコアをマッチング部５８において計算せず、同音語を１セットにして、その代表の単語の音響スコアを計算するようにしたので、計算量を減少させることができる。また、再評価部５９において単語接続情報の単語境界が修正されるため、注目ノードが有する時刻情報が、単語境界を表している精度が高くなり、単語予備選択部５６やマッチング部５８では、そのような精度の高い時刻情報が表す時刻以降の特徴量系列を用いて処理が行われる。従って、単語予備選択部５６において選択する単語の判断基準や、マッチング部５８において単語を絞り込むときの判断基準を強化しても、音声認識結果として正しい単語が除外されてしまう可能性を極めて低くすることができる。 That is, since the acoustic score of all word candidates selected by the word preliminary selection unit 56 is not calculated by the matching unit 58, the homophones are set as one set, and the acoustic score of the representative word is calculated. The amount of calculation can be reduced. In addition, since the word boundary of the word connection information is corrected in the re-evaluation unit 59, the time information of the node of interest is highly accurate in representing the word boundary. In the word preliminary selection unit 56 and the matching unit 58, Processing is performed using a feature amount sequence after the time represented by such highly accurate time information. Therefore, even if the criteria for selecting a word in the word preliminary selection unit 56 and the criteria for narrowing down the word in the matching unit 58 are strengthened, the possibility that a correct word is excluded as a speech recognition result is extremely reduced. be able to.

そして、単語予備選択部５６において選択する単語の判断基準を強化した場合には、マッチング部５８においてマッチング処理の対象となる単語数が少なくなり、その結果、マッチング部５８の処理に要する演算量およびメモリ容量も少なくすることができる。 When the criteria for selecting a word to be selected in the word preliminary selection unit 56 are strengthened, the number of words to be subjected to the matching process in the matching unit 58 is reduced. As a result, the amount of calculation required for the processing of the matching unit 58 and Memory capacity can also be reduced.

また、マッチング部５８では、単語候補セットの代表の単語の音響スコアを、同一の単語候補セットに含まれる他の単語候補に流用していたため、直前の単語との関係は考慮されず、音響スコアは必ずしも正確ではないが、再評価部５９において単語接続情報の単語境界が修正されるため、最終的な音響スコアをより正確な値に修正することができる。 In addition, the matching unit 58 diverts the acoustic score of the representative word of the word candidate set to other word candidates included in the same word candidate set, so the relationship with the immediately preceding word is not considered, and the acoustic score Is not necessarily accurate, but since the word boundary of the word connection information is corrected in the re-evaluation unit 59, the final acoustic score can be corrected to a more accurate value.

さらに、仮に、単語予備選択部５６において、正しい音声認識結果としての単語列を構成する単語のうち、ある時刻から開始する単語が、その時刻に選択されなかったとしても、その時刻からずれた、誤った時刻において選択されれば、再評価部５９において、その誤った時刻が修正され、正しい音声認識結果としての単語列を得ることができる。すなわち、単語予備選択部５６で、正しい音声認識結果としての単語列を構成する単語の選択漏れがあったとしても、再評価部５９において、その選択漏れを是正して、正しい音声認識結果としての単語列を得ることができる。 Furthermore, in the word preliminary selection unit 56, even if a word starting from a certain time among the words constituting the word string as a correct speech recognition result is not selected at that time, it is deviated from that time. If the wrong time is selected, the re-evaluation unit 59 corrects the wrong time, and a word string as a correct speech recognition result can be obtained. That is, even if there is a selection omission of the word constituting the word string as the correct speech recognition result in the word preliminary selection unit 56, the re-evaluation unit 59 corrects the omission of selection and obtains the correct speech recognition result. A word string can be obtained.

従って、再評価部５９では、マッチング部５８における終了時刻の検出の誤りの他、単語予備選択部５６における単語の選択の誤りも是正することができる。 Accordingly, the re-evaluation unit 59 can correct an error in the word selection in the word preliminary selection unit 56 in addition to an error in the end time detection in the matching unit 58.

ところで、上記のような手法で音声認識の処理を行なった場合に、再評価部５９における処理の中に同じような処理が何度も繰り返されることがある。例えば、単語接続関係記憶部６０に、図２０に示されるグラフ構造に対応する単語接続関係情報が記憶されていたとする。図２０においては、ノード１、ノード２、ノード３、およびノード４を結ぶアーク１、アーク２、およびアーク３により、単語列仮説「ドアを開ける」が形成されている。また、ノード１、ノード５、およびノード６を結ぶアーク４およびアーク５により、単語列仮説「窓を」が形成されている。図２０において、ノード６が、途中ノードである。図２１は、マッチング処理により、図２０のノード６に、後続の単語候補「開ける」を接続する場合の例を示している。図２１の例においては、ノード６から、２つの異なる終了時刻ｔ１およびｔ２を有するノード７およびノード８に対して、同一の単語候補「開ける」に対応するアーク６およびアーク７が延びている。 By the way, when the speech recognition process is performed by the above-described method, the same process may be repeated many times during the process in the reevaluation unit 59. For example, it is assumed that word connection relation information corresponding to the graph structure shown in FIG. 20 is stored in the word connection relation storage unit 60. In FIG. 20, the word string hypothesis “open the door” is formed by arc 1, arc 2, and arc 3 that connect node 1, node 2, node 3, and node 4. The word string hypothesis “window” is formed by the arc 4 and the arc 5 that connect the node 1, the node 5, and the node 6. In FIG. 20, node 6 is an intermediate node. FIG. 21 shows an example in which the subsequent word candidate “open” is connected to the node 6 of FIG. 20 by the matching process. In the example of FIG. 21, arc 6 and arc 7 corresponding to the same word candidate “open” extend from node 6 to node 7 and node 8 having two different end times t1 and t2.

図２１の例に示されるように、同一の単語候補に対応して、異なる終了時刻を有するノードが複数求まった場合の、ノード５からノード７までの単語列に対する再評価処理と、ノード５からノード８までの単語列に対する再評価について考える。この２つの再評価の処理に着目すると、いずれも再評価の開始時刻は同じであり、再評価すべき単語列「を」「開ける」も同じである。違いは、終了時刻だけである。つまり、この例の場合、同じ単語列に対して同じ開始時刻から再評価が行なわれることになる。特に、前述したように、マッチング部５８におけるマッチング処理では終了時刻が固定されないため、また、複数の時刻を終了時刻とする単語仮説が生成されるため、同じ単語列に対して同じ開始時刻から再評価が行なわれる状況が頻繁に発生することになる。 As shown in the example of FIG. 21, when a plurality of nodes having different end times corresponding to the same word candidate are obtained, a re-evaluation process for the word strings from node 5 to node 7, Consider the reevaluation for word strings up to node 8. Focusing on these two re-evaluation processes, the re-evaluation start time is the same, and the word strings “open” and “open” to be re-evaluated are also the same. The only difference is the end time. That is, in this example, the same word string is re-evaluated from the same start time. In particular, as described above, since the end time is not fixed in the matching process in the matching unit 58, and a word hypothesis having a plurality of times as end times is generated, the same word string is re-started from the same start time. The situation where evaluation is performed frequently occurs.

そこで、図２２の音声認識装置では、同じ単語列に対する同じ開始時刻からの再評価処理をより効率的に行なうようにしている。図２２は、図５の音声認識装置の再評価部５９に、再評価処理過程記憶部２０１を追加した構成になっており、図５の音声認識装置と同一の部位には同一の符号を付している。 Therefore, in the speech recognition apparatus shown in FIG. 22, the re-evaluation process for the same word string from the same start time is performed more efficiently. FIG. 22 is configured by adding a reevaluation process storage unit 201 to the reevaluation unit 59 of the speech recognition apparatus of FIG. 5, and the same parts as those of the speech recognition apparatus of FIG. doing.

図２１を参照して、図２２の音声認識装置の再評価部５９および再評価処理過程記憶部２０１について詳細に説明する。今、図２２のノード７を始点とするマッチング処理を行なう場合、その前に、ノード７から時間的に遡った部分的な単語系列に対する再評価が行なわれる。ここでは、２単語分だけ遡り、再評価を行なう場合について説明する。この例では、「を」「開ける」という単語列に対する再評価が行なわれることになる。ここで、開始時刻はノード５に対応する時刻ｔ０、終了時刻はノード７に対応する時刻ｔ１である。再評価を行なうために、まず、音響モデルデータベース６１と辞書データベース６２に基づき、単語列「を」「開ける」に対応する状態遷移モデルが構成される。図２３は、この状態遷移モデルについて示したものである。音響モデルデータベース６１としては、３状態のleft-to-right音素HMMを用いるものとする。この時、辞書データベース６２の発音に基づき、単語列「を」「開ける」に対応する音素HMM「o 」「a」「 k」「 e」「 r」「 u」が連結され、初期ノード２５１と終了ノード２５３、および単語境界ノード２５２を有し、それぞれのノード間を遷移する状態遷移モデルが構成される。状態遷移モデルのノードは○印によって示されており、各ノードを左から右に遷移するアークと、各ノードにおいて自己遷移するアークがノード間をつなぐ実線で示されている。各音素に対応する３つのノードには出力確率密度関数が与えられ、また、ノード間を遷移するアークには遷移確率が与えられる。さらに、単語へ遷移するノード、すなわち初期ノード２５１と単語境界ノード２５２において、言語スコアが与えられる。このような状態遷移モデルを利用し、開始時刻ｔ０から終了時刻ｔ１までの特徴量系列に基づく再評価処理が行なわれることになる。 With reference to FIG. 21, the reevaluation unit 59 and the reevaluation process storage unit 201 of the speech recognition apparatus in FIG. 22 will be described in detail. Now, when performing the matching process starting from the node 7 in FIG. 22, a reevaluation is performed on a partial word sequence that goes back in time from the node 7. Here, a case will be described in which reevaluation is performed by going back two words. In this example, the word string “open” and “open” is re-evaluated. Here, the start time is a time t0 corresponding to the node 5, and the end time is a time t1 corresponding to the node 7. In order to perform re-evaluation, a state transition model corresponding to the word string “open” and “open” is first constructed based on the acoustic model database 61 and the dictionary database 62. FIG. 23 shows this state transition model. As the acoustic model database 61, a three-state left-to-right phoneme HMM is used. At this time, phonemes HMMs “o”, “a”, “k”, “e”, “r”, and “u” corresponding to the word strings “open” and “open” are connected based on the pronunciation of the dictionary database 62, and the initial node 251 is connected. A state transition model having an end node 253 and a word boundary node 252 and transitioning between the respective nodes is configured. Nodes in the state transition model are indicated by circles, and arcs that transition from left to right in each node and arcs that self-transition in each node are indicated by solid lines connecting the nodes. The output probability density function is given to the three nodes corresponding to each phoneme, and the transition probability is given to the arc that transits between the nodes. Furthermore, a language score is given at a node that transitions to a word, that is, at an initial node 251 and a word boundary node 252. Using such a state transition model, a re-evaluation process based on a feature amount sequence from a start time t0 to an end time t1 is performed.

再評価処理は、まず、初期ノード２５１に初期値スコアが与えられ、その後、特徴量系列を時間順に使いながら、各ノードにおけるスコア値が更新される。スコア値の更新は、動的計画法に基づく音響スコアと言語スコアの累積の処理である。終了時刻ｔ１までのスコア値の更新が終了すると、初期ノード２５１から終了ノード２５３までの状態遷移系列として最適な経路が探索され、その探索結果に基づき、単語境界ノード２５２を通過する時刻ｔ´（図示略）、および初期ノード２５１と単語境界ノード２５２間の「を」に与えられる音響スコア、単語境界ノード２５２と終了ノード２５３間の「開ける」に与えられる音響スコアが確定される。以上のような、状態遷移モデルの構築、その状態遷移モデル上でのスコア値の更新、最適経路の探索、単語境界時刻ｔ´と「を」および「開ける」の音響スコアの確定という処理が、再評価の処理となる。そして、単語境界時刻ｔ´、「を」に与えられる音響スコアＡ５´（図示略）、および「開ける」に与えられる音響スコアＡ６´（図示略）が制御部５４に送信され、単語接続関係記憶部６０に記憶された再評価の対象となった単語列に対応するノードとアークの修正が行なわれることになる。 In the re-evaluation process, first, an initial value score is given to the initial node 251, and then the score value in each node is updated while using the feature amount sequence in time order. The update of the score value is a process of accumulating an acoustic score and a language score based on dynamic programming. When the update of the score value up to the end time t1 is completed, an optimum route is searched as a state transition sequence from the initial node 251 to the end node 253, and based on the search result, the time t ′ ( The acoustic score given to “O” between the initial node 251 and the word boundary node 252 and the acoustic score given to “open” between the word boundary node 252 and the end node 253 are determined. As described above, the process of constructing the state transition model, updating the score value on the state transition model, searching for the optimum route, and determining the acoustic score of the word boundary time t ′ and “open” and “open”, This is a re-evaluation process. Then, the word boundary time t ′, the acoustic score A 5 ′ (not shown) given to “O”, and the acoustic score A 6 ′ (not shown) given to “open” are transmitted to the control unit 54, and the word connection relation memory is stored. The node and the arc corresponding to the word string that is the object of reevaluation stored in the unit 60 are corrected.

次に、図２１におけるノード８を開始点とするマッチング処理を行なう場合には、ノード７の場合と同様に、ノード８から時間的に遡った部分的な単語系列に対する再評価の処理が行なわれるわけであるが、その際に構築される状態遷移モデルはノード７に対応する前述のものと同じであり、その状態遷移モデル上でのスコア値の更新については、開始時刻がｔ０である点も同じである。従って、前述の図２３の状態遷移モデルを用いたスコア値の更新を、時刻ｔ１以降も継続し時刻ｔ２まで行ない、最適経路の探索を行なうことによって、図２１におけるノード５とノード８間の単語列に対する再評価の処理が完了し、対応する単語境界時刻と各単語の音響スコアを確定することが可能となる。 Next, in the case of performing the matching process starting from the node 8 in FIG. 21, as in the case of the node 7, a re-evaluation process is performed on a partial word sequence that goes back in time from the node 8. However, the state transition model constructed at that time is the same as that described above corresponding to the node 7, and the update of the score value on the state transition model has a start time of t0. The same. Therefore, by updating the score value using the state transition model of FIG. 23 described above and continuing until time t1 until time t2, and searching for the optimum route, the word between node 5 and node 8 in FIG. The re-evaluation process for the column is completed, and the corresponding word boundary time and the acoustic score of each word can be determined.

そこで、図２２の再評価部５９は、ノード５とノード７間の単語列に対する再評価の処理が終了した時点で、構築された状態遷移モデルと時刻ｔ１での各ノードのスコア値、および、最適経路の探索に必要となるスコア値の更新履歴を再評価処理過程データとして、再評価処理過程記憶部２０１に記憶させる。そして、図２１におけるノード５とノード８間の単語列に対する再評価処理を行なう際には、再評価処理過程記憶部２０１に記憶された時刻ｔ１までの再評価処理過程データを利用し、時刻ｔ１から再評価処理を再開し、時刻ｔ２までのスコア値の更新を継続する。これにより、状態遷移モデルを構築する処理、および開始時刻ｔ０から時刻ｔ１までのスコア値の更新に伴う処理が共有されることになる。 Therefore, the re-evaluation unit 59 in FIG. 22 performs the state evaluation model constructed at the time when the re-evaluation processing on the word string between the node 5 and the node 7 is completed, and the score value of each node at the time t1, and The update history of the score value necessary for searching for the optimum route is stored in the reevaluation process storage unit 201 as reevaluation process data. Then, when performing the reevaluation process on the word string between the node 5 and the node 8 in FIG. 21, the reevaluation process data up to time t1 stored in the reevaluation process process storage unit 201 is used, and the time t1 The re-evaluation process is restarted from and the score value is updated until time t2. As a result, the process for constructing the state transition model and the process for updating the score value from the start time t0 to the time t1 are shared.

例えば、図２４に示されるように、同じノード２から展開された同じ単語Ｗ２に対する仮説で、ノード３乃至ノード６のような終了時刻が異なる仮説が多数生成された場合、従来は、ノード１を開始点とする再評価処理を、アーク２乃至アーク５のそれぞれの単語列に対して行なう必要があったが、図２２の音声認識装置においては、これらの再評価の処理が共有化され、結果的に、ノード１を開始点とし終了時刻の最も遅いノード６までの単語列に対する再評価の処理を1回行なうのと同等の計算量で済むことになる。つまり、ノード１とノード６間の単語列に対する再評価の処理の途中において、他の終了時刻に対する再評価の処理結果が得られるものと考えることができる。つまり、ノード１とノード６間の単語列に対する再評価の処理を途中で停止し、再評価処理過程記憶部２０１に記憶しておき、必要に応じて処理の継続が可能なようになされている。 For example, as shown in FIG. 24, when a large number of hypotheses with different end times, such as node 3 to node 6, are generated for the same word W2 expanded from the same node 2, The re-evaluation process as the starting point has to be performed for each of the word strings of arc 2 to arc 5. However, in the speech recognition apparatus of FIG. 22, these re-evaluation processes are shared, and the result In other words, the amount of calculation is the same as when the reevaluation process is performed once for the word string from node 1 to the node 6 with the latest end time. That is, it can be considered that a re-evaluation processing result for another end time is obtained in the middle of the re-evaluation processing for the word string between the node 1 and the node 6. That is, the reevaluation process for the word string between the node 1 and the node 6 is stopped halfway and stored in the reevaluation process storage unit 201 so that the process can be continued as necessary. .

なお、再評価処理過程記憶部２０１に記憶しながら、処理の継続が行なえるようになされているので、必要がなければ処理を省略することも可能となる。例えば、スコア値が良くない単語列仮説に関して仮説を棄却する、いわゆる、仮説の枝がりと呼ばれる処理が施される場合、すべての単語列仮説が再評価されるわけではないので、必要な時だけ再評価の処理が行なえるようになっていることは重要である。 Since the process can be continued while being stored in the reevaluation process storage unit 201, the process can be omitted if not necessary. For example, if a process called so-called hypothesis branching that rejects hypotheses for word string hypotheses with poor score values is performed, not all word string hypotheses are re-evaluated, so only when necessary It is important to be able to handle re-evaluation.

また、再評価処理過程記憶部２０１に記憶された再評価処理過程データを参照するために、対応する単語列や時刻情報、あるいは、単語接続関係記憶部６０に記憶されたノードやアークの情報を利用するということは容易に実現できる。また、単語接続関係記憶部６０に記憶されたノードの中のマッチング処理を継続すべきノードに関して、ノードが保持する時刻の早い順にマッチング処理の継続を行なう場合には、再評価部５９における再評価処理もそのような時刻の早い順に行なわれることになる。 In addition, in order to refer to the re-evaluation process data stored in the re-evaluation process process storage unit 201, the corresponding word string and time information, or the node and arc information stored in the word connection relation storage unit 60 are used. It can be easily realized. Further, in the case where the matching process is to be continued from the node stored in the word connection relation storage unit 60 in the order of the time held by the node, the re-evaluation unit 59 performs the reevaluation. The processing is also performed in order of such time.

従って、例えば図２１において、ノード５とノード７の間の単語列に対する再評価の処理が最初に行なわれた場合、その処理が終了した時点で、再評価部５９は、単語接続関係記憶部６０に記憶されたノードを調べ、再評価処理過程データを次に利用するノードを検索するようにする。図２１の例では、終了時刻だけが異なるノード８がこれにあたる。そして、もし、そのようなノードが見つかれば、再評価部５９は、再評価処理過程データを再評価処理過程記憶部２０１に記憶させると同時に、検索されたノードからリンクを張っておく。もし、ノードが見つからなければ、再評価の処理の共有化は行なわれないので、再評価部５９は、再評価処理過程記憶部２０１への再評価処理過程データの記憶を行なわないようにする。このようにすることで、再評価処理過程記憶部２０１における記憶容量を削減できるだけでなく、リンクを利用することで、再評価処理過程記憶部２０１に記憶された再評価処理過程データの参照を容易に行なうことができるようになる。つまり、リンクがあれば、そのリンクの張られた再評価処理過程データを利用して処理を継続し、リンクがなければ、再評価の処理を最初のノードの時刻から行なう。 Therefore, for example, in FIG. 21, when the re-evaluation process is first performed on the word string between the node 5 and the node 7, the re-evaluation unit 59 stores the word connection relation storage unit 60 when the process ends. The node stored in is searched, and the node that uses the reevaluation process data next is searched. In the example of FIG. 21, this is the node 8 that differs only in the end time. If such a node is found, the re-evaluation unit 59 stores the re-evaluation process data in the re-evaluation process process storage unit 201 and at the same time establishes a link from the retrieved node. If the node is not found, the re-evaluation process is not shared, and the re-evaluation unit 59 does not store the re-evaluation process data in the re-evaluation process process storage unit 201. In this way, not only can the storage capacity of the re-evaluation process storage unit 201 be reduced, but it is also easy to refer to the re-evaluation process data stored in the re-evaluation process storage unit 201 by using a link. Will be able to do it. In other words, if there is a link, the process is continued using the re-evaluation process data to which the link is attached, and if there is no link, the re-evaluation process is performed from the time of the first node.

以上のようにすることにより、さらに音声認識処理に要する計算量を削減することが可能となる。 By doing so, it is possible to further reduce the amount of calculation required for the speech recognition processing.

ところで、従来の音声認識装置においては、マッチング部５８において、異なる時刻を開始時刻として、同じ単語に対するマッチング処理が何度も繰り返されることがある。例えば、図２０において、ノード３に対応する時刻を開始時刻とする単語候補「開ける」が、単語予備選択部５６により決定されたとする。一方、ノード６に対応する時刻を開始時刻とする単語候補「開ける」が、単語予備選択部５６により決定されたとする。このような場合、ノード３およびノード６に対応する、異なる２つの時刻を開始時刻として、同じ単語候補「開ける」を含む単語候補セット「あける」に対するマッチング処理が行なわれることになる。そのため、マッチング処理を行なう際に、スコア計算を行なうための計算で必要となる状態遷移モデルを動的に構築する場合に、何度も同じモデルが構築され、処理量の増大を招いている。 By the way, in the conventional speech recognition apparatus, the matching process for the same word may be repeated many times in the matching unit 58 with different times as start times. For example, in FIG. 20, it is assumed that a word candidate “open” whose start time is the time corresponding to the node 3 is determined by the word preliminary selection unit 56. On the other hand, it is assumed that the word candidate “open” having the time corresponding to the node 6 as the start time is determined by the word preliminary selection unit 56. In such a case, the matching process is performed on the word candidate set “open” including the same word candidate “open”, with two different times corresponding to the nodes 3 and 6 as the start times. For this reason, when the state transition model necessary for the calculation for score calculation is dynamically constructed when performing the matching process, the same model is constructed many times, resulting in an increase in the processing amount.

図２５は、上記のような処理の重複を防ぐことができる音声認識装置の構成例を示している。図２５の音声認識装置は、図５の音声認識装置のマッチング部５８に、単語モデル記憶部３０１が追加された構成となっており、図５の音声認識装置と同一の部位については、同一の符号を付してある。 FIG. 25 shows an example of the configuration of a speech recognition apparatus that can prevent duplication of processing as described above. 25 has a configuration in which a word model storage unit 301 is added to the matching unit 58 of the voice recognition device of FIG. 5. The same parts as those of the voice recognition device of FIG. The code | symbol is attached | subjected.

図２５の音声認識装置の動作について、図２６および図２７を参照して説明する。 The operation of the speech recognition apparatus in FIG. 25 will be described with reference to FIGS.

図２６は、単語接続関係記憶部６０に記憶された単語接続関係情報のグラフ構造の例を示している。図２６において、ノード３を開始点とするマッチング処理を行なう場合を考える。まず、ノード３に対応する時刻ｔａを開始時刻として、単語候補クラスタリング部５７から単語候補「開ける」に対応する単語候補セット「あける」が与えられたとする。この時、単語候補「開ける」のスコアの初期値が１番高い場合、この単語候補に対応する状態遷移モデル（以下、単語モデルとも称する）が、音響モデルデータベース６１と辞書データベース６２に基づいて構成される。図２７は、この単語モデルについて示したものである。音響モデルデータベース６１として、３状態のleft-to-right音素HMMを用いるものとすると、辞書データベース６２の発音に基づき、単語候補「開ける」に対応する音素HMM「a」「 k」「 e」「 r」「 u」が連結され、初期ノード３５１と終了ノード３５２を保持し、それぞれのノード間を遷移する状態遷移モデルが構成される。状態遷移モデルのノードとアークの説明は、図２３に対する説明と同じであるため、省略する。 FIG. 26 shows an example of the graph structure of the word connection relation information stored in the word connection relation storage unit 60. In FIG. 26, consider a case where matching processing is performed with node 3 as a starting point. First, suppose that the word candidate set “open” corresponding to the word candidate “open” is given from the word candidate clustering unit 57 with the time ta corresponding to the node 3 as the start time. At this time, when the initial value of the score of the word candidate “open” is the highest, a state transition model (hereinafter also referred to as a word model) corresponding to the word candidate is configured based on the acoustic model database 61 and the dictionary database 62. Is done. FIG. 27 shows this word model. Assuming that a three-state left-to-right phoneme HMM is used as the acoustic model database 61, the phoneme HMMs “a”, “k”, “e”, “e” corresponding to the word candidate “open” based on the pronunciation of the dictionary database 62. “r” and “u” are connected, and an initial node 351 and an end node 352 are held, and a state transition model for transitioning between the respective nodes is configured. The description of the nodes and arcs in the state transition model is the same as the description for FIG.

このような単語モデルを利用し、時刻ｔａを開始時刻として、スコア計算が行なわれる。スコア計算では、スコアの初期値がノード３５１に与えられ、時刻ｔａ以降の特徴量系列を時間順に使いながら、各ノードにおけるスコア値が更新される。スコア値の更新は、動的計画法に基づく音響スコアの累積の処理である。終了時刻は固定されていないため、終了ノード３５２のスコア値を各時刻ごとに調べ、そのスコア値が良い場合、すなわち、例えば、ある閾値よりスコアが高い場合に、初期ノード３５１から終了ノード３５２まで累積されたスコアを音響スコアとする単語仮説が生成される。同じ単語候補に対して、複数の終了時刻に対応する単語仮説が生成される場合もある。生成された単語仮説は、制御部５４に送信され、単語接続関係記憶部６０に記憶される。また、図２５のマッチング部５８は、スコア計算用に構築された単語候補「開ける」に対応する単語モデル（状態遷移モデル）を、単語モデル記憶部３０１に記憶させる。 Using such a word model, score calculation is performed with time ta as the start time. In the score calculation, the initial value of the score is given to the node 351, and the score value at each node is updated while using the feature amount series after the time ta in time order. The update of the score value is a process of accumulating acoustic scores based on dynamic programming. Since the end time is not fixed, the score value of the end node 352 is checked at each time, and when the score value is good, that is, when the score is higher than a certain threshold, for example, from the initial node 351 to the end node 352 A word hypothesis is generated with the accumulated score as the acoustic score. A word hypothesis corresponding to a plurality of end times may be generated for the same word candidate. The generated word hypothesis is transmitted to the control unit 54 and stored in the word connection relation storage unit 60. Further, the matching unit 58 in FIG. 25 stores a word model (state transition model) corresponding to the word candidate “open” constructed for score calculation in the word model storage unit 301.

続いて、図２６のノード５を開始点とするマッチング処理を行なう場合に、単語候補クラスタリング部５７から単語候補「開ける」を含む単語候補セット「あける」が与えられ、単語候補「開ける」のスコアの初期値が１番高いとする。この場合、従来の音声認識装置は、再度、スコア計算用に単語候補「開ける」に対応する単語モデルを構築する必要があった。それに対して、図２５の音声認識装置においては、マッチング部５８は、単語モデル記憶部３０１に記憶された単語候補「開ける」に対応する単語モデルを読み出して、利用する。このようにすることにより、単語モデルを構築する処理が不要となり、スコアの計算だけを行なえばよいことになる。 Subsequently, when performing the matching process starting at the node 5 in FIG. 26, the word candidate clustering unit 57 gives the word candidate set “open” including the word candidate “open”, and the word candidate “open” score. The initial value is assumed to be the highest. In this case, the conventional speech recognition apparatus needs to construct a word model corresponding to the word candidate “open” again for score calculation. On the other hand, in the speech recognition apparatus in FIG. 25, the matching unit 58 reads and uses the word model corresponding to the word candidate “open” stored in the word model storage unit 301. By doing so, the process of constructing the word model is not necessary, and only the score calculation needs to be performed.

以上のように、マッチング処理において行なわれる単語モデルの構築に関して、構築されたモデルを単語モデル記憶部３０１に記憶させ、記憶された単語モデルを、後のマッチング処理で再利用するようにすることで、同じ単語に対応する単語モデルが何度も構築されることによる処理量の増大を防ぐことが可能になる。特に、開始時刻が異なる同じ単語に対するマッチング処理が頻繁に発生する場合には、その効率化は大きなものとなる。 As described above, regarding the construction of the word model performed in the matching process, the constructed model is stored in the word model storage unit 301, and the stored word model is reused in the subsequent matching process. Thus, it is possible to prevent an increase in the processing amount due to the word model corresponding to the same word being built many times. In particular, when matching processing for the same word with different start times frequently occurs, the efficiency increases.

なお、上記のような単語モデルに関して、辞書データベース６２に含まれる全ての単語に対して、あらかじめ対応する単語モデルを作成しておくことによっても、マッチング処理から単語モデルの構築処理を取り除くことが可能である。しかしながら、この場合には、全ての単語に対する単語モデルを記憶しておく必要が生じ、記憶容量の大きな記憶手段を設ける必要がある。これに対して、図２５の音声認識装置においては、必要になった場合に、単語モデルが構築されるようになされているため、記憶容量は小さなものとなる。特に、直前の単語に依存した音響モデルなどを利用した場合、直前の単語に依存した単語モデルが構築されることになるため、その記憶容量の差は大きなものとなる。 In addition, regarding the word model as described above, it is possible to remove the word model construction process from the matching process by creating a corresponding word model for all words included in the dictionary database 62 in advance. It is. However, in this case, it is necessary to store the word models for all the words, and it is necessary to provide storage means having a large storage capacity. On the other hand, in the speech recognition apparatus of FIG. 25, when it becomes necessary, the word model is constructed, so that the storage capacity is small. In particular, when an acoustic model or the like that depends on the immediately preceding word is used, a word model that depends on the immediately preceding word is constructed, so that the difference in storage capacity is large.

ところで、単語モデル記憶部３０１に記憶される単語モデルに関して、構築された単語モデルを記憶するだけでは、単語モデル記憶部３０１の記憶容量が増大しつづけることになる。そこで、例えば、発話単位で音声認識の処理が完了した際に、単語モデル記憶部３０１に記憶された単語モデルを単語モデル記憶部３０１から消去するようにしたり、記憶された単語モデルの個数が一定数以上になると、単語モデル記憶部３０１に記憶された単語モデルを単語モデル記憶部３０１から消去したりすることにより、記憶容量が増大し過ぎないようにすることができる。 By the way, regarding the word model stored in the word model storage unit 301, the storage capacity of the word model storage unit 301 continues to increase only by storing the constructed word model. Therefore, for example, when the speech recognition process is completed in units of utterances, the word model stored in the word model storage unit 301 is deleted from the word model storage unit 301, or the number of stored word models is constant. If the number exceeds the number, the word model stored in the word model storage unit 301 can be deleted from the word model storage unit 301 to prevent the storage capacity from increasing excessively.

ところで、マッチング部５８において、ある時刻を開始時刻とする単語候補に対するマッチング処理が完了してから、別の時刻を開始時刻とする単語候補に対するマッチング処理を行なう場合、リアルタイムに入力された音声信号を処理しようとすると、片方のマッチング処理を完了するために必要となる時刻までの音声信号が入力されないと、もう一方のマッチング処理を開始できないという問題が発生する。 By the way, in the matching unit 58, when matching processing for a word candidate having a certain time as a start time is completed and matching processing for a word candidate having a different time as a start time is performed, an audio signal input in real time is obtained. When processing is to be performed, there arises a problem that the other matching process cannot be started unless an audio signal up to the time required to complete one matching process is input.

例えば、図２０において、ノード３を開始点として「開ける」に対するマッチング処理を行ない、時刻Tまで処理を行なう場合を考える。入力音声をリアルタイムに処理することを考えると、時刻Tまでの音声の入力が完了しない限り、ノード３を開始点とするマッチング処理は完了しない。一方、ノード６を開始点とするマッチング処理に着目すると、時刻Tまでの音声入力が完了しなくても、時刻t以降の音声が入力された段階で、マッチング処理を開始することが可能である。従って、ノード３を開始点とするマッチング処理が完了しなくてもノード６を開始点とするマッチング処理を開始できることが望ましい。 For example, in FIG. 20, a case is considered where matching processing for “open” is performed with node 3 as a starting point, and processing is performed until time T. In consideration of processing the input voice in real time, the matching process starting from the node 3 is not completed unless the voice input up to time T is completed. On the other hand, when focusing on the matching process starting from the node 6, the matching process can be started when the voice after time t is input even if the voice input up to time T is not completed. . Therefore, it is desirable that the matching process starting from the node 6 can be started even if the matching process starting from the node 3 is not completed.

図２８は、入力音声をリアルタイムに処理できるようにした音声認識装置の構成例を示している。図２８は、図５の音声認識装置のマッチング部５８に、マッチング処理過程記憶部４０１が追加された構成となっており、図５の音声認識装置と同一の部位には同一の符号を付してある。 FIG. 28 shows an example of the configuration of a speech recognition apparatus that can process input speech in real time. FIG. 28 shows a configuration in which a matching process storage unit 401 is added to the matching unit 58 of the speech recognition apparatus of FIG. 5, and the same parts as those of the speech recognition apparatus of FIG. It is.

図２８の音声認識装置の動作について、図２６を参照して説明する。図２６において、ノード３を開始点とするマッチング処理を行なう場合を考える。ここで、ノード３に対応する時刻ｔａを開始時刻として、単語候補クラスタリング部５７からマッチング部５８に単語候補セットが与えられたとする。この時、マッチング処理として、単語モデルの構築、その単語モデル上でのスコア値の更新、単語仮説の生成という処理が行なわれることについては、図５の音声認識装置と同様である。 The operation of the speech recognition apparatus in FIG. 28 will be described with reference to FIG. In FIG. 26, consider a case where matching processing is performed with node 3 as a starting point. Here, it is assumed that a word candidate set is given from the word candidate clustering unit 57 to the matching unit 58 using the time ta corresponding to the node 3 as a start time. At this time, as matching processing, processing such as construction of a word model, update of a score value on the word model, and generation of a word hypothesis is performed in the same manner as in the speech recognition apparatus of FIG.

ところで、入力音声をリアルタイムに処理する場合、時刻ｔａを開始時刻とするマッチング処理を行なおうとしても、時刻ｔａ以降の音声が完全に利用できるとは限らない。例えば、時刻Tまでマッチング処理が必要な場合に、時刻ｔｃまでの音声しか入力されていない状況では、マッチング処理を完了することができない。そこで、図２８のマッチング部５８は、時刻ｔｃまでの特徴量系列だけを用いて、単語モデル上でのスコア値の更新を行なうようにし、構築された単語モデルと、そのモデル上の各状態のスコア値、および、スコア値の更新履歴をマッチング処理過程データとして、マッチング処理過程記憶部４０１に記憶させる。 By the way, when processing the input voice in real time, even if the matching process is performed with the time ta as the start time, the voice after the time ta may not be completely used. For example, when the matching process is necessary until time T, the matching process cannot be completed in a situation where only the sound up to time tc is input. Therefore, the matching unit 58 in FIG. 28 updates the score value on the word model using only the feature quantity series up to time tc, and the constructed word model and each state on the model are updated. The score value and the update history of the score value are stored in the matching process storage unit 401 as matching process process data.

続いて、マッチング部５８は、図２６のノード５を開始点とするマッチング処理を行なう。ここで、ノード５に対応する時刻ｔｂを開始時刻として、単語候補クラスタリング部５７から単語候補セットが与えられる。その単語候補セットのうち１番スコアの初期値が高い単語候補に対して、単語モデルの構築、その単語モデル上でのスコア値の更新、単語仮説の生成という処理が行なわれることについては、図２６のノード３の場合と同じである。 Subsequently, the matching unit 58 performs a matching process starting from the node 5 in FIG. Here, a word candidate set is given from the word candidate clustering unit 57 with the time tb corresponding to the node 5 as the start time. Regarding the word candidate with the highest initial value of the first score in the word candidate set, a process of constructing a word model, updating a score value on the word model, and generating a word hypothesis is shown in FIG. This is the same as the case of 26 nodes 3.

この処理中にも音声が入力されているため、入力音声の終端時刻は、図中、ｔｃより右方向（より新しい時刻）へとシフトしていく。今、入力音声の終端時刻が時刻ｔｄにシフトしたとする。ノード５に接続する単語のマッチング処理においても、時刻Tまでのマッチング処理が必要だとすると、時刻ｔｄでは、マッチング処理を完了させることができないので、マッチング部５８は、構築された単語モデルと、そのモデル上の各状態のスコア値、および、スコア値の更新履歴をマッチング処理過程データとして、マッチング処理過程記憶部４０１に記憶させる。ここで、ノード３を開始点とするマッチング処理に関しては、時刻ｔｃまでしかマッチング処理を行なっていないので、時刻ｔｄまで処理を進めることができることになる。そこで、マッチング部５８は、マッチング処理過程記憶部４０１に記憶されたマッチング処理過程データを用いて、ノード３を開始点とするマッチング処理を時刻ｔｃから再開する。以上のようにして、ノード３以降の部分とノード５以降の部分についてのマッチング処理を、入力音声に応じて、交互に進めるようにする。 Since voice is input even during this processing, the end time of the input voice shifts to the right (newer time) from tc in the figure. Assume that the end time of the input voice is shifted to time td. Also in the matching process of the word connected to the node 5, if the matching process up to the time T is necessary, the matching process cannot be completed at the time td. The score value of each of the above states and the update history of the score value are stored in the matching process storage unit 401 as matching process process data. Here, with respect to the matching process starting from the node 3, since the matching process is performed only until the time tc, the process can be advanced to the time td. Therefore, the matching unit 58 uses the matching process data stored in the matching process storage unit 401 to restart the matching process starting at the node 3 from time tc. As described above, the matching process for the portion after node 3 and the portion after node 5 are alternately advanced according to the input voice.

以上のように、音声の入力に応じて、マッチング処理を進められる範囲で進めた後、一時的に処理を停止して記憶し、必要に応じて、マッチング処理を継続できるようにすることで、入力音声に対してリアルタイムに処理を行なう場合の処理の高速化が実現されることになる。 As described above, according to the input of the voice, after proceeding within the range in which the matching process can be advanced, the process is temporarily stopped and stored, and if necessary, the matching process can be continued. The processing speed can be increased when processing the input voice in real time.

なお、以上の説明においては、図２２、図２５、および図２８の音声認識装置をそれぞれ独立して説明したが、音声認識装置が、図２２、図２５、および図２８にそれぞれ示された再評価処理過程記憶部２０１、単語モデル記憶部３０１、およびマッチング処理過程記憶部４０１のうち、いずれか２つ、あるいは３つ全てを有するようにすることも可能である。 In the above description, the voice recognition devices of FIGS. 22, 25, and 28 have been described independently. However, the voice recognition devices are not shown in FIG. 22, FIG. 25, and FIG. It is also possible to have any two or all three of the evaluation process storage unit 201, the word model storage unit 301, and the matching process storage unit 401.

また、再評価処理で利用される状態遷移モデルと、マッチング処理で利用される単語モデル(状態遷移モデル)に関して、ある単語に着目した場合、その状態遷移モデルは同じとなる場合がある。このような場合、その状態遷移モデルそのものは時刻には依存しないものなので、共通に利用するようにしてもよい。すなわち、図２５の単語モデル記憶部３０１に記憶された単語モデルを、再評価部５９で利用できるようにしてもよい。これにより、再評価部５９における状態遷移モデルを構築する処理を効率化することが可能となる。 In addition, regarding a state transition model used in the re-evaluation process and a word model (state transition model) used in the matching process, when focusing on a certain word, the state transition model may be the same. In such a case, since the state transition model itself does not depend on time, it may be used in common. That is, the word model stored in the word model storage unit 301 of FIG. Thereby, it is possible to improve the efficiency of the process of building the state transition model in the re-evaluation unit 59.

以上のように、単語候補クラスタリング部５７において、単語予備選択部５６で決定される単語候補がクラスタリングされ、得られた単語候補セット毎にマッチング処理が1回だけ行なわれるので、マッチング処理に要する計算量を大幅に削減することができる。そのマッチング処理は、直前の単語列仮説の数に依存することもない。また、再評価部５９において、直前の単語列仮説に依存した音響モデルが適用され、マッチング処理で決定される単語境界や音響スコアが正しく補正されるため、マッチング処理の削減に伴う認識精度の劣化もほとんど発生することはない。つまり、音声認識の精度を落とすことなく、処理の高速化が可能となる。 As described above, the word candidate clustering unit 57 clusters the word candidates determined by the word preliminary selecting unit 56, and the matching process is performed only once for each obtained word candidate set. The amount can be greatly reduced. The matching process does not depend on the number of immediately preceding word string hypotheses. Further, in the reevaluation unit 59, the acoustic model depending on the immediately preceding word string hypothesis is applied, and the word boundary and acoustic score determined in the matching process are correctly corrected. Almost never occurs. That is, the processing speed can be increased without reducing the accuracy of voice recognition.

実際に、本実施の形態の手法を、語彙数６万語の大語彙連続音声認識に適用した結果、マッチング処理に要する処理量が１０分の１以下に削減されることが確認された。この実験では、各時刻において着目すべき部分的な単語列仮説を約３０に制限し、それぞれ単語列仮説から展開すべき単語候補として、単語予備選択によって抽出される候補数が約１０００語程度になる場合について、本実施の形態の手法を適用した場合としない場合で比較を行なった。 Actually, as a result of applying the method of the present embodiment to large vocabulary continuous speech recognition with a vocabulary number of 60,000, it was confirmed that the processing amount required for the matching processing was reduced to 1/10 or less. In this experiment, the partial word string hypotheses to be noted at each time are limited to about 30, and the number of candidates extracted by word preliminary selection is about 1000 words as word candidates to be developed from the word string hypotheses. In this case, a comparison was made with and without applying the method of the present embodiment.

従来は、ある時刻に着目したとき、単語候補数が１０００語、部分的な単語列仮説が約３０と仮定すれば、マッチング処理が３００００回行なわれることになるのが、本実施の形態の手法を適用した場合、単語列仮説の個数に依存しなくなることから、少なくとも１０００回に抑えられることになる。さらに、同じ発音に対してのマッチング処理も削減されることになり、同音異義語の多い日本語の場合、さらに数パーセントから１０パーセント程度の処理の削減が見込まれることになる。つまり、３０分の１以下になることが期待される。しかしながら、マッチング処理の過程において、いわゆるビームサーチと呼ばれる仮説の枝刈りの処理を施していたことで、それぞれのマッチングの処理量が、もともとある程度抑えられていたため、この実験では、１０分の１以下の削減になったものと思われる。 Conventionally, when focusing on a certain time, assuming that the number of word candidates is 1000 words and the partial word string hypothesis is about 30, the matching process is performed 30000 times. Is no longer dependent on the number of word string hypotheses, so that it can be suppressed to at least 1000 times. Further, matching processing for the same pronunciation is also reduced, and in the case of Japanese with many homonyms, processing reduction of about several percent to 10 percent is expected. That is, it is expected to be 1/30 or less. However, in the process of matching processing, the processing of hypothesis pruning called so-called beam search was performed, so that the amount of processing of each matching was originally suppressed to some extent. It seems that it was reduced.

また、図２２の音声認識装置のように、開始時刻が同じで終了時刻が異なるだけの同じ単語列に対する再評価の処理に関して、再評価の処理を共有化する手段を備える、すなわち、再評価の処理過程を一時的に停止し記憶する手段と、その記憶された再評価の処理過程を用いて処理を継続する手段を備えることで、再評価に要する処理の効率化を可能となる。 Further, as in the speech recognition apparatus of FIG. 22, a re-evaluation process is provided for a re-evaluation process for the same word string having the same start time but different end times. By providing means for temporarily stopping and storing the process and means for continuing the process using the stored process for reevaluation, it is possible to increase the efficiency of the process required for reevaluation.

また、図２５の音声認識装置のように、マッチング処理においてスコア計算を行なうために構築されるスコア計算用の単語モデルを記憶する手段を備え、同じ単語に対するスコア計算が必要になった場合に、記憶された単語モデルを利用することで、単語モデル構築に要する処理の共有化(効率化)が実現される。 In addition, as in the speech recognition apparatus of FIG. 25, a means for storing a word model for score calculation constructed for performing score calculation in matching processing is provided, and when score calculation for the same word is necessary, By using the stored word model, sharing (efficiency) of processing required for word model construction is realized.

さらにまた、図２８の音声認識装置のように、マッチング処理を一時的に停止し記憶する手段と、その記憶されたマッチングの処理過程を用いて処理を続行する手段を備えることで、リアルタイムに入力される音声信号に対して、マッチング処理を処理可能なところから進めることが出来るようになるため、計算資源の有効活用が可能となる。 Furthermore, like the speech recognition apparatus of FIG. 28, it is provided with means for temporarily stopping and storing the matching process, and means for continuing the process using the stored matching process, so that input is performed in real time. Since it becomes possible to proceed with the matching process for the audio signal to be processed, it is possible to effectively use the calculation resources.

結果的に、音声認識処理の高速化が実現される。 As a result, high speed speech recognition processing is realized.

なお、ここでは、音響モデルデータベース６１、辞書データベース６２、および文法データベース６３に関しては、単語予備選択部５６、マッチング部５８、および再評価部５９で共通のものを用いるものとして説明したが、それぞれで異なる音響モデルデータベース、辞書データベース、および文法データベースを用いることも可能である。例えば、文法としては、単語予備選択部５６ではユニグラムを用い、マッチング部５８ではバイグラムを用い、再評価部５９ではトライグラムを用いることが可能である。 Here, the acoustic model database 61, the dictionary database 62, and the grammar database 63 have been described as being shared by the word preliminary selection unit 56, the matching unit 58, and the reevaluation unit 59. It is also possible to use different acoustic model databases, dictionary databases, and grammar databases. For example, as the grammar, the word preliminary selection unit 56 can use a unigram, the matching unit 58 can use a bigram, and the reevaluation unit 59 can use a trigram.

また、以上の説明においては、単語候補クラスタリング部５７において、発音が同じ単語候補（同音語）をクラスタリングする（「同一の発音を有する単語」という属性に基づいて、分類する）場合を例にして説明したが、発音が完全に同じでなくても、単語と単語の音響的な類似性を示すような距離尺度を定義し、その距離尺度が設定された閾値より小さいもの、つまり音響的な類似度が高い単語候補をクラスタリングする（「音響的に類似した単語」という属性に基づいて、分類する）ことで、マッチングの処理量をさらに削減することも可能である。 In the above description, the word candidate clustering unit 57 clusters the word candidates (sound words) having the same pronunciation (classified based on the attribute of “words having the same pronunciation”) as an example. As described above, even if the pronunciation is not exactly the same, a distance scale is defined that shows the acoustic similarity between words, and the distance scale is smaller than the set threshold, that is, acoustic similarity It is also possible to further reduce the amount of matching processing by clustering high-frequency word candidates (classifying on the basis of an attribute of “acoustically similar words”).

なお、本発明は、例えば、家庭用あるいは業務用のゲーム機、携帯電話機、携帯端末装置、その他、あらゆる電化機器に適用することが可能である。 Note that the present invention can be applied to, for example, home or business game machines, mobile phones, mobile terminal devices, and other electrical appliances.

上述した一連の処理は、ハードウェアにより実行させることもできるが、ソフトウェアにより実行させることもできる。一連の処理をソフトウェアにより実行させる場合には、そのソフトウェアを構成するプログラムが専用のハードウェアに組み込まれているコンピュータ、または、各種のプログラムをインストールすることで、各種の機能を実行することが可能な、例えば汎用のパーソナルコンピュータなどに、記録媒体等からインストールされる。 The series of processes described above can be executed by hardware, but can also be executed by software. When a series of processing is executed by software, various functions can be executed by installing a computer in which the programs that make up the software are installed in dedicated hardware, or by installing various programs. For example, it is installed from a recording medium or the like into a general-purpose personal computer or the like.

図２９は、このような処理を実行するパーソナルコンピュータ５００の内部構成例を示す図である。パーソナルコンピュータのCPU（Central Processing Unit）５０１は、ROM（Read Only Memory）５０２に記憶されているプログラムに従って各種の処理を実行する。RAM（Random Access Memory）５０３には、CPU５０１が各種の処理を実行する上において必要なデータやプログラムなどが適宜記憶される。入出力インタフェース５０５には、マウス、キーボード、マイクロフォン、AD変換器などから構成される入力部５０６が接続され、入力部５０６に入力された信号をCPU５０１に出力する。また、入出力インタフェース５０５は、ディスプレイ、スピーカ、およびDA変換器などから構成される出力部５０７も接続されている。 FIG. 29 is a diagram showing an example of the internal configuration of a personal computer 500 that executes such processing. A CPU (Central Processing Unit) 501 of the personal computer executes various processes according to a program stored in a ROM (Read Only Memory) 502. A RAM (Random Access Memory) 503 appropriately stores data and programs necessary for the CPU 501 to execute various processes. An input unit 506 including a mouse, a keyboard, a microphone, an AD converter, and the like is connected to the input / output interface 505, and a signal input to the input unit 506 is output to the CPU 501. The input / output interface 505 is also connected to an output unit 507 including a display, a speaker, a DA converter, and the like.

さらに、入出力インタフェース５０５には、ハードディスクなどから構成される記憶部５０８、および、インターネットなどのネットワークを介して他の装置とデータの通信を行なう通信部５０９も接続されている。ドライブ５１０は、磁気ディスク５２１、光ディスク５２２、光磁気ディスク５２３、半導体メモリ５３４などの記録媒体からデータを読み出したり、データを書き込んだりするときに用いられる。 Further, a storage unit 508 configured from a hard disk or the like and a communication unit 509 that performs data communication with other devices via a network such as the Internet are connected to the input / output interface 505. The drive 510 is used when data is read from or written to a recording medium such as the magnetic disk 521, the optical disk 522, the magneto-optical disk 523, and the semiconductor memory 534.

コンピュータにインストールされ、コンピュータによって実行可能な状態とされるプログラムを格納するプログラム格納媒体は、図２９に示すように、磁気ディスク５２１（フレキシブルディスクを含む）、光ディスク５２２（CD-ROM(Compact Disk-Read Only Memory),DVD(Digital Versatile Disk)を含む）、光磁気ディスク５２３（ＭＤ（Mini-Disk）を含む）、もしくは半導体メモリ５２４などよりなるパッケージメディア、または、プログラムが一時的もしくは永続的に格納されるROM５０２や、記憶部５０８を構成するハードディスクなどにより構成される。プログラム格納媒体へのプログラムの格納は、必要に応じてルータ、モデムなどのインタフェースを介して、ローカルエリアネットワーク、インターネット、デジタル衛星放送といった、有線または無線の通信媒体を利用して行われる。 As shown in FIG. 29, a program storage medium for storing a program that is installed in a computer and can be executed by the computer includes a magnetic disk 521 (including a flexible disk), an optical disk 522 (CD-ROM (Compact Disk- A package medium consisting of a read only memory), a DVD (Digital Versatile Disk), a magneto-optical disk 523 (including an MD (Mini-Disk)), or a semiconductor memory 524, or a program temporarily or permanently. A ROM 502 to be stored, a hard disk constituting the storage unit 508, and the like are configured. The program is stored in the program storage medium using a wired or wireless communication medium such as a local area network, the Internet, or digital satellite broadcasting via an interface such as a router or a modem as necessary.

なお、本明細書において、記録媒体に記録されるプログラムを記述するステップは、記載された順序に沿って時系列的に行われる処理はもちろん、必ずしも時系列的に処理されなくとも、並列的あるいは個別に実行される処理をも含むものである。 In the present specification, the step of describing the program recorded on the recording medium is not limited to the processing performed in chronological order according to the described order, but is not necessarily performed in chronological order. It also includes processes that are executed individually.

また、本明細書において、システムとは、複数の装置により構成される装置全体を表すものである。 Further, in this specification, the system represents the entire apparatus constituted by a plurality of apparatuses.

従来の音声認識装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the conventional speech recognition apparatus. バンドルサーチ法を説明する図である。It is a figure explaining a bundle search method. グラフ構造による単語接続関係情報の例を示す図である。It is a figure which shows the example of the word connection relation information by a graph structure. グラフ構造による単語接続関係情報の例を示す図である。It is a figure which shows the example of the word connection relation information by a graph structure. 本発明を適用した音声認識装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech recognition apparatus to which this invention is applied. 単語接続関係記憶部に記憶された単語接続関係情報の例を示す図である。It is a figure which shows the example of the word connection relationship information memorize | stored in the word connection relationship memory | storage part. 図５の音声認識装置の音声認識処理を説明するフローチャートである。It is a flowchart explaining the speech recognition process of the speech recognition apparatus of FIG. 再評価処理を説明する図である。It is a figure explaining a reevaluation process. 再評価処理を説明する図８に続く図である。It is a figure following FIG. 8 explaining a reevaluation process. 再評価処理を説明する図９に続く図である。It is a figure following FIG. 9 explaining a reevaluation process. 再評価処理を説明する図１０に続く図である。It is a figure following FIG. 10 explaining a reevaluation process. 図７のステップＳ７の処理を詳細に説明するフローチャートである。It is a flowchart explaining the process of FIG.7 S7 in detail. スコアの初期値の組み合わせの例を説明する図である。It is a figure explaining the example of the combination of the initial value of a score. 単語接続関係記憶部に記憶された単語接続関係情報の例を示す図である。It is a figure which shows the example of the word connection relationship information memorize | stored in the word connection relationship memory | storage part. 単語モデル（状態遷移モデル）の例を示す図である。It is a figure which shows the example of a word model (state transition model). 単語接続関係記憶部に記憶された単語接続関係情報の例を示す図である。It is a figure which shows the example of the word connection relationship information memorize | stored in the word connection relationship memory | storage part. 単語接続関係記憶部に記憶された単語接続関係情報の例を示す図である。It is a figure which shows the example of the word connection relationship information memorize | stored in the word connection relationship memory | storage part. スコアの初期値の組み合わせの例を説明する図である。It is a figure explaining the example of the combination of the initial value of a score. 単語接続関係記憶部に記憶された単語接続関係情報の例を示す図である。It is a figure which shows the example of the word connection relationship information memorize | stored in the word connection relationship memory | storage part. 単語接続関係記憶部に記憶された単語接続関係情報の例を示す図である。It is a figure which shows the example of the word connection relationship information memorize | stored in the word connection relationship memory | storage part. 単語接続関係記憶部に記憶された単語接続関係情報の例を示す図である。It is a figure which shows the example of the word connection relationship information memorize | stored in the word connection relationship memory | storage part. 本発明を適用した音声認識装置の他の構成例を示すブロック図である。It is a block diagram which shows the other structural example of the speech recognition apparatus to which this invention is applied. 単語モデル（状態遷移モデル）の例を示す図である。It is a figure which shows the example of a word model (state transition model). 単語接続関係記憶部に記憶された単語接続関係情報の例を示す図である。It is a figure which shows the example of the word connection relationship information memorize | stored in the word connection relationship memory | storage part. 本発明を適用した音声認識装置の他の構成例を示すブロック図である。It is a block diagram which shows the other structural example of the speech recognition apparatus to which this invention is applied. 単語接続関係記憶部に記憶された単語接続関係情報の例を示す図である。It is a figure which shows the example of the word connection relationship information memorize | stored in the word connection relationship memory | storage part. 単語モデル（状態遷移モデル）の例を示す図である。It is a figure which shows the example of a word model (state transition model). 本発明を適用した音声認識装置の他の構成例を示すブロック図である。It is a block diagram which shows the other structural example of the speech recognition apparatus to which this invention is applied. 本発明を適用したパーソナルコンピュータの構成例を示すブロック図である。It is a block diagram which shows the structural example of the personal computer to which this invention is applied.

Explanation of symbols

５１マイク，５２ AD変換部，５３特徴抽出部，５４制御部，５５特徴量記憶部，５６単語予備選択部，５７単語候補クラスタリング部，５８マッチング部，５９再評価部，６０単語接続関係記憶部，６１音響モデルデータベース（DB），６２辞書データベース（DB），６３文法データベース（DB），２０１再評価処理過程記憶部，３０１単語モデル記憶部，４０１マッチング処理過程記憶部 51 microphone, 52 AD conversion unit, 53 feature extraction unit, 54 control unit, 55 feature quantity storage unit, 56 word preliminary selection unit, 57 word candidate clustering unit, 58 matching unit, 59 re-evaluation unit, 60 word connection relation storage unit , 61 Acoustic model database (DB), 62 Dictionary database (DB), 63 Grammar database (DB), 201 Re-evaluation process storage unit, 301 Word model storage unit, 401 Matching process storage unit

Claims

In a speech recognition device that determines a word string corresponding to an input speech,
A matching process in which a part of the processing is shared for each word group in which one or more word candidates following the already obtained word in the word string are classified for each word having the same attribute. Processing execution means;
Correction means for correcting word connection relation information indicating a connection relation between words in the word string, which is a candidate for a speech recognition result, generated based on a result of matching processing by the matching processing execution means. Voice recognition device.

Classifying means for classifying one or more word candidates following the already obtained word in the word string as the word group for each word having the same attribute,
The speech recognition apparatus according to claim 1, wherein the matching process execution unit shares a part of the process for each word group classified by the classification unit.

A selection means for selecting one or more candidates for the word following the already requested word from the word string;
The speech recognition according to claim 2, wherein the classifying unit classifies the one or more word candidates selected by the selection unit as the word group for each word having the same attribute. apparatus.

The speech recognition apparatus according to claim 1, wherein one or more word candidates subsequent to the already obtained word are classified into the word group for each word having the same pronunciation.

The speech recognition apparatus according to claim 1, wherein one or more word candidates following the already obtained word are classified into the word group for each of the acoustically similar words.

A storage unit that stores the word connection relation information generated based on the result of the matching process performed by the matching process execution unit;
The speech recognition apparatus according to claim 1, wherein the correction unit corrects the word connection relation information stored by the storage unit.

When a state transition model for calculating an acoustic score of a partial word string is constructed by the modifying means to modify the word connection relation information, the constructed state transition model is stored. A storage means,
The said correction means utilizes the said state transition model memorize | stored by the said memory | storage means, when calculating the said acoustic score of the said same partial word sequence again. Voice recognition device.

In the case where a state transition model for calculating an acoustic score of a word is constructed by the matching processing execution unit, the matching processing execution unit further includes a storage unit that stores the constructed state transition model,
The speech recognition according to claim 1, wherein the matching process execution unit uses the state transition model stored in the storage unit when recalculating the acoustic score of the same word. apparatus.

A storage unit that stores a value calculated by the matching process execution unit as a mid-course of the matching process;
The matching processing execution means alternately executes the matching processing of a plurality of the word strings that are candidates for the speech recognition result using the value as the intermediate progress stored by the storage means. The speech recognition apparatus according to claim 1.

In the speech recognition method of the speech recognition apparatus for determining a word string corresponding to the input speech,
A matching process in which a part of the processing is shared for each word group in which one or more word candidates following the already obtained word in the word string are classified for each word having the same attribute. Process execution steps;
A correction step of correcting word connection relationship information indicating a connection relationship between words in a word string that is a candidate for a speech recognition result, which is generated based on the result of the matching processing by the processing of the matching processing execution step. A feature of speech recognition.

A program for a speech recognition device that determines a word string corresponding to an input speech,
A matching process in which a part of the processing is shared for each word group in which one or more word candidates following the already obtained word in the word string are classified for each word having the same attribute. Process execution steps;
A correction step of correcting word connection relationship information indicating a connection relationship between words in the word string, which is a candidate of a speech recognition result, generated based on the result of the matching processing by the processing of the matching processing execution step. A recording medium on which a computer-readable program is recorded.

A computer that controls the process of determining a word string corresponding to the input speech,
A matching process in which a part of the processing is shared for each word group in which one or more word candidates following the already obtained word in the word string are classified for each word having the same attribute. Process execution steps;
A correction step for correcting word connection relationship information indicating a connection relationship between words in the word string, which is a candidate of a speech recognition result, generated based on the result of the matching processing by the processing of the matching processing execution step; A program characterized by that.