JPH09292899A

JPH09292899A - Voice recognizing device

Info

Publication number: JPH09292899A
Application number: JP8130956A
Authority: JP
Inventors: Ryosuke Isotani; 亮輔磯谷
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1996-04-26
Filing date: 1996-04-26
Publication date: 1997-11-11
Anticipated expiration: 2016-04-26
Also published as: JP3171107B2

Abstract

PROBLEM TO BE SOLVED: To enable the recognition utilizing continuation time of syllable by correcting a vice, which is as candidate of recognition result, on the basis of the mean syllable length and the continuation time length of each syllable of the voice to be recognized. SOLUTION: An input voice is also input to a syllable boundary detecting unit 104, and the syllable detecting unit 104 obtains a position of candidate of a syllable boundary on the basis of the power information or the like, and outputs it to a mean syllable length estimating unit 105. The mean syllable length estimating unit 105 utilizes a characteristic that the continuation time length of each syllable is nearly constant so as to estimate the mean syllable length of the input voice on the basis of the information on each syllable boundary candidate. A candidate selecting unit 106 checks whether the continuation time length of each syllable exists between the minimum continuation time and the maximum continuation time length, which are decided on the basis of the mean syllable length obtained by the mean syllable length estimating unit 105, or not for each candidate, which are output from a voice recognizing unit 101, and in the case where any one sound, of which continuation time is out of the range, exists, this candidate is discarded, and in the case where a candidate, of which all syllables having continuation time in the range, is found, this candidate is recognized.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は音声認識装置に関
し、特に、音節などの音響的単位の継続時間長を用いた
音声認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus, and more particularly to a speech recognition apparatus using a duration time of an acoustic unit such as a syllable.

【０００２】[0002]

【従来の技術】従来、ＤＰ（動的計画法）マッチング
や、隠れマルコフモデル（hidden Markov model；「Ｈ
ＭＭ」という）を用いた音声認識装置が実用化されてい
る。ここでＤＰマッチングを簡単に説明すると、例えば
標準パターンと入力音声との音声区間全体にわたったス
ペクトル距離を計算する際に、両者の同じ音素同士が対
応するように時間軸を伸縮する時間軸正規化を行うため
のものである。またＨＭＭは、各単語あるいは各音素ご
とに少数の状態からなる推移図（マルコフモデル）を構
成し、入力音声がいずれのモデルによって生成された可
能性が最も大きいかを調べて認識する方法であり、この
時観測されるのは推移によって生じるスペクトル列であ
って状態そのものは観測されないことから「隠れ」と呼
ばれている。各モデルについて、学習サンプルを用いて
各状態でのスペクトルパラメータの生起確率と状態間の
推移確率を推定しておき、認識時には、入力音声を各モ
デルにあてはめて入力音声を生成する確率が最も高いモ
デルを選択して認識結果とする。2. Description of the Related Art Conventionally, DP (dynamic programming) matching and hidden Markov model (H) are used.
A voice recognition device using "MM") has been put to practical use. Here, the DP matching will be briefly described. For example, when calculating the spectral distance over the entire voice section of the standard pattern and the input voice, the time axis normal that expands and contracts the time axis so that the same phonemes of both are corresponded. It is for performing conversion. Further, the HMM is a method of forming a transition diagram (Markov model) consisting of a small number of states for each word or each phoneme, and recognizing by recognizing which model the input speech is most likely generated. , It is called "hidden" because what is observed at this time is the spectrum sequence generated by the transition and the state itself is not observed. For each model, the probability of occurrence of spectral parameters in each state and the transition probability between states are estimated using learning samples. At the time of recognition, the input speech is applied to each model and the probability of generating the input speech is highest. Select a model as the recognition result.

【０００３】このＤＰマッチングや隠れマルコフモデル
を用いた音声認識装置において、音節などのサブワード
を認識単位として、入力パターンと標準パターンとのマ
ッチングを行なう場合、各認識単位の継続時間長の情報
を用いて認識性能の向上を図る方法がある。例えば、音
韻ごとの継続時間長の最大値と最小値を予め設定してお
いて、これによりマッチング区間を制約する方法が一般
に知られている。In a speech recognition apparatus using this DP matching or a hidden Markov model, when matching an input pattern and a standard pattern with a subword such as a syllable as a recognition unit, information on the duration of each recognition unit is used. There is a method for improving recognition performance. For example, a method is generally known in which the maximum value and the minimum value of the duration time for each phoneme are set in advance and the matching section is restricted by this.

【０００４】以下、この種の従来の技術について図７を
参照して説明する。A conventional technique of this type will be described below with reference to FIG.

【０００５】図７を参照すると、入力音声は音声認識部
７０１に入力され、特徴ベクトルの時系列に変換され
る。音声認識部７０１は、単語辞書７０２中の各単語に
ついて、辞書７０２に書かれた読みの情報に基づいて音
韻標準パターン７０３を連結した単語標準パターンと、
特徴ベクトルの時系列に変換された入力音声との照合を
ＤＰマッチングを用いて行なう。その際、各音韻の継続
時間長の最大、最小値を音韻継続時間長情報７０４から
読み出し、音韻の継続時間がその範囲外となるマッチン
グは禁止する。マッチングの結果、距離が最小となる単
語を、認識結果として出力する。Referring to FIG. 7, the input voice is input to the voice recognition unit 701 and converted into a time series of feature vectors. The speech recognition unit 701, for each word in the word dictionary 702, a word standard pattern in which phoneme standard patterns 703 are connected based on reading information written in the dictionary 702,
The matching with the input voice converted into the time series of the feature vector is performed using DP matching. At that time, the maximum and minimum values of the duration of each phoneme are read from the phoneme duration information 704, and matching in which the duration of the phoneme is out of the range is prohibited. As a result of matching, the word with the smallest distance is output as the recognition result.

【０００６】音韻の継続時間の最大値、最小値は、音韻
ラベルのつけられた学習データを用いて、予め求めてお
く。これにより、時間軸の極端な伸縮による不適切なマ
ッチングが排除される。また、例えば「おばあさん」と
「おばさん」のように、違いが、主に、音韻の継続時間
長にある単語同士の認識が容易になり、認識性能が向上
する。The maximum value and the minimum value of the phoneme duration are obtained in advance by using the learning data with phoneme labels. This eliminates inappropriate matching due to extreme expansion and contraction of the time axis. Further, for example, it becomes easier to recognize words whose difference is mainly in the phoneme duration, such as “grandmother” and “grandmother”, and the recognition performance is improved.

【０００７】[0007]

【発明が解決しようとする課題】上記した従来の方法
は、ある決められた速度で発声される場合には有効であ
るが、音韻あるいは音節の継続時間長の最大値、最小値
は発生速度によって変わる。すなわち、上記従来の方法
においては、音韻の継続時間長の最大値、最小値として
予め決めた値を固定して用いるため、発声速度の変動に
弱い（すなわち認識性能が該変動の影響を受け易い）と
いう問題点を有している。The above-mentioned conventional method is effective when uttered at a certain fixed speed, but the maximum and minimum values of the duration of the phoneme or syllable depend on the generation speed. change. That is, in the above-mentioned conventional method, the maximum and minimum values of the phoneme duration are fixed and used, so that they are weak against the fluctuation of the speaking rate (that is, the recognition performance is easily affected by the fluctuation). ) Has a problem.

【０００８】従って、本発明は、上記事情に鑑みてなさ
れたものであって、その目的は、継続時間長を利用する
ことにより認識性能が高く、しかも発声速度の変動に強
い音声認識装置を提供することにある。Therefore, the present invention has been made in view of the above circumstances, and an object thereof is to provide a voice recognition device which has a high recognition performance by utilizing a duration time and is strong against fluctuations in utterance speed. To do.

【０００９】[0009]

【課題を解決するための手段】前記目的を達成するた
め、本発明に係る音声認識装置は、認識すべき入力音声
から平均音節長を求め、その平均音節長と認識結果候補
の各音節の継続時間長に基づいて、認識結果候補からの
選択あるいは認識結果の修正を行なうことを特徴として
いる。In order to achieve the above object, a speech recognition apparatus according to the present invention obtains an average syllable length from input speech to be recognized, and continues the average syllable length and each syllable of a recognition result candidate. It is characterized by selecting from the recognition result candidates or correcting the recognition result based on the time length.

【００１０】本発明に係る音声認識装置は、入力音声を
認識し、複数の認識結果候補を、各音節の継続時間長の
情報を付加して、出力する音声認識手段と、前記入力音
声から音節境界候補を求める音節境界候補検出手段と、
前記音節境界候補から平均音節長を求める平均音節長推
定手段と、前記認識結果候補と前記平均音節長とに基づ
いて前記複数の認識結果候補から認識結果を選択する候
補選択手段と、を含む。A voice recognition device according to the present invention recognizes an input voice, adds a plurality of recognition result candidates with information on the duration of each syllable, and outputs the voice recognition means, and a syllable from the input voice. Syllable boundary candidate detecting means for finding boundary candidates,
An average syllable length estimating means for obtaining an average syllable length from the syllable boundary candidate, and a candidate selecting means for selecting a recognition result from the plurality of recognition result candidates based on the recognition result candidate and the average syllable length.

【００１１】また、本発明に係る音声認識装置は、入力
音声を認識し、複数の認識結果候補を、音節ごとのセグ
メンテーション又は各音節の継続時間長の情報を付加し
て、出力する音声認識手段と、前記認識結果の各候補に
ついて平均音節長を求める平均音節長推定手段と、前記
認識結果候補と前記平均音節長とに基づいて前記複数の
認識結果候補から認識結果を選択する候補選択手段と、
を含む。Further, the voice recognition device according to the present invention recognizes an input voice, outputs a plurality of recognition result candidates by adding segmentation for each syllable or information on the duration of each syllable, and outputting the result. And an average syllable length estimating means for obtaining an average syllable length for each candidate of the recognition result, and a candidate selecting means for selecting a recognition result from the plurality of recognition result candidates based on the recognition result candidate and the average syllable length. ,
including.

【００１２】本発明によれば、平均音節長を用いて認識
結果の候補の選択あるいは修正を行なうため、認識時に
時間軸の極端な伸縮による不適切なマッチングをした候
補は棄却あるいは修正されて認識性能が向上し、さらに
平均音節長を入力音声自身から求めているので、発声速
度の変動の影響を受けにくい。According to the present invention, since the candidate of the recognition result is selected or corrected using the average syllable length, the candidate which is improperly matched due to the extreme expansion and contraction of the time axis at the time of recognition is rejected or corrected and recognized. Since the performance is improved and the average syllable length is obtained from the input voice itself, it is less susceptible to the fluctuation of the speaking speed.

【００１３】[0013]

【発明の実施の形態】本発明の実施の形態について図面
を参照して以下に説明する。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below with reference to the drawings.

【００１４】本発明の第１の実施の形態を図１を参照し
て説明する。図１は、本発明の第１の実施の形態の構成
をブロック図にて示したものである。A first embodiment of the present invention will be described with reference to FIG. FIG. 1 is a block diagram showing the configuration of the first embodiment of the present invention.

【００１５】入力音声は音声認識部１０１に入力され、
特徴ベクトルの時系列に変換される。音声認識部１０１
は、単語辞書１０２中の各単語について、単語辞書１０
２に書かれた読みの情報に基づいて音節標準パターン１
０３を連結した単語標準パターンと、特徴ベクトルの時
系列に変換された入力音声との照合を行ない、照合スコ
アおよび音節ごとのセグメンテーションを求める。照合
には、例えばＤＰマッチングを用いることができる。あ
るいは、音節標準パターンを隠れマルコフモデルで表
し、照合にはビタビアルゴリズムを用いることもでき
る。The input voice is input to the voice recognition unit 101,
It is converted into a time series of feature vectors. Voice recognition unit 101
Is a word dictionary 10 for each word in the word dictionary 102.
Syllable standard pattern 1 based on the reading information written in 2.
A standard word pattern in which 03 is concatenated is compared with an input voice converted into a time series of feature vectors to obtain a matching score and segmentation for each syllable. For matching, for example, DP matching can be used. Alternatively, the syllable standard pattern can be represented by a hidden Markov model, and the Viterbi algorithm can be used for matching.

【００１６】そして、照合結果のスコアに基づいて複数
の認識結果候補を選択し、照合スコアと各音節の継続時
間長の情報つきで候補選択部１０６に出力する。各音節
の継続時間長は、音節ごとのセグメンテーション情報か
ら容易に計算できる。Then, a plurality of recognition result candidates are selected based on the score of the matching result and output to the candidate selecting unit 106 together with the matching score and the information about the duration of each syllable. The duration of each syllable can be easily calculated from the segmentation information for each syllable.

【００１７】例えば「おばあさん」と発声した場合の、
認識結果候補の一例を図２に示す。なお、音声認識部１
０１では、単語辞書とともに構文規則を用いて、文ある
いは文節の連続音声認識を行なうこともできる。For example, when uttering "grandmother",
An example of the recognition result candidates is shown in FIG. The voice recognition unit 1
In 01, continuous speech recognition of a sentence or a phrase can also be performed using a syntax rule together with a word dictionary.

【００１８】一方、図１を再び参照して、入力音声は音
節境界検出部１０４にも入力され、音節境界検出部１０
４はパワーなどの情報を用いて音節境界の位置の候補を
求め、平均音節長推定部１０５に出力する。On the other hand, referring again to FIG. 1, the input voice is also input to the syllable boundary detecting unit 104, and the syllable boundary detecting unit 10
4 calculates a candidate for the position of the syllable boundary using information such as power and outputs it to the average syllable length estimation unit 105.

【００１９】音節境界候補の一例を図３に示す。音節境
界候補には誤った音節境界が含まれていたり、正しい音
節境界が含まれていなかったりすることがあるが、平均
音節長推定部１０５（図１参照）では、各音節の継続時
間長がほぼ一定であることを利用して、音節境界候補の
情報を基に入力音声の平均音節長を推定する。An example of syllable boundary candidates is shown in FIG. The syllable boundary candidates may include erroneous syllable boundaries or may not include correct syllable boundaries. However, in the average syllable length estimation unit 105 (see FIG. 1), the duration length of each syllable is By utilizing the fact that it is almost constant, the average syllable length of the input speech is estimated based on the information of the syllable boundary candidates.

【００２０】隣り合う音節境界候補の間の区間を音声区
間候補とみたとき、他の音節区間候補の時間長と比べて
極端に長い音節区間候補は、実際にはその中に複数の音
節が含まれると考えられ、また極端に短い音節区間候補
は、隣接する音節区間候補と合わせて１音節分であると
考えられる。When a section between adjacent syllable boundary candidates is regarded as a speech section candidate, a syllable section candidate extremely longer than the time length of another syllable section candidate actually includes a plurality of syllables. It is considered that the extremely short syllable section candidates are one syllable together with the adjacent syllable section candidates.

【００２１】例えば、図３に示した、音節境界候補の一
例の場合、第２の音節境界候補と第３の音節境界候補の
間の区間（２４０〜５４０ｍｓｅｃの区間）には、実際
には２音節含まれると考えられる。これを考慮すると、
図３の例では、平均音節長は約１５０ｍｓｅｃと推定さ
れる。具体的には、例えば以下に説明するような方法で
平均音節長を推定する。For example, in the case of the example of the syllable boundary candidate shown in FIG. 3, the interval between the second syllable boundary candidate and the third syllable boundary candidate (the interval of 240 to 540 msec) is actually 2 It is considered to include syllables. With this in mind,
In the example of FIG. 3, the average syllable length is estimated to be about 150 msec. Specifically, the average syllable length is estimated, for example, by the method described below.

【００２２】音節長と音節数と最初の音節の開始点を仮
定すると、全ての音節境界が決まるので、これと音声境
界検出部１０４の出力した音節境界候補とを比較し、そ
の類似度を求める。Assuming the syllable length, the number of syllables, and the starting point of the first syllable, all syllable boundaries are determined. Therefore, this is compared with the syllable boundary candidates output by the speech boundary detection unit 104, and the degree of similarity is obtained. .

【００２３】類似度は、音節境界候補の挿入、脱落、位
置のずれに対するペナルティーを予め設定しておくこと
により計算される。The degree of similarity is calculated by presetting a penalty for insertion, dropout, and positional deviation of syllable boundary candidates.

【００２４】音節長、音節数、最初の音節の開始点を変
化させて、類似度が最も大きいものを求め、そのときの
音節長を平均音節長する。平均音節長の推定には、それ
以前の発声に対して推定された平均音節長を利用し、こ
の平均音節長と当該発声の平均音節長の差を類似度を求
める際に、ペナルティーとして加えることもできる。The syllable length, the number of syllables, and the starting point of the first syllable are changed to find the one with the highest degree of similarity, and the syllable length at that time is taken as the average syllable length. To estimate the average syllable length, use the average syllable length estimated for previous utterances, and add the difference between this average syllable length and the average syllable length of the utterance as a penalty when determining the similarity. You can also

【００２５】候補選択部１０６では、音声認識部１０１
が出力した各候補について、その照合スコアの順に、各
音節の継続時間長が、平均音節長推定部１０５で求めた
平均音節長に基づいて決まる最小継続時間と最大継続時
間の間にあるか否かを調べる。In the candidate selection unit 106, the voice recognition unit 101
For each candidate output by, whether the duration of each syllable is between the minimum duration and the maximum duration determined based on the average syllable length obtained by the average syllable length estimation unit 105 in the order of the matching score. To find out.

【００２６】そして、継続時間長がこの範囲内にない音
節が一つでもある場合には、その候補を棄却する。If there is even one syllable whose duration is not within this range, the candidate is rejected.

【００２７】全ての音節について、継続時間長が上記範
囲内にある候補が見つかったら、これを認識結果として
出力し、選択を終了する。When a candidate whose duration time is within the above range is found for all syllables, this is output as a recognition result, and the selection is completed.

【００２８】最小継続時間と最大継続時間は、例えば平
均音節長との差あるいは比が予め決めた値であるように
設定する。この差あるいは比は、音節ごとに共通の値を
用いることもできるが、より精密には、音節の種類ごと
に異なる値を用いることもできる。The minimum duration and the maximum duration are set, for example, so that the difference or ratio with the average syllable length is a predetermined value. This difference or ratio may use a common value for each syllable, or more precisely, a different value for each syllable type.

【００２９】例えば、ある入力音声に対し、図２に示し
たような複数の認識結果候補が得られ、平均音節長が１
５０ｍｓｅｃと推定されたものとする。図２を参照し
て、各認識結果候補毎に、音節及び該音節の長さ情報が
格納されている。最小継続時間および最大継続時間は、
平均音節長との差３０ｍｓｅｃ以内として設定されると
すると、最小継続時間は１２０ｍｓｅｃ、最大継続時間
は１８０ｍｓｅｃとなる。For example, for a given input voice, a plurality of recognition result candidates as shown in FIG. 2 are obtained, and the average syllable length is 1
It is assumed to be 50 msec. With reference to FIG. 2, a syllable and length information of the syllable are stored for each recognition result candidate. The minimum and maximum durations are
If the difference from the average syllable length is set within 30 msec, the minimum duration is 120 msec and the maximum duration is 180 msec.

【００３０】そして、図２に示した複数の認識結果候補
のうち、第１候補は、第２音節（３１０ｍｓｅｃ）が最
大継続時間（１８０ｍｓｅｃ）を越えるため棄却され、
第２候補が選ばれて認識結果として出力される。Of the plurality of recognition result candidates shown in FIG. 2, the first candidate is rejected because the second syllable (310 msec) exceeds the maximum duration (180 msec),
The second candidate is selected and output as a recognition result.

【００３１】また、候補選択部１０６は以下のように構
成することもできる。すなわち、音声認識部１０１が出
力した各候補について、各音節ごとにその継続時間長と
平均音節長推定部１０５で求めた平均音節長に基づいて
継続時間スコアを求める。Further, the candidate selecting section 106 can be configured as follows. That is, for each candidate output by the voice recognition unit 101, a duration score is obtained for each syllable based on the duration and the average syllable length obtained by the average syllable length estimation unit 105.

【００３２】これらの継続時間スコアと、音声認識部１
０１で求めた照合スコアから候補の総合スコアを求め、
総合スコアの最も高い候補を認識結果として出力する。
継続時間スコアの求め方としては、例えば音節の継続時
間に関する分布を平均音節長に基づいて定め、その分布
をもとに各音節の継続時間に対する尤度として求めるこ
とができる。継続時間の分布は、例えば分散が予め音節
の種類ごとに定められたガウス分布と仮定して、その平
均値を平均音節長とすることにより決めることができ
る。These continuation time scores and the voice recognition unit 1
The total score of the candidates is obtained from the matching score obtained in 01,
The candidate with the highest total score is output as the recognition result.
As a method of obtaining the duration time score, for example, a distribution regarding the duration time of the syllable can be determined based on the average syllable length, and the likelihood can be obtained as the likelihood for the duration time of each syllable based on the distribution. The distribution of the duration can be determined, for example, by assuming that the variance is a Gaussian distribution in which each syllable type is determined in advance, and taking the average value as the average syllable length.

【００３３】次に、本発明の第２の実施の形態を説明す
る。図４は、本発明の第２の実施の形態の構成をブロッ
ク図にて示したものである。Next, a second embodiment of the present invention will be described. FIG. 4 is a block diagram showing the configuration of the second embodiment of the present invention.

【００３４】図４を参照すると、入力音声は音声認識部
４０１に入力され、音声認識部４０１は、前記第１の実
施の形態の場合と同様にして照合を行ない、複数の認識
結果候補をその照合スコアと音節ごとのセグメンテーシ
ョンの情報つきで、候補選択部４０５と平均音節長推定
部４０４に、照合スコアの順に順次出力する。Referring to FIG. 4, the input voice is input to the voice recognition unit 401, and the voice recognition unit 401 performs matching as in the case of the first embodiment, and a plurality of recognition result candidates are obtained. The matching score and the segmentation information for each syllable are sequentially output to the candidate selecting unit 405 and the average syllable length estimating unit 404 in the order of the matching score.

【００３５】平均音節長推定部４０４では、認識結果の
各候補ごとに、そのセグメンテーションによる音節の境
界を、前記第１の実施の形態における音節境界候補と同
様に扱うことにより、入力音声の平均音節長を推定す
る。The average syllable length estimation unit 404 treats, for each candidate of the recognition result, the boundary of the syllable by the segmentation in the same manner as the syllable boundary candidate in the first embodiment, and thus the average syllable of the input voice. Estimate the length.

【００３６】なお、平均音節長の推定は、音節境界の相
対的な位置関係のみからでも行なえるので、認識結果候
補に付加する情報として、音節ごとのセグメンテーショ
ンのかわりに各音節の継続時間長の情報を用いてもよ
い。Since the average syllable length can be estimated only from the relative positional relationship of syllable boundaries, the duration of each syllable instead of segmentation for each syllable is added as information to the recognition result candidates. Information may be used.

【００３７】候補選択部４０５では、認識結果候補ごと
に、セグメンテーションより求められる各音節の継続時
間長が、平均音節長推定部４０４で求めた平均音節長に
基づいて決まる最小継続時間と最大継続時間の間にある
か否かを調べる。In the candidate selecting section 405, the duration of each syllable obtained from the segmentation for each recognition result candidate is determined based on the average syllable length obtained by the average syllable length estimating section 404. To see if it is in between.

【００３８】継続時間長がこの範囲内にない音節が一つ
でもある場合には、その候補を棄却する。全ての音節に
ついて、その継続時間長が範囲内にある候補が見つかっ
たら、それを認識結果として出力し、選択を終了する。If there is even one syllable whose duration is not within this range, the candidate is rejected. When a candidate whose duration is within the range is found for all syllables, the candidate is output as a recognition result, and the selection ends.

【００３９】候補選択部４０５の動作は、平均音節長が
音節候補ごとに異なることを除き、前記第１の実施の形
態の場合と同様である。The operation of the candidate selecting section 405 is the same as that of the first embodiment except that the average syllable length differs for each syllable candidate.

【００４０】また、第１位の候補について求めた平均音
節長を、その他の候補についても用いることもできる。Further, the average syllable length obtained for the first candidate can be used for other candidates.

【００４１】さらに、前記第１の実施の形態で説明した
ように、各認識結果候補について継続時間スコアを求め
て、総合スコアにより選択することもできる。Further, as described in the first embodiment, it is also possible to obtain a continuation time score for each recognition result candidate and select it by the total score.

【００４２】次に、本発明の第３の実施の形態を説明す
る。図５は、本発明の第３の実施の形態の構成をブロッ
ク図にて示したものである。Next, a third embodiment of the present invention will be described. FIG. 5 is a block diagram showing the configuration of the third exemplary embodiment of the present invention.

【００４３】図５を参照すると、入力音声は音声認識部
５０１に入力され、特徴ベクトルの時系列に変換され
る。音声認識部５０１は、音節標準パターン５０３と音
節列接続規則５０２を用いて、入力音声を任意の音節列
として認識する。Referring to FIG. 5, the input voice is input to the voice recognition unit 501 and converted into a time series of feature vectors. The voice recognition unit 501 recognizes the input voice as an arbitrary syllable string using the syllable standard pattern 503 and the syllable string connection rule 502.

【００４４】ここで、音節接続規則は、例えば「『ん』
は連続せず、語頭にも現れない」、「『っ』のあとに母
音はこない」、などの日本語の音節の一般的な接続規則
を表したものであり、好ましくは有限状態ネットワーク
により表現することができる。Here, the syllable connection rule is, for example, ""
Represents a general connection rule of Japanese syllables such as "is not continuous and does not appear in the beginning of words", "vowels do not come after""and is preferably expressed by a finite state network. can do.

【００４５】ネットワークの各遷移に、確率を付与する
こともできる。It is also possible to give a probability to each transition of the network.

【００４６】音声認識部５０１は、有限状態ネットワー
ク制御の連続音声認識アルゴリズムにより認識を行な
い、最もスコアのよい音節列を仮認識結果として、音節
ごとのセグメンテーションの情報つきで、認識結果修正
部５０６と平均音節長推定部５０５に出力する。The speech recognition unit 501 performs recognition by the continuous speech recognition algorithm under finite state network control, and uses the syllabic sequence with the best score as the temporary recognition result, with the segmentation information for each syllable as the recognition result correction unit 506. It outputs to the average syllable length estimation unit 505.

【００４７】平均音節長推定部５０５では、仮認識結果
のセグメンテーション情報をもとに、前記第２の実施の
形態と同様に、入力音声の平均音節長を推定する。The average syllable length estimation unit 505 estimates the average syllable length of the input voice based on the segmentation information of the temporary recognition result, as in the second embodiment.

【００４８】認識結果修正部５０６では、仮認識結果に
ついて、セグメンテーションより求められる各音節の継
続時間長と、平均音節長推定部で求めた平均音節長を比
較する。継続時間長と平均音節長の比が予め定めた一定
の範囲にない場合は、音節列修正規則５０４を参照し
て、その音節あるいは隣接する音節を含む音節列を別の
音節あるいは音節列に置き換えることにより、仮認識結
果を修正する。修正した結果得られた文字列を、認識結
果として出力する。The recognition result correction unit 506 compares the duration time of each syllable obtained by the segmentation with the average syllable length obtained by the average syllable length estimation unit for the temporary recognition result. When the ratio of the duration to the average syllable length is not within a predetermined range, the syllable string correction rule 504 is referred to replace the syllable string containing the syllable or an adjacent syllable with another syllable or syllable string. As a result, the temporary recognition result is corrected. The character string obtained as a result of the correction is output as the recognition result.

【００４９】本発明の第３の実施の形態における音節列
修正規則の一例を図６に示す。FIG. 6 shows an example of a syllable string correction rule according to the third embodiment of the present invention.

【００５０】図６を参照して、音節列修正規則は、分割
規則と併合規則からなる。継続時間長が平均音節長の２
倍程度以上の場合には、分割規則にしたがい、その音節
を修正後の音節列に置き換える。Referring to FIG. 6, the syllable string correction rule includes a division rule and a merging rule. 2 duration is the average syllable length
In the case of about twice or more, according to the division rule, the syllable is replaced with the corrected syllable string.

【００５１】また、継続時間長が平均音節長の半分程度
以下で、隣接する音節の継続時間長と合わせてほぼ平均
音節長となるような音節については、併合規則に従い、
当該音節と隣接する音節をまとめて１音節に置き換え
る。For syllables whose duration is less than about half of the average syllable length and which becomes almost the average syllable length together with the durations of adjacent syllables, according to the merge rule,
The syllables adjacent to the syllable are collectively replaced with one syllable.

【００５２】図６では、非常に簡単な音節列修正規則の
例を示したが、実際には継続時間長や前後の音節環境に
より規則の適用に制約や優先条件を付けたり、より複雑
な修正規則を与えたりすることも可能である。FIG. 6 shows an example of a very simple syllable sequence correction rule. However, in practice, the application of the rule may be restricted or given a priority condition depending on the duration and the syllable environment before and after, or a more complicated correction may be made. It is also possible to give rules.

【００５３】本発明の第３の実施の形態において、平均
音節長を仮認識結果から求めるかわりに、前記第１の実
施の形態と同様に、音節境界検出部を設け、その出力す
る音節境界候補から求めることも可能である。In the third embodiment of the present invention, instead of obtaining the average syllable length from the tentative recognition result, as in the first embodiment, a syllable boundary detecting unit is provided and the syllable boundary candidates to be output are provided. It is also possible to ask from.

【００５４】なお、上記した本発明の実施の形態では、
音声の認識単位および継続時間長を求める単位として音
節を用いたが、音節に限らず、音素や半音節など任意の
音響的単位を用いることができる。In the above-described embodiment of the present invention,
Although syllables are used as the speech recognition unit and the unit for obtaining the duration time, it is not limited to syllables, and any acoustic unit such as a phoneme or a semi-syllable can be used.

【００５５】また、本発明においては、音声の認識単位
と継続時間長を求める単位は必ずしも同じである必要は
なく、例えば音声の認識単位として音素を用い、継続時
間長を求める単位として音節を用いることもできる。さ
らに、平均音節長を求める際に、音節境界候補から求め
る替わりに、例えば音節の母音の中心部の候補から求め
ることもできる。Further, in the present invention, the voice recognition unit and the unit for obtaining the duration are not necessarily the same. For example, a phoneme is used as the voice recognition unit and a syllable is used as the unit for obtaining the duration. You can also Further, when obtaining the average syllable length, instead of obtaining from the syllable boundary candidates, for example, it is possible to obtain from the candidate of the central part of the vowel of the syllable.

【００５６】[0056]

【発明の効果】以上説明したように、本発明によれば、
発声速度の変動によらず、音節等の継続時間を利用した
高性能な認識が可能な音声認識装置を実現したものであ
る。As described above, according to the present invention,
The present invention realizes a voice recognition device capable of high-performance recognition using the duration of syllables and the like, regardless of fluctuations in vocalization rate.

[Brief description of drawings]

【図１】本発明の第１の実施形態を示すブロック図であ
る。FIG. 1 is a block diagram showing a first embodiment of the present invention.

【図２】本発明の第１の実施形態における認識結果候補
の一例を示す図である。FIG. 2 is a diagram showing an example of a recognition result candidate according to the first embodiment of the present invention.

【図３】本発明の第１の実施形態における音節境界候補
の一例を示す図である。FIG. 3 is a diagram showing an example of syllable boundary candidates in the first embodiment of the present invention.

【図４】本発明の第２の実施形態を示すブロック図であ
る。FIG. 4 is a block diagram showing a second embodiment of the present invention.

【図５】本発明の第３の実施形態を示すブロック図であ
る。FIG. 5 is a block diagram showing a third embodiment of the present invention.

【図６】本発明の第３の実施形態における音声修正規則
の一例を示す図である。FIG. 6 is a diagram showing an example of a voice correction rule according to a third embodiment of the present invention.

【図７】従来例のブロック図である。FIG. 7 is a block diagram of a conventional example.

[Explanation of symbols]

１０１音声認識部１０２単語辞書１０３音節標準パターン１０４音節境界検出部１０５平均音節長推定部１０６候補選択部 101 voice recognition unit 102 word dictionary 103 syllable standard pattern 104 syllable boundary detection unit 105 average syllable length estimation unit 106 candidate selection unit

Claims

[Claims]

1. A speech recognition apparatus, characterized in that a recognition result is selected from recognition result candidates for an input speech based on a duration of each syllable of the input speech and an average syllable length obtained from the input speech. .

2. A speech recognizing means for recognizing an input voice and outputting a plurality of recognition result candidates by adding information on a duration time of each syllable, and a syllable boundary candidate for obtaining a syllable boundary candidate from the input voice. A detection means; an average syllable length estimation means for obtaining an average syllable length from the syllable boundary candidates; and a candidate selection means for selecting a recognition result from the plurality of recognition result candidates based on the recognition result candidates and the average syllable length. A voice recognition device comprising:

3. A speech recognition means for recognizing an input speech, outputting a plurality of recognition result candidates by adding segmentation for each syllable or duration information of each syllable, and outputting the result, and each candidate of the recognition result. An average syllable length estimating means for obtaining an average syllable length; and a candidate selecting means for selecting a recognition result from the plurality of recognition result candidates based on the recognition result candidate and the average syllable length. Recognition device.

4. The average syllable length obtained for the highest recognition result candidate is used for all recognition result candidates instead of obtaining the average syllable length for each candidate recognition result by the average syllable length estimation means. 4. The method according to claim 3, wherein
The speech recognition device according to the above.

5. The average syllable length estimating means holds information on the average syllable length obtained for past input speech, and together with this, obtains the average syllable length of the input speech. Item 5. The voice recognition device according to any one of items 2 to 4.

6. The candidate selecting means obtains at least one of a maximum duration and a minimum duration of syllables from the average syllable common to all syllables or for each syllable type, and the maximum duration is It is determined whether or not to reject each candidate of the recognition result candidates based on at least one value of the minimum duration, and the highest candidate among the candidates that are not rejected is set as the recognition result. The voice recognition device according to any one of 2 to 5.

7. The speech recognition means adds the information of the matching score to the recognition result candidate and outputs it, and the candidate selecting means, for each candidate of the recognition result candidate, the duration time of each syllable thereof. And a duration time score based on an average syllable length, and a recognition result is selected from the recognition result candidates based on the matching score and the duration time score of each candidate of the recognition result candidates. The voice recognition device according to any one of 2 to 5.

8. A voice recognition device, which corrects a recognition result for an input voice based on a duration time of each syllable, an average syllable length obtained from the input voice, and a syllable string correction rule.

9. A voice recognizing means for recognizing an input voice and outputting a tentative recognition result by adding information on a duration time of each syllable, and a syllable boundary candidate detecting means for obtaining a syllable boundary candidate from the input voice. An average syllable length estimating means for obtaining an average syllable length from the syllable boundary candidate; and a recognition for obtaining a recognition result by correcting the temporary recognition result based on the temporary recognition result, the average syllable length and a syllable string correction rule. A voice recognition device comprising: result correction means;

10. A voice recognition means for recognizing an input voice and outputting a provisional recognition result by adding segmentation for each syllable or information of duration of each syllable, and obtaining an average syllable length from the provisional recognition result. An average syllable length estimating means, and a recognition result correcting means for correcting the temporary recognition result based on the temporary recognition result, the average syllable length, and a syllable string correction rule to obtain a recognition result. Speech recognizer.