JP6179509B2

JP6179509B2 - Language model generation apparatus, speech recognition apparatus, language model generation method, and program storage medium

Info

Publication number: JP6179509B2
Application number: JP2014515497A
Authority: JP
Inventors: 雅弘西光; 亮輔磯谷; 祥史大西; 真寺尾
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2012-05-17
Filing date: 2013-05-13
Publication date: 2017-08-16
Anticipated expiration: 2033-05-13
Also published as: JPWO2013172014A1; WO2013172014A1

Description

本発明は、音声認識処理に利用する言語モデルを生成する技術に関する。 The present invention relates to a technique for generating a language model used for speech recognition processing.

コンピュータが音声を認識する処理において、言語モデルを利用する。言語モデルとは、言語的な制約を規定するモデルである。この言語モデルを、認識対象に応じて最適化する技術がある。 A language model is used in the process in which the computer recognizes speech. A language model is a model that defines linguistic constraints. There is a technique for optimizing the language model according to the recognition target.

例えば、特許文献１では、音声認識装置は、会話に関する言語モデルとして、問いかけの語彙集合と、応答の語彙集合と、それらが結びつく確率とを関連付けたデータベースを用いている。特許文献２では、音声認識装置は、二者の会話を認識対象として、一方の話者が発した音声（発話）を認識した結果を用いて、他方の話者の発話に適した言語モデルを生成する。 For example, in Patent Document 1, the speech recognition apparatus uses, as a language model related to conversation, a database in which a vocabulary set for questions, a vocabulary set for responses, and probabilities associated with them are associated. In Patent Document 2, the speech recognition apparatus uses a result of recognizing a voice (utterance) uttered by one speaker as a recognition target of a conversation between two parties, and uses a language model suitable for the utterance of the other speaker. Generate.

特開２００４−０１２７１３号公報JP 2004-012713 A 特開２０１１−１０７３１４号公報JP 2011-107314 A

ところで、特許文献１に記載されている手法においては、言語モデルとしてのデータが膨大になる虞がある。つまり、コンタクトセンターにおける接客担当者と客との会話や、会議での対話では、問いかけとその応答が複数回繰り返されることがある。このような一連の会話（複数回の発話）を認識対象とする場合には、問いかけとその問いかけに対する応答とに含まれる語彙候補の組み合わせ（つまり、言語モデルとしてのデータ）が膨大になる。 By the way, in the method described in Patent Document 1, there is a possibility that data as a language model becomes enormous. In other words, in a conversation between a customer service representative and a customer in a contact center or a conversation in a meeting, the question and the response may be repeated multiple times. When such a series of conversations (a plurality of utterances) is to be recognized, the combination of vocabulary candidates (that is, data as a language model) included in the question and the response to the question becomes enormous.

特許文献２に記載されている手法では、言語モデルを生成する処理において、会話している話者達のうちの一方の話者から発せられた音声を認識した結果のみが用いられるから、その認識結果に含まれない単語に関わる言語モデルは生成されない。このため、特許文献２に記載されている音声認識装置は、他方の話者の発話に適した言語モデルを生成できない。 In the method described in Patent Document 2, only the result of recognizing the speech uttered from one of the talking speakers is used in the process of generating the language model. Language models for words not included in the results are not generated. For this reason, the speech recognition apparatus described in Patent Document 2 cannot generate a language model suitable for the speech of the other speaker.

本発明は上記課題を解決するためになされた。すなわち、本発明の主な目的は、会話している話者達のうちの一人の発話に含まれていない単語が、その発話に応答している他の話者の発話に含まれている場合においても、当該他の話者の発話を認識する音声認識処理の認識精度を高める技術を提供することである。 The present invention has been made to solve the above problems. That is, the main object of the present invention is when a word that is not included in the utterance of one of the talking speakers is included in the utterance of another speaker responding to the utterance. Is to provide a technique for improving the recognition accuracy of the speech recognition process for recognizing the speech of the other speaker.

上記目的を達成するために、本発明に係る言語モデル生成装置は、
第１の発話に含まれる単語あるいは単語列である第１データと、前記第１の発話に応答する第２の発話に含まれる単語あるいは単語列のうちの前記第１データに関連する関連語である第２データとが関連付いている関連語対を集めたデータ群における前記第１データに、認識対象の会話を交わしている第１話者と他の話者とのうちの前記第１話者の発話に含まれる単語あるいは単語列を照合し、当該単語あるいは単語列に該当する前記第１データに関連付けられている前記第２データを抽出する抽出手段と、
その抽出された前記第２データを利用して、前記他の話者の発話に応じた言語モデルを生成するモデル生成手段と
を備えている。In order to achieve the above object, a language model generation apparatus according to the present invention includes:
A first data that is a word or a word string included in the first utterance and a related word related to the first data of the word or the word string included in the second utterance responding to the first utterance. The first episode of the first speaker and other speakers who are exchanging a recognition target conversation with the first data in a group of related words associated with certain second data. Extracting means for collating a word or a word string included in the person's utterance and extracting the second data associated with the first data corresponding to the word or word string;
Model generation means for generating a language model according to the speech of the other speaker using the extracted second data.

また、本発明に係る音声認識装置は、
音声をテキストデータに変換する音声認識処理を実行する認識手段と、
前記音声認識処理に用いる言語モデルを生成する言語モデル生成装置と
を有し、
前記言語モデル生成装置は、本発明の言語モデル生成装置であり、認識対象の会話を交わしている第１話者と他の話者とのうちの前記第１話者の発話に含まれ前記認識手段によりテキストデータに変換された単語あるいは単語列に基づいて、前記他の話者に応じた言語モデルを生成する。The speech recognition apparatus according to the present invention is
Recognition means for executing speech recognition processing for converting speech into text data;
A language model generation device that generates a language model used for the speech recognition processing,
The language model generation device is the language model generation device of the present invention, and is included in the utterance of the first speaker among the first speaker and the other speaker who are exchanging the conversation to be recognized. Based on the word or word string converted into text data by the means, a language model corresponding to the other speaker is generated.

さらに、本発明に係る言語モデル生成方法は、
第１の発話に含まれる単語あるいは単語列である第１データと、前記第１の発話に応答する第２の発話に含まれる単語あるいは単語列のうちの前記第１データに関連する関連語である第２データとが関連付いている関連語対を集めたデータ群における前記第１データに、認識対象の会話を交わしている第１話者と他の話者とのうちの前記第１話者の発話に含まれる単語あるいは単語列を照合し、当該単語あるいは単語列に該当する前記第１データに関連付けられている前記第２データを抽出し、
その抽出された前記第２データを利用して、前記他の話者の発話に応じた言語モデルを生成する。Furthermore, the language model generation method according to the present invention includes:
A first data that is a word or a word string included in the first utterance and a related word related to the first data of the word or the word string included in the second utterance responding to the first utterance. The first episode of the first speaker and other speakers who are exchanging a recognition target conversation with the first data in a group of related words associated with certain second data. Collating a word or word string included in the person's utterance, and extracting the second data associated with the first data corresponding to the word or word string,
A language model corresponding to the speech of the other speaker is generated using the extracted second data.

さらに、本発明に係るプログラム記憶媒体は、
第１の発話に含まれる単語あるいは単語列である第１データと、前記第１の発話に応答する第２の発話に含まれる単語あるいは単語列のうちの前記第１データに関連する関連語である第２データとが関連付いている関連語対を集めたデータ群における前記第１データに、認識対象の会話を交わしている第１話者と他の話者とのうちの前記第１話者の発話に含まれる単語あるいは単語列を照合し、当該単語あるいは単語列に該当する前記第１データに関連付けられている前記第２データを抽出する処理と、
その抽出された前記第２データを利用して、前記他の話者の発話に応じた言語モデルを生成する処理と
をコンピュータに実行させるコンピュータプログラムが格納されている。Furthermore, the program storage medium according to the present invention is:
A first data that is a word or a word string included in the first utterance and a related word related to the first data of the word or the word string included in the second utterance responding to the first utterance. The first episode of the first speaker and other speakers who are exchanging a recognition target conversation with the first data in a group of related words associated with certain second data. A process of collating a word or a word string included in the person's utterance and extracting the second data associated with the first data corresponding to the word or the word string;
A computer program for causing a computer to execute processing for generating a language model according to the speech of the other speaker using the extracted second data is stored.

なお、本発明の前記主な目的は、本発明の言語モデル生成装置に対応する言語モデル生成方法によっても達成される。また、本発明の前記主な目的は、本発明の言語モデル生成装置、音声認識装置および本発明の言語モデル生成方法をコンピュータによって実現するコンピュータプログラム、およびそのコンピュータプログラムを記憶するプログラム記憶媒体によっても達成される。 The main object of the present invention is also achieved by a language model generation method corresponding to the language model generation apparatus of the present invention. The main object of the present invention is also achieved by a computer program for realizing the language model generation apparatus, the speech recognition apparatus and the language model generation method of the present invention by a computer, and a program storage medium for storing the computer program. Achieved.

本発明によれば、会話している話者達のうちの一人の発話に含まれていない単語が、その発話に応答している他の話者の発話に含まれている場合においても、当該他の話者の発話を認識する処理の認識精度を高めることができる。 According to the present invention, even when a word that is not included in the utterance of one of the talking speakers is included in the utterance of another speaker who is responding to the utterance, The recognition accuracy of the process for recognizing the utterances of other speakers can be increased.

本発明に係る第１実施形態の音声認識装置のハードウェア構成を簡略化して表すブロック図である。It is a block diagram which simplifies and represents the hardware constitutions of the speech recognition apparatus of 1st Embodiment which concerns on this invention. 第１実施形態の言語モデル生成装置の構成を簡略化して表すブロック図である。It is a block diagram which simplifies and represents the structure of the language model production | generation apparatus of 1st Embodiment. 関連語対の例をイメージで表す図である。It is a figure showing the example of a related word pair with an image. 第１実施形態の言語モデル生成装置が言語モデルを生成する動作例を表すフローチャートである。It is a flowchart showing the operation example in which the language model production | generation apparatus of 1st Embodiment produces | generates a language model. 関連語対に関連付けられる付属情報の例をイメージで表す図である。It is a figure showing the example of the attached information linked | related with a related word pair with an image. 本発明に係る第２実施形態の音声認識装置の構成を簡略化して表すブロック図である。It is a block diagram which simplifies and represents the structure of the speech recognition apparatus of 2nd Embodiment which concerns on this invention. 第２実施形態の音声認識装置が音声を認識する動作例を表すフローチャートである。It is a flowchart showing the operation example in which the speech recognition apparatus of 2nd Embodiment recognizes a speech. 本発明に係る第３実施形態の音声認識装置の構成を簡略化して表すブロック図である。It is a block diagram which simplifies and represents the structure of the speech recognition apparatus of 3rd Embodiment which concerns on this invention. 第３実施形態における言語モデル生成装置が関連語対を生成する動作例を表すフローチャートである。It is a flowchart showing the operation example in which the language model production | generation apparatus in 3rd Embodiment produces | generates a related word pair.

以下に、本発明に係る実施形態を図面を参照しながら説明する。 Embodiments according to the present invention will be described below with reference to the drawings.

＜第１実施形態＞
図１は、本発明に係る第１実施形態の言語モデル生成装置を含む音声認識装置１のハードウェア構成を簡略化して表すブロック図である。音声認識装置１は、ＣＰＵ（Central Processing Unit）１０と、記憶装置であるメモリ１１と、記憶装置であるＨＤＤ（Hard Disk Drive）１２と、通信ＩＦ（InterFace）１３と、入力装置１４と、音声入力装置１５とを備えている。入力装置１４は、情報を音声認識装置１に入力する装置であり、当該入力装置１４としては、例えば、キーボードや、マウス等のポインティングデバイスがある。音声入力装置１５は、音声を電気信号（音声信号）に変換することによって、音声を音声認識装置１に取り込む装置であり、当該音声入力装置１５としては、例えば、マイクロホンがある。音声認識装置１の前記構成要素１０−１５は、バス１６を通して互いに接続されており、互いにデータの入出力を行う。<First Embodiment>
FIG. 1 is a block diagram showing a simplified hardware configuration of a speech recognition apparatus 1 including a language model generation apparatus according to the first embodiment of the present invention. The voice recognition device 1 includes a CPU (Central Processing Unit) 10, a memory 11 as a storage device, an HDD (Hard Disk Drive) 12 as a storage device, a communication IF (InterFace) 13, an input device 14, and a voice. And an input device 15. The input device 14 is a device that inputs information to the voice recognition device 1. Examples of the input device 14 include a keyboard and a pointing device such as a mouse. The voice input device 15 is a device that takes a voice into the voice recognition device 1 by converting the voice into an electrical signal (voice signal). The voice input device 15 includes, for example, a microphone. The components 10-15 of the speech recognition apparatus 1 are connected to each other through a bus 16 and input / output data.

図２は、音声認識装置１に含まれている言語モデル生成装置１８の構成例を実線により表すブロック図である。言語モデル生成装置１８は、抽出部（抽出手段）１００およびモデル生成部（モデル生成手段）１０１を備えている。この言語モデル生成装置１８は、ＣＰＵ１０が、ＨＤＤ１２（あるいはメモリ１１）に記憶されているコンピュータプログラム（以下、プログラムとも記す）１９を実行することにより、実現する。換言すれば、この第１実施形態における言語モデル生成装置１８は、コンピュータプログラム１９あるいはそのコンピュータプログラムが格納されているコンピュータ読み取り可能な記憶媒体１２（１１）によって構成されるとも言える。なお、言語モデル生成装置１８の全部又は一部の機能は、音声認識装置１に設けられたハードウェアにより実現されてもよい。 FIG. 2 is a block diagram illustrating a configuration example of the language model generation device 18 included in the speech recognition device 1 by a solid line. The language model generation device 18 includes an extraction unit (extraction unit) 100 and a model generation unit (model generation unit) 101. The language model generation device 18 is realized by the CPU 10 executing a computer program (hereinafter also referred to as a program) 19 stored in the HDD 12 (or the memory 11). In other words, it can be said that the language model generation device 18 in the first embodiment is configured by the computer program 19 or the computer-readable storage medium 12 (11) in which the computer program is stored. Note that all or some of the functions of the language model generation device 18 may be realized by hardware provided in the speech recognition device 1.

この第１実施形態では、記憶装置（ＨＤＤ１２）には、関連語対を集めたデータ群２０が格納されている。図３は、関連語対を集めたデータ群２０の一例をイメージで表す図である。関連語対とは、会話に関わるデータであり、第１の発話に基づいた第１データと、第１の発話に応答する第２の発話に基づいた第２データとが関連付いているデータである。第１データは、第１の発話に含まれている単語あるいは単語列である。第２データは、第１の発話に応答する第２の発話に含まれる単語あるいは単語列のうちの第１データに関連する関連語（単語あるいは単語列）である。具体的には、オペレータと顧客の二者による会話を例にすると、関連語対は、オペレータの発話（第１の発話）と、このオペレータの発話に応答する顧客の発話（第２の発話）において共起する単語あるいは単語列である関連語との組み合わせである。 In the first embodiment, the storage device (HDD 12) stores a data group 20 in which related word pairs are collected. FIG. 3 is a diagram illustrating an example of the data group 20 in which related word pairs are collected. The related word pair is data related to the conversation, and is data in which the first data based on the first utterance and the second data based on the second utterance responding to the first utterance are associated with each other. is there. The first data is a word or a word string included in the first utterance. The second data is a related word (word or word string) related to the first data among the words or word strings included in the second utterance responding to the first utterance. Specifically, taking a conversation between an operator and a customer as an example, the related word pair includes an operator's utterance (first utterance) and a customer's utterance responding to the operator's utterance (second utterance). It is a combination with a related word that is a word or a word string that co-occurs in.

より具体的には、例えば、「お名前いただけますか」というオペレータの発話（問いかけ）に対して、「スズキです」あるいは「ヤマダです」というように顧客が応答したとする。このような会話に基づいて、オペレータの発話（第１の発話）に含まれる「お名前」という単語に対して、顧客の発話（第２の発話）に含まれる「スズキ」あるいは「ヤマダ」という関連語（単語）が共起する（発話される確率が高い）というデータを得ることができる。この場合には、第１データとしての「お名前」と、第２データとしての「スズキ」とが関連付けられた関連語対、あるいは、第１データとしての「お名前」と、第２データとしての「ヤマダ」とが関連付けられた関連語対が生成される。 More specifically, for example, it is assumed that the customer responds to the utterance (question) of the operator “Can you give me your name?” Such as “I am Suzuki” or “I am Yamada”. Based on such a conversation, “Suzuki” or “Yamada” included in the customer's utterance (second utterance) with respect to the word “name” included in the operator's utterance (first utterance). Data that related words (words) co-occur (high probability of being uttered) can be obtained. In this case, a related word pair in which “name” as the first data and “Suzuki” as the second data are associated, or “name” as the first data and the second data A related word pair associated with “Yamada” is generated.

また、オペレータと顧客との間で、次のような会話が交わされたとする。 Further, it is assumed that the following conversation is exchanged between the operator and the customer.

オペレータ：「Ｃｔｒｌ」
顧客：「Ｃｔｒｌ、はい」
オペレータ：「Ａｌｔ」
顧客：「Ａｌｔ、はい」
オペレータ：「Ｄｅｌｅｔｅ」
顧客：「Ｄｅｌｅｔｅ、はい」
オペレータ：「その３つのボタンを、同時に押して下さい」
顧客：「はい」
オペレータ：「そうすれば再起動できます」
顧客：「再起動ですね」
このような会話に基づいて、例えば、オペレータの発話（第１の発話）に含まれる「Ｃｔｒｌ，Ａｌｔ，Ｄｅｌｅｔｅ」という単語列（単語の組み合わせ）に対して、顧客の発話（第２の発話）に含まれる「再起動する」という関連語が共起しているというデータを得ることができる。この場合には、第１データとしての「Ｃｔｒｌ，Ａｌｔ，Ｄｅｌｅｔｅ」という単語列と、第２データとしての「再起動する」という関連語（単語）とが関連付けられた関連語対が生成される。また、上記会話に基づいて、第１データとしての「再起動する」という単語と、第２データとしての「再起動する」という関連語（単語）とが関連付けられた関連語対が生成される。さらに、上記会話に基づいて、第１データとしての「Ｃｔｒｌ，Ａｌｔ，Ｄｅｌｅｔｅ」、「同時に」、「押す」という単語列と、第２データとしての「再起動する」という関連語（単語）とが関連付けられた関連語対が生成される。Operator: “Ctrl”
Customer: “Ctrl, yes”
Operator: “Alt”
Customer: “Alt, yes”
Operator: “Delete”
Customer: “Delete, yes”
Operator: “Press the three buttons simultaneously”
Customer: “Yes”
Operator: “You can then restart”
Customer: “Restart”
Based on such a conversation, for example, a customer's utterance (second utterance) for a word string (a combination of words) “Ctrl, Alt, Delete” included in the operator's utterance (first utterance) Can be obtained that the related word “restart” included in is co-occurring. In this case, a related word pair in which the word string “Ctrl, Alt, Delete” as the first data and the related word (word) “restart” as the second data are generated is generated. . Also, based on the conversation, a related word pair is generated in which the word “restart” as the first data and the related word (word) “restart” as the second data are associated. . Furthermore, based on the above conversation, a word string “Ctrl, Alt, Delete” as the first data, a word string “simultaneously” and “press”, and a related word (word) “restart” as the second data, A related word pair associated with is generated.

なお、この例では、オペレータの発話を第１の発話とし、顧客の発話を第２の発話とし、オペレータの発話に基づいた単語あるいは単語列を第１データとし、顧客の発話に基づいた単語あるいは単語列を第２データとしている。これに対し、顧客の発話を第１の発話とし、オペレータの発話を第２の発話とし、顧客の発話に基づいた単語あるいは単語列を第１データとし、オペレータの発話に基づいた単語あるいは単語列を第２データとしてもよい。 In this example, the operator's utterance is the first utterance, the customer's utterance is the second utterance, the word or word string based on the operator's utterance is the first data, and the word or The word string is the second data. In contrast, the customer utterance is the first utterance, the operator utterance is the second utterance, the word or word string based on the customer utterance is the first data, and the word or word string based on the operator utterance is used. May be the second data.

抽出部１００は、認識対象の会話を交わしている第１話者と他の話者とのうちの第１話者の発話に含まれる単語あるいは単語列に基づいて、データ群２０から第２データを抽出する機能を備えている。つまり、抽出部１００は、第１話者の発話に含まれる単語あるいは単語列をデータ群２０の第１データに照合し、当該単語あるいは単語列に該当する第１データがある場合には、その第１データに関連付けられている第２データを抽出する。なお、照合した単語あるいは単語列に該当する同じ第１データを含み、かつ、第２データが互いに異なる複数の関連語対がある場合には、抽出部１００は、それら関連語対の第２データを抽出する。具体的には、「お名前」という単語を図３に示すデータ群２０に照合する場合には、抽出部１００は、「お名前」と「スズキ」との関連語対および「お名前」と「ヤマダ」との関連語対に基づいて、「スズキ」および「ヤマダ」を抽出する。 The extraction unit 100 extracts the second data from the data group 20 based on a word or a word string included in the utterance of the first speaker among the first speaker and the other speakers who are exchanging the recognition target conversation. It has a function to extract. That is, the extraction unit 100 collates the word or word string included in the utterance of the first speaker with the first data in the data group 20, and if there is first data corresponding to the word or word string, Second data associated with the first data is extracted. When there are a plurality of related word pairs that include the same first data corresponding to the collated word or word string and the second data is different from each other, the extraction unit 100 uses the second data of the related word pairs. To extract. Specifically, in the case where the word “name” is collated with the data group 20 shown in FIG. 3, the extraction unit 100 sets the related word pair “name” and “Suzuki” and “name” Based on the related word pair with “Yamada”, “Suzuki” and “Yamada” are extracted.

モデル生成部１０１は、抽出部１００により抽出された第２データを利用して、前記他の話者の発話に応じた言語モデルを生成する機能を備えている。言語モデルには、例えば、Ｎグラム言語モデル、トリガー言語モデル、階層型ベイズ言語モデルというように、様々な言語モデルの種類がある。ここでは、モデル生成部１０１は、音声認識を実行する状況などを考慮して選択された言語モデルを生成する。なお、ここでは、その言語モデルの生成手法の説明は省略する。 The model generation unit 101 has a function of generating a language model corresponding to the utterance of the other speaker using the second data extracted by the extraction unit 100. Language models include various language model types such as an N-gram language model, a trigger language model, and a hierarchical Bayesian language model. Here, the model generation unit 101 generates a language model selected in consideration of the situation in which speech recognition is performed. Here, description of the method of generating the language model is omitted.

次に、言語モデル生成装置１８の動作例を図４のフローチャートを参照しながら説明する。なお、図４は、言語モデル生成装置１８の動作例を示すフローチャートである。つまり、図４に表されているフローチャートは、言語モデル生成装置１８が実行するプログラムに記載されている処理手順を表している。 Next, an operation example of the language model generation apparatus 18 will be described with reference to the flowchart of FIG. FIG. 4 is a flowchart showing an operation example of the language model generation device 18. That is, the flowchart shown in FIG. 4 represents a processing procedure described in a program executed by the language model generation device 18.

例えば、言語モデル生成装置１８が第１話者の発話に含まれている単語あるいは単語列を受け付けると（図４のステップＳ１０１）、抽出部１００が、その受け付けた単語あるいは単語列をデータ群２０の第１データに照合する。そして、その単語あるいは単語列に該当する（対応する）第１データが有る場合には、抽出部１００は、その第１データに関連付けられている第２データを抽出する（ステップＳ１０２）。 For example, when the language model generation device 18 receives a word or a word string included in the utterance of the first speaker (step S101 in FIG. 4), the extraction unit 100 converts the received word or word string into the data group 20 To the first data. If there is first data corresponding to (corresponding to) the word or word string, the extraction unit 100 extracts second data associated with the first data (step S102).

その後、その抽出された第２データを利用して、モデル生成部１０１が、第１話者と会話している他の話者の発話に応じた言語モデルを生成する（ステップＳ１０３）。 After that, using the extracted second data, the model generation unit 101 generates a language model according to the utterances of other speakers who are talking to the first speaker (step S103).

この第１実施形態では、上記のように、言語モデル生成装置１８（音声認識装置１）は、第１話者の発話に基づいて、他の話者の発話に共起すると想定される単語あるいは単語列（つまり、第２データ）をデータ群２０から抽出する。そして、言語モデル生成装置１８（音声認識装置１）は、抽出された第２データに基づいて、他の話者の発話に応じた言語モデルを生成する。このため、第１実施形態の言語モデル生成装置１８（音声認識装置１）は、例えば様々な状況を想定した問いかけとその問いかけに対する応答とに含まれる語彙候補に基づく場合よりも、他の話者の発話に応じた言語モデルに関わるデータ量を抑制できる。 In the first embodiment, as described above, the language model generation device 18 (speech recognition device 1) is based on the utterances of the first speaker, the words assumed to co-occur with the utterances of other speakers, or A word string (that is, second data) is extracted from the data group 20. And the language model production | generation apparatus 18 (voice recognition apparatus 1) produces | generates the language model according to the utterance of another speaker based on the extracted 2nd data. For this reason, the language model generation device 18 (speech recognition device 1) according to the first embodiment, for example, is a speaker other than the case based on vocabulary candidates included in questions that assume various situations and responses to the questions. The amount of data related to the language model corresponding to the utterance of

その上、第１実施形態の言語モデル生成装置１８（音声認識装置１）は、第１話者の発話に対して応答する他の話者の発話に、第１話者の発話に含まれる単語あるいは単語列が含まれていない場合においても、他の話者の発話の音声認識精度を高めることができる。特に、コンタクトセンターや会議において交わされる会話のように、問いかけと応答が複数回繰り返される一連の会話における音声認識精度を高めることができる。 In addition, the language model generation device 18 (voice recognition device 1) of the first exemplary embodiment includes words included in the utterance of the first speaker in the utterances of other speakers that respond to the utterance of the first speaker. Or even when the word string is not included, the voice recognition accuracy of the speech of another speaker can be improved. In particular, it is possible to improve voice recognition accuracy in a series of conversations in which questions and responses are repeated a plurality of times, such as conversations exchanged at a contact center or a conference.

＜第２実施形態＞
次に、本発明に係る第２実施形態を説明する。なお、第２実施形態の説明において、第１実施形態の音声認識装置を構成する構成部分と同一の名称部分には同一符号を付し、その共通部分の重複説明は省略する。Second Embodiment
Next, a second embodiment according to the present invention will be described. In the description of the second embodiment, the same reference numerals are given to the same name parts as the constituent parts constituting the speech recognition apparatus of the first embodiment, and the duplicate description of the common parts is omitted.

図６は、第２実施形態の音声認識装置の構成を簡略化して表すブロック図である。この第２実施形態の音声認識装置１は、音声認識部（認識手段）１０３を有している。この音声認識部１０３は、音声をテキストデータに変換する音声認識処理を実行する機能を備えている。すなわち、音声認識部１０３は、音声入力装置(マイクロホン)１５から音声に応じた電気信号が入力すると、その電気信号（音声）をテキストデータに変換する。この第２実施形態では、認識対象の会話を交わしている話者達のうち、音声認識処理による認識精度が高くなると想定される話者を第１話者とし、その情報が音声認識装置１に与えられる。これにより、音声認識部１０３は、第１話者ではない他の話者の発話に対する音声認識処理を実行する場合には、モデル生成部１０１により生成され記憶装置（HDD）１２に格納されている言語モデルを利用する。 FIG. 6 is a block diagram illustrating a simplified configuration of the speech recognition apparatus according to the second embodiment. The speech recognition apparatus 1 according to the second embodiment has a speech recognition unit (recognition means) 103. The speech recognition unit 103 has a function of executing speech recognition processing for converting speech into text data. That is, when an electrical signal corresponding to speech is input from the speech input device (microphone) 15, the speech recognition unit 103 converts the electrical signal (speech) into text data. In the second embodiment, among the speakers who are exchanging a conversation to be recognized, a speaker who is assumed to have high recognition accuracy by the speech recognition processing is defined as the first speaker, and the information is stored in the speech recognition apparatus 1. Given. As a result, the speech recognition unit 103 is generated by the model generation unit 101 and stored in the storage device (HDD) 12 when executing speech recognition processing for the utterance of another speaker who is not the first speaker. Use language models.

また、音声認識部１０３は、音声認識処理により複数の変換候補（テキストデータ）を出力する場合がある。さらに、音声認識部１０３は、音声認識処理によるテキストデータに、変換の信頼度と単語の品詞と音響スコアと言語スコアとＮグラムヒット率とのうちの少なくとも１つを確率情報として関連付ける。さらにまた、音声認識部１０３は、音声入力装置１５から受け取った音声（電気信号）が第１話者の音声である場合には、音声認識処理によるテキストデータに、第１話者の発話であることを表す情報を関連付ける。 Further, the voice recognition unit 103 may output a plurality of conversion candidates (text data) by voice recognition processing. Furthermore, the speech recognition unit 103 associates at least one of conversion reliability, a word part of speech, an acoustic score, a language score, and an N-gram hit rate as probability information with text data obtained by speech recognition processing. Furthermore, when the voice (electrical signal) received from the voice input device 15 is the voice of the first speaker, the voice recognition unit 103 is the voice of the first speaker in the text data obtained by the voice recognition process. Associate information that represents

抽出部１００は、音声認識部１０３によるテキストデータが第１話者の発話に基づいたデータである場合に、そのテキストデータに含まれている単語あるいは単語列に基づき、第１実施形態と同様に、データ群２０から第２データを抽出する。 When the text data from the speech recognition unit 103 is data based on the utterance of the first speaker, the extraction unit 100 is based on the word or word string included in the text data, as in the first embodiment. The second data is extracted from the data group 20.

また、抽出部１００は、音声認識部１０３が複数の変換候補（テキストデータ）を出力した場合には、そのテキストデータに関連付けられている確率情報を利用して、複数の変換候補の中の一つを選択する機能を備える。例えば、抽出部１００は、確率情報に基づいて、最も変換が適切であると考えられる変換候補を選択する。そして、抽出部１００は、選択した変換候補のテキストデータに含まれている単語あるいは単語列に基づき、第１実施形態と同様に、データ群２０から第２データを抽出する。 In addition, when the speech recognition unit 103 outputs a plurality of conversion candidates (text data), the extraction unit 100 uses probability information associated with the text data and uses one of the plurality of conversion candidates. The function to select one is provided. For example, the extraction unit 100 selects a conversion candidate that is considered to be most appropriate based on the probability information. Then, the extraction unit 100 extracts the second data from the data group 20 based on the word or word string included in the selected conversion candidate text data, as in the first embodiment.

第２実施形態の音声認識装置１（言語モデル生成装置）における上記以外の構成は、第１実施形態と同様である。 Other configurations of the speech recognition device 1 (language model generation device) of the second embodiment are the same as those of the first embodiment.

次に、第２実施形態における音声認識装置１の動作例を図７のフローチャートに基づいて説明する。図７は第２実施形態における音声認識装置１の動作例を表すフローチャートである。すなわち、図７のフローチャートは、音声認識装置１のＣＰＵ１０が実行するプログラムに記されている処理手順を表している。 Next, an operation example of the speech recognition apparatus 1 in the second embodiment will be described based on the flowchart of FIG. FIG. 7 is a flowchart showing an operation example of the speech recognition apparatus 1 in the second embodiment. That is, the flowchart of FIG. 7 represents a processing procedure described in a program executed by the CPU 10 of the speech recognition apparatus 1.

音声認識装置１において、音声入力装置１５から音声（電気信号）が音声認識部１０３に加えられると、音声認識部１０３は、その音声をテキストデータに変換する音声認識処理を実行する（図７のステップＳ２０１）。その変換元の音声が第１話者の発話による音声である場合には、抽出部１００は、その音声認識処理によるテキストデータに含まれている単語あるいは単語列に基づいて、データ群２０から第２データを抽出する（ステップＳ２０２）。モデル生成部１０１は、抽出された第２データを利用して、第１話者と会話している他の話者に応じた言語モデルを生成する（ステップＳ２０３）。この生成された言語モデルは、記憶装置（ＨＤＤ１２あるいはメモリ１１）に格納される。 In the speech recognition device 1, when speech (electrical signal) is applied from the speech input device 15 to the speech recognition unit 103, the speech recognition unit 103 executes speech recognition processing for converting the speech into text data (FIG. 7). Step S201). In the case where the conversion source voice is the voice of the first speaker's utterance, the extraction unit 100 starts from the data group 20 based on the word or word string included in the text data by the voice recognition process. Two data are extracted (step S202). The model generation unit 101 uses the extracted second data to generate a language model corresponding to another speaker who is talking to the first speaker (step S203). The generated language model is stored in the storage device (HDD 12 or memory 11).

その後、第１話者の発話に対して応答した他の話者の発話による音声（電気信号）が音声認識部１０３に加えられると、音声認識部１０３は、その音声をテキストデータに変換する音声認識処理を実行する（ステップＳ２０４）。この際には、音声認識部１０３は、モデル生成部１０１により生成され記憶装置１２（１１）に格納されている他の話者に応じた言語モデルを利用する。 Thereafter, when a voice (electrical signal) generated by another speaker who responds to the first speaker's utterance is added to the voice recognition unit 103, the voice recognition unit 103 converts the voice into text data. Recognition processing is executed (step S204). At this time, the speech recognition unit 103 uses a language model corresponding to another speaker generated by the model generation unit 101 and stored in the storage device 12 (11).

この第２実施形態の音声認識装置１（言語モデル生成装置１８）は、第１実施形態と同様の抽出部１００およびモデル生成部１０１を備えているので、第１実施形態と同様の効果を得ることができる。 Since the speech recognition device 1 (language model generation device 18) of the second embodiment includes the extraction unit 100 and the model generation unit 101 similar to those of the first embodiment, the same effects as those of the first embodiment are obtained. be able to.

＜第３実施形態＞
次に、本発明に係る第３実施形態を説明する。なお、この第３実施形態の説明において、第１および第２の実施形態の音声認識装置を構成する構成部分と同一の名称部分には同一符号を付し、その共通部分の重複説明は省略する。<Third Embodiment>
Next, a third embodiment according to the present invention will be described. In the description of the third embodiment, the same reference numerals are given to the same name parts as the constituent parts constituting the speech recognition apparatuses of the first and second embodiments, and the duplicate description of the common parts is omitted. .

図８は、第３実施形態の音声認識装置１の構成例を簡略化して表すブロック図である。この第３実施形態の音声認識装置１を構成する言語モデル生成装置１８は、第２実施形態の構成に加えて、データ生成部（データ生成手段）１０４を備えている。また、記憶装置（ＨＤＤ１２）には、会話コーパス２２が格納されている。会話コーパス２２とは、サンプルとしての会話のテキストデータと、その会話に関連する情報とを関連付けたデータを集めたデータ群である。具体的には、会話に関する情報は、会話をしている話者の音声から得られる音声特徴量、発話した話者の特徴、および発話した時間等の情報である。例えば、音声特徴量としては、発話音声の基本周波数あるいは音声認識に用いる特徴ベクトルがある。また、発話した話者の特徴としては、発話者の氏名あるいは役割がある。役割の例としては、会議における議長、コールセンタにおけるオペレータおよび顧客などがあげられる。また、発話した時間は、会話開始時間からその発話を開始するまでの絶対時間あるいは発話の時間長などがある。上記の情報以外にも、発話者が怒っているなどの感情、発話が質問文、陳述文であるといった発話のスタイルなどの情報が、会話に関する情報として、会話コーパス２２に含まれていてもよい。 FIG. 8 is a simplified block diagram illustrating a configuration example of the speech recognition apparatus 1 according to the third embodiment. The language model generation device 18 constituting the speech recognition device 1 of the third embodiment includes a data generation unit (data generation means) 104 in addition to the configuration of the second embodiment. A conversation corpus 22 is stored in the storage device (HDD 12). The conversation corpus 22 is a data group obtained by collecting data that associates text data of a conversation as a sample and information related to the conversation. Specifically, the information related to the conversation is information such as the voice feature amount obtained from the voice of the speaker who is having a conversation, the feature of the speaker who spoke, and the time of speaking. For example, the speech feature amount includes a fundamental frequency of speech speech or a feature vector used for speech recognition. Further, as a feature of the speaker who spoke, there is the name or role of the speaker. Examples of roles include the chairman at a conference, the operator and customer at a call center. Further, the utterance time includes the absolute time from the conversation start time to the start of the utterance or the duration of the utterance. In addition to the above information, the conversation corpus 22 may include information such as emotions such as the speaker being angry, utterance style such as the utterance being a question sentence or a statement sentence, etc. as conversation information. .

データ生成部１０４は、ＣＰＵ１０がプログラム１９に従って動作することにより実現するＣＰＵ１０の機能部の一つである。当該データ生成部１０４は、音声認識部１０３の音声認識処理によるテキストデータ（第１話者の発話）に含まれている単語あるいは単語列を第１データとする。さらに、データ生成部１０４は、その第１データに関連する単語あるいは単語列を会話コーパス２２から第２データとして抽出する機能を備えている。さらにまた、データ生成部１０４は、それら第１データと第２データを関連付けた関連語対を生成し、データ群２０に追加する機能を備えている。 The data generation unit 104 is one of functional units of the CPU 10 that is realized by the CPU 10 operating according to the program 19. The data generation unit 104 uses, as first data, a word or a word string included in text data (utterance of the first speaker) by the voice recognition processing of the voice recognition unit 103. Further, the data generation unit 104 has a function of extracting a word or a word string related to the first data from the conversation corpus 22 as second data. Furthermore, the data generation unit 104 has a function of generating a related word pair that associates the first data and the second data, and adding the related word pair to the data group 20.

具体的には、データ生成部１０４は、上記のように第１データとした単語あるいは単語列を含む会話のテキストデータを会話コーパス２２から抽出する。そして、データ生成部１０４は、その抽出した会話のテキストデータを利用して、第１データに関連する単語あるいは単語列を第２データとして特定する。さらに、データ生成部１０４は、それら第１データと第２データを関連付けることにより、関連語対を生成し、データ群２０に追加する。このようなデータ生成部１０４は、データ群２０を予め生成する際に機能できるだけでなく、実際に会話の音声認識処理を実行しながらデータ群２０を学習（機械学習）する機能を音声認識装置１に持たせることができる。 Specifically, the data generation unit 104 extracts from the conversation corpus 22 the text data of the conversation including the word or the word string as the first data as described above. Then, the data generation unit 104 specifies the word or word string related to the first data as the second data using the extracted text data of the conversation. Further, the data generation unit 104 generates a related word pair by associating the first data and the second data, and adds the related word pair to the data group 20. Such a data generation unit 104 not only functions when generating the data group 20 in advance, but also has a function of learning (machine learning) the data group 20 while actually executing speech recognition processing for conversation. Can have.

次に、データ生成部１０４（ＣＰＵ１０）が関連語対を生成する動作例を図９のフローチャートを利用して説明する。なお、図９は、第３実施形態の音声認識装置１が関連語対を生成する動作例を表すフローチャートである。換言すれば、図９のフローチャートは、音声認識装置１が関連語対を生成する処理手順を表している。 Next, an operation example in which the data generation unit 104 (CPU 10) generates related word pairs will be described with reference to the flowchart of FIG. FIG. 9 is a flowchart illustrating an operation example in which the speech recognition apparatus 1 according to the third embodiment generates a related word pair. In other words, the flowchart of FIG. 9 represents a processing procedure in which the speech recognition apparatus 1 generates a related word pair.

データ生成部１０４は、音声認識部１０３の音声認識処理によるテキストデータ（第１話者の発話）に含まれている単語あるいは単語列を第１データとして設定すると（図９のステップＳ３０１）、その第１データを会話コーパス２２に照合する。これにより、データ生成部１０４は、第１データを含む会話の情報を会話コーパス２２から抽出する。そして、データ生成部１０４は、その抽出した会話の情報に基づいて、第１データを含む発話に応答する発話に含まれ、かつ、第１データに関連する単語あるいは単語列を第２データとして抽出する（ステップＳ３０２）。さらに、データ生成部１０４は、第１データと第２データを関連付けて関連語対を生成する（ステップＳ３０３）。そして、データ生成部１０４は、その関連語対を記憶装置１２のデータ群２０に追加する（ステップＳ３０４）。 When the data generation unit 104 sets a word or a word string included in text data (utterance of the first speaker) by the speech recognition processing of the speech recognition unit 103 as the first data (step S301 in FIG. 9), The first data is collated with the conversation corpus 22. Thereby, the data generation unit 104 extracts the conversation information including the first data from the conversation corpus 22. Based on the extracted conversation information, the data generation unit 104 extracts, as second data, a word or word string that is included in the utterance that responds to the utterance that includes the first data and that is related to the first data. (Step S302). Furthermore, the data generation unit 104 generates a related word pair by associating the first data and the second data (step S303). Then, the data generation unit 104 adds the related word pair to the data group 20 of the storage device 12 (step S304).

この第３実施形態においても、第１実施形態および第２実施形態と同様な言語モデル生成装置１８を備えているので、第１実施形態および第２実施形態と同様な効果を得ることができる。また、この第３実施形態におけるデータ生成部１０４は、関連語対を生成する場合に、会話コーパスを利用する。これにより、データ生成部１０４は、会話に応じた関連語対を生成できる。 Also in the third embodiment, since the language model generation device 18 similar to that in the first embodiment and the second embodiment is provided, the same effect as in the first embodiment and the second embodiment can be obtained. The data generation unit 104 in the third embodiment uses a conversation corpus when generating related word pairs. Thereby, the data generation part 104 can generate | occur | produce the related word pair according to conversation.

＜その他の実施形態＞
なお、本発明は、第１−第３の実施形態に限定されず、様々な実施の形態を採り得る。例えば、データ群２０には、関連語対だけでなく、例えば、図５に表されるように、関連語対に関わる情報を付属情報として関連付けてもよい。付属情報としては、例えば、第１データあるいは第２データである単語あるいは単語列の表記、品詞、発話話者の役割、会話開始時から発話までの時間、発話数および単語数などがある。データ群２０の関連語対には、そのような付属情報のうちの１つ以上が関連付けられていてもよい。<Other embodiments>
In addition, this invention is not limited to the 1st-3rd embodiment, Various embodiment can be taken. For example, not only the related word pair but also information related to the related word pair may be associated with the data group 20 as attached information as shown in FIG. The attached information includes, for example, the notation of the word or word string as the first data or the second data, the part of speech, the role of the utterer, the time from the start of the conversation to the utterance, the number of utterances and the number of words. One or more of the attached information may be associated with the related word pair of the data group 20.

このように関連語対に付属情報が関連付けられている場合には、抽出部１００は、第１話者の発話に基づいた第１データだけでなく、付属情報に対応する情報を参考情報として受け付ける。例えば、抽出部１００は、音声認識装置１に備えられている計時装置から会話に関わる時間情報を参考情報として受け付ける。また、抽出部１００は、入力装置１４により入力された話者の役割情報などを参考情報として受け付ける場合もある。 When the attached information is associated with the related word pair in this way, the extraction unit 100 accepts not only the first data based on the utterance of the first speaker but also information corresponding to the attached information as reference information. . For example, the extraction unit 100 receives time information related to the conversation as reference information from a time measuring device provided in the voice recognition device 1. In addition, the extraction unit 100 may receive speaker role information or the like input by the input device 14 as reference information.

抽出部１００は、第１データと、上記のような参考情報とに基づいて、データ群２０から第２データを抽出する。 The extraction unit 100 extracts second data from the data group 20 based on the first data and the reference information as described above.

例えば、企業のコンタクトセンターにおける会話の場合、オペレータの「お名前いただけますか」という発話は、コンタクトセンターにおける会話の冒頭または終了直前で発話されることが多い。このような情報が付属情報としてデータ群２０の関連語対に関連付けられているとする。この場合に、抽出部１００は、「お名前いただけますか」という発話を第１データとして受け取った場合には、発話時間（会話を開始してから発話されるまでの時間）の情報（参考情報）に基づいて、その第１データに、より適切な第２データを抽出できる。 For example, in the case of a conversation in a company contact center, the operator's utterance "Can you name me" is often uttered just before the beginning or end of the conversation in the contact center? It is assumed that such information is associated with related word pairs in the data group 20 as attached information. In this case, when the extraction unit 100 receives an utterance “Can you name me” as the first data, the information (reference information) of the utterance time (the time from the start of the conversation until the utterance is spoken) ), More appropriate second data can be extracted from the first data.

また、コンタクトセンターにおける会話において、オペレータは発話するが、顧客は発話しないという単語あるいは単語列がある。このような単語あるいは単語列が第１データとして抽出部１００に加えられた場合には、例えば、抽出部１００は、話者の役割の情報（参考情報）をも利用して、データ群２０から第２データを抽出することにより、第１データに、より密接に関連している第２データを抽出できる。 Further, in a conversation at a contact center, there is a word or a word string that an operator speaks but a customer does not speak. When such a word or word string is added to the extraction unit 100 as the first data, for example, the extraction unit 100 also uses information (reference information) on the role of the speaker from the data group 20. By extracting the second data, it is possible to extract the second data that is more closely related to the first data.

また、第１−第３の実施形態では、抽出部１００は、第１データに関連する第２データを抽出している。これに対し、抽出部１００は、第２データだけでなく、第１データと第２データとが関連付いている関連語対を抽出してもよい。この場合には、モデル生成部１０１は、その抽出された関連語対を利用して、言語モデルを生成する。 In the first to third embodiments, the extraction unit 100 extracts second data related to the first data. On the other hand, the extraction unit 100 may extract not only the second data but also related word pairs in which the first data and the second data are associated. In this case, the model generation unit 101 generates a language model using the extracted related word pair.

以上、実施形態を参照して本願発明を説明したが、本願発明は上記実施形態に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 While the present invention has been described with reference to the embodiments, the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

なお、この出願は、２０１２年５月１７日に出願された日本出願特願２０１２−１１３５３４を基礎とする優先権を主張し、その開示の全てをここに取り込む。 In addition, this application claims the priority on the basis of Japanese application Japanese Patent Application No. 2012-113534 for which it applied on May 17, 2012, and takes in those the indications of all here.

上記の実施形態の一部又は全部は、以下の付記のようにも記載されうるが、以下には限られない。
（付記１）
第１の話者の発話に含まれる単語列と、前記発話に対して応答する第２の話者の発話に含まれる関連語とを関連付ける関連語対に基づいて、会話データに含まれる前記単語列に関連付けられている関連語を抽出する関連語抽出部と、
前記関連語抽出部により抽出された関連語を用いて、前記第２の話者の言語モデルを生成する言語モデル生成部と
を備える言語モデル生成装置。
（付記２）
前記第１の話者の発話に対して音声認識を行う音声認識部をさらに備え、
前記関連語抽出部は、前記関連語対に基づいて、前記音声認識部により音声認識された結果に含まれる前記単語列に関連付けられている関連語を抽出する付記１に記載の言語モデル生成装置。
（付記３）
前記関連語抽出部は、会話に含まれる話者のうち、前記音声認識部によって他者より高い精度で音声認識された話者を、前記第１の話者として扱う付記２に記載の言語モデル生成装置。
（付記４）
前記第１の話者の音声認識結果に含まれる単語列と、特定の会話における発話音声に基づいてテキストデータに変換した情報、前記発話音声または前記特定の会話から得られる音声特徴量、および前記発話音声または前記特定の会話から得られる話者情報を関連づけた会話コーパスとに基づいて、前記第１の話者の音声認識結果と、前記第１の音声認識結果と関連が高い前記会話コーパス中に含まれるテキストとの組み合わせを、前記関連語対として生成する関連語対生成部をさらに備える付記２又は付記３に記載の言語モデル生成装置。
（付記５）
前記関連語抽出部は、前記第１の話者の発話に含まれる１以上の単語列と、その発話に対して応答する前記第２の話者の発話に含まれる１以上の単語列を関連語として抽出する付記１乃至付記４の何れか一つに記載の言語モデル生成装置。
（付記６）
前記関連語抽出部は、前記第１の話者の発話テキストに含まれる単語列の表記、品詞、発話者の役割、会話開始時刻から発声までの時間、発話数および単語数の少なくともいずれかを用いて、前記関連語対から関連語を抽出する付記１乃至付記５の何れか一つに記載の言語モデル生成装置。
（付記７）
前記関連語抽出部は、前記第１の話者の音声認識結果に含まれる信頼度、品詞、音響スコア、言語スコア、Ｎグラムヒット率のいずれか１つ以上を用いて、前記第１の話者の音声認識結果を選択する付記２乃至付記４の何れか一つに記載の言語モデル生成装置。
（付記８）
前記言語モデル生成部により生成された前記第２の話者の言語モデルに基づいて、前記第２の話者の発話に対して音声認識を行う音声認識部をさらに備える付記１に記載の言語モデル生成装置。
（付記９）
第１の話者の発話に含まれる単語列と、前記発話に対して応答する第２の話者の発話に含まれる関連語とを関連付ける関連語対に基づいて、会話データに含まれる前記単語列に関連付けられている関連語を抽出し、
抽出された関連語を用いて、前記第２の話者の言語モデルを生成する
言語モデル生成方法。
（付記１０）
第１の話者の発話に含まれる単語列と、前記発話に対して応答する第２の話者の発話に含まれる関連語とを関連付ける関連語対に基づいて、会話データに含まれる前記単語列に関連付けられている関連語を抽出する処理と、
抽出された関連語を用いて、前記第２の話者の言語モデルを生成する処理とをコンピュータに実行させるコンピュータプログラム。A part or all of the above-described embodiment can be described as in the following supplementary notes, but is not limited thereto.
(Appendix 1)
The word included in the conversation data based on the related word pair that associates the word string included in the utterance of the first speaker and the related word included in the utterance of the second speaker responding to the utterance. A related term extractor for extracting related terms associated with the column;
A language model generation apparatus, comprising: a language model generation unit that generates a language model of the second speaker using the related words extracted by the related word extraction unit.
(Appendix 2)
A speech recognition unit for performing speech recognition on the speech of the first speaker;
The language model generation device according to supplementary note 1, wherein the related word extraction unit extracts a related word associated with the word string included in a result of voice recognition by the voice recognition unit based on the related word pair. .
(Appendix 3)
The language model according to supplementary note 2, wherein the related word extraction unit treats, as a first speaker, a speaker whose speech is recognized by the speech recognition unit with higher accuracy than others among speakers included in a conversation. Generator.
(Appendix 4)
A word string included in the speech recognition result of the first speaker, information converted into text data based on speech in a specific conversation, speech features obtained from the speech or the specific conversation, and Based on a speech corpus or a speech corpus associated with speaker information obtained from the specific conversation, the speech recognition result of the first speaker and the conversation corpus highly related to the first speech recognition result The language model generation device according to Supplementary Note 2 or Supplementary Note 3, further comprising a related word pair generation unit that generates a combination with the text included in the text as the related word pair.
(Appendix 5)
The related word extraction unit associates one or more word strings included in the utterance of the first speaker with one or more word strings included in the utterance of the second speaker responding to the utterance. 5. The language model generation device according to any one of supplementary notes 1 to 4, which is extracted as a word.
(Appendix 6)
The related word extraction unit includes at least one of a notation of a word string included in the utterance text of the first speaker, a part of speech, a role of the speaker, a time from the conversation start time to the utterance, the number of utterances, and the number of words. The language model generation device according to any one of Supplementary Note 1 to Supplementary Note 5, wherein a related word is extracted from the related word pair.
(Appendix 7)
The related word extraction unit uses any one or more of the reliability, the part of speech, the acoustic score, the language score, and the N-gram hit rate included in the speech recognition result of the first speaker. The language model generation device according to any one of supplementary notes 2 to 4, which selects a person's speech recognition result.
(Appendix 8)
The language model according to claim 1, further comprising a speech recognition unit that performs speech recognition on the utterance of the second speaker based on the language model of the second speaker generated by the language model generation unit. Generator.
(Appendix 9)
The word included in the conversation data based on the related word pair that associates the word string included in the utterance of the first speaker and the related word included in the utterance of the second speaker responding to the utterance. Extract related terms associated with a column,
A language model generation method for generating a language model of the second speaker by using an extracted related word.
(Appendix 10)
The word included in the conversation data based on the related word pair that associates the word string included in the utterance of the first speaker and the related word included in the utterance of the second speaker responding to the utterance. Extracting related terms associated with the column;
A computer program that causes a computer to execute processing for generating a language model of the second speaker using the extracted related terms.

本発明は、会話中の音声を認識する処理に関する技術を用いる分野に有効である。 The present invention is effective in the field of using technology related to processing for recognizing speech during conversation.

１音声認識装置
１０ＣＰＵ
１８言語モデル生成装置
１００抽出部
１０１モデル生成部
１０３音声認識部
１０４データ生成部1 Voice recognition device 10 CPU
18 language model generation device 100 extraction unit 101 model generation unit 103 speech recognition unit 104 data generation unit

Claims

A first data that is a word or a word string included in the first utterance and a related word related to the first data of the word or the word string included in the second utterance responding to the first utterance. The first episode of the first speaker and other speakers who are exchanging a recognition target conversation with the first data in a group of related words associated with certain second data. Extracting means for collating a word or a word string included in the person's utterance and extracting the second data associated with the first data corresponding to the word or word string;
A language model generation device comprising model generation means for generating a language model according to the speech of the other speaker using the extracted second data.

The extraction means includes the same first data corresponding to the collated word or word string, and detects that the plurality of related word pairs having different second data are in the data group. The language model generation apparatus according to claim 1, wherein the second data of the plurality of related word pairs is extracted.

The related word pair of the data group includes the first data notation, the part of speech of the first data, the role of the speaker that emits the first data, and the first data after the conversation is started. At least one of the time to issue and the number of words of the first data within a predetermined period is associated as attached information,
The extraction means receives not only a word or word string included in the utterance of the first speaker but also reference information corresponding to the attached information, and extracts the second data from the data group. The language model generation apparatus according to claim 1, wherein the second data is extracted also using reference information.

The extraction means receives a conversion candidate of a word or a word string by a speech recognition process for converting speech into text data, and the conversion candidate includes reliability, part of speech, acoustic score, language obtained by the speech recognition process At least one of the score and the N-gram hit rate is associated as probability information,
When the extraction means receives a plurality of the conversion candidates by the voice recognition processing for the utterance of the first speaker, the extraction means uses a plurality of the conversions by using probability information associated with the conversion candidates. 3. The second data is extracted by selecting one of candidates and comparing the selected word or word string of the selected conversion candidate with first data of the data group. Or the language model production | generation apparatus of Claim 3.

The extraction means extracts not only the second data but also the associated first data, and the model generation means associates the extracted first data with the second data. The language model generation device according to any one of claims 1 to 4, wherein a language model corresponding to an utterance of the other speaker is generated using the related word pair.

Comprising data generation means for generating the data group;
The data generation means includes a word or word included in a conversation corpus in which words or word strings taken from the first speaker's utterance and used as the first data in the data group and text data representing conversation as a sample are collected. By associating the word or the word string taken as the second data in the data group with high relevance to the word or the word string as the first data in the sequence, the related word pair The language model generation device according to claim 1, wherein the language model generation device generates the data.

Recognition means for executing speech recognition processing for converting speech into text data;
A language model generation device that generates a language model used for the speech recognition processing,
The language model generation device is the language model generation device according to any one of claims 1 to 6, wherein the language model generation device includes a first speaker and another speaker who are in a conversation to be recognized. A speech recognition apparatus for generating a language model corresponding to the other speaker based on a word or a word string included in the utterance of the first speaker and converted into text data by the recognition means.

The said recognition means is provided with the function to perform the speech recognition process which converts the speech of said another speaker into text data using the said language model produced | generated by the said language model production | generation apparatus. Voice recognition device.

A first data that is a word or a word string included in the first utterance and a related word related to the first data of the word or the word string included in the second utterance responding to the first utterance. The first episode of the first speaker and other speakers who are exchanging a recognition target conversation with the first data in a group of related words associated with certain second data. Collating a word or word string included in the person's utterance, and extracting the second data associated with the first data corresponding to the word or word string,
A language model generation method in which a j computer generates a language model corresponding to the utterance of the other speaker using the extracted second data.

A first data that is a word or a word string included in the first utterance and a related word related to the first data of the word or the word string included in the second utterance responding to the first utterance. The first episode of the first speaker and other speakers who are exchanging a recognition target conversation with the first data in a group of related words associated with certain second data. A process of collating a word or a word string included in the person's utterance and extracting the second data associated with the first data corresponding to the word or the word string;
A program storage medium storing a computer program that causes a computer to execute a process of generating a language model according to the speech of the other speaker using the extracted second data.