JP6347938B2

JP6347938B2 - Utterance key word extraction device, key word extraction system using the device, method and program thereof

Info

Publication number: JP6347938B2
Application number: JP2013239471A
Authority: JP
Inventors: 吉田　明弘; 明弘吉田; 裕司青野; 石原　晋也; 晋也石原; 豊國田; 義男神田; 野村　英司; 英司野村; 雄二大石
Original assignee: Nippon Telegraph and Telephone Corp; Nippon Telegraph and Telephone East Corp; NTT Inc USA
Current assignee: Nippon Telegraph and Telephone East Corp; NTT Inc; NTT Inc USA
Priority date: 2013-11-20
Filing date: 2013-11-20
Publication date: 2018-06-27
Anticipated expiration: 2033-11-20
Also published as: JP2015099289A

Description

本発明は、発話内の重要語を抽出する発話内重要語抽出装置と、当該装置を用いた発話内重要語抽出システムと、それらの方法とプログラムに関する。 The present invention relates to an important word extraction device in an utterance for extracting an important word in an utterance, an important word extraction system in an utterance using the device, a method and a program thereof.

我々は、生活の中で声を発声して他者との間でコミュニケーションを取っている。その発声には、様々な発話情報が含まれており、その発話情報の中から重要と思われる単語を効率よく抽出することで、ライフログのような個人活動履歴や備忘録の作成などが可能だと考えられる。 We communicate with others by speaking in our lives. The utterance includes various utterance information. By efficiently extracting important words from the utterance information, it is possible to create personal activity histories such as life logs and memorandums. it is conceivable that.

利用者の利便性を考慮すると、履歴を残すために、その都度発話するのではなく、常時音声を録音しておき、普段の無意識の生活の中で発せられる発話情報から、情報が得られる方法が好ましい。その発話情報から得られた重要と思われる単語を用いた個人活動履歴や備忘録は、有用な記録になると考えられる。 Considering the convenience of the user, in order to leave a history, instead of uttering each time, the voice is always recorded, information can be obtained from the utterance information uttered in the usual unconscious life Is preferred. Personal activity histories and memorandums using words considered important obtained from the utterance information are considered to be useful records.

従来、発話内容から重要語を抽出する方法としては、予め重要語を登録したキーワードリストを用いてキーワードと一致する認識結果を重要語として抽出する方法が知られている（例えば特許文献１）。 Conventionally, as a method for extracting an important word from utterance contents, a method is known in which a recognition result matching a keyword is extracted as an important word using a keyword list in which the important word is registered in advance (for example, Patent Document 1).

特開２００８−２８６９２１号公報JP 2008-286922 A

しかし、重要語とすべき単語は利用者ごとに違うと考えられるので、従来の方法では利用者ごとにキーワードリストを作成しなくてはならない課題がある。また、各利用者の関心の対象は、日々の生活の中で変化するのが一般的であると考えられるため、関心の対象に直結するキーワードのメンテナンスを日々行う必要がある。しかし、キーワードリストを日々メンテナンスするのは現実的ではない。また、普段の会話の対話音声の発話速度は速く、不明瞭な発音をすることも多いため、誤認識した単語を重要語として誤って抽出しまう場合もある。 However, since words that should be important words are considered to be different for each user, the conventional method has a problem that a keyword list must be created for each user. In addition, since it is generally considered that the object of interest of each user changes in daily life, it is necessary to perform daily maintenance of keywords directly connected to the object of interest. However, daily maintenance of the keyword list is not practical. In addition, since the speech rate of conversational speech in a normal conversation is fast and often produces unclear pronunciation, a misrecognized word may be erroneously extracted as an important word.

本発明は、このような課題に鑑みてなされたものであり、キーワードリストを自動的に最新化すると共に、誤認識した単語を重要語として抽出するリスクを低減させた発話内重要語抽出装置と、その装置を用いた発話内重要語抽出システムとそれらの方法とプログラムを提供することを目的とする。 The present invention has been made in view of such problems, and it is possible to automatically update a keyword list and reduce the risk of extracting misrecognized words as important words, It is an object of the present invention to provide a system for extracting important words in an utterance using the apparatus, a method and a program thereof.

本発明の発話内重要語抽出装置は、キーワードリストと、音声認識部と、単語出現頻度計数部と、単語登録部と、重要語抽出部と、を具備する。キーワードリストにはキーワードが登録される。音声認識部は、入力される音声信号を、音声認識処理して形態素に分割した単語列と品詞列の情報から成るテキスト文を生成して出力する。単語出現頻度計算部は、音声認識部が出力するテキスト文内の単語の所定時間内における出現頻度を計数して、単語とその品詞とその出現頻度とから成る出現頻度付認識結果を出力する。単語登録部は、出現頻度付認識結果を入力として、所定の条件を満たす出現頻度付認識結果の単語を、キーワードとしてキーワードリストに登録する。重要語抽出部は、出現頻度付認識結果を入力として、所定回数以上の出現頻度で且つ、キーワードリストに登録されたキーワードの何れかと一致する単語を重要語として抽出する。 The important word extraction device in an utterance of the present invention includes a keyword list, a voice recognition unit, a word appearance frequency counting unit, a word registration unit, and an important word extraction unit. Keywords are registered in the keyword list. The voice recognition unit generates and outputs a text sentence including information of a word string and a part of speech string obtained by dividing the input voice signal into morphemes by voice recognition processing. The word appearance frequency calculation unit counts the appearance frequency of words in the text sentence output by the speech recognition unit within a predetermined time, and outputs a recognition result with appearance frequency including the word, its part of speech, and its appearance frequency. The word registration unit inputs the recognition result with appearance frequency and registers the word of the recognition result with appearance frequency that satisfies a predetermined condition as a keyword in the keyword list. The important word extraction unit receives the recognition result with appearance frequency as an input, and extracts, as an important word, a word that has an appearance frequency equal to or greater than a predetermined number of times and matches any of the keywords registered in the keyword list.

また、この発明の発話内重要語抽出システムは、音声入力端末と、発話内重要語抽出装置と、ネットワークと、出現頻度付認識結果生成サーバと、を具備する。音声入力端末は、マイクロホンで収音した音声信号を発話内重要語抽出装置に出力する。発話内重要語抽出装置は、音声信号を音声ファイルとして録音し、録音された音声ファイルから人が発声した音声部分のみを切り出した音声区間の音声信号とその発声開始時刻情報を、ネットワークを介して出現頻度付認識結果生成サーバに送信し、その出現頻度付き認識結果生成サーバから受信した出現頻度付認識結果から重要語を抽出して出力する。出現頻度付認識結果生成サーバは、発話内重要語抽出装置から音声区間の音声信号と発声開始時刻情報とを受信し、音声区間の音声信号を音声認識処理して形態素に分割した単語列と品詞列の情報から成るテキスト文を生成し、当該テキスト文の単語の所定時間内における出現頻度を計数して、単語とその品詞とその出現頻度とから成る出現頻度付認識結果を生成して発話内重要語抽出装置に送信する。 Moreover, the utterance important word extraction system of this invention comprises a voice input terminal, an utterance important word extraction device, a network, and a recognition result generation server with appearance frequency. The voice input terminal outputs the voice signal picked up by the microphone to the important word extraction device in the utterance. The speech important word extraction device records a voice signal as a voice file, and extracts a voice signal of a voice section obtained by extracting only a voice part uttered by a person from the recorded voice file and the voice start time information via a network. It transmits to the recognition result generation server with appearance frequency, extracts an important word from the recognition result with appearance frequency received from the recognition result generation server with appearance frequency, and outputs it. A recognition result generation server with appearance frequency receives a speech signal and speech start time information from a speech important word extraction device, and recognizes the speech signal of the speech segment and divides it into morphemes by speech recognition processing. Generates a text sentence consisting of column information, counts the frequency of appearance of the word in the text sentence within a predetermined time, generates a recognition result with the appearance frequency consisting of the word, its part of speech, and its appearance frequency, and within the utterance Sent to the keyword extraction device.

本発明の発話内重要語抽出装置によれば、所定時間内に所定回数以上出現する出現頻度の多い単語のみを、重要語として抽出することができる。重要語を特定するキーワードリストは、出現頻度付認識結果から自動的に作成されるので、キーワードリストを作成する手間を省力化する効果を奏する。また、重要語として抽出されるためには、所定時間内に所定回数以上の回数音声認識される必要があるため、誤認識した単語を重要語にするリスクを小さくすることもできる。 According to the important word extracting device in an utterance of the present invention, it is possible to extract only words having a high appearance frequency that appear more than a predetermined number of times within a predetermined time as important words. Since the keyword list for specifying the important word is automatically created from the recognition result with the appearance frequency, there is an effect of saving labor for creating the keyword list. In addition, in order to be extracted as an important word, it is necessary to perform speech recognition more than a predetermined number of times within a predetermined time, so that the risk of making an erroneously recognized word an important word can be reduced.

また、本発明の発話内重要語抽出システムによれば、比較的に処理の重い音声認識処理を、ネットワークを介した出現頻度付認識結果生成サーバに分担させるので、発話内重要語抽出装置の構成を簡単にすることができる。その結果、発話内重要語抽出装置を安価にすると共に小型化することができる。 Moreover, according to the important word extraction system in an utterance of the present invention, the relatively heavy processing of speech recognition is shared by the recognition result generation server with appearance frequency via the network. Can be easy. As a result, the key word extraction device in the utterance can be made inexpensive and downsized.

本発明の発話内重要語抽出装置１００の機能構成例を示す図。The figure which shows the function structural example of the important word extraction apparatus 100 in an utterance of this invention. 発話内重要語抽出装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the important word extraction apparatus 100 in an utterance. 音声信号を音声認識したテキスト文の一例を示す図。The figure which shows an example of the text sentence which carried out the speech recognition of the audio | voice signal. 図３に示すテキスト文の単語列と品詞列と出現頻度を示す図。The figure which shows the word sequence of a text sentence shown in FIG. 3, a part of speech sequence, and appearance frequency. キーワードリスト１４０の一例を示す図。The figure which shows an example of the keyword list | wrist 140. FIG. 本発明の発話内重要語抽出システム２００のシステム構成を示す図。The figure which shows the system configuration | structure of the important word extraction system 200 in an utterance of this invention. 発話内重要語抽出システム２００の動作シーケンスを示す図。The figure which shows the operation | movement sequence of the important word extraction system 200 in an utterance. 本発明の発話内重要語抽出システム３００のシステム構成を示す図。The figure which shows the system configuration | structure of the important word extraction system 300 in an utterance of this invention. 音声入力端末２１０と発話内重要語抽出装置３２０の外観とその利用場面の一例を示す図。The figure which shows an example of the external appearance of the voice input terminal 210 and the important word extraction apparatus 320 in an utterance, and its utilization scene. 本発明の発話内重要語抽出システム４００のシステム構成を示す図。The figure which shows the system configuration | structure of the important word extraction system 400 in an utterance of this invention. 本発明の発話内重要語抽出システム４００を構成する出現頻度付認識結果生成サーバ４４０の機能構成例を示す図。The figure which shows the function structural example of the recognition result production | generation server 440 with appearance frequency which comprises the important word extraction system 400 in an utterance of this invention. 本発明の発話内重要語抽出システム４００を構成する発話内重要語抽出装置４２０の機能構成例を示す図。The figure which shows the function structural example of the important word extraction apparatus 420 in the utterance which comprises the important word extraction system 400 in utterance of this invention.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

図１に、この発明の発話内重要語抽出装置１００の機能構成例を示す。その動作フローを図２に示す。発話内重要語抽出装置１００は、音声認識部１１０と、単語出現頻度計数部１２０と、単語登録部１３０と、キーワードリスト１４０と、重要語抽出部１５０と、制御部１６０と、を具備する。発話内重要語抽出装置１００は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。以降で説明する他の実施例についても同様である。 FIG. 1 shows an example of the functional configuration of an apparatus for extracting important words in an utterance 100 of the present invention. The operation flow is shown in FIG. The utterance important word extraction device 100 includes a speech recognition unit 110, a word appearance frequency counting unit 120, a word registration unit 130, a keyword list 140, an important word extraction unit 150, and a control unit 160. The in-speech important word extraction device 100 is realized by reading a predetermined program into a computer composed of, for example, a ROM, a RAM, a CPU, and the like, and executing the program by the CPU. The same applies to other embodiments described below.

音声認識部１１０は、入力される音声信号を、音声認識処理して形態素に分割した単語列と品詞列の情報から成るテキスト文を生成して出力する（ステップＳ１１０）。入力される音声信号は、例えばサンプリング周波数１６ｋＨｚで離散的なディジタル信号に変換された信号である。音声認識部１１０は、離散値化された音声信号の所定数（例えば３２０個）を１フレームとしたフレーム毎に、例えばメル周波数ケプストラム係数（ＭＦＣＣ）分析によって音響特徴量を求め、音響尤度と言語尤度の最も高い形態素列の情報から成るテキスト文を音声認識結果として出力する。また、テキスト文と同時に単語毎の信頼度を出力してもよい。信頼度はＮベスト候補における単語の事後確率に基づいて求める方法、例えば参考文献１（Frank Wessel , Ralf Schluter , Klaus Macherey and Hermann Ney, “Confidence Measures for Large Vocabulary Continuous Speech Recognition”，IEEE Transactions on Speech and Audio Processing，Vol.9，No.3，March 2001．）などを用いればよい。 The speech recognition unit 110 generates and outputs a text sentence composed of word sequence and part-of-speech sequence information obtained by performing speech recognition processing and dividing the input speech signal into morphemes (step S110). The input audio signal is a signal converted into a discrete digital signal at a sampling frequency of 16 kHz, for example. The speech recognition unit 110 obtains an acoustic feature amount by, for example, Mel frequency cepstrum coefficient (MFCC) analysis for each frame in which a predetermined number (eg, 320) of discrete speech signals is one frame, and the acoustic likelihood and A text sentence composed of morpheme string information having the highest language likelihood is output as a speech recognition result. Further, the reliability for each word may be output simultaneously with the text sentence. The reliability is calculated based on the posterior probabilities of words in the N best candidates, for example, Reference 1 (Frank Wessel, Ralf Schluter, Klaus Macherey and Hermann Ney, “Confidence Measures for Large Vocabulary Continuous Speech Recognition”, IEEE Transactions on Speech and Audio Processing, Vol.9, No.3, March 2001.) etc. may be used.

単語出現頻度計数部１２０は、音声認識部１１０が出力するテキスト文内の単語の所定時間内における出現頻度を計数して、単語とその品詞とその出現頻度とから成る出現頻度付認識結果を出力する（ステップＳ１２０）。図２を参照してこの単語出現頻度計数過程を詳しく説明する。 The word appearance frequency counting unit 120 counts the appearance frequency of words in the text sentence output by the speech recognition unit 110 within a predetermined time, and outputs a recognition result with appearance frequency including the word, its part of speech, and its appearance frequency. (Step S120). The word appearance frequency counting process will be described in detail with reference to FIG.

単語出現頻度計数部１２０は、先ず、音声認識部１１０が出力する単語列の各単語に時刻ラベルを付与する（ステップＳ１２１）。時刻ラベルは、単語が出現した時刻を表すラベルであり、音声認識部１１０において付与しても良い。 First, the word appearance frequency counting unit 120 assigns a time label to each word in the word string output by the speech recognition unit 110 (step S121). The time label is a label indicating the time when the word appears, and may be given by the voice recognition unit 110.

次に、時刻ラベルが付与された単語が、既に出現済みの既出単語であるのか否か判断（ステップＳ１２２）し、既出単語でない初めて出現した単語である場合（ステップＳ１２２のＮｏ）、出現頻度を１回として当該単語とその品詞と出現頻度（１）とから成る出現頻度付認識結果を出力する（ステップＳ１２５）。初めてではない既出単語の場合（ステップＳ１２２のＹｅｓ）、当該単語の時刻ラベルから所定時間内に出現した単語であるのか否かを判断する（ステップＳ１２３）。所定時間内に出現した既出単語の場合（ステップＳ１２３のＹｅｓ）、出現頻度をインクリメント（１増やす）した出現頻度とその単語とその品詞から成る出現頻度付認識結果を出力する（ステップＳ１２４）。所定時間内ではない場合（ステップＳ１２３のＮｏ）は、１回にリセットした出現頻度と当該単語とその品詞を出力する（ステップＳ１２５）。 Next, it is determined whether or not the word to which the time label is given is an already appearing word (step S122), and when it is the first word that has not appeared (No in step S122), the appearance frequency is set. A recognition result with appearance frequency consisting of the word, its part of speech and appearance frequency (1) is output once (step S125). In the case of an existing word that is not the first time (Yes in step S122), it is determined whether the word has appeared within a predetermined time from the time label of the word (step S123). In the case of an already appearing word that has appeared within a predetermined time (Yes in step S123), an appearance frequency with the appearance frequency incremented (increased by 1), a recognition result with the appearance frequency including the word and its part of speech are output (step S124). If it is not within the predetermined time (No in step S123), the appearance frequency reset once, the word, and its part of speech are output (step S125).

このように単語出現頻度計数部１２０が動作することで、所定時間内において複数回出現する単語の出現頻度のみが増加する。所定時間内に再び出現しない単語の出現頻度は１にリセットされる。 As the word appearance frequency counting unit 120 operates in this way, only the appearance frequency of words that appear multiple times within a predetermined time increases. The appearance frequency of a word that does not appear again within a predetermined time is reset to 1.

図３に例示するテキスト文を参照して、単語と出現頻度との関係を説明する。発話者が、例えば著作権法の第１条の条文を諳んじたと仮定すると、音声認識部１１０は図３に示すテキスト文を出力する。このテキスト文は、図４に示す単語列と品詞列の情報で構成される。 The relationship between the word and the appearance frequency will be described with reference to the text sentence illustrated in FIG. Assuming that the speaker has for example stipulated the first article of the Copyright Law, the speech recognition unit 110 outputs the text sentence shown in FIG. This text sentence is composed of the word string and part-of-speech string information shown in FIG.

図４の１列目は単語、２列目は品詞、３列目は出現頻度であり、この３列を一組として上から下に向けて時間が経過する。１列目の最下行の「する」の次は４列目の最上行の「権利」に、４列目の最下行の「を」の次は７列目の最上行の「図り」につながる関係で時間が経過する。 The first column in FIG. 4 is a word, the second column is a part of speech, the third column is an appearance frequency, and time elapses from top to bottom with the three columns as a set. Next to “Yes” in the first row of the first column, “Right” in the uppermost row of the fourth column is connected to “Right” in the uppermost row of the fourth column, and after “O” in the fourth row of the fourth column. Time passes by relationship.

最初の形態素の単語は「この」、その品詞は「連体詞」、出現頻度は「１」である。次の形態素の単語は「法律」、品詞は「名詞」、出現頻度は「１」である。４つ目の形態素は句読点「、」でありその品詞は「特殊」である。句読点は８つ目の形態素でもあるので、８つ目の形態素が句読点「、」になった時点で出現頻度は「２」にインクリメントされる。 The word of the first morpheme is “this”, the part of speech is “combined”, and the appearance frequency is “1”. The next morpheme word is “law”, the part of speech is “noun”, and the appearance frequency is “1”. The fourth morpheme is a punctuation mark “,” and its part of speech is “special”. Since the punctuation mark is also the eighth morpheme, the appearance frequency is incremented to “2” when the eighth morpheme becomes the punctuation mark “,”.

ここで所定時間を、例えば１つの話題が続く５分間と仮定すると、図３に示すテキスト文は、その５分間の間に全て生成されると考えられる。したがって、複数回生成される単語の出現頻度はリセットされずに増加し続けることになる。その結果、出現回数の多い形態素は「７」回の特殊文字（句読点）、「５」回の「助詞」の「の」であり、形態素を名詞とした場合「３」回の「権利」と「２」回の「著作者」とが、出現頻度の多い単語となる。単語出現頻度計数部１２０は、それぞれの単語とその品詞とその出現頻度とから成る出現頻度付認識結果を、その生成された順番に時系列で出力する。つまり、単語出現頻度計数部１２０は、図４に示す「単語」と「品詞」と「出現頻度」とを組にした情報を、形態素ごとに出力する。 Here, assuming that the predetermined time is, for example, 5 minutes in which one topic continues, it is considered that all the text sentences shown in FIG. 3 are generated during the 5 minutes. Therefore, the appearance frequency of a word generated a plurality of times continues to increase without being reset. As a result, “7” special characters (punctuation marks), “5” “particles” “no”, and “3” “rights” when the morpheme is a noun “2” times of “author” are words having a high frequency of appearance. The word appearance frequency counting unit 120 outputs a recognition result with an appearance frequency including each word, its part of speech, and its appearance frequency in time series in the order of generation. That is, the word appearance frequency counting unit 120 outputs, for each morpheme, information in which “word”, “part of speech”, and “appearance frequency” shown in FIG. 4 are paired.

単語登録部１３０は、単語出現頻度計数部１２０が出力する出現頻度付認識結果を入力として、所定の条件を満たす出現頻度付認識結果の単語を、キーワードとしてキーワードリスト１４０に登録する（ステップＳ１３０）。ここで所定の条件を、例えば品詞を「名詞」その出現頻度を「１」とすると、図４に灰色の塗つぶしで表す「名詞」の単語が、テキスト文の文頭から「法律」、「著作物」、「実演」、…、「寄与」の順番に、２３個のキーワードとしてキーワードリスト１４０に登録される（ステップＳ１３２）。所定の条件に合致しない「句読点」や「助詞」や「接続詞」は、キーワードリスト１４０に登録されない（ステップＳ１３１のＮｏ）。また、ここで音声認識部１１０が出力する信頼度も利用し、所定の条件を満たす単語であり、ある閾値以上の信頼度を持った認識単語のみを重要語として抽出しても良い。 The word registration unit 130 receives the recognition result with the appearance frequency output from the word appearance frequency counting unit 120, and registers the word of the recognition result with the appearance frequency satisfying a predetermined condition in the keyword list 140 as a keyword (step S130). . If the predetermined condition is, for example, the part of speech is a “noun” and the appearance frequency is “1”, the word “noun” shown in gray in FIG. 23 keywords are registered in the keyword list 140 in the order of “thing”, “demonstration”,..., “Contribution” (step S132). “Punctuation marks”, “particles”, and “conjunctions” that do not match the predetermined condition are not registered in the keyword list 140 (No in step S131). In addition, the reliability output by the speech recognition unit 110 may be used here, and only words that satisfy a predetermined condition and have a reliability equal to or higher than a certain threshold may be extracted as important words.

図５に、キーワードリスト１４０に登録されたキーワードの一例を示す。キーワードリスト１４０には、最低限必要な情報として例えば「渋谷」や「初台」等の単語のみのキーワードを表す単語が登録されれば良い。それらの単語に加えて、図５の１列目に示すようにキーワードを識別する識別子（ＩＤ）を付与しても良い。また、そのキーワードを登録した「登録契機」、「登録日時」、「最終利用日時」、なども登録するようにしても良い。ここで「登録契機」とは、キーワードの種別を表す情報である。「初期登録」は、発話内重要語抽出装置１００が予め持つキーワードであり、消去されないものである。「個人発話頻度」は、ある日時の発話情報から登録されたキーワードであり、所定時間経過後に消去しても良いものである。「共通発話頻度」は、後述するこの発明の発話内重要語抽出システム４００（図１０〜１２）で用いるキーワードの種別である。具体例については後述する。 FIG. 5 shows an example of keywords registered in the keyword list 140. In the keyword list 140, for example, words representing only keywords such as “Shibuya” and “Hatsudai” may be registered as minimum necessary information. In addition to these words, an identifier (ID) for identifying a keyword may be given as shown in the first column of FIG. In addition, “registration opportunity”, “registration date / time”, “last use date / time”, and the like that registered the keyword may be registered. Here, “registration opportunity” is information indicating the type of keyword. “Initial registration” is a keyword that the utterance important word extraction apparatus 100 has in advance and is not deleted. “Personal utterance frequency” is a keyword registered from utterance information at a certain date and time, and may be deleted after a predetermined time has elapsed. The “common utterance frequency” is a keyword type used in the utterance important word extraction system 400 (FIGS. 10 to 12) of the present invention described later. Specific examples will be described later.

キーワードは、「登録日時」又は「最終利用日時」を基準に計時した所定時間経過後にリセットするようにしても良い。キーワードをリセットする場合、発話内重要語抽出装置１００は、図１に破線で示すキーワードリストリセット部１７０を具備する。 The keyword may be reset after elapse of a predetermined time measured based on “registration date” or “last use date”. When resetting a keyword, the important word extraction device 100 in the utterance includes a keyword list reset unit 170 indicated by a broken line in FIG.

キーワードリストリセット部１７０は、例えば「最終利用日時」から所定時間の経過を計時した後に、キーワードを個別にリセットする。ここで所定時間の経過は、例えば、２４時間、２日、３日、一週間後など、複数の期間が考えられ、例えば「登録契機」の情報に対応させて、その種別に応じて期間を変えても良い。 The keyword list resetting unit 170 resets the keywords individually after, for example, measuring the passage of a predetermined time from the “last use date”. Here, the elapse of the predetermined time may be, for example, a plurality of periods such as 24 hours, 2 days, 3 days, or a week later. For example, according to the information of “registration opportunity”, the period may be set according to the type. You can change it.

又は、例えば利用者からの録音開始を指示する操作入力に連動させた外部から入力されるリセット信号によって、キーワードリスト１４０をリセットするようにしても良い。この場合の所定時間は、利用者が操作する間隔であり不定期な時間となる。又は、キーワードリストリセット部１７０を、所定の時間周期でリセット信号を出力するように構成しても良い。この場合は、所定の時間間隔でキーワードリスト１４０に登録されたキーワードが全部一度に消去される。このように所定時間でキーワードリスト１４０をリセットすることで、最新のキーワードで構成されるキーワードリスト１４０を保持し続けることができる。 Alternatively, for example, the keyword list 140 may be reset by a reset signal input from the outside in conjunction with an operation input instructing start of recording from the user. The predetermined time in this case is an interval operated by the user, and is an irregular time. Alternatively, the keyword list reset unit 170 may be configured to output a reset signal at a predetermined time period. In this case, all the keywords registered in the keyword list 140 are deleted at a predetermined time interval. In this way, by resetting the keyword list 140 for a predetermined time, the keyword list 140 composed of the latest keywords can be kept.

重要語抽出部１５０は、単語出現頻度計数部１２０が出力する出現頻度付認識結果を入力として、所定回数以上の出現頻度で且つ、キーワードと一致する単語を重要語として抽出する（ステップＳ１５０）。ここで、例えば所定回数を２回に設定しておくとする。テキスト文が「この法律は、著作物並びに実演、レコード、放送及び有線放送に関し著作者の権利及びこれに隣接する権利」まで生成された時点で、テキスト文内にキーワードと一致（ステップＳ１５１）する単語である「権利」が２回出現する。この時点（ステップＳ１５２）で、「権利」という単語が重要語として抽出される（ステップＳ１５３）。その後、テキスト文が「…を定め、これらの文化的所産の公正な利用に留意しつつ、著作者」まで作成された時点で、テキスト文内に２回出現した「著作者」という単語が重要語として抽出される（ステップＳ１５３）。重要語抽出部１５０で設定する所定回数は、単語登録部１３０で使用する所定回数と異なる値であっても良い。また、ここで音声認識部１１０が出力する信頼度も利用し、所定の条件を満たす単語であり、ある閾値以上の信頼度を持った認識単語のみを重要語として抽出しても良い。 The keyword extraction unit 150 receives, as an input, the recognition result with the appearance frequency output from the word appearance frequency counting unit 120, and extracts, as an important word, a word that has an appearance frequency equal to or more than a predetermined number and matches the keyword (step S150). Here, for example, it is assumed that the predetermined number of times is set to two. At the time when the text sentence is generated up to “the rights of the author and the rights adjacent to the copyrighted work and demonstration, record, broadcast and cable broadcast”, the text sentence matches the keyword (step S151). The word “right” appears twice. At this time (step S152), the word “right” is extracted as an important word (step S153). After that, when the text was created up to "author", specifying "... and paying attention to the fair use of these cultural products, the word" author "that appeared twice in the text sentence is important It is extracted as a word (step S153). The predetermined number of times set by the keyword extraction unit 150 may be a value different from the predetermined number of times used by the word registration unit 130. In addition, the reliability output by the speech recognition unit 110 may be used here, and only words that satisfy a predetermined condition and have a reliability equal to or higher than a certain threshold may be extracted as important words.

ステップＳ１１０の音声認識過程と、ステップＳ１２０の単語出現頻度計数過程と、ステップＳ１３０の単語登録過程と、ステップＳ１５０の重要語抽出過程は、音声信号が終了するまで繰り返される（ステップＳ１６０のＮｏ）。この例の場合、抽出される重要語は、「権利」と「著作者」の２つである。この繰り返し動作の制御は制御部１６０で行う。制御部１６０は、発話内重要語抽出装置１００の各部の時系列動作を制御する一般的なものであり、特別な処理を行うものではない。他の実施例についても同様であり、以降、制御部の説明は省略する。 The speech recognition process in step S110, the word appearance frequency counting process in step S120, the word registration process in step S130, and the important word extraction process in step S150 are repeated until the speech signal is completed (No in step S160). In this example, there are two important words to be extracted: “right” and “author”. This repetitive operation is controlled by the control unit 160. The control unit 160 is a general unit that controls the time-series operation of each unit of the utterance important word extraction device 100, and does not perform any special processing. The same applies to other embodiments, and the description of the control unit will be omitted hereinafter.

なお、単語登録部１３０と重要語抽出部１５０の処理の順番は逆であっても良い。単語登録部１３０が後に動作する場合、重要語抽出部１５０の所定回数以上の回数の条件を１回以上に設定することで、単語が２回出現した時点で重要語として抽出される上記した例と同じ処理を行うことができる。つまり、単語登録部１３０と重要語抽出部１５０の処理の順番を入れ替えても、単語登録部１３０の所定の条件と、重要語抽出部１５０の所定回数の条件の設定によって同じ動作を行わせることが可能である。 Note that the processing order of the word registration unit 130 and the important word extraction unit 150 may be reversed. When the word registration unit 130 operates later, the above example is extracted as a keyword when the word appears twice by setting the condition of the number of times equal to or greater than the predetermined number of times by the keyword extraction unit 150 to 1 or more The same processing can be performed. That is, even if the processing order of the word registration unit 130 and the important word extraction unit 150 is changed, the same operation is performed by setting a predetermined condition of the word registration unit 130 and a predetermined number of conditions of the important word extraction unit 150. Is possible.

以上説明したように発話内重要語抽出装置１００が動作することで、所定時間内に所定回数以上出現する出現頻度の多い単語のみを重要語として抽出することができ、重要語を特定するためのキーワードリストを、出現頻度付認識結果から自動的に作成することができる。また、重要語として抽出されるためには、所定時間内に所定回数以上の回数音声認識される必要があるため、誤認識した単語を重要語とするリスクを小さくすることもできる。 As described above, by operating the important word extraction device 100 in an utterance, it is possible to extract only words having a high appearance frequency that appear more than a predetermined number of times within a predetermined time, and to identify the important words. The keyword list can be automatically created from the recognition result with the appearance frequency. In addition, in order to be extracted as an important word, it is necessary to perform speech recognition a predetermined number of times or more within a predetermined time, so that the risk of making an erroneously recognized word an important word can be reduced.

なお、上記した実施例では、キーワードとしてキーワードリスト１４０に登録する単語の品詞を「名詞」のみとする例で説明を行ったが、この例に限定されない。例えば２種類以上の品詞が連結する単語列をキーワードとして登録するようにしても良い。例えば「形容動詞」の「文化的」と「名詞」の「所産」が連続する単語列「文化的所産」、また、３個以上の単語の連結である「文化の発展に寄与する」等の、テキスト文の一節をキーワードとして登録して、重要語として抽出するようにても良い。キーワードを構成する品詞の組み合わせを増やすことで、重要語を限定して特定することができる。 In the above-described embodiment, the example in which the word part of speech registered in the keyword list 140 as a keyword is only “noun” is described, but the present invention is not limited to this example. For example, a word string connecting two or more types of parts of speech may be registered as a keyword. For example, the word sequence “cultural product” in which “cultural” in “adjective verb” and “product” in “noun” are continuous, and “contribute to the development of culture”, which is a concatenation of three or more words. A section of a text sentence may be registered as a keyword and extracted as an important word. By increasing the number of combinations of parts of speech that constitute a keyword, it is possible to limit and specify important words.

また、上記して説明した例において、１回目と２回目の同一単語が生成する時間間隔を例えば２分、３回目の生成時間を２回目から４分後と仮定した場合、直近の同一単語が生成される時間間隔が所定時間以内であるか否かを判定する図２に示したフローチャートでは、その単語の出現頻度は３回になる。この出現頻度の計数は、他の考え方で行っても良い。例えば、直近の過去所定時間（例えば５分）内に生成された生成回数を計数するようにしても良い。その場合は、１回目の生成時刻が５分より過去になるので、直近の２回が出現頻度になる。このように単語の出現頻度の計数方法も、図２に示す例に限定されない。 Also, in the example described above, assuming that the time interval for generating the same word for the first time and the second time is, for example, 2 minutes, and the time for generating the third time is 4 minutes after the second time, In the flowchart shown in FIG. 2 for determining whether or not the generated time interval is within a predetermined time, the appearance frequency of the word is three times. The appearance frequency may be counted by another way of thinking. For example, the number of generations generated within the latest past predetermined time (for example, 5 minutes) may be counted. In that case, since the first generation time is in the past from 5 minutes, the two most recent occurrences become the appearance frequency. Thus, the method of counting the appearance frequency of words is not limited to the example shown in FIG.

〔発話内重要語抽出システム〕
図６に、この発明の発話内重要語抽出装置２２０を含む発話内重要語抽出システム２００のシステム構成を示す。発話内重要語抽出システム２００は、音声入力端末２１０と、発話内重要語抽出装置２２０と、ネットワーク２３０と、出現頻度付認識結果生成サーバ２４０と、を具備する。図７も参照して発話内重要語抽出システム２００の動作を説明する。 [Key words extraction system in utterance]
FIG. 6 shows a system configuration of an in-speech important word extraction system 200 including the in-speech important word extraction device 220 of the present invention. The utterance important word extraction system 200 includes a voice input terminal 210, an utterance important word extraction device 220, a network 230, and an appearance frequency-added recognition result generation server 240. The operation of the utterance important word extraction system 200 will be described with reference to FIG.

音声入力端末２１０は、図示しないマイクロホンで収音（ステップＳ２１０）した音声信号を発話内重要語抽出装置２２０に出力する（ステップＳ２１１）。マイクロホンは、常時、利用者が発声する音声を記録するために、例えば利用者の胸元に装着可能な小型なものが好ましい。マイクロホンと音声入力端末とは一体で構成しても良い。その形態は、例えばネクタイピンのようなものであっても良い。 The voice input terminal 210 outputs a voice signal picked up by a microphone (not shown) (step S210) to the utterance important word extraction device 220 (step S211). The microphone is preferably a small one that can be worn on the chest of the user, for example, in order to record voices uttered by the user at all times. The microphone and the voice input terminal may be integrated. The form may be a tie pin, for example.

音声信号は、上記したように、例えばサンプリング周波数１６ｋＨｚで離散的なディジタル信号に変換された信号であり、有線または無線で発話内重要語抽出装置２２０に出力される。無線の場合は、無線ＰＡＮ（Personal Area Network）と称される１０ｍくらいまでの距離をカバーする近距離無線技術である例えばBluetooth（登録商標）や無線ＬＡＮを用いることができる。 As described above, the voice signal is a signal converted into a discrete digital signal at a sampling frequency of 16 kHz, for example, and is output to the in-speech important word extraction device 220 by wire or wirelessly. In the case of wireless, for example, Bluetooth (registered trademark) or wireless LAN, which is a short-range wireless technology that covers a distance of up to about 10 m called a wireless PAN (Personal Area Network), can be used.

〔発話内重要語抽出装置〕
発話内重要語抽出装置２２０は、音声入力端末２１０が出力する音声信号を音声ファイルとして録音し、録音された音声ファイルから人が発声した音声部分のみを切り出した音声区間の音声信号とその発声開始時刻情報を、ネットワーク２３０を介して出現頻度付認識結果サーバ２４０に送信し、当該出力頻度付認識結果サーバ２４０から受信した出現頻度付認識結果から重要語を抽出して出力する。ここで音声区間の抽出は、例えば参考文献２（特開２０１２−４８１１９号公報）に記載された方法を用いることができる。 [Key words extraction device]
The important word extraction device 220 in the utterance records the voice signal output from the voice input terminal 210 as a voice file, and the voice signal of the voice section obtained by cutting out only the voice part uttered by the person from the recorded voice file and the start of the voice production. The time information is transmitted to the recognition result server with appearance frequency 240 via the network 230, and an important word is extracted from the recognition result with appearance frequency received from the recognition result server with output frequency 240 and output. Here, for example, a method described in Reference Document 2 (Japanese Patent Laid-Open No. 2012-48119) can be used to extract a voice section.

図６を参照して発話内重要語抽出装置２２０の動作を更に詳しく説明する。発話内重要語抽出装置２２０は、音声録音部２２１と、音声区間抽出部２２２と、音声送出部２２３と、出現頻度付認識結果受信部２２４と、単語登録部２２５と、重要語抽出部２２６と、キーワードリスト２２７と、重要語表示部２２８と、を具備する。 With reference to FIG. 6, the operation of the utterance important word extraction device 220 will be described in more detail. The important word extraction device 220 in the utterance includes a voice recording unit 221, a voice segment extraction unit 222, a voice transmission unit 223, a recognition result receiving unit 224 with appearance frequency, a word registration unit 225, and a key word extraction unit 226. A keyword list 227 and an important word display unit 228.

音声録音部２２１は、音声入力端末２１０から送られて来る音声信号を音声ファイルとして録音する（ステップＳ２２１）。この時、音声信号の録音が開始された時刻である録音開始時刻も記録される。音声信号と録音開始時刻は発話内重要語抽出装置２２０を構成するコンピュータの例えばＲＡＭ等の記憶部に記憶される。 The voice recording unit 221 records the voice signal sent from the voice input terminal 210 as a voice file (step S221). At this time, the recording start time, which is the time when the recording of the audio signal is started, is also recorded. The voice signal and the recording start time are stored in a storage unit such as a RAM of a computer constituting the important word extraction device 220 in the utterance.

音声区間抽出部２２２は、音声録音部２２１が録音した音声ファイルから人が発声した音声部分のみを切り出した音声ファイルを作成し、音声送出部２２３に当該音声ファイルを出力する（ステップＳ２２２）。この時に、切り出した音声ファイルの発声開始時刻情報も音声送出部２２３に出力する。発声開始時刻情報は、音声信号の録音開始時刻と切り出し音声の音声信号録音開始時刻からの経過時間から得られる。つまり、発声開始時刻情報は、音声信号の録音開始時刻に、切り出し音声の音声信号録音開始時刻からの経過時間を加算することで得られる。音声部分のみを切り出す方法は、上記したように周知である。 The voice segment extraction unit 222 creates a voice file by cutting out only a voice part uttered by a person from the voice file recorded by the voice recording unit 221 and outputs the voice file to the voice transmission unit 223 (step S222). At this time, the voice start time information of the cut voice file is also output to the voice sending unit 223. The voice start time information is obtained from the elapsed time from the voice signal recording start time and the voice signal recording start time of the cut voice. That is, the voice start time information is obtained by adding the elapsed time from the voice signal recording start time of the cut voice to the voice signal recording start time. The method of cutting out only the audio part is well known as described above.

音声送出部２２３は、音声区間抽出部２２２が抽出した音声区間の音声信号と発声開始時刻情報とをネットワーク２３０を介して出現頻度付認識結果生成サーバ２４０に送信する（ステップＳ２２３）。出現頻度付認識結果受信部２２４は、出現頻度付認識結果生成サーバ２４０からネットワーク２３０を介して送られて来る出現頻度付認識結果を受信する（ステップＳ２２４）。この出現頻度付認識結果は、単語出現頻度計数部１２０（図１）の出力するものと同じである。 The voice sending unit 223 transmits the voice signal of the voice section extracted by the voice section extracting unit 222 and the utterance start time information to the recognition result generation server 240 with appearance frequency via the network 230 (step S223). The appearance frequency-added recognition result receiving unit 224 receives the appearance frequency-added recognition result sent from the appearance frequency-added recognition result generation server 240 via the network 230 (step S224). This recognition result with appearance frequency is the same as that output by the word appearance frequency counting unit 120 (FIG. 1).

単語登録部２２５は、出現頻度付認識結果受信部２２４で受信した出現頻度付認識結果の所定の条件を満たす単語を、キーワードとしてキーワードリスト２２７に登録する（ステップＳ２２５）。所定の条件とは、所定時間内であることと、出現回数が所定の回数以上であることと、所定の品詞であること、という条件である。 The word registration unit 225 registers words that satisfy the predetermined condition of the recognition result with appearance frequency received by the recognition result with appearance frequency receiving unit 224 as keywords in the keyword list 227 (step S225). The predetermined condition is a condition that it is within a predetermined time, that the number of appearances is a predetermined number of times or more, and that it is a predetermined part of speech.

所定時間内であるか否かは、発声開始時刻情報を基準に出現頻度付認識結果生成サーバ２４０の単語出現頻度計数部２４３が、単語ごとに付与した時刻ラベルで判定することができる。 Whether or not the time is within the predetermined time can be determined by the time label assigned to each word by the word appearance frequency counting unit 243 of the recognition result generation server with appearance frequency 240 based on the utterance start time information.

重要語抽出部２２６は、出現頻度付認識結果受信部２２４で受信した出現頻度付認識結果を入力として、所定回数以上の出現頻度で、且つキーワードと一致する単語を重要語として抽出する（ステップＳ２２６）。これら単語登録部２２５と重要語抽出部２２６とキーワードリスト２２７は、装置としては別の装置であるので参照符号を変えているが、発話内重要語抽出装置１００（図１）で説明した同一名称の各機能部と同じものである。 The keyword extraction unit 226 receives, as an input, the recognition result with the appearance frequency received by the recognition result with appearance frequency reception unit 224, and extracts a word that coincides with the keyword with an appearance frequency of a predetermined number of times or more as a keyword (step S226). ). The word registration unit 225, the keyword extraction unit 226, and the keyword list 227 are different devices so that the reference numerals are changed. However, the same names described in the utterance keyword extraction device 100 (FIG. 1) are used. It is the same as each functional part.

重要語表示部２２８は、重要語抽出部２２６が抽出した重要語を表示する（ステップＳ２２８）。重要語の表示は、発話内重要語抽出装置２２０が備える図示しない液晶パネル等の表示手段によって表示される。 The important word display unit 228 displays the important word extracted by the important word extraction unit 226 (step S228). The important word is displayed by display means such as a liquid crystal panel (not shown) provided in the utterance important word extraction device 220.

なお、単語出現頻度計数部２４３で単語ごとに時刻ラベルを付与せずに、発声開始時刻情報をそのまま出現頻度付認識結果と共に発話内重要語抽出装置に送信するようにしても良い。その場合は、発話内重要語抽出装置２２０の音声区間抽出部２２２で出力される発声開始時刻情報を、出現頻度付認識結果生成サーバ２４０の音声受信部２４１、音声認識部２４２、単語出現頻度計数部２４３、出現頻度付認識結果送信部２４４、再び発話内重要語抽出装置２２０の出現頻度付認識結果受信部２２４の順に伝達し、単語登録部２２５において発声開始時刻情報を用いて所定時間内であるか否かを判定する。 Note that the word appearance frequency counting unit 243 may transmit the utterance start time information as it is together with the appearance frequency-added recognition result to the utterance important word extraction device without assigning a time label for each word. In that case, the utterance start time information output by the speech segment extraction unit 222 of the utterance important word extraction device 220 is used as the speech reception unit 241, speech recognition unit 242, word appearance frequency count of the recognition result generation server 240 with appearance frequency. Unit 243, recognition result transmission unit with appearance frequency 244, and recognition result reception unit 224 with the appearance frequency of the utterance important word extraction device 220 again, and the word registration unit 225 uses the utterance start time information within a predetermined time. Determine whether there is.

〔出現頻度付認識結果生成サーバ〕
出現頻度付認識結果生成サーバ２４０は、図６に示すように、音声受信部２４１と、音声認識部２４２と、単語出現頻度計算部２４３と、出現頻度付認識結果送信部２４４と、を具備する。音声受信部２４１は、発話内重要語抽出装置２２０からネットワーク２３０を介して送信されて来る音声信号と発声開始時刻情報とを受信する（ステップＳ２４１）。 [Recognition result generation server with appearance frequency]
As shown in FIG. 6, the recognition result generation server with appearance frequency 240 includes a voice reception unit 241, a voice recognition unit 242, a word appearance frequency calculation unit 243, and a recognition result transmission unit with appearance frequency 244. . The voice receiving unit 241 receives the voice signal and the utterance start time information transmitted from the important word extraction device 220 in the utterance via the network 230 (step S241).

音声認識部２４２は、音声受信部２４１で受信した音声信号を、音声認識処理して形態素に分割した単語列と品詞列の情報から成るテキスト文を生成して出力する（ステップＳ２４２）。単語出現頻度計数部２４３は、音声認識部２４２が出力するテキスト文の単語の所定時間内における出現頻度を計数して、単語とその品詞とその出現頻度とから成る出現頻度付認識結果を出力する（ステップＳ２４３）。単語出現頻度計数部２４３は、単語出現頻度計数部１２０（図１）と同様に、音声認識部２４２が出力する単語列の各単語に時刻ラベルを付与する。単語出現頻度計数部２４３が付与する時刻ラベルは、音声認識処理する音声信号と共に受信する発声開始時刻情報を基準に計時した時刻である。 The voice recognition unit 242 generates and outputs a text sentence composed of information on a word string and a part of speech string obtained by performing voice recognition processing on the voice signal received by the voice reception unit 241 and dividing it into morphemes (step S242). The word appearance frequency counting unit 243 counts the appearance frequency of the word of the text sentence output by the speech recognition unit 242 within a predetermined time, and outputs a recognition result with appearance frequency including the word, its part of speech, and its appearance frequency. (Step S243). Similar to the word appearance frequency counting unit 120 (FIG. 1), the word appearance frequency counting unit 243 gives a time label to each word in the word string output by the speech recognition unit 242. The time label given by the word appearance frequency counting unit 243 is a time measured based on the utterance start time information received together with the voice signal to be voice-recognized.

出現頻度付認識結果送信部２４４は、単語出現頻度計数部２４３が出力する出現頻度付認識結果をネットワーク２３０を介して発話内重要語抽出装置２２０に出力する（ステップＳ２４４）。出現頻度付認識結果生成サーバ２４０内の音声認識部２４２は、発話内重要語抽出装置１００（図１）の音声認識部１１０に対して時刻ラベルを発声開始時刻情報を基準に付与する点でのみ異なるだけである。単語出現頻度計数部２４３は、発話内重要語抽出装置１００（図１）の単語出現頻度係数部１２０と全く同じものである。 The recognition result transmission unit with the appearance frequency 244 outputs the recognition result with the appearance frequency output from the word appearance frequency counting unit 243 to the important word extraction device 220 in the utterance via the network 230 (step S244). The speech recognition unit 242 in the recognition result generation server 240 with appearance frequency only gives a time label to the speech recognition unit 110 of the utterance important word extraction device 100 (FIG. 1) based on the utterance start time information. It is only different. The word appearance frequency counting unit 243 is exactly the same as the word appearance frequency coefficient unit 120 of the utterance important word extraction device 100 (FIG. 1).

なお、単語出現頻度計数部２４３において各単語に時刻ラベルを付与せず、発話内重要語抽出装置２２０の単語登録部２２５の所定の条件の判定に、発声開始時刻情報を用いても良いのは上記した通りである。 Note that the utterance start time information may be used to determine a predetermined condition of the word registration unit 225 of the utterance important word extraction device 220 without assigning a time label to each word in the word appearance frequency counting unit 243. As described above.

以上説明した発話内重要語抽出システム２００によれば、比較的に処理の重い音声認識処理を、出現頻度付認識結果生成サーバ２４０に分担させるので、発話内重要語抽出装置１００と同じ効果を奏する発話内重要語抽出装置２２０の構成を簡単にすることができる。また、出現頻度付認識結果生成サーバ２４０の機能を実現するコンピュータのＣＰＵパワーを高めることで、発話内重要語抽出装置１００で行う音声認識性能よりも高速・高精度な音声認識処理を行う事も可能である。 According to the important word extraction system 200 in the utterance described above, since the speech recognition process with relatively heavy processing is shared by the recognition result generation server 240 with appearance frequency, the same effect as the important word extraction apparatus 100 in the utterance is achieved. The configuration of the utterance important word extraction device 220 can be simplified. In addition, by increasing the CPU power of the computer that realizes the function of the recognition result generation server 240 with appearance frequency, voice recognition processing that is faster and more accurate than the voice recognition performance performed by the important word extraction device 100 in the utterance may be performed. Is possible.

図８に、この発明の発話内重要語抽出システム３００のシステム構成を示す。発話内重要語抽出システム３００は、発話内重要語抽出システム２００に対して発話内重要語抽出装置３２０の構成のみが異なる。 FIG. 8 shows a system configuration of an important word extraction system 300 in an utterance according to the present invention. The important word extraction system 300 in the utterance is different from the important word extraction system 200 in the utterance only in the configuration of the important word extraction device 320 in the utterance.

発話内重要語抽出装置３２０は、発話内重要語抽出装置２２０（図６）に対して行動履歴表示部３２１を備える点で異なる。行動履歴表示部３２１は、重要語抽出部２２６が抽出した重要語と、その重要語が抽出された時間帯とを組とした行動履歴を表示する。行動履歴は、発話内重要語抽出装置３２０が備える図示しない液晶パネル等の表示手段によって表示される。 The utterance important word extraction device 320 differs from the utterance important word extraction device 220 (FIG. 6) in that it includes an action history display unit 321. The action history display unit 321 displays an action history that is a combination of the important word extracted by the important word extraction unit 226 and the time zone in which the important word is extracted. The action history is displayed by display means such as a liquid crystal panel (not shown) provided in the utterance important word extraction device 320.

図９に、音声入力端末２１０と発話内重要語抽出装置３２０の外観とその利用場面の一例を示す。音声入力端末２１０は上記した例のネクタイピン型である。発話内重要語抽出装置３２０の表示手段に、重要語とその重要語が抽出された時間帯とが、時間帯別に表示されている。このように重要語と時間帯が表示されることで利用者は、一日の行動履歴を確認することができる。 FIG. 9 shows an example of the appearance and usage scenes of the voice input terminal 210 and the important word extraction device 320 in the utterance. The voice input terminal 210 is a tie pin type in the above example. The important word and the time zone in which the important word is extracted are displayed for each time zone on the display means of the important word extraction device 320 in the utterance. Thus, by displaying the important words and the time zone, the user can check the daily action history.

また、利用者が表示手段の重要語又は時間帯を指先でタップすることで、音声録音部２２１が録音したその時間帯の音声を、図示を省略しているスピーカで再生するようにしても良い。行動履歴に対応する音声を再生する場合、発話内重要語抽出装置３２０は、更に、行動履歴選択入力部３２２と、録音データ選択再生部３２３と、を備える。行動履歴選択入力部３２２は、利用者が行動履歴を選択する指先が表示手段にタップする入力に対応して、重要語又は時間帯に対応した信号を録音データ選択再生部３２３に出力する。 In addition, when the user taps an important word or time zone on the display means with a fingertip, the voice in the time zone recorded by the voice recording unit 221 may be reproduced by a speaker not shown. . When reproducing the voice corresponding to the action history, the important word extraction device 320 in the utterance further includes an action history selection input unit 322 and a recorded data selection reproduction unit 323. The action history selection input unit 322 outputs a signal corresponding to an important word or time zone to the recording data selection / reproduction unit 323 in response to an input that a fingertip of the user selecting action history taps on the display unit.

録音データ選択再生部３２３は、音声録音部２２１が録音した音声ファイルを、行動履歴選択入力部３２２が出力する重要語又は時間帯に対応した信号に基づいて読み出してスピーカで再生する。重要語又は時間帯に対応する音声ファイルを、音声として聴取可能にすることで、利用者は行動履歴を詳細に振り返ることができる。 The recorded data selection / playback unit 323 reads the voice file recorded by the voice recording unit 221 based on a signal corresponding to the important word or time zone output from the action history selection input unit 322 and plays it back on the speaker. By making the audio file corresponding to the important word or time zone audible as audio, the user can look back on the action history in detail.

図１０に、この発明の発話内重要語抽出システム４００のシステム構成を示す。発話内重要語抽出システム４００は、ネットワーク２３０に接続される複数の発話内重要語抽出装置４２０_１，４２０_２，４２０_ｎと、出現頻度付認識結果生成サーバ４４０とで構成される。図１１に、出現頻度付認識結果生成サーバ４４０の機能構成例を示す。出現頻度付認識結果生成サーバ４４０は、出現頻度付認識結果生成サーバ２４０（図６）に対して、複数の発話内重要語抽出装置から送信されてくる音声信号にそれぞれ対応して動作する点と、複数の送信信号から生成された複数のテキスト文から共通する共通キーワードを抽出する点で異なる。 FIG. 10 shows a system configuration of an important word extraction system 400 in an utterance according to the present invention. The intra-utterance important word extraction system 400 includes a plurality of intra-utterance important word extraction devices 420 ₁ , 420 ₂ , 420 _n connected to the network 230, and a recognition result generation server 440 with appearance frequency. In FIG. 11, the example of a function structure of the recognition result generation server 440 with appearance frequency is shown. The recognition result generation server with appearance frequency 440 operates with respect to the recognition result generation server with appearance frequency 240 (FIG. 6) in response to each of the audio signals transmitted from the plurality of utterance key word extraction devices. A common keyword is extracted from a plurality of text sentences generated from a plurality of transmission signals.

出現頻度付認識結果生成サーバ４４０は、音声受信部４４１と、複数の音声認識部４４２_１〜４４２_ｎと、複数の単語出現頻度計数部４４３_１〜４４３_ｎと、出現頻度付認識結果送信部４４４と、共通単語出現頻度計数部４４５と、共通単語登録部４４６と、共通キーワードリスト４４７と、共通キーワード送信部４４８と、を具備する。音声受信部４４１は、複数の発話内重要語抽出装置から送信されてくる音声信号を受信し、それぞれが識別できるように識別子を付与した音声信号を複数の音声認識部４４２_１〜４４２_ｎに出力する点で、１個の発話内重要語抽出装置に対応する音声受信部２４１（図６）と異なる。 The appearance frequency-added recognition result generation server 440 includes a voice reception unit 441, a plurality of voice recognition units 442 _{1 to} 442 _n , a plurality of word appearance frequency counting units 443 _{1 to} 443 _n, and a recognition result transmission unit with appearance frequency 444. A common word appearance frequency counting unit 445, a common word registration unit 446, a common keyword list 447, and a common keyword transmission unit 448. The voice receiving unit 441 receives the voice signals transmitted from the plurality of important word extraction devices in the utterance, and outputs the voice signals to which the identifiers are assigned so as to be identified to the plurality of voice recognition units 442 _{1 to} 442 _n . This is different from the voice receiving unit 241 (FIG. 6) corresponding to one important word extraction device in an utterance.

複数備わる音声認識部４４２_＊（＊：１〜ｎ）と単語出現頻度計数部４４３_＊は、識別子に対応させて動作する点のみが異なるだけで、それぞれの動作は１個の発話内重要語抽出装置に対して動作する音声認識部２４２と単語出現頻度計数部２４３と同じ処理を行う。つまり、１個目の発話内重要語抽出装置２２０_１から受信した音声信号は、例えば音声認識部４４２_１で音声認識処理され、単語の出現頻度の計数処理が単語出現頻度計算部４４３_１で行われる。ｎ個目の発話内重要語抽出装置２２０_ｎから受信した音声信号は、例えば音声認識部４４２_ｎで音声認識処理され、単語の出現頻度の計数処理が単語出現頻度計算部４４３_ｎで行われる。 The speech recognition units 442 _* (*: 1 to n) and the word appearance frequency counting unit 443 _* are different from each other only in that they operate in correspondence with identifiers, and each operation extracts one important word in an utterance. The same processing as the voice recognition unit 242 and the word appearance frequency counting unit 243 that operate on the apparatus is performed. That is, the speech signal received from the _first important word extraction device 2201 in the utterance is subjected to speech recognition processing, for example, by the speech recognition unit 442 ₁ , and the word appearance frequency counting processing is performed by the word appearance frequency calculation unit 443 ₁ . Is called. The speech signal received from the _nth utterance important word extraction device 220 _n is subjected to speech recognition processing, for example, by the speech recognition unit 442 _n , and the word appearance frequency counting processing is performed by the word appearance frequency calculation unit 443 _n .

出現頻度付認識結果送信部４４４は、複数の単語出現頻度計数部４４３_１〜４４３_ｎが計数した単語とその品詞とその出現頻度とから成る識別子が付与された出現頻度付認識結果をネットワーク２３０を介して複数の発話内重要語抽出装置４２０_＊に送信する。 The appearance frequency-added recognition result transmission unit 444 transmits the recognition result with the appearance frequency to which the identifier including the word counted by the plurality of word appearance frequency counting units 443 _{1 to} 443 _n , the part of speech, and the appearance frequency is given through the network 230. To a plurality of key words extraction devices 420 _* in the utterance.

共通単語出現頻度計数部４４５は、複数の音声認識部４４２_＊が出力する複数のテキスト文内の単語の所定時間内において共通して出現する単語の共通出現頻度を計数して、単語とその品詞とその共通出現頻度とから成る出現頻度付共通認識結果を出力する。又は、例えば全ての利用者の音声ファイルの認識結果であるテキスト文中の所定時間内の利用者１人あたりの平均出現回数に上限閾値を設け、その閾値以上の回数出現した単語とその品詞を出現頻度付共通認識結果として抽出するようにしても良い。 The common word appearance frequency counting unit 445 counts the common appearance frequency of words that appear in common within a predetermined time of the words in the plurality of text sentences output by the plurality of speech recognition units 442 _* , and the word and its part of speech. And a common recognition result with appearance frequency consisting of the common appearance frequency. Or, for example, an upper limit threshold is set for the average number of appearances per user within a predetermined time in a text sentence that is the recognition result of the speech file of all users, and words that appear more than the threshold and their parts of speech appear. You may make it extract as a common recognition result with a frequency.

共通単語登録部４４６は、共通単語出現頻度計数部４４５が抽出した出現頻度付共通認識結果を入力として、所定の条件を満たす出現頻度付共通認識結果の単語を、共通キーワードとして共通キーワードリスト４４７に登録する。共通キーワード送信部４４８は、共通キーワードリスト４４７に登録された共通キーワードを、複数の発話内重要語抽出装置４２０の全てに出力する。 The common word registration unit 446 receives the common recognition result with appearance frequency extracted by the common word appearance frequency counting unit 445 as an input, and uses the common recognition result with appearance frequency satisfying a predetermined condition as a common keyword in the common keyword list 447. sign up. The common keyword transmission unit 448 outputs the common keywords registered in the common keyword list 447 to all of the plurality of important words extraction devices 420 in the utterance.

図１２に、発話内重要語抽出システム４００を構成する発話内重要語抽出装置４２０の機能構成例を示す。発話内重要語抽出装置４２０は、発話内重要語抽出装置２２０（図６）に対して共通キーワード受信部４２１を備える点でのみ異なる。 FIG. 12 shows a functional configuration example of the utterance important word extraction apparatus 420 constituting the utterance important word extraction system 400. The in-speech important word extraction device 420 differs from the in-speech important word extraction device 220 (FIG. 6) only in that a common keyword receiving unit 421 is provided.

共通キーワード受信部４２１は、ネットワーク２３０を介して出現頻度付認識結果生成サーバ４４０から送信されて来る共通キーワードを受信してキーワードリスト２２７に登録する。共通キーワードの「登録契機」は「共通発話頻度」として登録される。つまり、発話内重要語抽出装置４２０は、複数の発話内重要語抽出装置の間で共通する共通キーワードと一致する単語も重要語として抽出することが可能である。 The common keyword receiving unit 421 receives the common keyword transmitted from the recognition result generation server with appearance frequency 440 via the network 230 and registers it in the keyword list 227. The common keyword “registration opportunity” is registered as “common utterance frequency”. That is, the keyword-in-speech extraction device 420 can also extract a word that matches a common keyword common to a plurality of keyword-in-speech extraction devices as a keyword.

以上説明したこの発明の発話内重要語抽出装置とその装置を用いた発話内重要語抽出システムは、重要語が発話の中に出現する頻度が多いことに着目し、所定時間内に所定回数以上、音声認識結果として出現した単語をキーワードとして登録し、更にそのキーワードと一致する単語が音声認識結果内に出現した場合に重要語として抽出する考えに基づく。この考えによれば、キーワードリストを自動的に最新化できるので、そのメンテナンスコストを不要にすることができる。また、重要語として抽出されるためには、所定時間内に所定回数以上音声認識結果として出力される必要があるため、誤りを多分に含む対話音声のような音声に対して、偶然、キーワードリストに存在する単語に誤認識した場合でも、発話していない単語を不用意に重要語として抽出してしまうリスクを小さくする効果も得られる。 In the utterance important word extraction apparatus of the present invention described above and the utterance important word extraction system using the apparatus, paying attention to the fact that the important words frequently appear in the utterance, a predetermined number of times or more within a predetermined time. This is based on the idea of registering a word that appears as a speech recognition result as a keyword and extracting it as an important word when a word that matches the keyword appears in the speech recognition result. According to this idea, since the keyword list can be automatically updated, the maintenance cost can be eliminated. In addition, in order to be extracted as an important word, it is necessary to output a speech recognition result more than a predetermined number of times within a predetermined time. Even when a word that is present in a word is mistakenly recognized, an effect of reducing the risk of inadvertently extracting a word that is not spoken as an important word can be obtained.

このように優れた効果を奏する本願発明の発話内重要語抽出装置は、常時、発話音声を録音するだけという簡単な方法で、ライフログのような個人活動履歴や備忘録の作成が可能であり、利用者に高い利便性を提供することができる。また、その発話内重要語抽出装置を用いた発話内重要語抽出システムは、1個の発話内重要語抽出装置で得られる効果に加えて、複数の発話内重要語抽出装置の間で共通に出現する重要語も抽出することができる。 The important word extraction device in the utterance of the present invention having such excellent effects can always create a personal activity history such as a life log or a memorandum, by simply recording the utterance voice, High convenience can be provided to the user. In addition, in addition to the effects obtained with a single utterance key word extraction device, the utterance key word extraction system using the utterance key word extraction device is common among a plurality of utterance key word extraction devices. The important words that appear can also be extracted.

上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 When the processing means in the above apparatus is realized by a computer, the processing contents of the functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

A keyword list with registered keywords,
A voice recognition unit that generates and outputs a text sentence composed of information on a word string and a part of speech string obtained by performing voice recognition processing on the input voice signal;
A word appearance frequency counting unit that counts the appearance frequency of words in the text sentence within a predetermined time and outputs a recognition result with the appearance frequency composed of the word, its part of speech, and its appearance frequency;
A word registering unit for registering the word of the recognition result with appearance frequency satisfying a predetermined condition as a keyword in the keyword list, using the recognition result with appearance frequency as an input;
An important word extraction unit that extracts, as an input, the words with the appearance frequency equal to or more than a predetermined number of times and matches any of the keywords registered in the keyword list, with the recognition result with the appearance frequency as an input;
An apparatus for extracting important words in an utterance.

A voice input terminal,
An apparatus for extracting important words in an utterance connected to the voice input terminal;
Network,
A recognition result generation server with an appearance frequency that communicates with the key word extraction device in the utterance via the network;
An important word extraction system in an utterance comprising
The voice input terminal outputs a voice signal picked up by a microphone to the important word extraction device in the utterance,
The utterance important word extracting device records the voice signal as a voice file, and extracts a voice signal of a voice section in which only a voice part uttered by a person is cut out from the recorded voice file and the voice start time information of the network. The word of the recognition result with appearance frequency satisfying a predetermined condition among the recognition results with appearance frequency received from the recognition result generation server with appearance frequency is transmitted as a keyword. It is registered in the keyword list, and a word that coincides with any of the keywords registered in the keyword list with an appearance frequency of a predetermined number of times or more is extracted and output,
The recognition result generation server with appearance frequency receives the speech signal and speech start time information of the speech segment from the intra-speech important word extraction device, and performs speech recognition processing on the speech signal of the speech segment and divides it into morphemes Generates a text sentence consisting of information on the sequence and part-of-speech column, counts the frequency of appearance of the word in the text sentence within a predetermined time, and generates a recognition result with appearance frequency consisting of the word, its part-of-speech and its appearance frequency Is transmitted to the important word extraction device in the utterance,
An important word extraction system in the utterance.

An apparatus for extracting important words in an utterance used in the important word extracting system in an utterance according to claim 2,
A keyword list with registered keywords,
An audio recording unit that records audio signals sent from the audio input terminal;
A voice segment extraction unit that extracts a voice signal of a voice segment obtained by cutting out only a voice part uttered by a person from the voice file recorded in the voice recording unit, and adds voice start time information indicating the start time to the voice segment. When,
A voice sending unit that sends the voice signal of the voice segment extracted by the voice segment extraction unit and the utterance start time information to the recognition result generation server with appearance frequency via the network;
A recognition result receiving unit with an appearance frequency for receiving a recognition result with an appearance frequency sent from the recognition result generating server with the appearance frequency via the network;
A word registration unit for registering the word of the recognition result with the appearance frequency satisfying a predetermined condition as a keyword in the keyword list, using the recognition result with the appearance frequency received by the recognition result with appearance frequency receiving unit as an input;
An important word extraction unit for extracting, as an input, the word having the appearance frequency equal to or more than a predetermined number of times and matching any of the keywords registered in the keyword list, with the recognition result with the appearance frequency as an input;
An apparatus for extracting important words in an utterance.

In the utterance important word extraction device according to claim 1 or 3,
Furthermore,
An apparatus for extracting important words in an utterance, comprising: an action history display unit that displays an action history that includes the important word and a time zone in which the important word is extracted.

A speech recognition process for generating and outputting a text sentence composed of information of a word sequence and a part of speech sequence obtained by performing speech recognition processing on an input speech signal;
A word appearance frequency counting process of counting the appearance frequency of words in the text sentence within a predetermined time and outputting a recognition result with the appearance frequency comprising the word, its part of speech, and its appearance frequency;
The word registration process of registering the word of the recognition result with appearance frequency satisfying a predetermined condition as a keyword in the keyword list using the recognition result with appearance frequency as an input,
An important word extraction process for extracting, as an input, the recognition result with the appearance frequency, a word that coincides with any of the keywords registered in the keyword list with the appearance frequency equal to or greater than a predetermined number of times,
A method for extracting important words in an utterance.

A voice signal output process in which a voice input terminal outputs a voice signal picked up by a microphone to a key word extraction device in an utterance;
The utterance key word extraction device records the voice signal, extracts a voice section from the voice file of the recorded voice signal, and transmits the voice signal of the voice section and the utterance start time information via the network. A voice segment output process to be transmitted to the recognition result generation server with appearance frequency;
The appearance recognition result generation server with frequency counts the single word occurrence frequency is frequency in a predetermined time period word of the text sentence formed by the audio signal of the voice section is received by the speech segment output process speech recognition Generating a recognition result with an appearance frequency and transmitting the recognition result with an appearance frequency and the utterance start time information to the key word extraction device in the utterance,
The utterance key word extraction device registers the words of the recognition result with appearance frequency satisfying a predetermined condition among the recognition results with appearance frequency received in the recognition result transmission process with the appearance frequency as a keyword in the keyword list, An important word extraction process for extracting and outputting a word having an appearance frequency equal to or greater than a predetermined number of times and matching any of the keywords registered in the keyword list;
A method for extracting important words in utterances.

A program for causing a computer to function as the important word extraction device in an utterance according to claim 1.