JP2000231427A

JP2000231427A - Multi-modal information analyzing device

Info

Publication number: JP2000231427A
Application number: JP3067899A
Authority: JP
Inventors: Atsushi Chazono; 篤茶園; Kazuo Kunieda; 和雄國枝
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1999-02-08
Filing date: 1999-02-08
Publication date: 2000-08-22

Abstract

PROBLEM TO BE SOLVED: To realize unification without any semantic inconsistency with a real objective world by interpreting each input modality based on the knowledge of the objective world of a user. SOLUTION: A recognized result such as the gesture of a user is inputted to a spatial base information under consideration detecting part 21, and what a user pays attention to is specified by using the knowledge of an objective world in an objective position storage part 23, and a recognized result such as an uttered voice is inputted to a context base information under consideration detecting part 22, and what the user pays attention to and how the user does is specified by using the knowledge of an objective world in an objective characteristics storing part 24, and they are unified under the consideration of the state of the objective world or the state of this system managed by a state managing part 26 in a consideration information unifying part 25. Thus, the information under consideration of the user can be detected.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、コンピュータの
入力装置に関するものであり、特に、人間のジェスチ
ャ、視線といった直接的な指示と、人間の発話音声とを
統合的に解析することにより、人間が何に対してどのよ
うに注目しているのかというユーザの注目情報として解
析結果を出力するマルチモーダル情報解析装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an input device of a computer, and more particularly, to a method of integrating a direct instruction such as a gesture and a line of sight of a human and a voice of a human into an integrated system. The present invention relates to a multi-modal information analysis device that outputs an analysis result as attention information of a user as to what and how to pay attention.

【０００２】[0002]

【従来の技術】人間は注目している対象を指さしたり、
目で追ったり、注目している内容を音声として発話する
ことによって、自らの意図、注目情報を表現している。
従来、人間の注目情報を解析する手段として、人間のジ
ェスチャによる直接指示や発話音声による内容指示など
の複数の入力モダリティを統合的に利用するマルチモー
ダル・インタフェースの技術がある。このマルチモーダ
ル・インタフェースでは、人間のジェスチャや発話音声
などを認識して統合する上で、各認識時間の情報から各
認識結果を結びつけ最終的な解析結果として出力すると
いう枠組みが一般的である。2. Description of the Related Art Humans point to an object of interest,
They pursue their eyes and speak the content they are paying attention to as voice, expressing their own intentions and attention information.
2. Description of the Related Art Conventionally, as a means for analyzing human attention information, there is a multimodal interface technology that integrally uses a plurality of input modalities such as a direct instruction by a human gesture and a content instruction by an uttered voice. In this multimodal interface, in order to recognize and integrate human gestures and uttered voices, a framework is generally used in which each recognition result is linked from information of each recognition time and output as a final analysis result.

【０００３】マルチモーダル・インタフェースにおい
て、複数の入力モダリティを扱う一般的な枠組みとして
は、特公平７−１２２８７９号公報に開示されているマ
ルチモーダル入力解析装置がある。このマルチモーダル
入力解析装置では、実際の入力開始時刻ならびに入力終
了時刻の組が付加された各モダリティの入力データを、
入力時間を基準に並べたデータ列とすることで、時間制
約を記述することが可能な構文解析規則を適用すること
により入力解析を行うものである。As a general framework for handling a plurality of input modalities in a multimodal interface, there is a multimodal input analyzer disclosed in Japanese Patent Publication No. Hei 7-122879. In this multi-modal input analysis device, input data of each modality to which a set of an actual input start time and an input end time is added,
Input analysis is performed by applying a syntax analysis rule capable of describing a time constraint by using a data sequence arranged based on the input time.

【０００４】また、人間のジェスチャと発話音声を利用
したマルチモーダル・インタフェースが特開平９−１１
４６３４号公報に開示されている。これはジェスチャと
発話音声の各認識結果に認識時刻を付加した形式で出力
し、その認識時刻を基準として認識結果を結びつけ、最
終的な解析結果を出力するものである。A multi-modal interface using human gestures and uttered voices is disclosed in Japanese Patent Laid-Open No. 9-11 / 1991.
No. 4634. In this method, a recognition time is added to each recognition result of a gesture and an uttered voice, and the recognition result is linked based on the recognition time, and a final analysis result is output.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら、かかる
従来のマルチモーダル・インタフェースにあっては、各
入力モダリティの認識時刻を管理し、この認識時刻を基
準として各モダリティの関連付けを行うだけであるた
め、実際に関連付けることにより意味をなすもの同士を
統合しているとは言えないという課題があった。例えば
前述の特開平９−１１４６３４号公報に開示されている
マルチモーダル・インタフェースでは、各入力モダリテ
ィの認識結果から各々の持つ意味的な構造を解析したも
のに対して、時間的な基準を用いて統合することで最終
的な解析結果としての意味構造を出力するようになって
いるが、各モダリティの意味構造を解析する場合にユー
ザが対象としている世界の知識を利用することはなく、
単にジェスチャの定義付けという点からでの意味的構
造、発話音声の言語的な定義付けの観点からの意味的構
造においてしか解析をしていない。そのために、例えば
ビデオの再生画像の出力先を指定する際に、ある出力先
のスクリーンを指し示しながら、「再生して」と発話す
るだけでは、スクリーン自体に再生機能がないためにユ
ーザの要求する操作を実現することができないというこ
とが生じる。これは時間的にも、各入力モダリティの意
味構造的にも矛盾はないが、実際の対象世界での操作と
しては意味的な矛盾が生じていることを示している。However, in such a conventional multimodal interface, it is only necessary to manage the recognition time of each input modality and to associate each modality with the recognition time as a reference. There was a problem that it was not possible to say that what made sense by actually associating them was integrated. For example, in the multi-modal interface disclosed in the above-mentioned Japanese Patent Application Laid-Open No. Hei 9-114634, a semantic structure of each input modality analyzed from its recognition result is analyzed using a temporal reference. By integrating, the semantic structure as the final analysis result is output, but when analyzing the semantic structure of each modality, the user does not use the knowledge of the target world,
Only the semantic structure in terms of defining gestures and the semantic structure in terms of linguistic definition of speech are analyzed. For this reason, for example, when designating the output destination of a video playback image, simply pointing to a certain output destination screen and saying “play back” requires a user request because the screen itself does not have a playback function. It happens that the operation cannot be realized. This indicates that there is no contradiction in terms of time and the semantic structure of each input modality, but there is a semantic contradiction as an operation in the actual target world.

【０００６】さらに、一般には、各種の認識部による認
識率は１００％ではないために、例えば音声認識の段階
で実際には存在しないものを対象としてしまい、結果的
に全体として矛盾のある対話処理を行ってしまうことが
しばしば発生するという課題があった。このような場合
に、現時点の解析結果を基にして、認識系にフィードバ
ックをかけるような枠組みが存在していれば、現時点で
の状況に応じて認識する単語や、認識する対象領域など
を絞り込んだりすることが可能となる。例えば、ユーザ
がテレビの方向を見ている場合に、音声認識部に対して
ＴＶ、テレビなどの単語やそれらの単語が含まれる文
章、またビデオがテレビに接続されていればビデオなど
の単語が含まれる発話音声の採択率を上げて認識を安定
させることが可能となる。また、例えばシステムに対し
てある人物の年齢を問い合わせたが、その人物が誰か分
からないという場合に、画像認識部に対して、年齢を問
い合わせするのに対象が人物であるという対象世界の知
識を利用して、予め認識対象を絞り込むことにより認識
誤りの可能性を減少させることで、認識を安定させると
いったことも可能となる。Further, since the recognition rate of various kinds of recognition units is generally not 100%, for example, those which do not actually exist at the stage of speech recognition are targeted, and consequently inconsistent interactive processing is performed as a whole. There is a problem that often occurs. In such a case, if there is a framework that provides feedback to the recognition system based on the current analysis results, narrow down the words to be recognized and the target areas to be recognized according to the current situation. Will be possible. For example, when the user is looking in the direction of a television, words such as TV and TV, sentences including those words, and words such as video if a video is connected to the television are sent to the voice recognition unit. The recognition rate can be stabilized by increasing the adoption rate of the included speech voice. Also, for example, when the age of a certain person is inquired to the system, but the person is not known, the image recognition unit is informed of the target world that the target is a person to inquire of the age. Utilizing this method, it is possible to stabilize the recognition by narrowing down the recognition targets in advance to reduce the possibility of recognition errors.

【０００７】さらに、従来のマルチモーダル・インタフ
ェースの技術では、いかにして時間ずれのある複数の入
力モダリティを関連づけるかに重点が置かれており、ユ
ーザが対象としている世界の現時点での状況に応じ、各
入力モダリティの重み付けを考慮するまでに至っていな
い。そのために、状況によってはあるモダリティが不必
要となる場合があるにも拘らず、そのモダリティの情報
を利用してしまうため、最終的な解析結果に矛盾が生じ
るなどの課題がある。例えばシステム側と対話的な操作
を進める中で、システム側が操作を実現する対象が何で
あるのかを問い返す場合を考える。この場合には、ユー
ザが腕などで対象を直接指示することにより対象を決定
できるし、また、対象固有の名称などを発話することに
よって対象を決定することも可能となるように設定す
る。しかし、問い返しのないような場合での対象の選
択、決定操作を考えると、腕などによる対象の直接指示
は対象の選択を可能とするが、決定までは可能とせず、
「これ」などの発話によって最終的な対象の決定を実現
する方が自然な操作感覚を提供できるし、各モダリティ
の意味づけを活かした統合が可能となる。Further, in the conventional multimodal interface technology, emphasis is placed on how to associate a plurality of input modalities with a time lag, depending on the current situation of the world to which the user is directed. However, the weight of each input modality has not yet been considered. For this reason, although a certain modality may not be necessary depending on the situation, the information of the modality is used, and thus there is a problem that inconsistency may occur in a final analysis result. For example, consider a case in which the system returns a question as to what is to be realized while performing an interactive operation with the system. In this case, it is set so that the user can determine the target by directly instructing the target with his / her arm or the like, and it is also possible to determine the target by speaking a name unique to the target. However, considering the target selection and decision operation in a case where there is no question return, direct instruction of the target with the arm etc. allows the target selection, but not the determination,
Realizing the final determination of the target by uttering “this” or the like can provide a natural operation feeling, and integration that makes use of the meaning of each modality can be performed.

【０００８】この発明は、複数の入力モダリティを統合
する場合に、時間的な基準によるモダリティ間の関連付
け、もしくは時間的な基準と各入力モダリティごとの意
味的構造とを基準としたモダリティ間の関連付けによる
統合では、各入力モダリティ後の意味的構造においては
矛盾がなくても、実際にユーザが対象とする世界の知識
からすれば意味的な矛盾が生じるという問題点に対して
は、ユーザが対象とする世界の知識にもとづいた各入力
モダリティの解釈により、実際の対象世界と意味的に矛
盾のない統合を実現することができるマルチモーダル情
報統合装置を得ることを目的とする。According to the present invention, when integrating a plurality of input modalities, an association between modalities based on a temporal reference or an association between modalities based on a temporal reference and a semantic structure of each input modality is provided. In the integration by, the user is not required to address the problem that even if there is no inconsistency in the semantic structure after each input modality, there is a semantic inconsistency based on the knowledge of the user's target world. It is an object of the present invention to obtain a multimodal information integration device that can realize integration that is semantically consistent with the actual target world by interpreting each input modality based on knowledge of the world.

【０００９】また、この発明は、現時点での解析結果に
もとづいて認識系にフィードバックをかけることができ
ないために、認識系において各状況に応じて認識対象を
絞り込んだりするなど柔軟に対応することができないと
いう問題点に対しては、各入力モダリティの解析結果な
らびに統合解析結果を基にして、各認識部に対して適切
なフィードバックをかけることにより、各状況に応じた
柔軟なシステムを構築することができるマルチモーダル
情報解析装置を提供することを目的とする。さらに、こ
の発明は、各入力モダリティの重み付けを変化させるこ
とができないために、不必要な情報まで統合してしまい
解釈の矛盾が生じるという問題点に対しては、ユーザが
対象としている世界の知識や現在の各対象の状態やシス
テム対話の状態などに応じて、各入力モダリティの重要
度を適宜変更することによって、より柔軟に意味的に矛
盾のない解析結果を導き出すことができるマルチモーダ
ル情報解析装置を得ることを目的とする。Further, since the present invention cannot provide feedback to the recognition system based on the analysis result at the present time, it is possible to flexibly cope with the recognition system by narrowing down the recognition target according to each situation. For problems that cannot be achieved, build a flexible system according to each situation by applying appropriate feedback to each recognition unit based on the analysis results of each input modality and the integrated analysis results. It is an object of the present invention to provide a multi-modal information analysis device capable of performing the above. In addition, the present invention solves the problem that the weight of each input modality cannot be changed, and thus the unnecessary information is integrated and the interpretation is inconsistent. Multimodal information analysis that can more flexibly derive semantically consistent analysis results by appropriately changing the importance of each input modality according to the current state of each target and the state of system dialogue The aim is to obtain a device.

【００１０】[0010]

【課題を解決するための手段】前記目的達成のために、
請求項１の発明にかかるマルチモーダル情報解析装置
は、ユーザのジェスチャや視線などから検出された画像
認識信号および対象位置蓄積部が蓄積している対象世界
における空間的な知識にもとづいてユーザが空間におい
て注目している情報を検出して、これを空間ベース注目
情報信号として出力する空間ベース注目情報検出部と、
ユーザの発話音声などから検出された音声認識信号およ
び対象特性蓄積部が蓄積している対象世界における各対
象の機能や特性などの知識にもとづいてユーザが発話内
容的に注目している情報を検出し、これをコンテキスト
ベース注目情報信号として出力するコンテキストベース
注目情報検出部と、前記空間ベース注目情報信号と前記
コンテキストベース注目情報信号と状態管理部が管理す
る対象世界に含まれる各対象の現在の状態やシステムと
の対話状態とを入力として、ユーザが何に対してどのよ
うに注目しているかという情報を検出して、これを注目
情報統合信号として出力する注目情報統合部と、前記注
目情報統合信号を入力として、その注目情報統合信号の
内容に応じて、前記画像認識信号および音声認識信号を
それぞれ出力する画像認識部および前記音声認識部に対
して、フィードバック信号を出力するフィードバック生
成部とを設けて、対話制御部に、前記注目情報統合信号
を入力として、実際に実現可能であれば対象世界におい
て実際に操作を実行させ、実現不可能な場合にはユーザ
に応答を返させることにより、ユーザとシステムとの対
話部分を管理させ、該対話制御部で実際に操作を実現す
る場合に、対象操作内容蓄積部に、各対象へのコマンド
などの対象知識を蓄積させ、管理させるようにしたもの
である。To achieve the above object,
The multi-modal information analysis device according to the first aspect of the present invention provides a multi-modal information analysis apparatus that allows a user to perform spatial analysis based on an image recognition signal detected from a gesture or a line of sight of the user and spatial knowledge in a target world stored in a target position storage unit. A space-based attention information detection unit that detects information of interest in and outputs this as a space-based attention information signal;
Detects the information that the user is paying attention to based on the speech recognition signal detected from the user's uttered voice and the knowledge of the functions and characteristics of each target in the target world stored in the target characteristic storage unit And a context-based attention information detection unit that outputs this as a context-based attention information signal; and a current information of each object included in the object world managed by the space-based attention information signal, the context-based attention information signal, and the state management unit. An attention information integration unit that detects information on what the user is paying attention to and how to receive the information as an attention information integration signal, using the state and the state of interaction with the system as inputs, An integrated signal is input, and the image recognition signal and the voice recognition signal are output according to the content of the attention information integrated signal. For the image recognition unit and the speech recognition unit, a feedback generation unit that outputs a feedback signal is provided, and the interactive control unit receives the attention information integrated signal as an input, and if it can be actually realized, To execute the operation, and if it is not feasible, to return a response to the user, to manage the dialogue between the user and the system. The storage unit stores and manages target knowledge such as commands for each target.

【００１１】また、請求項２の発明にかかるマルチモー
ダル情報解析装置は、前記空間ベース注目情報検出部
に、前記画像認識信号が入力されてくる時間に対して設
定した一定の時間間隔内に新たに画像認識信号の入力が
あれば、これらの画像認識信号をひとまとまりの信号と
して出力し、前記一定の時間間隔内に新たに画像認識信
号の入力がなければ画像認識信号をそのまま出力する画
像認識信号管理部を設けて、該画像認識信号管理部から
の画像認識信号を入力として、注目対象特定部に、前記
対象位置蓄積部内に蓄積されている対象世界における空
間的な知識を利用して、ユーザが実際に何に注目してい
るのかを特定させるようにしたものである。The multi-modal information analysis device according to the second aspect of the present invention provides the multi-modal information analysis device according to claim 1, wherein the space-based attention information detection unit newly outputs the image recognition signal within a predetermined time interval set with respect to a time when the image recognition signal is input. If there is an input of an image recognition signal, the image recognition signal is output as a group of signals, and if no new image recognition signal is input within the predetermined time interval, the image recognition signal is output as it is. Providing a signal management unit, the image recognition signal from the image recognition signal management unit as an input, to the attention target identification unit, using spatial knowledge in the target world stored in the target position storage unit, It is designed to specify what the user is actually paying attention to.

【００１２】また、請求項３の発明にかかるマルチモー
ダル情報解析装置は、前記コンテキストベース注目情報
検出部に、前記音声認識信号を入力として言語処理およ
び解析を行い、テキストの意味内容を特定する言語解析
部を設け、該言語解析部からの言語解析出力を入力とし
て、前記対象特性蓄積部内に蓄積されている対象世界に
おける各対象の機能や特性などの知識を利用して、注目
内容特定部に、ユーザが実際にどのようなことに対して
注目しているのかを特定させるようにしたものである。A multimodal information analyzing apparatus according to a third aspect of the present invention provides a language for specifying the semantic content of a text by performing language processing and analysis using the speech recognition signal as an input to the context-based attention information detecting unit. An analysis unit is provided, and a linguistic analysis output from the linguistic analysis unit is used as an input, and knowledge of functions and characteristics of each object in a target world stored in the target characteristic storage unit is used, and the attention content specifying unit is used. In this case, what the user actually pays attention to is specified.

【００１３】また、請求項４の発明にかかるマルチモー
ダル情報解析装置は、前記注目情報統合部に、前記空間
ベース注目情報検出部から出力される空間ベース注目情
報信号および前記コンテキストベース注目情報検出部か
ら出力されるコンテキストベース注目情報信号を入力と
して、これらの各注目情報信号の有効時間区間に関して
これらの各注目情報信号を統合的に組み合わせて解析す
ればよいのか、またはこれらのいずれかを単独で解析す
ればよいのかを検証する時間整合性検証部を設け、該時
間整合性検証部から出力される同期検証フラグ付き空間
ベース注目情報信号および同期検証フラグ付きコンテキ
ストベース注目情報信号を入力として、知識整合性検証
部に、各同期検証フラグの内容と前記状態管理部により
管理されている現在の対象世界の構成対象の状態やシス
テムの対話状態を考慮して、ユーザが対象世界の何に対
してどのようなことに注目しているかを検証させるよう
にしたものである。In the multimodal information analyzing apparatus according to a fourth aspect of the present invention, the attention information integration section includes a space-based attention information signal output from the space-based attention information detection section and the context-based attention information detection section. The context-based attention information signal output from the input as an input, and it is sufficient to analyze the respective attention information signals in an integrated manner with respect to the valid time section of each attention information signal, or one of them alone. A time-consistency verification unit for verifying whether or not the analysis should be performed, and a space-based attention information signal with a synchronization verification flag and a context-based attention information signal with a synchronization verification flag output from the time-consistency verification unit, In the consistency verification unit, the content of each synchronization verification flag and the current status managed by the state management unit are stored. Consider the subject of world configuration subject to state and conversational state of the system, one in which the user has so as to verify whether you are paying attention to what it against what the target world.

【００１４】また、請求項５の発明にかかるマルチモー
ダル情報解析装置は、前記状態管理部に、対象世界内に
含まれる、または対象世界を構成する複数の構成対象に
関する現在の状態の逐次読み出し，追加，編集，蓄積，
管理を行う対象状態管理部を設けて、システム状態管理
部に、ユーザとシステムとの間の対話内容，状態の逐次
読み出し，追加，編集，蓄積，管理を行わせるようにし
たものである。According to a fifth aspect of the present invention, in the multimodal information analyzing apparatus, the state management unit sequentially reads current states of a plurality of constituent objects included in the target world or constituting the target world. Add, edit, store,
A management target state management unit is provided so that the system state management unit sequentially reads, adds, edits, stores, and manages the contents of the dialogue between the user and the system.

【００１５】また、請求項６の発明にかかるマルチモー
ダル情報解析装置は、前記フィードバック生成部が出力
するフィードバック信号を、対象世界の知識である前記
対象位置蓄積部および対象特性蓄積部の蓄積データに従
って、それぞれ出力される画像認識フィードバック信号
および音声認識フィードバック信号としたものである。Further, in the multimodal information analyzing apparatus according to the invention of claim 6, the feedback signal output by the feedback generating section is based on the accumulated data of the object position accumulating section and the object characteristic accumulating section which are knowledge of the object world. , Which are output as an image recognition feedback signal and a speech recognition feedback signal, respectively.

【００１６】また、請求項７の発明にかかるマルチモー
ダル情報解析装置は、前記対話制御部に、前記注目情報
統合部から出力される注目情報統合信号を入力として、
対象世界に対して実際に操作を実現することが可能か否
かを検証する実現可能性検証部を設け、該実現可能性検
証部から出力される実現可能性検証信号を入力として、
操作内容生成部に、対象操作内容蓄積部に蓄積された対
象世界知識に応じて、ユーザの注目対象への操作を実行
するためのコマンド列を生成させるとともに、操作内容
実行部に、該操作内容生成部が出力する操作内容信号の
内容を解釈させ、実際の対象へ操作内容信号を出力させ
るようにしたものである。Further, in the multimodal information analyzing apparatus according to the invention of claim 7, the interactive control unit receives an attention information integration signal output from the attention information integration unit as an input.
Providing a feasibility verification unit that verifies whether or not the operation can be actually performed on the target world, and using a feasibility verification signal output from the feasibility verification unit as an input,
The operation content generation unit generates a command sequence for executing an operation on the user's target of interest in accordance with the target world knowledge stored in the target operation content storage unit. The content of the operation content signal output by the generation unit is interpreted, and the operation content signal is output to an actual target.

【００１７】また、請求項８の発明にかかるマルチモー
ダル情報解析装置は、前記注目情報統合部から出力され
る注目情報統合信号を蓄積し、かつ管理する注目情報蓄
積部を設けたものである。Further, the multimodal information analyzing apparatus according to the invention of claim 8 is provided with an attention information storage section for accumulating and managing the attention information integration signal output from the attention information integration section.

【００１８】[0018]

【発明の実施の形態】以下に、この発明の実施の形態を
図について説明する。図１は、この発明のマルチモーダ
ル情報解析装置を示すブロック図であり、同図におい
て、マルチモーダル情報解析装置は、従来技術にもある
ような画像入力部１１と、音声入力部１２と、画像認識
部１３と、音声認識部１４と、時間管理部１５と、この
発明によるところの注目情報検出部２と、対話管理部３
とを備えている。これらのうち、画像入力部１１は、カ
メラなどの画像入力装置から、ユーザのジェスチャや視
線などを入力し、画像信号１０１として出力する機能を
有する。また、音声入力部１２は、マイクなどの音声入
力装置から、ユーザの発話音声を入力し、音声信号１０
２として出力する機能を有する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing a multimodal information analysis device according to the present invention. In FIG. 1, the multimodal information analysis device includes an image input unit 11, a voice input unit 12, an image Recognition unit 13, voice recognition unit 14, time management unit 15, attention information detection unit 2 according to the present invention, and dialogue management unit 3
And Among these, the image input unit 11 has a function of inputting a user's gesture or line of sight from an image input device such as a camera and outputting the image as an image signal 101. The voice input unit 12 inputs a user's uttered voice from a voice input device such as a microphone, and outputs a voice signal 10.
2 is provided.

【００１９】また、画像認識部１３は、前記画像入力部
１１から出力される画像信号１０１が持つ画像の濃淡や
色彩の差ならびに変化などを利用することにより、ユー
ザのジェスチャ，視線，位置などの情報を検出する機能
を有する。例えば空間中でユーザがある対象を腕などで
直接指示した場合に、空間中での指示位置を特定する方
式としては、広く一般に使われている画像認識、パター
ン認識などの手法を利用することで実現する。また、画
像認識の一般的な手法である差分法やしきい値選択によ
る２値化法などを利用することで、予め撮影しておいた
空間の画像と現在撮影している画像との差分からユーザ
の位置を特定し、その位置に応じてカメラの撮影位置、
撮影角度などを調整し、パターン認識処理で一般に使わ
れるテンプレートマッチング手法などを利用して、予め
テンプレートとして用意しておいた人体形状モデルから
ユーザの胴体部分と腕部分を識別する。そして、例えば
三次元情報を取得する一般的な方法であるステレオ法を
利用することにより、複数のカメラ間で識別されたモデ
ルの対応を計算し、三次元の奥行き情報などを取得し、
ユーザの腕が空間中のどの領域、位置を指し示している
のかを特定することができる。The image recognizing unit 13 uses the difference and change of the density and color of the image included in the image signal 101 output from the image input unit 11 to determine the user's gesture, gaze, position, and the like. It has a function to detect information. For example, when a user directly points a target in space with an arm or the like, the method of specifying the pointing position in space is to use widely used methods such as image recognition and pattern recognition. Realize. In addition, by using a difference method or a binarization method based on threshold selection, which is a general method of image recognition, a difference between an image of a space captured in advance and an image currently captured is calculated. Identify the position of the user and shoot the camera according to the position,
The body angle and the arm of the user are identified from the human body shape model prepared as a template in advance by adjusting the photographing angle and the like and using a template matching method generally used in pattern recognition processing. Then, for example, by using a stereo method, which is a general method of acquiring three-dimensional information, calculate the correspondence of the model identified between a plurality of cameras, acquire three-dimensional depth information and the like,
It is possible to specify which area or position in the space the user's arm is pointing to.

【００２０】さらに、画像認識部１３は、この画像認識
部１３で処理されたことを示すための情報種別識別子
と、検出開始時間と検出終了時間、ならびに検出した空
間中でのユーザの指示位置を、例えば予め設定しておい
た空間内での領域番号などとして情報内容に記述し、画
像認識信号１０３として注目情報検出部２内の空間ベー
ス注目情報検出部２１へ出力する。これにより、ユーザ
が空間中のどの領域、位置に注目しているのかを検出す
ることができ、実際にユーザが何に対して注目している
のかを知る上での重要な情報源として利用することが可
能となる。Further, the image recognizing unit 13 stores an information type identifier for indicating that the processing has been performed by the image recognizing unit 13, a detection start time and a detection end time, and a user-specified position in the detected space. For example, it is described in the information content as an area number in the space set in advance, and is output as the image recognition signal 103 to the space-based attention information detection unit 21 in the attention information detection unit 2. As a result, it is possible to detect which region or position in the space the user is paying attention to, and use it as an important information source for knowing what the user is actually paying attention to. It becomes possible.

【００２１】図２は、画像認識部１３から出力される画
像認識信号１０３の記述フォーマットの一例を示してい
る。画像認識信号に関する記述フォーマットは、画像認
識部の信号であるか否かを識別するための情報種別識別
子と、空間領域番号や指示位置座標などの画像認識結果
を示した情報内容と、認識開始時間ならびに認識終了時
間などを示した時間情報とから構成されている。また、
情報内容に含まれる空間領域番号などは単一であるとは
限らず、複数の情報を含んだ形式も取りうる。なお、画
像認識信号は、固定長もしくは可変長の信号である。ま
た、画像認識信号内での、情報種別識別子，情報内容，
および時間情報の配置は、図２の形に限定されることは
なく、どのユーザの情報であるかを示すユーザ識別子な
どの他の情報が画像認識信号内に含まれる構成にしても
よい。FIG. 2 shows an example of a description format of the image recognition signal 103 output from the image recognition section 13. The description format for the image recognition signal includes an information type identifier for identifying whether or not the signal is an image recognition unit, information content indicating an image recognition result such as a spatial area number and a designated position coordinate, and a recognition start time. And time information indicating the recognition end time and the like. Also,
The information area is not limited to a single spatial area number, and may take a form including a plurality of pieces of information. The image recognition signal is a fixed-length or variable-length signal. In addition, the information type identifier, information content,
The arrangement of the time information and the time information is not limited to the form shown in FIG. 2, and other information such as a user identifier indicating which user the information is may be included in the image recognition signal.

【００２２】一方、前記音声認識部１４は、前記音声入
力部１２から出力される音声信号１０２の振幅，周波数
などの状態や変化などを利用することにより、例えばユ
ーザの発話音声を文字コードに変換する機能を有する。
ユーザの発話音声を文字コードに変換する方式として
は、広く一般に知られている連続音声認識手法を利用し
て、例えば予め用意しておいた認識用の辞書中に含まれ
る単語の発音モデルと発話音声による入力モデルとをマ
ッチングさせたり、ある単語が連続して出現する確率を
定めたモデルを利用することにより、ユーザの発話音声
を文字コードとして出力する。On the other hand, the voice recognition unit 14 converts, for example, a user's uttered voice into a character code by using the state or change of the amplitude and frequency of the voice signal 102 output from the voice input unit 12. It has a function to do.
As a method of converting a user's uttered voice to a character code, a widely known continuous voice recognition technique is used, for example, a pronunciation model of a word included in a dictionary for recognition prepared in advance and an utterance model. The user's uttered voice is output as a character code by matching a voice input model or using a model that determines the probability that a certain word appears continuously.

【００２３】また、音声認識部１４は、この音声認識部
１４で処理されたことを示すための情報種別識別子と、
認識開始時間ならびに認識終了時間と、発話音声から変
換された文字コードを、例えばひとまとまりのテキスト
として情報内容に記述し、音声認識信号１０４として注
目情報検出部２内のコンテキストベース注目情報検出部
２２へと出力する。これにより、ユーザが発話している
内容を文字として処理することができ、発話内容的に何
に注目しているのかを知る上での重要な情報源として利
用することが可能となる。The speech recognition unit 14 further includes an information type identifier for indicating that the speech recognition unit 14 has processed the speech,
The recognition start time and the recognition end time, and the character code converted from the uttered voice are described in the information content as a group of texts, for example, and the context-based attention information detection unit 22 in the attention information detection unit 2 is used as the speech recognition signal 104. Output to As a result, the content spoken by the user can be processed as characters, and can be used as an important information source for knowing what the user is paying attention to in the speech content.

【００２４】図３は、音声認識部１４から出力される音
声認識信号１０４の記述フォーマットの一例を示してい
る。音声認識信号に関する記述フォーマットは、音声認
識部の信号であるかを識別するための情報種別識別子
と、発話音声から獲得されたテキストなどの認識結果を
示した情報内容と、認識開始時間ならびに認識終了時間
を示した時間情報とから構成されている。なお、音声認
識信号は、固定長もしくは可変長の信号である。また、
音声認識信号内での、情報種別識別子、情報内容、およ
び時間情報の配置は、図３の形に限定されることはな
く、どのユーザの情報であるかを示すユーザ識別子など
の他の情報が音声認識信号内に含まれる構成にしてもよ
い。FIG. 3 shows an example of a description format of the speech recognition signal 104 output from the speech recognition section 14. The description format for the speech recognition signal includes an information type identifier for identifying whether the signal is a speech recognition unit signal, information content indicating a recognition result such as a text obtained from the uttered speech, a recognition start time and a recognition end time. And time information indicating time. The speech recognition signal is a fixed-length or variable-length signal. Also,
The arrangement of the information type identifier, the information content, and the time information in the voice recognition signal is not limited to the form shown in FIG. 3, and other information such as a user identifier indicating which user the information belongs to may be used. A configuration included in the speech recognition signal may be used.

【００２５】次に、前記時間管理部１５は、画像認識部
１３および音声認識部１４に関する時間情報を一元的に
管理し、各認識部１４，１３における認識開始時間，認
識終了時間などの時間情報を各認識部１４，１３に対し
て出力する機能を有する。扱う時間情報としては、絶対
時間情報とは限らず、相対的な時間情報などを含んでも
よい。また、注目情報検出部２は、空間ベース注目情報
検出部２１と、コンテキストベース注目情報検出部２２
と、対象位置蓄積部２３と、対象特性蓄積部２４と、注
目情報統合部２５と、状態管理部２６と、フィードバッ
ク生成部２７とを備えている。Next, the time management unit 15 integrally manages time information relating to the image recognizing unit 13 and the voice recognizing unit 14, and stores time information such as a recognition start time and a recognition end time in each of the recognizing units 14, 13. Is output to each of the recognition units 14 and 13. The handled time information is not limited to the absolute time information, and may include relative time information. The attention information detection unit 2 includes a space-based attention information detection unit 21 and a context-based attention information detection unit 22.
And a target position storage unit 23, a target characteristic storage unit 24, an attention information integration unit 25, a state management unit 26, and a feedback generation unit 27.

【００２６】図４は、図１に示すマルチモーダル情報統
合装置の構成の一部である空間ベース注目情報検出部２
１の詳細な構成を示す。空間ベース注目情報検出部２１
は、画像認識信号管理部２１１と、注目対象特定部２１
２とを備えており、これらのうち画像認識信号管理部２
１１は、前記画像認識部１３から出力される画像認識信
号１０３を入力として受け取り、この画像認識信号１０
３を受け取った時間Ｔに対して一定の時間間隔Ｗを設定
し、時間間隔Ｗ内に次の画像認識信号が入力された場合
には、これらの画像認識信号をひとまとまりの画像認識
信号２１０１として出力する機能を有する。もちろん、
時間間隔Ｗ内に他の画像認識信号が入力されない場合に
は、画像認識信号１０３をそのまま出力することにな
る。FIG. 4 shows a space-based attention information detection unit 2 which is a part of the configuration of the multimodal information integration apparatus shown in FIG.
1 shows a detailed configuration. Space-based attention information detection unit 21
Are the image recognition signal management unit 211 and the attention target identification unit 21
And the image recognition signal management unit 2
11 receives an image recognition signal 103 output from the image recognition unit 13 as an input, and
3 is set at a fixed time interval W with respect to the time T when the next image recognition signal is received, and when the next image recognition signal is input within the time interval W, these image recognition signals are set as a group of image recognition signals 2101. It has a function to output. of course,
If another image recognition signal is not input within the time interval W, the image recognition signal 103 is output as it is.

【００２７】一般に、ユーザが一連の操作の中で対象領
域などを指示する場合に、常に単一の領域のみを指示す
るとは限らない。従来、複数領域を指示する場合を考え
ると、各領域の指示開始時間、指示終了時間を各々別に
管理することになり、それぞれの認識信号は全く別のも
のとして扱われることになる。しかし、ユーザとシステ
ムとの対話の観点からすれば、複数領域を指示する場合
を考えると、各領域の指示開始時間、指示終了時間を各
々別に管理するのではなく、各領域の指示時間間隔があ
る一定時間以内であれば連続で複数の領域を指示してい
るとし、最初にある領域を指示した時間を指示開始時
間、最終的にどこの領域も指し示さない、もしくはある
領域を指し示したままの状態が一定時間以上続く場合な
どを指示終了として、その時間を指示終了時間として管
理する方が有効である。In general, when a user designates a target area or the like in a series of operations, the user does not always designate only a single area. Conventionally, when a plurality of regions are designated, the designated start time and designated end time of each region are separately managed, and each recognition signal is treated as a completely different signal. However, from the viewpoint of the interaction between the user and the system, when considering a case where a plurality of regions are designated, the designated start time and the designated end time of each region are not separately managed, but the designated time interval of each region is set. If it is within a certain time, it is assumed that a plurality of areas are continuously specified, and the time when the first area is specified is the instruction start time, finally no area is pointed, or a certain area is left pointing It is more effective to manage the time as an instruction end time, for example, when the instruction continues for a certain period of time or more.

【００２８】例えば最初に入力された画像認識信号の認
識開始時間がＴ１であり、時間間隔内に入力された最後
の画像認識信号の認識終了時間がＥ２である場合、これ
らをひとまとまりの画像認識信号として扱い、認識開始
時間＝Ｔ１、認識終了時間＝Ｅ２として出力することに
なる。また、各画像認識信号に記されている領域番号が
領域１，領域３，領域４であるとし、入力されてきた順
番が領域４，領域３，領域１である場合、情報内容は
（領域４，領域３，領域１）のように入力された順序を
保存した複数領域の指定という形式で出力することにな
る。For example, if the recognition start time of the first input image recognition signal is T1 and the end time of the last image recognition signal input within the time interval is E2, these are grouped into a single image recognition signal. It is treated as a signal and output as recognition start time = T1, recognition end time = E2. Further, it is assumed that the area numbers described in the respective image recognition signals are the area 1, the area 3 and the area 4, and if the input order is the area 4, the area 3 and the area 1, the information content is (area 4 , Area 3 and area 1), the output order is output in the form of specifying a plurality of areas in which the input order is stored.

【００２９】これは、例えば「ここからここまで」など
の発話音声の入力があった場合に、どこの領域を指し示
していたのか、複数の領域をどういう順番で指し示して
いたのかさえ分かれば、例えば領域１から領域５まで
を、順番的には領域５から領域１へ指し示しているとい
うような解析結果を導出することが可能となる。このよ
うに、複数の領域指定の際にも、例えば発話音声に含ま
れる指示語詞の数と空間での指示領域の数とが異なった
としても統合的に解析結果を導出することが可能であ
る。このように、画像認識信号管理部２１１は、複数の
領域指定にも対応できるように、画像認識信号１０３が
画像認識信号管理部２１１へ入力されてくる時間に対し
て、一定の時間間隔Ｗを設定する機能を有しており、そ
の時間間隔内に新たに画像認識信号の入力があれば、複
数の画像認識信号をひとまとまりの画像認識信号として
扱うことが可能である。For example, when an uttered voice such as "from here to here" is input, it is only necessary to know which region is being pointed and in which order plural regions are being pointed out. It is possible to derive an analysis result indicating that the region 1 to the region 5 are sequentially pointed from the region 5 to the region 1. In this way, even when a plurality of areas are specified, it is possible to derive the analysis result in an integrated manner even if, for example, the number of descriptive words included in the uttered speech differs from the number of indicated areas in space. . As described above, the image recognition signal management unit 211 sets a fixed time interval W with respect to the time when the image recognition signal 103 is input to the image recognition signal management unit 211 so as to be able to cope with a plurality of area designations. It has a function of setting, and if a new image recognition signal is input within the time interval, it is possible to handle a plurality of image recognition signals as a group of image recognition signals.

【００３０】注目対象特定部２１２は、前記画像認識信
号管理部２１１から出力される画像認識信号２１０１を
入力として受け取り、画像認識信号２１０１の内容と対
象位置蓄積部２３内に蓄積されている対象世界知識を利
用して、ユーザが実際に何に注目しているのかを特定す
る機能を有する。ここで、対象世界知識とは、例えば対
象世界内に含まれる、もしくは対象世界を構成する複数
の構成対象に関する空間内での絶対位置や相対位置，名
称，機能，特性などの知識のことである。The attention target specifying unit 212 receives the image recognition signal 2101 output from the image recognition signal management unit 211 as input, and stores the contents of the image recognition signal 2101 and the target world stored in the target position storage unit 23. It has a function to specify what the user actually pays attention to using knowledge. Here, the target world knowledge is, for example, knowledge of a plurality of constituent objects included in the target world or constituting the target world, such as absolute positions and relative positions in space, names, functions, characteristics, and the like. .

【００３１】これらのうち対象位置蓄積部２３内には、
絶対位置や相対位置などの空間的な位置情報が蓄積され
ている。例えば画像認識信号２１０１からユーザの指示
領域番号が１であるという情報が得られたとする。この
場合、対象位置蓄積部２３に対して領域番号１に存在し
ている対象が何であるのかを問い合わせ、例えば領域番
号１の中に存在している対象“テレビ”が検出されたと
する。これにより、ユーザが実際の対象世界の何に対し
て注目しているのかを推定することができ、注目情報内
容の対象識別子に“テレビ”というように記述すること
が可能となる。また、注目対象特定部２１２は、ユーザ
が実際に何に注目しているのかを特定した後に、情報源
が画像認識部１３であることを示す情報種別識別子と、
認識時の開始時間と終了時間、注目している対象の識別
子を含んだ注目情報内容とを記述し、空間ベース注目情
報信号２０１として注目情報統合部２５内の時間整合性
検証部２５１へと出力する。Of these, the target position storage unit 23 includes:
Spatial positional information such as an absolute position and a relative position is accumulated. For example, assume that information indicating that the user's designated area number is 1 is obtained from the image recognition signal 2101. In this case, it is assumed that the target position storage unit 23 is inquired about what is present in the area number 1 and, for example, the target “television” existing in the area number 1 is detected. As a result, it is possible to estimate what the user is actually paying attention to in the target world, and to write "TV" in the target identifier of the target information content. Further, the attention target identification unit 212, after identifying what the user actually pays attention to, an information type identifier indicating that the information source is the image recognition unit 13, and
Describe the start time and end time at the time of recognition, the attention information content including the identifier of the object of interest, and output the space-based attention information signal 201 to the time consistency verification unit 251 in the attention information integration unit 25. I do.

【００３２】図５は、空間ベース注目情報検出部２１か
ら出力される空間ベース注目情報信号２０１、コンテキ
ストベース注目情報検出部２２内の注目内容特定部２２
２から出力されるコンテキストベース注目情報信号２０
２の各記述フォーマットの一例を示している。各注目情
報信号は、情報源がどの認識部であるのかを示す情報種
別識別子と、認識部での認識開始時間および終了時間、
ユーザが実際に対象世界中の何にどのような注目をして
いるのかを示す注目情報内容とから構成される。なお、
各注目情報信号は、固定長もしくは可変長の信号であ
る。また、各注目情報信号内の、情報種別識別子、認識
開始時間および終了時間、ならびに注目情報内容の配置
は、図５に限定されることなくどのユーザの情報である
かを示すユーザ識別子など他の情報が注目情報信号内に
含まれる構成にしてもよい。FIG. 5 shows the space-based attention information signal 201 output from the space-based attention information detection unit 21 and the attention content identification unit 22 in the context-based attention information detection unit 22.
2 based context information signal 20
2 shows an example of each description format. Each attention information signal includes an information type identifier indicating which recognition unit the information source is, a recognition start time and a recognition end time in the recognition unit,
Attention information indicating what the user is actually paying attention to in the target world. In addition,
Each attention information signal is a fixed-length or variable-length signal. Further, the arrangement of the information type identifier, the recognition start time and the end time, and the content of the attention information in each attention information signal is not limited to FIG. The information may be included in the attention information signal.

【００３３】図６は、図５の記述フォーマット内に示さ
れている注目情報内容の記述フォーマットの一例を示し
たものである。注目情報内容は、ユーザが注目している
対象である、人や対象物体などの識別子である対象識別
子と、ユーザの要求している操作がどのようなものであ
るのかを示す操作内容識別子、および、操作に伴い何を
要求しているのかを示す要求内容識別子から構成され
る。例えば注目対象が人物Ａであり、その人物Ａの名前
を問い合わせしたい場合には、（人物Ａ，問い合わせ，
名前）などと記述することになる。なお、注目情報内容
は、固定長もしくは可変長である。また、注目情報内容
内の、対象識別子，操作内容識別子，要求内容識別子の
配置は、図６に限定されることなく、どのユーザの情報
であるのかを示すユーザ識別子、操作に伴い必要となる
関連人物、対象の識別子など他の情報が注目情報内容内
に含まれる構成としてもよい。FIG. 6 shows an example of a description format of the information of interest shown in the description format of FIG. The attention information content is a target to which the user is paying attention, a target identifier which is an identifier of a person or a target object, and an operation content identifier indicating what kind of operation the user is requesting, and , A request content identifier indicating what is being requested with the operation. For example, if the target of interest is person A and you want to inquire about the name of person A, (person A, inquiry,
Name). The content of the attention information is a fixed length or a variable length. In addition, the arrangement of the target identifier, the operation content identifier, and the request content identifier in the content of the attention information is not limited to FIG. Other information such as a person and a target identifier may be included in the attention information content.

【００３４】対象位置蓄積部２３は、対象世界知識の中
でも各対象の絶対位置情報や相対位置情報などに関する
知識の逐次読み出し，追加，編集，蓄積，管理などの機
能を有する。図７は、対象位置蓄積部２３内に蓄積され
ている対象世界知識の一例を示す図である。この例で
は、例えばテレビが領域１内に存在しており、しかも、
台の上に位置しているということが理解できる。その他
にも、領域１にどのような対象が存在しているのか、対
象空間内のどこに位置しているのかなどの情報を知るこ
とができる。対象位置蓄積部２３に記述されている対象
世界の知識は、予め作成しておいたものを利用すると限
定されるものではない。例えば対象とする空間をカメラ
などの画像入力装置を利用して撮影し、空間内に含まれ
る構成要素を画像認識で検出した後に、各構成要素にラ
ベリングをし、各対象毎に獲得される位置情報とラベリ
ングされた対象とを組として管理するなどの方法でも対
象位置蓄積部２３内に記述する知識を獲得できる。The target position storage unit 23 has a function of sequentially reading, adding, editing, storing, and managing knowledge regarding absolute position information and relative position information of each target among the target world knowledge. FIG. 7 is a diagram illustrating an example of the target world knowledge stored in the target position storage unit 23. In this example, for example, a television is located in area 1 and
It can be understood that it is located on the table. In addition, it is possible to know information such as what object exists in the area 1 and where it is located in the object space. The knowledge of the target world described in the target position storage unit 23 is not limited to using knowledge created in advance. For example, the target space is photographed using an image input device such as a camera, and the components included in the space are detected by image recognition, and then the components are labeled, and the position obtained for each target is obtained. The knowledge described in the target position storage unit 23 can also be acquired by a method such as managing information and labeled targets as a set.

【００３５】図８は、図１に示すマルチモーダル情報統
合装置の構成の一部であるコンテキストベース注目情報
検出部２２の詳細な構成を示す。このコンテキストベー
ス注目情報検出部２２は、言語解析部２２１と、注目内
容特定部２２２とを備え、これらのうち、言語解析部２
２１は、前記音声認識部１４から出力される音声認識信
号１０４を入力として受け付け、形態素解析や構文解析
など各種の言語処理を施すことによって、音声認識信号
中に含まれる、発話音声から取得されたテキストから、
各構成単語を抜き出したり、各単語の品詞情報を特定し
たり、構文情報を解析するなどにより、テキストの意味
内容を特定する機能を有する。この言語解析部２２１
は、例えば情報源が音声認識部１４であることを示す情
報種別識別子と、認識開始時間と終了時間、そして言語
解析内容の情報、例えば入力されたテキストを単語毎に
区切り、各単語に関する読み、品詞情報などを記述した
言語解析信号２２０１として出力する。FIG. 8 shows a detailed configuration of the context-based attention information detection unit 22 which is a part of the configuration of the multimodal information integration device shown in FIG. The context-based attention information detection unit 22 includes a language analysis unit 221 and an attention content identification unit 222. Of these, the language analysis unit 2
The speech recognition unit 21 receives the speech recognition signal 104 output from the speech recognition unit 14 as an input, and performs various language processing such as morphological analysis and syntax analysis, thereby obtaining the speech recognition signal included in the speech recognition signal. From the text,
It has a function of specifying the meaning of the text by extracting each constituent word, specifying part of speech information of each word, analyzing syntax information, and the like. This language analysis unit 221
For example, the information type identifier indicating that the information source is the speech recognition unit 14, the recognition start time and end time, and information of the linguistic analysis contents, for example, the input text is separated for each word, reading for each word, It is output as a language analysis signal 2201 describing part of speech information and the like.

【００３６】注目内容特定部２２は、言語解析部２１か
ら出力される言語解析信号２２０１を入力として受け付
け、対象特性蓄積部２４に蓄積されている対象世界の知
識を基にして、ユーザが実際にどのようなことに対して
注目しているのかを特定する機能を有する。例えば「こ
れつけて」というテキストに関する言語解析結果が入力
された場合には、“これ：指示代名詞”、“つける：動
詞”などの情報が言語解析信号２２０１から得られる。
この場合、対象特性蓄積部２４の中に、“つける”とい
う特性を有している対象があるかどうかを問い合わせ
し、例えば“つける”という特性を有している対象“テ
レビ”が検出されたとする。これにより、ユーザが実際
の対象世界のどのようなことに注目しているのかを推定
することができ、注目情報内容として（テレビ，電源，
つける）といった形で記述することが可能となる。The attention content specifying unit 22 receives the linguistic analysis signal 2201 output from the linguistic analysis unit 21 as an input, and based on the knowledge of the target world stored in the target characteristic storage unit 24, the user can actually It has a function to specify what kind of thing you are paying attention to. For example, when a linguistic analysis result relating to the text “Kore Attachment” is input, information such as “This: demonstrative pronoun” and “attach: verb” is obtained from the linguistic analysis signal 2201.
In this case, an inquiry is made as to whether or not there is a target having the characteristic of “turn on” in the target characteristic storage unit 24. For example, it is determined that the target “television” having the characteristic of “turn on” is detected. I do. As a result, it is possible to estimate what the user is actually looking at in the target world, and as the information of interest (TV, power supply,
Attached).

【００３７】さらに、この注目内容特定部２２２は、ユ
ーザが実際にどのような内容に注目しているのかを特定
した後に、情報源が音声認識部１４であることを示す情
報種別識別子と、認識開始時間と終了時間、注目してい
る内容を記述した注目情報内容とを記述し、コンテキス
トベース注目情報信号２０２として注目情報統合部２５
内の後述の時間整合性検証部２５１へと出力する。ここ
で、前記対象特性蓄積部２４は、対象世界知識の中でも
各対象の名称や機能，特性などに関する知識の逐次読み
出し，追加，編集，蓄積，管理などの機能を有する。図
９は、対象特性蓄積部２４内に蓄積されている対象世界
知識の一例を示す図である。この例では、例えばビデオ
はテープという特性に関して、再生したり、巻き戻しし
たり、停止したり、録画したりするなどの各種の特性値
を利用して操作することが可能であるということを知る
ことができる。また、ビデオの出力先としては、テレビ
やプロジェクタなどの外部出力先が用意されているなど
の情報も知ることができる。Further, the attention content specifying section 222 specifies what content the user is actually paying attention to, and thereafter, an information type identifier indicating that the information source is the voice recognition section 14 and a recognition type identifier. The start time, the end time, and the attention information content describing the content of interest are described, and the attention information integration unit 25 outputs the context-based attention information signal 202.
To the later-described time consistency verification unit 251. Here, the target characteristic storage unit 24 has functions of sequentially reading, adding, editing, storing, and managing knowledge about the names, functions, characteristics, and the like of each target among the target world knowledge. FIG. 9 is a diagram illustrating an example of the target world knowledge stored in the target characteristic storage unit 24. In this example, for example, it is known that a video can be operated using various characteristic values such as playing, rewinding, stopping, and recording with respect to a characteristic of a tape. be able to. In addition, as the video output destination, information that an external output destination such as a television or a projector is prepared can be known.

【００３８】また、この対象特性蓄積部２４に記述され
ている対象世界の知識は、予め作成しておいたものを利
用すると限定されるものではない。例えば物流倉庫など
で各対象に固有の有線もしくは無線により通信可能な管
理タグが貼り付けられているとした場合に、その管理タ
グから発信される有線もしくは無線の信号から、各対象
の名称，機能，特性などの情報を抽出し、自動的に対象
特性蓄積部２４内の知識を構成することも可能である。The knowledge of the target world described in the target characteristic storage unit 24 is not limited to the use of previously created knowledge. For example, if it is assumed that a unique management tag that can be communicated by wire or wireless is attached to each object in a distribution warehouse or the like, the name or function of each object is determined from a wire or wireless signal transmitted from the management tag. , Characteristics, etc., and the knowledge in the target characteristic storage unit 24 can be automatically configured.

【００３９】図１０は、図１に示すマルチモーダル情報
統合装置の構成の一部である注目情報統合部２５の詳細
な構成を示す。この注目情報統合部２５は、時間整合性
検証部２５１と、知識整合性検証部２５２とを備え、こ
れらのうち、時間整合性検証部２５１は、前記空間ベー
ス注目情報検出部２１から出力される空間ベース注目情
報信号２０１、および前記コンテキストベース注目情報
検出部２２から出力されるコンテキストベース注目情報
信号２０２を入力として受け付け、どの注目情報信号を
統合的に組み合わせて解析すればよいのか、もしくはあ
る注目情報信号を単独で解析した方がよいのかなどを検
証する。FIG. 10 shows a detailed configuration of the attention information integration unit 25 which is a part of the configuration of the multimodal information integration device shown in FIG. The attention information integration unit 25 includes a time consistency verification unit 251 and a knowledge consistency verification unit 252, of which the time consistency verification unit 251 is output from the space-based attention information detection unit 21. The space-based attention information signal 201 and the context-based attention information signal 202 output from the context-based attention information detection unit 22 are received as inputs, and which attention information signals should be combined and analyzed, or a certain attention. Verify whether it is better to analyze the information signal alone.

【００４０】例えば空間ベース注目情報信号２０１内に
記述されている画像認識部１３での認識開始時間がＴ
１、認識終了時間がＥ１であり、他の注目情報信号を待
ち受けるために設定する時間枠がＷ１であるとする。こ
こで、時間Ａと時間Ｂとの間の時間的な区間を［Ａ、
Ｂ］と表現すると、時間枠Ｗ１を含んだ時間区間は［Ｔ
１、Ｅ１＋Ｗ１］と表現することができる。ここでいう
時間枠とは、ある注目情報信号に記されている認識終了
時間に加算する形で設定されるものであり、各認識部で
の認識処理の負荷の違いによる時間ずれを解消したり、
各認識部の特性の差、注目情報検出部での処理時間の差
を吸収する役割を持っている。この時間枠は、各注目情
報信号に応じて自由に設定することが可能である。For example, the recognition start time T in the image recognition unit 13 described in the space-based attention information signal 201 is T
1. Assume that the recognition end time is E1 and the time frame set for waiting for another attention information signal is W1. Here, a time section between time A and time B is [A,
B], the time section including the time frame W1 is [T
1, E1 + W1]. The time frame referred to here is set by adding it to the recognition end time described in a certain information signal of interest, and eliminates a time lag due to a difference in the load of recognition processing in each recognition unit. ,
It plays a role in absorbing differences in characteristics between the recognition units and differences in processing time in the attention information detection unit. This time frame can be set freely according to each attention information signal.

【００４１】また、コンテキストベース注目情報信号２
０２内に記述されている音声認識部１４での認識開始時
間がＴ２、認識終了時間がＥ２であり、時間枠がＷ２で
ある場合には、空間ベース注目情報信号２０１の有効時
間区間［Ｔ１、Ｅ１＋Ｗ１］と、コンテキストベース注
目情報信号２０２の有効時間区間［Ｔ２、Ｅ２＋Ｗ２］
との間で、互いに重複している時間があれば、空間ベー
ス注目情報信号２０１とコンテキストベース注目情報信
号２０２とを組み合わせて統合的に解析する必要がある
と判断し、両注目情報信号の結果が時間的な観点から統
合的に解析されるべきであることを示すための、同期検
証フラグ、例えば（空間＝１、コンテキスト＝１）を空
間ベース注目情報信号２０１ならびにコンテキストベー
ス注目情報信号２０２に付加して、同期検証フラグ付き
空間ベース注目情報信号２５０１として、同時に、同期
検証フラグ付きコンテキストベース注目情報信号２５０
２として、知識整合性検証部２５２へと出力する機能を
有する。逆に、両注目情報信号の各有効時間が重複しな
い場合には、それぞれ単独で解析すべきであることを示
す同期検証フラグ、例えば（空間＝１、コンテキスト＝
０）もしくは（空間＝０、コンテキスト＝１）を各注目
情報信号に付加して出力すればよい。The context-based attention information signal 2
02, the recognition start time in the speech recognition unit 14 described in 02 is T2, the recognition end time is E2, and the time frame is W2, the effective time section [T1, E1 + W1] and the valid time interval [T2, E2 + W2] of the context-based attention information signal 202.
If there is a time overlapping with each other, it is determined that it is necessary to combine and analyze the space-based attention information signal 201 and the context-based attention information signal 202, and the result of both attention information signals is determined. Are added to the space-based attention information signal 201 and the context-based attention information signal 202, for example, by indicating a synchronization verification flag, for example, (space = 1, context = 1) to indicate that In addition, as a space-based attention information signal 2501 with a synchronization verification flag, at the same time, a context-based attention information signal 250 with a synchronization verification flag
As No. 2, it has a function of outputting to the knowledge consistency verification unit 252. Conversely, if the valid times of both attention information signals do not overlap, a synchronization verification flag indicating that the analysis should be performed independently, for example, (space = 1, context =
0) or (space = 0, context = 1) may be added to each attention information signal and output.

【００４２】図１１は、注目情報統合部２５内の時間整
合性検証部２５１から出力される同期検証フラグ付き注
目情報信号の記述フォーマットの一例を示したものであ
る。同期検証フラグ付き注目情報信号は、どの注目情報
検出部からの信号であるのかを示す情報種別識別子と、
情報内容と、同期検証フラグとから構成される。なお、
同期検証フラグ付き注目情報信号は、固定長もしくは可
変長の信号である。また、同期検証フラグ付き注目情報
信号内の、情報種別識別子，情報内容，同期検証フラグ
の配置は、図１１に限定されることなくユーザ識別子や
各注目情報信号の受付時間など他の情報が同期検証フラ
グ付き注目情報信号内に含まれる構成としてもよい。FIG. 11 shows an example of a description format of the attention information signal with the synchronization verification flag output from the time consistency verification section 251 in the attention information integration section 25. The attention information signal with the synchronization verification flag is an information type identifier indicating from which attention information detection unit the signal is output,
It consists of information contents and a synchronization verification flag. In addition,
The attention information signal with the synchronization verification flag is a fixed-length or variable-length signal. The arrangement of the information type identifier, the information content, and the synchronization verification flag in the attention information signal with the synchronization verification flag is not limited to FIG. 11, and other information such as the user identifier and the reception time of each attention information signal may be synchronized. It may be configured to be included in the attention information signal with the verification flag.

【００４３】ここで、同期検証フラグとは、複数の入力
モダリティを統合的に解析する場合に、時間的な整合性
の観点から、どの入力モダリティ間，どのモジュール間
での情報を統合した上で解析すべきなのか、もしくは単
独の入力モダリティ，モジュールで解析するべきなのか
などを示すためのフラグである。例えば統合すべきモジ
ュールの組が空間ベース注目情報検出部とコンテキスト
ベース注目情報検出部であった場合には、前記のように
（空間＝１、コンテキスト＝１）などと記述し、空間ベ
ース注目情報検出部のみの場合には、（空間＝１、コン
テキスト＝０）などと記述することとする。これによ
り、最終的な解析の際に、複数モダリティ、モジュール
を統合した上で解析を実現するのか、それとも単独モダ
リティ、モジュールで解析することが可能なのかを知る
ことが可能となる。Here, the synchronization verification flag means that when a plurality of input modalities are analyzed in an integrated manner, information between which input modalities and which modules are integrated from the viewpoint of temporal consistency. This flag indicates whether the analysis should be performed, or whether the analysis should be performed by a single input modality or module. For example, when the set of modules to be integrated is a space-based attention information detection unit and a context-based attention information detection unit, the space-based attention information is described as (space = 1, context = 1) as described above. In the case of only the detection unit, it is described as (space = 1, context = 0). Thereby, at the time of final analysis, it is possible to know whether the analysis is realized after integrating a plurality of modalities and modules, or whether analysis can be performed with a single modality or module.

【００４４】知識整合性検証部２５２は、前記時間整合
性検証部２５１から出力される同期検証フラグ付き空間
ベース注目情報信号２５０１、および同期検証フラグ付
きコンテキストベース注目情報信号２５０２を入力とし
て受け付け、同期検証フラグの内容と状態管理部２６に
よって管理されている現在の対象世界の構成対象の状態
やシステムとの対話状態を考慮に入れた上で、ユーザが
対象世界の何に対してどのようなことを注目しているの
かを特定する機能を有する。例えば同期検証フラグが
（空間＝１、コンテキスト＝１）であれば、空間ベース
の注目情報とコンテキストベースの注目情報の両者を統
合して解析する必要があり、同期検証フラグ付き空間ベ
ース注目情報信号２５０１内の注目情報内容の対象識別
子が“プロジェクタ”で、同期検証フラグ付きコンテキ
ストベース注目情報信号２５０２内の注目情報内容が
（ビデオ，再生，テープ）であったとする。また、対象
世界中には、ビデオの出力先として、テレビとプロジェ
クタが用意されており、ビデオの再生をどちらに出力す
ればいいのか分からないということがあり得る。The knowledge consistency verification unit 252 receives as input the space-based attention information signal 2501 with the synchronization verification flag and the context-based attention information signal 2502 with the synchronization verification flag output from the time consistency verification unit 251, and Taking into account the contents of the verification flag and the current state of the configuration object of the target world managed by the state management unit 26 and the state of dialogue with the system, the user can determine what to do with the target world. Has the function of specifying whether the user is paying attention to For example, when the synchronization verification flag is (space = 1, context = 1), it is necessary to integrate and analyze both the space-based attention information and the context-based attention information. Assume that the target identifier of the attention information content in 2501 is “projector”, and the attention information content in the context-based attention information signal 2502 with the synchronization verification flag is (video, playback, tape). In addition, televisions and projectors are prepared as video output destinations all over the target world, and it may not be clear to which of the video reproduction destinations to output.

【００４５】しかし、同期検証フラグ付き空間ベース注
目情報信号２５０１から“プロジェクタ”に出力すれば
いいということを検出することができる。これにより、
（（ビデオ，再生，テープ）→プロジェクタ）といった
ように、ビデオを再生してプロジェクタに出力するとい
うユーザの注目情報を特定することが可能となる。ま
た、知識整合性検証部２５２は、ビデオを再生にしてプ
ロジェクタに出力するというユーザの要求に対しては、
状態管理部２６にビデオの電源がついているのかなどを
問い合わせる。ここで、ビデオの電源がついていれば、
（ビデオ，電源，つける）という作業を省略することも
可能となってくる。ビデオの電源が入っていない状態で
あれば、（ビデオ，電源，つける），（（ビデオ，再
生，テープ）→プロジェクタ）という流れで実際の操作
を実現すればよいことになる。しかし、ビデオの電源が
ついているという状態が分かっているのであるから、
（（ビデオ，再生，テープ）→プロジェクタ）という情
報のみでユーザの要求を満たすことが可能となる。これ
により、不必要な操作を省くことなどができるようにな
り、現在の状態を利用することでユーザに対してより効
率的に応答することができ、ユーザにとっては操作感覚
の向上を実現することが可能となる。また、注目情報統
合部２５は、例えば前記注目情報内容に最終的な出力先
を付加した形で記述したものを、注目情報統合信号２０
３としてフィードバック生成部２７へ出力し、注目情報
特定信号２０３と同一の信号２０４を対話管理部３へと
出力する。However, it is possible to detect from the space-based attention information signal 2501 with the synchronization verification flag that output to the “projector” is required. This allows
It is possible to specify the user's attention information such as ((video, playback, tape) → projector) that plays back the video and outputs it to the projector. Further, the knowledge consistency verification unit 252 responds to a user's request to reproduce video and output the video to the projector.
An inquiry is made to the state management unit 26 as to whether the video power is on. Here, if the video is on,
(Video, power, turn on) can be omitted. If the power of the video is not turned on, the actual operation may be realized in the flow of (video, power, turn on), ((video, playback, tape) → projector). However, because we know that the video is on,
It is possible to satisfy the user's request only with the information ((video, playback, tape) → projector). As a result, unnecessary operations can be omitted, and the current state can be used to respond more efficiently to the user, thereby improving the operational feeling for the user. Becomes possible. In addition, the attention information integration unit 25 outputs, for example, a description in which the final output destination is added to the content of the attention information,
3 and outputs the same signal 204 as the attention information specifying signal 203 to the dialog management unit 3.

【００４６】図１２は、図１に示すマルチモーダル情報
統合装置の状態管理部２６の詳細な構成を示す。この状
態管理部２６は、対象状態管理部２６１とシステム状態
管理部２６２とを備えており、これらのうち対象状態管
理部２６１は、対象世界内に含まれる、もしくは対象世
界を構成する複数の構成対象に関する現在の状態の逐次
読み出し，追加，編集，蓄積，管理などの機能を有す
る。例えばある部屋に人物Ｂ，テレビ，ビデオという対
象が存在している場合には、人物Ｂがどのような操作を
しているのか、テレビはついているのか、ビデオは再生
状態にあるのかなどの状態を管理し、それらの状態を逐
次読み出して利用したり、蓄積，管理することなどが可
能である。FIG. 12 shows a detailed configuration of the state management unit 26 of the multimodal information integration device shown in FIG. The state management unit 26 includes a target state management unit 261 and a system state management unit 262. Of these, the target state management unit 261 includes a plurality of components included in the target world or constituting the target world. It has functions such as sequential reading, addition, editing, storage, and management of the current state of the target. For example, if there is an object called person B, television, and video in a room, the state of what operation is performed by person B, whether the television is on, whether the video is in a playback state, and the like. , And their states can be sequentially read and used, stored, and managed.

【００４７】一方、システム状態管理部２６２は、ユー
ザとシステム間での対話内容や状態の逐次読み出し，追
加，編集，蓄積，管理などの機能を有する。例えばシス
テムとの対話の中で、“つける”という要求内容だけを
システム側が把握しており、実際に“つける”に対応す
る対象が何であるかが分からない状態であるというよう
な現在のシステム状態を管理し、それらの状態を逐次読
み出して利用したり、蓄積，管理することなどが可能で
ある。これにより、例えばユーザから「何歳？」という
問い合わせがあり、要求内容は分かっているが、対象が
何であるのか不定であるという状況が起こった場合に
は、発話音声で「Ａさん」と答えるか、もしくは腕など
で直接Ａさんを指示することで対象を絞りこむことを可
能とすることができる。逆に、不定な状態でなければ、
腕などによる直接指示では最終的な絞り込みを行わず、
発話音声による操作内容、要求内容の入力に付随する形
でしか直接指示を利用しないように設定することなどが
可能である。これにより、現在の状態に応じた各入力モ
ダリティの重み付けによる入力モダリティの統合を実現
することができ、ユーザにとって自然な感覚での操作を
実現することが可能となる。On the other hand, the system state management unit 262 has functions of sequentially reading, adding, editing, storing, managing, and the like, the contents of the dialog between the user and the system. For example, in the dialogue with the system, the current system state is such that the system knows only the content of the request to turn on, and does not know what the object corresponding to actually turns on. , And their states can be sequentially read and used, stored, and managed. As a result, for example, when the user inquires "How old is it?" And the situation of the request is known but the target is uncertain, the answer is "Mr. A" in the uttered voice. Alternatively, it is possible to narrow down the target by directly instructing Mr. A with an arm or the like. Conversely, if it is not in an indeterminate state,
With direct instructions by arms etc., final refinement is not performed,
It is possible to make settings such that direct instructions are used only in the form accompanying the input of operation contents and request contents by uttered voice. As a result, it is possible to integrate the input modalities by weighting the input modalities according to the current state, and it is possible to realize an operation with a natural feeling for the user.

【００４８】また、フィードバック生成部２７は、注目
情報統合部２５から出力される注目情報統合信号２０３
を入力として受け付け、この注目情報統合信号２０３の
内容に、対象世界の知識である対象位置蓄積部２３、お
よび対象特性蓄積部２４の内容を照らし合わせて、画像
認識フィードバック信号１０５を画像認識部１３へ出力
し、また、音声認識フィードバック信号１０６を音声認
識部１４へと出力する機能を有する。Further, the feedback generation section 27 outputs the attention information integrated signal 203 output from the attention information integration section 25.
Is received as an input, and the contents of the attention information integrated signal 203 are compared with the contents of the target position storage unit 23 and the target characteristic storage unit 24, which are knowledge of the target world, and the image recognition feedback signal 105 is output to the image recognition unit 13. And outputs the voice recognition feedback signal 106 to the voice recognition unit 14.

【００４９】例えばユーザがある領域を腕などで直接指
示して、その領域内に含まれる対象がテレビであったと
する。この場合、注目情報統合信号２０３内に記される
注目情報内容としては、例えば（テレビ，操作内容不
明，要求内容不明）などと記されることになる。この注
目情報内容から、音声認識部１４に対して、例えば、テ
レビという単語やテレビという単語を含む文章、またテ
レビに関する特性である電源をつけるなどの単語やそれ
らの単語が含まれる文章に関して、通常の認識時よりも
認識する際の採択率を上げるなどの設定をするように音
声認識フィードバック信号１０６を出力することが可能
となる。また、ビデオがテレビに接続されていれば、ビ
デオという単語やビデオという単語を含む文章、ビデオ
に関する特性であるテープを再生するなどの単語やそれ
らの単語が含まれる文章などの採択率を上げて認識を安
定させるようにも設定することが可能となる。For example, it is assumed that the user directly designates an area with his / her arm or the like, and the object included in the area is a television. In this case, the content of the attention information recorded in the attention information integrated signal 203 is, for example, (TV, operation content unknown, request content unknown). From the contents of the attention information, for example, the speech recognition unit 14 is usually provided with a word including the word “TV” or a word including the word “TV”, a word such as turning on the power which is a characteristic related to the TV, and a sentence including those words. It is possible to output the speech recognition feedback signal 106 so as to make settings such as increasing the adoption rate at the time of recognition than at the time of recognition. Also, if the video is connected to a TV, the adoption rate of the word video, sentences containing the word video, words such as playing a tape that is a characteristic of video, and sentences containing those words, etc. will be increased. Settings can be made to stabilize recognition.

【００５０】他にも、例えばユーザが「何歳なの？」と
いう発話音声による問い合わせをした場合、注目情報内
容として、例えば（対象不明，問い合わせ，年齢）など
と記されることになる。この注目情報内容から、画像認
識部１３に対して、例えば年齢を問い合わせする対象と
しては人物が有力候補になるという対象特性蓄積部２４
内の知識を利用して、対象位置蓄積部２３内の対象候補
から予め認識対象や認識領域を絞り込むような画像認識
フィードバック信号１０５を出力することで、認識誤り
の可能性を減少させ認識を安定させるといったことも可
能となる。これにより、場面に応じたフィードバックを
認識部にかけることが可能となり、ユーザとシステムと
の対話をより円滑に、かつ、安定に進めることが可能と
なる。対話管理部３は、対話制御部３１と、対話操作内
容蓄積部３２とを備えている。In addition, for example, when the user makes an inquiry using an uttered voice of "How old are you?", For example, (object unknown, inquiry, age) or the like is written as the attention information content. From the contents of the attention information, the target characteristic storage unit 24 that a person is a promising candidate for inquiring the age to the image recognition unit 13, for example.
By outputting the image recognition feedback signal 105 for narrowing down the recognition target or the recognition area from the target candidates in the target position storage unit 23 in advance by using the knowledge in the above, the possibility of the recognition error is reduced and the recognition is stabilized. It is also possible to do. As a result, it is possible to apply feedback according to the scene to the recognition unit, and it is possible to more smoothly and stably interact with the user and the system. The dialog management unit 3 includes a dialog control unit 31 and a dialog operation content storage unit 32.

【００５１】図１３は、その対話管理部３内の対話制御
部３１の詳細な構成を示す。対話制御部３１は、実現可
能性検証部３１１と、操作内容生成部３１２と、操作内
容実行部３１３と、応答生成部３１４とを備えており、
これらのうち実現可能性検証部３１１は、前記注目情報
統合部２５から出力される注目情報統合信号２０４を入
力として受け付け、対象世界に対して実際に操作を実現
することが可能であるかどうかを検証する機能を有す
る。FIG. 13 shows a detailed configuration of the dialogue control unit 31 in the dialogue management unit 3. The dialog control unit 31 includes a feasibility verification unit 311, an operation content generation unit 312, an operation content execution unit 313, and a response generation unit 314.
Among these, the feasibility verification unit 311 receives the attention information integration signal 204 output from the attention information integration unit 25 as an input, and determines whether it is possible to actually perform an operation on the target world. Has a function to verify.

【００５２】例えば注目情報統合信号２０４内に記載さ
れる注目情報内容が、（テレビ，電源，つける）などと
記されている場合、システムとしてはテレビという対象
の電源という特性をつけるという属性に設定すればよい
ということが理解できる。よって、実現可能性検証部３
１１は、システムの操作を実現できるか否かを示すため
の実現可否識別子、この場合は実現可能であることを示
す識別子を注目情報統合信号２０４に付加した形で、実
現可能性検証信号３１０１として記述したものを操作内
容生成部３１２へ、また、同時に実現可能性検証信号３
１０１と同一の信号３１０２を応答生成部３１４へと出
力する。For example, when the content of the attention information described in the attention information integrated signal 204 is described as (TV, power supply, turn on), etc., the system is set to the attribute of adding the characteristic of the power supply of the target TV. You can understand that it should be done. Therefore, the feasibility verification unit 3
Numeral 11 denotes a feasibility identifier for indicating whether or not operation of the system can be realized. In this case, an identifier indicating the feasibility is added to the attention information integrated signal 204 to form a feasibility verification signal 3101. The description is sent to the operation content generation unit 312 and at the same time, the feasibility verification signal 3
A signal 3102 identical to 101 is output to the response generation unit 314.

【００５３】また、例えば（人物Ａ，操作内容不明，要
求内容不明）などと記されている場合、人物Ａに関する
操作であるということだということは、システムは理解
できるが、人物Ａに関してどのような操作をするのかと
いうことはシステムには理解できない。つまり、この場
合はシステムとして操作を実現することができないとい
うことになる。この時、実現可能性検証部３１１は、シ
ステムの操作として実現不可能であることを知らせる実
現可否識別子を注目情報統合信号２０４に付加した形
で、実現可能性検証信号３１０２として応答生成部３１
４へと出力する。この場合、実現可能性検証信号３１０
１を操作内容生成部３１２へと出力することも可能では
あるが、敢えて出力する必要はない。Further, for example, if (Person A, operation content unknown, request content unknown) is described, it means that the operation is related to person A, but the system can understand. The system does not understand what kind of operation is performed. That is, in this case, the operation cannot be realized as a system. At this time, the feasibility verification unit 311 adds a feasibility identifier indicating that the system operation is not feasible to the attention information integration signal 204 to the response generation unit 31 as a feasibility verification signal 3102.
4 is output. In this case, the feasibility verification signal 310
Although it is possible to output 1 to the operation content generation unit 312, it is not necessary to output it.

【００５４】次に、操作内容生成部３１２は、実現可能
性検証部３１１から出力される実現可能性を検証した後
の注目情報統合信号である実現可能性検証信号３１０１
を入力として受け付け、実現可能性検証信号３１０１内
の注目情報内容と、対象操作内容蓄積部３２に蓄積され
ている対象世界知識の内容に応じて、ユーザの注目対象
への操作を実行するためのコマンド列などを生成する機
能を有する。例えば実現可能性検証信号３１０１が、
（ビデオ，再生，テープ）という情報内容を含んでいる
場合には、実際に操作する対象であるビデオに関して、
テープを再生するために必要な操作内容を示すコマンド
列などを対象操作内容蓄積部３２から読み出すことで、
実際の操作に必要なコマンド列を生成することができ
る。そして、生成したコマンド列を操作内容信号３１０
３として、操作内容実行部３１３へと出力する機能を有
している。Next, the operation content generation unit 312 is a feasibility verification signal 3101 which is an attention information integrated signal after verifying the feasibility output from the feasibility verification unit 311.
Is received as input, and the user performs an operation on the target of interest in accordance with the content of the target information in the feasibility verification signal 3101 and the content of the target world knowledge stored in the target operation content storage unit 32. It has a function of generating a command string and the like. For example, the feasibility verification signal 3101 is
(Video, playback, tape), the video that is actually operated
By reading from the target operation content storage unit 32 a command string or the like indicating the operation content necessary for reproducing the tape,
It is possible to generate a command sequence required for an actual operation. Then, the generated command sequence is transmitted to the operation content signal 310.
As 3, it has a function of outputting to the operation content execution unit 313.

【００５５】操作内容実行部３１３は、前記操作内容生
成部３１２から出力される操作内容信号３１０３の内容
を解釈し、実際の対象へ操作内容信号を直接出力する、
また、実際の対象に直接出力するのではなく、システム
を介して画像や音声などを出力するなどの機能を有す
る。例えば操作内容信号３１０３内に、対象人物Ａへ年
齢を問い合わせる操作を実現するために必要となるコマ
ンド列、“実行（人物Ａ，問い合わせ，年齢）”が記述
されている場合には、このコマンド列を実行するために
必要な出力先を特定し、例えばシステム側にその操作実
行信号を出力することでユーザの要求に応えるという役
割を担っている。これにより、ユーザの注目情報に応
じ、実際の対象世界において操作を実現することが可能
となる。The operation content execution unit 313 interprets the content of the operation content signal 3103 output from the operation content generation unit 312 and directly outputs the operation content signal to an actual object.
In addition, it has a function of outputting images, sounds, and the like via a system instead of directly outputting to an actual target. For example, if the operation content signal 3103 describes a command sequence necessary to realize an operation of inquiring the target person A about the age, “execution (person A, inquiry, age)”, this command sequence Is executed, the output destination required to execute the operation is specified, and the operation execution signal is output to the system side, for example, to fulfill the role of responding to the user's request. Thereby, it is possible to realize an operation in the actual target world according to the attention information of the user.

【００５６】この操作実行信号の送信先としては、テレ
ビなどであれば実際の対象機器そのものであったり、人
物の年齢を問い合わせる場合ではシステムを構成するコ
ンピュータであったりするものであり、特に送信先が限
定されるものではない。また、実際の対象機器そのもの
に送信された場合には、その機器の特性に応じて画像や
音声などが、対象機器そのものや接続されている出力先
などに出力され、システムに含まれるコンピュータなど
に送信された場合には、コンピュータ自身からコンピュ
ータグラフィックスやビデオ映像などの画像や各種の効
果音や合成音などが出力されもしくはコンピュータを経
由して他の機器などから出力されることになる。The transmission destination of the operation execution signal is the actual target device itself in the case of a television or the like, or a computer constituting a system in the case of inquiring about the age of a person. Is not limited. Also, when transmitted to the actual target device itself, images and sounds are output to the target device itself and connected output destinations, etc., according to the characteristics of the device, and are sent to computers included in the system. When transmitted, images such as computer graphics and video images, various sound effects and synthesized sounds are output from the computer itself, or output from other devices via the computer.

【００５７】応答生成部３１４は、実現可能性検証部３
１１から実現可能性検証信号３１０２を入力として受け
付け、システムが操作を実現できるか否か、および対象
への操作内容はどのようなものかという情報に従って、
ユーザに応答文や応答画像などを出力するための制御機
能を有する。例えば実現可能性検証信号３１０２の内容
として、システムが操作を実現できないという情報と不
完全な注目情報内容（人物Ａ，操作内容不明，要求内容
不明）が得られたとする。この場合、ユーザに対して
“Ａさんに対してどのようにしたいのですか？”や“Ａ
さんに対する操作を実現できません”などの応答文を生
成し、スピーカなどの音声出力装置などを利用して出力
する。応答生成部３１４の出力送信先は特に限定される
ものではなく、システムに含まれるコンピュータのディ
スプレイや音声出力装置などを介して、各種の画像や効
果音や合成音を出力する、もしくはスクリーンやスピー
カなどを利用して出力させることになる。The response generation unit 314 includes the feasibility verification unit 3
11, a feasibility verification signal 3102 is received as an input, and according to the information as to whether the system can realize the operation, and what the operation content of the object is,
It has a control function for outputting a response sentence, a response image, and the like to the user. For example, suppose that as the contents of the feasibility verification signal 3102, information that the system cannot perform the operation and incomplete attention information contents (person A, operation contents unknown, request contents unknown) are obtained. In this case, ask the user "What do you want to do with A?"
A response sentence such as “Operation cannot be performed on the user” is generated and output using an audio output device such as a speaker. The output destination of the response generation unit 314 is not particularly limited and is included in the system. Various images, sound effects, and synthetic sounds are output via a computer display, a sound output device, or the like, or output using a screen, a speaker, or the like.

【００５８】対象操作内容蓄積部３２は、対象世界知識
の中でも各対象の操作を実現するために必要となる一連
の操作内容、コマンド列などに関する知識の逐次読み出
し，追加，編集，蓄積，管理などの機能を有する。図１
４は、対象操作内容蓄積部３２内に蓄積されている対象
世界知識の一例を示す。この例では、例えばビデオテー
プを再生するには、まず、ビデオの電源をつけるため
に、“実行（ビデオ，電源，オン）”というコマンドを
実行し、次に、システムとの対話の中で特定できるテレ
ビやプロジェクタなどの出力先の電源をつけるために、
“実行（出力先，電源，オン）”というコマンドを実行
する。そして、ビデオの出力をテレビなどの出力先に設
定し、テープを再生するために、“接続（ビデオ，出力
先）”ならびに、“実行（ビデオ，テープ，再生）”と
いう一連のコマンドを実行する必要があるということを
知ることができる。なお、対象操作内容蓄積部３２に記
述されている対象世界の知識は、予め作成しておいたも
のを利用すると限定されるものではない。例えば新たな
操作内容が追加されたり、ドライバファイルなどの追加
により機能の更新をしたりするなどの場合に、インター
ネットなどに接続していることで対象操作内容蓄積部３
２内の知識が自動的に更新されることなども考えられ
る。The target operation content storage unit 32 sequentially reads, adds, edits, stores, manages, etc., knowledge of a series of operation contents, command strings, etc. necessary for realizing the operation of each target in the target world knowledge. It has the function of FIG.
4 shows an example of the target world knowledge stored in the target operation content storage unit 32. In this example, for example, to play a videotape, first execute the command “execute (video, power, on)” to turn on the video, and then specify the command in the dialogue with the system. To turn on the power of the output destination such as a TV or projector that can
Execute the command “execute (output destination, power supply, on)”. Then, in order to set the output of the video to an output destination such as a television and reproduce the tape, a series of commands of “connection (video, output destination)” and “execute (video, tape, reproduction)” are executed. Know that you need it. Note that the knowledge of the target world described in the target operation content storage unit 32 is not limited to the one created in advance. For example, when a new operation content is added or a function is updated by adding a driver file or the like, the target operation content storage unit 3 is connected to the Internet or the like.
It is also conceivable that the knowledge in 2 is automatically updated.

【００５９】ところで、上述した第１の実施の形態にお
いて、ユーザの注目情報を検出するために、カメラなど
の画像入力装置から入力される画像信号、マイクなどの
音声入力装置から入力される音声信号を利用している
が、ユーザの注目情報を検出するために利用する情報は
画像情報、音声情報に限られるものではない。例えばユ
ーザの空間的な注目情報を検出する手段として、ユーザ
が腕を用いて直接指示などする以外にも、マウスやレー
ザポインタなどのポインティングデバイスから入力され
る情報を、ユーザの注目情報を検出するための情報とし
て利用してもよい。In the first embodiment, an image signal input from an image input device such as a camera, and an audio signal input from an audio input device such as a microphone, in order to detect user's attention information. However, the information used to detect the user's attention information is not limited to image information and audio information. For example, as means for detecting the user's spatial attention information, information input from a pointing device such as a mouse or a laser pointer may be used to detect the user's attention information, in addition to the user directly giving instructions using the arm. May be used as information for the purpose.

【００６０】図１５は、この発明の実施の他の形態であ
るマルチモーダル情報解析装置を示すブロック図であ
る。この実施の形態では、図１に示すマルチモーダル情
報解析装置と同様の機能，構成を有しているが、図４に
示す空間ベース注目情報検出部２１内の画像認識信号管
理部２１１と注目対象特定部２１２とを分離し、また、
注目情報統合部２５内の時間整合性検証部２５１と知識
整合性検証部２５２とを分離して設けている点が異な
る。FIG. 15 is a block diagram showing a multimodal information analyzer according to another embodiment of the present invention. This embodiment has the same function and configuration as the multi-modal information analysis apparatus shown in FIG. 1, but the image recognition signal management section 211 in the space-based attention information detection section 21 shown in FIG. Separated from the identification unit 212,
The difference is that the time consistency verification unit 251 and the knowledge consistency verification unit 252 in the attention information integration unit 25 are provided separately.

【００６１】注目情報検出部４内の画像認識信号管理部
４１は、前記画像認識部１３から出力される画像認識信
号１０３を入力として受け付け、単一の画像認識信号の
みで出力してもよいか、それとも、複数の画像認識信号
をひとまとまりとして注目情報内容、および時間情報を
改めて記述し直した画像認識信号を出力するかを管理す
る機能を有している。例えばある時間Ｔに最初の画像認
識信号が入力されたとする。この入力された画像認識信
号の情報内容は（領域３）で、時間情報は［Ｔ１，Ｅ
１］であるとする。このとき、ある時間Ｔに対して時間
間隔Ｗを設定し、次の画像認識信号が時間区間［Ｔ，Ｔ
＋Ｗ］内に入力されれば、最初の画像認識信号とひとま
とまりにして新たな画像認識信号として記述し直す。こ
のとき、次に入力された画像認識信号の情報内容が（領
域２）で、時間情報が［Ｔ２，Ｅ２］であったとする。
この場合、新たに生成される画像認識信号の情報内容
は、入力された順序を保存した形で（領域３，領域２）
となり、時間情報は［Ｔ１，Ｅ２］となる。時間間隔Ｗ
は新たな入力があるたびに設定し直され、時間間隔内に
入力がない場合に処理を終了する。The image recognition signal management unit 41 in the attention information detection unit 4 accepts the image recognition signal 103 output from the image recognition unit 13 as an input, and may output only a single image recognition signal. Alternatively, it has a function of managing whether to output the image recognition signal in which the attention information content and the time information are re-described as a group of a plurality of image recognition signals. For example, assume that the first image recognition signal is input at a certain time T. The information content of the input image recognition signal is (region 3), and the time information is [T1, E
1]. At this time, a time interval W is set for a certain time T, and the next image recognition signal is transmitted in a time interval [T, T
+ W], it is described as a new image recognition signal as a unit with the first image recognition signal. At this time, it is assumed that the information content of the next input image recognition signal is (area 2) and the time information is [T2, E2].
In this case, the information content of the newly generated image recognition signal is stored in the form of preserving the input order (region 3, region 2).
And the time information is [T1, E2]. Time interval W
Is reset each time there is a new input, and the process ends if there is no input within the time interval.

【００６２】また、この画像認識信号管理部４１は、複
数の情報内容を含む場合であれば、例えば領域番号の順
番を保存した状態で情報内容として記述し直し、時間情
報も全ての画像認識信号に対応する新たな認識開始時
間、認識終了時間として記述し直して、画像認識信号４
０１として時間整合性検証部４２へと出力する。時間整
合性検証部４２は、画像認識信号管理部４１から出力さ
れる画像認識信号４０１、および音声認識部１４から出
力される音声認識信号１０４を入力として受け付け、両
認識信号の時間的な観点からの整合性を検証し、認識信
号を統合的に組み合わせて解析すればよいのか、もしく
はある認識信号を単独で解析した方がよいのかなどを検
証する。例えば画像認識信号４０１内に記述されている
画像認識部１３での認識開始時間がＴ１１、認識終了時
間がＥ１１であり、他のモダリティからの認識信号を待
ち受けるために設定する時間枠がＷ１１であるとする。If the image recognition signal management section 41 includes a plurality of information contents, the image recognition signal management section 41 rewrites the information contents, for example, in a state where the order of the area numbers is preserved. Are re-described as a new recognition start time and a new recognition end time corresponding to
01 is output to the time consistency verification unit 42. The time consistency verification unit 42 receives the image recognition signal 401 output from the image recognition signal management unit 41 and the voice recognition signal 104 output from the voice recognition unit 14 as inputs, and from the viewpoint of time of both recognition signals. Is verified, and whether it is sufficient to analyze the recognition signals in an integrated manner or whether it is better to analyze a certain recognition signal alone is verified. For example, the recognition start time in the image recognition unit 13 described in the image recognition signal 401 is T11, the recognition end time is E11, and the time frame set to wait for a recognition signal from another modality is W11. And

【００６３】ここで、時間枠Ｗ１１を含んだ時間区間は
［Ｔ１１，Ｅ１１＋Ｗ１１］と表現することができる。
ここでいう時間枠とは、ある認識信号に記されている認
識終了時間に加算する形で設定されるものであり、各認
識部での認識処理の負荷の違いによる時間ずれを解消し
たり、各認識部の特性の差などを吸収する役割を持って
いる。この時間枠は、各認識信号に応じて自由に設定す
ることが可能である。Here, the time section including the time frame W11 can be expressed as [T11, E11 + W11].
The time frame referred to here is set in a form to be added to the recognition end time described in a certain recognition signal, and eliminates a time lag due to a difference in a load of recognition processing in each recognition unit, It has the role of absorbing differences in the characteristics of each recognition unit. This time frame can be set freely according to each recognition signal.

【００６４】また、音声認識信号１０４内に記されてい
る音声認識部１４での認識開始時間がＴ２２、認識終了
時間がＥ２２であり、時間枠がＷ２２である場合には、
画像認識信号４０１の有効時間区間［Ｔ１１、Ｅ１１＋
Ｗ１１］と、音声認識信号１０４の有効時間区間［Ｔ２
２、Ｅ２２＋Ｗ２２］との間で、互いに重複している時
間があれば、画像認識信号４０１と音声認識信号１０４
とを組み合わせて統合的に解析する必要があると判断
し、両認識信号の結果が時間的な観点から統合的に解析
されるべきであることを示すための、同期検証フラグ、
例えば（画像＝１、音声＝１）を画像認識信号４０１な
らびに音声認識信号１０４に付加して、同期検証フラグ
付き画像認識信号４０２として注目対象特定部４３へ、
同時に、同期検証フラグ付き音声認識信号４０３として
コンテキストベース注目情報検出部４４へと出力する機
能を有する。逆に、両認識信号の各有効時間が重複し
ない場合には、それぞれ単独で解析すべきであることを
示す同期検証フラグ、例えば（画像＝１，音声＝０）、
もしくは（画像＝０，音声＝１）を各認識信号に付加し
て出力すればよい。If the recognition start time of the speech recognition unit 14 described in the speech recognition signal 104 is T22, the recognition end time is E22, and the time frame is W22,
Effective time section of image recognition signal 401 [T11, E11 +
W11] and the valid time section [T2
2, E22 + W22], the image recognition signal 401 and the speech recognition signal 104
And a synchronization verification flag for indicating that the results of both recognition signals should be analyzed in an integrated manner from a time point of view,
For example, (image = 1, sound = 1) is added to the image recognition signal 401 and the sound recognition signal 104, and the image recognition signal 402 with the synchronization verification flag is sent to the attention target specifying unit 43.
At the same time, it has a function of outputting to the context-based attention information detection unit 44 as a speech recognition signal 403 with a synchronization verification flag. Conversely, if the valid times of both recognition signals do not overlap, a synchronization verification flag indicating that analysis should be performed independently, for example (image = 1, audio = 0),
Alternatively, (image = 0, sound = 1) may be added to each recognition signal and output.

【００６５】このように、時間整合性検証部４２を設け
ることにより、注目対象特定部４３およびコンテキスト
ベース注目情報検出部４４における処理時間の違い、メ
ッセージ送信時間の違いなどに影響されることなく、認
識時点での時間的な整合性を検証することが可能とな
る。つまり、第１の実施の形態の場合よりも、さらにユ
ーザの入力に近い時点での時間的な整合性を検証するこ
とが可能となり、より厳密な時間整合性の検証を行った
後に、対象世界での知識にもとづいた整合性検証を実現
することが可能となる。As described above, the provision of the time consistency verification unit 42 allows the attention target identification unit 43 and the context-based attention information detection unit 44 to be unaffected by differences in processing time and message transmission time. It is possible to verify temporal consistency at the time of recognition. That is, it is possible to verify the temporal consistency at a point closer to the user's input than in the case of the first embodiment. It is possible to realize the consistency verification based on the knowledge in.

【００６６】図１６は、この発明の実施のさらに他の形
態であるマルチモーダル情報解析装置を示すブロック図
であり、この実施の形態では、注目情報統合部２５から
出力される注目情報統合信号を蓄積し管理する機能を有
する注目情報蓄積部５を有している点が図１に示した形
態と異なる。この注目情報蓄積部５は、注目情報統合部
２５から出力される注目情報統合信号を逐次読み出し，
追加，編集，蓄積，管理などの機能を有しており、例え
ば蓄積している注目情報統合信号をある一定の時間枠に
おいて時間順に管理したり、ある注目対象毎の注目時
間、注目頻度などを管理したりすることにより、時間指
定や注目対象指定などにより任意の注目情報統合信号を
逐次読み出すことが可能となる。FIG. 16 is a block diagram showing a multi-modal information analyzing apparatus according to still another embodiment of the present invention. In this embodiment, an attention information integration signal output from attention information integration section 25 is used. It differs from the embodiment shown in FIG. 1 in that it has an attention information storage unit 5 having a function of storing and managing. The attention information storage unit 5 sequentially reads the attention information integrated signal output from the attention information integration unit 25,
It has functions such as addition, editing, accumulation, and management. For example, it manages the accumulated attention information integrated signal in chronological order in a certain time frame, and can observe attention time and attention frequency for each attention target. By managing the information, it is possible to sequentially read out any attention information integrated signal by designating time, attention target, or the like.

【００６７】これにより、ユーザの注目対象や注目内容
の遷移を把握することが可能となったり、注目頻度の高
い対象や内容を把握することが可能となり、現在の対象
世界やシステムの状態だけでなく、過去の注目情報を参
照した上でより効率的に安定して注目情報を検出するこ
とができ、よりユーザの感覚に沿った自然な操作環境を
構築することで、さらに円滑に操作を実現することが可
能となる。また、図１に示すマルチモーダル情報統合装
置の、空間ベース注目情報検出部２１，コンテキストベ
ース注目情報検出部２２に関して、各注目情報信号を蓄
積し管理する部分を付加する形態もあり得る。As a result, it is possible to grasp the transition of the target of interest and the contents of the attention of the user, and it is possible to grasp the objects and contents of high frequency of attention. And more efficiently and stably detect the attention information by referring to the past attention information, and realize a smoother operation by constructing a natural operation environment more in line with the user's sense It is possible to do. Further, in the multi-modal information integration device shown in FIG. 1, there may be a form in which a portion for accumulating and managing each attention information signal is added to the space-based attention information detection unit 21 and the context-based attention information detection unit 22.

【００６８】また、前記の各実施の形態においても、図
１に示す実施の形態の場合と同様に、ユーザの注目情報
の検出をするために、画像情報，音声情報のみの使用に
限られるものではない。例えばユーザの空間的な注目情
報を検出する手段として、ユーザが腕を用いて直接指示
する以外にも、マウスやレーザポインタなどのポインテ
ィングデバイスから入力される情報を、ユーザの注目情
報を検出するための情報として利用してもよい。Also, in each of the above-described embodiments, as in the case of the embodiment shown in FIG. 1, the use of only image information and audio information to detect the user's attention information is limited. is not. For example, as means for detecting the user's spatial attention information, in addition to the user's direct instruction using the arm, information input from a pointing device such as a mouse or a laser pointer is used to detect the user's attention information. May be used as information.

【００６９】[0069]

【発明の効果】以上説明したように、この発明によれ
ば、画像認識信号と音声認識信号とにもとづき、入力モ
ダリティ間での時間的な整合性を検証した上で、同期検
証フラグ付き画像認識信号と同期検証フラグ付き音声認
識信号とを出力し、空間ベース注目情報検出部は前記同
期検証フラグ付き画像認識信号から、対象世界の知識を
利用してユーザが空間的に何に注目しているのかを検出
して、これを空間ベース注目情報信号として出力し、コ
ンテキストベース注目情報検出部は前記同期検証フラグ
付き音声認識信号から、対象世界の知識を利用してユー
ザが発話内容的にどのようなことに注目しているのかを
検出して、これをコンテキストベース注目情報信号とし
て出力し、これらの空間ベース注目情報信号とコンテキ
ストベース注目情報信号とから、注目情報統合部におい
て、対象世界の知識だけでなく現在の状態を利用して、
ユーザが実際に何に対してどのように注目しているのか
を検出して、これを注目情報統合信号として出力するよ
うに構成したので、単に入力モダリティの入力時間が近
いから統合するというのではなく、実際の対象世界に整
合するように統合するマルチモーダル情報統合装置を実
現することが可能となり、この結果、人間の腕などによ
る直接指示のジェスチャと発話音声とにもとづいて、人
間の複雑な複数の入力情報を統合的に解析して判断する
ことが可能となるという効果が得られる。As described above, according to the present invention, based on the image recognition signal and the speech recognition signal, after verifying the temporal consistency between the input modalities, the image recognition with the synchronization verification flag is performed. A signal and a speech recognition signal with a synchronization verification flag are output, and the space-based attention information detection unit uses the knowledge of the target world to look at what the user spatially focuses on from the image recognition signal with the synchronization verification flag. The context-based attention information detection unit detects how the user speaks from the speech recognition signal with the synchronization verification flag using the knowledge of the target world. Is detected as a context-based attention information signal, and the space-based attention information signal and the context-based attention information are output. Degree from and, in the attention information integration unit, using the current state as well as knowledge of the object world,
Since it is configured to detect what the user actually pays attention to and how to output this as the attention information integrated signal, it is not simply to integrate because the input time of the input modality is near. Instead, it is possible to realize a multimodal information integration device that integrates to match the actual target world, and as a result, based on the gestures of direct instructions by the human arm and the utterance voice, the human complexities The effect is obtained that it is possible to analyze and judge a plurality of pieces of input information in an integrated manner.

[Brief description of the drawings]

【図１】この発明の実施の一形態によるマルチモーダ
ル情報解析装置を示すブロック図である。FIG. 1 is a block diagram illustrating a multimodal information analysis device according to an embodiment of the present invention.

【図２】図１における画像認識部から出力される画像
認識信号の記述フォーマットを示すフォーマット図であ
る。FIG. 2 is a format diagram showing a description format of an image recognition signal output from an image recognition unit in FIG. 1;

【図３】図１における音声認識部から出力される音声
認識信号の記述フォーマットを示すフォーマット図であ
る。FIG. 3 is a format diagram showing a description format of a speech recognition signal output from a speech recognition unit in FIG. 1;

【図４】図１における空間ベース注目情報検出部の詳
細を示すブロック図である。FIG. 4 is a block diagram illustrating details of a space-based attention information detection unit in FIG. 1;

【図５】図１における空間ベース注目情報検出部およ
びコンテキストベース注目情報検出部から出力される各
注目情報信号の記述フォーマットを示すフォーマット図
である。FIG. 5 is a format diagram showing a description format of each attention information signal output from a space-based attention information detection unit and a context-based attention information detection unit in FIG. 1;

【図６】図５の記述フォーマットにおける注目情報内
容の記述フォーマットの具体例を示すフォーマット図で
ある。FIG. 6 is a format diagram showing a specific example of a description format of attention information content in the description format of FIG. 5;

【図７】図１における対象位置蓄積部に蓄積される対
象世界知識の記述内容の具体例を示す説明図である。FIG. 7 is an explanatory diagram showing a specific example of description contents of target world knowledge stored in a target position storage unit in FIG. 1;

【図８】図１におけるコンテキストベース注目情報検
出部の詳細を示すブロック図である。FIG. 8 is a block diagram illustrating details of a context-based attention information detection unit in FIG. 1;

【図９】図１における対象特性蓄積部に蓄積される対
象世界知識の記述内容の具体例を示す説明図である。FIG. 9 is an explanatory diagram showing a specific example of description contents of target world knowledge stored in a target characteristic storage unit in FIG. 1;

【図１０】図１における注目情報統合部の詳細を示す
ブロック図である。FIG. 10 is a block diagram showing details of an attention information integrating unit in FIG. 1;

【図１１】図１における時間整合性検証部から出力さ
れる同期検証フラグ付き各注目情報信号の記述フォーマ
ットを示すフォーマット図である。11 is a format diagram illustrating a description format of each information signal of interest with a synchronization verification flag output from the time consistency verification unit in FIG. 1;

【図１２】図１における状態管理部の詳細を示すブロ
ック図である。FIG. 12 is a block diagram illustrating details of a state management unit in FIG. 1;

【図１３】図１における対話管理部内の対話制御部の
詳細を示すブロック図である。FIG. 13 is a block diagram showing details of a dialog control unit in the dialog management unit in FIG. 1;

【図１４】図１における対象操作内容蓄積部に蓄積さ
れる対象世界知識の記述内容の具体例を示す説明図であ
る。14 is an explanatory diagram showing a specific example of description contents of target world knowledge stored in a target operation content storage unit in FIG. 1;

【図１５】この発明の実施の他の形態によるマルチモ
ーダル情報統合装置を示すブロック図である。FIG. 15 is a block diagram showing a multimodal information integration device according to another embodiment of the present invention.

【図１６】この発明の実施の他の形態によるマルチモ
ーダル情報統合装置を示すブロック図である。FIG. 16 is a block diagram showing a multimodal information integration device according to another embodiment of the present invention.

[Explanation of symbols]

２１空間ベース注目情報検出部２１１画像認識信号管理部２１２注目対象特定部２２コンテキストベース注目情報検出部２２１言語解析部２２２注目内容特定部２３対象位置蓄積部２４対象特性蓄積部２５注目情報統合部２５１時間整合性検証部２５２知識整合性検証部２６状態管理部２６１対象状態管理部２６２システム状態管理部２７フィードバック生成部３１対話制御部３２対象操作内容蓄積部３１１実現可能性検証部３１２操作内容生成部３１３操作内容実行部３１４応答生成部 Reference Signs List 21 space-based attention information detection unit 211 image recognition signal management unit 212 attention target identification unit 22 context-based attention information detection unit 221 language analysis unit 222 attention content identification unit 23 target position storage unit 24 target characteristic storage unit 25 attention information integration unit 251 Time consistency verification unit 252 Knowledge consistency verification unit 26 State management unit 261 Target state management unit 262 System state management unit 27 Feedback generation unit 31 Dialogue control unit 32 Target operation content storage unit 311 Feasibility verification unit 312 Operation content generation unit 313 Operation content execution unit 314 Response generation unit

Claims

[Claims]

1. A multi-modal analysis apparatus that receives a plurality of modalities such as gestures, eyes, facial expressions, and voices used in normal communication by a human and outputs an analysis result by integrating the input modalities. Based on image recognition signals detected from gestures and eyes of the user and spatial knowledge in the target world accumulated by the target position storage unit, information that the user is paying attention to in space is detected and A space-based attention information detection unit that outputs a notice information signal, and a speech recognition signal detected from the user's uttered speech and the knowledge of the functions and characteristics of each object in the object world stored in the object characteristic accumulation unit Information that the user is paying attention to in terms of utterance content, and A context-based attention information detection unit that outputs as an information signal; a space-based attention information signal, the context-based attention information signal, and a current state of each object included in an object world managed by a state management unit and a conversation state with the system. As an input, detecting information on what and how the user is paying attention to, an attention information integration unit that outputs this as an attention information integration signal, and the attention information integration signal as an input, According to the content of the attention information integration signal, a feedback generation unit that outputs a feedback signal to the image recognition unit and the speech recognition unit that output the image recognition signal and the speech recognition signal, respectively, As an input, if an operation is actually feasible, the operation is actually performed in the target world. A dialogue control unit that returns a response to the user and manages a part of the dialogue between the user and the system, and a target that accumulates and manages target knowledge such as commands for each target when the dialogue control unit actually implements operations. A multimodal information analysis device comprising an operation content storage unit.

2. If the space-based attention information detection unit receives a new image recognition signal within a predetermined time interval set with respect to the time when the image recognition signal is input, the image recognition is performed. An image recognition signal management unit that outputs a signal as a group of signals and outputs the image recognition signal as it is if no new image recognition signal is input within the predetermined time interval; and an image from the image recognition signal management unit. A recognition target signal specifying unit that specifies what the user is actually paying attention to by using spatial knowledge in the target world stored in the target position storage unit as an input of the recognition signal; The multimodal information analysis device according to claim 1, wherein:

3. The context-based attention information detection unit performs language processing and analysis using the speech recognition signal as an input, and inputs a language analysis unit that specifies the meaning of text and a language analysis output from the language analysis unit. The content of interest that specifies what the user is actually paying attention to, using the knowledge of the functions and characteristics of each object in the target world stored in the target characteristic storage unit. The multimodal information analysis device according to claim 1, further comprising a specifying unit.

4. The attention information integration section receives a space-based attention information signal output from the space-based attention information detection section and a context-based attention information signal output from the context-based attention information detection section as inputs. And a time consistency verification unit for verifying whether the respective attention information signals should be analyzed in an integrated manner with respect to the effective time section of each attention information signal, or whether any of them should be analyzed alone. The space-based attention information signal with synchronization verification flag and the context-based attention information signal with synchronization verification flag output from the time consistency verification unit are input, and the content of each synchronization verification flag and the state management unit are managed. Considering the current state of the configuration object in the target world and the dialogue state of the system, the user Multimodal information analysis apparatus according to claim 1, characterized in that it comprises a knowledge consistency verification unit for verifying whether the focus on what it for.

5. The state management unit according to claim 1, further comprising: sequentially reading out, adding, and reading a current state of a plurality of constituent objects included in the target world or constituting the target world.
Claims characterized by comprising a target state management unit for editing, storing, and managing, and a system state managing unit for sequentially reading, adding, editing, storing, and managing the content of a dialogue between the user and the system, and the state. Item 2. The multimodal information analysis device according to item 1.

6. A feedback signal output by the feedback generation unit, according to data stored in the target position storage unit and the target characteristic storage unit, which is knowledge of a target world.
2. The multi-modal information analysis device according to claim 1, wherein the multi-modal information analysis device is an image recognition feedback signal and a voice recognition feedback signal that are respectively output.

7. An interactive control unit, which receives an attention information integrated signal output from the attention information integration unit as an input, and verifies whether it is possible to actually perform an operation on a target world. A feasibility verification unit, and a feasibility verification signal output from the feasibility verification unit is input, and an operation is performed on a target of interest of the user according to the target world knowledge stored in the target operation content storage unit. And an operation content execution unit that interprets the content of the operation content signal output by the operation content generation unit and outputs the operation content signal to an actual target. The multi-modal information analysis device according to claim 1.

8. The multi-modal information analysis apparatus according to claim 1, further comprising an attention information storage unit that accumulates and manages the attention information integration signal output from the attention information integration unit.