CN102339129B

CN102339129B - Multichannel human-computer interaction method based on voice and gestures

Info

Publication number: CN102339129B
Application number: CN 201110278390
Authority: CN
Inventors: 赵沁平; 陈小武; 蒋恺; 许楠
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2011-09-19
Filing date: 2011-09-19
Publication date: 2013-12-25
Anticipated expiration: 2031-09-19
Also published as: CN102339129A

Abstract

The invention discloses a multi-channel human-computer interaction method based on voice and gesture, which extracts voice reference object constraint information from voice information, and extracts gesture reference object constraint information from gesture information, wherein the gesture reference object constraint information includes current pointing The statistics of the distance from any point in the pointing area defined by the gesture to the pointing center of the pointing gesture and the time statistic maintained by the pointing gesture. When analyzing the constraint information of the gesture reference object, the distance statistics and time statistics are obtained to reduce the ambiguity of pointing in 3D interaction. In the process of determining the referent object, the model objects in the virtual environment are divided into four categories, and the referent object is compared with a certain type of model object according to the possibility of the referent object appearing in a certain type. It is also helpful to narrow the search scope of the referring object and reduce the impact of pointing ambiguity.

Description

A Multi-channel Human-Computer Interaction Method Based on Speech and Gesture

技术领域 technical field

本发明涉及人机交互领域，尤其涉及一种基于语音和手势的多通道人机交互方法。The invention relates to the field of human-computer interaction, in particular to a multi-channel human-computer interaction method based on voice and gesture.

背景技术 Background technique

多通道人机交互能够有效地扩大人与计算机之间信息交换的带宽，从而达到提高交互效率的目的；并可发挥人机之间彼此不同的认知潜力，降低用户的认知负荷。用户可以通过各种不同的交互通道以及它们之间的相互组合、协作来完成交互任务，这正好弥补了单一交互模式给用户带来的限制和负担。多通道人机交互中，指称归结定义为：求出多个通道输入信息的共同所指对象。其中，指称主要包括自然语言中的代词、定位副词、指示词和限定名词，例如“它”、“这儿”、“这个”、“那间房屋”等；指称对象是用户所指称的客观实体，例如三维空间中的模型等。在传统的单通道用户界面中，指称技术是单一的，并且通常是精确的，目标与目标之间的边界是清晰的。而在多通道用户界面中，指称技术是复合的并且通常是模糊的，边界是不清晰的。Multi-channel human-computer interaction can effectively expand the bandwidth of information exchange between human and computer, so as to achieve the purpose of improving interaction efficiency; it can also exert the different cognitive potentials between human and computer, and reduce the cognitive load of users. Users can complete interactive tasks through various interaction channels and their mutual combination and cooperation, which just makes up for the limitations and burdens brought to users by a single interaction mode. In multi-channel human-computer interaction, referential resolution is defined as finding out the common referent of input information from multiple channels. Among them, reference mainly includes pronouns, positioning adverbs, demonstratives and qualifiers in natural language, such as "it", "here", "this", "that house" and so on; the object of reference is the objective entity referred to by the user, For example, models in three-dimensional space, etc. In traditional single-channel user interfaces, the referential technique is singular and often precise, and the boundaries between objects and objects are clear. In MUI, however, referential techniques are compounded and often blurred, and boundaries are not clear.

目前多通道的研究已不局限于整合语音和传统鼠标键盘，基于语音和笔，语音和唇动，语音和三维手势的多通道系统得到了较大的关注。其中的典型代表包括基于Agent结构、支持语音和笔的多通道协作系统QuickSet，整合了“魔术棒”(一种新的六自由度设备)和语音的XWand系统等。W3C国际组织已经成立了“多通道交互”工作小组，开发W3C新的一类支持移动设备的多通道协议标准，包括多通道交互框架、多通道交互需求、多通道交互用例、可扩展多通道注释语言需求、数字墨水需求、可扩展多通道注释标记语言等。这些标准的制定反映了多通道技术已开始成熟。At present, multi-channel research is not limited to the integration of voice and traditional mouse and keyboard. Multi-channel systems based on voice and pen, voice and lip movement, voice and three-dimensional gestures have received greater attention. Typical representatives include QuickSet, a multi-channel collaboration system based on Agent structure that supports voice and pen, and XWand system that integrates "magic wand" (a new six-degree-of-freedom device) and voice. The W3C international organization has established a "multi-channel interaction" working group to develop a new W3C multi-channel protocol standard supporting mobile devices, including the multi-channel interaction framework, multi-channel interaction requirements, multi-channel interaction use cases, and scalable multi-channel annotations Language requirements, digital ink requirements, Extensible Multi-Channel Annotation Markup Language, etc. The establishment of these standards reflects the maturity of multi-channel technology.

关于多通道人机交互中指称归结问题的研究，Kehler运用认知科学和计算语言学的相关原理，研究并验证了多通道环境下指称与认知状态的对应关系，提出一种对认知状态编码并结合一组简单判断规则获取指称对象的方法，并在一个基于笔和语音的二维旅游地图应用中达到了很高的准确率。Kehler方法在处理单一指称结合精确指点手势时很有效，但这些规则假设所有对象都能被确定地选中，不能支持模糊的手势。Regarding the research on the problem of referential resolution in multi-channel human-computer interaction, Kehler used the relevant principles of cognitive science and computational linguistics to study and verify the corresponding relationship between referential and cognitive states in a multi-channel environment, and proposed a method for understanding cognitive states. The method of encoding and combining a set of simple judgment rules to obtain the referring object has achieved high accuracy in a pen and voice-based 2D tourist map application. Kehler's method works well when dealing with single-pointing gestures combined with precise pointing gestures, but these rules assume that all objects can be deterministically selected and cannot support ambiguous gestures.

哥伦比亚大学、俄勒冈科学和健康大学等合作研究增强现实与虚拟现实环境下三维多通道交互，提出用感知形状的方法解决指称归结的问题。感知形状是由用户控制的几何体，用户通过它与增强现实或虚拟现实环境交互，在交互过程中感知形状产生各种统计信息辅助目标选择。该方法主要解决了指称归结中指点模糊性问题，但并没有关注未指明信息的推断和多通道对齐。德国比勒费尔德大学的Pfeiffer等提出多通道指称归结应该注意指称类型、语句的复杂性、一致背景、不确定性等方面，并设计了一种面向沉浸式虚拟环境的指称归结引擎。该引擎是一个三层结构的专家系统：核心层、领域层、应用层。核心层是一个约束满足管理器；领域层提供对知识库的访问；应用层是外界程序与指称归结引擎的接口，负责将语音输入中的指称转化为对指称归结引擎的查询。该指称归结引擎将指称归结问题看作约束满足问题，主要关注从复杂的自然语言中提取有效的约束。但该方法对欠约束的情况以及指点模糊性还缺乏相应的处理。Columbia University, Oregon University of Science and Health, etc. collaborated to study three-dimensional multi-channel interaction in augmented reality and virtual reality environments, and proposed to use the method of perceiving shapes to solve the problem of referential resolution. Perceived shapes are user-controlled geometries through which users interact with augmented reality or virtual reality environments. Perceived shapes generate various statistical information during the interaction process to assist target selection. This method mainly solves the problem of pointing ambiguity in referential resolution, but does not pay attention to the inference of unspecified information and multi-channel alignment. Pfeiffer et al. from Bielefeld University in Germany proposed that multi-channel referential resolution should pay attention to referential types, sentence complexity, consistent background, uncertainty, etc., and designed a referential resolution engine for immersive virtual environments. The engine is an expert system with three layers: core layer, domain layer and application layer. The core layer is a constraint satisfaction manager; the domain layer provides access to the knowledge base; the application layer is the interface between external programs and the reference resolution engine, and is responsible for converting references in speech input into queries for the reference resolution engine. The reference resolution engine regards the reference resolution problem as a constraint satisfaction problem, and mainly focuses on extracting effective constraints from complex natural language. However, this method still lacks corresponding treatment for under-constrained situations and pointing ambiguity.

发明内容 Contents of the invention

本发明设计开发了一种基于语音和手势的多通道人机交互方法。The invention designs and develops a multi-channel human-computer interaction method based on voice and gesture.

本发明的一个目的在于，解决基于语音和手势的多通道人机交互方法中的指点模糊性问题。虚拟环境中进行三维交互时，手势(从识别指点开始到指点结束)不仅表达了空间信息，也承载了时间方面的信息。对象在指点区域内停留时间越长，可以认为被选中的可能性越大。因此，在进行手势指称对象约束信息的分析时，不仅要获取距离统计量，而且要获取时间统计量，从而降低三维交互中的指点模糊性。并且，在对指称对象进行确定的过程中，是将虚拟环境中的模型对象划分为四类，并将指称对象与某一类型模型对象进行对比，这种方法也有助于缩小指称对象的寻找范围，降低指点模糊性的影响。An object of the present invention is to solve the pointing ambiguity problem in the multi-channel human-computer interaction method based on voice and gesture. When performing three-dimensional interaction in a virtual environment, gestures (from the beginning of pointing recognition to the end of pointing) not only express spatial information, but also carry temporal information. The longer the object stays in the pointing area, the more likely it is to be selected. Therefore, when analyzing the constraint information of gesture referent objects, not only distance statistics but also time statistics should be obtained, so as to reduce pointing ambiguity in 3D interaction. Moreover, in the process of determining the referent object, the model objects in the virtual environment are divided into four categories, and the referent object is compared with a certain type of model object. This method also helps to narrow the search scope of the referent object , reducing the effect of pointing ambiguity.

本发明的另一个目的在于，解决基于语音和手势的多通道人机交互方法中的未指明信息推断的问题。虚拟环境中的模型对象被划分为四类，其中，聚焦对象为在上一次人机交互过程中所被确定的指称对象，也就是说，如果此次人机交互中语音输入的语句中出现了指示性代词“它”，则可认为此次人机交互的指称对象就是聚焦对象，从而解决了未指明信息推断的问题。Another object of the present invention is to solve the problem of unspecified information inference in the multi-channel human-computer interaction method based on voice and gesture. The model objects in the virtual environment are divided into four categories. Among them, the focus object is the referring object determined during the last human-computer interaction, that is, if there is The indicative pronoun "it" can be considered as the referring object of this human-computer interaction is the focused object, thus solving the problem of unspecified information inference.

本发明的又一个目的在于，提供一种基于语音和手势的多通道人机交互方法。通过构建多通道分层整合模型，在多通道分层整合模型中建立四层：物理层、词法层、语法层和语义层，并最终将人机交互所需的命令信息及指称对象填充入任务槽，上述整合过程的目标以及整合成功与否的判据都是以人机交互的任务结构的完整性为基础，最终目的就是生成可提交系统执行的任务结构，保证人机交互的有效进行。Another object of the present invention is to provide a multi-channel human-computer interaction method based on speech and gestures. By building a multi-channel hierarchical integration model, four layers are established in the multi-channel hierarchical integration model: physical layer, lexical layer, grammatical layer and semantic layer, and finally the command information and referent objects required for human-computer interaction are filled into tasks The goal of the above integration process and the criteria for the success of the integration are based on the integrity of the human-computer interaction task structure. The ultimate goal is to generate a task structure that can be submitted to the system for execution to ensure the effective implementation of human-computer interaction.

本发明提供的技术方案为：The technical scheme provided by the invention is:

一种基于语音和手势的多通道人机交互方法，其特征在于，包括以下步骤：A multi-channel human-computer interaction method based on voice and gesture, is characterized in that, comprises the following steps:

步骤一、构建语音通道和手势通道，并分别通过语音通道和手势通道对人机交互的指称对象进行语音信息和手势信息的输入；Step 1. Build a voice channel and a gesture channel, and input voice information and gesture information to the referent of the human-computer interaction through the voice channel and the gesture channel respectively;

步骤二、从上述语音信息中提取语音指称对象约束信息，从上述手势信息中提取手势指称对象约束信息，其中，所述手势指称对象约束信息包括当前指点手势所限定的指点区域内的任一点到达指点手势的指点中心的距离统计量以及上述指点手势所维持的时间统计量；Step 2. Extract the speech reference object constraint information from the above voice information, and extract the gesture reference object constraint information from the above gesture information, wherein the gesture reference object constraint information includes the arrival of any point in the pointing area defined by the current pointing gesture The distance statistics of the pointing center of the pointing gesture and the time statistics maintained by the above-mentioned pointing gesture;

步骤三、将上述语音指称对象约束信息及手势指称对象约束信息与虚拟环境中模型对象的特征信息进行对比，确定出人机交互的指称对象，从上述语音指称对象约束信息中提取对指称对象的命令信息，将命令信息作用于指称对象，完成一次人机交互。Step 3: Compare the above-mentioned speech referent object constraint information and gesture referent object constraint information with the feature information of the model object in the virtual environment, determine the referent object of human-computer interaction, and extract the referent object from the above-mentioned speech referent object constraint information Command information, which acts on the referent object to complete a human-computer interaction.

优选的是，所述的基于语音和手势的多通道人机交互方法中，所述虚拟环境中的模型对象被划分为指点对象、聚焦对象、激活对象以及沉寂对象四类，所述指点对象为位于当前指点手势所限定的指点区域内的对象，所述聚焦对象为在上一次人机交互过程中所被确定的指称对象，所述激活对象为位于可视范围内的除指点对象和激活对象以外的模型对象，所述沉寂对象为位于不可视范围内的除指点对象和激活对象以外的模型对象，在步骤三中，将上述语音指称对象约束信息及手势指称对象约束信息按顺序逐一与上述指点对象、聚焦对象、激活对象、沉寂对象的特征信息进行对比，确定出人机交互的指称对象。Preferably, in the multi-channel human-computer interaction method based on voice and gestures, the model objects in the virtual environment are divided into four categories: pointing objects, focusing objects, activation objects and silent objects, and the pointing objects are The object located in the pointing area defined by the current pointing gesture, the focus object is the referent object determined during the last human-computer interaction, and the activation object is the pointing object and activation object located within the visible range The silent object is a model object other than the pointing object and the activation object located in the invisible range. In step 3, the above-mentioned voice referent object constraint information and gesture referent object constraint information are sequentially combined with the above-mentioned The feature information of the pointing object, focusing object, active object, and silent object is compared to determine the referent object of human-computer interaction.

优选的是，所述的基于语音和手势的多通道人机交互方法中，在所述步骤二中，Preferably, in the multi-channel human-computer interaction method based on voice and gesture, in the second step,

从上述语音信息中提取语音指称对象约束信息和从上述手势信息中提取手势指称对象约束信息是通过以下方式实现的：Extracting the speech reference object constraint information from the above voice information and the gesture reference object constraint information from the above gesture information are realized in the following ways:

构建多通道分层整合模型，所述多通道分层整合模型包括有四层，分别为物理层、词法层、语法层和语义层，其中，所述物理层接收分别由语音通道和手势通道输入的语音信息和手势信息，所述词法层包括有语音识别解析模块和手势识别解析模块，所述语音识别解析模块将物理层的语音信息解析为语音指称对象约束信息，所述手势识别解析模块将物理层的手势信息解析为手势指称对象约束信息。Build a multi-channel layered integration model, the multi-channel layered integration model includes four layers, respectively physical layer, lexical layer, syntax layer and semantic layer, wherein, the physical layer receives input from voice channel and gesture channel respectively speech information and gesture information, the lexical layer includes a speech recognition and analysis module and a gesture recognition and analysis module, the speech recognition and analysis module parses the speech information of the physical layer into speech reference object constraint information, and the gesture recognition and analysis module will The gesture information at the physical layer is parsed into constraint information of the gesture referent.

优选的是，所述的基于语音和手势的多通道人机交互方法中，所述步骤三中，Preferably, in the described multi-channel human-computer interaction method based on voice and gesture, in the third step,

将上述语音指称对象约束信息及手势指称对象约束信息与虚拟环境中模型对象的特征信息进行对比，确定出人机交互的指称对象，所述指称对象的确定是在所述语法层上实现的，Comparing the constraint information of the voice reference object and the gesture reference object constraint information with the characteristic information of the model object in the virtual environment, and determining the reference object of human-computer interaction, the determination of the reference object is realized on the grammar layer,

从上述语音指称对象约束信息中提取对指称对象的命令信息是通过以下方式实现的：The command information for the referent is extracted from the constraint information of the above speech referent object in the following ways:

所述语法层从语音指称对象约束信息中提取命令信息，The grammar layer extracts command information from the phonetic reference object constraint information,

将命令信息作用于指称对象是通过以下方式实现的：Applying command information to referents is accomplished in the following ways:

所述语义层将语法层所提取的命令信息作用于指称对象。The semantic layer applies the command information extracted by the syntax layer to the referring object.

优选的是，所述的基于语音和手势的多通道人机交互方法中，所述多通道分层整合模型还包括有任务槽，所述任务槽包括命令表项以及指称对象表项，Preferably, in the multi-channel human-computer interaction method based on voice and gesture, the multi-channel layered integration model further includes a task slot, and the task slot includes a command entry and a referent entry,

其中所述语义层将语法层所提取的命令信息作用于指称对象是通过以下方式进行的：Wherein, the semantic layer applies the command information extracted by the syntax layer to the referent object in the following manner:

所述语义层将语法层所提取的命令信息填入命令表项，将指称对象填入指称对象表项，所述任务槽被填充完整，所述多通道分层整合模型生产系统可执行命令。The semantic layer fills the command information extracted by the syntax layer into the command table item, and fills the referent object into the referent object table item, the task slot is completely filled, and the multi-channel layered integrated model production system can execute the command.

优选的是，所述的基于语音和手势的多通道人机交互方法中，在所述任务槽未填充完整的情况下，设置等待时间，所述任务槽在等待时间内被填充完整，则继续此次人机交互，所述任务槽在等待时间内未被填充完整，则放弃此次人机交互。Preferably, in the multi-channel human-computer interaction method based on voice and gesture, when the task slot is not filled completely, a waiting time is set, and the task slot is filled completely within the waiting time, then continue In this human-computer interaction, if the task slot is not completely filled within the waiting time, the human-computer interaction is abandoned.

优选的是，所述的基于语音和手势的多通道人机交互方法中，所述命令表项包括有动作表项和参数表项，所述语音指称对象约束信息中提取对指称对象的命令信息时，所述命令信息包括动作信息和参数信息。Preferably, in the multi-channel human-computer interaction method based on speech and gestures, the command entry includes an action entry and a parameter entry, and the command information for the referent is extracted from the speech reference object constraint information When , the command information includes action information and parameter information.

优选的是，所述的基于语音和手势的多通道人机交互方法中，所述步骤一中，在语音通道接收到第一个语句时，开始一次人机交互过程。Preferably, in the multi-channel human-computer interaction method based on voice and gesture, in the first step, when the voice channel receives the first sentence, a human-computer interaction process is started.

优选的是，所述的基于语音和手势的多通道人机交互方法中，所述步骤一中，在语音通道接收到一个语句时，设置超时时间以接收手势通道的手势信息的输入，如手势信息的输入超出所设超时时间，则放弃此次人机交互过程。Preferably, in the multi-channel human-computer interaction method based on voice and gesture, in the first step, when the voice channel receives a sentence, set the timeout time to receive the input of gesture information of the gesture channel, such as gesture If the input of information exceeds the set timeout period, the human-computer interaction process will be abandoned.

本发明所述的基于语音和手势的多通道人机交互方法，具有以下有益效果：The voice and gesture-based multi-channel human-computer interaction method of the present invention has the following beneficial effects:

(1)解决基于语音和手势的多通道人机交互方法中的指点模糊性问题。虚拟环境中进行三维交互时，手势(从识别指点开始到指点结束)不仅表达了空间信息，也承载了时间方面的信息。对象在指点区域内停留时间越长，可以认为被选中的可能性越大。因此，在进行手势指称对象约束信息的分析时，不仅要获取距离统计量，而且要获取时间统计量，从而降低三维交互中的指点模糊性。并且，在对指称对象进行确定的过程中，是将虚拟环境中的模型对象划分为四类，并将指称对象与某一类型模型对象进行对比，这种方法也有助于缩小指称对象的寻找范围，降低指点模糊性的影响。(1) Solve the problem of pointing ambiguity in multi-channel human-computer interaction methods based on speech and gestures. When performing three-dimensional interaction in a virtual environment, gestures (from the beginning of pointing recognition to the end of pointing) not only express spatial information, but also carry temporal information. The longer the object stays in the pointing area, the more likely it is to be selected. Therefore, when analyzing the constraint information of gesture referent objects, not only distance statistics but also time statistics should be obtained, so as to reduce pointing ambiguity in 3D interaction. Moreover, in the process of determining the referent object, the model objects in the virtual environment are divided into four categories, and the referent object is compared with a certain type of model object. This method also helps to narrow the search scope of the referent object , reducing the effect of pointing ambiguity.

(2)解决基于语音和手势的多通道人机交互方法中的未指明信息推断的问题。虚拟环境中的模型对象被划分为四类，其中，聚焦对象为在上一次人机交互过程中所被确定的指称对象，也就是说，如果此次人机交互中语音输入的语句中出现了指示性代词“它”，则可认为此次人机交互的指称对象就是聚焦对象，从而解决了未指明信息推断的问题。(2) Solve the problem of unspecified information inference in the multi-channel human-computer interaction method based on speech and gesture. The model objects in the virtual environment are divided into four categories. Among them, the focus object is the referring object determined during the last human-computer interaction, that is, if there is The indicative pronoun "it" can be considered as the referring object of this human-computer interaction is the focused object, thus solving the problem of unspecified information inference.

(3)提供一种基于语音和手势的多通道人机交互方法。通过构建多通道分层整合模型，在多通道分层整合模型中建立四层：物理层、词法层、语法层和语义层，并最终将人机交互所需的命令信息及指称对象填充入任务槽，上述整合过程的目标以及整合成功与否的判据都是以人机交互的任务结构的完整性为基础，最终目的就是生成可提交系统执行的任务结构，保证人机交互的有效进行，提高了人机交互的可靠性。(3) Provide a multi-channel human-computer interaction method based on voice and gesture. By building a multi-channel hierarchical integration model, four layers are established in the multi-channel hierarchical integration model: physical layer, lexical layer, grammatical layer and semantic layer, and finally the command information and referent objects required for human-computer interaction are filled into tasks The goal of the above-mentioned integration process and the criteria for the success of the integration are based on the integrity of the human-computer interaction task structure. reliability of human-computer interaction.

附图说明 Description of drawings

图1为本发明所述的基于语音和手势的多通道人机交互方法的人机交互过程的示意图。FIG. 1 is a schematic diagram of the human-computer interaction process of the multi-channel human-computer interaction method based on voice and gesture according to the present invention.

图2为本发明所述的基于语音和手势的多通道人机交互方法的指称归结的总体架构图。FIG. 2 is an overall architecture diagram of the reference resolution of the multi-channel human-computer interaction method based on voice and gesture according to the present invention.

图3为本发明所述的基于语音和手势的多通道人机交互方法的总体流程图。FIG. 3 is an overall flow chart of the multi-channel human-computer interaction method based on voice and gesture according to the present invention.

具体实施方式 Detailed ways

下面结合附图对本发明做进一步的详细说明，以令本领域技术人员参照说明书文字能够据以实施。The present invention will be further described in detail below in conjunction with the accompanying drawings, so that those skilled in the art can implement it with reference to the description.

如图1、图2和图3所示，本发明提供一种基于语音和手势的多通道人机交互方法，包括以下步骤：As shown in Figure 1, Figure 2 and Figure 3, the present invention provides a multi-channel human-computer interaction method based on voice and gesture, comprising the following steps:

如图1所示，上述基于语音和手势的多通道人机交互方法，首先支持语音和手势两个交互通道。其中语音识别模块采用微软语音识别引擎，将用户的语音命令映射为带时间戳的文本信息，由语音解析模块从中提取出语音指称对象约束信息。手势通道使用数据手套获取关节及位置信息以供手势识别，手势解析模块接受指点手势，并产生指点对象向量。多通道整合模块整合来自语音和手势通道的信息，在整合过程中实现对指称的归结，最后产生系统可执行命令或相应提示。As shown in FIG. 1 , the above multi-channel human-computer interaction method based on voice and gesture firstly supports two interaction channels of voice and gesture. The voice recognition module adopts the Microsoft voice recognition engine to map the user's voice commands into text information with time stamps, and the voice analysis module extracts the voice referent object constraint information from it. The gesture channel uses data gloves to obtain joint and position information for gesture recognition. The gesture analysis module accepts pointing gestures and generates pointing object vectors. The multi-channel integration module integrates the information from the voice and gesture channels, realizes the affiliation of the reference during the integration process, and finally generates system executable commands or corresponding prompts.

本发明采用多通道分层整合模型实现多通道整合。整合过程是任务引导的，整合的目标以及整合成功与否的判据都是以交互任务结构的完整性为基础，最终目的就是生成可提交系统执行的任务结构，其中包括任务的动作、任务作用的对象以及相应参数等信息。因此，本发明中定义了任务槽，任务槽属于多通道分层整合模型的一部分。任务槽的结构分为三个部分，分别是动作表项、指称对象表项和参数表项，也可以称之为动作槽、指称对象槽和参数槽。实际上，动作表项和参数表项都属于命令表项。其中指称对象槽中的指称对象可以不止一个，目前参数槽只能填充位置信息。不同的命令会对应有具有不同结构的任务槽，例如选择命令的任务槽只有动作和指称对象两个表项。整合的过程就变成了对任务槽的填充过程，一旦任务槽填满，就形成了系统可执行的完整任务。The present invention adopts a multi-channel layered integration model to realize multi-channel integration. The integration process is task-guided. The goal of integration and the criteria for the success of integration are based on the integrity of the interactive task structure. The ultimate goal is to generate a task structure that can be submitted to the system for execution, including task actions and task functions. Objects and corresponding parameters and other information. Therefore, task slots are defined in the present invention, and task slots are part of the multi-channel hierarchical integration model. The structure of the task slot is divided into three parts, which are the action table item, referent object table item and parameter table item, which can also be called action slot, referent object slot and parameter slot. In fact, both action table items and parameter table items belong to command table items. There can be more than one referent in the referent slot. Currently, the parameter slot can only be filled with position information. Different commands correspond to task slots with different structures. For example, the task slot of the select command has only two entries: action and referent. The process of integration becomes the process of filling task slots. Once the task slots are filled, a complete task that can be executed by the system is formed.

举例来说，如仅进行了语音输入“旋转它”，而未作出指点手势，也就是无法确定指称对象。则任务槽在填充时，将在动作槽内填入“旋转”，而指称对象槽为空。此时，由于设置有等待时间，如果任务槽在等待时间内被填充完整，也就是在等待时间内作出了指点手势，从而确定了指称对象，则继续进行此次人机交互。多通道分层整合模型会生成系统可执行命令，如果任务槽在等待时间内未被填充完整，则放弃此次人机交互。For example, if only the voice input "rotate it" is made without a pointing gesture, the referent cannot be determined. Then when the task slot is filled, the action slot will be filled with "rotation", while the referent slot is empty. At this time, since the waiting time is set, if the task slot is completely filled within the waiting time, that is, a pointing gesture is made within the waiting time, thereby determining the referent, the human-computer interaction will continue. The multi-channel layered integration model will generate system executable commands, and if the task slot is not fully filled within the waiting time, the human-computer interaction will be abandoned.

本发明定义的多通道分层整合模型，顾名思义，是基于分层的思想，将通道信息从具体的设备信息到最终要填充至任务槽的语义抽象成物理层、词法层、语法层和语义层等四层。物理层信息是从交互设备输入的原始信息，它的形式具有多样性，与具体的交互设备直接相关。比如从语音输入的是字符串信息，而从数据手套输入的是传感器信息。词法层是关键的一层，它对来自设备层的原始信息进行统一化处理，把意义相同而形式不同的输入统一为相同的信息表示，从而向语法层提供与设备无关的信息。在词法层中，语音通道的语音信息经过语音识别模块和语音解析模块进行抽象，生成语音指称对象约束信息；同时，手势通道的手势信息经过手势识别模块和手势解析模块的抽象后，生成手势指称对象约束信息。语法层主要将来自词法层的信息按照人机交互的语法规范进行分解，分解为符合任务槽各个表项的形式，为后续的语义融合做准备。指称归结主要在语法层进行。并且，语法层还从语音指称对象约束信息中提取命令信息。在语义层，就是利用任务引导机制，进行任务槽的填充和完善，虽然任务与具体的应用有关，但任务槽的填充和完善却独立于应用。The multi-channel layered integration model defined in the present invention, as the name suggests, is based on the idea of layering, abstracting channel information from specific device information to semantics to be filled into task slots into physical layer, lexical layer, syntax layer and semantic layer Wait for four floors. The physical layer information is the original information input from the interactive device, and its form is diverse and directly related to the specific interactive device. For example, character string information is input from voice, while sensor information is input from data gloves. The lexical layer is the key layer. It unifies the original information from the device layer, and unifies the input with the same meaning but different forms into the same information representation, thus providing device-independent information to the grammar layer. In the lexical layer, the voice information of the voice channel is abstracted by the voice recognition module and the voice analysis module to generate voice reference object constraint information; at the same time, the gesture information of the gesture channel is abstracted by the gesture recognition module and the gesture analysis module to generate gesture references Object constraint information. The grammatical layer mainly decomposes the information from the lexical layer according to the grammatical norms of human-computer interaction, and decomposes it into forms that conform to each table item of the task slot, so as to prepare for the subsequent semantic fusion. Referential resolution is mainly carried out at the grammatical level. Moreover, the syntax layer also extracts command information from the phonetic referent constraint information. In the semantic layer, the task guidance mechanism is used to fill and complete the task slot. Although the task is related to the specific application, the filling and completion of the task slot is independent of the application.

实际上，人机交互过程可以分为两种策略，“急性子”和“慢性子”两种。急性子整合只要多通道输入支持一定程度的整合就开始处理，此过程可以看作是事件驱动的。而慢性子的整合则要到具有了全部输入或者比较完整的输入之后才开始处理。举例而言，在进行人机交互时，急性子的策略是，语音输入“旋转它”，多通道分层整合模型就开始工作，开始进行信息的处理。而慢性子的策略是，语音输入“旋转它”，同时指点手势做出指点某个物体，以使得模型可以确定指称对象，此时模型才启动。也就是，慢性子是在一次性提供一次人机交互的全部信息。由于用户的语音输入经常出现不连续的情况，一个完整的移动物体的命令中间出现较大的时间间隔。同时受到语音识别引擎的限制，本发明使用“急性子”策略，采用语音驱动，在语音通道接收到第一语句时，就开始一次人机交互过程。In fact, the human-computer interaction process can be divided into two strategies, "immediate" and "slow". Acute sub-integration processing begins as soon as the multi-channel input supports some level of integration, and this process can be viewed as event-driven. And the integration of the slow child will not start to process until it has all the inputs or relatively complete inputs. For example, when conducting human-computer interaction, the strategy of the impatient is that the voice input "rotate it", and the multi-channel hierarchical integration model starts to work and begins to process information. The strategy of the slow child is that the voice input "rotate it", and at the same time the pointing gesture is made to point to an object, so that the model can determine the referring object, and the model starts at this time. That is to say, the slow child is providing all the information of a human-computer interaction at one time. Since the user's voice input often appears discontinuous, there is a large time interval in the middle of a complete command of a moving object. Simultaneously limited by the speech recognition engine, the present invention uses the "impatient" strategy, adopts speech drive, and starts a human-computer interaction process when the speech channel receives the first sentence.

指称对象确认的过程也就是指称归结的过程。在本发明中，指称归结要同时以语音指称对象约束信息和手势指称对象约束信息为依据。本发明基于以下两条假设：(1)语音输入中的语义是清晰的，本发明主要关注于解决多通道指称归结中的指点模糊性，因此假设语音输入中的语义是清晰的，不存在“左上角”、“中间”、“以前”等模糊词汇；(2)以“自我为中心”的指称，指称可以划分为三种类型：以自我为中心、以参照物为中心、以他人为中心。本发明中的所有指称均是以自我为中心，不存在“选择他左边的物体”这种以其他视点为中心的情况。The process of identifying the referent is also the process of referential resolution. In the present invention, the reference resolution is based on both speech reference object constraint information and gesture reference object constraint information. The present invention is based on the following two assumptions: (1) the semantics in the speech input is clear, and the present invention mainly focuses on solving the ambiguity of pointing in the multi-channel reference resolution, so it is assumed that the semantics in the speech input is clear and there is no " upper left corner", "middle", "before" and other vague words; (2) "self-centered" references, which can be divided into three types: self-centered, reference-centered, and others-centered . All references in the present invention are self-centered, and there is no such situation as "choose the object on his left" that is centered on other viewpoints.

本发明采用语音驱动的整合策略，一个语句被识别后，触发多通道整合过程。多通道分层整合模型中，首先，语音指称对象约束信息被填充入语音约束集。根据手势指称对象约束信息则可以为虚拟环境中的所有模型对象分配身份，将所有模型对象划分为指点对象、聚焦对象、激活对象以及沉寂对象四类。所述指点对象为位于当前指点手势所限定的指点区域内的对象，所述聚焦对象为在上一次人机交互过程中所被确定的指称对象，所述激活对象为位于可视范围内的除指点对象和激活对象以外的模型对象，所述沉寂对象为位于不可视范围内的除指点对象和激活对象以外的模型对象。每一类型模型对象对应一个初始化匹配矩阵，分别为指点矩阵、聚焦矩阵、激活矩阵和沉寂矩阵。The present invention adopts a voice-driven integration strategy, and after a sentence is recognized, a multi-channel integration process is triggered. In the multi-channel hierarchical integration model, firstly, the phonetic reference object constraint information is filled into the phonetic constraint set. All model objects in the virtual environment can be assigned identities according to the constraint information of gesture referent objects, and all model objects can be divided into four categories: pointing objects, focusing objects, active objects and silent objects. The pointing object is an object located in the pointing area defined by the current pointing gesture, the focusing object is the referring object determined during the last human-computer interaction, and the activation object is an object located in the visible range except A model object other than the pointing object and the activation object, the silent object is a model object other than the pointing object and the activation object located in an invisible range. Each type of model object corresponds to an initialization matching matrix, which are pointing matrix, focusing matrix, activation matrix and silence matrix.

本发明在指称归结过程中采用感知形状的方法，感知形状是由用户控制并能提供交互对象有关信息的几何体。当系统识别当前手势为指点手势时，生成附着在虚拟手食指指尖上的圆锥体(也就是由指点手势所限定的指点区域)，通过碰撞检测记录模型对象和圆锥体交互过程，生成各种统计量数据。然后对统计量加权平均生成指点优先级。一次指点交互完成以后，得到与该指点手势对应的二元组向量，该二元组的第一个元素为指点对象向量，第二个元素为指点优先级。The present invention adopts the method of perception shape in the process of denotation resolution, and the perception shape is a geometry controlled by the user and can provide information about interactive objects. When the system recognizes that the current gesture is a pointing gesture, it generates a cone attached to the tip of the index finger of the virtual hand (that is, the pointing area defined by the pointing gesture), records the interaction process between the model object and the cone through collision detection, and generates various Statistics data. The weighted average of the statistics is then used to generate pointing priorities. After a pointing interaction is completed, a 2-tuple vector corresponding to the pointing gesture is obtained. The first element of the 2-tuple is the pointing object vector, and the second element is the pointing priority.

本发明定义了时间序列T_rank和距离序列D_rank两种统计量。在感知形状内的时间越长，距离指点中心(虚拟手食指指尖)越近，则该模型对象的优先级越高。The present invention defines two statistics of time series T _rank and distance series D _rank . The longer the time in the perceived shape and the closer the distance to the pointing center (the virtual hand index fingertip), the higher the priority of the model object.

T_rank的计算过程如下式所示，其中T_object表示某模型对象在圆锥体中的时间，T_period表示某次交互过程中圆锥体的存在时间(即为指点手势的持续时间)。The calculation process of T _rank is shown in the following formula, where T _object represents the time of a certain model object in the cone, and T _period represents the existence time of the cone during a certain interaction process (that is, the duration of the pointing gesture).

$T_{rank} = \frac{T_{object}}{T_{period}},$ 0＜T_rank≤1 $T_{rank} = \frac{T_{object}}{T_{period}},$ 0<T _rank ≤1

D_rank的计算过程如下式所示，其中D_object表示某模型对象中心到指点中心的距离，D_max是在圆锥体中的模型对象到指点中心的最远距离。The calculation process of D _rank is shown in the following formula, where D _object represents the distance from the center of a model object to the pointing center, and D _max is the farthest distance from the model object in the cone to the pointing center.

$D_{rank} = 1 - \frac{D_{object}}{D_{\max}},$ 0＜D_rank≤1 ${D.}_{rank} = 1 - \frac{{D.}_{object}}{{D.}_{\max}},$ 0<D _rank ≤1

指点优先级P_rank由上述两种统计量加权平均得到，其计算方法如下：The pointing priority P _rank is obtained by the weighted average of the above two statistics, and its calculation method is as follows:

P_rank＝T_rank*λ+D_rank*(1-λ)，0≤λ≤1P _rank =T _rank *λ+D _rank *(1-λ), 0≤λ≤1

由于交互设备并没有被设计来以协作的方式工作，进行跨通道的整合就必须依靠时间相关性。因此通过感知形状计算得到指点优先级P_rank后，应该记录当前时间以供后阶段的多通道整合使用。由于任务槽对于进一步的信息输入具有等待时间的设置，这个等待时间的数值则要考虑到，进一步的手势信息输入并与语音信息一起完成指称归结过程所需要的时间。Since interactive devices are not designed to work in a cooperative manner, integration across channels must rely on time correlation. Therefore, after obtaining the pointing priority P _rank through perceptual shape calculation, the current time should be recorded for the later stage of multi-channel integration. Since the task slot has a waiting time setting for further information input, the value of this waiting time should take into account the time required for further input of gesture information and completion of the referral resolution process together with voice information.

上述得到指点优先级和指点对象向量后，将在指点矩阵、聚焦矩阵、激活矩阵、沉寂矩阵中逐一进行比对寻找，处于四个矩阵中的模型对象具有对应的状态。在每一阶段，对于位于同一矩阵中的模型对象进行指称归结时，则是通过匹配函数Match(o，e)量化模型对象所处状态。After the pointing priority and pointing object vector are obtained above, the pointing matrix, focusing matrix, activation matrix, and silence matrix will be compared and searched one by one, and the model objects in the four matrices have corresponding states. At each stage, when referring to the model objects located in the same matrix, the state of the model objects is quantified by the matching function Match(o, e).

匹配函数的构造如下：The matching function is constructed as follows:

$Match match ((o o,, e e)) = = [[\underset{S S &Element; &Element; {{P P,, F f,, A A,, E E.}}}{Σ Σ} P P ((o o | | S S)) * * P P ((S S | | e e))]] * * Semantic Semantic ((o o,, e e)) * * Temp Temp ((o o,, e e))$

其中，o表示模型对象，e表示指称。P表示指点状态，F表示聚焦状态，A表示激活状态，E表示沉寂状态，S表示当前对象的状态。下面是Match(o，e)的各个组成部分：Among them, o represents the model object and e represents the reference. P means the pointing state, F means the focusing state, A means the active state, E means the quiet state, and S means the state of the current object. The following are the various components of Match(o, e):

(1)P(o|S)与P(S|e)(1) P(o|S) and P(S|e)

P(o|S)表示给定认知状态S时对象o被选中概率，用于衡量手势通道对指称归结的影响。具体计算方法是：P(o|P)＝P_rank；

(M为聚焦对象的个数)，

(N为激活对象的个数)，

(L表示虚拟环境中所有模型对象的个数)。P(S|e)是当指称为e时指称对象状态为S的概率。P(o|S) represents the probability that an object o is selected given a cognitive state S, and is used to measure the impact of gesture channels on referential resolution. The specific calculation method is: P(o|P)=P _rank ;

(M is the number of focus objects),

(N is the number of activated objects),

(L represents the number of all model objects in the virtual environment). P(S|e) is the probability that the referent state is S when the referent is e.

(2)Semantic(o，e)(2) Semantic (o, e)

Semantic(o，e)表示模型对象o与指称e之间的语义兼容性，用于衡量语音通道对指称归结的影响，其构造如下：Semantic(o, e) represents the semantic compatibility between the model object o and the reference e, which is used to measure the influence of the speech channel on the reference resolution, and its construction is as follows:

$Semantic Semantic ((o o,, e e)) = = \underset{k k}{Σ Σ} \frac{{Attr Attr}_{k k} ((o o,, e e))}{K K}$

本发明将标志符和语义类型均划入属性Attr_k中，当o和e均有属性k并且两者值不等时Attr_k(o，e)为0，其余情况为1。K为指称对象的属性总数。The present invention classifies both identifiers and semantic types into the attribute Attr _k , and Attr _k (o, e) is 0 when both o and e have attribute k and the two values are not equal, and is 1 in other cases. K is the total number of attributes referring to the object.

(3)Temp(o，e)(3)Temp(o,e)

Temp(o，e)表示模型对象o和指称e之间的时间兼容性，用于衡量时间对指称归结的影响。它是一个分段函数：Temp(o, e) represents the temporal compatibility between the model object o and the reference e, and is used to measure the impact of time on reference resolution. It is a piecewise function:

当o和e在同一次交互中时，Temp(o，e)的计算过程如下：When o and e are in the same interaction, the calculation process of Temp(o, e) is as follows:

Temp(o，e)＝exp(-|Time(o)-Time(e)|)Temp(o,e)=exp(-|Time(o)-Time(e)|)

当o和e在不同交互中时，Temp(o，e)的计算过程如下：When o and e are in different interactions, the calculation process of Temp(o, e) is as follows:

Temp(o，e)＝exp(-|OrderIndex(o)-OrderIndex(e)|)Temp(o,e)=exp(-|OrderIndex(o)-OrderIndex(e)|)

其中Time(o)为指点手势发生时间，Time(e)为指称发生时间，单位为秒；OrderIndex(o)表示o在指点手势序列中的先后次序，OrderIndex(e)表示e在指称序列中的先后次序。处于聚焦、激活或沉寂状态对象的Temp(o，e)＝1。Among them, Time(o) is the time when the pointing gesture occurs, Time(e) is the time when the reference occurs, and the unit is second; OrderIndex(o) indicates the order of o in the sequence of pointing gestures, and OrderIndex(e) indicates the order of e in the sequence of references priority. Temp(o,e)=1 for objects in focus, activation or silence.

当指称与处于某一状态(即位于某一矩阵内)的模型对象经对比匹配后，指称对象即得到确认。When the reference and the model object in a certain state (that is, in a certain matrix) are compared and matched, the reference object is confirmed.

所述的基于语音和手势的多通道人机交互方法中，所述虚拟环境中的模型对象被划分为指点对象、聚焦对象、激活对象以及沉寂对象四类，所述指点对象为位于当前指点手势所限定的指点区域内的对象，所述聚焦对象为在上一次人机交互过程中所被确定的指称对象，所述激活对象为位于可视范围内的除指点对象和激活对象以外的模型对象，所述沉寂对象为位于不可视范围内的除指点对象和激活对象以外的模型对象，在步骤三中，将上述语音指称对象约束信息及手势指称对象约束信息按顺序逐一与上述指点对象、聚焦对象、激活对象、沉寂对象的特征信息进行对比，确定出人机交互的指称对象。In the multi-channel human-computer interaction method based on voice and gestures, the model objects in the virtual environment are divided into four categories: pointing objects, focusing objects, activation objects and silent objects, and the pointing objects are located in the current pointing gesture Objects within the defined pointing area, the focus object is the reference object determined during the last human-computer interaction, and the activation object is a model object located within the visible range other than the pointing object and the activation object , the silent object is a model object located in an invisible range except for the pointing object and the activation object. The feature information of the active object, the active object, and the silent object are compared to determine the referent object of the human-computer interaction.

所述的基于语音和手势的多通道人机交互方法中，在所述步骤二中，从上述语音信息中提取语音指称对象约束信息和从上述手势信息中提取手势指称对象约束信息是通过以下方式实现的：构建多通道分层整合模型，所述多通道分层整合模型包括有四层，分别为物理层、词法层、语法层和语义层，其中，所述物理层接收分别由语音通道和手势通道输入的语音信息和手势信息，所述词法层包括有语音识别解析模块和手势识别解析模块，所述语音识别解析模块将物理层的语音信息解析为语音指称对象约束信息，所述手势识别解析模块将物理层的手势信息解析为手势指称对象约束信息。In the multi-channel human-computer interaction method based on voice and gesture, in the second step, extracting the voice reference object constraint information from the above voice information and extracting the gesture reference object constraint information from the above gesture information in the following manner Achieved: build a multi-channel layered integration model, the multi-channel layered integration model includes four layers, respectively physical layer, lexical layer, syntax layer and semantic layer, wherein, the physical layer receives voice channel and The voice information and gesture information input by the gesture channel, the lexical layer includes a voice recognition and analysis module and a gesture recognition and analysis module, the voice recognition and analysis module analyzes the voice information of the physical layer into voice reference object constraint information, and the gesture recognition The parsing module parses the gesture information of the physical layer into constraint information of the gesture referent.

所述的基于语音和手势的多通道人机交互方法中，所述步骤三中，将上述语音指称对象约束信息及手势指称对象约束信息与虚拟环境中模型对象的特征信息进行对比，确定出人机交互的指称对象，所述指称对象的确定是在所述语法层上实现的，从上述语音指称对象约束信息中提取对指称对象的命令信息是通过以下方式实现的：所述语法层从语音指称对象约束信息中提取命令信息，将命令信息作用于指称对象是通过以下方式实现的：所述语义层将语法层所提取的命令信息作用于指称对象。In the multi-channel human-computer interaction method based on voice and gesture, in the third step, the above-mentioned voice reference object constraint information and gesture reference object constraint information are compared with the feature information of the model object in the virtual environment to determine the human The referent object of machine-computer interaction, the determination of the referent object is realized on the grammar layer, and the command information for the referent object is extracted from the above-mentioned phonetic referent object constraint information in the following manner: the grammar layer extracts the command information from the speech Extracting command information from the constraint information of the referent object, and applying the command information to the referent object is realized in the following manner: the semantic layer applies the command information extracted by the syntax layer to the referent object.

所述的基于语音和手势的多通道人机交互方法中，所述多通道分层整合模型还包括有任务槽，所述任务槽包括命令表项以及指称对象表项，其中所述语义层将语法层所提取的命令信息作用于指称对象是通过以下方式进行的：所述语义层将语法层所提取的命令信息填入命令表项，将指称对象填入指称对象表项，所述任务槽被填充完整，所述多通道分层整合模型生产系统可执行命令。In the multi-channel human-computer interaction method based on speech and gestures, the multi-channel layered integration model further includes a task slot, and the task slot includes a command entry and a referent entry, wherein the semantic layer will The command information extracted by the grammatical layer acts on the referring object in the following manner: the semantic layer fills the command information extracted by the grammatical layer into the command table item, fills the referring object into the referring object table item, and the task slot Completely populated, the multi-channel layered integrated model production system can execute commands.

所述的基于语音和手势的多通道人机交互方法中，在所述任务槽未填充完整的情况下，设置等待时间，所述任务槽在等待时间内被填充完整，则继续此次人机交互，所述任务槽在等待时间内未被填充完整，则放弃此次人机交互。In the multi-channel human-computer interaction method based on voice and gestures, when the task slot is not filled completely, a waiting time is set, and the task slot is filled completely within the waiting time, then the human-machine interaction is continued. If the task slot is not completely filled within the waiting time, the human-computer interaction is abandoned.

所述的基于语音和手势的多通道人机交互方法中，所述命令表项包括有动作表项和参数表项，所述语音指称对象约束信息中提取对指称对象的命令信息时，所述命令信息包括动作信息和参数信息。In the multi-channel human-computer interaction method based on speech and gestures, the command entry includes an action entry and a parameter entry, and when the command information for the referent is extracted from the speech reference object constraint information, the Command information includes action information and parameter information.

所述的基于语音和手势的多通道人机交互方法中，所述步骤一中，在语音通道接收到第一个语句时，开始一次人机交互过程。In the multi-channel human-computer interaction method based on voice and gesture, in the first step, when the voice channel receives the first sentence, a human-computer interaction process is started.

所述的基于语音和手势的多通道人机交互方法中，所述步骤一中，在语音通道接收到一个语句时，设置超时时间以接收手势通道的手势信息的输入，如手势信息的输入超出所设超时时间，则放弃此次人机交互过程。In the described multi-channel human-computer interaction method based on voice and gesture, in the step 1, when the voice channel receives a sentence, the timeout period is set to receive the input of the gesture information of the gesture channel, such as the input of the gesture information exceeds If the set timeout time is exceeded, the human-computer interaction process is abandoned.

尽管本发明的实施方案已公开如上，但其并不仅仅限于说明书和实施方式中所列运用，它完全可以被适用于各种适合本发明的领域，对于熟悉本领域的人员而言，可容易地实现另外的修改，因此在不背离权利要求及等同范围所限定的一般概念下，本发明并不限于特定的细节和这里示出与描述的图例。Although the embodiment of the present invention has been disclosed as above, it is not limited to the use listed in the specification and implementation, it can be applied to various fields suitable for the present invention, and it can be easily understood by those skilled in the art Therefore, the invention is not limited to the specific details and examples shown and described herein without departing from the general concept defined by the claims and their equivalents.

Claims

1. the multimodal human-computer interaction method based on voice and gesture, is characterized in that, comprises the following steps:

Step 1, structure voice channel and gesture passage, and the referent of man-machine interaction is carried out to the input of voice messaging and gesture information by voice channel and gesture passage respectively;

When voice channel receives first statement, start interactive process one time;

When voice channel receives a statement, time-out time is set, as the input of gesture information exceeds set time-out time, abandon this interactive process;

Step 2, extract voice referent constraint information from above-mentioned voice messaging; extract gesture referent constraint information from above-mentioned gesture information; wherein, described gesture referent constraint information comprises that any point in the indication zone that current indication gesture limits arrives the distance statistics amount at the indication center of giving directions gesture and the time statistic that above-mentioned indication gesture maintains;

Extract voice referent constraint information and extract gesture referent constraint information from above-mentioned voice messaging from above-mentioned gesture information and be achieved in the following ways:

Build hyperchannel layering Integrated Models; described hyperchannel layering Integrated Models includes four layers; be respectively Physical layer, morphology layer, grammer layer and semantic layer; wherein; described Physical layer receives voice messaging and the gesture information of being inputted by voice channel and gesture passage respectively; described morphology layer includes speech recognition parsing module and gesture identification parsing module; described speech recognition parsing module resolves to voice referent constraint information by the voice messaging of Physical layer, and described gesture identification parsing module resolves to gesture referent constraint information by the gesture information of Physical layer;

Step 3, model object in virtual environment is divided into the indication object, focal object, activate object and quiet object four classes, described indication object is the object that is positioned at the indication zone that current indication gesture limits, the referent of described focal object for being determined in upper once interactive process, described activation object is the model object except giving directions object and activation object that is positioned at visual range, described quiet object is the model object except giving directions object and activation object that is positioned at not visible scope, by above-mentioned voice referent constraint information and gesture referent constraint information in order one by one with above-mentioned indication object, focal object, activate object, the characteristic information of quiet object is contrasted, determine the referent of man-machine interaction, extract the command information to referent from above-mentioned voice referent constraint information, command information is acted on to referent, complete a man-machine interaction,

The characteristic information of model object in above-mentioned voice referent constraint information and gesture referent constraint information and virtual environment is contrasted, determine the referent of man-machine interaction, the definite of described referent realize on described grammer layer,

The command information of extracting referent from above-mentioned voice referent constraint information is achieved in the following ways:

Described grammer layer extracts command information from voice referent constraint information,

Command information is acted on to referent to be achieved in the following ways:

The command information that described semantic layer extracts the grammer layer acts on referent;

Described hyperchannel layering Integrated Models also includes the task groove, and described task groove comprises order list item and referent list item,

The command information that wherein said semantic layer extracts the grammer layer acts on referent and carries out in the following manner:

The command information that described semantic layer extracts the grammer layer is inserted the order list item, and referent is inserted to the referent list item, and described task groove is filled complete, described hyperchannel layering Integrated Models production system executable command;

In the situation that described task groove is not filled is complete, the stand-by period is set, described task grain is filled complete within the stand-by period, continues this man-machine interaction, and described task groove is not filled complete within the stand-by period, abandons this man-machine interaction;

Described order list item includes action list item and parameter list item, and while extracting the command information to referent in described voice referent constraint information, described command information comprises action message and parameter information.