CN102262624A

CN102262624A - System and method for realizing cross-language communication based on multi-mode assistance

Info

Publication number: CN102262624A
Application number: CN201110225342XA
Authority: CN
Inventors: 徐常胜; 程健; 梁超; 张歆明
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2011-08-08
Filing date: 2011-08-08
Publication date: 2011-11-30

Abstract

The present invention proposes a cross-language communication system and method based on multi-modal assistance. The method utilizes the foreground interaction module, data management module and semantic association module in the cross-language communication system to analyze conversation content and use natural language processing tools It can automatically extract the central topic and keywords in the conversation, and the semantic association module can automatically search for relevant pictures and video clips based on the detected central topic and keyword information, and provide them to both parties in an appropriate way, so as to promote mutual understanding and communication. Here, the pictures and videos used as aids in understanding can be automatically picked up from the Internet by searching, or directly obtained from a pre-marked multimedia library. Finally, the system generates a multi-modal conversation summary based on the text chat information of the two parties in the conversation and the corresponding picture and video content.

Description

System and method for realizing cross-language communication based on multimodal assistance

技术领域 technical field

本发明属于多媒体分析、网络通讯领域，涉及基于多模态辅助的实现跨语言沟通的方法。The invention belongs to the fields of multimedia analysis and network communication, and relates to a method for realizing cross-language communication based on multi-mode assistance.

背景技术 Background technique

随着通讯技术和互联网技术的快速发展，出现了与邮件、电话、电报等传统通讯方式完全不同的一种网络即时通讯系统，比如MSN和QQ。传统的邮件和电报以文字为主，电话以语音为主，而即时通讯不仅可以使用文字和语音，还可以辅助丰富的视频、图片等多媒体手段。通过即时通讯系统，远隔重洋的人们可以实现如面对面的实时交谈。整个地球已经成为名副其实的地球村。With the rapid development of communication technology and Internet technology, a network instant messaging system completely different from traditional communication methods such as mail, telephone, and telegram has emerged, such as MSN and QQ. Traditional emails and telegrams mainly use text, and telephone calls mainly use voice, while instant messaging can not only use text and voice, but also can assist rich multimedia methods such as video and pictures. Through the instant messaging system, people across oceans can realize face-to-face real-time conversation. The whole earth has become a veritable global village.

对于说不同语言的对话者来说，语言问题仍然是即时通讯中难以逾越的障碍。近年来，由于机器翻译技术取得了长足进步，不同语言之间的用户的交流存在的语言问题在某种程度上通过机器翻译的技术得到了一定的解决。但是机器翻译存在两个明显的缺点。第一就是不同语言之间的准确翻译。但是机器翻译仍然只能对一些简单的对话进行自动翻译。即使是世界上使用人数最多的两种语言：英语和汉语，它们之间的自动翻译准确率也还是无法完全满足日常使用需要。如果考虑到世界上众多的少数民族语言，不同语言之间准确的自动翻译可能仍然是一个任重道远的问题。第二个就是词义的多义性是机器翻译中遇到的另一个挑战性的难题。Language issues remain an insurmountable barrier in instant messaging for interlocutors who speak different languages. In recent years, due to the great progress made in machine translation technology, the language problems existing in the communication between users of different languages have been solved to some extent by the technology of machine translation. But machine translation has two obvious disadvantages. The first is accurate translation between different languages. But machine translation can still only automatically translate some simple conversations. Even for the two most spoken languages in the world: English and Chinese, the accuracy of automatic translation between them still cannot fully meet the needs of daily use. If one considers the numerous minority languages in the world, accurate automatic translation between different languages may still be a problem with a long way to go. The second is that the polysemy of word meaning is another challenging problem encountered in machine translation.

为增强交流的从文本到图像的合成系统，现有技术中将输入的文本中主体内容以图片的形式表现出来。这个问题的解决是通过三个优化来完成从文本到图片的转换，即基于输入的文本最大化关键字出现的概率、基于输入文本和已选择的关键字最大化相应的图片出现的概率和基于输入文本，已选关键字和对应的图片最大化文本和图片的空间分布。这样基于这三个优化最终完成从文本到图片的转化。但是这个系统存在以下三个缺点：In order to enhance the text-to-image synthesis system for communication, in the prior art, the main content of the input text is represented in the form of pictures. The solution to this problem is to complete the conversion from text to pictures through three optimizations, that is, to maximize the probability of keywords based on the input text, to maximize the probability of corresponding pictures based on the input text and the selected keywords, and to maximize the probability of occurrence of the corresponding pictures based on the input text and selected keywords. Input text, selected keywords and corresponding images to maximize the spatial distribution of text and images. In this way, based on these three optimizations, the conversion from text to pictures is finally completed. But this system has the following three disadvantages:

1).系统处理速度慢。这个系统由于要计算优化，这样会导致图片到文本的转化速度变慢；1). The processing speed of the system is slow. Due to the calculation optimization of this system, it will slow down the conversion speed of pictures to text;

2).系统的界面不友好。由于要对输入的文本和给出的图片一起进行优化得出空间布局再呈现给用户。如果将这样的文本图片混杂的布局应用到用户之间对话的情况，势必会给用户造成不友好的感觉。2). The interface of the system is not friendly. Because it is necessary to optimize the input text and the given picture together, the spatial layout is presented to the user. If such a layout with mixed text and pictures is applied to the dialogue between users, it will definitely give users an unfriendly feeling.

3).系统不易使用。由于是终端软件，这样势必要求用户自行下载软件。可以借助网页来解决系统的不易使用的缺点。3). The system is not easy to use. Since it is terminal software, it is bound to require the user to download the software by himself. The shortcomings of the system that are not easy to use can be solved with the help of web pages.

发明内容 Contents of the invention

本发明的目的是解决现有技术处理速度慢、不易使用的技术缺陷，通过多模态信息辅助使用不同语言的人能够顺畅地在线交流。通过图像、视频等多模态信息减少传统自动翻译中产生的歧义性和多义性，并且辅助对用户对话内容的语义理解，由此本发明提供一种基于多模态辅助的实现跨语言沟通的方法。The purpose of the present invention is to solve the technical defects of slow processing speed and difficult use in the prior art, and assist people who use different languages to communicate online smoothly through multi-modal information. The ambiguity and polysemy generated in traditional automatic translation are reduced through multi-modal information such as images and videos, and the semantic understanding of user dialogue content is assisted. Therefore, the present invention provides a cross-language communication based on multi-modal assistance. Methods.

为实现所述目的，本发明的第一方面提供一种基于多模态辅助的跨语言沟通系统，该系统的技术方案包括：前台交互模块、数据管理模块和语义关联模块，其中：In order to achieve the stated purpose, the first aspect of the present invention provides a cross-language communication system based on multimodal assistance. The technical solution of the system includes: a front-end interaction module, a data management module and a semantic association module, wherein:

前台交互模块的输入端接受用户输入的文本聊天内容并对用户聊天的内容进行预处理，得到用户聊天的文本信息，并通过前台交互模块的前后台交互模块的输出端传送处理后的用户文本聊天内容；前台交互模块的聊天页面为用户显示聊天双方的对话的文字内容和根据双方谈话的内容系统推荐出来的多媒体图片；The input terminal of the front-end interaction module accepts the text chat content input by the user and preprocesses the content of the user chat, obtains the text information of the user chat, and transmits the processed user text chat through the output end of the front-end interaction module of the front-end interaction module Content; the chat page of the front-end interactive module displays the text content of the conversation between the two parties and the multimedia pictures recommended by the system according to the content of the conversation between the two parties for the user;

语义关联模块的输入端与前台交互模块输出端连接，接收并对用户的文本聊天内容进行分析，利用自然语言处理工具提取出双方谈话的主要内容，得到并输出文本信息关联上翻译的文本和相对应的多媒体信息，及根据文本聊天内容、翻译的内容和相应的多媒体信息生成一个多模态摘要；The input end of the semantic association module is connected to the output end of the front-end interaction module, which receives and analyzes the text chat content of the user, uses natural language processing tools to extract the main content of the conversation between the two parties, obtains and outputs the text information associated with the translated text and related Corresponding multimedia information, and generating a multimodal summary based on text chat content, translated content and corresponding multimedia information;

数据管理模块的输入端与语义关联模块连接输出端连接，数据管理模块要对新输入的文本聊天内容、翻译的内容和相应的多媒体信息进行存储，同时把历史的用户信息连同新的用户信息进行整合，生成并显示所有的聊天双方的对话的文字内容和根据双方谈话的内容系统推荐出来的多媒体图片信息。The input end of the data management module is connected to the output end of the semantic association module. The data management module needs to store the newly input text chat content, translated content and corresponding multimedia information, and simultaneously store the historical user information together with the new user information. Integrate, generate and display all the text content of the conversation between the two chatting parties and the multimedia picture information recommended by the system according to the content of the conversation between the two parties.

优选实施例，当后台的语义关联模块收到用户发送过来的文本信息之后，语义关联模块为了帮助不同语种的聊天用户能够从使用的语言的角度来理解对方的说话的含义，将Google翻译的结果集成进来；这样除了原始的用户聊天信息以外，还附带上了对这个聊天内容的基于Google翻译的用户聊天的译文。In a preferred embodiment, after the semantic association module in the background receives the text information sent by the user, the semantic association module will translate the results of Google translation in order to help chat users of different languages understand the meaning of the other party's speech from the perspective of the language used. Integrate in; so that in addition to the original user chat information, the translation of the chat content based on Google Translate is also attached.

优选实施例，语义关联模块提取出双方谈话的主要内容是将这些主要内容作为关键字，采用基于文本的图像检索从图像数据库中检索出来相应的候选图片集。In a preferred embodiment, the semantic association module extracts the main content of the conversation between the two parties by using these main content as keywords, and using text-based image retrieval to retrieve the corresponding candidate picture set from the image database.

为实现所述目的，本发明的第二方面提供一种使用基于多模态辅助的跨语言沟通系统实现跨语言沟通的方法，该方法以用户对话聊天为基础，根据文本解析技术对谈话内容分析得到的结果，为用户提供多媒体元素以辅助语言交流上存在障碍的或者文化背景存在差异的用户之间的语义理解，所述方法实现步骤包括以下：In order to achieve the stated purpose, the second aspect of the present invention provides a method for realizing cross-language communication using a cross-language communication system based on multimodal assistance. The method is based on user dialogue and chat, and analyzes the content of the conversation according to text analysis technology. The result obtained is to provide users with multimedia elements to assist semantic understanding between users who have language barriers or have differences in cultural backgrounds. The implementation steps of the method include the following:

步骤S1：用户首先通过语义聊天的前台界面发送自己想和对方的聊天的文字内容，前台界面通过Ajax构建的前后台交互模块向后台的语义关联模块传递用户聊天的文本信息，采用基于主题的跨模态分析方法对用户谈话内容进行分析，利用自然语言处理工具自动地提取对话中的中心议题及关键字；Step S1: The user first sends the text content he wants to chat with the other party through the front-end interface of the semantic chat. The modal analysis method analyzes the content of the user's conversation, and uses natural language processing tools to automatically extract the central topic and keywords in the conversation;

步骤S2：语义关联模块根据对话中的中心议题及关键字信息，采用基于文本的图像检索自动地从数据库或者互联网根据谈话主题检索相关的图片集和视频片段并提供给谈话双方；Step S2: According to the central topic and keyword information in the conversation, the semantic association module automatically retrieves relevant picture collections and video clips from the database or the Internet according to the conversation topic and provides them to the conversation parties by using text-based image retrieval;

步骤S3：系统根据谈话双方的文本聊天信息以及与之相对应的图片和视频片段内容，生成一个多模态的谈话摘要，最终以多媒体的形式来实现不同语种的用户之间顺畅的语义交流；同时，系统根据谈话双方的文本聊天历史信息以及与之相对应的图片和视频内容，能为谈话双方生成一个多模态的谈话摘要。Step S3: The system generates a multi-modal conversation summary based on the text chat information of the two parties in the conversation and the corresponding pictures and video clips, and finally realizes smooth semantic communication between users of different languages in the form of multimedia; At the same time, the system can generate a multi-modal conversation summary for the two parties based on the text chat history information of the two parties and the corresponding picture and video content.

优选实施例，所述多模态的谈话摘要包含文本、音频、图像和视频信息，为用户提供多媒体元素以辅助语言交流上存在障碍的或者文化背景存在差异的用户之间的语义理解。In a preferred embodiment, the multimodal conversation summary includes text, audio, image and video information, and provides multimedia elements for users to assist semantic understanding between users who have language barriers or have different cultural backgrounds.

优选实施例，所述图片和视频片段内容是通过搜索从网络自动扒取，或从一个预先已标注好的多媒体库中直接获取。In a preferred embodiment, the content of the pictures and video clips is automatically picked up from the network through searching, or directly obtained from a pre-marked multimedia library.

优选实施例，所述多模态的谈话摘要是基于主题的摘要，使用的关系网络并根据统计上次谈话中出现在一个预定义预料库中的词语共生频率得到检测主题。In a preferred embodiment, the multimodal conversation summary is a topic-based summary, using a relational network and detecting topics according to statistics of co-occurrence frequencies of words that appeared in a predefined prediction library in the last conversation.

本发明的有益效果：本发明的核心是如何通过多媒体信息(图像或者视频)来对文本信息进行描述。本发明提出的基于多模态辅助的跨语言沟通系统能为在线即时通讯提供友好和方便的环境，有三个主要特点：第一友好性，由于采用了基于话题相关的图像或视频搜索技术辅助文本内容理解，从而大大减少了翻译的多义性和歧义性；第二交互性，使得系统能够更好地满足用户个性化的需求；第三易用性，所提出的系统能够根据谈话记录自动地生成多媒体的摘要。Beneficial effects of the present invention: the core of the present invention is how to describe text information through multimedia information (image or video). The cross-language communication system based on multimodal assistance proposed by the present invention can provide a friendly and convenient environment for online instant messaging, and has three main features: first, friendliness, due to the use of topic-related image or video search technology to assist text Content understanding, which greatly reduces the ambiguity and ambiguity of translation; second interactivity, which enables the system to better meet the individual needs of users; third ease of use, the proposed system can automatically Generate summaries for multimedia.

为了辅助使用者之间的交流与理解，本发明的系统采用了基于主题的跨模态分析方法。系统根据谈话双方的文本聊天信息以及与之相对应的图片和视频内容，生成一个多模态的谈话摘要。这样，由于这个多模态的谈话通过包含丰富的内容，即非常直观易懂的图像、视频、文本等的多模态辅助信息，从而有效消除纯文本之间的自动翻译出现的歧义性，提高了语言交流的效率及质量，实现不同语种的用户之间进行顺畅的语义交流。In order to assist the communication and understanding between users, the system of the present invention adopts a topic-based cross-modal analysis method. The system generates a multi-modal conversation summary based on the text chat information of the two parties in the conversation and the corresponding pictures and video content. In this way, because this multimodal conversation contains rich content, that is, multimodal auxiliary information such as images, videos, and texts that are very intuitive and easy to understand, it can effectively eliminate the ambiguity that occurs in automatic translation between plain texts and improve It improves the efficiency and quality of language communication, and realizes smooth semantic communication between users of different languages.

附图说明 Description of drawings

图1是本发明基于多模态辅助的跨语言沟通系统的界面框图；Fig. 1 is the interface block diagram of the cross-language communication system based on multimodal assistance in the present invention;

图2是本发明基于多模态辅助的跨语言沟通系统的结构框图；Fig. 2 is the structural block diagram of the cross-language communication system based on multimodal assistance in the present invention;

图3a和图3b给出了一个预定披萨的示例结果；Figures 3a and 3b present an example result for ordering a pizza;

图4针对谈话内容的多媒体摘要示例。Figure 4 is an example of a multimedia summary for a conversation.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白，以下结合具体实施例，并参照附图，对本发明进一步详细说明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in further detail below in conjunction with specific embodiments and with reference to the accompanying drawings.

本发明提出基于多模态辅助的跨语言沟通系统及实现跨语言沟通的方法，所述方法利用前台交互模块1、数据管理模块2和语义关联模块3，通过分析谈话内容，利用自然语言处理工具能够自动地提取对话中的中心议题及关键字，并语义关联模块3根据检测到的中心议题与关键字信息，自动地搜索相关的图片和视频片段并以恰当的方式提供给谈话双方，从而达到促进彼此的了解和沟通。这里，作为辅助理解的图片和视频，既可以通过搜索的方法从网络自动扒取，也可以从一个预先已标注好的多媒体库中直接获取。最后，系统根据谈话双方的文本聊天信息以及与之相对应的图片和视频内容，生成一个多模态的谈话摘要。The present invention proposes a cross-language communication system based on multimodal assistance and a method for realizing cross-language communication. The method utilizes a front-end interaction module 1, a data management module 2 and a semantic association module 3 to analyze conversation content and use natural language processing tools It can automatically extract the central topic and keywords in the conversation, and the semantic association module 3 automatically searches for relevant pictures and video clips based on the detected central topic and keyword information, and provides them to both parties in an appropriate way, so as to achieve Promote mutual understanding and communication. Here, the pictures and videos used as aids in understanding can be automatically picked up from the Internet by searching, or directly obtained from a pre-marked multimedia library. Finally, the system generates a multimodal conversation summary based on the text chat information of the two parties in the conversation and the corresponding picture and video content.

图1示出了本发明提出了一个辅助跨语言沟通的多媒体聊天系统的用户交互界面，它能够为使用不同语言的用户进行交流提供一个友好、可交互的及时沟通环境。其中主要包括了三个方面的功能：基于及时翻译的文本通信，一个基于谈话主题的图片或视频检索，以及针对谈话内容的多媒体摘要(图4示出)。图1的最上面的部分主要是用来显示系统的名字以及用户聊天谈话的主题。接下来的是系统界面的主要显示区，即文本对话和多媒体辅助信息显示，例如问路、买车、定宾馆等。图1中的右侧部分是基于及时翻译的文本通信，用户文本聊天区域：呈现用户基本的文字聊天信息机相关的Google翻译的文本信息；图1左侧部分是一个基于谈话主题的图片或视频检索，以及针对谈话内容的多媒体摘要，多媒体内容展示区：基于用户谈话的内容为用户呈现相关的多媒体信息辅助用户的语义理解。Fig. 1 shows the user interface of a multimedia chatting system that assists cross-language communication proposed by the present invention, which can provide a friendly, interactive and timely communication environment for users using different languages to communicate. It mainly includes three aspects of functions: text communication based on timely translation, a picture or video retrieval based on the topic of the conversation, and multimedia summary for the content of the conversation (shown in Figure 4). The top part of Figure 1 is mainly used to display the name of the system and the subject of the user's chat conversation. What follows is the main display area of the system interface, that is, text dialogue and multimedia auxiliary information display, such as asking for directions, buying a car, and ordering a hotel. The right part of Figure 1 is text communication based on instant translation, and the user text chat area: presents the user's basic text chat information and Google-translated text information related to the machine; the left part of Figure 1 is a picture or video based on the topic of the conversation Retrieval, as well as multimedia abstracts for conversation content, multimedia content display area: based on the content of user conversations, relevant multimedia information is presented for users to assist users in semantic understanding.

如图2示出本发明基于多模态辅助的跨语言沟通系统的结构框图。基于多模态辅助的跨语言沟通系统的框架分成三个组成部分，即前台交互模块1，数据管理模块2和语义关联模块3。其中前台设计包括聊天界面和前后台交互两个部分。其中前台交互模块1接受用户输入的文本聊天内容并对用户聊天的内容进行预处理，得到用户聊天的文本信息；用户的聊天文字内容通过前台交互模块1的前后台交互字模块的输出端将处理后的用户文本聊天内容传送给语义关联模块3，前台交互模块1的聊天页面为用户显示聊天双方的对话的文字内容和根据双方谈话的内容系统推荐出来的多媒体图片。FIG. 2 shows a structural block diagram of the cross-language communication system based on multi-modal assistance in the present invention. The framework of the cross-language communication system based on multimodal assistance is divided into three components, namely the front-end interaction module 1, the data management module 2 and the semantic association module 3. The front-end design includes two parts: the chat interface and the front-end and back-end interaction. Wherein the foreground interactive module 1 accepts the text chat content of user's input and carries out preprocessing to the content of user's chat, obtains the text information of user's chat; The user's text chat content after is transmitted to semantic association module 3, and the chat page of foreground interactive module 1 shows the text content of the conversation of chatting two sides and the multimedia picture that comes out according to the content system of both sides' conversation for the user display.

语义关联模块3的输入端与前台交互模块1输出端连接，接收并通过对用户的文字聊天内容进行分析之后，利用自然语言处理工具提取出双方谈话的主要内容，得到并输出文本信息关联上翻译的文本和相对应的多媒体信息，及根据文本聊天内容、翻译的内容和相应的多媒体信息生成一个多模态摘要；语义关联模块3将文本聊天内容、翻译的内容和相应的多媒体信息一起输出到数据管理模块2。The input end of the semantic association module 3 is connected to the output end of the front-end interaction module 1. After receiving and analyzing the content of the user’s text chat, the natural language processing tool is used to extract the main content of the conversation between the two parties, and the text information is obtained and output. text and corresponding multimedia information, and generate a multimodal summary according to text chat content, translated content and corresponding multimedia information; semantic association module 3 outputs text chat content, translated content and corresponding multimedia information to Data management module2.

数据管理模块2的输入端与语义关联模块3连接输出端连接，数据管理模块2要对新输入文本聊天内容、翻译的内容和相应的多媒体的信息进行存储。同时要把历史用户信息连同新的用户信息进行整合，生成并显示所有的聊天双方的对话的文字内容和根据双方谈话的内容系统推荐出来的多媒体图片信息；最后一并返还给前台交互模块1。最终前台交互模块1的聊天页面就会将所有的信息全部显示给用户。下面详细说明一下模块的工作流程。The input end of the data management module 2 is connected to the output end of the semantic association module 3, and the data management module 2 stores newly input text chat content, translated content and corresponding multimedia information. At the same time, it is necessary to integrate historical user information with new user information, generate and display all the text content of the dialogue between the chatting parties and the multimedia picture information recommended by the system according to the content of the conversation between the two parties; and finally return them to the front-end interaction module 1. Finally, the chat page of the foreground interaction module 1 will display all the information to the user. The workflow of the module is described in detail below.

用户首先通过聊天界面向前台交互模块1发送聊天内容。续请见图1用户的语义聊天界面是分成两个主要的部分，一部分就是显示传统的聊天双方的对话的文字内容的部分，另一部分就是显示根据双方谈话的内容系统推荐出来的多媒体图片列表。这个时候前台界面通过Ajax构建的前后台交互模块向后台传递用户输入的文字聊天的文本信息。后台框架是分成两个部分，一部分是数据管理模块2，另一部分是语义关联模块3。当后台收到用户发送过来的文本信息之后，语义关联模块3为了帮助不同语种的聊天用户能够从自身的使用的语言的角度来理解对方的说话的含义，将Google翻译的结果集成进来。这样除了原始的用户聊天信息以外，还附带上了对这个聊天内容的基于Google翻译的用户聊天的译文。语义关联模块3对文本信息利用自然语言处理工具提取出双方谈话的主要内容。这个时候，语义关联模块3首先将这些主要内容作为关键字，采用基于文本的图像检索从图像数据库中检索出来相应的候选图片集。最后用户的所有和对话和相应的多媒体信息可以用来生成一个多模态摘要。以一个预定披萨的示例结果为例说明一下生成的多媒体摘要，如图4所示。从图4给出的这个基于多模态的摘要看出，用户在和披萨店的货物员的对话中，进行了披萨种类、饮料和付款方式的选择。用户通过聊天系统反馈回来的相应的披萨店的披萨的图片，能够更好地根据自己的意愿进行选择。这个多模态摘要也有利于用户日后想再次想定披萨，可以根据这个多模态摘要提供的多媒体信息来帮助用户进行回顾。The user first sends the chat content to the foreground interaction module 1 through the chat interface. See Figure 1. The user's semantic chat interface is divided into two main parts, one part is to display the text content of the traditional conversation between the two parties in the chat, and the other part is to display the list of multimedia pictures recommended by the system according to the content of the conversation between the two parties. At this time, the front-end interface transmits the text information of the text chat entered by the user to the back-end through the front-end and back-end interaction modules built by Ajax. The background framework is divided into two parts, one is the data management module 2, and the other is the semantic association module 3. After the background receives the text information sent by the user, the semantic association module 3 integrates the results of Google translation in order to help chat users of different languages understand the meaning of the other party's speech from the perspective of their own language. In this way, in addition to the original user chat information, a translation of the chat content based on Google Translate is attached. Semantic association module 3 uses natural language processing tools to extract the main content of the conversation between the two parties on the text information. At this time, the semantic association module 3 first uses these main contents as keywords, and uses text-based image retrieval to retrieve corresponding candidate picture sets from the image database. Finally, all user conversations and corresponding multimedia information can be used to generate a multimodal summary. Take a sample result of ordering pizza as an example to illustrate the generated multimedia summary, as shown in Figure 4. From the multimodal summary given in Figure 4, it can be seen that the user made choices about the type of pizza, drink and payment method during the conversation with the goods clerk in the pizzeria. The picture of the pizza of the corresponding pizzeria that the user feeds back through the chat system can better choose according to his own wishes. This multimodal summary is also beneficial for the user to imagine pizza again in the future, and the multimedia information provided by this multimodal summary can help the user to review.

下面对图2中的语义关联机制进行阐述。语义关联机制主要分成三个部分，即基于即时翻译的文本通信、基于话题和图片的视频检索以及最后基于用户文本聊天内容和相应的多媒体信息生成的多模态摘要。The semantic association mechanism in Fig. 2 is described below. The semantic association mechanism is mainly divided into three parts, namely text communication based on instant translation, video retrieval based on topics and pictures, and finally multimodal summarization based on user text chat content and corresponding multimedia information.

(1).基于及时翻译的文本通信(1). Text communication based on timely translation

类似大多数的及时通信系统，本发明提出的系统也支持最基本的文本通信。但是，由于谈话的双方可能具有不同的语言背景。例如，当一个说英语的美国人和一个说汉语的中国人在网上交谈，美国人不懂汉语，而中国人又不懂英语，通过普通的文本交谈不能使双方无障碍的沟通。为此，本发明的系统集成了一个简单的机器翻译功能，在聊天时，将说话者的语言自动翻译为接受者的语言后再显示出来，这样就能够保证谈话双方能够大致了解对方的意图。Similar to most instant communication systems, the system proposed by the present invention also supports the most basic text communication. However, since the two parties in the conversation may have different language backgrounds. For example, when an English-speaking American talks with a Chinese-speaking Chinese on the Internet, and the American does not understand Chinese, and the Chinese does not understand English, ordinary text conversations cannot enable the two parties to communicate without barriers. For this reason, the system of the present invention integrates a simple machine translation function. When chatting, the speaker's language is automatically translated into the recipient's language and then displayed, so that both sides of the conversation can roughly understand each other's intentions.

(2).基于话题的图片和视频检索(2). Topic-based image and video retrieval

尽管有机器翻译作为桥梁，跨语言的沟通仍然不能令人十分满意。究其原意，主要在于机器翻译的准确性(翻译的目标语言的可理解程度)依然偏低。主要语种间的翻译结果，例如英语与汉语之间，仍然还达不到实用的标准。另外，由于日常用语中多义词与句子的存在，导致机器翻译技术也难以满足现实的需要。图3a中示出食品包括：海食品、水果、肉。水果包括：香蕉、苹果、桔子，例如“苹果”一词既可以表示一种水果，也可以表示苹果公司(图3a)。为了营造一种易于理解的、沉浸式的在线沟通环境，我们设计了一种基于主题的图片/视频检索子模块来辅助不同语言背景的用户相互交流。其中，话题检测、图片检索以及相关反馈是三个主要功能。Despite machine translation as a bridge, cross-language communication is still not quite satisfactory. The original intention is that the accuracy of machine translation (the intelligibility of the translated target language) is still low. The translation results between major languages, such as between English and Chinese, are still not up to practical standards. In addition, due to the existence of polysemous words and sentences in everyday language, it is difficult for machine translation technology to meet the needs of reality. The foods shown in Fig. 3a include: seafood, fruit, and meat. Fruits include: bananas, apples, and oranges. For example, the word "apple" can represent either a fruit or an apple company (Fig. 3a). In order to create an easy-to-understand and immersive online communication environment, we designed a topic-based image/video retrieval sub-module to assist users with different language backgrounds to communicate with each other. Among them, topic detection, image retrieval and related feedback are the three main functions.

话题检测通过两种途径来实现。第一是用户从一个预定义的话题列表中选择一个话题。不同的话题与不同的已标注的(通过手工或者学习的方法得到标注)图片/视频数据库相关联。第二种方法则是通过抽取文本分析提取主题关键词。在一次对话中，可以抽取许多表示谈话内容的实体词。根据这些实体词，我们首先建立一个类似WordNet的语义关系树，它对词间的语义继承关系进行了刻画，如图3a所示，词“苹果”，“香蕉”以及“桔子”都属于食品类中的水果子类，而图3b所示“苹果”一词同时可能又同时与“戴尔”，“联想”一道属于电脑品牌这一类，图3b示出“苹果”电脑品牌例子包括：台式电脑mac、平板电脑ipad及智能手机iphone。上述的这些语义关系可以从WordNet中所抽取得到，也可以通过使用通过统计单词在一个预定义的语料库中的“词频-反向文档频率”权重(TF-IDF)所得到。一旦我们从对话中抽取到关键词，系统就可以通过分析关键词间的语义关系来自动地推断其所对应的潜在话题。Topic detection is achieved in two ways. The first is for the user to select a topic from a predefined list of topics. Different topics are associated with different annotated (annotated by hand or learned methods) image/video databases. The second method is to extract subject keywords through extractive text analysis. In a conversation, many entity words representing the content of the conversation can be extracted. According to these entity words, we first build a semantic relationship tree similar to WordNet, which describes the semantic inheritance relationship between words. As shown in Figure 3a, the words "apple", "banana" and "orange" all belong to the category of food Fruit sub-category in , and the word "Apple" shown in Figure 3b may also belong to the category of computer brands together with "Dell" and "Lenovo". Figure 3b shows examples of "Apple" computer brands include: desktop computers mac, tablet ipad and smartphone iphone. The above semantic relationships can be extracted from WordNet, or can be obtained by using the "term frequency-inverse document frequency" weight (TF-IDF) of words in a predefined corpus. Once we extract keywords from the conversation, the system can automatically infer the corresponding potential topics by analyzing the semantic relationship between the keywords.

根据对话中所抽取的主题，系统自动地从网络或者后台数据库中检索相应的图片信息。使用基于文本的检索，我们可以容易地根据谈话主题找到相关的标注图片。然而，大部分的网络图片都是未标注的，我们使用检索到的已标注好的文本相关联的图片作为训练集，学习得到一个主题模型，并且用这个主题模型区检索大量的未标注图片。为此，基于主题的图片检索需要首先构建主题模型，其目标是自动地找到一个潜在的(隐含的)语义空间以便更准确的建模检索过程中的文档信息。这里，一个文档的语义结构包括了一些潜在的隐含概念或者主题(它们往往对应词间的一种稳定而特有的共生模式)。通过潜在主题的加权组合，文档可以表示为一系列的潜在主题，而其较全组合系数则可以看做是文档的一种特征表示。这种表示具有一些系列的优点：首先语义空间相较于单词空间而言，维度往往较低。这不仅节约了存储空间，也有利于快速搜索；其次通过单词空间到语义空间的转换，不仅可以减少单词向量中的噪音，而且也可以解决上述的多义和歧义问题，进而提高检索性能。例如，单词“苹果”既可以表示一种水果，又可以表示一个电脑品牌(图3b)。它的准确意义可以同一主题的其他相关的关键词所推得。According to the topic extracted in the dialogue, the system automatically retrieves the corresponding picture information from the network or background database. Using text-based retrieval, we can easily find relevant annotated images based on the conversation topic. However, most of the network pictures are unlabeled. We use the retrieved pictures associated with the labeled text as the training set to learn a topic model, and use this topic model to retrieve a large number of unlabeled pictures. For this reason, topic-based image retrieval needs to build a topic model first, and its goal is to automatically find a latent (implicit) semantic space in order to more accurately model the document information in the retrieval process. Here, the semantic structure of a document includes some potential hidden concepts or themes (they often correspond to a stable and unique co-occurrence pattern between words). Through the weighted combination of latent topics, a document can be represented as a series of latent topics, and its comprehensive combination coefficient can be regarded as a feature representation of the document. This representation has a number of advantages: First, the semantic space tends to have a lower dimensionality than the word space. This not only saves storage space, but also facilitates fast search; secondly, through the conversion from word space to semantic space, it can not only reduce the noise in the word vector, but also solve the above polysemy and ambiguity problems, thereby improving the retrieval performance. For example, the word "apple" can refer to both a fruit and a computer brand (Fig. 3b). Its precise meaning can be deduced from other related keywords on the same topic.

反馈作为一种流行的人机交互技术广泛应用于文本域视觉信息的分析中。通过用户对系统输出的反馈评价，系统可以自适应地进行修正。通过用户反馈所得到的监督信息已经在实践中被证明是有效地。在我们的系统中，用户可以从自动的主题抽取算法所得到的候选列表中选择正确的主题。被选主题将用于下一次的主题抽取通过建模时序的(当前和下一步的)主题关系。在图像检索中，我们的系统列巨额了一些检索到的样本图片，并且邀请用户依据谈话主题对相关图片进行打分。Feedback, as a popular human-computer interaction technique, is widely used in the analysis of visual information in the text domain. Through the user's feedback and evaluation of the system output, the system can be adaptively corrected. Supervision information obtained through user feedback has been proven to be effective in practice. In our system, users can select the correct topic from a candidate list generated by an automatic topic extraction algorithm. The selected topics will be used for the next topic extraction by modeling temporal (current and next) topic relationships. In image retrieval, our system lists some retrieved sample images and invites users to rate related images according to the conversation topic.

(3).多模态摘要(3). Multimodal Summary

传统的及时通信通常保存以文本方式保留聊天记录。我们的系统中，用户可以使用图片、视频以及文本等多模态的方式来表达谈话者的意图。通过一种多模态的方式而非单一的文本来保存聊天信息，可以得到较之以往更加生动形象记录。Traditional instant communication usually keeps chat records in text mode. In our system, users can use multi-modal methods such as pictures, videos, and texts to express the speaker's intentions. By saving chat information in a multimodal way instead of a single text, you can get more vivid records than ever before.

文本，图片以及视频的摘要是自然语言处理以及多媒体领域的一个研究热点。它往往通过一段更为精练简洁的文本(图片或者视频)来概括地表达原始的文本(图片或者视频)信息。目前相关的技术大多根据显著性特征，重复的模态或者关键词(帧)等信息来构建摘要内容。在我们的系统中，考虑到除文本外还存在大量的图片和视频信息，我们采用了主题驱动的摘要方法通过分析用户间的谈话内容进而生成关于特定话题的摘要信息。这一摘要信息包含了涉及该话题的相关文本、图片以及视频内容。Text, image and video summarization is a research hotspot in the field of natural language processing and multimedia. It often expresses the original text (picture or video) information in a general way through a more concise and concise text (picture or video). Most of the current related technologies construct abstract content based on salient features, repeated modalities or keywords (frames) and other information. In our system, considering that there are a large amount of image and video information besides text, we adopt a topic-driven summarization method to generate summary information on a specific topic by analyzing the content of conversations between users. This summary information contains relevant text, pictures and video content related to the topic.

以上所述，仅为本发明中的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉该技术的人在本发明所揭露的技术范围内，可理解想到的变换或替换，都应涵盖在本发明的权利要求书的保护范围之内。The above is only a specific implementation mode in the present invention, but the scope of protection of the present invention is not limited thereto. Anyone familiar with the technology can understand the conceivable transformation or replacement within the technical scope disclosed in the present invention. All should be covered within the scope of protection of the claims of the present invention.

Claims

1. A cross-language communication system based on multimodal assistance, characterized in that the system includes: a front-end interaction module, a data management module and a semantic association module, wherein:

The input terminal of the front-end interaction module accepts the text chat content input by the user and preprocesses the content of the user chat, obtains the text information of the user chat, and transmits the processed user text chat through the output end of the front-end interaction module of the front-end interaction module Content; the chat page of the front-end interactive module displays the text content of the conversation between the two parties and the multimedia pictures recommended by the system according to the content of the conversation between the two parties for the user;

The input end of the semantic association module is connected to the output end of the front-end interaction module, which receives and analyzes the text chat content of the user, uses natural language processing tools to extract the main content of the conversation between the two parties, obtains and outputs the text information associated with the translated text and related Corresponding multimedia information, and generating a multimodal summary based on text chat content, translated content and corresponding multimedia information;

The input end of the data management module is connected to the output end of the semantic association module. The data management module needs to store the newly input text chat content, translated content and corresponding multimedia information, and simultaneously store the historical user information together with the new user information. Integrate, generate and display all the text content of the conversation between the two chatting parties and the multimedia picture information recommended by the system according to the content of the conversation between the two parties.

2. The cross-language communication system based on multi-modal assistance as claimed in claim 1, characterized in that, after the semantic association module in the background receives the text information sent by the user, the semantic association module can help chat users of different languages from using In order to understand the meaning of the other party’s speech from the perspective of their own language, the results of Google translation are integrated; in this way, in addition to the original user chat information, the translation of the chat content based on Google translation is also attached.

3. The cross-language communication system based on multimodal assistance as claimed in claim 1, wherein the main content of the conversation between the two parties is extracted by the semantic association module by using these main content as keywords, and using text-based image retrieval from the image database The corresponding candidate image set is retrieved.

4. A method for realizing cross-language communication using the cross-language communication system based on multimodal assistance described in claim 1, characterized in that, the method is based on the user dialogue and chatting, and is obtained by analyzing the content of the conversation according to the text analysis technology As a result, multimedia elements are provided for users to assist semantic understanding between users who have language barriers or have differences in cultural backgrounds, and the implementation of the method includes the following steps:

Step S1: The user first sends the text content he wants to chat with the other party through the front-end interface of the semantic chat. The modal analysis method analyzes the content of the user's conversation, and uses natural language processing tools to automatically extract the central topic and keywords in the conversation;

Step S2: According to the central topic and keyword information in the conversation, the semantic association module automatically retrieves relevant picture collections and video clips from the database or the Internet according to the conversation topic and provides them to the conversation parties by using text-based image retrieval;

Step S3: The system generates a multi-modal conversation summary based on the text chat information of the two parties in the conversation and the corresponding pictures and video clips, and finally realizes smooth semantic communication between users of different languages in the form of multimedia; At the same time, the system can generate a multi-modal conversation summary for the two parties based on the text chat history information of the two parties and the corresponding picture and video content.

5. The method for realizing cross-language communication as claimed in claim 4, characterized in that, the multi-modal talk summary includes text, audio, image and video information, and provides multimedia elements for users to assist in language communication. Semantic understanding between users with different or cultural backgrounds.

6. The method for realizing cross-language communication as claimed in claim 4, characterized in that, the contents of the pictures and video clips are automatically picked up from the Internet by searching, or directly obtained from a pre-marked multimedia library.

7. The method for realizing cross-language communication as claimed in claim 4, characterized in that, the multi-modal conversation summary is a subject-based summary, using a relational network and appearing in a predefined The co-occurrence frequency of words in the expected library is detected by the subject.