CN107315742A

CN107315742A - The Interpreter's method and system that personalize with good in interactive function

Info

Publication number: CN107315742A
Application number: CN201710535661.8A
Authority: CN
Inventors: 陈炜; 王峰; 徐爽; 徐波
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Beijing Zidong Cognitive Technology Co Ltd
Priority date: 2017-07-03
Filing date: 2017-07-03
Publication date: 2017-11-03

Abstract

The invention provides an anthropomorphic spoken language translation method with a man-machine dialogue function, which comprises the following steps: performing intelligent speech recognition on the source language voice to obtain the source language text; processing the source language text and the dialogue scene to perform anthropomorphic Man-machine dialogue and communication; perform machine translation and get translation results. The invention also provides an anthropomorphic spoken language translation system with man-machine dialogue function. According to the needs of the translation task, the present invention conducts man-machine dialogue with the user when necessary, can significantly improve the user's translation experience in complex application scenarios, and improve the accuracy of translation semantics.

Description

Anthropomorphic spoken language translation method and system with man-machine dialogue function

技术领域technical field

本发明涉及计算机和人工智能领域，尤其涉及一种把拟人化的人机对话机制加入翻译过程中的口语翻译方法及相应的系统。The invention relates to the field of computer and artificial intelligence, in particular to a spoken language translation method and a corresponding system which adds an anthropomorphic man-machine dialogue mechanism into the translation process.

背景技术Background technique

随着互联网的普及应用和全球化的快速推进，口语翻译作为对人工翻译高成本、高门槛、供需失衡等问题的有效解决方案，在日常生活、商务洽谈、国际交流等多个场景下具有旺盛的市场需求。With the popularization and application of the Internet and the rapid advancement of globalization, oral translation, as an effective solution to the problems of high cost, high threshold, and imbalance between supply and demand of human translation, has a strong role in many scenarios such as daily life, business negotiation, and international communication. market demand.

两种语言的口语翻译技术由图1所示构成，包括源语言和目标语言的语音识别、语音合成和双向翻译技术。其中双向语音识别和双向翻译是必须包含的技术，而语音合成则视翻译应用场景和设备而可选。The oral translation technology of two languages is composed as shown in Figure 1, including speech recognition, speech synthesis and two-way translation technology of source language and target language. Among them, two-way speech recognition and two-way translation are technologies that must be included, while speech synthesis is optional depending on the translation application scenario and equipment.

传统的口语自动翻译方法，一般由用户输入待翻译的源语言语音，自动识别并翻译后直接将目标语言的自然语音呈现给对方用户，口语识别或者翻译从用户视角仅仅是一种端到端的软件(如图2所示)。In the traditional automatic spoken language translation method, the user generally inputs the source language speech to be translated, automatically recognizes and translates it, and directly presents the natural speech of the target language to the other user. Spoken language recognition or translation is only an end-to-end software from the user's perspective. (as shown in picture 2).

受困于人类沟通语言的复杂性和多变性，即使人类翻译员也会通过各种方式与对话者进行沟通，以求获得对所需要翻译语音的准确内涵。而目前机器口语翻译方法，是一种不对实际场景复杂性和语义复杂性情况进行处理的端到端呈现翻译方法，显然难以满足准确度要求。同时由于翻译作为一个软件服务缺乏与用户的人机沟通，在实际应用场景中也难以满足场景友好性的要求。如何提高实际复杂场景下口语翻译准确率和用户体验是当前需要解决的问题。Constrained by the complexity and variability of human communication language, even human translators will communicate with interlocutors in various ways in order to obtain the accurate connotation of the speech to be translated. However, the current machine spoken language translation method is an end-to-end rendering translation method that does not deal with the complexity of the actual scene and the semantic complexity, which is obviously difficult to meet the accuracy requirements. At the same time, because translation as a software service lacks human-computer communication with users, it is also difficult to meet the requirements of scene friendliness in actual application scenarios. How to improve the accuracy of oral translation and user experience in actual complex scenarios is a problem that needs to be solved at present.

发明内容Contents of the invention

(一)要解决的技术问题(1) Technical problems to be solved

鉴于上述技术问题，本发明提供了一种具有人机对话功能的拟人化口语翻译方法及系统。本发明的核心点是在原有语音识别和翻译的基础上，加入一个人机对话模块，该模块捕捉、处理和识别当时的声学场景、话者场景、韵律场景、语言场景等，根据翻译任务需要在必要时跟用户进行人机对话，确切地获得能够显著提升复杂应用场景下用户的翻译体验，并提高翻译语义的准确度。In view of the above technical problems, the present invention provides an anthropomorphic spoken language translation method and system with man-machine dialogue function. The core point of the present invention is to add a man-machine dialogue module on the basis of the original speech recognition and translation, which captures, processes and recognizes the acoustic scene, speaker scene, prosodic scene, language scene, etc. Conduct man-machine dialogues with users when necessary to obtain accurate translation experience that can significantly improve users' translation experience in complex application scenarios and improve the accuracy of translation semantics.

(二)技术方案(2) Technical solution

根据本发明的一个方面，提供了一种具有人机对话功能的拟人化口语翻译方法，其包括以下步骤：对源语言语音进行智能语音识别，得到源语言文本；对源语言文本以及对话场景进行处理，进行拟人化人机对话沟通；进行机器翻译，得到翻译结果。According to one aspect of the present invention, there is provided an anthropomorphic spoken language translation method with a man-machine dialogue function, which includes the following steps: performing intelligent speech recognition on the source language speech to obtain the source language text; Processing, anthropomorphic man-machine dialogue and communication; machine translation, translation results.

根据本发明的另一个方面，还提供了一种具有人机对话功能的拟人化口语翻译系统，其包括：语音识别模块、人机对话管理模块、机器翻译模块，语音识别模块用于对源语言语音进行智能语音识别，得到源语言文本；人机对话管理模块用于对源语言文本以及对话场景进行处理，进行拟人化人机对话沟通；机器翻译模块用于进行机器翻译，得到翻译结果。According to another aspect of the present invention, there is also provided an anthropomorphic spoken language translation system with man-machine dialogue function, which includes: a speech recognition module, a man-machine dialogue management module, a machine translation module, and the speech recognition module is used to interpret the source language Intelligent voice recognition is performed on the voice to obtain the source language text; the man-machine dialogue management module is used to process the source language text and dialogue scenes to perform anthropomorphic man-machine dialogue communication; the machine translation module is used to perform machine translation to obtain the translation result.

(三)有益效果(3) Beneficial effects

从上述技术方案可以看出，本发明具有人机对话功能的拟人化口语翻译方法及系统至少具有以下有益效果其中之一：It can be seen from the above technical solutions that the anthropomorphic spoken language translation method and system with man-machine dialogue function of the present invention have at least one of the following beneficial effects:

(1)本发明能够显著提升复杂应用场景下翻译性能的准确性；(1) The present invention can significantly improve the accuracy of translation performance in complex application scenarios;

(2)本发明使得用户使用更方便，交谈过程中不需再做其他任何冗余操作；(2) The present invention makes it more convenient for users to use, and does not need to do any other redundant operations during the conversation;

(3)本发明使得用户翻译及交互体验更智能、更人性化。(3) The present invention makes user translation and interactive experience more intelligent and humanized.

附图说明Description of drawings

图1为现有技术两种语言的口语翻译技术示意图。FIG. 1 is a schematic diagram of oral translation technology of two languages in the prior art.

图2为现有技术口语自动翻译系统示意图。Fig. 2 is a schematic diagram of an automatic spoken language translation system in the prior art.

图3为本发明一种具有人机对话功能的拟人化口语翻译系统示意图。Fig. 3 is a schematic diagram of an anthropomorphic spoken language translation system with man-machine dialogue function according to the present invention.

图4为本发明一种具有人机对话功能的拟人化口语翻译系统的语音识别模块的结构示意图。FIG. 4 is a structural schematic diagram of a speech recognition module of an anthropomorphic spoken language translation system with man-machine dialogue function according to the present invention.

图5为本发明一种具有人机对话功能的拟人化口语翻译系统的详细示意图。Fig. 5 is a detailed schematic diagram of an anthropomorphic spoken language translation system with man-machine dialogue function according to the present invention.

图6为本发明第一实施例中获取说话者输入的源语言语音方法的示意图。Fig. 6 is a schematic diagram of a method for acquiring source language speech input by a speaker in the first embodiment of the present invention.

图7为本发明第一实施例中跟说话者进行人机对话方法的示意图。Fig. 7 is a schematic diagram of a method for man-machine dialogue with a speaker in the first embodiment of the present invention.

图8为本发明第一实施例中可视化向说话者展示当前系统状态的方法的示意图。Fig. 8 is a schematic diagram of a method for visually presenting the current system state to the speaker in the first embodiment of the present invention.

图9为本发明第一实施例中向对话另一方智能输出翻译结果的方法的示意图。FIG. 9 is a schematic diagram of a method for intelligently outputting translation results to the other party in the dialogue in the first embodiment of the present invention.

图10为本发明第二实施例中获取会议信息并创建会议的方法的示意图。Fig. 10 is a schematic diagram of a method for acquiring meeting information and creating a meeting in the second embodiment of the present invention.

图11为本发明第二实施例中智能主持会议进程的方法的示意图。Fig. 11 is a schematic diagram of a method for intelligently hosting a conference process in the second embodiment of the present invention.

图12为本发明第二实施例中可视化向参会者展示当前会议状态的方法的示意图。Fig. 12 is a schematic diagram of a method for visually displaying the current conference status to participants in the second embodiment of the present invention.

图13为本发明第三实施例一种基于无屏幕显示的拟人化口语翻译系统的翻译方法示意图。Fig. 13 is a schematic diagram of a translation method based on an anthropomorphic spoken language translation system without screen display according to the third embodiment of the present invention.

具体实施方式detailed description

为使本发明的目的、技术方案和优点更加清楚明白，以下结合具体实施例，并参照附图，对本发明进一步详细说明。显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in further detail below in conjunction with specific embodiments and with reference to the accompanying drawings. Apparently, the described embodiments are some, but not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

本发明提供了一种具有人机对话功能的拟人化口语翻译方法，如图3所示，其包括以下步骤：获取源语言语音(即用户A的语种)；对源语言语音进行智能语音识别，得到源语言文本；对源语言文本以及对话场景进行处理，进行拟人化人机对话沟通；进行机器翻译，得到目标语言文本；进行语音合成，得到目标语言语音(即用户B的语种)；输出目标语言语音。The present invention provides a kind of anthropomorphic spoken language translation method with man-machine dialogue function, as shown in Figure 3, it comprises the following steps: obtain source language speech (being the language of user A); Carry out intelligent speech recognition to source language speech, Get the source language text; process the source language text and the dialogue scene, and carry out anthropomorphic human-machine dialogue communication; perform machine translation to obtain the target language text; perform speech synthesis to obtain the target language voice (that is, the language of user B); output the target language voice.

特别说明的是，对于图3中的源语言语音和目标语言语音，源语言语音作为翻译前的语种，目标语言语音作为翻译后的语种，两者是相对而言的，对于同一用户，发出的源语言语音和得到的目标语言语音为同一种语种，例如，用户A的语种为中文，用户B的语种为英文，用户A发出中文(源语言语音)，经过翻译，用户B得到英文(目标语言语音)；用户B发出英文(源语言语音)，经过翻译，用户A得到中文(目标语言语音)。In particular, for the source language speech and the target language speech in Figure 3, the source language speech is the language before translation, and the target language speech is the language after translation. The two are relative. For the same user, the The source language voice and the obtained target language voice are the same language. For example, the language of user A is Chinese, and the language of user B is English. User A sends Chinese (source language voice), and after translation, user B gets English (target language voice). voice); user B sends out English (voice in the source language), and after translation, user A gets Chinese (voice in the target language).

对源语言语音进行智能语音识别时，在翻译界面上利用智能语音检测手段和语种识别手段，自动进行对话者语音以及语种信息的判别，使得使用翻译系统的人无需进行基于两种语言的输入按键而进行对话；具体地，分别给出了每帧语音/非语音的概率得分，每帧源语种/目标语种的概率得分，并同步对语音进行双语解码，在这基础上利用信息综合输出有意义的识别结果。如图4所示，源语言语音和目标语言语音是两种不同的语言语音，都是指翻译前的语言语音。When intelligent voice recognition is performed on the source language voice, intelligent voice detection means and language recognition means are used on the translation interface to automatically discriminate the interlocutor's voice and language information, so that people using the translation system do not need to input keys based on two languages To conduct a dialogue; specifically, the probability score of each frame of speech/non-speech and the probability score of each frame of source language/target language are given, and the speech is simultaneously decoded bilingually. On this basis, it is meaningful to use the comprehensive output of information recognition results. As shown in Figure 4, the source language speech and the target language speech are two different language speeches, both of which refer to the language speech before translation.

如图5所示，对源语言文本以及对话场景进行处理时，包括对声学场景、话者场景、韵律场景、语言场景进行处理，通过拟人化人机对话沟通，确切地获得能够显著提升复杂应用场景下用户的翻译体验的信息。具体地，(1)对翻译系统所处的场景进行声学背景感知，待感知信息包括但不限于检测动态背景噪声信息(如信噪比、噪声类别等)并把声学背景感知的信息及智能处理结果进行综合处理；(2)对话者场景进行智能感知，待感知信息包括但不限于说话人信息、语种信息、是否有多人说话以及多人说话语音分离信息及其他说话者场景信息等，并把这些结果(话者场景感知信息及智能处理结果)进行综合处理；(3)对韵律场景进行智能感知，感知信息包括但不限于输入语音的中间停顿、断句边界、语速、基频、共振峰等超音段韵律特征及韵律分析置信度以及翻译回合边界的智能处理等，并把这些结果(韵律场景感知信息及智能处理结果)进行综合处理；(4)对语言场景进行智能感知，即对源语言文本进行上下文智能处理，包括但不限于：待翻译文本中抽取的人名、地名和机构名；待翻译文本中可能包含的识别错误的词汇；待翻译文本中包含的口语化语言碎片和重复；待翻译文本中包含的明显成分缺失；待翻译文本中包含的时序颠倒；待翻译文本中包含的时间数字短语语言信息；待翻译文本中包含的行业术语、日常缩略语、网络新词、古诗词、成语、俗语、歇后语等；并把这些结果(上下文智能处理结果)进行综合处理。As shown in Figure 5, when processing the source language text and the dialogue scene, it includes the processing of the acoustic scene, the speaker scene, the prosodic scene, and the language scene. Information about the user's translation experience in the scenario. Specifically, (1) perform acoustic background perception on the scene where the translation system is located. The information to be perceived includes but is not limited to detecting dynamic background noise information (such as signal-to-noise ratio, noise category, etc.) and intelligently processing the acoustic background perception information The results are comprehensively processed; (2) The scene of the interlocutor is intelligently sensed, and the information to be sensed includes but is not limited to speaker information, language information, whether there are multiple people speaking, voice separation information of multiple people speaking, and other speaker scene information, etc., and Comprehensively process these results (speaker scene perception information and intelligent processing results); (3) carry out intelligent perception of prosodic scenes, perception information includes but not limited to the middle pause of input speech, sentence boundary, speech rate, fundamental frequency, resonance peak and other supersegmental prosodic features, prosody analysis confidence and intelligent processing of translation round boundaries, etc., and comprehensively process these results (prosodic scene perception information and intelligent processing results); (4) perform intelligent perception of language scenes, namely Contextual intelligent processing of the source language text, including but not limited to: the names of people, places and organizations extracted from the text to be translated; possibly misrecognized words contained in the text to be translated; fragments of colloquial language contained in the text to be translated and Repetition; the absence of obvious components contained in the text to be translated; the reversal of the time sequence contained in the text to be translated; the time and number phrase language information contained in the text to be translated; industry terms, daily abbreviations, new words on the Internet, Ancient poems, idioms, common sayings, allegorical sayings, etc.; and these results (context intelligent processing results) are comprehensively processed.

人机对话沟通的内容包括但不限于：如果上面特征抽取后需要对用户进行友好提示，这包括提示用户改善使用环境、话者环境、提示用户正确使用方法等；或者对所述自然语言语音的判定结果需要进一步语义说明，则启动人机对话以获取用户的语义说明信息以便进行正确翻译。在上述进行对话时在拟人化翻译界面中具有清晰的提示，这种提示包括但不限于声音、图形、图符等方式。The content of man-machine dialogue communication includes but is not limited to: if the above feature extraction needs to give friendly prompts to users, this includes prompting users to improve the use environment, speaker environment, prompting users to use correctly, etc.; If the judgment result requires further semantic explanation, start a man-machine dialogue to obtain the user's semantic explanation information for correct translation. There are clear prompts in the anthropomorphic translation interface during the above-mentioned dialogue, and such prompts include but are not limited to sound, graphics, icons, etc.

另外，可以通过可视或非可视的形式向用户展示使用过程中翻译系统的不同状态，这些形式包括但不限于：通过可视化拟人化形象向用户展示翻译系统的不同状态；通过非可视化的声音媒介向用户展示翻译系统的不同状态。In addition, different states of the translation system during use can be displayed to users in visual or non-visual forms, including but not limited to: displaying different states of the translation system to users through visual anthropomorphic images; Mediums show users different states of the translation system.

在本发明中，通过接触或非接触的形式获取用户对人机对话内容的反馈，这些形式包括但不限于：通过点击、触摸本发明实施例中所涉及的硬件化设备来实现对人机对话内容的反馈或确认；通过语音交互来实现人机对话内容的反馈或确认。In the present invention, the user's feedback on the content of the man-machine dialogue is acquired through contact or non-contact forms, including but not limited to: realizing the man-machine dialogue by clicking and touching the hardware equipment involved in the embodiment of the present invention Feedback or confirmation of content; Feedback or confirmation of man-machine dialogue content is realized through voice interaction.

本发明还提供了一种具有人机对话功能的拟人化口语翻译系统，其包括：输入模块，用于获取源语言语音(即用户A的语种)；语音识别模块，用于对源语言语音进行智能语音识别，得到源语言文本；人机对话管理模块，用于对源语言文本以及对话场景进行处理，进行拟人化人机对话沟通；机器翻译模块，用于进行机器翻译，得到目标语言文本；语音合成模块，用于进行语音合成，得到目标语言语音(即用户B的语种)；输出模块，用于输出目标语言语音。The present invention also provides a kind of anthropomorphic spoken language translation system with man-machine dialogue function, it comprises: input module, is used for obtaining source language speech (namely the language of user A); Speech recognition module, is used for source language speech Intelligent speech recognition to obtain source language texts; man-machine dialogue management module, used to process source language texts and dialogue scenes, and perform anthropomorphic man-machine dialogue communication; machine translation module, used to perform machine translation to obtain target language texts; The speech synthesis module is used for speech synthesis to obtain the target language speech (that is, the language of user B); the output module is used for outputting the target language speech.

第一实施例：手机上双方口语翻译对话系统The first embodiment: the dialogue system of oral translation between the two parties on the mobile phone

在本实施例中，提供一种手机上双方口语翻译对话系统，该系统向对话双方提供端到端口语翻译对话功能，并在必要时向用户发起人机对话以提升用户的翻译体验。In this embodiment, a two-party oral translation dialogue system on a mobile phone is provided. The system provides an end-to-speak translation dialogue function for both parties in the dialogue, and initiates a man-machine dialogue to the user when necessary to improve the user's translation experience.

(1)获取说话者输入的源语言语音，同时输入方式根据说话者的使用环境、使用习惯等可选(如图6所示)。(1) Obtain the source language speech input by the speaker, and the input method is optional according to the speaker's use environment and usage habits (as shown in Figure 6).

如果说话者当前所处环境不利于直接使用语音输入，则本系统提供直接输入源语言文字的备选方案；If the speaker's current environment is not conducive to direct use of voice input, the system provides alternatives for direct input of source language text;

如果说话者习惯于手动指定对话双方语言的语种，则本系统提供语种手动指定语种的按钮，同时允许说话者改变时进行语种的手动切换；If the speaker is used to manually specifying the language of the two parties in the conversation, the system provides a button for manually specifying the language, and at the same time allows manual switching of the language when the speaker changes;

如果说话者习惯于利用该系统自动识别当前语种，则本系统在双方对话时提供语种自动切换的功能，从而无需说话者手动指定当前输入语音的语种；If the speaker is accustomed to using the system to automatically identify the current language, the system provides the function of automatic language switching when the two parties are talking, so that the speaker does not need to manually specify the language of the current input voice;

如果说话者习惯于手动点击来确定语音输入的边界，则本系统提供语音输入按钮，利用说话者操作该按钮的状态来获取说话者输入语音的边界，同时，根据说话者是否选择系统自动识别语种来确定语音输入按钮的数量，如果说话者选择手动指定语种，则语音输入按钮为两个，对话双方分别操作各自的语音输入按钮，如果说话者选择系统自动识别语种，则对话双方分享同一个语音输入按钮；If the speaker is used to manually clicking to determine the boundary of voice input, the system provides a voice input button, and the state of the speaker's operation of the button is used to obtain the boundary of the speaker's input voice. At the same time, the system automatically recognizes the language according to whether the speaker chooses To determine the number of voice input buttons, if the speaker chooses to manually specify the language, there are two voice input buttons, and the two parties in the dialogue operate their own voice input buttons respectively. If the speaker chooses the system to automatically identify the language, the two dialogue parties share the same voice input button;

如果说话者习惯于利用该系统自动识别语音输入的边界，则本系统提供语音输入断点自动检测的功能，从而在说话者暂停或停止语音输入时，自动识别语音输入的边界，将当前已获取的语音输入交于后续处理流程；If the speaker is accustomed to using the system to automatically identify the boundary of voice input, the system provides the function of automatic detection of voice input breakpoints, so that when the speaker pauses or stops voice input, the boundary of voice input is automatically recognized, and the currently acquired The voice input of the phone is handed over to the subsequent processing flow;

(2)在必要时跟说话者进行人机对话，人机对话的内容根据说话者不同的声学场景、话者场景、语言场景等可选(如图7所示)。(2) Conduct a man-machine dialogue with the speaker when necessary, and the content of the man-machine dialogue can be selected according to the speaker's different acoustic scenes, speaker scenes, language scenes, etc. (as shown in Figure 7).

如果拟人化Mediator在获取说话者语音输入时动态计算背景噪声强度超过所设阈值，则该系统将建议说话者重新输入语音或者更改输入方式；If the anthropomorphic Mediator dynamically calculates that the background noise intensity exceeds the set threshold when acquiring the speaker's voice input, the system will suggest the speaker to re-input the voice or change the input method;

如果拟人化Mediator通过处理输入语音，判断有超过对话双方所设语种类别的语音输入，则该系统将建议说话者重新设置语种选项；If the anthropomorphic Mediator judges that there is a voice input that exceeds the language category set by the dialogue parties through processing the input voice, the system will suggest the speaker to reset the language option;

如果拟人化Mediator通过处理输入语音，判断有多个说话者同时输入语音，则该系统将建议说话者依次输入语音以获得更好的翻译体验；If the anthropomorphic Mediator judges that there are multiple speakers inputting voice at the same time by processing the input voice, the system will suggest that the speakers input voice in turn to obtain a better translation experience;

如果拟人化Mediator通过处理输入语音经自动语音识别后得到的待翻译文本，同时分析待翻译文本的语言场景和语义混淆度，当语言场景的复杂性或语义混淆度超过预设阈值时，则该系统将启动人机对话以获取说话者对复杂语言场景和语义混淆部分的进一步说明，这里的复杂场景和语义混淆部分包括但不限于：If the anthropomorphic Mediator analyzes the language scene and semantic confusion of the text to be translated by processing the text to be translated after automatic speech recognition of the input voice, when the complexity of the language scene or the semantic confusion exceeds the preset threshold, the The system will initiate a man-machine dialogue to obtain further explanations from the speaker on complex language scenarios and semantic confusion. The complex scenarios and semantic confusion here include but are not limited to:

如果待翻译文本中包含人名、地名和机构名，且其自身或者与其上下文存在歧义，则该系统将建议说话者确认人名、地名和机构名的词语边界及结构；If the text to be translated contains names of people, places and institutions, and there is ambiguity between them or their context, the system will suggest the speaker to confirm the word boundaries and structures of names of people, places and institutions;

如果待翻译文本中可能包含识别错误的词汇，则该系统将建议说话者确认该词汇与其真实输入是否一致；If the text to be translated may contain misrecognized words, the system will advise the speaker to confirm whether the words are consistent with their real input;

如果待翻译文本中包含较多的口语化语言碎片和重复，则该系统将自动对待翻译文本进行语言解析和重构，同时将重构的语言表达提交说话者确认，若确认通过，则将重构的语言表达交于后续流程处理，如果说话者否决，则该系统将建议说话者重新组织语言并以更流畅的方式重新表达其语义；If the text to be translated contains many colloquial language fragments and repetitions, the system will automatically analyze and reconstruct the text to be translated, and submit the reconstructed language expression to the speaker for confirmation. The structured language expression is handed over to the subsequent process. If the speaker vetoes it, the system will suggest that the speaker restructure the language and re-express its semantics in a more fluent manner;

如果待翻译文本中包含明显的成分缺失，则该系统将自动对待翻译文本进行成分补全，同时将补全的语言表达提交说话者确认，若确认通过，则将补全的语言表达交于后续流程处理，如果说话者否决，则该系统将建议说话者以更完整的语言结构表达重新进行输入；If the text to be translated contains obvious missing components, the system will automatically complete the components of the text to be translated, and submit the completed language expression to the speaker for confirmation. If the confirmation is passed, the completed language expression will be submitted to the follow-up Process processing, if the speaker vetoes, the system will suggest that the speaker re-enter with a more complete language structure expression;

如果待翻译文本中包含时序颠倒，则该系统将自动调整待翻译文本的时序，同时将调整的语言表达提交说话者确认，若确认通过，则将调整的语言表达交于后续流程处理，如果说话者否决，则该系统将建议说话者以正常的时序重新进行输入；If the time sequence of the text to be translated is reversed, the system will automatically adjust the time sequence of the text to be translated, and submit the adjusted language expression to the speaker for confirmation. If the confirmation is passed, the adjusted language expression will be submitted to the subsequent process for processing. If the speaker rejects it, the system will suggest that the speaker re-enter the input with the normal timing;

如果待翻译文本中包含时间数字短语，且短语自身或与其上下文间存在歧义，则该系统将建议说话者确认时间数字短语的短语边界及结构，所述时间数字短语包括但不限于：基数、序数、小数、分数、概率词、倍数词、约数、个体量词、度量词、复合量词、不定量词、动量词、时量词、名量词、时间、时长、季度、月份、星期、节气、节日、纪年；If the text to be translated contains temporal numeral phrases, and there is ambiguity between the phrase itself or its context, the system will advise the speaker to confirm the phrase boundary and structure of temporal numeral phrases, including but not limited to: cardinal numbers, ordinal numbers , decimals, fractions, probability words, multiple words, approximate numbers, individual quantifiers, measure words, compound quantifiers, infinitive quantifiers, verb quantifiers, time quantifiers, noun quantifiers, time, duration, quarters, months, weeks, solar terms, festivals, Chronicle;

如果待翻译文本中包含专有短语，且短语自身或与其上下文间存在歧义，则该系统将建议说话者确认专有短语的短语边界及结构，所述专有名词包括但不限于：行业术语、日常缩略语、网络新词、古诗词、成语、俗语、歇后语。If the text to be translated contains proprietary phrases, and there is ambiguity between the phrase itself or its context, the system will suggest the speaker to confirm the phrase boundary and structure of the proprietary phrases. The proper nouns include but are not limited to: industry terms, Daily abbreviations, new words on the Internet, ancient poems, idioms, sayings, allegory.

同时，在上述人机对话过程中，人机对话的沟通交互方式包括但不限于：通过语音问答实现人机沟通交互；通过文本显示对说话者的提示和需求；通过屏幕触摸、点击等获取说话者的确认和回复。At the same time, in the process of the above-mentioned human-computer dialogue, the communication and interaction methods of the human-computer dialogue include but are not limited to: realizing human-computer communication and interaction through voice question and answer; displaying prompts and requirements for the speaker through text; acknowledgment and reply.

如果拟人化Mediator通过处理和识别输入语音，获取输入语音的韵律场景信息，则该系统利用韵律场景信息帮助提升说话者的翻译服务体验，这里的韵律场景包括但不限于：If the anthropomorphic Mediator obtains the prosodic scene information of the input speech by processing and recognizing the input speech, then the system uses the prosodic scene information to help improve the speaker's translation service experience. The prosodic scenes here include but are not limited to:

输入语音的中间停顿信息，如果拟人化Mediator通过处理和识别输入语音，采集获得输入语音的中间停顿信息，则该系统将根据所述信息智能判断说话者口语语音碎片间的语义关联及语义重心，并以此作为后续流程处理的优化依据；The middle pause information of the input speech, if the anthropomorphic Mediator collects and obtains the middle pause information of the input speech by processing and recognizing the input speech, then the system will intelligently judge the semantic correlation and semantic center of gravity between the speaker's spoken speech fragments according to the information, And use it as the optimization basis for subsequent process processing;

输入语音的断句边界，如果拟人化Mediator通过处理和识别输入语音，采集获得输入语音的断句边界信息，则该系统将根据所述信息智能判断待翻译文本的篇章切分，并以此为后续流程处理提供切分信息；Sentence boundary of the input voice, if the anthropomorphic Mediator collects and obtains the sentence boundary information of the input voice by processing and recognizing the input voice, then the system will intelligently judge the chapter segmentation of the text to be translated according to the information, and use this as a follow-up process Processing and providing segmentation information;

输入语音的语调情感，如果拟人化Mediator通过处理和识别输入语音，采集获得输入语音的语调情感信息，则该系统将根据所述信息智能判断待翻译文本的语义重心及句中与句尾标点，并以此为后续流程提供语义情感信息；The intonation and emotion of the input voice. If the anthropomorphic Mediator collects and acquires the intonation and emotion information of the input voice by processing and recognizing the input voice, then the system will intelligently judge the semantic center of gravity and mid-sentence and sentence-end punctuation of the text to be translated according to the information. And provide semantic and emotional information for the subsequent process;

输入语音的翻译回合边界，如果拟人化Mediator通过处理和识别输入语音，采集获得输入语音的翻译回合边界信息，则该系统将根据所述信息智能重置记忆模块，并以此作为开启新回合翻译对话的依据。The translation round boundary of the input speech. If the anthropomorphic Mediator acquires the translation round boundary information of the input speech by processing and recognizing the input speech, the system will intelligently reset the memory module according to the information and use it as a new round of translation Basis for dialogue.

(3)可视化向说话者展示当前系统状态(如图8所示)。可视化界面中设置专用拟人化形象向说话者展示当前系统状态，专用拟人化形象包括但不限于：卡通造型、明星、动物、机器人等；在该系统提供翻译服务的不同阶段，专用拟人化形象的状态包括但不限于：(3) Visually display the current system state to the speaker (as shown in Figure 8). A special anthropomorphic image is set in the visual interface to show the current system status to the speaker. The special anthropomorphic image includes but is not limited to: cartoons, stars, animals, robots, etc.; at different stages of the translation service provided by the system, the special anthropomorphic image Status includes but is not limited to:

当对话双方之一作为说话者输入语音时，专用拟人化形象以输入语音的语种或说话人为依据，以聆听输入语音的状态面向说话者方向；When one of the dialogue parties acts as a speaker to input voice, the special anthropomorphic image is based on the language or speaker of the input voice, and faces the speaker in the state of listening to the input voice;

当向说话者发起人机对话时，专用拟人化形象根据人机对话的实际场景，以请求解答、友好提示、智能判断等状态面向说话者方向；When initiating a man-machine dialogue to the speaker, the dedicated anthropomorphic image faces the speaker in the state of requesting answers, friendly prompts, intelligent judgments, etc. according to the actual scene of the man-machine dialogue;

当获取说话者对人机对话内容的答复时，专用拟人化形态根据说话者的答复内容，以聆听答复、理解、感谢等状态面向说话者方向；When obtaining the speaker's reply to the content of the man-machine dialogue, the special anthropomorphic form faces the speaker in the state of listening to the reply, understanding, thanking, etc. according to the speaker's reply content;

当向另一方输出翻译结果时，专用拟人化形象以开口说话、沟通交流的状态面向另一方向。When outputting the translation results to the other party, the dedicated anthropomorphic image faces the other direction in a state of speaking and communicating.

(4)向对话另一方智能输出翻译结果(如图9所示)。拟人化Mediator通过人机对话、智能处理获取说话者的上述声学、语义、韵律等信息，在必要时附加于翻译结果中同步输出给对话另一方，输出方式包括但不限于：利用输出文本标红、加粗等方式标注输出翻译结果的重点部分；利用输出语音的重音、重复等方式显示说话者的情感及语义重心；利用附加自动说明的方式对输出文本中的生僻词、专业概念加以解释。(4) Intelligently output translation results to the other party in the dialogue (as shown in FIG. 9 ). Anthropomorphic Mediator obtains the above-mentioned acoustics, semantics, prosody and other information of the speaker through man-machine dialogue and intelligent processing, and when necessary, adds it to the translation result and outputs it to the other party in the dialogue simultaneously. The output methods include but are not limited to: use the output text to mark red Mark the key parts of the output translation results by means of , bold, etc.; display the speaker's emotion and semantic focus by means of the accent and repetition of the output voice; use the method of adding automatic explanation to explain the rare words and professional concepts in the output text.

第二实施例：手机上多方口语翻译会议系统The second embodiment: multi-party oral translation conference system on the mobile phone

在本实施例中，提供了一种手机上多方口语翻译会议系统，该系统向参会者提供端到端的多方口语会议翻译功能，提供智能会议主持功能，并在必要时向参会者发起人机对话以提升会议的翻译体验。In this embodiment, a multi-party spoken language translation conference system on a mobile phone is provided. The system provides participants with an end-to-end multi-party spoken language conference translation function, provides an intelligent conference moderator function, and provides a conference call to the participants when necessary. Machine dialogue to improve the translation experience of the meeting.

(1)获取会议信息并创建会议(如图10所示)。由会议创建者指定会议标识码，会议标识码为会议的唯一性识别依据，其他会议参与者通过输入会议标识码参与指定会议；由会议创建者指定会议名称，会议名称为会议的内容概括或参会者的信息体现；由会议创建者指定会议所有语种，参会者只能在会议创建者选定的语种中选择自身语种；参会者分别输入本人姓名，本人姓名作为会议中参会者的识别依据在会议界面及拟人化Mediator交谈中得以体现。(1) Obtain meeting information and create a meeting (as shown in Figure 10). The meeting creator specifies the meeting identification code, which is the basis for the unique identification of the meeting, and other meeting participants participate in the specified meeting by entering the meeting identification code; the meeting creator specifies the meeting name, and the meeting name is the content summary or participation of the meeting The information of the participants is reflected; the creator of the meeting specifies all the languages of the meeting, and the participants can only choose their own language among the languages selected by the creator of the meeting; the participants enter their names respectively, and their names are used as the names of the participants in the meeting The identification basis is reflected in the meeting interface and the anthropomorphic Mediator conversation.

(2)拟人化Mediator启动多方口语翻译会议并智能主持会议进程(如图11所示)。如果会议创建者选择由参会者自行主持会议，则拟人化Mediator在启动多方口语翻译会议后，将主持会议的功能交由会议创建者及参会者自行控制，这些功能包括但不限于：(2) Anthropomorphic Mediator initiates a multi-party oral translation conference and intelligently hosts the conference process (as shown in Figure 11). If the meeting creator chooses the participants to host the meeting, the anthropomorphic Mediator will hand over the function of hosting the meeting to the meeting creator and the participants after starting the multi-party oral translation conference. These functions include but are not limited to:

如果会议创建者选择参会者发言模式为麦克风争抢模式，则发言顺序由参会者自行决定，当一位参会者发言时，该系统将拒绝其他参会者申请发言的请求，直至该参会者发言结束为止，当该参会者发言结束后，其他参会者可以申请发言，如果出现多人同时申请发言，则该系统按申请请求到达的先后决定下一位发言者；If the conference creator selects the speaking mode of the participants as the microphone competition mode, the speaking order is determined by the participants themselves. When one participant speaks, the system will reject other participants' requests to speak until the When the participant finishes speaking, other participants can apply to speak. If there are multiple people applying to speak at the same time, the system will determine the next speaker according to the order in which the application requests arrive;

如果会议创建者选择参会者发言模式为麦克风指定模式，则发言顺序由会议创建者指定，当一位参会者向会议创建者申请发言时，会议创建者可以将发言权授予该参会者，同时其他参会者在该参会者发言过程中，无法申请发言权；If the conference creator selects the speaking mode of the participants as the microphone designation mode, the speaking order is specified by the conference creator. When a participant applies to the conference creator to speak, the conference creator can grant the speaking right to the participant , and other participants cannot apply for the right to speak while the participant is speaking;

如果会议创建者手动指定发言时间长度限制，则该系统在参会者发言到达时间长度限制时对参会者作出提醒；If the meeting creator manually specifies the speaking time limit, the system will remind the participants when the speaking time limit is reached;

如果会议创建者不指定发言时间长度限制，则该系统不会对参会者发言时间做出限制，由参会者自行掌控；If the meeting creator does not specify a limit on the speaking time, the system will not limit the speaking time of the participants, which is controlled by the participants themselves;

如果会议创建者不设置参会者发言语音的声学场景、话者场景、语言场景及韵律场景的自动监测，则拟人化Mediator将不对参会者发言语音及其翻译结果做任何智能化处理，直接将翻译结果分发给其他参会者；If the conference creator does not set the automatic monitoring of the acoustic scene, speaker scene, language scene and prosodic scene of the participant's voice, the anthropomorphic Mediator will not perform any intelligent processing on the participant's speech and its translation results, and directly Distribute translation results to other participants;

如果会议创建者设置参会者发言语音的声学场景、话者场景、语言场景及韵律场景的自动监测，则拟人化Mediator将对参会者发言语音及其翻译结果做智能化处理，并将处理结果返回给会议创建者，由会议创建者决定是否与发言者进行进一步的沟通或确认，其中声学场景、话者场景、语言场景及韵律场景的自动监测包括但不限于第一实施例中所涉及的对输入语音、待翻译文本或翻译结果的各项识别及处理。If the conference creator sets the automatic monitoring of the acoustic scene, speaker scene, language scene and prosodic scene of the participant's voice, the anthropomorphic Mediator will intelligently process the participant's speech and its translation results, and will process The result is returned to the conference creator, who decides whether to communicate or confirm further with the speaker, wherein the automatic monitoring of the acoustic scene, speaker scene, language scene and prosodic scene includes but is not limited to those involved in the first embodiment Various recognition and processing of input speech, text to be translated or translation results.

如果会议创建者选择由拟人化Mediator智能主持会议，则拟人化Mediator在启动多方口语翻译会议后，将开启主持会议的功能，这些功能包括但不限于：If the meeting creator chooses the anthropomorphic Mediator to host the meeting intelligently, the anthropomorphic Mediator will enable the function of hosting the meeting after starting the multi-party oral translation conference. These functions include but are not limited to:

拟人化Mediator通过对会议进程和会议记录的智能识别及处理，同时根据参会者申请发言的顺序信息，智能确定发言顺序；The anthropomorphic Mediator can intelligently identify and process the meeting process and meeting records, and at the same time, intelligently determine the speaking order according to the order information of the participants' application for speaking;

拟人化Mediator通过对当前发言者的发言内容、语速及会议进程的智能判断，对当前发言者的发言时间长度、内容长度、语速等进行智能提醒，这些智能提醒包括：Anthropomorphic Mediator intelligently judges the current speaker's speech content, speech speed and conference progress, and intelligently reminds the current speaker's speech length, content length, speech speed, etc. These smart reminders include:

如果发言者的发言时间过长，则拟人化Mediator自动提醒发言者注意发言时间；If the speaker's speaking time is too long, the anthropomorphic Mediator will automatically remind the speaker to pay attention to the speaking time;

如果发言者的发言内容过长，则拟人化Mediator自动提醒发言者对发言内容进行切分以达到更好的翻译性能；If the speaker's speech is too long, the anthropomorphic Mediator will automatically remind the speaker to segment the speech to achieve better translation performance;

如果发言者的语速过快，则拟人化Mediator自动提醒发言者降低语速，以更和缓的节奏进行表达。If the speaker speaks too fast, the anthropomorphic Mediator automatically reminds the speaker to slow down and express with a more gentle rhythm.

拟人化Mediator通过对当前发言者输入语音的声学场景、话者场景、语言场景及韵律场景的智能监测，及对会议进程和会议记录的智能识别及处理，动态决策与发言者进行人机对话的必要性，其中，拟人化Mediator动态决策的参照因素包括但不限于：Anthropomorphic Mediator intelligently monitors the acoustic scene, speaker scene, language scene and prosodic scene of the current speaker's input voice, and intelligently recognizes and processes the meeting process and meeting records, and dynamically decides the human-machine dialogue with the speaker. Necessity, among them, the reference factors for the dynamic decision-making of the anthropomorphic Mediator include but are not limited to:

如果拟人化Mediator智能感知参会者对会议主题熟悉度较高，则该系统将调高进行人机对话的阈值或门限，如果拟人化Mediator智能感知参会者对会议主题熟悉度较低，则该系统将降低进行人机对话的阈值或门限；If the anthropomorphic Mediator intelligent perception participants are more familiar with the conference theme, the system will increase the threshold or threshold for man-machine dialogue; if the anthropomorphic Mediator intelligent perception participants are less familiar with the conference theme, then The system will lower the threshold or threshold for engaging in human-computer dialogue;

如果拟人化Mediator智能感知会议进程紧迫性较高，则该系统将调高进行人机对话的阈值或门限，如果拟人化Mediator智能感知会议进程紧迫性较低，则该系统将降低进行人机对话的阈值或门限。If the anthropomorphic Mediator intelligently perceives the urgency of the meeting process is high, the system will increase the threshold or threshold for man-machine dialogue; if the anthropomorphic Mediator intelligently senses the urgency of the meeting process is low, the system will reduce the man-machine dialogue threshold or threshold.

拟人化Mediator对当前发言者输入语音的声学场景、话者场景、语言场景及韵律场景的智能监测包括但不限于第一实施例中所涉及的对输入语音、待翻译文本或翻译结果的各项识别及处理。Anthropomorphic Mediator's intelligent monitoring of the acoustic scene, speaker scene, language scene and prosodic scene of the current speaker's input voice includes but is not limited to the input speech, text to be translated or translation results involved in the first embodiment identification and processing.

(3)可视化向参会者展示当前会议状态(如图12所示)。可视化界面中设置专用拟人化形象向参会者展示当前会议状态，专用拟人化形象包括但不限于：卡通造型、明星、动物、机器人等；在会议的不同状态下，专用拟人化形象的状态同时随之改变，包括但不限于：(3) Visually display the current conference status to the participants (as shown in FIG. 12 ). Set up a special anthropomorphic image in the visual interface to show the current meeting status to the participants. The special anthropomorphic image includes but not limited to: cartoon shape, star, animal, robot, etc.; in different states of the meeting, the state of the special anthropomorphic image is simultaneously changes, including but not limited to:

当参会者申请发言时，专用拟人化形象以等待聆听的状态面向申请者，同时将申请发言的参会者信息通知其他参会者；When a participant applies for a speech, the dedicated anthropomorphic image faces the applicant in a state of waiting to listen, and at the same time informs other participants of the participant's information who applies for a speech;

当参会者进行发言时，专用拟人化形象以聆听的状态面向发言者，同时以发言的状态面向其他参会者；When a participant speaks, the dedicated anthropomorphic image faces the speaker in a state of listening, and at the same time faces other participants in a state of speaking;

当拟人化Mediator与发言者进行人机对话时，专用拟人化形象根据人机对话的实际场景，以请求解答、友好提示、智能判断等状态面向发言者方向，同时以等待的状态面向其他参会者；When the anthropomorphic Mediator conducts man-machine dialogue with the speaker, the dedicated anthropomorphic image faces the speaker in the state of requesting answers, friendly prompts, intelligent judgments, etc. according to the actual scene of the man-machine dialogue, and at the same time faces other participants in a waiting state By;

当会议创建者修改会议相关设置时，专用拟人化形象将修改内容通知其他参会者，修改内容包括但不限于：会议主题、会议标识码、会议语种、会议主持方、发言时间长度限制、发言模式、会议参会人员。When the meeting creator modifies the relevant settings of the meeting, the special anthropomorphic image will notify other participants of the modification, including but not limited to: meeting theme, meeting identification code, meeting language, meeting host, speech time limit, speech mode, meeting participants.

(4)拟人化Mediator智能判定结束会议(4) Anthropomorphic Mediator intelligently determines the end of the meeting

如果会议创建者选择手动结束会议，则通过点击结束会议的功能按钮来终止会议，同时其他参会者将被强制结束会议；If the meeting creator chooses to end the meeting manually, the meeting can be terminated by clicking the function button of ending the meeting, and other participants will be forced to end the meeting;

如果会议创建者选择由拟人化Mediator智能判定结束会议，则拟人化Mediator通过对会议进程和会议记录的识别及处理，智能判定会议进程的边界，在会议进程结束后终止会议。If the meeting creator chooses to have the anthropomorphic Mediator intelligently decide to end the meeting, the anthropomorphic Mediator can intelligently determine the boundary of the meeting process by identifying and processing the meeting process and meeting records, and terminate the meeting after the meeting process ends.

此外，会议结束后，该系统向参会者提供会议相关信息，包括但不限于：会议记录、会议统计信息、会议人员名单、会议纪要。In addition, after the meeting is over, the system will provide participants with meeting-related information, including but not limited to: meeting minutes, meeting statistical information, meeting personnel list, and meeting minutes.

第三实施例：基于无屏幕显示的拟人化口语翻译系统The third embodiment: an anthropomorphic spoken language translation system based on no-screen display

在本实施例中提供一种基于无屏幕显示的拟人化口语翻译系统，该系统向使用者提供无屏幕情况下端到端的拟人翻译服务，该系统采用如下技术方案(如图13所示)：In this embodiment, an anthropomorphic spoken language translation system based on no-screen display is provided. The system provides end-to-end anthropomorphic translation services to users without a screen. The system adopts the following technical solution (as shown in FIG. 13 ):

(1)获取说话者的相关信息(1) Obtain information about the speaker

在无屏幕显示的情况下，该系统通过对说话者输入语音的智能处理，获取说话者的相关信息，所述相关信息包括但不限于：In the case of no screen display, the system acquires relevant information about the speaker through intelligent processing of the speaker's input voice, and the relevant information includes but is not limited to:

可选的，该系统在启动时请求所有说话人依次说一句常用语，从而获取对话参与方的语种信息；Optionally, the system requests all speakers to speak a common phrase in turn when starting up, so as to obtain the language information of the dialogue participants;

该系统自动对说话者进行说话人识别，并将识别结果作为语种识别、人机对话及区分不同说话人对话记录的重要依据。The system automatically identifies the speaker, and uses the identification result as an important basis for language identification, man-machine dialogue, and dialogue records for different speakers.

(2)获取说话者输入的源语言语音(2) Obtain the source language voice input by the speaker

对话开始后，该系统通过对说话者输入语音的智能处理，获取完整的源语言语音，所述对输入语音的智能处理包括但不限于：After the dialogue starts, the system obtains the complete source language voice through the intelligent processing of the speaker's input voice. The intelligent processing of the input voice includes but is not limited to:

该系统自动对说话者输入的语音进行断点检测，从而智能识别输入语音的边界，获取完整的语音片段。The system automatically detects the breakpoints of the speech input by the speaker, so as to intelligently identify the boundaries of the input speech and obtain complete speech fragments.

(3)在必要时跟说话者进行人机对话，人机对话的内容根据说话者不同的声学场景、话者场景、语言场景等可选(3) Conduct man-machine dialogue with the speaker when necessary. The content of the man-machine dialogue can be selected according to the speaker's different acoustic scenes, speaker scenes, language scenes, etc.

拟人化Mediator通过对当前输入语音的声学场景、话者场景、语言场景及韵律场景的智能监测，在必要时开启人机对话，所述智能检测及人机对话包括但不限于第一实施例中所涉及的内容。The anthropomorphic Mediator starts the man-machine dialogue when necessary through the intelligent monitoring of the acoustic scene, speaker scene, language scene and prosody scene of the current input voice. The intelligent detection and man-machine dialogue include but are not limited to the first embodiment the content involved.

(4)在无屏幕情况下，拟人化Mediator通过声音的不同状态向用户展示当前对话的状态(4) In the case of no screen, the anthropomorphic Mediator shows the user the status of the current dialogue through different states of the voice

拟人化Mediator以声音为媒介，向用户展示当前对话的状态，所述以声音为媒介的方式包括但不限于：The anthropomorphic Mediator uses sound as a medium to display the current state of the conversation to the user. The ways of using sound as a medium include but are not limited to:

拟人化Mediator支持通过声音的性别来区分人机对话和翻译结果输出等不同状态；Anthropomorphic Mediator supports different states such as man-machine dialogue and translation result output through the gender of the voice;

拟人化Mediator支持通过声音的语气来区分人机对话和翻译结果输出等不同状态，在进行人机对话时，拟人化Mediator使用轻声的商谈、请求语气，而在输出翻译结果时，拟人化Mediator则使用客观、严肃的表述语气；Anthropomorphic Mediator supports different states such as man-machine dialogue and translation result output through the tone of the voice. In man-machine dialogue, anthropomorphic Mediator uses soft negotiating and requesting tone; when outputting translation results, anthropomorphic Mediator uses Use an objective, serious tone of expression;

拟人化Mediator支持通过声音的前缀背景音乐来区分人机对话和翻译结果输出等不同状态，在进行人机对话时，拟人化Mediator在对话前将插入简短的启发意味的轻快音乐，而在输出翻译结果时，拟人化Mediator则在输出前将插入简短的告知意味的厚重音乐。Anthropomorphic Mediator supports different states such as man-machine dialogue and translation result output through the prefix background music of the sound. When the result is generated, the anthropomorphic Mediator will insert a short informative thick music before outputting.

(5)向对话另一方智能输出翻译结果语音(5) Intelligently output the voice of the translation result to the other party in the dialogue

拟人化Mediator通过人机对话、智能处理获取说话者的上述声学、语义、韵律等信息，在必要时附加于翻译结果中同步输出给对话另一方，输出方式包括但不限于：利用输出语音的重音、重复等方式显示说话者的情感及语义重心。Anthropomorphic Mediator obtains the above-mentioned acoustics, semantics, prosody and other information of the speaker through human-computer dialogue and intelligent processing, and when necessary, adds it to the translation result and outputs it to the other party in the dialogue simultaneously. The output methods include but are not limited to: using the accent of the output voice , repetition and other ways to show the speaker's emotional and semantic focus.

需要说明的是，在附图或说明书正文中，未绘示或描述的实现方式，均为所属技术领域中普通技术人员所知的形式，并未进行详细说明。此外，上述对各元件和方法的定义并不仅限于实施例中提到的各种具体结构、形状或方式，本领域普通技术人员可对其进行简单地更改或替换。It should be noted that, in the accompanying drawings or in the text of the specification, implementations that are not shown or described are forms known to those of ordinary skill in the art, and are not described in detail. In addition, the above definitions of each element and method are not limited to the various specific structures, shapes or methods mentioned in the embodiments, and those skilled in the art can easily modify or replace them.

综上所述，本发明提供一种具有人机对话功能的拟人化口语翻译方法及系统。本发明的核心点是在原有语音识别和翻译的基础上，加入一个人机对话模块，该模块捕捉、处理和识别当时的声学场景、话者场景、韵律场景、语言场景等，根据翻译任务需要在必要时跟用户进行人机对话，确切地获得能够显著提升复杂应用场景下用户的翻译体验，并提高翻译语义的准确度。To sum up, the present invention provides an anthropomorphic spoken language translation method and system with man-machine dialogue function. The core point of the present invention is to add a man-machine dialogue module on the basis of the original speech recognition and translation, which captures, processes and recognizes the acoustic scene, speaker scene, prosodic scene, language scene, etc. Conduct man-machine dialogues with users when necessary to obtain accurate translation experience that can significantly improve users' translation experience in complex application scenarios and improve the accuracy of translation semantics.

以上所述的具体实施例，对本发明的目的、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上所述仅为本发明的具体实施例而已，并不用于限制本发明，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The specific embodiments described above have further described the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above descriptions are only specific embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

1. a kind of anthropomorphic spoken language translation method with man-machine dialogue function, it comprises the following steps:

Perform intelligent speech recognition on the source language speech to obtain the source language text;

Process the source language text and dialogue scenes, and carry out anthropomorphic man-machine dialogue communication;

Perform machine translation and get translation results.

2. The anthropomorphic spoken language translation method according to claim 1, wherein, when carrying out intelligent speech recognition to source language speech, specifically comprise the following steps:

Give the probability score of each frame of speech/non-speech, and the probability score of each frame of source language/target language;

Simultaneously perform bilingual decoding on speech;

Use information to synthesize and output meaningful recognition results.

3. The anthropomorphic spoken language translation method according to claim 1, wherein, when processing the source language text and the dialogue scene, it includes perceiving the acoustic scene, the speaker scene, the prosodic scene, and the language scene.

4. The anthropomorphic spoken language translation method according to claim 3, wherein when perceiving an acoustic scene, the information to be perceived includes: dynamic background noise information.

5. The anthropomorphic spoken language translation method according to claim 3, wherein, when the scene of the interlocutor is perceived, the information to be sensed includes: speaker information, language information, whether there are many people speaking, voice separation information of many people speaking, and others Speaker scene information.

6. The anthropomorphic spoken language translation method according to claim 3, wherein, when the prosody scene is perceived, the information to be sensed includes: a pause in the input speech, sentence boundary, speech rate, fundamental frequency, formant and prosody analysis confidence degrees and translation turn boundaries.

7. The anthropomorphic spoken language translation method according to claim 3, wherein, when the language scene is perceived, the information to be perceived includes:

Names of people, places and institutions extracted from the text to be translated;

Misidentified words that may be included in the text to be translated;

Fragments and repetitions of colloquial language contained in the text to be translated;

missing obvious elements contained in the text to be translated;

Chronological inversions contained in the text to be translated;

Language information of time, number, phrases contained in the text to be translated;

Industry terminology, daily abbreviations, new words on the Internet, ancient poems, idioms, common sayings, allegories contained in the text to be translated.

8. The anthropomorphic spoken language translation method according to claim 1, wherein, when carrying out anthropomorphic man-machine dialogue communication, its communication content includes:

Prompts on correct usage methods, tips on improving the usage environment or the speaker's environment;

Obtain semantic description information when further semantic description is required.

9. An anthropomorphic spoken language translation system with man-machine dialogue function, comprising:

The speech recognition module is used for performing intelligent speech recognition on the source language speech to obtain the source language text;

The man-machine dialogue management module is used to process source language texts and dialogue scenes, and carry out anthropomorphic man-machine dialogue communication;

The machine translation module is used for performing machine translation and obtaining translation results.

10. The anthropomorphic spoken language translation system according to claim 9, wherein the different states of the translation system are shown through visualized anthropomorphic images, or different states of the translation system are shown through non-visualized sound media.