CN106409283A

CN106409283A - Audio frequency-based man-machine mixed interaction system and method

Info

Publication number: CN106409283A
Application number: CN201610791966.0A
Authority: CN
Inventors: 俞凯; 石开宇; 郑达; 陈露; 常成; 曹迪
Original assignee: Shanghai Jiao Tong University
Current assignee: Sipic Technology Co Ltd
Priority date: 2016-08-31
Filing date: 2016-08-31
Publication date: 2017-02-15
Anticipated expiration: 2036-08-31
Also published as: CN106409283B

Abstract

The invention discloses an audio-based human-computer hybrid interactive system. The speech recognition module is connected with the semantic recognition module and transmits text information corresponding to the speech. The exception processing module is connected with the speech recognition module and the semantic recognition module, and the speech recognition module transmits text information. To the exception processing module, the semantic recognition module transmits the semantic analysis result to the exception processing module; the exception processing module is connected with the speech synthesis module and transmits intervention information. The invention also discloses an audio-based human-computer hybrid interaction method. The voice recognition module converts voice information into text information and outputs it to the semantic recognition unit; the semantic recognition unit extracts the user's purpose and corresponding key information from the text information; The processing module judges whether there is an abnormality in the human-computer dialogue according to the text information of the speech recognition module and the semantic information of the semantic recognition module, and processes the message reply for the abnormality. The technical solution of the invention provides a unified man-machine dialogue experience.

Description

Audio-based human-computer hybrid interaction system and method

技术领域technical field

本发明涉及信息处理技术领域，尤其涉及一种基于音频的人机混合交互系统及方法。The present invention relates to the technical field of information processing, in particular to an audio-based human-computer hybrid interaction system and method.

背景技术Background technique

如图1所示，目前基于音频的人机对话系统均使用机器回复作为最终回复呈现给用户，当机器决策系统不能明确用户意图时，大部分对话系统选择呈现“请再说一遍”之类的回复以让用户进行重新的输入，其中部分人机对话系统引入了基于话务中心的人工干预方法。As shown in Figure 1, the current audio-based man-machine dialogue systems all use the machine’s reply as the final reply to present to the user. When the machine decision-making system cannot clarify the user’s intention, most of the dialogue systems choose to present a reply such as “please say it again”. In order to allow users to re-input, some of the man-machine dialogue systems have introduced a manual intervention method based on the call center.

目前现有人机对话异常处理主要通过话务中心形式实现，在机器无法处理用户输入音频或者在用户明确表示需要人工服务时，请求人工的话务中心介入，此时用户与话务员之间建立一对一的通话连接，话务员与用户进行直接交流，获知用户的需求并通过话务平台下发相应的指令。At present, the abnormal handling of the existing human-machine dialogue is mainly realized through the call center. When the machine cannot process the user input audio or when the user clearly expresses the need for manual service, the call center is requested to intervene manually. At this time, a pair is established between the user and the operator. One call connection, the operator communicates directly with the user, learns the user's needs and issues corresponding instructions through the call platform.

现有话务中心的人工干预方式存在的问题主要有：人工效率低，干预师与用户需要建立一对一的语音交流，等待用户输入的时间段内无法服务其他人；成本高，大规模的呼叫中心需要一系列的电信设备以及相应服务集成，同时由于效率低，需要更多干预师进行干预服务，从而间接提高了人力成本；受网络环境影响大：利用网络资源直接传输音频需要稳定的网络连接，网络环境的波动会导致音频质量下降从而影响对话体验，甚至中断人机对话流程。The problems existing in the manual intervention method of the existing call center mainly include: low manual efficiency, the interventionist and the user need to establish one-to-one voice communication, and cannot serve other people during the time period waiting for user input; high cost, large-scale The call center needs a series of telecommunication equipment and corresponding service integration. At the same time, due to low efficiency, more interventionists are required to provide intervention services, which indirectly increases labor costs; it is greatly affected by the network environment: using network resources to directly transmit audio requires a stable network Connection, the fluctuation of the network environment will lead to the degradation of audio quality, which will affect the dialogue experience, and even interrupt the process of man-machine dialogue.

因此，本领域的技术人员致力于开发一种基于音频的人机混合交互系统及方法，将人工干预回复与机器回复相结合，从而统一人机对话的流程和提升用户体验。Therefore, those skilled in the art are devoting themselves to developing an audio-based human-computer hybrid interaction system and method, which combines manual intervention reply with machine reply, so as to unify the process of man-machine dialogue and improve user experience.

发明内容Contents of the invention

有鉴于现有技术的上述缺陷，本发明所要解决的技术问题是如何提高客服过程中人机对话的效率和用户体验。In view of the above-mentioned defects of the prior art, the technical problem to be solved by the present invention is how to improve the efficiency and user experience of man-machine dialogue in the customer service process.

为实现上述目的，本发明提供了一种基于音频的人机混合交互系统，包括语音识别模块、语音合成模块、语义识别模块以及异常处理模块，其中，所述语音识别模块被配置为与所述语义识别模块相连并传输语音对应的文字信息，所述异常处理模块被配置为与所述语音识别模块和所述语义识别模块相连，所述语音识别模块被配置为传输文字信息给所述异常处理模块，所述语义识别模块被配置为传输语义解析结果给所述异常处理模块；所述异常处理模块被配置为与所述语音合成模块相连并传输干预信息。To achieve the above object, the present invention provides an audio-based human-computer hybrid interaction system, including a speech recognition module, a speech synthesis module, a semantic recognition module and an exception processing module, wherein the speech recognition module is configured to communicate with the The semantic recognition module is connected and transmits text information corresponding to the voice, the exception processing module is configured to be connected to the speech recognition module and the semantic recognition module, and the speech recognition module is configured to transmit text information to the exception processing module, the semantic recognition module is configured to transmit semantic analysis results to the exception processing module; the exception processing module is configured to be connected to the speech synthesis module and transmit intervention information.

进一步地，所述语音识别模块包括信号处理及特征提取单元、声学模型、语言模型以及解码器，其中，所述信号处理及特征提取单元被配置为与所述声学模型相连并传输声学特征信息，所述解码器被配置为与所述声学模型和所述语言模型相连并输出识别结果。Further, the speech recognition module includes a signal processing and feature extraction unit, an acoustic model, a language model, and a decoder, wherein the signal processing and feature extraction unit is configured to be connected to the acoustic model and transmit acoustic feature information, The decoder is configured to be connected to the acoustic model and the language model and output a recognition result.

进一步地，所述语音合成模块包括文本分析单元、韵律控制单元以及合成语音单元，其中，所述文本分析单元被配置为接收文本信息并对所述文本信息进行处理，将处理结果传输到所述韵律控制单元与所述合成语音单元，所述韵律控制单元被配置为与所述合成语音单元相连，并传输音高、音长、音强、停顿及语调信息，所述合成语音单元被配置为将所述接收文本分析单元的分析结果与所述韵律控制单元的控制参数合成输出的语音。Further, the speech synthesis module includes a text analysis unit, a prosody control unit and a speech synthesis unit, wherein the text analysis unit is configured to receive text information and process the text information, and transmit the processing result to the A prosody control unit and the synthesized speech unit, the prosody control unit is configured to be connected to the synthesized speech unit, and transmit pitch, sound length, sound intensity, pause and intonation information, and the synthesized speech unit is configured to The analysis result of the received text analysis unit and the control parameters of the prosody control unit are synthesized into the output speech.

进一步地，所述语义识别模块包括领域标注单元、意图判断单元、信息提取单元，其中，所述领域标注单元被配置为与所述意图判断单元相连并传输领域信息，所述意图判断单元被配置为与所述信息提取单元相连并传输用户意图信息，所述信息提取单元输出语义分析的结果。Further, the semantic recognition module includes a domain labeling unit, an intention judgment unit, and an information extraction unit, wherein the domain labeling unit is configured to be connected to the intention judgment unit and transmit domain information, and the intention judgment unit is configured In order to be connected with the information extraction unit and transmit user intention information, the information extraction unit outputs the result of semantic analysis.

进一步地，所述异常处理模块包括异常检测单元、数据库查询单元以及干预师单元，其中，所述异常检测单元被配置为接收所述语音识别模块和所述语义识别模块的输出，并决定是否采取干预措施，所述数据库查询单元被配置为接收所述异常检测单元的干预信号，并接收所述语义识别模块的语义信息，查询并输出干预消息，所述干预师单元被配置为利用干预师对所述数据库查询单元输出的所述干预消息进行必要的择优以及修改，最终输出给用户的回复消息。Further, the abnormality processing module includes an abnormality detection unit, a database query unit and an interventionist unit, wherein the abnormality detection unit is configured to receive the output of the speech recognition module and the semantic recognition module, and decide whether to take Intervention measures, the database query unit is configured to receive the intervention signal of the abnormality detection unit, and receive the semantic information of the semantic recognition module, query and output an intervention message, and the interventionist unit is configured to use the interventionist to The intervention message output by the database query unit undergoes necessary optimization and modification, and finally outputs a reply message to the user.

本发明还提供了一种基于音频的人机混合交互方法，包括以下步骤：The present invention also provides an audio-based human-computer hybrid interaction method, comprising the following steps:

步骤1、提供语音识别模块、语音合成模块、语义识别模块以及异常处理模块；Step 1, providing a speech recognition module, a speech synthesis module, a semantic recognition module and an exception handling module;

步骤2、所述语音识别模块将语音信息转换为文字信息并输出至所述语义识别单元；Step 2, the voice recognition module converts voice information into text information and outputs it to the semantic recognition unit;

步骤3、所述语义识别单元从文字信息中提取用户目的以及相应的关键信息；Step 3, the semantic recognition unit extracts the user purpose and corresponding key information from the text information;

步骤4、所述异常处理模块根据所述语音识别模块的文字信息以及所述语义识别模块的语义信息判断人机对话当前是否出现异常并针对异常处理消息的回复。Step 4. The abnormality processing module judges whether there is an abnormality in the human-computer dialogue according to the text information of the speech recognition module and the semantic information of the semantic recognition module, and processes the message reply for the abnormality.

进一步地，在步骤2中，具体包括以下步骤：Further, in step 2, the following steps are specifically included:

步骤2.1、从输入的音频流中提取特征供声学模型处理，同时降低环境噪声、信道和说话人因素对所述特征造成的影响；Step 2.1, extracting features from the input audio stream for acoustic model processing, while reducing the impact of environmental noise, channel and speaker factors on the features;

步骤2.2、解码器根据声学、语言学模型及词典，对所述声学模型的处理结果，寻找能够以最大概率输出所述音频流的词串，作为语音的识别结果。Step 2.2: The decoder searches for a word string that can output the audio stream with the highest probability based on the acoustic model processing results of the acoustic model and the dictionary as the speech recognition result.

进一步地，在步骤3中，具体包括以下步骤：Further, in step 3, the following steps are specifically included:

步骤3.1、利用文字信息中标志性的关键词标记当前对话所属的领域；Step 3.1, using iconic keywords in the text information to mark the field to which the current conversation belongs;

步骤3.2、在所述领域中基于规则对用户意图进行判断；Step 3.2, judging the user's intention based on rules in the field;

步骤3.3、根据所述领域以及所述用户意图，结合规则，对具体的关键信息进行提取。Step 3.3, extract specific key information according to the field and the user's intention in combination with rules.

进一步地，在步骤4中，具体包括以下步骤：Further, in step 4, the following steps are specifically included:

步骤4.1、异常检测单元根据所述语音识别模块的文字信息以及所述语义识别模块的语义信息判断当前的人机对话是否出现异常，若异常则由干预师单元接管人机对话；Step 4.1, the abnormality detection unit judges whether the current man-machine dialogue is abnormal according to the text information of the speech recognition module and the semantic information of the semantic recognition module, and if it is abnormal, the interventionist unit takes over the man-machine dialogue;

步骤4.2、数据库查询单元根据语义信息进行数据库的查询，得到具有推荐度的干预消息，如果干预消息的推荐度较高，则直接利用该干预消息进行干预，如果推荐度较低，则请求干预师进行人工介入；Step 4.2, the database query unit queries the database according to the semantic information, and obtains intervention messages with a recommendation degree. If the recommendation degree of the intervention message is high, the intervention message is directly used for intervention. If the recommendation degree is low, the interventionist is requested. perform manual intervention;

步骤4.3、在机器算法无法找到高推荐度的干预消息时，干预师介入进行干预消息的选择以及修改，随后将修改后的干预消息发送至客户端。Step 4.3. When the machine algorithm fails to find a highly recommended intervention message, the interventionist intervenes to select and modify the intervention message, and then sends the modified intervention message to the client.

进一步地，所述关键信息包括对话领域、对话关键词，所述对话关键词包括内容关键词和情绪关键词。Further, the key information includes dialogue fields and dialogue keywords, and the dialogue keywords include content keywords and emotion keywords.

与现有技术相比，本发明的技术效果包括:Compared with prior art, technical effect of the present invention comprises:

1、效率提高：充分利用了干预师等待用户输入的时间，使得干预师可同时对多个用户进行干预服务，提高干预的效率。1. Efficiency improvement: Make full use of the time that the interventionist waits for user input, so that the interventionist can provide intervention services to multiple users at the same time, improving the efficiency of intervention.

2、成本减少：无需采购话务中心相关的一系列电信设备，利用现有的计算机以及服务器即可搭建干预平台。2. Cost reduction: There is no need to purchase a series of telecommunication equipment related to the call center, and the intervention platform can be built by using the existing computers and servers.

3、工作场景丰富：由于干预师界面采用了B/S(Browser/Server浏览器/服务器)结构，干预师打开浏览器登录相应的网站即可进行干预操作，不必要在工位上接听电话，可以在PAD、智能手机、个人笔记本等移动终端上进行干预服务。3. Rich work scenarios: Since the interventionist interface adopts the B/S (Browser/Server browser/server) structure, the interventionist can open the browser and log in to the corresponding website to perform intervention operations, and it is unnecessary to answer the phone at the workstation. Intervention services can be performed on mobile terminals such as PADs, smart phones, and personal notebooks.

4、网络要求低：文本传输的数据量很小，从而对网络的要求降低，同时用户收听到的语音由本地合成，不受网络情况的影响。4. Low network requirements: The amount of data transmitted by text is small, thereby reducing the requirements for the network. At the same time, the voice heard by the user is synthesized locally and is not affected by the network situation.

5、统一的人机对话体验：对用户来说，干预师是透明的，用户的体验如同与一个充分智能的“机器”在对话，可以无缝衔接目前的人机对话方式。5. Unified man-machine dialogue experience: For users, the interventionist is transparent, and the user experience is like talking with a fully intelligent "machine", which can seamlessly connect with the current man-machine dialogue.

以下将结合附图对本发明的构思、具体结构及产生的技术效果作进一步说明，以充分地了解本发明的目的、特征和效果。The idea, specific structure and technical effects of the present invention will be further described below in conjunction with the accompanying drawings, so as to fully understand the purpose, features and effects of the present invention.

附图说明Description of drawings

图1为现有传统话务中心的干预模式示意图；FIG. 1 is a schematic diagram of an intervention mode of an existing traditional call center;

图2为本发明的系统模块示意图；Fig. 2 is a schematic diagram of a system module of the present invention;

图3为本发明一个较佳实施例的系统流程示意图；Fig. 3 is a schematic flow chart of the system of a preferred embodiment of the present invention;

图4为本发明一个较佳实施例的角色对话流程示意图。Fig. 4 is a schematic diagram of a character dialog flow in a preferred embodiment of the present invention.

具体实施方式detailed description

本发明是通过以下技术方案实现的：The present invention is achieved through the following technical solutions:

如图2所示，本发明涉及一种基于音频的人机对话异常处理系统，包括：语音识别模块、语音合成模块、语义识别模块以及异常处理模块，其中：语音识别模块与语义识别模块相连并传输语音对应的文字信息，语音识别模块和语义识别模块均与异常处理模块相连，并分别传输文字信息和语义解析结果，异常处理模块与语音合成模块相连并传输干预信息。As shown in Fig. 2, the present invention relates to a kind of audio-based man-machine dialogue exception processing system, comprising: a speech recognition module, a speech synthesis module, a semantic recognition module and an exception processing module, wherein: the speech recognition module is connected with the semantic recognition module and The text information corresponding to the voice is transmitted, the speech recognition module and the semantic recognition module are connected to the exception processing module, and respectively transmit text information and semantic analysis results, and the exception processing module is connected to the speech synthesis module and transmits intervention information.

所述的语音识别模块包括：信号处理及特征提取单元、声学模型、语言模型以及解码器，其中：信号处理及特征提取单元与声学模型相连并传输声学特征信息，解码器与声学模型和语言模型相连，对外界输出识别结果。Described speech recognition module comprises: signal processing and feature extraction unit, acoustic model, language model and decoder, wherein: signal processing and feature extraction unit are connected with acoustic model and transmit acoustic characteristic information, decoder and acoustic model and language model Connected to output recognition results to the outside world.

所述的语音合成模块包括：文本分析单元、韵律控制单元以及合成语音单元，，其中：文本分析单元接收文本信息并对其进行处理，将处理结果传输到韵律控制单元与合成语音单元，韵律控制单元与合成语音单元相连，并传输目标的音高、音长、音强、停顿及语调等信息，合成语音单元接收文本分析单元的分析结果与韵律控制单元的控制参数，对外界输出合成的语音。The speech synthesis module includes: a text analysis unit, a prosody control unit and a synthetic speech unit, wherein: the text analysis unit receives the text information and processes it, and transmits the processing result to the prosody control unit and the synthesized speech unit, and the prosody control unit The unit is connected to the synthesized speech unit, and transmits information such as pitch, duration, sound intensity, pause, and intonation of the target. The synthesized speech unit receives the analysis results of the text analysis unit and the control parameters of the prosody control unit, and outputs the synthesized speech to the outside world. .

所述的语义识别模块包括：领域标注单元、意图判断单元、信息提取单元，其中：领域标注单元与意图判断单元相连并传输领域信息，意图判断单元与信息提取单元相连并传输用户意图信息，信息单元与外界相连并传输语义分析的信息。The semantic recognition module includes: a field labeling unit, an intention judgment unit, and an information extraction unit, wherein: the field labeling unit is connected with the intention judgment unit and transmits domain information, and the intention judgment unit is connected with the information extraction unit and transmits user intention information, information The unit is connected to the outside world and transmits information for semantic analysis.

所述的异常处理模块包括：异常检测单元、数据库查询单元、干预师单元以，其中：异常检测单元接收语音识别模块和语义识别模块的输出，并决定是否采取干预措施，数据库查询单元接收异常检测单元的干预信号，并接收语义识别模块的语义信息，查询并输出干预消息，干预师单元利用干预师对数据库查询单元输出的干预消息进行必要的择优以及修改，最终输出用户回复消息。The abnormal processing module includes: an abnormal detection unit, a database query unit, and an intervention division unit, wherein: the abnormal detection unit receives the output of the speech recognition module and the semantic recognition module, and decides whether to take intervention measures, and the database query unit receives the abnormal detection The intervention signal of the unit, and receive the semantic information of the semantic recognition module, query and output the intervention message, the interventionist unit uses the interventionist to select and modify the intervention message output by the database query unit, and finally outputs the user reply message.

本发明涉及上述系统的人机对话异常处理方法，具体包括以下步骤：The present invention relates to the man-machine dialog abnormality processing method of the above system, which specifically includes the following steps:

步骤1、提供语音识别模块、语音合成模块、语义识别模块以及异常处理模块。Step 1. Provide a speech recognition module, a speech synthesis module, a semantic recognition module and an exception handling module.

步骤2、语音识别模块将语音信息转换为文字信息并输出至语义识别单元，具体步骤包括：Step 2, the voice recognition module converts the voice information into text information and outputs it to the semantic recognition unit, and the specific steps include:

2.1前端处理音频流,从输入信号中提取特征，供声学模型处理。同时尽可能降低环境噪声、信道、说话人等因素对特征造成的影响。2.1 The front end processes the audio stream and extracts features from the input signal for processing by the acoustic model. At the same time, the influence of environmental noise, channel, speaker and other factors on the features is reduced as much as possible.

2.2解码器对输入的信号根据声学、语言学模型及词典，寻找能够以最大概率输出该信号的词串，作为语音的识别结果。2.2 The decoder looks for the word string that can output the signal with the greatest probability according to the acoustic, linguistic model and dictionary for the input signal, as the speech recognition result.

步骤3、语义识别单元从文字信息中提取用户目的以及相应的关键信息，具体步骤包括：Step 3, the semantic recognition unit extracts the user's purpose and corresponding key information from the text information, and the specific steps include:

3.1利用文字信息中标志性的关键词标记当前对话所属的领域。3.1 Use iconic keywords in the text information to mark the field to which the current conversation belongs.

3.2在具体领域中基于规则对用户的意图进行判断。3.2 Judge the user's intention based on the rules in the specific field.

3.3根据领域以及用户意图，结合规则，例如预先设定的模板，对具体的关键信息进行提取。3.3 According to the field and user intentions, combined with rules, such as preset templates, to extract specific key information.

步骤4、异常处理模块根据语音识别模块的文字信息以及语义识别模块的语义信息判断人机对话当前是否出现异常并进行异常的处理以及消息的回复，具体步骤包括：Step 4, the exception processing module judges whether the man-machine dialogue is currently abnormal according to the text information of the speech recognition module and the semantic information of the semantic recognition module, and performs abnormal processing and message reply. The specific steps include:

4.1异常检测单元根据语音识别模块的文字信息以及语义识别模块的语义信息判断当前的人机对话是否出现异常。不异常则由本地客户端进行处理，异常则由干预服务器接管人机对话。4.1 The abnormality detection unit judges whether the current man-machine dialogue is abnormal according to the text information of the speech recognition module and the semantic information of the semantic recognition module. If there is no exception, it will be handled by the local client, and if there is an exception, the intervening server will take over the man-machine dialogue.

4.2数据库查询单元根据语义信息进行数据库的查询，得到推荐的干预消息，如果干预消息的推荐度较高，则直接利用该干预消息进行干预，如果推荐度较低，则请求干预师进行人工介入。4.2 The database query unit searches the database according to the semantic information to obtain the recommended intervention message. If the recommendation degree of the intervention message is high, the intervention message is directly used for intervention. If the recommendation degree is low, the interventionist is requested to intervene manually.

4.3在机器算法无法找到高推荐度的干预消息时，干预师介入进行干预消息的选择以及修改，随后将修改后的干预消息发送至客户端。4.3 When the machine algorithm fails to find a highly recommended intervention message, the interventionist intervenes to select and modify the intervention message, and then sends the modified intervention message to the client.

在人机对话异常处理的过程中，用户的语音输入通过机器的语音识别以及语义解析后，会将语音的识别结果以及语义解析的结果以文本的形式传到干预师端，干预师接受到消息之后可以选择发送对话消息或者下发命令消息。对话消息以文本的形式传输到机器，随后通过语音合成系统(TTS)合成语音并播放给用户，命令消息则是直接通过机器执行命令。In the process of man-machine dialogue exception handling, after the user's voice input passes through the machine's voice recognition and semantic analysis, the voice recognition results and semantic analysis results will be sent to the interventionist in the form of text, and the interventionist will receive the message After that, you can choose to send a dialog message or send a command message. Dialogue messages are transmitted to the machine in the form of text, and then the voice is synthesized by a speech synthesis system (TTS) and played to the user, while the command message is directly executed by the machine.

本实施例包括以下步骤，如图3和图4所示，即用户输入-->干预消息生成-->客户机推送干预消息三个步骤分别进行技术方案的介绍：This embodiment includes the following steps, as shown in Figure 3 and Figure 4, that is, the three steps of user input->intervention message generation->client push intervention message are introduced respectively for technical solutions:

1)用户输入1) User input

用户进行语音输入的过程中，利用的语音识别系统将用户的语音输入音频转换为文字，同时对该句文字进行语义分析(语义分析的结果包括用户当前的对话领域、用户请求服务的关键信息等)，最后将文字以及语义分析的结果以文本形式通过HTTP协议的POST方法传输到异常处理模块。During the user's voice input process, the voice recognition system used converts the user's voice input audio into text, and at the same time performs semantic analysis on the sentence (the result of semantic analysis includes the user's current dialogue field, key information of the user's requested service, etc. ), and finally transmit the text and semantic analysis results to the exception handling module in text form through the POST method of the HTTP protocol.

2)干预消息生成2) Intervention message generation

异常处理模块在异常情况下，根据语音识别的文本信息和语义识别的语义槽查询数据库，得到备选的干预消息。如果干预消息的推荐度较高，则直接利用该干预消息进行干预，如果推荐度较低，则请求干预师进行人工介入。干预师在界面上可以看到由异常处理模块提供的辅助数据比如用户输入的识别结果和语义分析的结果等，结合这些信息干预师能够更准确快速地对候选干预消息进行筛选与修改。干预消息分为对话消息与命令消息，均以文本的形式采用统一的Websocket协议进行传输，其区别在与传输内容的不同以及机器的处理方式不同。The exception processing module queries the database according to the text information of the speech recognition and the semantic slot of the semantic recognition under abnormal circumstances, and obtains alternative intervention messages. If the recommendation degree of the intervention message is high, the intervention message is directly used for intervention, and if the recommendation degree is low, the interventionist is requested to intervene manually. Interventionists can see the auxiliary data provided by the exception handling module on the interface, such as the recognition results input by users and the results of semantic analysis, etc. Combining these information, interventionists can more accurately and quickly screen and modify candidate intervention messages. Intervention messages are divided into dialogue messages and command messages, both of which are transmitted in the form of text using a unified Websocket protocol, the difference lies in the difference in the transmission content and the processing method of the machine.

3)客户机推送干预消息3) The client pushes the intervention message

客户机收到干预消息后立刻返回干预师“消息已收到”的确认信息，并将干预消息缓存在消息队列中。客户机会监听当前的人机对话状态并在一定条件下尝试从消息队列中取出消息向用户推送，具体的推送时机包括有：1、干预消息到达时，2、TTS合成的语音消息播报完成时；需要满足的条件为1、消息队列不为空，2、客户机的音频播放器当前空闲。如果干预消息成功推送则返回干预师“干预消息已推送”的确认信息。After the client computer receives the intervention message, it immediately returns the confirmation message of "the message has been received" to the interventionist, and caches the intervention message in the message queue. The client will monitor the current state of the man-machine dialogue and try to take out the message from the message queue and push it to the user under certain conditions. The specific push timing includes: 1. When the intervention message arrives, 2. When the broadcast of the voice message synthesized by TTS is completed; The conditions to be met are 1. the message queue is not empty, and 2. the audio player of the client is currently idle. If the intervention message is successfully pushed, it will return the interventionist's confirmation message "The intervention message has been pushed".

例如：For example:

1、用户A发出语音指令“我要去一个好玩的地方”。1. User A issues a voice command "I am going to a fun place".

2、语音识别模块将语音输入转换为文字。2. The voice recognition module converts voice input into text.

3、语义分析模块处理后得到用户意图为“导航”，导航的目标地的标签为“好玩”。3. After processing by the semantic analysis module, it is obtained that the user's intention is "navigation", and the label of the navigation destination is "fun".

4、异常处理模块中的异常检测单元收到用户A的服务请求，包含完整的语音识别结果“我要去一个好玩的地方”，和语义分析的结果“导航”、"好玩"，同时检测到当前的对话状态出现异常。4. The abnormal detection unit in the abnormal processing module receives the service request from user A, including the complete voice recognition result "I am going to a fun place", and the semantic analysis results "navigation" and "fun", and simultaneously detects There is an exception in the current dialog state.

5、异常处理模块中的数据库查询单元根据”导航“、”好玩“进行数据库查询，得到一些备选消息比如”请问您要去苏州的好玩小吃吗？“、”为您找个5个与好玩相关的地点“，这两条消息的推荐度都比较低，故请求干预师单元的人工介入。干预师利用异常处理模块得到的数据库查询结果以及语义分析结果和语音识别的文字结果进行干预消息的选择和修改，将干预消息改为”请问您想要怎样的娱乐方式？“，向用户发送该文本消息。5. The database query unit in the exception handling module performs database query according to "navigation" and "fun", and gets some alternative messages such as "do you want to go to Suzhou's fun snacks?", "find 5 for you and fun" Relevant locations", the recommendation of these two messages is relatively low, so manual intervention from the interventionist unit is requested. The interventionist selects and modifies the intervention message by using the database query results obtained by the exception handling module, the semantic analysis results, and the text results of speech recognition, and changes the intervention message to "What kind of entertainment do you want?", and sends the message to the user. text message.

6、客户机收到干预消息后将其存入消息队列，向异常处理模块发送“消息已收到”的反馈，并尝试进行推送。6. After receiving the intervention message, the client computer stores it in the message queue, sends a feedback of "the message has been received" to the exception handling module, and tries to push it.

7、条件满足后进行干预消息的语音合成系统合成以及播报，用户听到音频“请问您想要怎样的娱乐方式”，客户机向异常处理模块发送“消息已推送”反馈。7. After the conditions are met, the speech synthesis system synthesizes and broadcasts the intervention message. The user hears the audio "What kind of entertainment do you want?", and the client computer sends a "message has been pushed" feedback to the exception handling module.

8、客户进行进一步的语音输入“我要去唱歌”8. The customer makes further voice input "I'm going to sing"

9、ASR系统将语音输入转换为文字9. ASR system converts speech input into text

10、语义分析得到用户意图为“导航”，导航的目标为“KTV”10. Semantic analysis shows that the user's intention is "navigation", and the navigation target is "KTV"

11、异常检测单元得到用户A的具体服务需求，包含完整的语音识别结果“我要去唱歌”，和语义分析的结果”导航“、”KTV“。11. The anomaly detection unit obtains the specific service requirements of user A, including the complete voice recognition result "I'm going to sing", and the semantic analysis results "navigation" and "KTV".

12、数据库查询单元根据”导航“、”KTV“、以及用户的相关信息进行数据库的搜索，得到备选干预消息”为您推荐xxx请问是否前往？“，同时由于推荐度很高，故绕过干预师单元，直接向客户机发送文字消息”为您推荐xxx请问是否前往？“12. The database query unit searches the database according to "navigation", "KTV" and user related information, and obtains an alternative intervention message "recommend xxx for you, do you want to go?", and because the recommendation is very high, it is bypassed Intervention division unit, send a text message directly to the client computer "recommend xxx for you, would you like to go?"

13、用户确认前往13. The user confirms to go to

14、异常处理系统用户推送命令类型的干预消息，包含命令类型“导航”以及目的地的POI信息。14. The user of the exception handling system pushes an intervention message of the command type, including the command type "navigation" and the POI information of the destination.

15、客户机从消息队列中取出命令类型“导航”的消息以及相应的POI信息，进行导航操作，客户机向异常处理模块发送“消息已推送”反馈,交互结束。15. The client takes out the message of the command type "navigation" and the corresponding POI information from the message queue, and performs the navigation operation. The client sends a feedback of "the message has been pushed" to the exception handling module, and the interaction ends.

以上详细描述了本发明的较佳具体实施例。应当理解，本领域的普通技术无需创造性劳动就可以根据本发明的构思作出诸多修改和变化。因此，凡本技术领域中技术人员依本发明的构思在现有技术的基础上通过逻辑分析、推理或者有限的实验可以得到的技术方案，皆应在由权利要求书所确定的保护范围内。The preferred specific embodiments of the present invention have been described in detail above. It should be understood that those skilled in the art can make many modifications and changes according to the concept of the present invention without creative efforts. Therefore, all technical solutions that can be obtained by those skilled in the art based on the concept of the present invention through logical analysis, reasoning or limited experiments on the basis of the prior art shall be within the scope of protection defined by the claims.

Claims

1. a kind of man-machine mixing interactive system based on audio frequency is it is characterised in that include sound identification module, phonetic synthesis mould Block, semantics recognition module and exception processing module, wherein, described sound identification module is configured to and described semantics recognition mould Block is connected and transmits the corresponding Word message of voice, and described exception processing module is configured to and described sound identification module and institute Predicate justice identification module is connected, and described sound identification module is configured to transmit Word message to described exception processing module, institute Predicate justice identification module is configured to transmit semantic analysis result to described exception processing module；Described exception processing module is joined It is set to and be connected with described voice synthetic module and transmit intervention information.

2. the man-machine mixing interactive system based on audio frequency as claimed in claim 1 is it is characterised in that described sound identification module Including signal processing and feature extraction unit, acoustic model, language model and decoder, wherein, described signal processing and spy Levy extraction unit to be configured to be connected and transmit acoustic featuress information with described acoustic model, described decoder is configured to and institute State acoustic model to be connected with described language model and export recognition result.

3. the man-machine mixing interactive system based on audio frequency as claimed in claim 1 is it is characterised in that described voice synthetic module Including text analysis unit, prosodic control unit and synthesis voice unit, wherein, described text analysis unit is configured to connect Receive text message and described text message is processed, result is transferred to described prosodic control unit and described synthesis Voice unit, described prosodic control unit be configured to described synthesis voice unit be connected, and transmit pitch, the duration of a sound, loudness of a sound, Pause and prosody information, described synthesis voice unit be configured to by described receive text analysis unit analysis result with described The voice of the control parameter synthesis output of prosodic control unit.

4. the man-machine mixing interactive system based on audio frequency as claimed in claim 1 is it is characterised in that described semantics recognition module Including field mark unit, intention determination unit, information extraction unit, wherein, described field mark unit is configured to and institute State intention determination unit to be connected and transmission field information, described intention determination unit is configured to and described information extraction unit phase Connect and transmit user intent information, described information extraction unit exports the result of semantic analysis.

5. the man-machine mixing interactive system based on audio frequency as claimed in claim 1 is it is characterised in that described exception processing module Including abnormality detecting unit, data base querying unit and intervention Shi Danyuan, wherein, described abnormality detecting unit is configured to connect Receive described sound identification module and the output of described semantics recognition module, and decide whether to take intervening measure, described data base Query unit is configured to receive the intervention signal of described abnormality detecting unit, and receives the semantic letter of described semantics recognition module Breath, inquires about and exports intervention message, and it is defeated to described data base querying unit that described intervention Shi Danyuan is configured to, with intervention teacher The described intervention message going out carry out necessary preferentially and modification, final output replying message to user.

6. a kind of man-machine mixing exchange method based on audio frequency is it is characterised in that comprise the following steps：

Step 1, offer sound identification module, voice synthetic module, semantics recognition module and exception processing module；

Voice messaging is converted to Word message and exports to described semantics recognition unit by step 2, described sound identification module；

Step 3, described semantics recognition unit extract customer objective and corresponding key message from Word message；

Step 4, described exception processing module are according to the Word message of described sound identification module and described semantics recognition module Semantic information judge whether human computer conversation reply that is abnormal and being directed to abnormality processing message currently.

7. the man-machine mixing exchange method based on audio frequency as claimed in claim 6 it is characterised in that in step 2, specifically wraps Include following steps：

Step 2.1, extract feature from the audio stream of input and supply acoustic model to process, reduce environment noise, channel and say simultaneously The impact that words people's factor causes to described feature；

, according to acoustics, linguistic model and dictionary, the result to described acoustic model, searching can for step 2.2, decoder Export the word string of described audio stream with maximum of probability, as the recognition result of voice.

8. the man-machine mixing exchange method based on audio frequency as claimed in claim 6 it is characterised in that in step 3, specifically wraps Include following steps：

Step 3.1, using the significant field belonging to keyword tag current session in Word message；

Step 3.2, rule-based in described field user view is judged；

Step 3.3, according to described field and described user view, binding rule, specific key message is extracted.

9. the man-machine mixing exchange method based on audio frequency as claimed in claim 6 it is characterised in that in step 4, specifically wraps Include following steps：

Step 4.1, abnormality detecting unit are according to the Word message of described sound identification module and described semantics recognition module Semantic information judges whether current human computer conversation exception, if abnormal, takes over human computer conversation by intervening Shi Danyuan；

Step 4.2, data base querying unit carry out the inquiry of data base according to semantic information, and the intervention obtaining having recommendation degree disappears Breath, if the recommendation degree of intervention message is higher, is directly intervened using this intervention message, if recommendation degree is relatively low, please Seek intervention Shi Jinhang manpower intervention；

Step 4.3, when machine algorithm cannot find the intervention message of high recommendations degree, intervene teacher and intervene the choosing carrying out intervention message Select and change, subsequently send amended intervention message to client.

10. the man-machine mixing exchange method based on audio frequency as described in claim 6 or 8 is it is characterised in that described key message Including dialogue field, dialogue key word, described dialogue key word includes content keyword and emotion key word.