CN106409283B

CN106409283B - Man-machine mixed interaction system and method based on audio

Info

Publication number: CN106409283B
Application number: CN201610791966.0A
Authority: CN
Inventors: 俞凯; 石开宇; 郑达; 陈露; 常成; 曹迪
Original assignee: Shanghai Jiao Tong University
Current assignee: Sipic Technology Co Ltd
Priority date: 2016-08-31
Filing date: 2016-08-31
Publication date: 2020-01-10
Anticipated expiration: 2036-08-31
Also published as: CN106409283A

Abstract

The invention discloses a human-computer hybrid interaction system based on audio. A speech recognition module is connected with a semantic recognition module and transmits text information corresponding to the speech. An exception processing module is connected with the speech recognition module and the semantic recognition module, and the speech recognition module transmits the text information. To the exception processing module, the semantic recognition module transmits the semantic analysis result to the exception processing module; the exception processing module is connected with the speech synthesis module and transmits the intervention information. The invention also discloses a human-computer hybrid interaction method based on audio. The speech recognition module converts the speech information into text information and outputs it to the semantic recognition unit; the semantic recognition unit extracts the user purpose and corresponding key information from the text information; abnormality The processing module judges whether the current human-machine dialogue is abnormal according to the text information of the speech recognition module and the semantic information of the semantic recognition module, and processes the reply to the message according to the abnormality. The technical solution of the present invention provides a unified human-machine dialogue experience.

Description

Audio-based human-computer hybrid interaction system and method

技术领域technical field

本发明涉及信息处理技术领域，尤其涉及一种基于音频的人机混合交互系统及方法。The present invention relates to the technical field of information processing, and in particular, to an audio-based human-computer hybrid interaction system and method.

背景技术Background technique

如图1所示，目前基于音频的人机对话系统均使用机器回复作为最终回复呈现给用户，当机器决策系统不能明确用户意图时，大部分对话系统选择呈现“请再说一遍”之类的回复以让用户进行重新的输入，其中部分人机对话系统引入了基于话务中心的人工干预方法。As shown in Figure 1, the current audio-based human-machine dialogue systems all use the machine reply as the final reply to present to the user. When the machine decision-making system cannot clarify the user's intention, most dialogue systems choose to present a reply such as "please say it again". In order to allow users to re-enter, some of the man-machine dialogue systems have introduced manual intervention methods based on call centers.

目前现有人机对话异常处理主要通过话务中心形式实现，在机器无法处理用户输入音频或者在用户明确表示需要人工服务时，请求人工的话务中心介入，此时用户与话务员之间建立一对一的通话连接，话务员与用户进行直接交流，获知用户的需求并通过话务平台下发相应的指令。At present, the existing abnormal handling of human-machine dialogue is mainly implemented in the form of a call center. When the machine cannot process the user's input audio or when the user clearly indicates that manual service is required, the manual call center is requested to intervene. At this time, the user and the operator establish a pair of Once the call is connected, the operator communicates directly with the user, learns the user's needs and issues corresponding instructions through the call platform.

现有话务中心的人工干预方式存在的问题主要有：人工效率低，干预师与用户需要建立一对一的语音交流，等待用户输入的时间段内无法服务其他人；成本高，大规模的呼叫中心需要一系列的电信设备以及相应服务集成，同时由于效率低，需要更多干预师进行干预服务，从而间接提高了人力成本；受网络环境影响大：利用网络资源直接传输音频需要稳定的网络连接，网络环境的波动会导致音频质量下降从而影响对话体验，甚至中断人机对话流程。The main problems of the existing manual intervention method of the call center are: low labor efficiency, the interventionist and the user need to establish one-to-one voice communication, and cannot serve other people during the time period waiting for the user's input; high cost, large-scale The call center requires a series of telecommunication equipment and corresponding service integration. At the same time, due to low efficiency, more interventionists are required to provide intervention services, which indirectly increases labor costs; it is greatly affected by the network environment: using network resources to directly transmit audio requires a stable network Connection, fluctuations in the network environment can cause audio quality to degrade, affecting the conversation experience, and even interrupting the human-machine conversation flow.

因此，本领域的技术人员致力于开发一种基于音频的人机混合交互系统及方法，将人工干预回复与机器回复相结合，从而统一人机对话的流程和提升用户体验。Therefore, those skilled in the art are devoted to developing an audio-based human-machine hybrid interaction system and method, which combines human intervention response with machine response, so as to unify the process of human-machine dialogue and improve user experience.

发明内容SUMMARY OF THE INVENTION

有鉴于现有技术的上述缺陷，本发明所要解决的技术问题是如何提高客服过程中人机对话的效率和用户体验。In view of the above-mentioned defects of the prior art, the technical problem to be solved by the present invention is how to improve the efficiency of human-machine dialogue and user experience in the customer service process.

为实现上述目的，本发明提供了一种基于音频的人机混合交互系统，包括语音识别模块、语音合成模块、语义识别模块以及异常处理模块，其中，所述语音识别模块被配置为与所述语义识别模块相连并传输语音对应的文字信息，所述异常处理模块被配置为与所述语音识别模块和所述语义识别模块相连，所述语音识别模块被配置为传输文字信息给所述异常处理模块，所述语义识别模块被配置为传输语义解析结果给所述异常处理模块；所述异常处理模块被配置为与所述语音合成模块相连并传输干预信息。In order to achieve the above object, the present invention provides an audio-based human-computer hybrid interaction system, including a speech recognition module, a speech synthesis module, a semantic recognition module and an exception processing module, wherein the speech recognition module is configured to The semantic recognition module is connected to and transmits text information corresponding to the voice, the exception processing module is configured to be connected to the voice recognition module and the semantic recognition module, and the voice recognition module is configured to transmit text information to the exception processing module, the semantic recognition module is configured to transmit the semantic parsing result to the exception processing module; the exception processing module is configured to be connected with the speech synthesis module and transmit intervention information.

进一步地，所述语音识别模块包括信号处理及特征提取单元、声学模型、语言模型以及解码器，其中，所述信号处理及特征提取单元被配置为与所述声学模型相连并传输声学特征信息，所述解码器被配置为与所述声学模型和所述语言模型相连并输出识别结果。Further, the speech recognition module includes a signal processing and feature extraction unit, an acoustic model, a language model and a decoder, wherein the signal processing and feature extraction unit is configured to be connected with the acoustic model and transmit acoustic feature information, The decoder is configured to connect with the acoustic model and the language model and output a recognition result.

进一步地，所述语音合成模块包括文本分析单元、韵律控制单元以及合成语音单元，其中，所述文本分析单元被配置为接收文本信息并对所述文本信息进行处理，将处理结果传输到所述韵律控制单元与所述合成语音单元，所述韵律控制单元被配置为与所述合成语音单元相连，并传输音高、音长、音强、停顿及语调信息，所述合成语音单元被配置为将所述接收文本分析单元的分析结果与所述韵律控制单元的控制参数合成输出的语音。Further, the speech synthesis module includes a text analysis unit, a prosody control unit and a synthesized speech unit, wherein the text analysis unit is configured to receive text information and process the text information, and transmit the processing result to the a prosody control unit and the synthesized speech unit, the prosody control unit configured to be connected to the synthesized speech unit and to transmit pitch, length, intensity, pause and intonation information, the synthesized speech unit configured to be The output speech is synthesized by combining the analysis result of the received text analysis unit and the control parameters of the prosody control unit.

进一步地，所述语义识别模块包括领域标注单元、意图判断单元、信息提取单元，其中，所述领域标注单元被配置为与所述意图判断单元相连并传输领域信息，所述意图判断单元被配置为与所述信息提取单元相连并传输用户意图信息，所述信息提取单元输出语义分析的结果。Further, the semantic recognition module includes a domain labeling unit, an intention judging unit, and an information extraction unit, wherein the domain labeling unit is configured to be connected to the intention judging unit and transmit domain information, and the intention judging unit is configured In order to connect with the information extraction unit and transmit user intention information, the information extraction unit outputs the result of semantic analysis.

进一步地，所述异常处理模块包括异常检测单元、数据库查询单元以及干预师单元，其中，所述异常检测单元被配置为接收所述语音识别模块和所述语义识别模块的输出，并决定是否采取干预措施，所述数据库查询单元被配置为接收所述异常检测单元的干预信号，并接收所述语义识别模块的语义信息，查询并输出干预消息，所述干预师单元被配置为利用干预师对所述数据库查询单元输出的所述干预消息进行必要的择优以及修改，最终输出给用户的回复消息。Further, the abnormality processing module includes an abnormality detection unit, a database query unit and an interventionist unit, wherein the abnormality detection unit is configured to receive the output of the speech recognition module and the semantic recognition module, and decide whether to take the Intervention measures, the database query unit is configured to receive the intervention signal of the abnormality detection unit, and to receive the semantic information of the semantic recognition module, query and output the intervention message, and the interventionist unit is configured to use the interventionist to The intervention message output by the database query unit is selected and modified as necessary, and finally a reply message is output to the user.

本发明还提供了一种基于音频的人机混合交互方法，包括以下步骤：The present invention also provides an audio-based human-computer hybrid interaction method, comprising the following steps:

步骤1、提供语音识别模块、语音合成模块、语义识别模块以及异常处理模块；Step 1. Provide a speech recognition module, a speech synthesis module, a semantic recognition module and an exception handling module;

步骤2、所述语音识别模块将语音信息转换为文字信息并输出至所述语义识别单元；Step 2, the speech recognition module converts the speech information into text information and outputs it to the semantic recognition unit;

步骤3、所述语义识别单元从文字信息中提取用户目的以及相应的关键信息；Step 3, the semantic recognition unit extracts the user purpose and the corresponding key information from the text information;

步骤4、所述异常处理模块根据所述语音识别模块的文字信息以及所述语义识别模块的语义信息判断人机对话当前是否出现异常并针对异常处理消息的回复。Step 4: The abnormality processing module judges whether there is currently an abnormality in the human-machine dialogue according to the text information of the speech recognition module and the semantic information of the semantic recognition module, and processes a reply to the abnormality message.

进一步地，在步骤2中，具体包括以下步骤：Further, in step 2, the following steps are specifically included:

步骤2.1、从输入的音频流中提取特征供声学模型处理，同时降低环境噪声、信道和说话人因素对所述特征造成的影响；Step 2.1. Extract features from the input audio stream for processing by the acoustic model, while reducing the impact of environmental noise, channel and speaker factors on the features;

步骤2.2、解码器根据声学、语言学模型及词典，对所述声学模型的处理结果，寻找能够以最大概率输出所述音频流的词串，作为语音的识别结果。Step 2.2: The decoder searches for the word string that can output the audio stream with the greatest probability according to the acoustic and linguistic models and the dictionary and the processing result of the acoustic model, as the speech recognition result.

进一步地，在步骤3中，具体包括以下步骤：Further, in step 3, the following steps are specifically included:

步骤3.1、利用文字信息中标志性的关键词标记当前对话所属的领域；Step 3.1, use the iconic keywords in the text information to mark the field to which the current conversation belongs;

步骤3.2、在所述领域中基于规则对用户意图进行判断；Step 3.2, judge the user's intention based on the rules in the field;

步骤3.3、根据所述领域以及所述用户意图，结合规则，对具体的关键信息进行提取。Step 3.3: Extract specific key information according to the domain and the user's intention and in combination with the rules.

进一步地，在步骤4中，具体包括以下步骤：Further, in step 4, it specifically includes the following steps:

步骤4.1、异常检测单元根据所述语音识别模块的文字信息以及所述语义识别模块的语义信息判断当前的人机对话是否出现异常，若异常则由干预师单元接管人机对话；Step 4.1, the abnormality detection unit judges whether the current man-machine dialogue is abnormal according to the text information of the speech recognition module and the semantic information of the semantic recognition module, and if abnormal, the interventionist unit takes over the man-machine dialogue;

步骤4.2、数据库查询单元根据语义信息进行数据库的查询，得到具有推荐度的干预消息，如果干预消息的推荐度较高，则直接利用该干预消息进行干预，如果推荐度较低，则请求干预师进行人工介入；Step 4.2, the database query unit queries the database according to the semantic information, and obtains the intervention message with the recommendation degree. If the recommendation degree of the intervention message is high, the intervention message is directly used for intervention, and if the recommendation degree is low, the interventionist is requested. carry out manual intervention;

步骤4.3、在机器算法无法找到高推荐度的干预消息时，干预师介入进行干预消息的选择以及修改，随后将修改后的干预消息发送至客户端。Step 4.3. When the machine algorithm cannot find a highly recommended intervention message, the interventionist steps in to select and modify the intervention message, and then sends the modified intervention message to the client.

进一步地，所述关键信息包括对话领域、对话关键词，所述对话关键词包括内容关键词和情绪关键词。Further, the key information includes dialogue fields and dialogue keywords, and the dialogue keywords include content keywords and emotion keywords.

与现有技术相比，本发明的技术效果包括:Compared with the prior art, the technical effects of the present invention include:

1、效率提高：充分利用了干预师等待用户输入的时间，使得干预师可同时对多个用户进行干预服务，提高干预的效率。1. Efficiency improvement: The time spent by the interventionist waiting for user input is fully utilized, so that the interventionist can provide intervention services to multiple users at the same time, thereby improving the efficiency of intervention.

2、成本减少：无需采购话务中心相关的一系列电信设备，利用现有的计算机以及服务器即可搭建干预平台。2. Cost reduction: There is no need to purchase a series of telecommunications equipment related to the call center, and an intervention platform can be built using existing computers and servers.

3、工作场景丰富：由于干预师界面采用了B/S(Browser/Server浏览器/服务器)结构，干预师打开浏览器登录相应的网站即可进行干预操作，不必要在工位上接听电话，可以在PAD、智能手机、个人笔记本等移动终端上进行干预服务。3. Rich work scenarios: Since the interventionist interface adopts the B/S (Browser/Server browser/server) structure, the interventionist can open the browser and log in to the corresponding website to perform intervention operations, and it is not necessary to answer the phone at the workstation. Intervention services can be performed on mobile terminals such as PADs, smartphones, and personal notebooks.

4、网络要求低：文本传输的数据量很小，从而对网络的要求降低，同时用户收听到的语音由本地合成，不受网络情况的影响。4. Low network requirements: The amount of data transmitted by text is very small, so the requirements for the network are reduced. At the same time, the voice heard by the user is synthesized locally and is not affected by the network situation.

5、统一的人机对话体验：对用户来说，干预师是透明的，用户的体验如同与一个充分智能的“机器”在对话，可以无缝衔接目前的人机对话方式。5. Unified human-machine dialogue experience: For the user, the interventionist is transparent, and the user's experience is like a dialogue with a fully intelligent "machine", which can seamlessly connect the current human-machine dialogue mode.

以下将结合附图对本发明的构思、具体结构及产生的技术效果作进一步说明，以充分地了解本发明的目的、特征和效果。The concept, specific structure and technical effects of the present invention will be further described below in conjunction with the accompanying drawings, so as to fully understand the purpose, characteristics and effects of the present invention.

附图说明Description of drawings

图1为现有传统话务中心的干预模式示意图；1 is a schematic diagram of an intervention mode of an existing traditional call center;

图2为本发明的系统模块示意图；2 is a schematic diagram of a system module of the present invention;

图3为本发明一个较佳实施例的系统流程示意图；3 is a schematic flow chart of a system according to a preferred embodiment of the present invention;

图4为本发明一个较佳实施例的角色对话流程示意图。FIG. 4 is a schematic diagram of a character dialogue flow according to a preferred embodiment of the present invention.

具体实施方式Detailed ways

本发明是通过以下技术方案实现的：The present invention is achieved through the following technical solutions:

如图2所示，本发明涉及一种基于音频的人机对话异常处理系统，包括：语音识别模块、语音合成模块、语义识别模块以及异常处理模块，其中：语音识别模块与语义识别模块相连并传输语音对应的文字信息，语音识别模块和语义识别模块均与异常处理模块相连，并分别传输文字信息和语义解析结果，异常处理模块与语音合成模块相连并传输干预信息。As shown in Figure 2, the present invention relates to an audio-based human-machine dialogue exception processing system, including: a speech recognition module, a speech synthesis module, a semantic recognition module and an exception processing module, wherein: the speech recognition module is connected with the semantic recognition module and The text information corresponding to the voice is transmitted. The speech recognition module and the semantic recognition module are connected to the exception processing module, and respectively transmit text information and semantic analysis results. The exception processing module is connected to the speech synthesis module and transmits intervention information.

所述的语音识别模块包括：信号处理及特征提取单元、声学模型、语言模型以及解码器，其中：信号处理及特征提取单元与声学模型相连并传输声学特征信息，解码器与声学模型和语言模型相连，对外界输出识别结果。The speech recognition module includes: a signal processing and feature extraction unit, an acoustic model, a language model and a decoder, wherein: the signal processing and feature extraction unit is connected with the acoustic model and transmits acoustic feature information, and the decoder is connected with the acoustic model and the language model. connected, and output the recognition result to the outside world.

所述的语音合成模块包括：文本分析单元、韵律控制单元以及合成语音单元，，其中：文本分析单元接收文本信息并对其进行处理，将处理结果传输到韵律控制单元与合成语音单元，韵律控制单元与合成语音单元相连，并传输目标的音高、音长、音强、停顿及语调等信息，合成语音单元接收文本分析单元的分析结果与韵律控制单元的控制参数，对外界输出合成的语音。The speech synthesis module includes: a text analysis unit, a rhythm control unit and a synthesized speech unit, wherein: the text analysis unit receives and processes the text information, and transmits the processing result to the rhythm control unit and the synthesized speech unit, and the rhythm control unit The unit is connected with the synthetic speech unit, and transmits information such as the pitch, length, intensity, pause and intonation of the target. The synthetic speech unit receives the analysis results of the text analysis unit and the control parameters of the prosody control unit, and outputs the synthesized speech to the outside world. .

所述的语义识别模块包括：领域标注单元、意图判断单元、信息提取单元，其中：领域标注单元与意图判断单元相连并传输领域信息，意图判断单元与信息提取单元相连并传输用户意图信息，信息单元与外界相连并传输语义分析的信息。The semantic recognition module includes: a domain labeling unit, an intent judging unit, and an information extraction unit, wherein: the domain labeling unit is connected with the intent judging unit and transmits domain information, the intent judging unit is connected with the information extraction unit and transmits user intent information, information Units are connected to the outside world and transmit information for semantic analysis.

所述的异常处理模块包括：异常检测单元、数据库查询单元、干预师单元以，其中：异常检测单元接收语音识别模块和语义识别模块的输出，并决定是否采取干预措施，数据库查询单元接收异常检测单元的干预信号，并接收语义识别模块的语义信息，查询并输出干预消息，干预师单元利用干预师对数据库查询单元输出的干预消息进行必要的择优以及修改，最终输出用户回复消息。The abnormality processing module includes: an abnormality detection unit, a database query unit, and an interventionist unit, wherein: the abnormality detection unit receives the output of the speech recognition module and the semantic recognition module, and decides whether to take intervention measures, and the database query unit receives the abnormality detection unit. The intervention signal of the unit, and receive the semantic information of the semantic recognition module, query and output the intervention message, the interventionist unit uses the interventionist to perform necessary selection and modification of the intervention message output by the database query unit, and finally outputs the user reply message.

本发明涉及上述系统的人机对话异常处理方法，具体包括以下步骤：The present invention relates to the abnormal processing method of the human-machine dialogue of the above-mentioned system, which specifically includes the following steps:

步骤1、提供语音识别模块、语音合成模块、语义识别模块以及异常处理模块。Step 1. Provide a speech recognition module, a speech synthesis module, a semantic recognition module and an exception handling module.

步骤2、语音识别模块将语音信息转换为文字信息并输出至语义识别单元，具体步骤包括：Step 2, the speech recognition module converts the speech information into text information and outputs it to the semantic recognition unit, and the specific steps include:

2.1前端处理音频流,从输入信号中提取特征，供声学模型处理。同时尽可能降低环境噪声、信道、说话人等因素对特征造成的影响。2.1 The front end processes the audio stream and extracts features from the input signal for processing by the acoustic model. At the same time, the influence of environmental noise, channel, speaker and other factors on the features is minimized.

2.2解码器对输入的信号根据声学、语言学模型及词典，寻找能够以最大概率输出该信号的词串，作为语音的识别结果。2.2 The decoder searches for the word string that can output the signal with the greatest probability according to the acoustic, linguistic model and dictionary for the input signal, as the recognition result of the speech.

步骤3、语义识别单元从文字信息中提取用户目的以及相应的关键信息，具体步骤包括：Step 3, the semantic recognition unit extracts the user purpose and the corresponding key information from the text information, and the specific steps include:

3.1利用文字信息中标志性的关键词标记当前对话所属的领域。3.1 Use the iconic keywords in the text information to mark the field to which the current conversation belongs.

3.2在具体领域中基于规则对用户的意图进行判断。3.2 Judging the user's intention based on rules in specific fields.

3.3根据领域以及用户意图，结合规则，例如预先设定的模板，对具体的关键信息进行提取。3.3 According to the domain and user intent, combined with rules, such as preset templates, extract specific key information.

步骤4、异常处理模块根据语音识别模块的文字信息以及语义识别模块的语义信息判断人机对话当前是否出现异常并进行异常的处理以及消息的回复，具体步骤包括：Step 4. The abnormality processing module judges whether the human-machine dialogue is abnormal at present according to the text information of the speech recognition module and the semantic information of the semantic recognition module, and performs abnormal processing and message reply. The specific steps include:

4.1异常检测单元根据语音识别模块的文字信息以及语义识别模块的语义信息判断当前的人机对话是否出现异常。不异常则由本地客户端进行处理，异常则由干预服务器接管人机对话。4.1 The abnormality detection unit judges whether the current man-machine dialogue is abnormal according to the text information of the speech recognition module and the semantic information of the semantic recognition module. If it is not abnormal, it will be handled by the local client, and if it is abnormal, the intervening server will take over the man-machine dialogue.

4.2数据库查询单元根据语义信息进行数据库的查询，得到推荐的干预消息，如果干预消息的推荐度较高，则直接利用该干预消息进行干预，如果推荐度较低，则请求干预师进行人工介入。4.2 The database query unit queries the database according to the semantic information, and obtains the recommended intervention message. If the recommendation degree of the intervention message is high, the intervention message is directly used for intervention, and if the recommendation degree is low, the interventionist is requested to perform manual intervention.

4.3在机器算法无法找到高推荐度的干预消息时，干预师介入进行干预消息的选择以及修改，随后将修改后的干预消息发送至客户端。4.3 When the machine algorithm cannot find a highly recommended intervention message, the interventionist will intervene to select and modify the intervention message, and then send the modified intervention message to the client.

在人机对话异常处理的过程中，用户的语音输入通过机器的语音识别以及语义解析后，会将语音的识别结果以及语义解析的结果以文本的形式传到干预师端，干预师接受到消息之后可以选择发送对话消息或者下发命令消息。对话消息以文本的形式传输到机器，随后通过语音合成系统(TTS)合成语音并播放给用户，命令消息则是直接通过机器执行命令。In the process of abnormal processing of human-machine dialogue, after the user's voice input is passed through the machine's voice recognition and semantic analysis, the results of speech recognition and semantic analysis will be sent to the interventionist in the form of text, and the interventionist will receive the message. You can then choose to send a conversation message or issue a command message. Conversational messages are transmitted to the machine in the form of text, and then speech is synthesized and played back to the user through a speech synthesis system (TTS), while command messages execute commands directly through the machine.

本实施例包括以下步骤，如图3和图4所示，即用户输入-->干预消息生成-->客户机推送干预消息三个步骤分别进行技术方案的介绍：This embodiment includes the following steps, as shown in FIG. 3 and FIG. 4 , that is, the three steps of user input --> intervention message generation --> client pushes the intervention message respectively introduce technical solutions:

1)用户输入1) User input

用户进行语音输入的过程中，利用的语音识别系统将用户的语音输入音频转换为文字，同时对该句文字进行语义分析(语义分析的结果包括用户当前的对话领域、用户请求服务的关键信息等)，最后将文字以及语义分析的结果以文本形式通过HTTP协议的POST方法传输到异常处理模块。In the process of the user's voice input, the used voice recognition system converts the user's voice input audio into text, and at the same time performs semantic analysis on the text (the results of the semantic analysis include the user's current dialogue field, the key information of the user's request for services, etc. ), and finally the text and semantic analysis results are transmitted to the exception handling module in the form of text through the POST method of the HTTP protocol.

2)干预消息生成2) Intervention message generation

异常处理模块在异常情况下，根据语音识别的文本信息和语义识别的语义槽查询数据库，得到备选的干预消息。如果干预消息的推荐度较高，则直接利用该干预消息进行干预，如果推荐度较低，则请求干预师进行人工介入。干预师在界面上可以看到由异常处理模块提供的辅助数据比如用户输入的识别结果和语义分析的结果等，结合这些信息干预师能够更准确快速地对候选干预消息进行筛选与修改。干预消息分为对话消息与命令消息，均以文本的形式采用统一的Websocket协议进行传输，其区别在与传输内容的不同以及机器的处理方式不同。The exception handling module queries the database according to the text information of the speech recognition and the semantic slot of the semantic recognition under abnormal conditions, and obtains an alternative intervention message. If the recommendation degree of the intervention message is high, the intervention message is directly used for intervention, and if the recommendation degree is low, the interventionist is requested to perform manual intervention. The interventionist can see the auxiliary data provided by the exception handling module on the interface, such as the recognition result input by the user and the result of semantic analysis, etc. The interventionist can screen and modify the candidate intervention messages more accurately and quickly by combining this information. Intervention messages are divided into dialogue messages and command messages, both of which are transmitted in the form of text using the unified Websocket protocol.

3)客户机推送干预消息3) The client pushes the intervention message

客户机收到干预消息后立刻返回干预师“消息已收到”的确认信息，并将干预消息缓存在消息队列中。客户机会监听当前的人机对话状态并在一定条件下尝试从消息队列中取出消息向用户推送，具体的推送时机包括有：1、干预消息到达时，2、TTS合成的语音消息播报完成时；需要满足的条件为1、消息队列不为空，2、客户机的音频播放器当前空闲。如果干预消息成功推送则返回干预师“干预消息已推送”的确认信息。After receiving the intervention message, the client immediately returns to the interventionist "message has been received" confirmation, and caches the intervention message in the message queue. The client will monitor the current man-machine dialogue state and try to push the message from the message queue to the user under certain conditions. The specific push timings include: 1. When the intervention message arrives, 2. When the TTS synthesized voice message broadcast is completed; The conditions that need to be met are 1. The message queue is not empty, and 2. The audio player of the client is currently idle. If the intervention message is successfully pushed, it will return the confirmation message of the interventionist that "the intervention message has been pushed".

例如：E.g:

1、用户A发出语音指令“我要去一个好玩的地方”。1. User A issues a voice command "I'm going to a fun place".

2、语音识别模块将语音输入转换为文字。2. The speech recognition module converts the speech input into text.

3、语义分析模块处理后得到用户意图为“导航”，导航的目标地的标签为“好玩”。3. After processing, the semantic analysis module obtains that the user's intention is "navigation", and the label of the navigation destination is "fun".

4、异常处理模块中的异常检测单元收到用户A的服务请求，包含完整的语音识别结果“我要去一个好玩的地方”，和语义分析的结果“导航”、"好玩"，同时检测到当前的对话状态出现异常。4. The anomaly detection unit in the anomaly processing module receives the service request from user A, including the complete speech recognition result "I'm going to a fun place", and the results of semantic analysis "navigation" and "fun". The current conversation state is abnormal.

5、异常处理模块中的数据库查询单元根据”导航“、”好玩“进行数据库查询，得到一些备选消息比如”请问您要去苏州的好玩小吃吗？“、”为您找个5个与好玩相关的地点“，这两条消息的推荐度都比较低，故请求干预师单元的人工介入。干预师利用异常处理模块得到的数据库查询结果以及语义分析结果和语音识别的文字结果进行干预消息的选择和修改，将干预消息改为”请问您想要怎样的娱乐方式？“，向用户发送该文本消息。5. The database query unit in the exception handling module performs database query according to "navigation" and "fun", and gets some alternative messages such as "do you want to go to Suzhou's fun snacks?", "to find 5 and fun for you. Relevant places", the recommendation degree of these two messages is relatively low, so the manual intervention of the interventionist unit is requested. The interventionist selects and modifies the intervention message by using the database query results obtained by the exception processing module, the results of semantic analysis and the text results of speech recognition, and changes the intervention message to "What kind of entertainment do you want?", and sends the message to the user. text message.

6、客户机收到干预消息后将其存入消息队列，向异常处理模块发送“消息已收到”的反馈，并尝试进行推送。6. After the client receives the intervention message, it stores it in the message queue, sends a "message received" feedback to the exception handling module, and tries to push it.

7、条件满足后进行干预消息的语音合成系统合成以及播报，用户听到音频“请问您想要怎样的娱乐方式”，客户机向异常处理模块发送“消息已推送”反馈。7. After the conditions are met, the speech synthesis system of the intervention message is synthesized and broadcasted. The user hears the audio "What kind of entertainment do you want?", and the client sends a "message has been pushed" feedback to the exception handling module.

8、客户进行进一步的语音输入“我要去唱歌”8. The customer makes further voice input "I'm going to sing"

9、ASR系统将语音输入转换为文字9. ASR system converts speech input into text

10、语义分析得到用户意图为“导航”，导航的目标为“KTV”10. Semantic analysis obtains that the user's intention is "navigation", and the goal of navigation is "KTV"

11、异常检测单元得到用户A的具体服务需求，包含完整的语音识别结果“我要去唱歌”，和语义分析的结果”导航“、”KTV“。11. The abnormality detection unit obtains the specific service requirements of user A, including the complete speech recognition result "I'm going to sing", and the results of semantic analysis "navigation" and "KTV".

12、数据库查询单元根据”导航“、”KTV“、以及用户的相关信息进行数据库的搜索，得到备选干预消息”为您推荐xxx请问是否前往？“，同时由于推荐度很高，故绕过干预师单元，直接向客户机发送文字消息”为您推荐xxx请问是否前往？“12. The database query unit searches the database according to "navigation", "KTV", and the relevant information of the user, and obtains the alternative intervention message "Recommend xxx for you, do you want to go?". At the same time, due to the high degree of recommendation, bypass Interventionist unit, send a text message directly to the client "I recommend xxx for you, do you want to go?"

13、用户确认前往13. The user confirms to go to

14、异常处理系统用户推送命令类型的干预消息，包含命令类型“导航”以及目的地的POI信息。14. The exception handling system pushes the intervention message of the command type by the user, including the command type "navigation" and the POI information of the destination.

15、客户机从消息队列中取出命令类型“导航”的消息以及相应的POI信息，进行导航操作，客户机向异常处理模块发送“消息已推送”反馈,交互结束。15. The client takes out the message of the command type "navigation" and the corresponding POI information from the message queue, and performs the navigation operation. The client sends the "message has been pushed" feedback to the exception handling module, and the interaction ends.

以上详细描述了本发明的较佳具体实施例。应当理解，本领域的普通技术无需创造性劳动就可以根据本发明的构思作出诸多修改和变化。因此，凡本技术领域中技术人员依本发明的构思在现有技术的基础上通过逻辑分析、推理或者有限的实验可以得到的技术方案，皆应在由权利要求书所确定的保护范围内。The preferred embodiments of the present invention have been described in detail above. It should be understood that many modifications and changes can be made according to the concept of the present invention by those skilled in the art without creative efforts. Therefore, all technical solutions that can be obtained by those skilled in the art through logical analysis, reasoning or limited experiments on the basis of the prior art according to the concept of the present invention shall fall within the protection scope determined by the claims.

Claims

1. A man-machine mixed interaction system based on audio is characterized by comprising a voice recognition module, a voice synthesis module, a semantic recognition module and an exception handling module, wherein the voice recognition module is connected with the semantic recognition module and transmits text information corresponding to voice, the exception handling module is connected with the voice recognition module and the semantic recognition module, the voice recognition module is configured to transmit the text information to the exception handling module, and the semantic recognition module is configured to transmit a semantic parsing result to the exception handling module; the exception handling module is configured to connect with the speech synthesis module and transmit intervention information; the voice synthesis module is configured to convert the intervention information transmitted by the exception handling module into voice, send and play the voice to a user, and wait for the user to further feed back;

the abnormality processing module comprises an abnormality detection unit, a database query unit and an intervener unit, wherein the abnormality detection unit is configured to receive the outputs of the voice recognition module and the semantic recognition module and decide whether to take intervention measures, the database query unit is configured to receive an intervention signal of the abnormality detection unit, receive semantic information of the semantic recognition module, query and output the intervention information with high recommendation degree to the voice synthesis module; the intervener unit is configured to perform necessary preference and modification on the intervention information with low recommendation degree output by the database query unit by using an intervener, and then transmit the intervention information to the voice synthesis module to obtain a reply message to be further fed back by the user.

2. The audio-based human-computer hybrid interaction system according to claim 1, wherein the speech recognition module comprises a signal processing and feature extraction unit, an acoustic model, a language model, and a decoder, wherein the signal processing and feature extraction unit is configured to be connected to the acoustic model and to transmit acoustic feature information, and the decoder is configured to be connected to the acoustic model and the language model and to output a recognition result.

3. The audio-based human-computer hybrid interactive system according to claim 1, wherein the speech synthesis module comprises a text analysis unit, a prosody control unit and a synthesized speech unit, wherein the text analysis unit is configured to receive and process text information, transmit the processing result to the prosody control unit and the synthesized speech unit, the prosody control unit is configured to be connected to the synthesized speech unit and transmit pitch, duration, intensity, pause and intonation information, and the synthesized speech unit is configured to receive the analysis result of the text analysis unit and the control parameters of the prosody control unit to synthesize the output speech.

4. The audio-based human-computer hybrid interaction system according to claim 1, wherein the semantic recognition module comprises a domain labeling unit, an intention judging unit, and an information extracting unit, wherein the domain labeling unit is configured to be connected with the intention judging unit and transmit domain information, the intention judging unit is configured to be connected with the information extracting unit and transmit user intention information, and the information extracting unit outputs a result of semantic analysis.

5. A man-machine mixed interaction method based on audio is characterized by comprising the following steps:

step 1, providing a voice recognition module, a voice synthesis module, a semantic recognition module and an exception handling module;

step 2, the voice recognition module converts voice information into character information and outputs the character information to the semantic recognition module;

step 3, the semantic recognition module extracts the user purpose and corresponding key information from the character information;

step 4, the exception handling module judges whether the man-machine conversation is abnormal at present according to the character information of the voice recognition module and the semantic information of the semantic recognition module and replies to exception handling information;

wherein, in the step 4, the method specifically comprises the following steps:

step 4.1, the abnormality detection unit judges whether the current man-machine conversation is abnormal according to the character information of the voice recognition module and the semantic information of the semantic recognition module, and if the current man-machine conversation is abnormal, the interventionalist unit takes over the man-machine conversation;

step 4.2, the database query unit queries the database according to the semantic information to obtain intervention information with recommendation degree, if the recommendation degree of the intervention information is higher, intervention is directly performed by using the intervention information, the intervention information is sent to a client, and the step 2 is entered to wait for further feedback of the user; if the recommendation degree is low, requesting an interventionalist to perform manual intervention;

and 4.3, when the intervention information with high recommendation degree cannot be found by the machine algorithm, an intervention teacher intervenes to select and modify the intervention information, then the modified intervention information is sent to a client, and the step 2 is entered to wait for further feedback of the user.

6. The audio-based human-computer hybrid interaction method according to claim 5, wherein in the step 2, the method specifically comprises the following steps:

2.1, extracting features from the input audio stream for processing by an acoustic model, and simultaneously reducing the influence of environmental noise, channels and speaker factors on the features;

and 2.2, searching a word string capable of outputting the audio stream with the maximum probability as a voice recognition result by the decoder according to the acoustic model, the linguistic model and the dictionary.

7. The audio-based human-computer hybrid interaction method according to claim 5, wherein in step 3, the method specifically comprises the following steps:

step 3.1, marking the field to which the current conversation belongs by using the symbolic key words in the character information;

step 3.2, judging the user intention based on rules in the field;

and 3.3, extracting specific key information according to the field and the user intention and by combining rules.

8. An audio-based human-computer hybrid interaction method according to claim 5 or 7, wherein the key information includes a dialogue domain, dialogue keywords including content keywords and emotion keywords.