CN106531158A

CN106531158A - Method and device for recognizing answer voice

Info

Publication number: CN106531158A
Application number: CN201611081923.XA
Authority: CN
Inventors: 谢湘; 唐刚
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2016-11-30
Filing date: 2016-11-30
Publication date: 2017-03-22

Abstract

The invention relates to the field of computer paralinguistic information, in particular to a method and device for recognizing a response voice, which is used to solve the problem that the current response voice recognition method is not accurate enough in recognizing the response voice. The embodiment of the present invention obtains the response voice to be recognized; the response mode corresponding to the response voice to be recognized is determined using the response mode recognition model; if the response mode is a formal response mode, the response voice to be recognized is input into the first voice recognition system; if the response mode is In the informal response mode, the response voice to be recognized is input into the second voice recognition system. When the embodiment of the present invention recognizes the response voice, it first recognizes the response voice as a formal response mode or an informal response mode, and inputs different voice recognition systems for the formal response mode and the informal response mode for recognition, thereby improving the overall voice recognition performance .

Description

Recognition method and device for voice response

技术领域technical field

本发明涉及计算机副语言领域，特别涉及一种应答语音的识别方法及装置。The invention relates to the field of computer paralanguage, in particular to a method and device for recognizing a response voice.

背景技术Background technique

近年来，计算机副语言学成为语音语言处理领域的研究热点，语音识别技术的发展对推动智能化、人性化的新型人机交互技术的发展和应用具有重要的作用。In recent years, computer paralinguistics has become a research hotspot in the field of speech and language processing. The development of speech recognition technology plays an important role in promoting the development and application of intelligent and humanized new human-computer interaction technology.

语音识别就是利用计算机自动将语音转化成文本的技术，语音一直是人类生活中交互的重要媒介，因此让机器实现对语音的识别是至关重要的一步。目前在很多场合会使用语音记录仪记录语音，并且需要对语音记录仪中记录的语音进行分析。例如，在飞行场景中，使用舱音记录仪记录飞机上的语音，在飞行结束后通过识别舱音记录仪中的语音对飞行质量进行评价。目前，在对语音记录仪中记录的语音信息进行识别时，采用的是机器自动识别的方法，具体的，利用端点识别技术将语音记录仪中记录的语音划分为一句句待识别应答语音，并将待识别应答语音输入到语音识别系统中，通过该语音识别系统识别待识别应答语音。由于待识别应答语音根据不同的说话对象以及环境分为正式应答语音和非正式应答语音，正式应答语音和非正式应答语音对应的语音环境不同，并且说话人语气、语调均不相同；而现有技术直接将获取到的应答语音输入语音识别系统进行识别的方法往往不能准确的识别应答语音。Speech recognition is a technology that uses computers to automatically convert speech into text. Speech has always been an important medium for interaction in human life, so it is a crucial step for machines to recognize speech. Currently, voice recorders are used to record voices in many occasions, and the voices recorded in the voice recorders need to be analyzed. For example, in a flight scene, the cabin voice recorder is used to record the voice on the plane, and the flight quality is evaluated by recognizing the voice in the cabin voice recorder after the flight. At present, when the voice information recorded in the voice recorder is recognized, the method of automatic machine recognition is adopted. Specifically, the voice recorded in the voice recorder is divided into a sentence to be recognized and answered by using the endpoint recognition technology. The speech to be recognized is input into the speech recognition system, and the speech to be recognized is recognized by the speech recognition system. Since the speech to be recognized is divided into formal speech and informal speech according to different speaking objects and environments, the speech environments corresponding to formal speech and informal speech are different, and the tone and intonation of the speaker are not the same; while the existing The method of directly inputting the obtained answering voice into the voice recognition system for recognition often cannot accurately recognize the answering voice.

综上所述，目前的应答语音识别方法在识别应答语音时不够准确。To sum up, the current response speech recognition methods are not accurate enough in recognizing the response speech.

发明内容Contents of the invention

本发明提供一种应答语音的识别方法及装置，用以解决目前的应答语音识别方法在识别应答语音时不够准确的问题。The invention provides a method and device for recognizing a response voice, which is used to solve the problem that the current response voice recognition method is not accurate enough in recognizing the response voice.

基于上述问题，本发明实施例提供一种应答语音的识别方法，包括：Based on the above problems, an embodiment of the present invention provides a method for recognizing a response voice, including:

获取待识别应答语音；Obtain the response voice to be recognized;

使用应答方式识别模型确定所述待识别应答语音对应的应答方式；其中，所述应答方式识别模型为有监督的机器学习模型；Using the response mode recognition model to determine the response mode corresponding to the response voice to be recognized; wherein, the response mode recognition model is a supervised machine learning model;

若所述应答方式为正式应答方式，则将所述待识别应答语音输入第一语音识别系统，以使所述第一语音识别系统识别所述待识别应答语音，并输出所述待识别应答语音对应的文本信息；If the response mode is a formal response mode, input the response voice to be recognized into the first voice recognition system, so that the first voice recognition system recognizes the response voice to be recognized and output the response voice to be recognized Corresponding text information;

若所述应答方式为非正式应答方式，则将所述待识别应答语音输入第二语音识别系统，以使所述第二语音识别系统识别所述待识别应答语音，并输出所述待识别应答语音对应的文本信息；If the response mode is an informal response mode, input the response voice to be recognized into the second voice recognition system, so that the second voice recognition system can recognize the response voice to be recognized and output the response to be recognized Text information corresponding to voice;

其中，所述第一语音识别系统和所述第二语音识别系统配置有不同的参数。Wherein, the first speech recognition system and the second speech recognition system are configured with different parameters.

由于本发明实施例在识别应答语音时，获取待识别的应答语音后，使用应答方式识别模型确定待识别应答语音对应的应答方式，针对正式应答方式和非正式应答方式输入不同的语音识别系统进行识别。由于第一语音识别系统用于识别正式应答语音，第二语音识别系统用于识别非正式应答语音，并且第一语音识别系统和第二语音识别系统配置有不同的参数，针对不同的应答方式使用不同的语音识别系统进行识别，从而使得对待识别应答语音的识别更加准确。Since the embodiment of the present invention recognizes the response voice, after obtaining the response voice to be recognized, the response mode recognition model is used to determine the response mode corresponding to the response voice to be recognized, and input different voice recognition systems for the formal response mode and the informal response mode. identify. Since the first voice recognition system is used to recognize formal response voices, the second voice recognition system is used to recognize informal response voices, and the first voice recognition system and the second voice recognition system are configured with different parameters for different response methods Different voice recognition systems perform recognition, so that the recognition of the response voice to be recognized is more accurate.

可选的，所述使用应答方式识别模型确定所述待识别应答语音对应的应答方式，具体包括：Optionally, the use of the response mode recognition model to determine the response mode corresponding to the to-be-recognized response voice specifically includes:

将从所述待识别应答语音提取出的语音特征输入所述应答方式识别模型；Inputting the voice features extracted from the response voice to be recognized into the response mode recognition model;

获取所述应答方式识别模型输出的所述待识别应答语音对应的应答方式。Acquiring a response mode corresponding to the to-be-recognized response voice output by the response mode recognition model.

由于本发明实施例将待识别的应答语音进行特征提取后，将提取到的语音特征输入应答方式识别模型，通过应答方式识别模型确定待识别应答语音对应的应答方式。After the embodiment of the present invention extracts the features of the response voice to be recognized, the extracted voice features are input into the response mode recognition model, and the response mode corresponding to the response voice to be recognized is determined through the response mode recognition model.

可选的，所述语音特征包括帧级特征、片级特征和段级特征；Optionally, the speech features include frame-level features, slice-level features and segment-level features;

根据下列方式从应答语音提取出语音特征：Speech features are extracted from the answering speech in the following way:

使用特征提取工具，根据预设的帧长和帧移，提取所述待识别应答语音的帧级特征；Using a feature extraction tool to extract the frame-level features of the response speech to be recognized according to the preset frame length and frame shift;

将所述帧级特征做平滑滤波处理，并对平滑处理后的帧级特征做差分运算，确定所述待识别应答语音的片级特征；Perform smoothing and filtering processing on the frame-level features, and perform a differential operation on the smoothed frame-level features to determine the slice-level features of the speech to be recognized;

根据预设的统计参数，对所述片级特征进行分析处理，确定所述待识别应答语音的段级特征。According to preset statistical parameters, the slice-level features are analyzed and processed to determine the segment-level features of the response speech to be recognized.

由于本发明实施例从待识别应答语音中提取出帧级、片级、段级语音特征，从而保证应答方式识别模型准确的识别该待识别应答语音对应的应答方式。Since the embodiment of the present invention extracts frame-level, slice-level, and segment-level speech features from the response speech to be recognized, it is ensured that the response method recognition model accurately recognizes the response method corresponding to the response speech to be recognized.

可选的，根据下列方式获得所述应答方式识别模型：Optionally, the response mode identification model is obtained according to the following manner:

确定包含多个应答语音的训练集，以及包含多个应答语音的测试集；其中，所述训练集中的应答语音与所述测试集中的应答语音不同；Determine a training set comprising a plurality of response voices, and a test set comprising a plurality of response voices; wherein, the response voices in the training set are different from the response voices in the test set;

针对所述训练集中任意一个应答语音，将从所述应答语音中提取出的语音特征输入到训练前的应答方式识别模型中进行训练；For any answering voice in the training set, the voice features extracted from the answering voice are input into the response mode recognition model before training for training;

针对所述测试集中任意一个应答语音，将从所述应答语音中提取出的语音特征输入到训练后的应答方式识别模型中，并获取所述应答方式识别模型输出的所述应答语音对应的应答方式；For any response voice in the test set, input the voice features extracted from the response voice into the trained response mode recognition model, and obtain the response corresponding to the response voice output by the response mode recognition model Way;

根据训练后的应答方式识别模型输出的所述测试集中每一个应答语音对应的应答方式，确定所述训练后的应答方式识别模型的正确识别率，若所述正确识别率大于设定阈值，确定所述训练后的应答方式识别模型训练完成，保存所述训练后的应答方式识别模型。According to the response mode corresponding to each response voice in the test set output by the trained response mode recognition model, determine the correct recognition rate of the trained response mode recognition model, if the correct recognition rate is greater than the set threshold, determine The training of the trained response mode recognition model is completed, and the trained response mode recognition model is saved.

由于本发明实施例使用训练集中的多个应答语音对应答方式识别模型进行训练，在训练后使用测试集中的应答语音判断训练后的应答方式识别模型是否满足要求，在应答方式识别模型识别该测试集中的应答语音的正确识别率大于设定阈值时，确定该应答方式识别模型训练完成，保存该训练后的应答方式识别模型；若正确识别率小于设定阈值，则使用训练集中的应答语音再次进行训练，直到应答方式识别模型的正确识别率大于设定阈值，从而保证获得的应答方式识别模型更加准确的识别待识别应答语音对应的应答方式。Because the embodiment of the present invention uses a plurality of response voices in the training set to train the response mode recognition model, use the response voices in the test set to judge whether the trained response mode recognition model meets the requirements after training, and identify the test mode in the response mode recognition model. When the correct recognition rate of the concentrated response speech is greater than the set threshold, it is determined that the response recognition model training is completed, and the trained response recognition model is saved; if the correct recognition rate is less than the preset threshold, then the response speech in the training set is used again The training is performed until the correct recognition rate of the response mode recognition model is greater than the set threshold, so as to ensure that the obtained response mode recognition model can more accurately identify the response mode corresponding to the response speech to be recognized.

可选的，所述应答方式识别模型为支持向量机SVM模型。Optionally, the response mode identification model is a support vector machine (SVM) model.

另一方面，本发明实施例还提供一种应答语音的识别装置，包括：On the other hand, the embodiment of the present invention also provides a recognition device for answering voice, including:

获取模块，用于获取待识别应答语音；An acquisition module, configured to acquire the response voice to be recognized;

识别模块，用于使用应答方式识别模型确定所述待识别应答语音对应的应答方式；其中，所述应答方式识别模型为有监督的机器学习模型；The identification module is used to determine the response mode corresponding to the response voice to be recognized by using the response mode recognition model; wherein, the response mode recognition model is a supervised machine learning model;

判断模块，用于若所述应答方式为正式应答方式，则将所述待识别应答语音输入第一语音识别系统，以使所述第一语音识别系统识别所述待识别应答语音，并输出所述待识别应答语音对应的文本信息；若所述应答方式为非正式应答方式，则将所述待识别应答语音输入第二语音识别系统，以使所述第二语音识别系统识别所述待识别应答语音，并输出所述待识别应答语音对应的文本信息；其中，所述第一语音识别系统和所述第二语音识别系统配置有不同的参数。A judging module, configured to input the response voice to be recognized into the first voice recognition system if the response mode is a formal response mode, so that the first voice recognition system can recognize the response voice to be recognized and output the response voice to be recognized. The text information corresponding to the response voice to be recognized; if the response mode is an informal response mode, then input the response voice to be recognized into the second voice recognition system, so that the second voice recognition system recognizes the response voice to be recognized Response voice, and output text information corresponding to the response voice to be recognized; wherein, the first voice recognition system and the second voice recognition system are configured with different parameters.

可选的，所述识别模块，具体用于：Optionally, the identification module is specifically used for:

将从所述待识别应答语音提取出的语音特征输入所述应答方式识别模型；获取所述应答方式识别模型输出的所述待识别应答语音对应的应答方式。Inputting the speech features extracted from the response speech to be recognized into the response mode recognition model; obtaining the response mode corresponding to the response speech to be recognized outputted by the response mode recognition model.

所述识别模块，具体用于：The identification module is specifically used for:

使用特征提取工具，根据预设的帧长和帧移，提取所述待识别应答语音的帧级特征；将所述帧级特征做平滑滤波处理，并对平滑处理后的帧级特征做差分运算，确定所述待识别应答语音的片级特征；根据预设的统计参数，对所述片级特征进行分析处理，确定所述待识别应答语音的段级特征。Use a feature extraction tool to extract the frame-level features of the response speech to be recognized according to the preset frame length and frame shift; perform smoothing and filtering on the frame-level features, and perform a differential operation on the smoothed frame-level features Determining slice-level features of the speech to be recognized; analyzing and processing the slice-level features according to preset statistical parameters to determine segment-level features of the speech to be recognized.

可选的，所述获取模块，还用于：Optionally, the acquisition module is also used for:

根据下列方式获得所述应答方式识别模型：Obtain the response mode recognition model according to the following manner:

确定包含多个应答语音的训练集，以及包含多个应答语音的测试集；其中，所述训练集中的应答语音与所述测试集中的应答语音不同；针对所述训练集中任意一个应答语音，将从所述应答语音中提取出的语音特征输入到训练前的应答方式识别模型中进行训练；针对所述测试集中任意一个应答语音，将从所述应答语音中提取出的语音特征输入到训练后的应答方式识别模型中，并获取所述应答方式识别模型输出的所述应答语音对应的应答方式；根据训练后的应答方式识别模型输出的所述测试集中每一个应答语音对应的应答方式，确定所述训练后的应答方式识别模型的识别正确率，若所述识别正确率大于设定阈值，确定所述训练后的应答方式识别模型训练完成，保存所述训练后的应答方式识别模型。Determine the training set that includes a plurality of answering voices, and the test set that includes a plurality of answering voices; wherein, the answering voices in the training set are different from the answering voices in the test set; for any one answering voice in the training set, the The voice feature extracted from the answering voice is input into the response mode recognition model before training for training; for any answering voice in the test set, the voice feature extracted from the answering voice is input into the post-training In the response mode recognition model, and obtain the response mode corresponding to the said response voice output by the response mode recognition model; according to the response mode corresponding to each response voice in the test set output by the response mode recognition model after training, determine If the recognition accuracy rate of the trained response mode recognition model is greater than a set threshold, it is determined that the training of the trained response mode recognition model is completed, and the trained response mode recognition model is saved.

附图说明Description of drawings

为了更清楚地说明本发明实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简要介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域的普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For Those of ordinary skill in the art can also obtain other drawings based on these drawings without any creative effort.

图1为本发明实施例应答语音的识别方法的流程图；Fig. 1 is the flow chart of the recognition method of response speech of the embodiment of the present invention;

图2为本发明实施例提取语音特征的流程图；Fig. 2 is the flow chart of the embodiment of the present invention extracting speech feature;

图3为本发明实施例获取应答方式识别模型的方法流程图；FIG. 3 is a flow chart of a method for obtaining a response mode identification model according to an embodiment of the present invention;

图4为本发明实施例获取应答方式识别模型的方法的整体流程图；FIG. 4 is an overall flowchart of a method for obtaining a response mode recognition model according to an embodiment of the present invention;

图5A为本发明实施例SVM核函数对应的识别结果准确率示意图；Fig. 5A is a schematic diagram of the accuracy rate of the recognition result corresponding to the SVM kernel function according to the embodiment of the present invention;

图5B为本发明实施例SVM核函数性能比较图；Fig. 5B is a performance comparison diagram of the SVM kernel function of the embodiment of the present invention;

图6为本发明实施例应答语音的识别装置的结构示意图。FIG. 6 is a schematic structural diagram of a recognition device for answering speech according to an embodiment of the present invention.

具体实施方式detailed description

本发明实施例获取待识别应答语音；使用应答方式识别模型确定所述待识别应答语音对应的应答方式；其中，所述应答方式识别模型为有监督的机器学习模型；若所述应答方式为正式应答方式，则将所述待识别应答语音输入第一语音识别系统，以使所述第一语音识别系统识别所述待识别应答语音，并输出所述待识别应答语音对应的文本信息；若所述应答方式为非正式应答方式，则将所述待识别应答语音输入第二语音识别系统，以使所述第二语音识别系统识别所述待识别应答语音，并输出所述待识别应答语音对应的文本信息；其中，所述第一语音识别系统和所述第二语音识别系统配置有不同的参数。The embodiment of the present invention acquires the response voice to be recognized; uses the response mode recognition model to determine the response mode corresponding to the response voice to be recognized; wherein, the response mode recognition model is a supervised machine learning model; if the response mode is formal response mode, then input the response voice to be recognized into the first voice recognition system, so that the first voice recognition system can recognize the response voice to be recognized, and output the text information corresponding to the response voice to be recognized; if the If the response mode is an informal response mode, the response voice to be recognized is input into the second voice recognition system, so that the second voice recognition system recognizes the response voice to be recognized, and outputs the corresponding voice of the response voice to be recognized. text information; wherein, the first speech recognition system and the second speech recognition system are configured with different parameters.

由于本发明实施例在识别应答语音时，获取待识别的应答语音后，使用应答方式识别模型确定待识别应答语音对应的应答方式，针对正式应答方式和非正式应答方式输入不同的语音识别系统进行识别。由于第一语音识别系统用于识别正式应答语音，第二语音识别系统用于识别非正式应答语音，并且第一语音识别系统和第二语音识别系统配置有不同的参数，本发明实施例首先识别应答语音为正式应答方式或非正式应答方式，针对不同的应答方式使用不同的语音识别系统进行识别，从而提升整体的语音识别性能，对待识别应答语音的识别更加准确。Since the embodiment of the present invention recognizes the response voice, after obtaining the response voice to be recognized, the response mode recognition model is used to determine the response mode corresponding to the response voice to be recognized, and input different voice recognition systems for the formal response mode and the informal response mode. identify. Since the first speech recognition system is used to recognize formal response speech, the second speech recognition system is used to recognize informal reply speech, and the first speech recognition system and the second speech recognition system are configured with different parameters, the embodiment of the present invention first recognizes The answering voice is a formal answering method or an informal answering method, and different speech recognition systems are used to recognize different answering methods, so as to improve the overall speech recognition performance and make the recognition of the answering voice to be recognized more accurate.

需要说明的是，本发明实施例的识别应答语音的应答方式的方法，不仅可以用于提升语音识别系统的效果，还可以应用于其它的高层系统，比如说话人识别系统，异常音监测系统等。It should be noted that the method for identifying the response mode of the response voice in the embodiment of the present invention can not only be used to improve the effect of the voice recognition system, but also can be applied to other high-level systems, such as speaker recognition systems, abnormal sound monitoring systems, etc. .

为了使本发明的目的、技术方案和优点更加清楚，下面将结合附图对本发明作进一步地详细描述，显然，所描述的实施例仅仅是本发明一部份实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例，都属于本发明保护的范围。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, rather than all embodiments . Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

如图1所示，本发明实施例应答语音的识别方法包括：As shown in Figure 1, the recognition method of answering voice in the embodiment of the present invention comprises:

步骤101、获取待识别应答语音；Step 101, obtaining the response voice to be recognized;

步骤102、使用应答方式识别模型确定所述待识别应答语音对应的应答方式；其中，所述应答方式识别模型为有监督的机器学习模型；Step 102, using the response mode recognition model to determine the response mode corresponding to the response speech to be recognized; wherein, the response mode recognition model is a supervised machine learning model;

步骤103、若所述应答方式为正式应答方式，则将所述待识别应答语音输入第一语音识别系统，以使所述第一语音识别系统识别所述待识别应答语音，并输出所述待识别应答语音对应的文本信息；若所述应答方式为非正式应答方式，则将所述待识别应答语音输入第二语音识别系统，以使所述第二语音识别系统识别所述待识别应答语音，并输出所述待识别应答语音对应的文本信息；其中，所述第一语音识别系统和所述第二语音识别系统配置有不同的参数。Step 103: If the answering method is a formal answering method, input the speech to be recognized into the first speech recognition system, so that the first speech recognition system can recognize the speech to be recognized and output the speech to be recognized. Recognize the text information corresponding to the response voice; if the response mode is an informal response mode, input the response voice to be recognized into the second voice recognition system, so that the second voice recognition system can recognize the response voice to be recognized , and output text information corresponding to the response voice to be recognized; wherein, the first voice recognition system and the second voice recognition system are configured with different parameters.

本发明实施例待识别应答语音对应的应答方式包括正式应答方式和非正式应答方式；The response modes corresponding to the to-be-recognized response voice in the embodiment of the present invention include formal response modes and informal response modes;

本发明实施例可以应用于飞行场景中，对飞行场景中的应答语音的应答方式进行识别，识别飞行中的应答语音的应答方式为正式应答方式或非正式应答方式。其中，正式应答方式的识别语音为驾驶员与地面控制中心之间的指示性对话；例如，驾驶员向地面控制中心发出请示，地面控制中心针对驾驶员的请示进行应答，驾驶员向地面控制中心回复确认等。The embodiment of the present invention can be applied in a flight scene to identify the response mode of the response voice in the flight scene, and identify whether the response mode of the response voice in flight is a formal response mode or an informal response mode. Among them, the recognized voice of the formal answering method is an indicative dialogue between the driver and the ground control center; Reply confirmation etc.

非正式应答方式的识别语音为正副驾驶员之间的对话，或驾驶员与地面塔台之间的对话；例如，正副驾驶员之间聊天的语音，正副驾驶员之间关于飞行过程中的指导性语音，驾驶员向地面塔台中心汇报飞机状态等。The recognized voice of the informal response mode is a dialogue between the pilot and the co-pilot, or a dialogue between the pilot and the ground tower; for example, the voice of the chat between the pilot and the co-pilot, the guidance between the pilot and the co-pilot during the flight Voice, the pilot reports the status of the aircraft to the ground tower center.

需要说明的是，本发明实施例并不限于飞行场景中，在任意语言场境中均可利用本发明实施例的应答方式识别方法，并且，在不同的语言场景中，对正式应答方式和非正式应答方式的定义也不尽相同。例如，A、B为足球比赛解说员，在确定A和B之间的对话信息的应答方式时，将A和B之间关于该场足球比赛的对话定义为正式应答方式的对话，将A和B之间与该场足球比赛无关的对话定义为非正式应答方式的对话。It should be noted that the embodiment of the present invention is not limited to the flight scene, and the response method recognition method of the embodiment of the present invention can be used in any language scene, and, in different language scenes, the formal response method and non- Definitions of formal responses also vary. For example, A and B are football game narrators. When determining the response mode of the dialogue information between A and B, the dialogue about the football match between A and B is defined as the dialogue of the formal response mode, and A and B The dialogue between B that has nothing to do with the football game is defined as the dialogue in the form of informal response.

本发明实施例在使用应答方式识别模型确定所述待识别应答语音对应的应答方式时，具体采用下面方法：In the embodiment of the present invention, when using the response mode recognition model to determine the response mode corresponding to the response voice to be recognized, the following methods are specifically adopted:

可选的，将从所述待识别应答语音提取出的语音特征输入所述应答方式识别模型；获取所述应答方式识别模型输出的所述待识别应答语音对应的应答方式。Optionally, input the speech features extracted from the response speech to be recognized into the response mode recognition model; and acquire the response mode corresponding to the response speech to be recognized output by the response mode recognition model.

其中，本发明实施例的应答方式识别模型为有监督的机器学习模型，具体的，本发明实施例的应答方式识别模型为SVM(支持向量机)模型。Wherein, the response mode recognition model in the embodiment of the present invention is a supervised machine learning model, specifically, the response mode recognition model in the embodiment of the present invention is an SVM (Support Vector Machine) model.

本发明实施例在获取到待识别应答语音后，使用特征提取工具，提取所述待识别应答语音中的语音特征。In the embodiment of the present invention, after the speech to be recognized is acquired, a feature extraction tool is used to extract speech features in the speech to be recognized.

实施中，本发明实施例在提取待识别应答语音中的语音特征时，采用分层提取的方式提取待识别应答语音中的语音特征。During implementation, the embodiment of the present invention extracts the speech features in the response speech to be recognized in a layered extraction manner when extracting the speech features in the response speech to be recognized.

本发明实施例的语音特征包括帧级(frame level)特征、片级特征(segmentlevel)和段级(part level)特征。The speech features in the embodiment of the present invention include frame level features, segment level features and segment level features.

具体的，本发明实施例使用openSMILE特征提取工具，对待识别应答语音进行分层提取，提取出待识别应答语音中的语音特征。Specifically, the embodiment of the present invention uses the openSMILE feature extraction tool to perform hierarchical extraction of the response speech to be recognized, and extract the speech features in the response speech to be recognized.

可选的，使用特征提取工具，根据预设的帧长和帧移，提取所述待识别应答语音的帧级特征；将所述帧级特征做平滑滤波处理，并对平滑处理后的帧级特征做差分运算，确定所述待识别应答语音的片级特征；根据预设的统计参数，对所述片级特征进行分析处理，确定所述待识别应答语音的段级特征。。Optionally, use a feature extraction tool to extract the frame-level features of the response speech to be recognized according to the preset frame length and frame shift; perform smoothing filtering on the frame-level features, and smooth the frame-level features after smoothing Perform differential operation on the features to determine slice-level features of the speech to be recognized; analyze and process the slice-level features according to preset statistical parameters to determine segment-level features of the speech to be recognized. .

下面详细介绍本发明实施例从待识别应答语音中提取语音特征的方法。The method for extracting speech features from the response speech to be recognized will be introduced in detail below in the embodiment of the present invention.

第一步，提取待识别应答语音中的帧级特征。The first step is to extract the frame-level features in the speech to be recognized.

其中，帧级特征为待识别应答语音中的第一层语音特征。Among them, the frame-level features are the first-level speech features in the response speech to be recognized.

实施中，使用openSMILE特征提取工具，帧长20ms，帧移10ms，共包含16维特征，具体的帧级特征参数如表1所示，具体包括：In the implementation, using the openSMILE feature extraction tool, the frame length is 20ms, and the frame shift is 10ms, which contains a total of 16-dimensional features. The specific frame-level feature parameters are shown in Table 1, including:

RMSenergy(Root Mean Square energy，能量均方根)、mfcc(Mel-FrequencyCepstral Coefficient，梅尔频率倒谱系数)1-12维、zcr(zero-crossing rate，过零率)、Voice_prob(浊音占比)、F0(根据倒谱计算出的基频)。RMSenergy (Root Mean Square energy, energy root mean square), mfcc (Mel-FrequencyCepstral Coefficient, Mel frequency cepstral coefficient) 1-12 dimensions, zcr (zero-crossing rate, zero-crossing rate), Voice_prob (voiced sound ratio) , F0 (fundamental frequency calculated according to the cepstrum).

表1Table 1

帧级特征的英文简写English abbreviation for frame-level features 帧级特征的中文解释Chinese interpretation of frame-level features RMSenergyRM Energy 能量均方根energy root mean square mfcc(1-12)mfcc(1-12) 梅尔频率倒谱系数1-12维Mel frequency cepstral coefficient 1-12 dimensions zcrzcr 过零率(帧级)Zero-crossing rate (frame level) Voice_probVoice_prob 通过自相关计算浊音占比Calculation of voiced speech proportion by autocorrelation F0F0 根据倒谱计算出的基频Fundamental frequency calculated from the cepstrum

第二步，提取待识别应答语音中的片级特征。The second step is to extract slice-level features in the speech to be recognized.

其中，片级特征为待识别应答语音中的第二层语音特征。Among them, the slice-level feature is the second-level speech feature in the response speech to be recognized.

具体的，将所述帧级特征做平滑滤波处理，并对平滑处理后的帧级特征做差分运算，确定所述所述待识别应答语音中的片级特征。Specifically, smoothing and filtering is performed on the frame-level features, and a difference operation is performed on the smoothed frame-level features to determine slice-level features in the response speech to be recognized.

实施中，对第一步中得到的帧序列进行窗口长度为3帧的平滑滤波sma(smoothedby a moving average filter)；During implementation, the frame sequence obtained in the first step is carried out to a smoothing filter sma (smoothed by a moving average filter) with a window length of 3 frames;

在对帧序列进行平滑滤波后，对平滑后的特征做一阶差分de(deltacoefficient)。After smoothing and filtering the frame sequence, the first-order difference de(deltacoefficient) is made on the smoothed features.

其中，待具体的片级特征分析函数如表2所示，具体包括：Among them, the slice-level feature analysis functions to be specified are shown in Table 2, specifically including:

sma(平滑滤波)和de(一阶差分)。sma (smooth filtering) and de (first difference).

表2Table 2

片级特征分析函数的英文简写English abbreviation for chip-level feature analysis function 片级特征分析函数的中文解释Chinese explanation of chip-level feature analysis function smasma 平滑滤波smoothing filter dede 一阶差分first difference

在经过第一步和第二步之后，共得到16*2＝32维语音特征。After the first step and the second step, a total of 16*2=32 dimensional speech features are obtained.

第三步，提取待识别应答语音中的段级特征。The third step is to extract the segment-level features in the speech to be recognized.

其中，段级特征为待识别应答语音中的第三层语音特征。Among them, the segment-level features are the third-level speech features in the response speech to be recognized.

具体的，根据预设的统计参数，对所述片级特征进行分析处理，确定所述待识别应答语音中的段级特征。Specifically, the slice-level features are analyzed and processed according to preset statistical parameters to determine segment-level features in the response speech to be recognized.

实施中，对第二步输出的特征做统计分析，主要包括12个统计参数，根据12个统计参数对第二步输出的特征片级特征进行分析处理，得到待识别应答语音中的段级特征。During implementation, statistical analysis is performed on the features output in the second step, which mainly includes 12 statistical parameters. According to the 12 statistical parameters, the feature slice-level features output in the second step are analyzed and processed to obtain the segment-level features in the speech to be recognized. .

具体的预设的12个统计参数如表3所示，包括：The specific preset 12 statistical parameters are shown in Table 3, including:

max(maximum，包络取最大值)、min(minute，包络取最小值)、range(包络变化范围)、maxpos(maximum position，最大值位置)、minpos(minute position，包络最小值绝对位置)、amean(Arithmetic mean，包络算数均值)、linregc1(包络的线性近似斜率)、linregc2(包络的线性近似偏移)、linregerrQ(包络的线性预测值与实际值的均方根)、stddev(标准差)、skewness(三阶偏斜度)、kurtosis(四阶峭度)。max(maximum, the maximum value of the envelope), min(minute, the minimum value of the envelope), range (the range of the envelope change), maxpos(maximum position, the maximum position), minpos(minute position, the absolute minimum value of the envelope Position), amean (Arithmetic mean, envelope arithmetic mean), linregc1 (linear approximation slope of the envelope), linregc2 (linear approximation offset of the envelope), linregerrQ (the root mean square of the linear prediction value and the actual value of the envelope ), stddev (standard deviation), skewness (third-order skewness), kurtosis (fourth-order kurtosis).

表3table 3

段级特征统计参数的英文简写English abbreviation for segment-level feature statistics parameters 段级特征统计参数的中文解释Chinese interpretation of segment-level characteristic statistical parameters maxmax 包络取最大值Envelope takes the maximum value minmin 包络取最小值Envelope takes minimum value rangerange 包络变化范围Envelope variation range maxposmax pos 最大值位置Maximum position minposminpos 包络最小值绝对位置Envelope Minimum Absolute Position ameanamean 包络算数均值Envelope Arithmetic Mean linregc1linregc1 包络的线性近似斜率linear approximation slope of the envelope linregc2linregc2 包络的线性近似偏移Linear approximation offset of the envelope linregerrQlinregerrQ 包络的线性预测值与实际值的均方根Envelope's linear predicted value and the root mean square of the actual value stddevstddev 标准差standard deviation skewnessskewness 三阶偏斜度third order skewness kurtosiskurtosis 四阶峭度fourth order kurtosis

如图2所示，本发明实施例在第三步中提取待识别应答语音中的段级特征时，是针对第二步中得到的片级特征进行统计分析，并且包括预设的12个统计参数，则经过第三步段级特征提取后，共得到16*2*12＝384维语音特征。As shown in Figure 2, when the embodiment of the present invention extracts segment-level features in the speech to be recognized in the third step, statistical analysis is performed on the slice-level features obtained in the second step, and includes 12 preset statistics parameters, after the third step of segment-level feature extraction, a total of 16*2*12=384-dimensional speech features are obtained.

本发明实施例通过特征提取工具提取出待识别应答语音中的语音特征之后，将提取出的语音特征输入到应答方式识别模型中，以使所述应答方式识别模型根据所述语音特征识别所述待识别应答语音对应的应答方式；并获取该应答方式识别模型根据输入的语音特征，输出的该待识别应答语音对应的应答方式。In the embodiment of the present invention, after the speech features in the response speech to be recognized are extracted by a feature extraction tool, the extracted speech features are input into the response mode recognition model, so that the response mode recognition model can recognize the speech features according to the speech features. The response mode corresponding to the response voice to be recognized; and the response mode corresponding to the response voice to be recognized outputted by the response mode recognition model according to the input voice features.

需要说明的是，本发明实施例的应答方式识别模型为经过预先训练的、用于识别应答方式的模型。It should be noted that the response mode identification model in the embodiment of the present invention is a pre-trained model for identifying response modes.

由于本发明实施例对待识别应答语音对应的应答方式的识别，主要借助于应答方式识别模型，并且该应答方式识别模型为经过预先训练的模型，因此，本发明实施例还包括一个重要的组成部分，即训练应答方式识别模型。Since the embodiment of the present invention recognizes the response mode corresponding to the response voice to be recognized, mainly by means of the response mode recognition model, and the response mode recognition model is a pre-trained model, therefore, the embodiment of the present invention also includes an important component , that is, to train the response pattern recognition model.

下面详细说明本发明实施例训练应答方式识别模型的过程。The process of training the response mode recognition model in the embodiment of the present invention will be described in detail below.

如图3所示，本发明实施例获得应答方式识别模型的方法包括：As shown in Figure 3, the method for obtaining the response mode recognition model in the embodiment of the present invention includes:

步骤301、确定包含多个应答语音的训练集，以及包含多个应答语音的测试集；其中，所述训练集中的应答语音与所述测试集中的应答语音不同；Step 301, determining a training set comprising a plurality of response voices, and a test set comprising a plurality of response voices; wherein, the response voices in the training set are different from the response voices in the test set;

步骤302、针对所述训练集中任意一个应答语音，将从所述应答语音中提取出的语音特征输入到训练前的应答方式识别模型中进行训练；Step 302, for any one of the response voices in the training set, input the voice features extracted from the response voice into the response mode recognition model before training for training;

步骤303、针对所述测试集中任意一个应答语音，将从所述应答语音中提取出的语音特征输入到训练后的应答方式识别模型中，并获取所述应答方式识别模型输出的所述应答语音对应的应答方式；Step 303, for any one of the response voices in the test set, input the voice features extracted from the response voice into the trained response pattern recognition model, and obtain the response speech output by the response pattern recognition model The corresponding response method;

步骤304、根据训练后的应答方式识别模型输出的所述测试集中每一个应答语音对应的应答方式，确定所述训练后的应答方式识别模型的识别正确率，若所述识别正确率大于设定阈值，确定所述训练后的应答方式识别模型训练完成，保存所述训练后的应答方式识别模型。Step 304, according to the response mode corresponding to each response voice in the test set output by the trained response mode recognition model, determine the recognition accuracy rate of the trained response mode recognition model, if the recognition accuracy rate is greater than the set The threshold value is used to determine that the training of the trained response mode recognition model is completed, and to save the trained response mode recognition model.

步骤301中，本发明实施例在确定训练集和测试集时，从语料库中选取多个应答语音，将选取出的多个应答语音组成训练集或测试集。In step 301, the embodiment of the present invention selects a plurality of response voices from the corpus when determining the training set and the test set, and forms the training set or the test set from the selected multiple response voices.

本发明实施例的语料库为预先录制的语音，该预先录制的语音中包括多个正式应答方式和非正式应答方式的应答语音。The corpus in the embodiment of the present invention is a pre-recorded voice, and the pre-recorded voice includes multiple response voices in formal and informal response modes.

例如，语料库可以为在执行实际飞行过程中录制的17.5小时的语音，在录制好之后，对该17.5小时的语音进行标注，假设标注确定该17.5小时的语音中共包括18个说话人，其中包含了4668个正式应答方式的应答语音，以及2257个非正式应答方式的应答语音，则正式应答方式的应答语音与非正式应答方式的应答语音的比例为2.07:1，并且所有应答语音的语音采样频率都为16KHz，量化精度为16bit。For example, the corpus can be 17.5 hours of speech recorded during the actual flight. After the recording, the 17.5 hours of speech are marked, assuming that the marking determines that the 17.5 hours of speech includes 18 speakers, including There are 4668 response voices in the formal response mode and 2257 response voices in the informal response mode, then the ratio of the response voices in the formal response mode to the response voices in the informal response mode is 2.07:1, and the voice sampling frequency of all the response voices Both are 16KHz, and the quantization precision is 16bit.

从语料库中的所有应答语音中选取出多个应答语音，组成训练集；较佳的，训练集中正式应答方式的应答语音与非正式应答方式的应答语音的比例，接近语料库中正式应答方式的应答语音与非正式应答方式的应答语音的比例。Select a plurality of response voices from all the response voices in the corpus to form a training set; preferably, the ratio of the response voices in the formal response mode to the response voices in the informal response mode in the training set is close to the response in the formal response mode in the corpus Ratio of speech to speech in response to informal responses.

例如，确定两个训练集，分别为训练集A和训练集B，以及确定一个测试集C，其中，训练集A、B和测试集C中正式应答方式的应答语音与非正式应答方式的应答语音的数量及比例如表4所示：For example, determine two training sets, which are respectively training set A and training set B, and determine a test set C, wherein, in the training set A, B and test set C, the response voice of the formal answering mode and the answering voice of the informal answering mode The number and proportion of voices are shown in Table 4:

从语料库中选取1580个正式应答方式的应答语音，以及1580个非正式应答方式的应答语音组成训练集A，训练集A中正式应答方式的应答语音与非正式应答方式的应答语音的比例为1:1；从语料库中选取3270个正式应答方式的应答语音，以及1580个非正式应答方式的应答语音组成训练集B，训练集B中正式应答方式的应答语音与非正式应答方式的应答语音的比例为2.07:1；从语料库中选取1400个正式应答方式的应答语音，以及677个非正式应答方式的应答语音组成测试集C，测试集C中正式应答方式的应答语音与非正式应答方式的应答语音的比例为2.07:1。Select 1580 formal response voices and 1580 informal response voices from the corpus to form training set A. The ratio of formal response voices to informal response voices in training set A is 1 : 1; from the corpus, select 3270 response voices of formal response mode and 1580 response voices of informal response mode to form training set B, the difference between the response voice of formal response mode and the response voice of informal response mode in training set B The ratio is 2.07:1; 1400 formal response voices and 677 informal response voices are selected from the corpus to form test set C. The ratio of voice to answer is 2.07:1.

表4Table 4

下面以表4所示的训练集A、B和测试集C为例，说明训练应答方式识别模型的方法。The following takes the training sets A, B and test set C shown in Table 4 as examples to illustrate the method of training the response mode recognition model.

具体的，本发明实施例是通过训练集A和训练集B中每一个应答语音，对应答方式识别模型进行训练，在训练完成后，将测试集C中的每一个应答语音输入训练后的应答方式识别模型，若应答方式识别模型输出的测试集C中应答语音对应的应答方式的正确识别率大于设定阈值时，确定该应答方式识别模型训练完成，并保存训练完成的应答方式识别模型。Specifically, in the embodiment of the present invention, the response mode recognition model is trained through each response voice in the training set A and the training set B, and after the training is completed, each response voice in the test set C is input into the trained response For the mode recognition model, if the correct recognition rate of the response mode corresponding to the response voice in the test set C output by the response mode recognition model is greater than the set threshold, it is determined that the training of the response mode recognition model is completed, and the trained response mode recognition model is saved.

下面针对训练集A中任意一个应答语音，说明训练应答方式识别模型的过程：The following describes the process of training the response mode recognition model for any response voice in the training set A:

1、使用特征提取工具，提取该应答语音的语音特征。1. Use a feature extraction tool to extract the speech features of the answering speech.

具体提取应答语音的语音特征的方法采用上述方法，在此不再详细赘述。The specific method for extracting the speech features of the response speech adopts the above-mentioned method, which will not be described in detail here.

2、将该应答语音对应的语音特征输入应答方式识别模型中进行训练。2. Input the speech features corresponding to the response speech into the response mode recognition model for training.

具体的，在将应答语音对应的应答语音输入应答方式识别模型，并将所述应答语音对应的应答方式输入应答方式识别模型，以使应答方式识别模型学习到该语音特征对应的应答方式。Specifically, the answering voice corresponding to the answering voice is input into the response mode recognition model, and the answering mode corresponding to the response voice is input into the response mode recognition model, so that the response mode recognition model learns the response mode corresponding to the voice feature.

本发明实施例采用上述的方式，使用训练集中的应答语音对应答方式识别模型进行训练，在经过训练集A和训练级B中的多个应答语音进行多次训练后，使用测试集C中的应答语音，判断该应答方式识别模型是否训练完成。The embodiment of the present invention adopts the above-mentioned method, and uses the response voice in the training set to train the response mode recognition model. Respond to the voice, and judge whether the training of the recognition model of the response method is completed.

具体的，在采用测试集C判断应答方式识别模型是否训练完成时，针对测试集C中的任意一个应答语音，执行下列操作：Specifically, when the test set C is used to judge whether the response mode recognition model has been trained, for any answer voice in the test set C, the following operations are performed:

1、使用特征提取工具，提取该应答语音的语音特征；1. Use the feature extraction tool to extract the speech features of the response speech;

2、将该应答语音对应的语音特征输入训练后的应答方式识别模型；2. Input the speech features corresponding to the response speech into the trained response mode recognition model;

3、获取训练后的应答方式识别模型输出的该应答语音对应的应答方式。3. Acquiring the response mode corresponding to the response voice output by the trained response mode recognition model.

具体的，预先设定应答方式识别模型在确定应答语音对应的应答方式为正式应答方式时，应答方式识别模型输出“1”；在确定应答语音对应的应答方式为非式应答方式时，应答方式识别模型输出“0”。Specifically, when the pre-set response mode recognition model determines that the response mode corresponding to the response voice is a formal response mode, the response mode recognition model outputs "1"; when determining that the response mode corresponding to the response voice is an informal response mode, the response mode The recognition model outputs "0".

本发明实施例在使用训练后的应答方式识别模型对测试集C中的每一个应答语音进行判断后，确定测试集C中每一个应答语音对应的识别结果；将应答方式识别模型确定的测试集C中每一个应答语音对应的识别结果，与每一个应答语音对应的应答方式进行比较，确定测试集C对应的识别结果的正确识别率，若该正确识别率大于设定阈值，则确定该应答方式识别模型训练完成，保存训练后的应答方式识别模型；若该正确识别率不大于设定阈值，则重新选择训练集和测试集，对该应答方式识别模型继续训练，直到该应答方式识别模型对测试集中应答语音的识别结果对应的正确识别率大于设定阈值。In the embodiment of the present invention, after using the trained response mode recognition model to judge each response voice in the test set C, determine the corresponding recognition result of each response voice in the test set C; the test set determined by the response mode recognition model The recognition result corresponding to each response voice in C is compared with the response method corresponding to each response voice to determine the correct recognition rate of the recognition result corresponding to the test set C. If the correct recognition rate is greater than the set threshold, the response is determined. After the mode recognition model training is completed, save the trained response mode recognition model; if the correct recognition rate is not greater than the set threshold, then re-select the training set and test set, and continue training the response mode recognition model until the response mode recognition model The correct recognition rate corresponding to the recognition result of the response speech in the test set is greater than the set threshold.

如图4所示，本发明实施例获得应答方式识别模型的方法的整体流程图。As shown in FIG. 4 , an overall flow chart of a method for obtaining a response mode recognition model according to an embodiment of the present invention.

步骤401、确定包含多个应答语音的训练集，以及包含多个应答语音的测试集；其中，所述训练集中的应答语音与所述测试集中的应答语音不同；Step 401, determining a training set comprising a plurality of response voices, and a test set comprising a plurality of response voices; wherein, the response voices in the training set are different from the response voices in the test set;

下列步骤402、403为针对训练集中的任意一个应答语音。The following steps 402 and 403 are for any answering speech in the training set.

步骤402、使用特征提取工具，提取所述应答语音中的语音特征；Step 402, using a feature extraction tool to extract speech features in the response speech;

步骤403、将提取出的语音特征，以及所述应答语音对应的应答方式输入到应答方式识别模型中进行训练；Step 403, input the extracted speech features and the corresponding response mode of the response voice into the response mode recognition model for training;

下列步骤404、405为针对训练集中的任意一个应答语音。The following steps 404 and 405 are for any answering speech in the training set.

步骤404、使用特征提取工具，提取所述应答语音中的语音特征；Step 404, using a feature extraction tool to extract speech features in the response speech;

步骤405、将提取出的语音特征输入到应答方式识别模型中进行识别；Step 405, input the extracted speech features into the response mode recognition model for recognition;

步骤406、确定所述测试集中每一个应答语音的识别结果；Step 406, determining the recognition result of each response voice in the test set;

步骤407、将所述测试集中每一个应答语音的识别结果，与测试集中每一个应答语音对应的应答方式进行比较，确定所述测试集对应的识别结果的正确识别率；Step 407: Comparing the recognition result of each response voice in the test set with the corresponding answer mode of each response voice in the test set, and determining the correct recognition rate of the recognition result corresponding to the test set;

步骤408、判断正确识别率是否大于设定阈值，若是，执行步骤409，若否，返回步骤401；Step 408, judging whether the correct recognition rate is greater than the set threshold, if so, execute step 409, if not, return to step 401;

步骤409、确定所述应答方式识别模型训练完成后，保存训练后的应答方式识别模型。Step 409, after determining that the training of the response mode recognition model is completed, save the trained response mode recognition model.

本发明实施例在识别应答方式的二分类问题中，采用了适用于小数据量的支持向量机SVM分类器作为应答方式识别模型，并且对比了如下核函数：线性核函数、多项式核函数、高斯径向基核函数以及反正切核函数。In the embodiment of the present invention, in the binary classification problem of identifying the response mode, the support vector machine SVM classifier suitable for small data volumes is used as the response mode identification model, and the following kernel functions are compared: linear kernel function, polynomial kernel function, Gaussian Radial basis kernel function and arctangent kernel function.

本发明实施例基于如表4所示的训练集，分别采用线性核函数、多项式核函数、高斯径向基核函数以及反正切核函数进行实验，得到的识别结果的准确率如图5A所示，其中，SVM核函数为线性核函数时，训练集A对应的识别结果的准确率为80.30，训练集B对应的识别结果的准确率为81.02；SVM核函数为多项式核函数，并且d＝2时，训练集A对应的识别结果的准确率为77.95，训练集B对应的识别结果的准确率为79.25；SVM核函数为多项式核函数，并且d＝3时，训练集A对应的识别结果的准确率为76.17，训练集B对应的识别结果的准确率为81.13；SVM核函数为多项式核函数，并且d＝4时，训练集A对应的识别结果的准确率为63.79，训练集B对应的识别结果的准确率为63.94；SVM核函数为高斯径向基核函数时，训练集A对应的识别结果的准确率为90.71，训练集B对应的识别结果的准确率为91.62；SVM核函数为反正切核函数时，训练集A对应的识别结果的准确率为84.45，训练集B对应的识别结果的准确率为89.56；The embodiment of the present invention is based on the training set shown in Table 4, respectively using linear kernel function, polynomial kernel function, Gaussian radial basis kernel function and arctangent kernel function to conduct experiments, and the accuracy of the obtained recognition results is shown in Figure 5A , wherein, when the SVM kernel function is a linear kernel function, the accuracy rate of the recognition result corresponding to the training set A is 80.30, and the accuracy rate of the recognition result corresponding to the training set B is 81.02; the SVM kernel function is a polynomial kernel function, and d=2 , the accuracy rate of the recognition result corresponding to the training set A is 77.95, and the accuracy rate of the recognition result corresponding to the training set B is 79.25; the SVM kernel function is a polynomial kernel function, and when d=3, the recognition result corresponding to the training set A is The accuracy rate is 76.17, the accuracy rate of the recognition result corresponding to the training set B is 81.13; the SVM kernel function is a polynomial kernel function, and when d=4, the accuracy rate of the recognition result corresponding to the training set A is 63.79, and the accuracy rate of the recognition result corresponding to the training set B is 63.79. The accuracy rate of the recognition result is 63.94; when the SVM kernel function is a Gaussian radial basis kernel function, the accuracy rate of the recognition result corresponding to the training set A is 90.71, and the accuracy rate of the recognition result corresponding to the training set B is 91.62; the SVM kernel function is When the arctangent kernel function is used, the accuracy rate of the recognition result corresponding to the training set A is 84.45, and the accuracy rate of the recognition result corresponding to the training set B is 89.56;

并且，SVM模型分别采用线性核函数、多项式核函数、高斯径向基核函数以及反正切核函数的性能比较如图5B所示。Moreover, the performance comparison of the SVM model using linear kernel function, polynomial kernel function, Gaussian radial basis kernel function and arctangent kernel function is shown in Fig. 5B.

基于同一发明构思，本发明实施例中还提供了一种应答方式的识别装置，由于该装置解决问题的原理与本发明实施例应答方式的识别的方法相似，因此该装置的实施可以参见方法的实施，重复之处不再赘述。Based on the same inventive concept, an identification device of a response mode is also provided in the embodiment of the present invention. Since the problem-solving principle of the device is similar to the identification method of the response mode in the embodiment of the present invention, the implementation of the device can refer to the method implementation, the repetition will not be repeated.

如图6所示，本发明实施例应答语音的识别装置，包括：As shown in Figure 6, the recognition device of the response voice in the embodiment of the present invention includes:

获取模块601，获取模块，用于获取待识别应答语音；Obtaining module 601, an obtaining module, used to obtain the response voice to be recognized;

识别模块602，用于使用应答方式识别模型确定所述待识别应答语音对应的应答方式；其中，所述应答方式识别模型为有监督的机器学习模型；The identification module 602 is configured to use a response mode recognition model to determine the response mode corresponding to the response speech to be recognized; wherein, the response mode recognition model is a supervised machine learning model;

判断模块603，用于若所述应答方式为正式应答方式，则将所述待识别应答语音输入第一语音识别系统，以使所述第一语音识别系统识别所述待识别应答语音，并输出所述待识别应答语音对应的文本信息；若所述应答方式为非正式应答方式，则将所述待识别应答语音输入第二语音识别系统，以使所述第二语音识别系统识别所述待识别应答语音，并输出所述待识别应答语音对应的文本信息；其中，所述第一语音识别系统和所述第二语音识别系统配置有不同的参数。Judging module 603, configured to input the response voice to be recognized into the first voice recognition system if the response mode is a formal response mode, so that the first voice recognition system can recognize the response voice to be recognized and output The text information corresponding to the response voice to be recognized; if the response mode is an informal response mode, input the response voice to be recognized into the second voice recognition system, so that the second voice recognition system can recognize the response voice to be recognized Recognizing a response voice, and outputting text information corresponding to the response voice to be recognized; wherein, the first voice recognition system and the second voice recognition system are configured with different parameters.

可选的，所述识别模块602，具体用于：Optionally, the identification module 602 is specifically configured to:

所述识别模块602，具体用于：The identification module 602 is specifically used for:

可选的，所述获取模块601，还用于：Optionally, the obtaining module 601 is also used for:

本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器，使得通过该计算机或其他可编程数据处理设备的处理器执行的指令可实现流程图中的一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special purpose computer, an embedded processor, or other programmable data processing equipment, so that the instructions executed by the processor of the computer or other programmable data processing equipment can realize the A process or processes and/or a function specified in a block or blocks of a block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图的一个流程或多个流程和/或方框图的一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby The instructions provide steps for implementing the functions specified in the flow or flows of the flowcharts and/or the block or blocks of the block diagrams.

尽管已描述了本发明的优选实施例，但本领域内的技术人员一旦得知了基本创造性概念，则可对这些实施例做出另外的变更和修改。所以，所附权利要求意欲解释为包括优选实施例以及落入本发明范围的所有变更和修改。While preferred embodiments of the present invention have been described, additional changes and modifications can be made to these embodiments by those skilled in the art once the basic inventive concept is appreciated. Therefore, it is intended that the appended claims be construed to cover the preferred embodiment as well as all changes and modifications which fall within the scope of the invention.

显然，本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样，倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内，则本发明也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present invention without departing from the spirit and scope of the present invention. Thus, if these modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalent technologies, the present invention also intends to include these modifications and variations.

Claims

1. a kind of recognition methodss of response voice, it is characterised in that the method includes：

Obtain response voice to be identified；

The corresponding response mode of the response voice to be identified is determined using response mode identification model；Wherein, the answer party Formula identification model is the machine learning model for having supervision；

If the response mode is formal response mode, by first speech recognition system of response phonetic entry to be identified, So that first speech recognition system recognizes the response voice to be identified, and it is corresponding to export the response voice to be identified Text message；

If the response mode is unofficial response mode, by the second speech recognition of response phonetic entry system to be identified System, so that second speech recognition system recognizes the response voice to be identified, and exports the response voice pair to be identified The text message answered；

Wherein, first speech recognition system and second speech recognition system are configured with different parameters.

2. the method for claim 1, it is characterised in that the use response mode identification model determines described to be identified The corresponding response mode of response voice, specifically includes：

The phonetic feature extracted from the response voice to be identified is input into into the response mode identification model；

Obtain the corresponding response mode of response voice described to be identified of the response mode identification model output.

3. method as claimed in claim 2, it is characterised in that the phonetic feature includes frame level feature, chip level feature and section Level feature；

Phonetic feature is extracted from response voice according to following manner：

Using feature extraction tools, moved according to default frame length and frame, extract the frame level feature of the response voice to be identified；

The frame level feature is done into the disposal of gentle filter, and calculus of differences is done to the frame level feature after smoothing processing, it is determined that described The chip level feature of response voice to be identified；

According to default statistical parameter, process is analyzed to the chip level feature, determines the section of the response voice to be identified Level feature.

4. the method for claim 1, it is characterised in that the response mode identification model is obtained according to following manner：

It is determined that the training set comprising multiple response voices, and the test set comprising multiple response voices；Wherein, the training set In response voice it is different from the response voice in the test set；

For any one response voice in the training set, the phonetic feature extracted from the response voice is input to It is trained in response mode identification model before training；

For any one response voice in the test set, the phonetic feature extracted from the response voice is input to In response mode identification model after training, and obtain the response language of the output of the response mode identification model after the training The corresponding response mode of sound；

In the test set according to the response mode identification model output after the training, each response voice is corresponding should Mode is answered, the recognition correct rate of the response mode identification model after the training is determined, if the recognition correct rate is more than setting Threshold value, determines that the training of the response mode identification model after the training is completed, and preserves the response mode identification mould after the training Type.

5. the method as described in Claims 1 to 4 is arbitrary, it is characterised in that the response mode identification model is supporting vector Machine SVM models.

6. a kind of identifying device of response voice, it is characterised in that include：

Acquisition module, for obtaining response voice to be identified；

Identification module, for determining the corresponding response mode of the response voice to be identified using response mode identification model；Its In, the response mode identification model is the machine learning model for having supervision；

Judge module, if being formal response mode for the response mode, by the response phonetic entry to be identified first Speech recognition system, so that first speech recognition system recognizes the response voice to be identified, and exports described to be identified The corresponding text message of response voice；If the response mode is unofficial response mode, by the response voice to be identified The second speech recognition system is input into, so that second speech recognition system recognizes the response voice to be identified, and institute is exported State the corresponding text message of response voice to be identified；Wherein, first speech recognition system and the second speech recognition system It is under unified central planning to be equipped with different parameters.

7. device as claimed in claim 6, it is characterised in that the identification module, specifically for：

The phonetic feature extracted from the response voice to be identified is input into into the response mode identification model；Obtain described answering Answer the corresponding response mode of response voice described to be identified of mode identification model output.

8. device as claimed in claim 7, it is characterised in that the phonetic feature includes frame level feature, chip level feature and section Level feature；

The identification module, specifically for：

Using feature extraction tools, moved according to default frame length and frame, extract the frame level feature of the response voice to be identified；Will The frame level feature does the disposal of gentle filter, and does calculus of differences to the frame level feature after smoothing processing, determines described to be identified The chip level feature of response voice；According to default statistical parameter, process is analyzed to the chip level feature, it is determined that described wait to know The Utterance level feature of voice is not replied.

9. device as claimed in claim 6, it is characterised in that the acquisition module, is additionally operable to：

The response mode identification model is obtained according to following manner：

It is determined that the training set comprising multiple response voices, and the test set comprising multiple response voices；Wherein, the training set In response voice it is different from the response voice in the test set；For any one response voice in the training set, will The phonetic feature extracted from the response voice is input in the response mode identification model before training and is trained；For Any one response voice in the test set, the phonetic feature extracted from the response voice is input to after training In response mode identification model, and it is corresponding to obtain the response voice of the output of the response mode identification model after the training Response mode；In the test set according to the response mode identification model output after training, each response voice is corresponding should Mode is answered, the recognition correct rate of the response mode identification model after the training is determined, if the recognition correct rate is more than setting Threshold value, determines that the training of the response mode identification model after the training is completed, and preserves the response mode identification mould after the training Type.

10. the device as described in claim 6～9 is arbitrary, it is characterised in that the response mode identification model is supporting vector Machine SVM models.