CN115457938A

CN115457938A - Method, device, storage medium and electronic device for identifying wake-up words

Info

Publication number: CN115457938A
Application number: CN202211145889.3A
Authority: CN
Inventors: 王宝俊; 吴人杰; 方瑞东; 林聚财; 殷俊
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2022-09-20
Filing date: 2022-09-20
Publication date: 2022-12-09

Abstract

Embodiments of the present invention provide a method, device, storage medium, and electronic device for recognizing wake-up words, wherein the method includes: performing feature extraction on the target voice signal to obtain multi-frame acoustic feature vectors; The acoustic feature vector is processed to obtain the target processing result; the multi-frame acoustic feature vector is decoded through the decoding map to obtain the target decoding result; according to the target processing result and the target decoding result, the recognition result of the wake-up word in the target speech signal is determined. The present invention solves the problem in the related art that the accuracy of wake-up word recognition is low due to crosstalk between wake-up words.

Description

Method, device, storage medium and electronic device for identifying wake-up words

技术领域technical field

本发明实施例涉及语音唤醒领域，具体而言，涉及一种识别唤醒词的方法、装置、存储介质及电子装置。Embodiments of the present invention relate to the field of voice wake-up, and in particular, relate to a method, device, storage medium, and electronic device for recognizing wake-up words.

背景技术Background technique

近年来，随着信息化技术的快速发展，语音识别相关的技术已经极大的方便和丰富了人们的生活。在智能家居设备、视频会议设备、家电设备等中配置了完善的语音唤醒功能。用户可以说出唤醒词，唤醒设备，然后开始与设备的人机语音交互。因此语音唤醒是语音交互的重要环节。In recent years, with the rapid development of information technology, technologies related to speech recognition have greatly facilitated and enriched people's lives. Perfect voice wake-up function is configured in smart home equipment, video conferencing equipment, home appliances, etc. The user can speak the wake word, wake up the device, and start the human-computer voice interaction with the device. Therefore, voice wake-up is an important part of voice interaction.

目前的唤醒词识别中，常常需要定义多个唤醒词，并对这些不同的唤醒词同时训练，在同一个模型中进行分类任务，这将导致多个唤醒词之间存在串扰，导致会出现识别唤醒词的准确率不高，从而会增加设备误唤醒率。因此，现有技术中存在由于唤醒词之间的串扰导致唤醒词识别的准确率较低的问题。In the current wake-up word recognition, it is often necessary to define multiple wake-up words, and train these different wake-up words at the same time, and perform classification tasks in the same model, which will lead to crosstalk between multiple wake-up words, resulting in recognition The accuracy of the wake-up word is not high, which will increase the false wake-up rate of the device. Therefore, there is a problem in the prior art that the recognition accuracy of wake-up words is low due to crosstalk between wake-up words.

针对相关技术中存在的由于唤醒词之间的串扰导致唤醒词识别的准确率较低的问题，目前尚未提出有效的解决方案。Aiming at the problem in the related art that the accuracy of wake-up word recognition is low due to the crosstalk between wake-up words, no effective solution has been proposed so far.

发明内容Contents of the invention

本发明实施例提供了一种识别唤醒词的方法、装置、存储介质及电子装置，以至少解决相关技术中存在的由于唤醒词之间的串扰导致唤醒词识别的准确率较低的问题。Embodiments of the present invention provide a method, device, storage medium, and electronic device for recognizing wake-up words, so as to at least solve the problem in the related art that the accuracy of wake-up word recognition is low due to crosstalk between wake-up words.

根据本发明的一个实施例，提供了一种识别唤醒词的方法，包括：对目标语音信号进行特征提取，得到多帧声学特征向量；通过深度神经网络对所述多帧声学特征向量进行处理，得到目标处理结果；通过解码图对所述多帧声学特征向量进行解码，得到目标解码结果；根据所述目标处理结果和所述目标解码结果，确定所述目标语音信号中唤醒词的识别结果。According to an embodiment of the present invention, a method for recognizing wake-up words is provided, including: performing feature extraction on a target speech signal to obtain a multi-frame acoustic feature vector; processing the multi-frame acoustic feature vector through a deep neural network, Obtain a target processing result; decode the multi-frame acoustic feature vectors through a decoding map to obtain a target decoding result; determine a recognition result of a wake-up word in the target speech signal according to the target processing result and the target decoding result.

在一个示例性实施例中，通过深度神经网络对所述多帧声学特征向量进行处理，得到目标处理结果，包括：将所述多帧声学特征向量输入深度神经网络，通过所述深度神经网络对所述多帧声学特征向量中的每帧声学特征向量进行分类，得到各帧声学特征向量对应的音素后验特征向量，其中，所述目标处理结果包括所述各帧声学特征向量对应的音素后验特征向量。In an exemplary embodiment, processing the multi-frame acoustic feature vectors through a deep neural network to obtain a target processing result includes: inputting the multi-frame acoustic feature vectors into a deep neural network, and processing the multi-frame acoustic feature vectors through the deep neural network The acoustic feature vectors of each frame in the multi-frame acoustic feature vectors are classified to obtain the phoneme posterior feature vectors corresponding to the acoustic feature vectors of each frame, wherein the target processing result includes the phoneme posterior feature vectors corresponding to the acoustic feature vectors of each frame test feature vector.

在一个示例性实施例中，所述通过解码图对所述多帧声学特征向量进行解码，得到目标解码结果，包括：通过解码图中的多条路径对所述多帧声学特征向量进行解码，得到目标路径；将所述目标路径确定为所述目标解码结果。In an exemplary embodiment, the decoding the multi-frame acoustic feature vectors through a decoding graph to obtain a target decoding result includes: decoding the multi-frame acoustic feature vectors through multiple paths in the decoding graph, Obtaining a target path; determining the target path as the target decoding result.

在一个示例性实施例中，通过解码图中的多条路径对所述多帧声学特征向量进行解码，得到目标路径，包括：通过令牌传递算法在所述解码图的所述多条路径中确定所述目标路径。In an exemplary embodiment, decoding the multi-frame acoustic feature vectors through multiple paths in the decoding graph to obtain the target path includes: using a token passing algorithm in the multiple paths in the decoding graph Determine the target path.

在一个示例性实施例中，所述根据所述目标处理结果和所述目标解码结果，对所述目标语音信号中待识别的唤醒词进行识别，包括：在目标路径上包含待识别的唤醒词的情况下，在各帧声学特征向量对应的音素后验特征向量中确定与所述待识别的唤醒词对应的目标音素后验特征向量，其中，所述所述目标处理结果包括所述各帧声学特征向量对应的音素后验特征向量，所述目标解码结果包括所述目标路径；通过所述目标音素后验特征向量对所述待识别的唤醒词进行识别。In an exemplary embodiment, the identifying the wake-up word to be recognized in the target voice signal according to the target processing result and the target decoding result includes: including the wake-up word to be recognized on the target path In the case of , the target phoneme posterior feature vector corresponding to the wake-up word to be recognized is determined in the phoneme posterior feature vector corresponding to the acoustic feature vector of each frame, wherein the target processing result includes the The phoneme posterior feature vector corresponding to the acoustic feature vector, the target decoding result includes the target path; the wake-up word to be recognized is identified through the target phoneme posterior feature vector.

在一个示例性实施例中，所述通过所述目标音素后验特征向量对所述待识别的唤醒词进行识别，包括：确定所述目标音素后验特征向量与预设的标准模板之间的目标距离；根据所述目标距离与预设的标准距离之间的关系对所述待识别的唤醒词进行识别。In an exemplary embodiment, the identifying the wake-up word to be recognized through the target phoneme posterior feature vector includes: determining the relationship between the target phoneme posterior feature vector and a preset standard template Target distance: identifying the wake-up word to be recognized according to the relationship between the target distance and a preset standard distance.

在一个示例性实施例中，所述根据所述目标距离与预设的标准距离之间的关系对所述待识别的唤醒词进行识别，包括：在所述目标距离与预设的目标标准距离之间的差值小于或等于预设阈值的情况下，将将所述待识别的唤醒词确定为所述目标语音信号中唤醒词的识别结果。In an exemplary embodiment, the identifying the wake-up word to be recognized according to the relationship between the target distance and a preset standard distance includes: If the difference between them is less than or equal to the preset threshold, the wake-up word to be recognized will be determined as the recognition result of the wake-up word in the target voice signal.

根据本发明的又一个实施例，还提供了一种识别唤醒词的装置，包括：提取模块，用于对目标语音信号进行特征提取，得到多帧声学特征向量；处理模块，用于通过深度神经网络对所述多帧声学特征向量进行处理，得到目标处理结果；解码模块，用于通过解码图对所述多帧声学特征向量进行解码，得到目标解码结果；确定模块，用于根据所述目标处理结果和所述目标解码结果，确定所述目标语音信号中唤醒词的识别结果。According to yet another embodiment of the present invention, there is also provided a device for recognizing wake-up words, including: an extraction module for feature extraction of the target voice signal to obtain multi-frame acoustic feature vectors; a processing module for The network processes the multi-frame acoustic feature vectors to obtain target processing results; the decoding module is used to decode the multi-frame acoustic feature vectors through the decoding map to obtain target decoding results; the determination module is used to obtain target decoding results according to the target Processing the result and the target decoding result to determine the recognition result of the wake-up word in the target voice signal.

根据本发明的又一个实施例，还提供了一种计算机可读存储介质，所述计算机可读存储介质中存储有计算机程序，其中，所述计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。According to yet another embodiment of the present invention, a computer-readable storage medium is also provided, and a computer program is stored in the computer-readable storage medium, wherein the computer program is configured to perform any one of the above methods when running Steps in the examples.

根据本发明的又一个实施例，还提供了一种电子装置，包括存储器和处理器，所述存储器中存储有计算机程序，所述处理器被设置为运行所述计算机程序以执行上述任一项方法实施例中的步骤。According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory and a processor, wherein a computer program is stored in the memory, and the processor is configured to run the computer program to perform any of the above Steps in the method examples.

通过本发明，通过对声学特征向量进行解码，得到目标解码结果，通过声学特征向量对应的多帧后验特征向量验证目标结果，不直接将目标解码结果作为识别结果，而是根据目标处理结果对目标解码结果验证后确定目标语音信号中唤醒词的识别结果，避免了由于唤醒词之间的串扰，使将目标语音信号中的唤醒词识别为其他的唤醒词，从而提高了唤醒词识别的准确率。Through the present invention, the target decoding result is obtained by decoding the acoustic feature vector, and the target result is verified through the multi-frame posterior feature vector corresponding to the acoustic feature vector, and the target decoding result is not directly used as the recognition result, but is processed according to the target processing result After the target decoding result is verified, the recognition result of the wake-up word in the target voice signal is determined, which avoids the crosstalk between the wake-up words and makes the wake-up word in the target voice signal recognized as other wake-up words, thereby improving the accuracy of wake-up word recognition Rate.

附图说明Description of drawings

图1是根据本发明实施例的识别唤醒词的方法的移动终端硬件结构框图；FIG. 1 is a block diagram of the hardware structure of a mobile terminal according to a method for identifying a wake-up word according to an embodiment of the present invention;

图2是根据本发明实施例的识别唤醒词的方法的流程图；FIG. 2 is a flowchart of a method for identifying a wake-up word according to an embodiment of the present invention;

图3是根据本发明具体实施例的识别唤醒词的方法的流程图；FIG. 3 is a flow chart of a method for identifying a wake-up word according to a specific embodiment of the present invention;

图4是根据本发明实施例的识别唤醒词的装置的结构框图。Fig. 4 is a structural block diagram of an apparatus for recognizing a wake-up word according to an embodiment of the present invention.

具体实施方式detailed description

下文中将参考附图并结合实施例来详细说明本发明的实施例。Embodiments of the present invention will be described in detail below with reference to the drawings and in combination with the embodiments.

需要说明的是，本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。It should be noted that the terms "first" and "second" in the description and claims of the present invention and the above drawings are used to distinguish similar objects, but not necessarily used to describe a specific sequence or sequence.

本申请实施例中所提供的方法实施例可以在移动终端、计算机终端或者类似的运算装置中执行。以运行在移动终端上为例，图1是根据本发明实施例的识别唤醒词的方法的移动终端硬件结构框图。如图1所示，移动终端可以包括一个或多个(图1中仅示出一个)处理器102(处理器102 可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置) 和用于存储数据的存储器104，其中，上述移动终端还可以包括用于通信功能的传输设备106以及输入输出设备108。本领域普通技术人员可以理解，图1所示的结构仅为示意，其并不对上述移动终端的结构造成限定。例如，移动终端还可包括比图1中所示更多或者更少的组件，或者具有与图1所示不同的配置。The method embodiments provided in the embodiments of the present application may be executed in mobile terminals, computer terminals or similar computing devices. Taking running on a mobile terminal as an example, FIG. 1 is a block diagram of a hardware structure of a mobile terminal according to a method for recognizing wake-up words according to an embodiment of the present invention. As shown in Figure 1, the mobile terminal may include one or more (only one is shown in Figure 1) processors 102 (processor 102 may include but not limited to processing devices such as microprocessor MCU or programmable logic device FPGA, etc.) and a memory 104 for storing data, wherein the above-mentioned mobile terminal may also include a transmission device 106 and an input and output device 108 for communication functions. Those skilled in the art can understand that the structure shown in FIG. 1 is only for illustration, and it does not limit the structure of the above mobile terminal. For example, the mobile terminal may also include more or fewer components than those shown in FIG. 1 , or have a different configuration from that shown in FIG. 1 .

存储器104可用于存储计算机程序，例如，应用软件的软件程序以及模块，如本发明实施例中的识别唤醒词的方法对应的计算机程序，处理器 102通过运行存储在存储器104内的计算机程序，从而执行各种功能应用以及数据处理，即实现上述的方法。存储器104可包括高速随机存储器，还可包括非易失性存储器，如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中，存储器104可进一步包括相对于处理器102远程设置的存储器，这些远程存储器可以通过网络连接至移动终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 104 can be used to store computer programs, for example, software programs and modules of application software, such as the computer program corresponding to the method for recognizing wake-up words in the embodiment of the present invention, and the processor 102 runs the computer program stored in the memory 104, thereby Executing various functional applications and data processing is to realize the above-mentioned method. The memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include a memory that is remotely located relative to the processor 102, and these remote memories may be connected to the mobile terminal through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

传输装置106用于经由一个网络接收或者发送数据。上述的网络具体实例可包括移动终端的通信供应商提供的无线网络。在一个实例中，传输装置106包括一个网络适配器(Network Interface Controller，简称为 NIC)，其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中，传输装置106可以为射频(Radio Frequency，简称为RF) 模块，其用于通过无线方式与互联网进行通讯。The transmission device 106 is used to receive or transmit data via a network. The specific example of the above network may include a wireless network provided by the communication provider of the mobile terminal. In one example, the transmission device 106 includes a network interface controller (NIC for short), which can be connected to other network devices through a base station so as to communicate with the Internet. In one example, the transmission device 106 may be a radio frequency (Radio Frequency, RF for short) module, which is used to communicate with the Internet in a wireless manner.

在本实施例中提供了一种识别唤醒词的方法，图2是根据本发明实施例的识别唤醒词的方法的流程图，如图2所示，该流程包括如下步骤：In this embodiment, a method for identifying a wake-up word is provided. FIG. 2 is a flowchart of a method for identifying a wake-up word according to an embodiment of the present invention. As shown in FIG. 2 , the process includes the following steps:

步骤S202，对目标语音信号进行特征提取，得到多帧声学特征向量；Step S202, performing feature extraction on the target speech signal to obtain multi-frame acoustic feature vectors;

步骤S204，通过深度神经网络对所述多帧声学特征向量进行处理，得到目标处理结果；Step S204, processing the multi-frame acoustic feature vectors through a deep neural network to obtain a target processing result;

步骤S206，通过解码图对所述多帧声学特征向量进行解码，得到目标解码结果；Step S206, decoding the multi-frame acoustic feature vectors through the decoding map to obtain a target decoding result;

步骤S208，根据所述目标处理结果和所述目标解码结果，确定所述目标语音信号中唤醒词的识别结果。Step S208, according to the target processing result and the target decoding result, determine the recognition result of the wake-up word in the target voice signal.

在上述步骤S202提供的技术方案中，目标语音信号为语音采集设备采集的语音信号，在进行目标语音信号中的唤醒词识别之前,需要根据语音信号的波形提取有效的声学特征，特征提取对后续唤醒词识别系统的准确性极其关键。In the technical solution provided in the above step S202, the target voice signal is the voice signal collected by the voice acquisition device. Before the wake-up word recognition in the target voice signal is performed, it is necessary to extract effective acoustic features according to the waveform of the voice signal. Feature extraction is important for subsequent The accuracy of the wake word recognition system is extremely critical.

提取声学特征向量的模型可以包括以下至少之一：MFCC (Mel-FrequencyCepstral Coefficients,梅尔倒谱系数)、FBANK (Filter Bank,滤波器组特征)、PLP(Perceptual Linear Predictive, 感知线性预测)、PCEN(Per-Channel EnergyNormalization,通道能量归一化特征)等。The model for extracting the acoustic feature vector may include at least one of the following: MFCC (Mel-FrequencyCepstral Coefficients, Mel cepstral coefficients), FBANK (Filter Bank, filter bank feature), PLP (Perceptual Linear Predictive, perceptual linear prediction), PCEN (Per-Channel EnergyNormalization, channel energy normalization features), etc.

在上述步骤S204提供的技术方案中，深度神经网络可以为深度神经网络-隐马尔科夫模型架构(Deep Neural Network-Hidden Markov Model， DNN-HMM)中的深度神经网络，将多帧声学特征向量输入到深度神经网络中，得到多帧声学特征向量中每一帧声学特征向量在预先定义好的k类音素上的后验特征向量，即目标处理结果。In the technical solution provided in the above step S204, the deep neural network can be a deep neural network in a deep neural network-hidden Markov model architecture (Deep Neural Network-Hidden Markov Model, DNN-HMM), and the multi-frame acoustic feature vector Input it into the deep neural network to obtain the posterior feature vector of each frame of the acoustic feature vector in the multi-frame acoustic feature vector on the pre-defined k-type phonemes, that is, the target processing result.

在上述步骤S206提供的技术方案中，上述解码图为HCLG解码图，HCLG 解码图是通过语言模型、词典、上下文音素和HMM构建的一个大的资源图。在HCLG中对多帧声学特征向量进行解码，得到目标解码结果。In the technical solution provided by the above step S206, the above-mentioned decoding map is an HCLG decoding map, and the HCLG decoding map is a large resource map constructed by language models, dictionaries, context phonemes and HMMs. The multi-frame acoustic feature vectors are decoded in HCLG to obtain the target decoding result.

HCLG解码图中包含多个路径，通过解码过程在多个路径中选取一个或多个最优路径作为目标解码结果，且每个路径上包括对应单词级别的结果，例如，在HCLG上解码的最优路径是路径1，路径1上对应的单词级别的结果为“请开启设备”。The HCLG decoding graph contains multiple paths, and one or more optimal paths are selected as the target decoding result through the decoding process, and each path includes the corresponding word-level results, for example, the most optimal path decoded on HCLG The optimal path is path 1, and the word-level result corresponding to path 1 is "please turn on the device".

在上述步骤S208提供的技术方案中，目标解码结果中判断是否解析到唤醒词，如果没有解析到唤醒词，则确定目标语音信号中不包含唤醒词，不对设备进行唤醒；而如果解析到唤醒词的情况下，根据目标处理结果和目标解码结果确定在解码图中解码到的唤醒词是否为目标语音信号中包含的唤醒词。例如。在HCLG解码图中的解码结果为“请开启设备”，则解码结果中包含唤醒词“开启”，进一步的结合目标处理结果和目标解码结果确认目标语音信号中包含“开启”这一唤醒词，确认之后将“开启”作为在目标语音信号中识别到的唤醒词。In the technical solution provided in the above step S208, it is judged in the target decoding result whether the wake-up word is parsed, if the wake-up word is not parsed, it is determined that the target voice signal does not contain the wake-up word, and the device is not woken up; and if the wake-up word is parsed In the case of , determine whether the wake-up word decoded in the decoding map is the wake-up word contained in the target speech signal according to the target processing result and the target decoding result. E.g. The decoding result in the HCLG decoding diagram is "Please turn on the device", then the decoding result contains the wake-up word "turn on", and further combined with the target processing result and the target decoding result to confirm that the target voice signal contains the wake-up word "turn on", After confirmation, "ON" is used as the wake-up word recognized in the target voice signal.

通过上述步骤，通过对声学特征向量进行解码，得到目标解码结果，通过声学特征向量对应的多帧后验特征向量验证目标结果，不直接将目标解码结果作为识别结果，而是根据目标处理结果对目标解码结果验证后确定目标语音信号中唤醒词的识别结果，由于对声学特征向量处理得到的是音素后验特征，对多帧声学特征向量进行解码后的解码结果与音素后验特征进行音素后验特征匹配，对解码过程得到的解码结果的准确性进行判断，以减少了唤醒词之间的串扰使解码结果出现的误差，从而确定目标语音信号中唤醒词的识别结果，避免了由于唤醒词之间的串扰，使将目标语音信号中的唤醒词识别为其他的唤醒词，从而提高了唤醒词识别的准确率。Through the above steps, the target decoding result is obtained by decoding the acoustic feature vector, and the target result is verified through the multi-frame a posteriori feature vector corresponding to the acoustic feature vector. The target decoding result is not directly used as the recognition result, but is processed according to the target After the target decoding result is verified, the recognition result of the wake-up word in the target speech signal is determined. Since the acoustic feature vector is processed to obtain the phoneme posterior feature, the decoding result after decoding the multi-frame acoustic feature vector is combined with the phoneme posterior feature. The accuracy of the decoding result obtained in the decoding process is judged to reduce the error in the decoding result caused by the crosstalk between the wake-up words, thereby determining the recognition result of the wake-up word in the target voice signal, and avoiding the error caused by the wake-up word The crosstalk among them enables the recognition of the wake-up word in the target voice signal as other wake-up words, thereby improving the accuracy of wake-up word recognition.

在一个可选的实施例中，通过深度神经网络对所述多帧声学特征向量进行处理，得到目标处理结果，包括：将所述多帧声学特征向量输入深度神经网络，通过所述深度神经网络对所述多帧声学特征向量中的每帧声学特征向量进行分类，得到各帧声学特征向量对应的音素后验特征向量，其中，所述目标处理结果包括所述各帧声学特征向量对应的音素后验特征向量。In an optional embodiment, processing the multi-frame acoustic feature vectors through a deep neural network to obtain a target processing result includes: inputting the multi-frame acoustic feature vectors into a deep neural network, through the deep neural network Classifying each frame of the acoustic feature vectors in the multi-frame acoustic feature vectors to obtain the phoneme posterior feature vectors corresponding to the acoustic feature vectors of each frame, wherein the target processing result includes the phoneme corresponding to the acoustic feature vectors of each frame Posterior eigenvectors.

在本实施例中，预先定义有K个音素，对唤醒词和非唤醒词进行音素建模，对于唤醒词采用精细建模的方式，即一个唤醒词包含了多个音素；对于非唤醒词则采用粗放建模，即非唤醒词为词建模的方式，一个非唤醒词对应一个音素。In this embodiment, K phonemes are predefined, and phoneme modeling is performed on wake-up words and non-wake-up words, and fine modeling is adopted for wake-up words, that is, one wake-up word contains multiple phonemes; Extensive modeling is adopted, that is, non-awakening words are used to model words, and a non-awakening word corresponds to a phoneme.

深度神经网络的输入是一帧声学特征向量，输出的是该帧对应的音素后验特征向量，例如对目标语音信号进行特征提取得到的多帧声学特征向量为：u_o＝{o₁,o₂,...,o_n}，其中n为多帧声学特征向量的帧数，将每一帧声学特征向量依次输入深度神经网络得到对应帧的音素后验特征向量，多帧声学特征向量对应多帧音素后验特征向量：

第一帧声学特征向量o₁对应的音素后验特征向量为

第二帧声学特征向量o₂对应的音素后验特征向量为

以此类推第n帧声学特征向量o_n对应的音素后验特征向量为

The input of the deep neural network is an acoustic feature vector of a frame, and the output is the phoneme posterior feature vector corresponding to the frame. For example, the multi-frame acoustic feature vector obtained by feature extraction of the target speech signal is: u _o ={o ₁ ,o ₂ ,...,o _n }, where n is the number of frames of multi-frame acoustic feature vectors, input the acoustic feature vectors of each frame into the deep neural network in turn to obtain the phoneme posterior feature vectors of the corresponding frames, and the multi-frame acoustic feature vectors correspond to Multi-frame phoneme posterior feature vectors:

The phoneme posterior feature vector corresponding to the acoustic feature vector o ₁ of the first frame is

The phoneme posterior feature vector corresponding to the acoustic feature vector o ₂ of the second frame is

By analogy, the phoneme posterior feature vector corresponding to the acoustic feature vector o _n of the nth frame is

一个音素后验特征向量PG表示对应的语音特征向量o在预先定义好的K个音素{C₁,C₂,...,C_k}上的后验概率分布，表示为：A phoneme posterior feature vector PG represents the posterior probability distribution of the corresponding speech feature vector o on the predefined K phonemes {C ₁ , C ₂ ,...,C _k }, expressed as:

PG_o＝{p(C₁|o),p(C₂|o),...p(C_k|o)}PG _o ＝{p(C ₁ |o), p(C ₂ |o),...p(C _k |o)}

其中p(C₁|o)是特征向量o在第1类音素上的后验概率，p(C₂|o)是特征向量o在第2类音素上的后验概率，以此类推，p(C_k|o)是特征向量o 在第k类音素上的后验概率。Where p(C ₁ |o) is the posterior probability of feature vector o on the first type of phoneme, p(C ₂ |o) is the posterior probability of feature vector o on the second type of phoneme, and so on, p (C _k |o) is the posterior probability of feature vector o on the k-th phoneme.

目标处理结果中包含多帧音素后验特征向量，多帧音素后验特征向量中的每帧音素后验特征向量和多帧声学特征向量中的每帧声学特征向量一一对应。The target processing result includes multi-frame phoneme posterior feature vectors, and each frame of phoneme posterior feature vectors in the multi-frame phoneme posterior feature vectors is in one-to-one correspondence with each frame of acoustic feature vectors in the multi-frame acoustic feature vectors.

在一个可选的实施例中，所述通过解码图对所述多帧声学特征向量进行解码，得到目标解码结果，包括：通过解码图中的多条路径对所述多帧声学特征向量进行解码，得到目标路径；将所述目标路径确定为所述目标解码结果。In an optional embodiment, the decoding the multi-frame acoustic feature vectors through the decoding graph to obtain the target decoding result includes: decoding the multi-frame acoustic feature vectors through multiple paths in the decoding graph , to obtain a target path; and determine the target path as the target decoding result.

在本实施例中，在HCLG解码图中包含多条路径，每一个路径上所有节点的输出总和，即构成输出的句子或单词。构建了HCLG解码图后，根据多帧声学特征向量在HCLG解码图中找到最优路径，最优路径上输出标签序列在待识别语音上的代价要尽可能小，在最优路径上取出的输出标签序列就是单词级别识别结果，这个过程就是解码。也可以找到最优的多条路径，最优的多条路径的识别结果被称为N-best列表。In this embodiment, the HCLG decoding graph contains multiple paths, and the sum of the outputs of all nodes on each path constitutes an output sentence or word. After constructing the HCLG decoding map, find the optimal path in the HCLG decoding map according to the multi-frame acoustic feature vector. The cost of the output label sequence on the optimal path on the speech to be recognized should be as small as possible, and the output taken out on the optimal path The label sequence is the word-level recognition result, and this process is decoding. The optimal multiple paths can also be found, and the identification result of the optimal multiple paths is called an N-best list.

在一个可选的实施例中，通过解码图中的多条路径对所述多帧声学特征向量进行解码，得到目标路径，包括：通过令牌传递算法在所述解码图的所述多条路径中确定所述目标路径。In an optional embodiment, decoding the multi-frame acoustic feature vectors through multiple paths in the decoding graph to obtain the target path includes: using a token passing algorithm on the multiple paths in the decoding graph Determine the target path in .

在本实施例中，令牌传递算法是通过在起始节点放置令牌，一个起始节点对应一个令牌，如果有多个起始节点，则每个起始节点放置一个令牌。对多帧声学特征向量按帧进行解码，解码第一帧(解码第一个声学特征向量)后，根据解码到的信息将起始节点上的令牌传递到下一节点，并计算传递代价，然后再解码第二帧声学特征向量，根据解码到的信息将令牌从当前节点再传递到下一节点，累积传递代价，依次解码所有声学特征向量，在解码完最后一帧后，查看令牌所在的状态节点，回溯出当前令牌所经过的路径，在每一次传递的过程中计算传递代价并累积。如果一个状态节点有多个跳转，则把令牌复制多份，分别传递。在传递到最后一帧，检查当前解码图中的所有令牌的传递代价，根据传递代价选出最优的一个或多个路径，传递代价越低，则表示对应路径越优。In this embodiment, the token passing algorithm is to place tokens on the start nodes, one start node corresponds to one token, and if there are multiple start nodes, each start node places a token. Decode multi-frame acoustic feature vectors frame by frame, after decoding the first frame (decoding the first acoustic feature vector), pass the token on the starting node to the next node according to the decoded information, and calculate the transfer cost, Then decode the acoustic feature vector of the second frame, pass the token from the current node to the next node according to the decoded information, accumulate the transfer cost, decode all the acoustic feature vectors in turn, and check the token after decoding the last frame The state node where it is located traces back the path that the current token has traveled, and calculates and accumulates the transfer cost during each transfer. If a state node has multiple jumps, multiple copies of the token are passed on separately. After passing to the last frame, check the transfer cost of all tokens in the current decoding graph, and select the optimal path or paths according to the transfer cost. The lower the transfer cost, the better the corresponding path.

最优路径是指在全局上最优，而全局最优必然局部最优，即如果一条路径是全局最优的，那么该路径必然是其经过任意状态的局部最优路径。所以，当多个令牌传递到同一状态节点上时，只保留最优的令牌(累积代价最小的令牌)即可。The optimal path refers to the global optimum, and the global optimum must be locally optimal, that is, if a path is globally optimal, then the path must be a locally optimal path through any state. Therefore, when multiple tokens are passed to the same state node, only the optimal token (the token with the smallest cumulative cost) can be kept.

在一个可选的实施例中，所述根据所述目标处理结果和所述目标解码结果，对所述目标语音信号中待识别的唤醒词进行识别，包括：在目标路径上包含待识别的唤醒词的情况下，在各帧声学特征向量对应的音素后验特征向量中确定与所述待识别的唤醒词对应的目标音素后验特征向量，其中，所述所述目标处理结果包括所述各帧声学特征向量对应的音素后验特征向量，所述目标解码结果包括所述目标路径；通过所述目标音素后验特征向量对所述待识别的唤醒词进行识别。In an optional embodiment, the identifying the wake-up word to be recognized in the target speech signal according to the target processing result and the target decoding result includes: including the wake-up word to be recognized on the target path In the case of words, the target phoneme posterior feature vector corresponding to the wake-up word to be recognized is determined in the phoneme posterior feature vector corresponding to the acoustic feature vector of each frame, wherein the target processing result includes the each The phoneme posterior feature vector corresponding to the frame acoustic feature vector, the target decoding result includes the target path; the wake-up word to be recognized is identified through the target phoneme posterior feature vector.

在本实施例中，判断目标路径对应输出的句子或单词中是否包含唤醒词，在输出的句子或单词中包含唤醒词(即待识别的唤醒词)的情况下，在各帧声学特征向量对应的音素后验特征向量中找到该唤醒词对应的一帧或多帧音素后验特征向量，确定为目标音素后验特征向量，根据目标音素后验特征向量确定是否将该唤醒词作为本次识别的结果。In this embodiment, it is judged whether the output sentence or word corresponding to the target path contains a wake-up word. One or more frames of phoneme posterior feature vectors corresponding to the wake-up word are found in the phoneme posterior feature vector of the target phoneme posterior feature vector, and whether the wake-up word is used as this recognition is determined according to the target phoneme posterior feature vector. the result of.

其中，目标音素后验特征向量中包含一帧或多帧音素后验特征向量。Wherein, the target phoneme posterior feature vector includes one or more frames of phoneme posterior feature vectors.

举例来说，目标路径中输出的句子为“请开启设备”，而该句子中“请”和“设备”属于非唤醒词，“开启”属于唤醒词，在这种情况下，目标路径上是包含待识别的唤醒词的，在目标处理结果中的多帧音素后验特征向量中找到与“开启”对应的音素后验特征向量，确定为目标音素后验特征向量，假设在对声学特征向量进行分类，得到10帧音素后验特征向量，表示为：

假设其中“请”对应于第一个音素后验特征向量

“开启”对应于第2～8帧音素后验特征向量

设备对应于第9和10帧音素后验特征向量PG_o9、PG_o10，则将{PG_o2,...,PG_o8}确定为目标音素后验特征向量。For example, the output sentence in the target path is "Please turn on the device", and "please" and "device" in this sentence are non-wake-up words, and "turn on" is a wake-up word. In this case, the target path is If it contains the wake-up word to be recognized, find the phoneme posterior feature vector corresponding to "open" in the multi-frame phoneme posterior feature vector in the target processing result, and determine it as the target phoneme posterior feature vector, assuming that the acoustic feature vector Classification is performed to obtain 10 frames of phoneme posterior feature vectors, expressed as:

Suppose where "please" corresponds to the first phoneme posterior feature vector

"On" corresponds to the phoneme posterior feature vectors of the 2nd to 8th frames

The device corresponds to the phoneme posterior feature vectors PG _o9 and PG _o10 of the 9th and 10th frames, and {PG _o2 ,...,PG _o8 } are determined as the target phoneme posterior feature vectors.

在一个可选的实施例中，所述通过所述目标音素后验特征向量对所述待识别的唤醒词进行识别，包括：确定所述目标音素后验特征向量与预设的标准模板之间的目标距离；根据所述目标距离与预设的标准距离之间的关系对所述待识别的唤醒词进行识别。In an optional embodiment, the identifying the wake-up word to be recognized through the target phoneme posterior feature vector includes: determining the difference between the target phoneme posterior feature vector and a preset standard template The target distance; the wake-up word to be recognized is identified according to the relationship between the target distance and a preset standard distance.

在本实施例中，对于每一个唤醒词都预设一个对应的音素后验特征向量序列(即标准模板)，计算待识别的唤醒词对应的标准模板与目标音素后验特征向量序列之间的目标距离，在目标距离与预设的标准距离之间的关系确认是否将待识别的唤醒词作为本次唤醒词的识别结果。In this embodiment, a corresponding phoneme posterior feature vector sequence (i.e. standard template) is preset for each wake-up word, and the relationship between the standard template corresponding to the wake-up word to be recognized and the target phoneme posterior feature vector sequence is calculated. The target distance, the relationship between the target distance and the preset standard distance determines whether to use the wake-up word to be recognized as the recognition result of the wake-up word this time.

需要说明的是，计算标准模板和目标音素后验特征向量序列之间的目标距离使用动态规整算法。It should be noted that the dynamic warping algorithm is used to calculate the target distance between the standard template and the target phoneme posterior feature vector sequence.

标准模板对应的音素后验特征向量序列为u_x＝{x₁,x₂,...,x_n}， u_y＝{y₁,y₂,...,y_m}，n和m分别表示两个序列音素后验特征向量的帧数，建立一个距离矩阵D，距离矩阵中的元素D(i,j)表示表示标准模板中第i帧向量与目标音素后验特征向量序列第j帧向量之间的距离，使用负对数内积度量表示标准模板中第i帧向量与目标音素后验特征向量序列第j帧向量之间的距离，即第i帧向量和第j帧向量之间的距离为D(i,j)＝-lg(x_i·y_j)。The phoneme posterior feature vector sequence corresponding to the standard template is u _x ={x ₁ ,x ₂ ,...,x _n }, u _y ={y ₁ ,y ₂ ,...,y _m }, n and m Indicate the number of frames of the two sequence phoneme posterior feature vectors respectively, establish a distance matrix D, and the element D(i, j) in the distance matrix represents the i-th frame vector in the standard template and the j-th sequence of the target phoneme posterior feature vector The distance between frame vectors, using the negative logarithmic inner product metric to represent the distance between the i-th frame vector in the standard template and the j-th frame vector of the target phoneme posterior feature vector sequence, that is, the distance between the i-th frame vector and the j-th frame vector The distance between is D(i,j)=-lg( _xi ·y _j ).

用φ表示u_x与u_y之间的一种可能的对应关系:Let φ represent a possible correspondence between u _x and u _y :

φ(k)＝(i_k,j_k),k＝1,2,...,T。φ(k)=(i _k , j _k ), k=1, 2, . . . , T.

其中,T表示时间，k为T的自变量，在k时刻u_x序列中的第i帧向量与u_y序列中的第j帧序列对应。Among them, T represents time, k is the independent variable of T, and the i-th frame vector in the u _x sequence corresponds to the j-th frame sequence in the u _y sequence at time k.

在矩阵D中找出一个最优对应序列φ'，最优序列φ'对应的累计失真值最小为：Find an optimal corresponding sequence φ' in the matrix D, and the minimum cumulative distortion value corresponding to the optimal sequence φ' is:

将最小的累积失真值确定为目标音素后验特征向量序列与预设的标准模板之间的目标距离。The minimum cumulative distortion value is determined as the target distance between the target phoneme posterior feature vector sequence and the preset standard template.

在一个可选的实施例中，所述根据所述目标距离与预设的标准距离之间的关系对所述待识别的唤醒词进行识别，包括：在所述目标距离与预设的标准距离之间的差值小于或等于预设阈值的情况下，将所述待识别的唤醒词确定为所述目标语音信号中唤醒词的识别结果。In an optional embodiment, the identifying the wake-up word to be recognized according to the relationship between the target distance and a preset standard distance includes: When the difference between them is less than or equal to the preset threshold, the wake-up word to be recognized is determined as the recognition result of the wake-up word in the target voice signal.

在本实施例中，预设的标准距离是通过唤醒词测试集的正样本数据唤醒词标准模板计算得到的，唤醒词测试集中有多个测试样本，测试样本之间存在语调、语速可能存在差异，因此，计算每个测试样本对应的后验特征向量序列与标准模板之间的距离，可以得到多个匹配距离，根据多个匹配距离得到标准距离。标准距离可以是通过计算多个匹配距离的平均值得到，也可以是其他的确定方法，在此不作限定。In this embodiment, the preset standard distance is calculated through the wake-up word standard template of the positive sample data of the wake-up word test set. Therefore, by calculating the distance between the posterior feature vector sequence corresponding to each test sample and the standard template, multiple matching distances can be obtained, and the standard distance can be obtained according to multiple matching distances. The standard distance can be obtained by calculating the average value of multiple matching distances, or can be determined by other methods, which are not limited here.

以唤醒词为“开启”为例，唤醒词测试集中有10个“开启”的测试样本，在这10个测试样本中，“开启”可能是由不同口音、不用语速说出的语音。获取每个测试样本的后验特征向量序列，标准模板是由普通话和标准语速说出的语音得到的后验特征向量序列。将每个每个测试样本的后验特征向量序列与标准模板进行距离匹配，得到标准距离。Taking the wake-up word as "open" as an example, there are 10 "open" test samples in the wake-up word test set. In these 10 test samples, "open" may be spoken by different accents and at different speeds. Obtain the posterior feature vector sequence of each test sample, and the standard template is the posterior feature vector sequence obtained from speech spoken in Mandarin and standard speech speed. The posterior feature vector sequence of each test sample is matched with the standard template to obtain the standard distance.

在计算的目标距离和标准距离之间的差值小于或等于预设阈值的情况下，将在目标解码结果中的待识别的唤醒词确定为最终的识别结果，即目标语音信号中唤醒词的识别结果。例如，目标解码结果“请开启设备”中待识别的唤醒词为“开启”，在计算的目标距离和标准距离之间的差值小于或等于预设阈值的情况下，将“开启”确定为识别结果，即确定目标语音信号中包含唤醒词“开启”，进而设备根据唤醒词执行对应的操作，即设备执行开启操作。When the difference between the calculated target distance and the standard distance is less than or equal to the preset threshold, the wake-up word to be recognized in the target decoding result is determined as the final recognition result, that is, the wake-up word in the target voice signal recognition result. For example, the wake-up word to be recognized in the target decoding result "Please turn on the device" is "ON", and when the difference between the calculated target distance and the standard distance is less than or equal to the preset threshold, "ON" is determined as The recognition result is to determine that the target voice signal contains the wake-up word "on", and then the device performs the corresponding operation according to the wake-up word, that is, the device performs the turn-on operation.

显然，上述所描述的实施例仅仅是本发明一部分的实施例，而不是全部的实施例。Apparently, the above-described embodiments are only part of the embodiments of the present invention, not all of them.

下面结合实施例对本发明进行具体说明：The present invention is specifically described below in conjunction with embodiment:

图3是根据本发明具体实施例的识别唤醒词的方法的流程图，如图3 所示，Fig. 3 is a flow chart of the method for identifying wake-up words according to a specific embodiment of the present invention, as shown in Fig. 3,

S301：获取语音信号，提取多帧声学特征向量；S301: Acquire a speech signal, and extract multi-frame acoustic feature vectors;

通过设备上的语音信号采集设备采集语音信号，并在语音信号中提取声学特征；Collect the voice signal through the voice signal acquisition device on the device, and extract the acoustic features from the voice signal;

S302：利用深度神经网络对多帧声学特征向量分类，并得到多帧音素后验特征向量；S302: Using a deep neural network to classify multi-frame acoustic feature vectors, and obtain multi-frame phoneme posterior feature vectors;

采用训练好的模型对多帧声学特征帧进行分类。训练好的模型为深度神经网络-隐马尔科夫模型(Deep Neural Network-Hidden Markov Model， DNN-HMM)架构，建模的单元为音素。The trained model is used to classify multi-frame acoustic feature frames. The trained model is a deep neural network-hidden Markov model (Deep Neural Network-Hidden Markov Model, DNN-HMM) architecture, and the modeling unit is a phoneme.

深度神经网络的输入为一帧的声学特征，输出则为一个音素后验特征向量。输入一个声学特征向量o，输出该特征向量在预先定义好的k个类 {C₁,C₂,...,C_k}上的后验概率分布为：The input of the deep neural network is the acoustic feature of a frame, and the output is a phoneme posterior feature vector. Input an acoustic feature vector o, and output the posterior probability distribution of the feature vector on the pre-defined k classes {C ₁ ,C ₂ ,...,C _k } as follows:

PG_o＝(p(C₁|o)p(C₂|o)...p(C_k|o))PG _o ＝(p(C ₁ |o)p(C ₂ |o)...p(C _k |o))

其中p(C_i|o)是特征向量o在第i类上的后验概率，这里的类可定义为任何种类的语音单元，比如音素。本专利中使用的即为音素级的后验特征。where p(C _i |o) is the posterior probability of the feature vector o on the i-th class, where a class can be defined as any kind of speech unit, such as a phoneme. The phone-level posterior features are used in this patent.

利用已经训练好的DNN-HMM模型，其中的DNN可以得到以帧为单位的音素后验特征。Using the DNN-HMM model that has been trained, the DNN in it can obtain the phoneme posterior features in units of frames.

S303：构建HCLG解码图，进行时序解码，得到最优路径；S303: Construct the HCLG decoding map, perform timing decoding, and obtain the optimal path;

DNN-HMM模型中，根据HCLG解码图实现声学特征序列的解码，通过在解码图中找到的最优的多条路径，得到识别结果的列表(N-best列表)。解码图的构建依赖于语言模型、词典、上下文音素和HMM构成的一个大的资源图。令牌传递算法按帧进行，执行到最后一帧时，令牌传递结束，此时查看终止状态上的令牌，取最优的一个或多个令牌，按照其上的信息可以取出或者回溯出这些令牌所对应的路径，这样就得到了识别结果。该路径累积了声学模型和语言模型两部分的似然值，假设似然值最高的路径中解码到唤醒词的若干帧的累积似然值为P_h。In the DNN-HMM model, the decoding of the acoustic feature sequence is realized according to the HCLG decoding graph, and a list of recognition results (N-best list) is obtained through the optimal multiple paths found in the decoding graph. The construction of the decoding graph relies on a large resource graph composed of language models, dictionaries, context phonemes, and HMMs. The token passing algorithm is carried out by frame. When the execution reaches the last frame, the token passing ends. At this time, check the tokens in the termination state, take the optimal one or more tokens, and take out or backtrack according to the information on them. Find the path corresponding to these tokens, and thus get the recognition result. This path accumulates the likelihood values of the two parts of the acoustic model and the language model, assuming that the cumulative likelihood value of several frames decoding the wake word in the path with the highest likelihood value is _Ph .

S304：根据最优路径中是否解码到唤醒词，进行音素后验特征匹配，计算负对数内积；S304: According to whether the wake-up word is decoded in the optimal path, perform phoneme posterior feature matching, and calculate the negative logarithmic inner product;

根据HMM时序解码的结果，决定是否进行音素后验概率的匹配。一旦步骤三中解码到唤醒词，则根据路径上的唤醒词对应的帧进行音素后验概率匹配。According to the result of HMM timing decoding, it is determined whether to perform phoneme posterior probability matching. Once the wake-up word is decoded in step 3, the phoneme posterior probability matching is performed according to the frame corresponding to the wake-up word on the path.

音素后验概率即DNN输出的音素分类概率，时序匹配使用动态时间规整算法将唤醒词对应的音素后验概率序列与标准模板进行距离的计算，这里采用内积度量计算两个序列之间的距离。The phoneme posterior probability is the phoneme classification probability output by DNN. The timing matching uses the dynamic time warping algorithm to calculate the distance between the phoneme posterior probability sequence corresponding to the wake-up word and the standard template. Here, the inner product metric is used to calculate the distance between the two sequences .

音素特征向量的序列匹配采用如下动态时间规整算法。给定两个连续语音片段的特征向量序列，u_x＝{x₁,x₂,...,x_n}，u_y＝{y₁,y₂,...,y_m}，n和m分别表示两个序列特征向量的帧数，通过定义语音帧的特征向量之间的距离，建立一个距离矩阵D，用φ表示u_x与u_y之间的一种可能的对应关系，φ(k)＝(i_k,j_k),k＝1,2,...,T，其中,T表示时间，k为T的自变量，在k时刻u_x序列中的第i帧向量与u_y序列中的第j帧序列对应。The sequence matching of phoneme feature vectors adopts the following dynamic time warping algorithm. Given the feature vector sequences of two consecutive speech segments, u _x ={x ₁ ,x ₂ ,...,x _n }, u _y ={y ₁ ,y ₂ ,...,y _m }, n and m respectively represent the frame number of the two sequence feature vectors, by defining the distance between the feature vectors of speech frames, a distance matrix D is established, and φ represents a possible correspondence between u _x and u _y , φ( k)=(i _k , j _k ), k=1,2,...,T, where T represents time, k is an independent variable of T, at time k the i-th frame vector in u _x sequence is related to u corresponds to the jth frame sequence in the _y sequence.

在矩阵D中找出一个最优对应序列φ′，从而最小化累计失真值 Dist_φ(u_x,u_y)，Find an optimal corresponding sequence φ′ in the matrix D, so as to minimize the accumulated distortion value Dist _φ (u _x ,u _y ),

在本实施例中，语音帧用音素后验特征表示，使用负对数内积度量，则：In this embodiment, the speech frame is represented by the phoneme posterior feature, and the negative logarithmic inner product measure is used, then:

最终待匹配序列与模板序列之间距离记为P_d，取值为最小化累计失真值Dist_φ(u_x,u_y)。The final distance between the sequence to be matched and the template sequence is denoted as P _d , and its value is the minimum cumulative distortion value Dist _φ (u _x ,u _y ).

S305：使用唤醒词测试集的正样本数据与唤醒词的标准模板进行距离匹配，得到距离阈值；S305: Use the positive sample data of the wake-up word test set to perform distance matching with the standard template of the wake-up word to obtain a distance threshold;

通过唤醒词测试集的正样本数据与唤醒词的标准模板进行DTW距离匹配，可以计算得到距离阈值(即标准距离)

Through the DTW distance matching between the positive sample data of the wake-up word test set and the standard template of the wake-up word, the distance threshold (that is, the standard distance) can be calculated.

关于DTW距离的计算可以采用现有技术。The calculation of the DTW distance can adopt the existing technology.

S306：比较待匹配序列与模板序列的距离与距离阈值的差值，得到唤醒结果；S306: Comparing the difference between the distance between the sequence to be matched and the template sequence and the distance threshold, and obtaining a wake-up result;

比较当前待匹配序列与标准模板序列的距离，计算

该差值小到超过人为设定的某一阈值，则判定当前唤醒词为识别到的唤醒词，输出唤醒结果。Compare the distance between the current sequence to be matched and the standard template sequence, and calculate

If the difference is so small as to exceed a certain threshold set manually, it is determined that the current wake-up word is the recognized wake-up word, and a wake-up result is output.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件，但很多情况下前者是更佳的实施方式。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质(如 ROM/RAM、磁碟、光盘)中，包括若干指令用以使得一台终端设备(可以是手机，计算机，服务器，或者网络设备等)执行本发明各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware, but in many cases the former is Better implementation. Based on such an understanding, the essence of the technical solution of the present invention or the part that contributes to the prior art can be embodied in the form of software products, and the computer software products are stored in a storage medium (such as ROM/RAM, disk, CD) contains several instructions to enable a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to execute the methods described in various embodiments of the present invention.

在本实施例中还提供了一种识别唤醒词的装置，图4是根据本发明实施例的识别唤醒词的装置的结构框图，如图4所示，该装置包括：In this embodiment, a device for recognizing wake-up words is also provided. FIG. 4 is a structural block diagram of a device for recognizing wake-up words according to an embodiment of the present invention. As shown in FIG. 4 , the device includes:

提取模块402，用于对目标语音信号进行特征提取，得到多帧声学特征向量；Extraction module 402, is used for carrying out feature extraction to target speech signal, obtains multi-frame acoustic feature vector;

处理模块404，用于通过深度神经网络对所述多帧声学特征向量进行处理，得到目标处理结果；A processing module 404, configured to process the multi-frame acoustic feature vectors through a deep neural network to obtain a target processing result;

解码模块406，用于通过解码图对所述多帧声学特征向量进行解码，得到目标解码结果；A decoding module 406, configured to decode the multi-frame acoustic feature vectors through a decoding map to obtain a target decoding result;

确定模块408，用于根据所述目标处理结果和所述目标解码结果，确定所述目标语音信号中唤醒词的识别结果。The determination module 408 is configured to determine the recognition result of the wake-up word in the target voice signal according to the target processing result and the target decoding result.

在一个可选的实施例中，上述装置还用于，将所述多帧声学特征向量输入深度神经网络，通过所述深度神经网络对所述多帧声学特征向量中的每帧声学特征向量进行分类，得到各帧声学特征向量对应的音素后验特征向量，其中，所述目标处理结果包括所述各帧声学特征向量对应的音素后验特征向量。In an optional embodiment, the above-mentioned device is further configured to input the multi-frame acoustic feature vectors into a deep neural network, and perform an acoustic feature vector on each frame of the multi-frame acoustic feature vectors through the deep neural network. Classify to obtain the phoneme posterior feature vectors corresponding to the acoustic feature vectors of each frame, wherein the target processing result includes the phoneme posterior feature vectors corresponding to the acoustic feature vectors of each frame.

在一个可选的实施例中，上述装置还用于，通过解码图中的多条路径对所述多帧声学特征向量进行解码，得到目标路径；将所述目标路径确定为所述目标解码结果。In an optional embodiment, the above device is further configured to decode the multi-frame acoustic feature vectors through multiple paths in the decoding graph to obtain a target path; and determine the target path as the target decoding result .

在一个可选的实施例中，上述装置还用于，通过令牌传递算法在所述解码图的所述多条路径中确定所述目标路径。In an optional embodiment, the above apparatus is further configured to determine the target path among the multiple paths in the decoding graph by using a token passing algorithm.

在一个可选的实施例中，上述装置还用于，在目标路径上包含待识别的唤醒词的情况下，在各帧声学特征向量对应的音素后验特征向量中确定与所述待识别的唤醒词对应的目标音素后验特征向量，其中，所述所述目标处理结果包括所述各帧声学特征向量对应的音素后验特征向量，所述目标解码结果包括所述目标路径；通过所述目标音素后验特征向量对所述待识别的唤醒词进行识别。In an optional embodiment, the above-mentioned device is also used to determine, in the case that the target path contains the wake-up word to be recognized, in the phoneme posterior feature vector corresponding to each frame acoustic feature vector The target phoneme posterior feature vector corresponding to the wake-up word, wherein the target processing result includes the phoneme posterior feature vector corresponding to the acoustic feature vector of each frame, and the target decoding result includes the target path; through the The target phoneme posterior feature vector identifies the wake-up word to be identified.

在一个可选的实施例中，上述装置还用于，确定所述目标音素后验特征向量与预设的标准模板之间的目标距离；根据所述目标距离与预设的标准距离之间的关系对所述待识别的唤醒词进行识别。In an optional embodiment, the above device is also used to determine the target distance between the target phoneme posterior feature vector and the preset standard template; according to the target distance between the target distance and the preset standard distance The relationship identifies the wake word to be identified.

在一个可选的实施例中，上述装置还用于，在所述目标距离与预设的目标标准距离之间的差值小于或等于预设阈值的情况下，将所述待识别的唤醒词确定为所述目标语音信号中唤醒词的识别结果。In an optional embodiment, the above-mentioned device is further configured to, when the difference between the target distance and the preset target standard distance is less than or equal to a preset threshold, send the wake-up word to be recognized to It is determined as the recognition result of the wake-up word in the target voice signal.

需要说明的是，上述各个模块是可以通过软件或硬件来实现的，对于后者，可以通过以下方式实现，但不限于此：上述模块均位于同一处理器中；或者，上述各个模块以任意组合的形式分别位于不同的处理器中。It should be noted that the above-mentioned modules can be realized by software or hardware. For the latter, it can be realized by the following methods, but not limited to this: the above-mentioned modules are all located in the same processor; or, the above-mentioned modules can be combined in any combination The forms of are located in different processors.

本发明的实施例还提供了一种计算机可读存储介质，该计算机可读存储介质中存储有计算机程序，其中，该计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。Embodiments of the present invention also provide a computer-readable storage medium, in which a computer program is stored, wherein the computer program is set to execute the steps in any one of the above method embodiments when running.

在一个示例性实施例中，上述计算机可读存储介质可以包括但不限于： U盘、只读存储器(Read-Only Memory，简称为ROM)、随机存取存储器 (Random Access Memory，简称为RAM)、移动硬盘、磁碟或者光盘等各种可以存储计算机程序的介质。In an exemplary embodiment, the above-mentioned computer-readable storage medium may include but not limited to: U disk, read-only memory (Read-Only Memory, referred to as ROM), random access memory (Random Access Memory, referred to as RAM) , mobile hard disk, magnetic disk or optical disk and other media that can store computer programs.

本发明的实施例还提供了一种电子装置，包括存储器和处理器，该存储器中存储有计算机程序，该处理器被设置为运行计算机程序以执行上述任一项方法实施例中的步骤。An embodiment of the present invention also provides an electronic device, including a memory and a processor, where a computer program is stored in the memory, and the processor is configured to run the computer program to perform the steps in any one of the above method embodiments.

在一个示例性实施例中，上述电子装置还可以包括传输设备以及输入输出设备，其中，该传输设备和上述处理器连接，该输入输出设备和上述处理器连接。In an exemplary embodiment, the electronic device may further include a transmission device and an input and output device, wherein the transmission device is connected to the processor, and the input and output device is connected to the processor.

本实施例中的具体示例可以参考上述实施例及示例性实施方式中所描述的示例，本实施例在此不再赘述。For specific examples in this embodiment, reference may be made to the examples described in the foregoing embodiments and exemplary implementation manners, and details will not be repeated here in this embodiment.

显然，本领域的技术人员应该明白，上述的本发明的各模块或各步骤可以用通用的计算装置来实现，它们可以集中在单个的计算装置上，或者分布在多个计算装置所组成的网络上，它们可以用计算装置可执行的程序代码来实现，从而，可以将它们存储在存储装置中由计算装置来执行，并且在某些情况下，可以以不同于此处的顺序执行所示出或描述的步骤，或者将它们分别制作成各个集成电路模块，或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样，本发明不限制于任何特定的硬件和软件结合。Obviously, those skilled in the art should understand that each module or each step of the above-mentioned present invention can be realized by a general-purpose computing device, and they can be concentrated on a single computing device, or distributed in a network formed by multiple computing devices In fact, they can be implemented in program code executable by a computing device, and thus, they can be stored in a storage device to be executed by a computing device, and in some cases, can be executed in an order different from that shown here. Or described steps, or they are fabricated into individual integrated circuit modules, or multiple modules or steps among them are fabricated into a single integrated circuit module for implementation. As such, the present invention is not limited to any specific combination of hardware and software.

以上所述仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the principle of the present invention shall be included in the protection scope of the present invention.

Claims

1. A method for identifying wake-up words, comprising:

Perform feature extraction on the target speech signal to obtain multi-frame acoustic feature vectors;

Processing the multi-frame acoustic feature vectors through a deep neural network to obtain a target processing result;

Decoding the multi-frame acoustic feature vectors through a decoding map to obtain a target decoding result;

According to the target processing result and the target decoding result, determine the recognition result of the wake-up word in the target voice signal.

2. The method according to claim 1, characterized in that, the multi-frame acoustic feature vectors are processed by a deep neural network to obtain target processing results, including:

The multi-frame acoustic feature vector is input into a deep neural network, and each frame acoustic feature vector in the multi-frame acoustic feature vector is classified by the deep neural network to obtain a phoneme posterior feature vector corresponding to each frame acoustic feature vector , wherein the target processing result includes the phoneme posterior feature vectors corresponding to the acoustic feature vectors of each frame.

3. The method according to claim 1, wherein said multi-frame acoustic feature vectors are decoded through a decoding map to obtain a target decoding result, comprising:

Decoding the multi-frame acoustic feature vectors through multiple paths in the decoding graph to obtain a target path;

The target path is determined as the target decoding result.

4. The method according to claim 3, wherein the multi-frame acoustic feature vector is decoded by multiple paths in the decoding graph to obtain the target path, including:

The target path is determined among the plurality of paths in the decoding graph by a token passing algorithm.

5. The method according to any one of claims 1 to 4, wherein, according to the target processing result and the target decoding result, the wake-up word to be recognized in the target voice signal is identified ,include:

In the case that the wake-up word to be identified is contained on the target path, the target phoneme posterior feature vector corresponding to the wake-up word to be identified is determined in the phoneme posterior feature vector corresponding to the acoustic feature vector of each frame, wherein the The target processing result includes the phoneme posterior feature vectors corresponding to the acoustic feature vectors of each frame, and the target decoding result includes the target path;

The wake-up word to be recognized is recognized through the target phoneme posterior feature vector.

6. The method according to claim 5, wherein the identifying the wake-up word to be identified by the target phoneme posterior feature vector comprises:

Determining the target distance between the target phoneme posterior feature vector and a preset standard template;

The wake-up word to be recognized is recognized according to the relationship between the target distance and a preset standard distance.

7. The method according to claim 6, wherein the identifying the wake-up word to be identified according to the relationship between the target distance and a preset standard distance comprises:

If the difference between the target distance and the preset target standard distance is less than or equal to a preset threshold, the wake-up word to be recognized is determined as the recognition result of the wake-up word in the target voice signal.

8. A device for recognizing wake-up words, comprising:

The extraction module is used to extract the features of the target speech signal to obtain multi-frame acoustic feature vectors;

A processing module, configured to process the multi-frame acoustic feature vectors through a deep neural network to obtain a target processing result;

A decoding module, configured to decode the multi-frame acoustic feature vectors through a decoding map to obtain a target decoding result;

A determining module, configured to determine the recognition result of the wake-up word in the target voice signal according to the target processing result and the target decoding result.

9. A computer-readable storage medium, characterized in that, a computer program is stored in the computer-readable storage medium, wherein when the computer program is executed by a processor, any one of claims 1 to 7 is implemented. The steps of the method.

10. An electronic device comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, characterized in that the rights are realized when the processor executes the computer program The steps of the method described in any one of Claims 1 to 7.