CN111583939A

CN111583939A - Method and device for specific target wake-up by voice recognition

Info

Publication number: CN111583939A
Application number: CN201910124945.7A
Authority: CN
Inventors: 李政; 吴国扬; 陈心章
Original assignee: Foxlink Electronics Dongguan Co Ltd; Cheng Uei Precision Industry Co Ltd
Current assignee: Foxlink Electronics Dongguan Co Ltd; Cheng Uei Precision Industry Co Ltd
Priority date: 2019-02-19
Filing date: 2019-02-19
Publication date: 2020-08-25

Abstract

The invention discloses a method and a device for waking up a specific target by voice recognition, wherein the method comprises the following steps: receiving a voice message of a specific target and extracting voice characteristics in the voice message; the voice characteristics of the specific target are used as input data of an HVS model which is trained in an identification mode, training is carried out, a specific target acoustic model is obtained, and the specific target acoustic model is stored; receiving a voice message of a target to be detected, and extracting voice characteristics in the voice message; taking the voice characteristics of the target to be tested as input data of a hidden vector state model trained in an identification mode, and training to obtain an acoustic model of the target to be tested; and comparing the acoustic model of the target to be detected with the acoustic model of the specific target, if the acoustic model of the target to be detected and the acoustic model of the specific target are related, performing language decoding on the voice characteristics of the target to be detected by using the language model, and judging whether to awaken or not according to a language decoding result. According to the invention, the HVS model of discriminant training is used as the acoustic model, so that the target can be accurately and quickly judged, and further the awakening function is achieved.

Description

Method and device for specific target wake-up by speech recognition

技术领域technical field

本发明涉及一种语音识别领域，尤其涉及一种语音识别的方法及装置。The present invention relates to the field of speech recognition, and in particular, to a method and device for speech recognition.

背景技术Background technique

近年来，智慧音箱逐渐改变人们生活的方式，智慧音箱作为语音助理可协助用户执行生活上的任务，例如帮忙叫车、购物、提醒事项、记录资讯等等，尽管智慧音箱带来生活上更多便利，然而智慧音箱仍有许多安全隐患，有时智慧音箱无法有效地判别使用者是否为初始设定的用户而进行信用卡下订商品的可能性，因此，为了防止有心人士使用，目前市面上许多智慧音箱会采用语音识别的方式作为防护措施。In recent years, smart speakers have gradually changed the way people live. As a voice assistant, smart speakers can assist users in performing tasks in life, such as helping with car calls, shopping, reminders, recording information, etc. Although smart speakers bring more It is convenient, but there are still many security risks in smart speakers. Sometimes smart speakers cannot effectively determine whether the user is the initial user and the possibility of ordering goods with a credit card. Therefore, in order to prevent people from using it, many smart speakers are currently on the market. The speaker will use voice recognition as a protective measure.

一般的智慧音箱通常采用语音唤醒的方式唤醒智慧音箱进而执行后续任务，所谓语音唤醒的方式通常是从一段连续的语音中自动撷取一些使用者预先注册的语音指令(唤醒词)。传统上使用隐藏式马可夫模型(Hidden Markov Model，HMM)的技术，利用单独的音素(Phoneme)、音节的特征向量比对，找出机率最大(最有可能)的单字，后来，又结合高斯混合模型(Gaussian Mixture Model，GMM)形成经典的GMM-HMM模型。现有的GMM-HMM模型常采用最大相似度训练方法(Maximum Likelihood)，然而此种方法在某些因素下容易使得竞争者答案机率大于正确答案机率，则导致正确率的下降，因此仍有进步改善的空间。General smart speakers usually use voice wake-up to wake up the smart speakers to perform subsequent tasks. The so-called voice wake-up method usually automatically captures some pre-registered voice commands (wake words) from a continuous voice. Traditionally, the Hidden Markov Model (HMM) technique was used to compare the feature vectors of individual phonemes and syllables to find the word with the highest probability (most likely), and later, combined with Gaussian mixture The model (Gaussian Mixture Model, GMM) forms the classic GMM-HMM model. The existing GMM-HMM model often adopts the maximum similarity training method (Maximum Likelihood). However, this method tends to make the competitor's answer probability greater than the correct answer probability under certain factors, resulting in a decline in the correct rate, so there is still progress. Room for improvement.

发明内容SUMMARY OF THE INVENTION

本发明的目的是针对上述现有技术存在的缺陷和不足，提出一种语音识别用于特定目标唤醒的实现方法，利用特定目标的唤醒词结合采用鉴别式训练的潜藏向量状态模型(Hidden Vector State Model，简称HVS Model)，实现对特定目标的身分识别监测，从而达到特定目标语音唤醒的目的。The purpose of the present invention is to aim at the defects and deficiencies of the above-mentioned prior art, and propose a method for realizing the wake-up of a specific target by speech recognition, which utilizes the wake-up word of the specific target in combination with the Hidden Vector State Model (Hidden Vector State Model) of discriminative training. Model, referred to as HVS Model), to realize the identification and monitoring of specific targets, so as to achieve the purpose of voice wake-up of specific targets.

为实现上述目的，本发明实施例一方面提出了一种语音识别用于特定目标唤醒的方法，包括以下步骤：In order to achieve the above object, an embodiment of the present invention provides a method for waking up a specific target by voice recognition, including the following steps:

S1：接收一特定目标的语音讯息并对所述特定目标的语音讯息进行预处理，提取所述特定目标的一语音特征；S1: Receive a voice message of a specific target and preprocess the voice message of the specific target to extract a voice feature of the specific target;

S2：将所述特定目标的语音特征作为以鉴别式训练的潜藏向量状态模型(HVS Model)的输入数据并进行训练，得到一特定目标声学模型，并储存所述特定目标声学模型；S2: take the speech feature of the specific target as the input data of the latent vector state model (HVS Model) of the discriminative training and train to obtain a specific target acoustic model, and store the specific target acoustic model;

S3：接收一待测目标的语音讯息并对所述待测目标的语音讯息进行预处理，提取所述待测目标的一语音特征；S3: Receive a voice message of a target to be tested and preprocess the voice message of the target to be tested to extract a voice feature of the target to be tested;

S4：将所述待测目标的语音特征作为以鉴别式训练的潜藏向量状态模型的输入数据并进行训练，得到一待测目标的声学模型；S4: take the voice feature of the target to be measured as the input data of the latent vector state model trained by the discriminant type and train to obtain an acoustic model of the target to be measured;

S5：比对所述待测目标的声学模型与所述特定目标的声学模型之间的关联性，若两者有关联则将所述待测目标的语音特征使用至少一语言模型进行语言解码，并根据语言解码结果判断是否唤醒。S5: Compare the correlation between the acoustic model of the target to be measured and the acoustic model of the specific target, and if the two are correlated, use at least one language model to perform language decoding on the speech feature of the target to be measured, And judge whether to wake up according to the language decoding result.

具体地，所述特定目标的语音讯息与所述待测目标的语音讯息中包括至少一唤醒词。Specifically, the voice message of the specific target and the voice message of the target to be tested include at least one wake-up word.

具体地，所述预处理包括：将语音讯息进行杂讯抑制处理及回音消除处理。Specifically, the preprocessing includes: performing noise suppression processing and echo cancellation processing on the voice message.

具体地，所述语音特征利用梅尔倒频谱系数(MFCC)的方式取得。Specifically, the speech features are obtained by means of Mel cepstral coefficients (MFCC).

具体地，所述鉴别式训练采用最大互信息法(MMI)进行训练。Specifically, the discriminative training adopts the maximum mutual information method (MMI) for training.

具体地，所述语言模型包括一词库模型或一文法模型或及其组合。Specifically, the language model includes a vocabulary model or a grammar model or a combination thereof.

具体地，所述根据语言解码结果判断是否达到语音识别的唤醒，其步骤包含：将所述待测目标的语音特征进行语言解码；判断待测目标语音讯息其中是否包含所述唤醒词；若包含所述唤醒词则语音识别唤醒启动，若没有包含所述唤醒词则语音识别唤醒未启动。Specifically, judging whether the wake-up of speech recognition is achieved according to the language decoding result, the steps include: performing language decoding on the speech feature of the target to be tested; judging whether the voice message of the target to be tested contains the wake-up word; If the wake-up word is included, the voice recognition wake-up is activated, and if the wake-up word is not included, the voice recognition wake-up is not activated.

本发明实施例另一方面提出一种语音识别用于特定目标唤醒的装置，包括：Another aspect of the embodiments of the present invention provides a voice recognition device for waking up a specific target, including:

一采集模组，包括多个麦克风阵列，用于接收特定目标与待测目标的语音讯息，其中所述语音讯息包含一唤醒词；an acquisition module including a plurality of microphone arrays for receiving voice messages of a specific target and a target to be tested, wherein the voice messages include a wake-up word;

一提取模组，连接所述采集模组，用于提取所述特定目标以及所述待测目标的语音讯息其中的MFCC语音特征；an extraction module, connected to the acquisition module, for extracting the MFCC voice features in the voice messages of the specific target and the target to be tested;

一训练模组，连接所述提取模组，用于将所述特定目标以及所述待测目标的语音讯息其中的MFCC语音特征作为以最大互信息法训练的潜藏向量状态模型的输入数据，并获取训练后的特定目标的声学模型与待测目标的声学模型；a training module, connected to the extraction module, for using the MFCC voice features in the voice messages of the specific target and the target to be tested as the input data of the latent vector state model trained by the maximum mutual information method, and Obtain the acoustic model of the specific target after training and the acoustic model of the target to be tested;

一存储模组，连接所述训练模组，用于保存训练完成的特定目标的声学模型；a storage module, connected to the training module, for saving the acoustic model of the specific target that has been trained;

一解码模组，连接所述提取模组，用于将所述待测目标的语音讯息进行语言解码；以及a decoding module, connected to the extraction module, for performing language decoding on the voice message of the target to be tested; and

一处理器模组，连接所述训练模组、所述存储模组与所述解码模组，用于比对所述存储模组中的特定目标的声学模型与待测目标的声学模型，以及根据比对结果判断是否启动所述解码模组进行待测目标的语音讯息的语言解码，并根据语言解码后的待测目标的语音讯息确认是否包含唤醒词以唤醒所述装置。a processor module, connected to the training module, the storage module and the decoding module, for comparing the acoustic model of the specific target in the storage module with the acoustic model of the target to be tested, and According to the comparison result, it is judged whether to activate the decoding module to perform language decoding of the voice message of the target to be tested, and to confirm whether the voice message of the target to be tested after the language decoding contains a wake-up word to wake up the device.

具体地，所述装置进一步包括一注册模组，所述注册模组连接所述采集模组与所述存储模组，所述注册模组用于启动保存特定目标的声学模型到所述存储模组。Specifically, the device further includes a registration module, the registration module is connected to the acquisition module and the storage module, and the registration module is used to start saving the acoustic model of a specific target to the storage module Group.

具体地，所述装置进一步包括一无线通讯模组，其中，所述无线通讯模组用于进行外部通讯连接。Specifically, the device further includes a wireless communication module, wherein the wireless communication module is used for external communication connection.

与现有技术相比，本发明语音识别用于特定目标唤醒的方法及装置采用鉴别式训练的潜藏向量状态模型作为声学模型，使用鉴别式训练除了最大化正确答案的出现机率外，也会将竞争者的出现机率降低，增加其正确答案与竞争者之间的鉴别能力，能够快速且准确地判断待测目标是否为特定目标，进而达到唤醒的功用。Compared with the prior art, the voice recognition method and device for specific target awakening of the present invention adopts the latent vector state model of discriminative training as the acoustic model. The probability of a competitor's appearance is reduced, the ability to distinguish between the correct answer and the competitor is increased, and it can quickly and accurately determine whether the target to be tested is a specific target, thereby achieving the function of awakening.

附图说明Description of drawings

图1为本发明实施例一种语音识别用于特定目标唤醒的方法流程示意图。FIG. 1 is a schematic flowchart of a method for waking up a specific target by voice recognition according to an embodiment of the present invention.

图2为本发明实施例一种语音识别用于特定目标唤醒的装置示意图。FIG. 2 is a schematic diagram of an apparatus for waking up a specific target by voice recognition according to an embodiment of the present invention.

图中各附图标记说明如下：The reference numerals in the figure are explained as follows:

100　语音识别装置　　　11　　采集模组100 Speech Recognition Device 11 Acquisition Module

12　　提取模组　　　　　13　　训练模组12 Extraction module 13 Training module

14　　存储模组　　　　　15　　解码模组14 Storage Module 15 Decoding Module

16　　处理器模组　　　　17　　注册模组16 Processor Module 17 Register Module

18　　无线通讯模组18 Wireless communication module

S101～S105　　　　流程步骤。S101～S105 Process steps.

具体实施方式Detailed ways

为详细说明本发明的技术内容、构造特征、所达成的目的及功效，以下兹例举实施例并配合图式详予说明。In order to describe the technical content, structural features, achieved goals and effects of the present invention in detail, the following examples are given and described in detail with the drawings.

请参阅图１，图１为本发明实施例公开的一种语音识别用于特定目标唤醒的方法流程示意图，包括如下步骤：Please refer to FIG. 1. FIG. 1 is a schematic flowchart of a method for waking up a specific target by speech recognition disclosed in an embodiment of the present invention, including the following steps:

步骤Ｓ１０１：接收一特定目标的语音讯息并对所述特定目标的语音讯息进行预处理，提取所述特定目标的一语音特征；Step S101: Receive a voice message of a specific target and preprocess the voice message of the specific target to extract a voice feature of the specific target;

具体的，此步骤中特定目标指的是进行语音识别中达到唤醒条件的注册用户，而语音讯息为事先准备好的文本，此文本内容中会包含预设的一唤醒词，特定目标先朗读文本内容并经由本发明实施例一语音识别装置１００的一采集模组１１收集特定目标的语音讯息。Specifically, the specific target in this step refers to the registered user who has reached the wake-up condition in the speech recognition, and the voice message is a text prepared in advance. The text content will contain a preset wake-up word, and the specific target will read the text aloud first. content and collect voice information of a specific target through a collection module 11 of a voice recognition device 100 according to an embodiment of the present invention.

具体的，此步骤中所收集的语音讯息为类比语音讯号，需要将类比语音讯号转成数位语音讯号才可进行后续语音识别处理。另外，在语音讯息中可能会包含其他环境噪音，因此也需要对语音讯息进行预处理，滤除无用的环境噪音并取得有效的语音讯号，所述预处理包含对数位语音讯号进行杂讯抑制处理及回音消除处理，上述预处理可以参照目前现有降噪处理的技术。Specifically, the voice information collected in this step is an analog voice signal, and the analog voice signal needs to be converted into a digital voice signal before subsequent voice recognition processing can be performed. In addition, other environmental noises may be included in the voice information, so it is also necessary to preprocess the voice information to filter out the useless environmental noise and obtain an effective voice signal. The preprocessing includes noise suppression processing on the digital voice signal. and echo cancellation processing, the above-mentioned preprocessing can refer to the existing noise reduction processing technology.

具体的，完成预处理后的语音讯号需要提取特定目标的语音特征，本发明实施例中采用梅尔倒频谱系数(Mel-frequency Cepstral Coefficients, 简称MFCC)的方式撷取特定目标的语音特征，将预处理后的语音讯号切割为多个音框(Frame blocking)、针对需要加重语音讯号的部分进行预强调(Pre-emphasis)、进行加窗(Window)等作业，得到更加清晰、明确的一段语音特征。Specifically, the voice signal after the preprocessing needs to extract the voice features of the specific target. In the embodiment of the present invention, the method of Mel-frequency Cepstral Coefficients (MFCC) is used to extract the voice features of the specific target. The pre-processed speech signal is cut into multiple frames (Frame blocking), pre-emphasis (Pre-emphasis) and windowing (Window) are performed for the part that needs to be emphasized, so as to obtain a clearer and clearer piece of speech. feature.

步骤Ｓ１０２：将所述特定目标的语音特征作为以鉴别式训练的潜藏向量状态模型(Hidden Vector State Model, 简称HVS Model)的输入数据并进行训练，得到一特定目标声学模型，并储存所述特定目标声学模型；Step S102: Use the speech feature of the specific target as the input data of the discriminative training Hidden Vector State Model (HVS Model for short) and train it to obtain a specific target acoustic model, and store the specific target acoustic model. target acoustic model;

具体的，此步骤中将特定目标的语音特征作为输入资料进行声学模型的训练，在本发明实施例中采用潜藏向量状态模型并使用鉴别式训练的方式进行训练，鉴别式训练不以最大化训练声学语料的相似度为目标，而以最小化分类(或辨识)错误为目标，增进辨识率。Specifically, in this step, the voice feature of the specific target is used as the input data to train the acoustic model. In the embodiment of the present invention, the latent vector state model is used and the training is performed by using the discriminative training method. The discriminative training does not maximize the training. The similarity of the acoustic corpus is the goal, and the classification (or identification) error is minimized to improve the identification rate.

其中鉴别式训练是以最大互信息法(Maximum Mutual Information, 简称MMI)为准则进行训练，其能够将最大化正确答案出现的机率提高，并有效的降低竞争者出现的机率，并增加正确答案与竞争者的鉴别性。Among them, the discriminative training is based on the maximum mutual information method (Maximum Mutual Information, referred to as MMI) for training, which can maximize the probability of the correct answer, and effectively reduce the probability of competitors, and increase the probability of correct answers and Competitor discrimination.

具体的，此步骤中储存所述特定目标声学模型指的是储存到本发明实施例语音识别装置１００的一存储模组１４。Specifically, storing the specific target acoustic model in this step refers to storing in a storage module 14 of the speech recognition apparatus 100 according to the embodiment of the present invention.

步骤Ｓ１０３：接收一待测目标的语音讯息并对所述待测目标的语音讯息进行预处理，提取所述待测目标的一语音特征；Step S103: Receive a voice message of a target to be tested and preprocess the voice message of the target to be tested to extract a voice feature of the target to be tested;

具体的，此步骤中待测目标指的是欲进行语音识别比对的使用人，待测目标输出一段语音讯息，并经由本发明实施例语音识别装置１００的一采集模组１１收集待测目标的语音讯息。Specifically, in this step, the target to be measured refers to a user who wants to perform voice recognition and comparison. The target to be measured outputs a piece of voice information, and the target to be measured is collected by a collection module 11 of the speech recognition device 100 according to the embodiment of the present invention. 's voice message.

具体的，此步骤中对待测目标的语音讯息进行预处理，并提取所述待测目标的语音特征，其处理步骤等同于上述对特定目标的语音讯息进行预处理，并提取所述特定目标的语音特征的流程。Specifically, in this step, the voice information of the target to be tested is preprocessed, and the voice features of the target to be tested are extracted. The flow of speech features.

步骤Ｓ１０４：将所述待测目标的语音特征作为以鉴别式训练的潜藏向量状态模型的输入数据并进行训练，得到一待测目标的声学模型；Step S104: take the speech feature of the target to be measured as the input data of the latent vector state model trained by the discriminant type and train it to obtain an acoustic model of the target to be measured;

具体的，此步骤中对待测目标的语音特征作为输入资料进行声学模型的训练，在本发明实施例中采用潜藏向量状态模型并使用鉴别式训练的方式进行训练，鉴别式训练是以最大互信息法(Maximum Mutual Information, 简称MMI)为准则进行训练。Specifically, in this step, the speech features of the target to be tested are used as input data to train the acoustic model. In the embodiment of the present invention, the latent vector state model is used and the training is performed by means of discriminative training. The discriminative training is based on the maximum mutual information. The maximum Mutual Information (MMI) is used as the criterion for training.

步骤Ｓ１０５：比对所述待测目标的声学模型与所述特定目标的声学模型之间的关联性，若两者有关联则将所述待测目标的语音特征使用至少一语言模型进行语言解码，并根据语言解码结果判断是否唤醒。Step S105: Compare the correlation between the acoustic model of the target to be tested and the acoustic model of the specific target, and if there is a correlation between the two, use at least one language model to decode the speech feature of the target to be tested. , and judge whether to wake up according to the language decoding result.

具体的，此步骤中当待测目标的声学模型符合特定目标的声学模型则进行语言解码，假若待测目标的声学模型不符合特定目标的声学模型则不进行任何动作，所述语言解码使用待测目标的语音特征作为输入资料进行语言模型的训练，在本发明实施例中语言模型包含一词库模型及一文法模型。Specifically, in this step, when the acoustic model of the target to be tested conforms to the acoustic model of the specific target, language decoding is performed, and if the acoustic model of the target to be tested does not conform to the acoustic model of the specific target, no action is performed. The speech feature of the test target is used as input data to train the language model. In the embodiment of the present invention, the language model includes a vocabulary model and a grammar model.

当待测目标的声学模型判别为特定目标的声学模型，则代表此时待测目标为特定目标，因此进行语言解码确认待测目标的语音讯息是否包含唤醒词。将待测目标的语音特征进行词库模型与文法模型的训练，解析得到待测目标的语音讯息内容，然后再判断待测目标的语音讯息内容是否包含唤醒词，若包含唤醒词则语音识别唤醒启动，若没有包含唤醒词则语音识别唤醒未启动。When the acoustic model of the target to be tested is determined to be the acoustic model of a specific target, it means that the target to be tested is a specific target at this time, so language decoding is performed to confirm whether the voice message of the target to be tested contains wake words. The speech features of the target to be tested are trained on thesaurus model and grammar model, and the content of the voice message of the target to be measured is obtained by parsing, and then it is judged whether the voice message content of the target to be tested contains a wake-up word. If it contains a wake-up word, the voice recognition wakes up. Start, if the wake-up word is not included, the voice recognition wake-up is not started.

请参阅图2，本发明实施例一语音识别用于特定目标唤醒的装置。一语音识别装置１００包含一采集模组１１、一提取模组１２、一训练模组１３、一存储模组１４、一解码模组１５、一处理器模组１６、一注册模组１７以及一无线通讯模组１８。Please refer to FIG. 2 , an embodiment of the present invention is an apparatus for waking up a specific target by voice recognition. A speech recognition device 100 includes an acquisition module 11, an extraction module 12, a training module 13, a storage module 14, a decoding module 15, a processor module 16, a registration module 17 and a Wireless communication module 18.

所述采集模组１１与提取模组１２和注册模组１７连接，其中采集模组１１设置多个麦克风阵列用于接收特定目标与待测目标的语音讯息，收集的语音讯息为类比语音讯号需要转化成数位语音讯号，同时将数位语音讯号进行杂讯抑制处理及回音消除处理，然后将处理完的数位语音讯息传送到提取模组１２。The acquisition module 11 is connected with the extraction module 12 and the registration module 17, wherein the acquisition module 11 is provided with a plurality of microphone arrays for receiving the voice information of the specific target and the target to be measured, and the collected voice information is required for analog voice signals. Convert the digital voice signal into a digital voice signal, and at the same time perform noise suppression processing and echo cancellation processing on the digital voice signal, and then transmit the processed digital voice information to the extraction module 12.

所述特定目标的定义是根据本发明语音识别用于特定目标唤醒的对象，所述待测目标的定义是语音识别装置１００进行语音识别的对象。The definition of the specific target is the object that is used for the wake-up of the specific target according to the speech recognition of the present invention, and the definition of the target to be tested is the object that the speech recognition apparatus 100 performs speech recognition.

所述特定目标的语音讯息中包含一预设的唤醒词。The voice message of the specific target includes a preset wake-up word.

所述提取模组１２与采集模组１１、训练模组１３以及解码模组１５连接，提取模组１２用于接收采集模组１１处理后的语音讯息，并提取其中特定目标与待测目标的语音特征，再传送到训练模组１３进行声学模型训练或是传送到解码模组１５进行解码。The extraction module 12 is connected with the acquisition module 11, the training module 13 and the decoding module 15, and the extraction module 12 is used to receive the voice message processed by the acquisition module 11, and to extract the specific target and the target to be tested. The speech features are then sent to the training module 13 for acoustic model training or sent to the decoding module 15 for decoding.

所述提取特定目标与待测目标的语音特征是采用梅尔倒频谱系数(Mel-frequency Cepstral Coefficients, 简称MFCC)的方式撷取其语音讯息的语音特征。The extraction of the voice features of the specific target and the target to be measured is to extract the voice features of the voice messages by using Mel-frequency Cepstral Coefficients (MFCC for short).

所述训练模组１３与提取模组１２、存储模组１４以及处理器模组１６连接。所述训练模组１３用于接收提取模组１２提取完的特定目标与待测目标的语音特征，并将特定目标与待测目标的语音特征作为以最大互信息法训练的潜藏向量状态模型的输入数据，最后获取训练后的声学模型，并根据特定目标与待测目标进行不同步骤。若是特定目标则将特定目标的声学模型传送到存储模组１４，若是待测目标则将待测目标的声学模型传送到处理器模组１６。The training module 13 is connected to the extraction module 12 , the storage module 14 and the processor module 16 . The training module 13 is used to receive the voice features of the specific target and the target to be tested extracted by the extraction module 12, and use the voice features of the specific target and the target to be tested as the latent vector state model trained by the maximum mutual information method. Input data, and finally obtain the trained acoustic model, and perform different steps according to the specific target and the target to be tested. If it is a specific target, the acoustic model of the specific target is transmitted to the storage module 14 , and if it is a target to be measured, the acoustic model of the target to be measured is transmitted to the processor module 16 .

所述存储模组１４与训练模组１３、处理器模组１６以及注册模组１７连接。所述存储模组１４用于保存训练模组１３训练完成的特定目标的声学模型。在本发明实施例中，当特定目标进行注册模组１７的操作，则训练模组１３训练后的特定目标的声学模型会传送到存储模组１４进行保存。另外，当处理器模组１６进行待测目标与特定目标的声学模型比对时，则存储模组１４将保存的特定目标的声学模型传送到处理器模组１６。The storage module 14 is connected with the training module 13 , the processor module 16 and the registration module 17 . The storage module 14 is used to save the acoustic model of the specific target trained by the training module 13 . In the embodiment of the present invention, when the specific target performs the operation of the registration module 17, the acoustic model of the specific target trained by the training module 13 will be transmitted to the storage module 14 for saving. In addition, when the processor module 16 compares the acoustic model of the target to be tested with the specific target, the storage module 14 transmits the saved acoustic model of the specific target to the processor module 16 .

所述解码模组１５与提取模组１２及处理器模组１６连接。所述解码模组１５用于将待测目标的语音讯息进行语言解码，更具体的说明，提取模组１２将待测目标的语音特征作为以词库模型及文法模型的输入资料进行训练，并将结果传送到处理器模组１６。The decoding module 15 is connected to the extraction module 12 and the processor module 16 . The decoding module 15 is used to perform language decoding on the voice information of the target to be tested. More specifically, the extraction module 12 uses the voice feature of the target to be tested as the input data of the thesaurus model and the grammar model for training, and The results are passed to the processor module 16 .

所述处理器模组１６与训练模组１３、存储模组１４、解码模组１５与无线通讯模组１８连接。所述处理器模组１６用于比对特定目标的声学模型与待测目标的声学模型，并根据两个声学模型的比对结果判断是否启动所述解码模组１５进行语言解码，更具体的说明，当训练模组１３传送待测目标的声学模型则处理器模组１６同时从存储模组１４中取得特定目标的声学模型，并在处理器模组１６中进行这两个声学模型的比对。The processor module 16 is connected with the training module 13 , the storage module 14 , the decoding module 15 and the wireless communication module 18 . The processor module 16 is used to compare the acoustic model of the specific target with the acoustic model of the target to be tested, and judge whether to activate the decoding module 15 to perform language decoding according to the comparison result of the two acoustic models, and more specifically. It means that when the training module 13 transmits the acoustic model of the target to be tested, the processor module 16 simultaneously obtains the acoustic model of the specific target from the storage module 14, and compares the two acoustic models in the processor module 16. right.

当确认特定目标的声学模型与待测目标的声学模型有关连，即代表待测目标为特定目标，因此进行待测目标的语音讯息语言解码判断其中是否包含唤醒词，故处理器模组１６会启动解码模组１５，并由解码模组１５进行语言解码。When it is confirmed that the acoustic model of the specific target is related to the acoustic model of the target to be tested, it means that the target to be tested is a specific target. Therefore, the speech message language decoding of the target to be tested is performed to determine whether it contains a wake-up word. Therefore, the processor module 16 will The decoding module 15 is activated, and the decoding module 15 performs language decoding.

所述解码模组１５从提取模组１２中获取待测目标的语音特征，并将语言解码的运算结果回传给处理器模组１６，处理器模组１６会根据待测目标的声学模型以及语言解码后结果判断待测目标的语音讯息中是否包含唤醒词。The decoding module 15 obtains the speech features of the target to be tested from the extraction module 12, and returns the operation result of language decoding to the processor module 16, and the processor module 16 will determine the target according to the acoustic model of the target to be tested and After language decoding, it is determined whether the voice message of the target to be tested contains wake words.

当处理器模组１６得到待测目标的语音讯息中包含唤醒词则执行语音识别装置１００的唤醒，反之则不执行。When the processor module 16 obtains that the voice message of the target to be tested contains a wake-up word, the wake-up of the voice recognition device 100 is executed, otherwise, it is not executed.

所述注册模组１７与采集模组１１以及存储模组１４连接。所述注册模组１７用于提供特定目标进行语音识别装置１００的注册，其中注册模组１７包含一启动元件以及一显示元件，当特定目标碰触启动元件则同时启动存储模组１４，表示采集模组１１此次收集到的语音讯息经过训练模组１３训练后的声学模型需要保存到存储模组１４，另外，当特定目标碰触启动元件则显示元件启动提供特定目标确认目前是否为注册阶段。The registration module 17 is connected with the acquisition module 11 and the storage module 14 . The registration module 17 is used to provide a specific target for registration of the speech recognition device 100, wherein the registration module 17 includes an activation element and a display element. When the specific target touches the activation element, the storage module 14 is activated at the same time, indicating that the acquisition is performed. The acoustic model of the voice message collected by the module 11 this time after being trained by the training module 13 needs to be saved to the storage module 14. In addition, when a specific target touches the activation element, the display element is activated to provide a specific target to confirm whether it is currently in the registration stage. .

在本发明实施例中，所述启动元件为一种按钮，所述显示元件为一种发光二极管。In the embodiment of the present invention, the activation element is a button, and the display element is a light-emitting diode.

所述无线通讯模组１８与处理器模组１６连接。所述无线通讯模组１８用于当处理器模组１６确认唤醒语音识别装置１００成功后进行与外部通讯连接。The wireless communication module 18 is connected to the processor module 16 . The wireless communication module 18 is used for connecting with the external communication after the processor module 16 confirms that the voice recognition device 100 is woken up successfully.

在本发明实施例中，所述无线通讯模组１８包含一种Wi-Fi模组或一种蓝牙模组。In the embodiment of the present invention, the wireless communication module 18 includes a Wi-Fi module or a Bluetooth module.

以上所述，本发明语音识别用于特定目标唤醒的方法及装置采用鉴别式训练的潜藏向量状态模型作为声学模型，使用最大互信息法的鉴别式训练除了最大化正确答案的出现机率外，也会将竞争者的出现机率降低，增加其正确答案与竞争者之间的鉴别能力，能够快速且准确地判断待测目标是否为特定目标，进而达到唤醒的功用。As described above, the method and device for voice recognition of the present invention for awakening a specific target adopts the latent vector state model of the discriminative training as the acoustic model, and the discriminative training using the maximum mutual information method not only maximizes the probability of occurrence of the correct answer, but also It will reduce the appearance probability of competitors, increase the ability to discriminate between their correct answers and competitors, and can quickly and accurately determine whether the target to be tested is a specific target, thereby achieving the function of awakening.

Claims

1. a kind of method that speech recognition is used for specific target wake-up, is characterized in that, comprises the steps:

S1: Receive a voice message of a specific target and preprocess the voice message of the specific target to extract a voice feature of the specific target;

S2: take the speech feature of the specific target as the input data of the latent vector state model (HVS Model) of the discriminative training and train to obtain a specific target acoustic model, and store the specific target acoustic model;

S3: Receive a voice message of a target to be tested and preprocess the voice message of the target to be tested to extract a voice feature of the target to be tested;

S4: take the speech feature of the target to be measured as the input data of the latent vector state model trained by the discriminant type and train to obtain an acoustic model of the target to be measured;

S5: Compare the correlation between the acoustic model of the target to be measured and the acoustic model of the specific target, and if the two are correlated, use at least one language model to perform language decoding on the speech feature of the target to be measured, And judge whether to wake up according to the language decoding result.

2 . The method according to claim 1 , wherein the voice message of the specific target and the voice message of the target to be tested include at least one wake-up word. 3 .

3 . The method according to claim 1 , wherein the preprocessing comprises: performing noise suppression processing and echo cancellation processing on the voice message. 4 .

4 . The method according to claim 1 , wherein the speech features are obtained by means of Mel cepstral coefficients (MFCC). 5 .

5. The method according to claim 1, wherein the discriminative training adopts the maximum mutual information method (MMI) for training.

6. The method according to claim 1, wherein the language model comprises a vocabulary model or a grammar model or a combination thereof.

7. The method according to claim 2, wherein the step of judging whether the wake-up of speech recognition is achieved according to the language decoding result, comprises:

performing language decoding on the speech feature of the target to be tested;

Determine whether the target voice message to be tested contains the wake-up word;

If the wake-up word is included, the voice recognition wake-up is activated, and if the wake-up word is not included, the voice recognition wake-up is not activated.

8. A voice recognition device for waking up a specific target, wherein the device comprises:

an acquisition module including a plurality of microphone arrays for receiving voice messages of a specific target and a target to be tested, wherein the voice messages include a wake-up word;

an extraction module, connected to the acquisition module, for extracting the MFCC voice features in the voice messages of the specific target and the target to be tested;

a training module, connected to the extraction module, for using the MFCC voice features in the voice messages of the specific target and the target to be tested as the input data of the latent vector state model trained by the maximum mutual information method, and Obtain the acoustic model of the specific target after training and the acoustic model of the target to be tested;

a storage module, connected to the training module, for saving the acoustic model of the specific target that has been trained;

a decoding module, connected to the extraction module, for performing language decoding on the voice message of the target to be tested; and

a processor module, connected to the training module, the storage module and the decoding module, for comparing the acoustic model of the specific target in the storage module with the acoustic model of the target to be tested, and According to the comparison result, it is determined whether to activate the decoding module to perform language decoding of the voice message of the target to be tested, and to confirm whether the voice message of the target to be tested after the language decoding contains a wake-up word to wake up the device.

9. The device according to claim 8, further comprising a registration module, the registration module connects the acquisition module and the storage module, the The registration module is used to start saving the acoustic model of a specific target to the storage module.

10 . The apparatus of claim 8 , further comprising a wireless communication module, wherein the wireless communication module is used for external communication connection. 11 .