[go: up one dir, main page]

CN111583939A - Method and device for specific target wake-up by voice recognition - Google Patents

Method and device for specific target wake-up by voice recognition Download PDF

Info

Publication number
CN111583939A
CN111583939A CN201910124945.7A CN201910124945A CN111583939A CN 111583939 A CN111583939 A CN 111583939A CN 201910124945 A CN201910124945 A CN 201910124945A CN 111583939 A CN111583939 A CN 111583939A
Authority
CN
China
Prior art keywords
target
module
specific target
voice
tested
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201910124945.7A
Other languages
Chinese (zh)
Inventor
李政
吴国扬
陈心章
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Foxlink Electronics Dongguan Co Ltd
Cheng Uei Precision Industry Co Ltd
Original Assignee
Foxlink Electronics Dongguan Co Ltd
Cheng Uei Precision Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Foxlink Electronics Dongguan Co Ltd, Cheng Uei Precision Industry Co Ltd filed Critical Foxlink Electronics Dongguan Co Ltd
Priority to CN201910124945.7A priority Critical patent/CN111583939A/en
Publication of CN111583939A publication Critical patent/CN111583939A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • G10L17/24Interactive procedures; Man-machine interfaces the user being prompted to utter a password or a predefined phrase
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a method and a device for waking up a specific target by voice recognition, wherein the method comprises the following steps: receiving a voice message of a specific target and extracting voice characteristics in the voice message; the voice characteristics of the specific target are used as input data of an HVS model which is trained in an identification mode, training is carried out, a specific target acoustic model is obtained, and the specific target acoustic model is stored; receiving a voice message of a target to be detected, and extracting voice characteristics in the voice message; taking the voice characteristics of the target to be tested as input data of a hidden vector state model trained in an identification mode, and training to obtain an acoustic model of the target to be tested; and comparing the acoustic model of the target to be detected with the acoustic model of the specific target, if the acoustic model of the target to be detected and the acoustic model of the specific target are related, performing language decoding on the voice characteristics of the target to be detected by using the language model, and judging whether to awaken or not according to a language decoding result. According to the invention, the HVS model of discriminant training is used as the acoustic model, so that the target can be accurately and quickly judged, and further the awakening function is achieved.

Description

语音识别用于特定目标唤醒的方法及装置Method and device for specific target wake-up by speech recognition

技术领域technical field

本发明涉及一种语音识别领域,尤其涉及一种语音识别的方法及装置。The present invention relates to the field of speech recognition, and in particular, to a method and device for speech recognition.

背景技术Background technique

近年来,智慧音箱逐渐改变人们生活的方式,智慧音箱作为语音助理可协助用户执行生活上的任务,例如帮忙叫车、购物、提醒事项、记录资讯等等,尽管智慧音箱带来生活上更多便利,然而智慧音箱仍有许多安全隐患,有时智慧音箱无法有效地判别使用者是否为初始设定的用户而进行信用卡下订商品的可能性,因此,为了防止有心人士使用,目前市面上许多智慧音箱会采用语音识别的方式作为防护措施。In recent years, smart speakers have gradually changed the way people live. As a voice assistant, smart speakers can assist users in performing tasks in life, such as helping with car calls, shopping, reminders, recording information, etc. Although smart speakers bring more It is convenient, but there are still many security risks in smart speakers. Sometimes smart speakers cannot effectively determine whether the user is the initial user and the possibility of ordering goods with a credit card. Therefore, in order to prevent people from using it, many smart speakers are currently on the market. The speaker will use voice recognition as a protective measure.

一般的智慧音箱通常采用语音唤醒的方式唤醒智慧音箱进而执行后续任务,所谓语音唤醒的方式通常是从一段连续的语音中自动撷取一些使用者预先注册的语音指令(唤醒词)。传统上使用隐藏式马可夫模型(Hidden Markov Model,HMM)的技术,利用单独的音素(Phoneme)、音节的特征向量比对,找出机率最大(最有可能)的单字,后来,又结合高斯混合模型(Gaussian Mixture Model,GMM)形成经典的GMM-HMM模型。现有的GMM-HMM模型常采用最大相似度训练方法(Maximum Likelihood),然而此种方法在某些因素下容易使得竞争者答案机率大于正确答案机率,则导致正确率的下降,因此仍有进步改善的空间。General smart speakers usually use voice wake-up to wake up the smart speakers to perform subsequent tasks. The so-called voice wake-up method usually automatically captures some pre-registered voice commands (wake words) from a continuous voice. Traditionally, the Hidden Markov Model (HMM) technique was used to compare the feature vectors of individual phonemes and syllables to find the word with the highest probability (most likely), and later, combined with Gaussian mixture The model (Gaussian Mixture Model, GMM) forms the classic GMM-HMM model. The existing GMM-HMM model often adopts the maximum similarity training method (Maximum Likelihood). However, this method tends to make the competitor's answer probability greater than the correct answer probability under certain factors, resulting in a decline in the correct rate, so there is still progress. Room for improvement.

发明内容SUMMARY OF THE INVENTION

本发明的目的是针对上述现有技术存在的缺陷和不足,提出一种语音识别用于特定目标唤醒的实现方法,利用特定目标的唤醒词结合采用鉴别式训练的潜藏向量状态模型(Hidden Vector State Model,简称HVS Model),实现对特定目标的身分识别监测,从而达到特定目标语音唤醒的目的。The purpose of the present invention is to aim at the defects and deficiencies of the above-mentioned prior art, and propose a method for realizing the wake-up of a specific target by speech recognition, which utilizes the wake-up word of the specific target in combination with the Hidden Vector State Model (Hidden Vector State Model) of discriminative training. Model, referred to as HVS Model), to realize the identification and monitoring of specific targets, so as to achieve the purpose of voice wake-up of specific targets.

为实现上述目的,本发明实施例一方面提出了一种语音识别用于特定目标唤醒的方法,包括以下步骤:In order to achieve the above object, an embodiment of the present invention provides a method for waking up a specific target by voice recognition, including the following steps:

S1:接收一特定目标的语音讯息并对所述特定目标的语音讯息进行预处理,提取所述特定目标的一语音特征;S1: Receive a voice message of a specific target and preprocess the voice message of the specific target to extract a voice feature of the specific target;

S2:将所述特定目标的语音特征作为以鉴别式训练的潜藏向量状态模型(HVS Model)的输入数据并进行训练,得到一特定目标声学模型,并储存所述特定目标声学模型;S2: take the speech feature of the specific target as the input data of the latent vector state model (HVS Model) of the discriminative training and train to obtain a specific target acoustic model, and store the specific target acoustic model;

S3:接收一待测目标的语音讯息并对所述待测目标的语音讯息进行预处理,提取所述待测目标的一语音特征;S3: Receive a voice message of a target to be tested and preprocess the voice message of the target to be tested to extract a voice feature of the target to be tested;

S4:将所述待测目标的语音特征作为以鉴别式训练的潜藏向量状态模型的输入数据并进行训练,得到一待测目标的声学模型;S4: take the voice feature of the target to be measured as the input data of the latent vector state model trained by the discriminant type and train to obtain an acoustic model of the target to be measured;

S5:比对所述待测目标的声学模型与所述特定目标的声学模型之间的关联性,若两者有关联则将所述待测目标的语音特征使用至少一语言模型进行语言解码,并根据语言解码结果判断是否唤醒。S5: Compare the correlation between the acoustic model of the target to be measured and the acoustic model of the specific target, and if the two are correlated, use at least one language model to perform language decoding on the speech feature of the target to be measured, And judge whether to wake up according to the language decoding result.

具体地,所述特定目标的语音讯息与所述待测目标的语音讯息中包括至少一唤醒词。Specifically, the voice message of the specific target and the voice message of the target to be tested include at least one wake-up word.

具体地,所述预处理包括:将语音讯息进行杂讯抑制处理及回音消除处理。Specifically, the preprocessing includes: performing noise suppression processing and echo cancellation processing on the voice message.

具体地,所述语音特征利用梅尔倒频谱系数(MFCC)的方式取得。Specifically, the speech features are obtained by means of Mel cepstral coefficients (MFCC).

具体地,所述鉴别式训练采用最大互信息法(MMI)进行训练。Specifically, the discriminative training adopts the maximum mutual information method (MMI) for training.

具体地,所述语言模型包括一词库模型或一文法模型或及其组合。Specifically, the language model includes a vocabulary model or a grammar model or a combination thereof.

具体地,所述根据语言解码结果判断是否达到语音识别的唤醒,其步骤包含:将所述待测目标的语音特征进行语言解码;判断待测目标语音讯息其中是否包含所述唤醒词;若包含所述唤醒词则语音识别唤醒启动,若没有包含所述唤醒词则语音识别唤醒未启动。Specifically, judging whether the wake-up of speech recognition is achieved according to the language decoding result, the steps include: performing language decoding on the speech feature of the target to be tested; judging whether the voice message of the target to be tested contains the wake-up word; If the wake-up word is included, the voice recognition wake-up is activated, and if the wake-up word is not included, the voice recognition wake-up is not activated.

本发明实施例另一方面提出一种语音识别用于特定目标唤醒的装置,包括:Another aspect of the embodiments of the present invention provides a voice recognition device for waking up a specific target, including:

一采集模组,包括多个麦克风阵列,用于接收特定目标与待测目标的语音讯息,其中所述语音讯息包含一唤醒词;an acquisition module including a plurality of microphone arrays for receiving voice messages of a specific target and a target to be tested, wherein the voice messages include a wake-up word;

一提取模组,连接所述采集模组,用于提取所述特定目标以及所述待测目标的语音讯息其中的MFCC语音特征;an extraction module, connected to the acquisition module, for extracting the MFCC voice features in the voice messages of the specific target and the target to be tested;

一训练模组,连接所述提取模组,用于将所述特定目标以及所述待测目标的语音讯息其中的MFCC语音特征作为以最大互信息法训练的潜藏向量状态模型的输入数据,并获取训练后的特定目标的声学模型与待测目标的声学模型;a training module, connected to the extraction module, for using the MFCC voice features in the voice messages of the specific target and the target to be tested as the input data of the latent vector state model trained by the maximum mutual information method, and Obtain the acoustic model of the specific target after training and the acoustic model of the target to be tested;

一存储模组,连接所述训练模组,用于保存训练完成的特定目标的声学模型;a storage module, connected to the training module, for saving the acoustic model of the specific target that has been trained;

一解码模组,连接所述提取模组,用于将所述待测目标的语音讯息进行语言解码;以及a decoding module, connected to the extraction module, for performing language decoding on the voice message of the target to be tested; and

一处理器模组,连接所述训练模组、所述存储模组与所述解码模组,用于比对所述存储模组中的特定目标的声学模型与待测目标的声学模型,以及根据比对结果判断是否启动所述解码模组进行待测目标的语音讯息的语言解码,并根据语言解码后的待测目标的语音讯息确认是否包含唤醒词以唤醒所述装置。a processor module, connected to the training module, the storage module and the decoding module, for comparing the acoustic model of the specific target in the storage module with the acoustic model of the target to be tested, and According to the comparison result, it is judged whether to activate the decoding module to perform language decoding of the voice message of the target to be tested, and to confirm whether the voice message of the target to be tested after the language decoding contains a wake-up word to wake up the device.

具体地,所述装置进一步包括一注册模组,所述注册模组连接所述采集模组与所述存储模组,所述注册模组用于启动保存特定目标的声学模型到所述存储模组。Specifically, the device further includes a registration module, the registration module is connected to the acquisition module and the storage module, and the registration module is used to start saving the acoustic model of a specific target to the storage module Group.

具体地,所述装置进一步包括一无线通讯模组,其中,所述无线通讯模组用于进行外部通讯连接。Specifically, the device further includes a wireless communication module, wherein the wireless communication module is used for external communication connection.

与现有技术相比,本发明语音识别用于特定目标唤醒的方法及装置采用鉴别式训练的潜藏向量状态模型作为声学模型,使用鉴别式训练除了最大化正确答案的出现机率外,也会将竞争者的出现机率降低,增加其正确答案与竞争者之间的鉴别能力,能够快速且准确地判断待测目标是否为特定目标,进而达到唤醒的功用。Compared with the prior art, the voice recognition method and device for specific target awakening of the present invention adopts the latent vector state model of discriminative training as the acoustic model. The probability of a competitor's appearance is reduced, the ability to distinguish between the correct answer and the competitor is increased, and it can quickly and accurately determine whether the target to be tested is a specific target, thereby achieving the function of awakening.

附图说明Description of drawings

图1为本发明实施例一种语音识别用于特定目标唤醒的方法流程示意图。FIG. 1 is a schematic flowchart of a method for waking up a specific target by voice recognition according to an embodiment of the present invention.

图2为本发明实施例一种语音识别用于特定目标唤醒的装置示意图。FIG. 2 is a schematic diagram of an apparatus for waking up a specific target by voice recognition according to an embodiment of the present invention.

图中各附图标记说明如下:The reference numerals in the figure are explained as follows:

100  语音识别装置   11  采集模组100 Speech Recognition Device 11 Acquisition Module

12   提取模组     13  训练模组12 Extraction module 13 Training module

14   存储模组     15  解码模组14 Storage Module 15 Decoding Module

16   处理器模组    17  注册模组16 Processor Module 17 Register Module

18   无线通讯模组18 Wireless communication module

S101~S105    流程步骤。S101~S105 Process steps.

具体实施方式Detailed ways

为详细说明本发明的技术内容、构造特征、所达成的目的及功效,以下兹例举实施例并配合图式详予说明。In order to describe the technical content, structural features, achieved goals and effects of the present invention in detail, the following examples are given and described in detail with the drawings.

请参阅图1,图1为本发明实施例公开的一种语音识别用于特定目标唤醒的方法流程示意图,包括如下步骤:Please refer to FIG. 1. FIG. 1 is a schematic flowchart of a method for waking up a specific target by speech recognition disclosed in an embodiment of the present invention, including the following steps:

步骤S101:接收一特定目标的语音讯息并对所述特定目标的语音讯息进行预处理,提取所述特定目标的一语音特征;Step S101: Receive a voice message of a specific target and preprocess the voice message of the specific target to extract a voice feature of the specific target;

具体的,此步骤中特定目标指的是进行语音识别中达到唤醒条件的注册用户,而语音讯息为事先准备好的文本,此文本内容中会包含预设的一唤醒词,特定目标先朗读文本内容并经由本发明实施例一语音识别装置100的一采集模组11收集特定目标的语音讯息。Specifically, the specific target in this step refers to the registered user who has reached the wake-up condition in the speech recognition, and the voice message is a text prepared in advance. The text content will contain a preset wake-up word, and the specific target will read the text aloud first. content and collect voice information of a specific target through a collection module 11 of a voice recognition device 100 according to an embodiment of the present invention.

具体的,此步骤中所收集的语音讯息为类比语音讯号,需要将类比语音讯号转成数位语音讯号才可进行后续语音识别处理。另外,在语音讯息中可能会包含其他环境噪音,因此也需要对语音讯息进行预处理,滤除无用的环境噪音并取得有效的语音讯号,所述预处理包含对数位语音讯号进行杂讯抑制处理及回音消除处理,上述预处理可以参照目前现有降噪处理的技术。Specifically, the voice information collected in this step is an analog voice signal, and the analog voice signal needs to be converted into a digital voice signal before subsequent voice recognition processing can be performed. In addition, other environmental noises may be included in the voice information, so it is also necessary to preprocess the voice information to filter out the useless environmental noise and obtain an effective voice signal. The preprocessing includes noise suppression processing on the digital voice signal. and echo cancellation processing, the above-mentioned preprocessing can refer to the existing noise reduction processing technology.

具体的,完成预处理后的语音讯号需要提取特定目标的语音特征,本发明实施例中采用梅尔倒频谱系数(Mel-frequency Cepstral Coefficients, 简称MFCC)的方式撷取特定目标的语音特征,将预处理后的语音讯号切割为多个音框(Frame blocking)、针对需要加重语音讯号的部分进行预强调(Pre-emphasis)、进行加窗(Window)等作业,得到更加清晰、明确的一段语音特征。Specifically, the voice signal after the preprocessing needs to extract the voice features of the specific target. In the embodiment of the present invention, the method of Mel-frequency Cepstral Coefficients (MFCC) is used to extract the voice features of the specific target. The pre-processed speech signal is cut into multiple frames (Frame blocking), pre-emphasis (Pre-emphasis) and windowing (Window) are performed for the part that needs to be emphasized, so as to obtain a clearer and clearer piece of speech. feature.

步骤S102:将所述特定目标的语音特征作为以鉴别式训练的潜藏向量状态模型(Hidden Vector State Model, 简称HVS Model)的输入数据并进行训练,得到一特定目标声学模型,并储存所述特定目标声学模型;Step S102: Use the speech feature of the specific target as the input data of the discriminative training Hidden Vector State Model (HVS Model for short) and train it to obtain a specific target acoustic model, and store the specific target acoustic model. target acoustic model;

具体的,此步骤中将特定目标的语音特征作为输入资料进行声学模型的训练,在本发明实施例中采用潜藏向量状态模型并使用鉴别式训练的方式进行训练,鉴别式训练不以最大化训练声学语料的相似度为目标,而以最小化分类(或辨识)错误为目标,增进辨识率。Specifically, in this step, the voice feature of the specific target is used as the input data to train the acoustic model. In the embodiment of the present invention, the latent vector state model is used and the training is performed by using the discriminative training method. The discriminative training does not maximize the training. The similarity of the acoustic corpus is the goal, and the classification (or identification) error is minimized to improve the identification rate.

其中鉴别式训练是以最大互信息法(Maximum Mutual Information, 简称MMI)为准则进行训练,其能够将最大化正确答案出现的机率提高,并有效的降低竞争者出现的机率,并增加正确答案与竞争者的鉴别性。Among them, the discriminative training is based on the maximum mutual information method (Maximum Mutual Information, referred to as MMI) for training, which can maximize the probability of the correct answer, and effectively reduce the probability of competitors, and increase the probability of correct answers and Competitor discrimination.

具体的,此步骤中储存所述特定目标声学模型指的是储存到本发明实施例语音识别装置100的一存储模组14。Specifically, storing the specific target acoustic model in this step refers to storing in a storage module 14 of the speech recognition apparatus 100 according to the embodiment of the present invention.

步骤S103:接收一待测目标的语音讯息并对所述待测目标的语音讯息进行预处理,提取所述待测目标的一语音特征;Step S103: Receive a voice message of a target to be tested and preprocess the voice message of the target to be tested to extract a voice feature of the target to be tested;

具体的,此步骤中待测目标指的是欲进行语音识别比对的使用人,待测目标输出一段语音讯息,并经由本发明实施例语音识别装置100的一采集模组11收集待测目标的语音讯息。Specifically, in this step, the target to be measured refers to a user who wants to perform voice recognition and comparison. The target to be measured outputs a piece of voice information, and the target to be measured is collected by a collection module 11 of the speech recognition device 100 according to the embodiment of the present invention. 's voice message.

具体的,此步骤中对待测目标的语音讯息进行预处理,并提取所述待测目标的语音特征,其处理步骤等同于上述对特定目标的语音讯息进行预处理,并提取所述特定目标的语音特征的流程。Specifically, in this step, the voice information of the target to be tested is preprocessed, and the voice features of the target to be tested are extracted. The flow of speech features.

步骤S104:将所述待测目标的语音特征作为以鉴别式训练的潜藏向量状态模型的输入数据并进行训练,得到一待测目标的声学模型;Step S104: take the speech feature of the target to be measured as the input data of the latent vector state model trained by the discriminant type and train it to obtain an acoustic model of the target to be measured;

具体的,此步骤中对待测目标的语音特征作为输入资料进行声学模型的训练,在本发明实施例中采用潜藏向量状态模型并使用鉴别式训练的方式进行训练,鉴别式训练是以最大互信息法(Maximum Mutual Information, 简称MMI)为准则进行训练。Specifically, in this step, the speech features of the target to be tested are used as input data to train the acoustic model. In the embodiment of the present invention, the latent vector state model is used and the training is performed by means of discriminative training. The discriminative training is based on the maximum mutual information. The maximum Mutual Information (MMI) is used as the criterion for training.

步骤S105:比对所述待测目标的声学模型与所述特定目标的声学模型之间的关联性,若两者有关联则将所述待测目标的语音特征使用至少一语言模型进行语言解码,并根据语言解码结果判断是否唤醒。Step S105: Compare the correlation between the acoustic model of the target to be tested and the acoustic model of the specific target, and if there is a correlation between the two, use at least one language model to decode the speech feature of the target to be tested. , and judge whether to wake up according to the language decoding result.

具体的,此步骤中当待测目标的声学模型符合特定目标的声学模型则进行语言解码,假若待测目标的声学模型不符合特定目标的声学模型则不进行任何动作,所述语言解码使用待测目标的语音特征作为输入资料进行语言模型的训练,在本发明实施例中语言模型包含一词库模型及一文法模型。Specifically, in this step, when the acoustic model of the target to be tested conforms to the acoustic model of the specific target, language decoding is performed, and if the acoustic model of the target to be tested does not conform to the acoustic model of the specific target, no action is performed. The speech feature of the test target is used as input data to train the language model. In the embodiment of the present invention, the language model includes a vocabulary model and a grammar model.

当待测目标的声学模型判别为特定目标的声学模型,则代表此时待测目标为特定目标,因此进行语言解码确认待测目标的语音讯息是否包含唤醒词。将待测目标的语音特征进行词库模型与文法模型的训练,解析得到待测目标的语音讯息内容,然后再判断待测目标的语音讯息内容是否包含唤醒词,若包含唤醒词则语音识别唤醒启动,若没有包含唤醒词则语音识别唤醒未启动。When the acoustic model of the target to be tested is determined to be the acoustic model of a specific target, it means that the target to be tested is a specific target at this time, so language decoding is performed to confirm whether the voice message of the target to be tested contains wake words. The speech features of the target to be tested are trained on thesaurus model and grammar model, and the content of the voice message of the target to be measured is obtained by parsing, and then it is judged whether the voice message content of the target to be tested contains a wake-up word. If it contains a wake-up word, the voice recognition wakes up. Start, if the wake-up word is not included, the voice recognition wake-up is not started.

请参阅图2,本发明实施例一语音识别用于特定目标唤醒的装置。一语音识别装置100包含一采集模组11、一提取模组12、一训练模组13、一存储模组14、一解码模组15、一处理器模组16、一注册模组17以及一无线通讯模组18。Please refer to FIG. 2 , an embodiment of the present invention is an apparatus for waking up a specific target by voice recognition. A speech recognition device 100 includes an acquisition module 11, an extraction module 12, a training module 13, a storage module 14, a decoding module 15, a processor module 16, a registration module 17 and a Wireless communication module 18.

所述采集模组11与提取模组12和注册模组17连接,其中采集模组11设置多个麦克风阵列用于接收特定目标与待测目标的语音讯息,收集的语音讯息为类比语音讯号需要转化成数位语音讯号,同时将数位语音讯号进行杂讯抑制处理及回音消除处理,然后将处理完的数位语音讯息传送到提取模组12。The acquisition module 11 is connected with the extraction module 12 and the registration module 17, wherein the acquisition module 11 is provided with a plurality of microphone arrays for receiving the voice information of the specific target and the target to be measured, and the collected voice information is required for analog voice signals. Convert the digital voice signal into a digital voice signal, and at the same time perform noise suppression processing and echo cancellation processing on the digital voice signal, and then transmit the processed digital voice information to the extraction module 12.

所述特定目标的定义是根据本发明语音识别用于特定目标唤醒的对象,所述待测目标的定义是语音识别装置100进行语音识别的对象。The definition of the specific target is the object that is used for the wake-up of the specific target according to the speech recognition of the present invention, and the definition of the target to be tested is the object that the speech recognition apparatus 100 performs speech recognition.

所述特定目标的语音讯息中包含一预设的唤醒词。The voice message of the specific target includes a preset wake-up word.

所述提取模组12与采集模组11、训练模组13以及解码模组15连接,提取模组12用于接收采集模组11处理后的语音讯息,并提取其中特定目标与待测目标的语音特征,再传送到训练模组13进行声学模型训练或是传送到解码模组15进行解码。The extraction module 12 is connected with the acquisition module 11, the training module 13 and the decoding module 15, and the extraction module 12 is used to receive the voice message processed by the acquisition module 11, and to extract the specific target and the target to be tested. The speech features are then sent to the training module 13 for acoustic model training or sent to the decoding module 15 for decoding.

所述提取特定目标与待测目标的语音特征是采用梅尔倒频谱系数(Mel-frequency Cepstral Coefficients, 简称MFCC)的方式撷取其语音讯息的语音特征。The extraction of the voice features of the specific target and the target to be measured is to extract the voice features of the voice messages by using Mel-frequency Cepstral Coefficients (MFCC for short).

所述训练模组13与提取模组12、存储模组14以及处理器模组16连接。所述训练模组13用于接收提取模组12提取完的特定目标与待测目标的语音特征,并将特定目标与待测目标的语音特征作为以最大互信息法训练的潜藏向量状态模型的输入数据,最后获取训练后的声学模型,并根据特定目标与待测目标进行不同步骤。若是特定目标则将特定目标的声学模型传送到存储模组14,若是待测目标则将待测目标的声学模型传送到处理器模组16。The training module 13 is connected to the extraction module 12 , the storage module 14 and the processor module 16 . The training module 13 is used to receive the voice features of the specific target and the target to be tested extracted by the extraction module 12, and use the voice features of the specific target and the target to be tested as the latent vector state model trained by the maximum mutual information method. Input data, and finally obtain the trained acoustic model, and perform different steps according to the specific target and the target to be tested. If it is a specific target, the acoustic model of the specific target is transmitted to the storage module 14 , and if it is a target to be measured, the acoustic model of the target to be measured is transmitted to the processor module 16 .

所述存储模组14与训练模组13、处理器模组16以及注册模组17连接。所述存储模组14用于保存训练模组13训练完成的特定目标的声学模型。在本发明实施例中,当特定目标进行注册模组17的操作,则训练模组13训练后的特定目标的声学模型会传送到存储模组14进行保存。另外,当处理器模组16进行待测目标与特定目标的声学模型比对时,则存储模组14将保存的特定目标的声学模型传送到处理器模组16。The storage module 14 is connected with the training module 13 , the processor module 16 and the registration module 17 . The storage module 14 is used to save the acoustic model of the specific target trained by the training module 13 . In the embodiment of the present invention, when the specific target performs the operation of the registration module 17, the acoustic model of the specific target trained by the training module 13 will be transmitted to the storage module 14 for saving. In addition, when the processor module 16 compares the acoustic model of the target to be tested with the specific target, the storage module 14 transmits the saved acoustic model of the specific target to the processor module 16 .

所述解码模组15与提取模组12及处理器模组16连接。所述解码模组15用于将待测目标的语音讯息进行语言解码,更具体的说明,提取模组12将待测目标的语音特征作为以词库模型及文法模型的输入资料进行训练,并将结果传送到处理器模组16。The decoding module 15 is connected to the extraction module 12 and the processor module 16 . The decoding module 15 is used to perform language decoding on the voice information of the target to be tested. More specifically, the extraction module 12 uses the voice feature of the target to be tested as the input data of the thesaurus model and the grammar model for training, and The results are passed to the processor module 16 .

所述处理器模组16与训练模组13、存储模组14、解码模组15与无线通讯模组18连接。所述处理器模组16用于比对特定目标的声学模型与待测目标的声学模型,并根据两个声学模型的比对结果判断是否启动所述解码模组15进行语言解码,更具体的说明,当训练模组13传送待测目标的声学模型则处理器模组16同时从存储模组14中取得特定目标的声学模型,并在处理器模组16中进行这两个声学模型的比对。The processor module 16 is connected with the training module 13 , the storage module 14 , the decoding module 15 and the wireless communication module 18 . The processor module 16 is used to compare the acoustic model of the specific target with the acoustic model of the target to be tested, and judge whether to activate the decoding module 15 to perform language decoding according to the comparison result of the two acoustic models, and more specifically. It means that when the training module 13 transmits the acoustic model of the target to be tested, the processor module 16 simultaneously obtains the acoustic model of the specific target from the storage module 14, and compares the two acoustic models in the processor module 16. right.

当确认特定目标的声学模型与待测目标的声学模型有关连,即代表待测目标为特定目标,因此进行待测目标的语音讯息语言解码判断其中是否包含唤醒词,故处理器模组16会启动解码模组15,并由解码模组15进行语言解码。When it is confirmed that the acoustic model of the specific target is related to the acoustic model of the target to be tested, it means that the target to be tested is a specific target. Therefore, the speech message language decoding of the target to be tested is performed to determine whether it contains a wake-up word. Therefore, the processor module 16 will The decoding module 15 is activated, and the decoding module 15 performs language decoding.

所述解码模组15从提取模组12中获取待测目标的语音特征,并将语言解码的运算结果回传给处理器模组16,处理器模组16会根据待测目标的声学模型以及语言解码后结果判断待测目标的语音讯息中是否包含唤醒词。The decoding module 15 obtains the speech features of the target to be tested from the extraction module 12, and returns the operation result of language decoding to the processor module 16, and the processor module 16 will determine the target according to the acoustic model of the target to be tested and After language decoding, it is determined whether the voice message of the target to be tested contains wake words.

当处理器模组16得到待测目标的语音讯息中包含唤醒词则执行语音识别装置100的唤醒,反之则不执行。When the processor module 16 obtains that the voice message of the target to be tested contains a wake-up word, the wake-up of the voice recognition device 100 is executed, otherwise, it is not executed.

所述注册模组17与采集模组11以及存储模组14连接。所述注册模组17用于提供特定目标进行语音识别装置100的注册,其中注册模组17包含一启动元件以及一显示元件,当特定目标碰触启动元件则同时启动存储模组14,表示采集模组11此次收集到的语音讯息经过训练模组13训练后的声学模型需要保存到存储模组14,另外,当特定目标碰触启动元件则显示元件启动提供特定目标确认目前是否为注册阶段。The registration module 17 is connected with the acquisition module 11 and the storage module 14 . The registration module 17 is used to provide a specific target for registration of the speech recognition device 100, wherein the registration module 17 includes an activation element and a display element. When the specific target touches the activation element, the storage module 14 is activated at the same time, indicating that the acquisition is performed. The acoustic model of the voice message collected by the module 11 this time after being trained by the training module 13 needs to be saved to the storage module 14. In addition, when a specific target touches the activation element, the display element is activated to provide a specific target to confirm whether it is currently in the registration stage. .

在本发明实施例中,所述启动元件为一种按钮,所述显示元件为一种发光二极管。In the embodiment of the present invention, the activation element is a button, and the display element is a light-emitting diode.

所述无线通讯模组18与处理器模组16连接。所述无线通讯模组18用于当处理器模组16确认唤醒语音识别装置100成功后进行与外部通讯连接。The wireless communication module 18 is connected to the processor module 16 . The wireless communication module 18 is used for connecting with the external communication after the processor module 16 confirms that the voice recognition device 100 is woken up successfully.

在本发明实施例中,所述无线通讯模组18包含一种Wi-Fi模组或一种蓝牙模组。In the embodiment of the present invention, the wireless communication module 18 includes a Wi-Fi module or a Bluetooth module.

以上所述,本发明语音识别用于特定目标唤醒的方法及装置采用鉴别式训练的潜藏向量状态模型作为声学模型,使用最大互信息法的鉴别式训练除了最大化正确答案的出现机率外,也会将竞争者的出现机率降低,增加其正确答案与竞争者之间的鉴别能力,能够快速且准确地判断待测目标是否为特定目标,进而达到唤醒的功用。As described above, the method and device for voice recognition of the present invention for awakening a specific target adopts the latent vector state model of the discriminative training as the acoustic model, and the discriminative training using the maximum mutual information method not only maximizes the probability of occurrence of the correct answer, but also It will reduce the appearance probability of competitors, increase the ability to discriminate between their correct answers and competitors, and can quickly and accurately determine whether the target to be tested is a specific target, thereby achieving the function of awakening.

Claims (10)

1.一种语音识别用于特定目标唤醒的方法,其特征在于,包括如下步骤:1. a kind of method that speech recognition is used for specific target wake-up, is characterized in that, comprises the steps: S1:接收一特定目标的语音讯息并对所述特定目标的语音讯息进行预处理,提取所述特定目标的一语音特征;S1: Receive a voice message of a specific target and preprocess the voice message of the specific target to extract a voice feature of the specific target; S2:将所述特定目标的语音特征作为以鉴别式训练的潜藏向量状态模型(HVS Model)的输入数据并进行训练,得到一特定目标声学模型,并储存所述特定目标声学模型;S2: take the speech feature of the specific target as the input data of the latent vector state model (HVS Model) of the discriminative training and train to obtain a specific target acoustic model, and store the specific target acoustic model; S3:接收一待测目标的语音讯息并对所述待测目标的语音讯息进行预处理,提取所述待测目标的一语音特征;S3: Receive a voice message of a target to be tested and preprocess the voice message of the target to be tested to extract a voice feature of the target to be tested; S4:将所述待测目标的语音特征作为以鉴别式训练的潜藏向量状态模型的输入数据并进行训练,得到一待测目标的声学模型;S4: take the speech feature of the target to be measured as the input data of the latent vector state model trained by the discriminant type and train to obtain an acoustic model of the target to be measured; S5:比对所述待测目标的声学模型与所述特定目标的声学模型的间的关联性,若两者有关联则将所述待测目标的语音特征使用至少一语言模型进行语言解码,并根据语言解码结果判断是否唤醒。S5: Compare the correlation between the acoustic model of the target to be measured and the acoustic model of the specific target, and if the two are correlated, use at least one language model to perform language decoding on the speech feature of the target to be measured, And judge whether to wake up according to the language decoding result. 2.根据权利要求1所述的语音识别用于特定目标唤醒的方法,其特征在于,所述特定目标的语音讯息与所述待测目标的语音讯息中包括至少一唤醒词。2 . The method according to claim 1 , wherein the voice message of the specific target and the voice message of the target to be tested include at least one wake-up word. 3 . 3.根据权利要求1所述的语音识别用于特定目标唤醒的方法,其特征在于,所述预处理包括:将语音讯息进行杂讯抑制处理及回音消除处理。3 . The method according to claim 1 , wherein the preprocessing comprises: performing noise suppression processing and echo cancellation processing on the voice message. 4 . 4.根据权利要求1所述的语音识别用于特定目标唤醒的方法,其特征在于,所述语音特征利用梅尔倒频谱系数(MFCC)的方式取得。4 . The method according to claim 1 , wherein the speech features are obtained by means of Mel cepstral coefficients (MFCC). 5 . 5.根据权利要求1所述的语音识别用于特定目标唤醒的方法,其特征在于,所述鉴别式训练采用最大互信息法(MMI)进行训练。5. The method according to claim 1, wherein the discriminative training adopts the maximum mutual information method (MMI) for training. 6.根据权利要求1所述的语音识别用于特定目标唤醒的方法,其特征在于,所述语言模型包括一词库模型或一文法模型或及其组合。6. The method according to claim 1, wherein the language model comprises a vocabulary model or a grammar model or a combination thereof. 7.根据权利要求2所述的语音识别用于特定目标唤醒的方法,其特征在于,所述根据语言解码结果判断是否达到语音识别的唤醒,其步骤包含:7. The method according to claim 2, wherein the step of judging whether the wake-up of speech recognition is achieved according to the language decoding result, comprises: 将所述待测目标的语音特征进行语言解码;performing language decoding on the speech feature of the target to be tested; 判断待测目标语音讯息其中是否包含所述唤醒词;Determine whether the target voice message to be tested contains the wake-up word; 若包含所述唤醒词则语音识别唤醒启动,若没有包含所述唤醒词则语音识别唤醒未启动。If the wake-up word is included, the voice recognition wake-up is activated, and if the wake-up word is not included, the voice recognition wake-up is not activated. 8.一种语音识别用于特定目标唤醒的装置,其特征在于,所述装置包括:8. A voice recognition device for waking up a specific target, wherein the device comprises: 一采集模组,包括多个麦克风阵列,用于接收特定目标与待测目标的语音讯息,其中所述语音讯息包含一唤醒词;an acquisition module including a plurality of microphone arrays for receiving voice messages of a specific target and a target to be tested, wherein the voice messages include a wake-up word; 一提取模组,连接所述采集模组,用于提取所述特定目标以及所述待测目标的语音讯息其中的MFCC语音特征;an extraction module, connected to the acquisition module, for extracting the MFCC voice features in the voice messages of the specific target and the target to be tested; 一训练模组,连接所述提取模组,用于将所述特定目标以及所述待测目标的语音讯息其中的MFCC语音特征作为以最大互信息法训练的潜藏向量状态模型的输入数据,并获取训练后的特定目标的声学模型与待测目标的声学模型;a training module, connected to the extraction module, for using the MFCC voice features in the voice messages of the specific target and the target to be tested as the input data of the latent vector state model trained by the maximum mutual information method, and Obtain the acoustic model of the specific target after training and the acoustic model of the target to be tested; 一存储模组,连接所述训练模组,用于保存训练完成的特定目标的声学模型;a storage module, connected to the training module, for saving the acoustic model of the specific target that has been trained; 一解码模组,连接所述提取模组,用于将所述待测目标的语音讯息进行语言解码;以及a decoding module, connected to the extraction module, for performing language decoding on the voice message of the target to be tested; and 一处理器模组,连接所述训练模组、所述存储模组与所述解码模组,用于比对所述存储模组中的特定目标的声学模型与待测目标的声学模型,以及根据比对结果判断是否启动所述解码模组进行待测目标的语音讯息的语言解码,并根据语言解码后的待测目标的语音讯息确认是否包含唤醒词以唤醒所述装置。a processor module, connected to the training module, the storage module and the decoding module, for comparing the acoustic model of the specific target in the storage module with the acoustic model of the target to be tested, and According to the comparison result, it is determined whether to activate the decoding module to perform language decoding of the voice message of the target to be tested, and to confirm whether the voice message of the target to be tested after the language decoding contains a wake-up word to wake up the device. 9.根据权利要求8所述的语音识别用于特定目标唤醒的装置,其特征在于,进一步包括一注册模组,所述注册模组连接所述采集模组与所述存储模组,所述注册模组用于启动保存特定目标的声学模型到所述存储模组。9. The device according to claim 8, further comprising a registration module, the registration module connects the acquisition module and the storage module, the The registration module is used to start saving the acoustic model of a specific target to the storage module. 10.根据权利要求8所述的语音识别用于特定目标唤醒的装置,其特征在于,进一步包括一无线通讯模组,其中,所述无线通讯模组用于进行外部通讯连接。10 . The apparatus of claim 8 , further comprising a wireless communication module, wherein the wireless communication module is used for external communication connection. 11 .
CN201910124945.7A 2019-02-19 2019-02-19 Method and device for specific target wake-up by voice recognition Withdrawn CN111583939A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910124945.7A CN111583939A (en) 2019-02-19 2019-02-19 Method and device for specific target wake-up by voice recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910124945.7A CN111583939A (en) 2019-02-19 2019-02-19 Method and device for specific target wake-up by voice recognition

Publications (1)

Publication Number Publication Date
CN111583939A true CN111583939A (en) 2020-08-25

Family

ID=72122523

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910124945.7A Withdrawn CN111583939A (en) 2019-02-19 2019-02-19 Method and device for specific target wake-up by voice recognition

Country Status (1)

Country Link
CN (1) CN111583939A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103971678A (en) * 2013-01-29 2014-08-06 腾讯科技(深圳)有限公司 Method and device for detecting keywords
CN106611597A (en) * 2016-12-02 2017-05-03 百度在线网络技术(北京)有限公司 Voice wakeup method and voice wakeup device based on artificial intelligence
CN107123417A (en) * 2017-05-16 2017-09-01 上海交通大学 Optimization method and system are waken up based on the customized voice that distinctive is trained
CN108281137A (en) * 2017-01-03 2018-07-13 中国科学院声学研究所 A kind of universal phonetic under whole tone element frame wakes up recognition methods and system
CN109155132A (en) * 2016-03-21 2019-01-04 亚马逊技术公司 Speaker verification method and system
CN109243446A (en) * 2018-10-01 2019-01-18 厦门快商通信息技术有限公司 A kind of voice awakening method based on RNN network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103971678A (en) * 2013-01-29 2014-08-06 腾讯科技(深圳)有限公司 Method and device for detecting keywords
CN109155132A (en) * 2016-03-21 2019-01-04 亚马逊技术公司 Speaker verification method and system
CN106611597A (en) * 2016-12-02 2017-05-03 百度在线网络技术(北京)有限公司 Voice wakeup method and voice wakeup device based on artificial intelligence
CN108281137A (en) * 2017-01-03 2018-07-13 中国科学院声学研究所 A kind of universal phonetic under whole tone element frame wakes up recognition methods and system
CN107123417A (en) * 2017-05-16 2017-09-01 上海交通大学 Optimization method and system are waken up based on the customized voice that distinctive is trained
CN109243446A (en) * 2018-10-01 2019-01-18 厦门快商通信息技术有限公司 A kind of voice awakening method based on RNN network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DEYU ZHOU AND YULAN HE: "《Discriminative Training of the Hidden Vector State Model for Semantic Parsing》", 《IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING》 *

Similar Documents

Publication Publication Date Title
CN110428810B (en) Voice wake-up recognition method and device and electronic equipment
US9646610B2 (en) Method and apparatus for activating a particular wireless communication device to accept speech and/or voice commands using identification data consisting of speech, voice, image recognition
WO2017071182A1 (en) Voice wakeup method, apparatus and system
CN112102850B (en) Emotion recognition processing method and device, medium and electronic equipment
US9354687B2 (en) Methods and apparatus for unsupervised wakeup with time-correlated acoustic events
WO2016150001A1 (en) Speech recognition method, device and computer storage medium
CN107767861B (en) Voice awakening method and system and intelligent terminal
US6618702B1 (en) Method of and device for phone-based speaker recognition
EP3210205B1 (en) Sound sample verification for generating sound detection model
CN111210829B (en) Speech recognition method, apparatus, system, device and computer readable storage medium
CN110570873B (en) Voiceprint wake-up method and device, computer equipment and storage medium
CN109272991B (en) Voice interaction method, device, equipment and computer-readable storage medium
CN109979438A (en) Voice wake-up method and electronic equipment
CN109192224B (en) Voice evaluation method, device and equipment and readable storage medium
US20140337024A1 (en) Method and system for speech command detection, and information processing system
US9335966B2 (en) Methods and apparatus for unsupervised wakeup
CN101772015A (en) Method for starting up mobile terminal through voice password
CN110610707A (en) Voice keyword recognition method and device, electronic equipment and storage medium
CN103943105A (en) Voice interaction method and system
JPH0962293A (en) Speech recognition dialogue device and speech recognition dialogue processing method
CN103021409A (en) Voice activating photographing system
TW202029181A (en) Method and apparatus for specific user to wake up by speech recognition
CN110689887B (en) Audio verification method and device, storage medium and electronic equipment
CN108052195A (en) Control method of microphone equipment and terminal equipment
JP2019124952A (en) Information processing device, information processing method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20200825

WW01 Invention patent application withdrawn after publication