CN110648659B

CN110648659B - Voice recognition and keyword detection device and method based on multitask model

Info

Publication number: CN110648659B
Application number: CN201910906552.1A
Authority: CN
Inventors: 赖家豪; 郑达; 李索恒; 张志齐
Original assignee: Shanghai Yitu Information Technology Co ltd
Current assignee: Shanghai Yitu Information Technology Co ltd
Priority date: 2019-09-24
Filing date: 2019-09-24
Publication date: 2022-07-01
Anticipated expiration: 2039-09-24
Also published as: WO2021057038A1; CN110648659A

Abstract

The invention discloses a voice recognition and keyword detection device based on a multitask model, which comprises: a neural network; the system comprises a voice recognition decoder, a keyword decoder and a training module; in the training stage, the training module trains the speech recognition decoder and the neural network by adopting first input audio data, a first text label and a first CTC loss function, trains the keyword decoder and the neural network by adopting the first input audio data, a second text label and a second CTC loss function, and performs back propagation according to the output of the corresponding CTC loss function in the training process to train the neural network, the speech recognition decoder and the keyword decoder. The invention also discloses a voice recognition and keyword detection method based on the multitask model. The method can effectively utilize the training data of the voice recognition and train the keyword detection capability of the model at the same time, thereby obviously improving the accuracy and recall rate of the keyword detection.

Description

Voice recognition and keyword detection device and method based on multitask model

Technical Field

The invention relates to voice recognition, in particular to a voice recognition and keyword detection device and method based on a multitask model.

Background

Speech Recognition, also known as Automatic Speech Recognition (ASR), is a technology that converts an input Speech signal, i.e., an audio signal, into a corresponding text for output, and has important applications in Artificial Intelligence (AI).

The existing speech recognition device usually includes a Neural Network (NN), the Neural Network forms a corresponding model through training, and after a speech signal, i.e., an audio signal, is processed through feature extraction and input to the Neural Network according to the trained model, the Neural Network selects an optimal output path according to the trained model and forms a corresponding text signal for output. Neural networks include Recurrent Neural Networks (RNNs), which are typically trained using rules based on Connection Timing Classification (CTC) criteria. In the process of CTC-based rule training, training samples are provided, the training samples including an input audio signal, corresponding to the label of real output, each node in RNN has initial Weight value, namely Weight (Weight), after the input audio signal is input into RNN, the RNN generates output data according to the weight setting of each internal node, the output data and a real output label have a difference value and are calculated and output through a CTC loss function, the CTC loss can carry out back propagation to realize the adjustment of the weight of each node in the RNN, finally, the difference between the output data and the real output label is reduced to the required value or when the change of the difference between the output data and the real output label is small, and finishing the training, wherein each node in the RNN after the training has the corresponding final weight and is applied to the actual speech recognition. In the actual speech recognition, the audio signal subjected to feature extraction is input into the RNN, the RNN selects an output path with the largest score according to the training structure to output, the output path with the largest score is the output path corresponding to the largest probability product of each node of the RNN on the output path, and finally, the corresponding text information can be obtained through text decoding.

The speech recognition technology can recognize all texts appearing in the speech. In some applications, however, keyword detection is also required, and the keyword detection can obtain a command required in automatic control, or monitor sensitive information appearing in communication voice, and the like. The common keyword detection in the prior art is to directly search the top n (top pn) results for keywords, or to add a certain score to the keywords in the decoder so that the keywords can more easily survive in a beam search. These methods do not utilize the learning modeling ability of the deep neural network, and also cannot fully utilize the keyword information in a large amount of training data.

Disclosure of Invention

The invention aims to provide a voice recognition and keyword detection device based on a multi-task model, which can effectively utilize training data of voice recognition and simultaneously train the keyword detection capability of the model, thereby obviously improving the accuracy and recall rate of keyword detection. Therefore, the invention also provides a voice recognition and keyword detection method based on the multitask model.

In order to solve the above technical problems, the present invention provides a voice recognition and keyword detection apparatus based on a multitask model, comprising: a neural network; speech recognition decoder, keyword decoder.

The input end of the neural network is connected with input audio data, the neural network is provided with a plurality of nodes, and each node of each neural network is provided with a weight.

The output data formed by the output end of the neural network is respectively connected to the speech recognition decoder and the keyword decoder.

The voice recognition and keyword detection device based on the multitask model further comprises a training module;

in a training phase, the training module trains the speech recognition decoder and the neural network using first input audio data, a first text label, and a first CTC loss function, the training module trains the keyword decoder and the neural network using the first input audio data, the second text label, and a second CTC loss function, in the training process, the output of the first CTC loss function is subjected to back propagation to realize weight updating training of each node of the neural network and training of the voice recognition decoder, the output of the second CTC loss function is subjected to back propagation to realize weight updating training of each node of the neural network and training of the keyword decoder, and the final weight of each node of the neural network is obtained after the training is finished.

In a further improvement, the first text label is a text label composed of all words appearing in the first input audio data, and the second text label is a text label composed of keywords appearing in the first input audio data.

In a further refinement, said first CTC loss function outputs a difference between an output signal of said speech recognition decoder and said first text tag.

The second CTC loss function outputs a difference between an output signal of the keyword decoder and the second text tag.

In a further improvement, the voice recognition and keyword detection device based on the multitask model further comprises an inference module.

In the reasoning stage, second input audio data with unknown content are input into the neural network, the speech recognition decoder decodes output data of the neural network and obtains a speech recognition score, and the keyword decoder decodes output data of the neural network and obtains a keyword decoding score;

based on the speech recognition score and the keyword decoding score, the multitask model-based speech recognition and keyword detection device outputs prediction result data that predicts the content of the second input audio data.

In a further refinement, the input audio data is subjected to feature processing before being input to the neural network.

In a further refinement, the feature processing is to extract spectral features of the input audio data by short-time fourier transform.

In a further refinement, the multitask model based speech recognition and keyword detection means outputs the prediction result data based on a sum of the speech recognition score and the keyword decoding score.

In a further refinement, the neural network is a recurrent neural network.

In order to solve the above technical problems, the voice recognition and keyword detection method based on a multitask model according to the present invention uses a voice recognition and keyword detection apparatus based on a multitask model to perform voice recognition and keyword detection, wherein the voice recognition and keyword detection apparatus based on a multitask model comprises: a neural network; speech recognition decoder, keyword decoder.

The voice recognition and keyword detection device based on the multitask model further comprises a training module, and the training method comprises the following steps:

In a further improvement, the voice recognition and keyword detection device based on the multitask model further comprises an inference module, and the inference method comprises the following steps:

based on the speech recognition score and the keyword decoding score, the multitask model-based speech recognition and keyword detection device outputs prediction result data for predicting the content of the second input audio data.

In a further improvement, the multitask model based speech recognition and keyword detection device outputs the prediction result data based on the sum of the speech recognition score and the keyword decoding score.

In a further refinement, the neural network is a recurrent neural network.

The voice recognition and keyword detection device based on the multitask model is formed by adding a keyword decoder in the voice recognition device, the keyword decoder and the voice recognition decoder share the same neural network, and the neural network is trained in parallel in a training stage, so that the model formed by training the neural network is the multitask model, namely, the model has the voice recognition and keyword detection capabilities at the same time, namely, the voice recognition training data can be effectively utilized to train the keyword detection capability of the model at the same time, and the keyword detection accuracy and recall rate can be remarkably improved.

Drawings

The invention is described in further detail below with reference to the following figures and embodiments:

FIG. 1 is a schematic structural diagram of a voice recognition and keyword detection apparatus based on a multitasking model according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a training phase of the apparatus for speech recognition and keyword detection based on multitasking model according to the embodiment of the present invention;

FIG. 3 is a flowchart of an inference phase of the apparatus for speech recognition and keyword detection based on multitask model according to the embodiment of the present invention.

Detailed Description

FIG. 1 is a schematic structural diagram of a voice recognition and keyword detection apparatus based on a multitask model according to an embodiment of the present invention; the voice recognition and keyword detection device based on the multitask model comprises the following steps: a neural network 102; a speech recognition decoder 103, a keyword decoder 104; further comprising: a voice characteristic processing module 101, a training module and an inference module. Fig. 2 is a flowchart of a training phase corresponding to the training module, and fig. 3 is a flowchart of an inference phase corresponding to the inference module.

The input end of the neural network 102 is connected to input audio data, the neural network 102 has a plurality of nodes, and each node of each neural network 102 has a weight.

Typically, the input audio data needs to be feature processed by the speech feature processing module 101 before being input to the neural network 102. Preferably, the feature processing is to extract a spectral feature of the input audio data by short-time fourier transform. The neural network 102 is a recurrent neural network 102.

The output data formed by the output of the neural network 102 is connected to the speech recognition decoder 103 and the keyword decoder 104, respectively.

As shown in fig. 2, in the training phase, the training module trains the speech recognition decoder 103 and the neural network 102 by using the first input audio data 109a, the first text label 107 and the first CTC loss function 105, the training module trains the keyword decoder 104 and the neural network 102 by using the first input audio data 109a, the second text label 108 and the second CTC loss function 106, the speech recognition decoder 103 and the keyword decoder 104 run in parallel in the training process, the output of the first CTC loss function 105 is propagated backwards to implement the weight update training of each node of the neural network 102 and the training of the speech recognition decoder 103, the output of the second CTC loss function 106 is propagated backwards to implement the weight update training of each node of the neural network 102 and the training of the keyword decoder 104, the back propagation is indicated by a dashed line in fig. 2, and the final weight of each node of the neural network 102 is obtained after the training is finished.

The first text label 107 is a text label composed of all words appearing in the first input audio data 109a, and the second text label 108 is a text label composed of keywords appearing in the first input audio data 109 a. The first input audio data 109a and the first text label 107 form a pair of samples corresponding to the speech recognition decoder 103; the first input audio data 109a and the second text label 108 form a pair of samples corresponding to the keyword decoder 104.

The first CTC loss function 105 outputs the difference between the output signal of the speech recognition decoder 103 and the first text tag 107. Finally, training of the neural network and the speech recognition decoder 103 based on CTC rules is achieved.

The second CTC loss function 106 outputs a difference between the output signal of the keyword decoder 104 and the second text tag 108. Finally, the neural network and the keyword decoder 104 are trained based on the CTC rules.

As shown in fig. 3, in the inference phase, second input audio data 109b with unknown content is input to the neural network 102, the speech recognition decoder 103 decodes the output data of the neural network 102 and obtains a speech recognition score, and the keyword decoder 104 decodes the output data of the neural network 102 and obtains a keyword decoding score.

Based on the speech recognition score and the keyword decoding score, the multitask model based speech recognition and keyword detection apparatus outputs prediction result data that predicts the content of the second input audio data 109 b. Preferably, the voice recognition and keyword detection apparatus based on multitask model outputs the prediction result data based on the sum of the voice recognition score and the keyword decoding score.

The voice recognition and keyword detection device based on the multitask model is formed by adding the keyword decoder 104 in the voice recognition device, the keyword decoder 104 and the voice recognition decoder 103 share the same neural network 102, and the neural network 102 is trained in parallel in a training stage, so that the model formed by training the neural network 102 is the multitask model, namely the model has the voice recognition and keyword detection capabilities at the same time, namely the voice recognition training data can be effectively utilized to train the keyword detection capability of the model at the same time, and therefore the keyword detection accuracy and the recall rate can be remarkably improved.

In the voice recognition and keyword detection method based on the multitask model according to the embodiment of the present invention, a voice recognition and keyword detection device based on the multitask model is used for voice recognition and keyword detection, and the voice recognition and keyword detection device based on the multitask model includes: a neural network 102; a speech recognition decoder 103, a keyword decoder 104; further comprising: a speech feature processing module 101, a training module and an inference module. Fig. 2 is a flowchart of a training phase corresponding to the training module, and fig. 3 is a flowchart of an inference phase corresponding to the inference module.

The input end of the neural network 102 is connected with input audio data, the neural network 102 has a plurality of nodes, and each node of each neural network 102 has a weight value.

As shown in fig. 2, the training method includes:

in the training stage, the training module trains the speech recognition decoder 103 and the neural network 102 by using first input audio data 109a, a first text tag 107 and a first CTC loss function 105, the training module trains the keyword decoder 104 and the neural network 102 by using the first input audio data 109a, a second text tag 108 and a second CTC loss function 106, the speech recognition decoder 103 and the keyword decoder 104 run in parallel in the training process, the output of the first CTC loss function 105 is propagated backwards to realize weight update training of each node of the neural network 102 and training of the speech recognition decoder 103, the output of the second CTC loss function 106 is propagated backwards to realize weight update training of each node of the neural network 102 and training of the keyword decoder 104, the back propagation is indicated by a dashed line in fig. 2, and the final weight of each node of the neural network 102 is obtained after the training is finished.

As shown in fig. 3, the inference method includes:

in the inference stage, the second input audio data 109b with unknown content is input to the neural network 102, the speech recognition decoder 103 decodes the output data of the neural network 102 to obtain a speech recognition score, and the keyword decoder 104 decodes the output data of the neural network 102 to obtain a keyword decoding score.

Based on the speech recognition score and the keyword decoding score, the multitask model based speech recognition and keyword detection device outputs prediction result data predicting the content of the second input audio data 109 b. Preferably, the multitask-model-based speech recognition and keyword detection device outputs the prediction result data based on the sum of the speech recognition score and the keyword decoding score.

The present invention has been described in detail with reference to the specific embodiments, but these should not be construed as limitations of the present invention. Many variations and modifications may be made by one of ordinary skill in the art without departing from the principles of the present invention, which should also be considered as within the scope of the present invention.

Claims

1. A voice recognition and keyword detection device based on a multitask model is characterized by comprising: a neural network; a speech recognition decoder, a keyword decoder;

the input end of the neural network is connected with input audio data, the neural network is provided with a plurality of nodes, and each node of each neural network is provided with a weight;

the output data formed by the output end of the neural network is respectively connected to the speech recognition decoder and the keyword decoder;

in a training stage, the training module trains the speech recognition decoder and the neural network by adopting first input audio data, a first text label and a first CTC loss function, the training module trains the keyword decoder and the neural network by adopting the first input audio data, a second text label and a second CTC loss function, in the training process, the output of the first CTC loss function is subjected to back propagation to realize weight updating training of each node of the neural network and training of the speech recognition decoder, the output of the second CTC loss function is subjected to back propagation to realize weight updating training of each node of the neural network and training of the keyword decoder, and the final weight of each node of the neural network is obtained after training is finished;

the first text label is a text label formed by all words appearing in the first input audio data, and the second text label is a text label formed by keywords appearing in the first input audio data;

the first CTC loss function outputs a difference between an output signal of the speech recognition decoder and the first text tag;

2. The multitask model based speech recognition and keyword detection device according to claim 1, wherein: the voice recognition and keyword detection device based on the multitask model further comprises an inference module;

3. The multitask model based speech recognition and keyword detection device according to claim 1, wherein: the input audio data is subjected to feature processing before being input to the neural network.

4. The multitask model based speech recognition and keyword detection device according to claim 3, wherein: the feature processing is to extract spectral features of the input audio data by short-time fourier transform.

5. The multitask model based speech recognition and keyword detection device according to claim 3, wherein: and outputting prediction result data by the voice recognition and keyword detection device based on the multitask model based on the sum of the voice recognition score and the keyword decoding score.

6. The multitask model based speech recognition and keyword detection device according to claim 1, wherein: the neural network is a recurrent neural network.

7. A voice recognition and keyword detection method based on a multitask model is characterized in that: the voice recognition and keyword detection device based on the multitask model is adopted to perform voice recognition and keyword detection, and comprises: a recurrent neural network; a speech recognition decoder, a keyword decoder;

the input end of the recurrent neural network is connected with input audio data, the recurrent neural network is provided with a plurality of nodes, and each node of each recurrent neural network is provided with a weight;

the output data formed by the output end of the recurrent neural network are respectively connected to the speech recognition decoder and the keyword decoder;

in a training phase, the training module trains the speech recognition decoder and the recurrent neural network with first input audio data, a first text label, and a first CTC loss function, the training module trains the keyword decoder and the recurrent neural network using the first input audio data, a second text label, and a second CTC loss function, in the training process, the output of the first CTC loss function is subjected to back propagation to realize weight updating training of each node of the recurrent neural network and training of the voice recognition decoder, the output of the second CTC loss function is subjected to back propagation to realize weight updating training of each node of the recurrent neural network and training of the keyword decoder, and the final weight of each node of the recurrent neural network is obtained after training is finished;

the first text label is a text label formed by all words appearing in the first input audio data, and the second text label is a text label formed by key words appearing in the first input audio data;

8. The multitask model based speech recognition and keyword detection method of claim 7 wherein: the voice recognition and keyword detection device based on the multitask model further comprises an inference module, and the inference method comprises the following steps:

in an inference stage, second input audio data with unknown content is input into the recurrent neural network, the speech recognition decoder decodes output data of the recurrent neural network and obtains a speech recognition score, and the keyword decoder decodes output data of the recurrent neural network and obtains a keyword decoding score;

9. The multitask model based speech recognition and keyword detection method of claim 7 wherein: the input audio data is subjected to feature processing before being input to the recurrent neural network.

10. The multitask model based speech recognition and keyword detection method of claim 9 wherein: the feature processing is to extract spectral features of the input audio data by short-time fourier transform.

11. The multitask model based speech recognition and keyword detection method of claim 8 wherein: and outputting the prediction result data by the voice recognition and keyword detection device based on the multitask model based on the sum of the voice recognition score and the keyword decoding score.