CN110648659B - Voice recognition and keyword detection device and method based on multitask model - Google Patents
Voice recognition and keyword detection device and method based on multitask model Download PDFInfo
- Publication number
- CN110648659B CN110648659B CN201910906552.1A CN201910906552A CN110648659B CN 110648659 B CN110648659 B CN 110648659B CN 201910906552 A CN201910906552 A CN 201910906552A CN 110648659 B CN110648659 B CN 110648659B
- Authority
- CN
- China
- Prior art keywords
- neural network
- keyword
- decoder
- speech recognition
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 69
- 238000000034 method Methods 0.000 title claims abstract description 19
- 238000013528 artificial neural network Methods 0.000 claims abstract description 116
- 238000012549 training Methods 0.000 claims abstract description 90
- 230000008569 process Effects 0.000 claims abstract description 8
- 230000000306 recurrent effect Effects 0.000 claims description 20
- 238000012545 processing Methods 0.000 claims description 14
- 230000003595 spectral effect Effects 0.000 claims description 6
- 230000006870 function Effects 0.000 description 25
- 230000006872 improvement Effects 0.000 description 5
- 230000005236 sound signal Effects 0.000 description 5
- 230000000644 propagated effect Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/225—Feedback of the input speech
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
- Character Discrimination (AREA)
Abstract
The invention discloses a voice recognition and keyword detection device based on a multitask model, which comprises: a neural network; the system comprises a voice recognition decoder, a keyword decoder and a training module; in the training stage, the training module trains the speech recognition decoder and the neural network by adopting first input audio data, a first text label and a first CTC loss function, trains the keyword decoder and the neural network by adopting the first input audio data, a second text label and a second CTC loss function, and performs back propagation according to the output of the corresponding CTC loss function in the training process to train the neural network, the speech recognition decoder and the keyword decoder. The invention also discloses a voice recognition and keyword detection method based on the multitask model. The method can effectively utilize the training data of the voice recognition and train the keyword detection capability of the model at the same time, thereby obviously improving the accuracy and recall rate of the keyword detection.
Description
Technical Field
The invention relates to voice recognition, in particular to a voice recognition and keyword detection device and method based on a multitask model.
Background
Speech Recognition, also known as Automatic Speech Recognition (ASR), is a technology that converts an input Speech signal, i.e., an audio signal, into a corresponding text for output, and has important applications in Artificial Intelligence (AI).
The existing speech recognition device usually includes a Neural Network (NN), the Neural Network forms a corresponding model through training, and after a speech signal, i.e., an audio signal, is processed through feature extraction and input to the Neural Network according to the trained model, the Neural Network selects an optimal output path according to the trained model and forms a corresponding text signal for output. Neural networks include Recurrent Neural Networks (RNNs), which are typically trained using rules based on Connection Timing Classification (CTC) criteria. In the process of CTC-based rule training, training samples are provided, the training samples including an input audio signal, corresponding to the label of real output, each node in RNN has initial Weight value, namely Weight (Weight), after the input audio signal is input into RNN, the RNN generates output data according to the weight setting of each internal node, the output data and a real output label have a difference value and are calculated and output through a CTC loss function, the CTC loss can carry out back propagation to realize the adjustment of the weight of each node in the RNN, finally, the difference between the output data and the real output label is reduced to the required value or when the change of the difference between the output data and the real output label is small, and finishing the training, wherein each node in the RNN after the training has the corresponding final weight and is applied to the actual speech recognition. In the actual speech recognition, the audio signal subjected to feature extraction is input into the RNN, the RNN selects an output path with the largest score according to the training structure to output, the output path with the largest score is the output path corresponding to the largest probability product of each node of the RNN on the output path, and finally, the corresponding text information can be obtained through text decoding.
The speech recognition technology can recognize all texts appearing in the speech. In some applications, however, keyword detection is also required, and the keyword detection can obtain a command required in automatic control, or monitor sensitive information appearing in communication voice, and the like. The common keyword detection in the prior art is to directly search the top n (top pn) results for keywords, or to add a certain score to the keywords in the decoder so that the keywords can more easily survive in a beam search. These methods do not utilize the learning modeling ability of the deep neural network, and also cannot fully utilize the keyword information in a large amount of training data.
Disclosure of Invention
The invention aims to provide a voice recognition and keyword detection device based on a multi-task model, which can effectively utilize training data of voice recognition and simultaneously train the keyword detection capability of the model, thereby obviously improving the accuracy and recall rate of keyword detection. Therefore, the invention also provides a voice recognition and keyword detection method based on the multitask model.
In order to solve the above technical problems, the present invention provides a voice recognition and keyword detection apparatus based on a multitask model, comprising: a neural network; speech recognition decoder, keyword decoder.
The input end of the neural network is connected with input audio data, the neural network is provided with a plurality of nodes, and each node of each neural network is provided with a weight.
The output data formed by the output end of the neural network is respectively connected to the speech recognition decoder and the keyword decoder.
The voice recognition and keyword detection device based on the multitask model further comprises a training module;
in a training phase, the training module trains the speech recognition decoder and the neural network using first input audio data, a first text label, and a first CTC loss function, the training module trains the keyword decoder and the neural network using the first input audio data, the second text label, and a second CTC loss function, in the training process, the output of the first CTC loss function is subjected to back propagation to realize weight updating training of each node of the neural network and training of the voice recognition decoder, the output of the second CTC loss function is subjected to back propagation to realize weight updating training of each node of the neural network and training of the keyword decoder, and the final weight of each node of the neural network is obtained after the training is finished.
In a further improvement, the first text label is a text label composed of all words appearing in the first input audio data, and the second text label is a text label composed of keywords appearing in the first input audio data.
In a further refinement, said first CTC loss function outputs a difference between an output signal of said speech recognition decoder and said first text tag.
The second CTC loss function outputs a difference between an output signal of the keyword decoder and the second text tag.
In a further improvement, the voice recognition and keyword detection device based on the multitask model further comprises an inference module.
In the reasoning stage, second input audio data with unknown content are input into the neural network, the speech recognition decoder decodes output data of the neural network and obtains a speech recognition score, and the keyword decoder decodes output data of the neural network and obtains a keyword decoding score;
based on the speech recognition score and the keyword decoding score, the multitask model-based speech recognition and keyword detection device outputs prediction result data that predicts the content of the second input audio data.
In a further refinement, the input audio data is subjected to feature processing before being input to the neural network.
In a further refinement, the feature processing is to extract spectral features of the input audio data by short-time fourier transform.
In a further refinement, the multitask model based speech recognition and keyword detection means outputs the prediction result data based on a sum of the speech recognition score and the keyword decoding score.
In a further refinement, the neural network is a recurrent neural network.
In order to solve the above technical problems, the voice recognition and keyword detection method based on a multitask model according to the present invention uses a voice recognition and keyword detection apparatus based on a multitask model to perform voice recognition and keyword detection, wherein the voice recognition and keyword detection apparatus based on a multitask model comprises: a neural network; speech recognition decoder, keyword decoder.
The input end of the neural network is connected with input audio data, the neural network is provided with a plurality of nodes, and each node of each neural network is provided with a weight.
The output data formed by the output end of the neural network is respectively connected to the speech recognition decoder and the keyword decoder.
The voice recognition and keyword detection device based on the multitask model further comprises a training module, and the training method comprises the following steps:
in a training phase, the training module trains the speech recognition decoder and the neural network using first input audio data, a first text label, and a first CTC loss function, the training module trains the keyword decoder and the neural network using the first input audio data, the second text label, and a second CTC loss function, in the training process, the output of the first CTC loss function is subjected to back propagation to realize weight updating training of each node of the neural network and training of the voice recognition decoder, the output of the second CTC loss function is subjected to back propagation to realize weight updating training of each node of the neural network and training of the keyword decoder, and the final weight of each node of the neural network is obtained after the training is finished.
In a further improvement, the first text label is a text label composed of all words appearing in the first input audio data, and the second text label is a text label composed of keywords appearing in the first input audio data.
In a further refinement, said first CTC loss function outputs a difference between an output signal of said speech recognition decoder and said first text tag.
The second CTC loss function outputs a difference between an output signal of the keyword decoder and the second text tag.
In a further improvement, the voice recognition and keyword detection device based on the multitask model further comprises an inference module, and the inference method comprises the following steps:
in the reasoning stage, second input audio data with unknown content are input into the neural network, the speech recognition decoder decodes output data of the neural network and obtains a speech recognition score, and the keyword decoder decodes output data of the neural network and obtains a keyword decoding score;
based on the speech recognition score and the keyword decoding score, the multitask model-based speech recognition and keyword detection device outputs prediction result data for predicting the content of the second input audio data.
In a further refinement, the input audio data is subjected to feature processing before being input to the neural network.
In a further refinement, the feature processing is to extract spectral features of the input audio data by short-time fourier transform.
In a further improvement, the multitask model based speech recognition and keyword detection device outputs the prediction result data based on the sum of the speech recognition score and the keyword decoding score.
In a further refinement, the neural network is a recurrent neural network.
The voice recognition and keyword detection device based on the multitask model is formed by adding a keyword decoder in the voice recognition device, the keyword decoder and the voice recognition decoder share the same neural network, and the neural network is trained in parallel in a training stage, so that the model formed by training the neural network is the multitask model, namely, the model has the voice recognition and keyword detection capabilities at the same time, namely, the voice recognition training data can be effectively utilized to train the keyword detection capability of the model at the same time, and the keyword detection accuracy and recall rate can be remarkably improved.
Drawings
The invention is described in further detail below with reference to the following figures and embodiments:
FIG. 1 is a schematic structural diagram of a voice recognition and keyword detection apparatus based on a multitasking model according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a training phase of the apparatus for speech recognition and keyword detection based on multitasking model according to the embodiment of the present invention;
FIG. 3 is a flowchart of an inference phase of the apparatus for speech recognition and keyword detection based on multitask model according to the embodiment of the present invention.
Detailed Description
FIG. 1 is a schematic structural diagram of a voice recognition and keyword detection apparatus based on a multitask model according to an embodiment of the present invention; the voice recognition and keyword detection device based on the multitask model comprises the following steps: a neural network 102; a speech recognition decoder 103, a keyword decoder 104; further comprising: a voice characteristic processing module 101, a training module and an inference module. Fig. 2 is a flowchart of a training phase corresponding to the training module, and fig. 3 is a flowchart of an inference phase corresponding to the inference module.
The input end of the neural network 102 is connected to input audio data, the neural network 102 has a plurality of nodes, and each node of each neural network 102 has a weight.
Typically, the input audio data needs to be feature processed by the speech feature processing module 101 before being input to the neural network 102. Preferably, the feature processing is to extract a spectral feature of the input audio data by short-time fourier transform. The neural network 102 is a recurrent neural network 102.
The output data formed by the output of the neural network 102 is connected to the speech recognition decoder 103 and the keyword decoder 104, respectively.
As shown in fig. 2, in the training phase, the training module trains the speech recognition decoder 103 and the neural network 102 by using the first input audio data 109a, the first text label 107 and the first CTC loss function 105, the training module trains the keyword decoder 104 and the neural network 102 by using the first input audio data 109a, the second text label 108 and the second CTC loss function 106, the speech recognition decoder 103 and the keyword decoder 104 run in parallel in the training process, the output of the first CTC loss function 105 is propagated backwards to implement the weight update training of each node of the neural network 102 and the training of the speech recognition decoder 103, the output of the second CTC loss function 106 is propagated backwards to implement the weight update training of each node of the neural network 102 and the training of the keyword decoder 104, the back propagation is indicated by a dashed line in fig. 2, and the final weight of each node of the neural network 102 is obtained after the training is finished.
The first text label 107 is a text label composed of all words appearing in the first input audio data 109a, and the second text label 108 is a text label composed of keywords appearing in the first input audio data 109 a. The first input audio data 109a and the first text label 107 form a pair of samples corresponding to the speech recognition decoder 103; the first input audio data 109a and the second text label 108 form a pair of samples corresponding to the keyword decoder 104.
The first CTC loss function 105 outputs the difference between the output signal of the speech recognition decoder 103 and the first text tag 107. Finally, training of the neural network and the speech recognition decoder 103 based on CTC rules is achieved.
The second CTC loss function 106 outputs a difference between the output signal of the keyword decoder 104 and the second text tag 108. Finally, the neural network and the keyword decoder 104 are trained based on the CTC rules.
As shown in fig. 3, in the inference phase, second input audio data 109b with unknown content is input to the neural network 102, the speech recognition decoder 103 decodes the output data of the neural network 102 and obtains a speech recognition score, and the keyword decoder 104 decodes the output data of the neural network 102 and obtains a keyword decoding score.
Based on the speech recognition score and the keyword decoding score, the multitask model based speech recognition and keyword detection apparatus outputs prediction result data that predicts the content of the second input audio data 109 b. Preferably, the voice recognition and keyword detection apparatus based on multitask model outputs the prediction result data based on the sum of the voice recognition score and the keyword decoding score.
The voice recognition and keyword detection device based on the multitask model is formed by adding the keyword decoder 104 in the voice recognition device, the keyword decoder 104 and the voice recognition decoder 103 share the same neural network 102, and the neural network 102 is trained in parallel in a training stage, so that the model formed by training the neural network 102 is the multitask model, namely the model has the voice recognition and keyword detection capabilities at the same time, namely the voice recognition training data can be effectively utilized to train the keyword detection capability of the model at the same time, and therefore the keyword detection accuracy and the recall rate can be remarkably improved.
In the voice recognition and keyword detection method based on the multitask model according to the embodiment of the present invention, a voice recognition and keyword detection device based on the multitask model is used for voice recognition and keyword detection, and the voice recognition and keyword detection device based on the multitask model includes: a neural network 102; a speech recognition decoder 103, a keyword decoder 104; further comprising: a speech feature processing module 101, a training module and an inference module. Fig. 2 is a flowchart of a training phase corresponding to the training module, and fig. 3 is a flowchart of an inference phase corresponding to the inference module.
The input end of the neural network 102 is connected with input audio data, the neural network 102 has a plurality of nodes, and each node of each neural network 102 has a weight value.
Typically, the input audio data needs to be feature processed by the speech feature processing module 101 before being input to the neural network 102. Preferably, the feature processing is to extract a spectral feature of the input audio data by short-time fourier transform. The neural network 102 is a recurrent neural network 102.
The output data formed by the output of the neural network 102 is connected to the speech recognition decoder 103 and the keyword decoder 104, respectively.
As shown in fig. 2, the training method includes:
in the training stage, the training module trains the speech recognition decoder 103 and the neural network 102 by using first input audio data 109a, a first text tag 107 and a first CTC loss function 105, the training module trains the keyword decoder 104 and the neural network 102 by using the first input audio data 109a, a second text tag 108 and a second CTC loss function 106, the speech recognition decoder 103 and the keyword decoder 104 run in parallel in the training process, the output of the first CTC loss function 105 is propagated backwards to realize weight update training of each node of the neural network 102 and training of the speech recognition decoder 103, the output of the second CTC loss function 106 is propagated backwards to realize weight update training of each node of the neural network 102 and training of the keyword decoder 104, the back propagation is indicated by a dashed line in fig. 2, and the final weight of each node of the neural network 102 is obtained after the training is finished.
The first text label 107 is a text label composed of all words appearing in the first input audio data 109a, and the second text label 108 is a text label composed of keywords appearing in the first input audio data 109 a. The first input audio data 109a and the first text label 107 form a pair of samples corresponding to the speech recognition decoder 103; the first input audio data 109a and the second text label 108 form a pair of samples corresponding to the keyword decoder 104.
The first CTC loss function 105 outputs the difference between the output signal of the speech recognition decoder 103 and the first text tag 107. Finally, training of the neural network and the speech recognition decoder 103 based on CTC rules is achieved.
The second CTC loss function 106 outputs a difference between the output signal of the keyword decoder 104 and the second text tag 108. Finally, the neural network and the keyword decoder 104 are trained based on the CTC rules.
As shown in fig. 3, the inference method includes:
in the inference stage, the second input audio data 109b with unknown content is input to the neural network 102, the speech recognition decoder 103 decodes the output data of the neural network 102 to obtain a speech recognition score, and the keyword decoder 104 decodes the output data of the neural network 102 to obtain a keyword decoding score.
Based on the speech recognition score and the keyword decoding score, the multitask model based speech recognition and keyword detection device outputs prediction result data predicting the content of the second input audio data 109 b. Preferably, the multitask-model-based speech recognition and keyword detection device outputs the prediction result data based on the sum of the speech recognition score and the keyword decoding score.
The present invention has been described in detail with reference to the specific embodiments, but these should not be construed as limitations of the present invention. Many variations and modifications may be made by one of ordinary skill in the art without departing from the principles of the present invention, which should also be considered as within the scope of the present invention.
Claims (11)
1. A voice recognition and keyword detection device based on a multitask model is characterized by comprising: a neural network; a speech recognition decoder, a keyword decoder;
the input end of the neural network is connected with input audio data, the neural network is provided with a plurality of nodes, and each node of each neural network is provided with a weight;
the output data formed by the output end of the neural network is respectively connected to the speech recognition decoder and the keyword decoder;
the voice recognition and keyword detection device based on the multitask model further comprises a training module;
in a training stage, the training module trains the speech recognition decoder and the neural network by adopting first input audio data, a first text label and a first CTC loss function, the training module trains the keyword decoder and the neural network by adopting the first input audio data, a second text label and a second CTC loss function, in the training process, the output of the first CTC loss function is subjected to back propagation to realize weight updating training of each node of the neural network and training of the speech recognition decoder, the output of the second CTC loss function is subjected to back propagation to realize weight updating training of each node of the neural network and training of the keyword decoder, and the final weight of each node of the neural network is obtained after training is finished;
the first text label is a text label formed by all words appearing in the first input audio data, and the second text label is a text label formed by keywords appearing in the first input audio data;
the first CTC loss function outputs a difference between an output signal of the speech recognition decoder and the first text tag;
the second CTC loss function outputs a difference between an output signal of the keyword decoder and the second text tag.
2. The multitask model based speech recognition and keyword detection device according to claim 1, wherein: the voice recognition and keyword detection device based on the multitask model further comprises an inference module;
in the reasoning stage, second input audio data with unknown content are input into the neural network, the speech recognition decoder decodes output data of the neural network and obtains a speech recognition score, and the keyword decoder decodes output data of the neural network and obtains a keyword decoding score;
based on the speech recognition score and the keyword decoding score, the multitask model-based speech recognition and keyword detection device outputs prediction result data that predicts the content of the second input audio data.
3. The multitask model based speech recognition and keyword detection device according to claim 1, wherein: the input audio data is subjected to feature processing before being input to the neural network.
4. The multitask model based speech recognition and keyword detection device according to claim 3, wherein: the feature processing is to extract spectral features of the input audio data by short-time fourier transform.
5. The multitask model based speech recognition and keyword detection device according to claim 3, wherein: and outputting prediction result data by the voice recognition and keyword detection device based on the multitask model based on the sum of the voice recognition score and the keyword decoding score.
6. The multitask model based speech recognition and keyword detection device according to claim 1, wherein: the neural network is a recurrent neural network.
7. A voice recognition and keyword detection method based on a multitask model is characterized in that: the voice recognition and keyword detection device based on the multitask model is adopted to perform voice recognition and keyword detection, and comprises: a recurrent neural network; a speech recognition decoder, a keyword decoder;
the input end of the recurrent neural network is connected with input audio data, the recurrent neural network is provided with a plurality of nodes, and each node of each recurrent neural network is provided with a weight;
the output data formed by the output end of the recurrent neural network are respectively connected to the speech recognition decoder and the keyword decoder;
the voice recognition and keyword detection device based on the multitask model further comprises a training module, and the training method comprises the following steps:
in a training phase, the training module trains the speech recognition decoder and the recurrent neural network with first input audio data, a first text label, and a first CTC loss function, the training module trains the keyword decoder and the recurrent neural network using the first input audio data, a second text label, and a second CTC loss function, in the training process, the output of the first CTC loss function is subjected to back propagation to realize weight updating training of each node of the recurrent neural network and training of the voice recognition decoder, the output of the second CTC loss function is subjected to back propagation to realize weight updating training of each node of the recurrent neural network and training of the keyword decoder, and the final weight of each node of the recurrent neural network is obtained after training is finished;
the first text label is a text label formed by all words appearing in the first input audio data, and the second text label is a text label formed by key words appearing in the first input audio data;
the first CTC loss function outputs a difference between an output signal of the speech recognition decoder and the first text tag;
the second CTC loss function outputs a difference between an output signal of the keyword decoder and the second text tag.
8. The multitask model based speech recognition and keyword detection method of claim 7 wherein: the voice recognition and keyword detection device based on the multitask model further comprises an inference module, and the inference method comprises the following steps:
in an inference stage, second input audio data with unknown content is input into the recurrent neural network, the speech recognition decoder decodes output data of the recurrent neural network and obtains a speech recognition score, and the keyword decoder decodes output data of the recurrent neural network and obtains a keyword decoding score;
based on the speech recognition score and the keyword decoding score, the multitask model-based speech recognition and keyword detection device outputs prediction result data that predicts the content of the second input audio data.
9. The multitask model based speech recognition and keyword detection method of claim 7 wherein: the input audio data is subjected to feature processing before being input to the recurrent neural network.
10. The multitask model based speech recognition and keyword detection method of claim 9 wherein: the feature processing is to extract spectral features of the input audio data by short-time fourier transform.
11. The multitask model based speech recognition and keyword detection method of claim 8 wherein: and outputting the prediction result data by the voice recognition and keyword detection device based on the multitask model based on the sum of the voice recognition score and the keyword decoding score.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910906552.1A CN110648659B (en) | 2019-09-24 | 2019-09-24 | Voice recognition and keyword detection device and method based on multitask model |
PCT/CN2020/090285 WO2021057038A1 (en) | 2019-09-24 | 2020-05-14 | Apparatus and method for speech recognition and keyword detection based on multi-task model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910906552.1A CN110648659B (en) | 2019-09-24 | 2019-09-24 | Voice recognition and keyword detection device and method based on multitask model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110648659A CN110648659A (en) | 2020-01-03 |
CN110648659B true CN110648659B (en) | 2022-07-01 |
Family
ID=69011144
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910906552.1A Active CN110648659B (en) | 2019-09-24 | 2019-09-24 | Voice recognition and keyword detection device and method based on multitask model |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110648659B (en) |
WO (1) | WO2021057038A1 (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110648659B (en) * | 2019-09-24 | 2022-07-01 | 上海依图信息技术有限公司 | Voice recognition and keyword detection device and method based on multitask model |
CN111261146B (en) * | 2020-01-16 | 2022-09-09 | 腾讯科技(深圳)有限公司 | Speech recognition and model training method, device and computer readable storage medium |
CN112233655B (en) * | 2020-09-28 | 2024-07-16 | 上海声瀚信息科技有限公司 | Neural network training method for improving recognition performance of voice command words |
CN114420105A (en) * | 2020-10-13 | 2022-04-29 | 腾讯科技(深圳)有限公司 | Training method, device, server and storage medium for speech recognition model |
CN115206296A (en) * | 2021-04-09 | 2022-10-18 | 京东科技控股股份有限公司 | Method and device for speech recognition |
CN113221555B (en) * | 2021-05-07 | 2023-11-14 | 支付宝(杭州)信息技术有限公司 | Keyword recognition method, device and equipment based on multitasking model |
CN113314119B (en) * | 2021-07-27 | 2021-12-03 | 深圳百昱达科技有限公司 | Voice recognition intelligent household control method and device |
CN113823274B (en) * | 2021-08-16 | 2023-10-27 | 华南理工大学 | Voice keyword sample screening method based on detection error weighted editing distance |
CN113703579B (en) * | 2021-08-31 | 2023-05-30 | 北京字跳网络技术有限公司 | Data processing method, device, electronic device and storage medium |
CN117275461B (en) * | 2023-11-23 | 2024-03-15 | 上海蜜度科技股份有限公司 | Multitasking audio processing method, system, storage medium and electronic equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105632487A (en) * | 2015-12-31 | 2016-06-01 | 北京奇艺世纪科技有限公司 | Voice recognition method and device |
CN108735202A (en) * | 2017-03-13 | 2018-11-02 | 百度(美国)有限责任公司 | Convolution recurrent neural network for small occupancy resource keyword retrieval |
CN109145281A (en) * | 2017-06-15 | 2019-01-04 | 北京嘀嘀无限科技发展有限公司 | Audio recognition method, device and storage medium |
CN109215662A (en) * | 2018-09-18 | 2019-01-15 | 平安科技(深圳)有限公司 | End-to-end audio recognition method, electronic device and computer readable storage medium |
CN109599093A (en) * | 2018-10-26 | 2019-04-09 | 北京中关村科金技术有限公司 | Keyword detection method, apparatus, equipment and the readable storage medium storing program for executing of intelligent quality inspection |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9646634B2 (en) * | 2014-09-30 | 2017-05-09 | Google Inc. | Low-rank hidden input layer for speech recognition neural network |
US9508340B2 (en) * | 2014-12-22 | 2016-11-29 | Google Inc. | User specified keyword spotting using long short term memory neural network feature extractor |
CN107358951A (en) * | 2017-06-29 | 2017-11-17 | 阿里巴巴集团控股有限公司 | A kind of voice awakening method, device and electronic equipment |
CN108538285B (en) * | 2018-03-05 | 2021-05-04 | 清华大学 | Multi-instance keyword detection method based on multitask neural network |
CN108922521B (en) * | 2018-08-15 | 2021-07-06 | 合肥讯飞数码科技有限公司 | Voice keyword retrieval method, device, equipment and storage medium |
CN109616102B (en) * | 2019-01-09 | 2021-08-31 | 百度在线网络技术(北京)有限公司 | Acoustic model training method and device and storage medium |
CN109840287B (en) * | 2019-01-31 | 2021-02-19 | 中科人工智能创新技术研究院(青岛)有限公司 | Cross-modal information retrieval method and device based on neural network |
CN110648659B (en) * | 2019-09-24 | 2022-07-01 | 上海依图信息技术有限公司 | Voice recognition and keyword detection device and method based on multitask model |
-
2019
- 2019-09-24 CN CN201910906552.1A patent/CN110648659B/en active Active
-
2020
- 2020-05-14 WO PCT/CN2020/090285 patent/WO2021057038A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105632487A (en) * | 2015-12-31 | 2016-06-01 | 北京奇艺世纪科技有限公司 | Voice recognition method and device |
CN108735202A (en) * | 2017-03-13 | 2018-11-02 | 百度(美国)有限责任公司 | Convolution recurrent neural network for small occupancy resource keyword retrieval |
CN109145281A (en) * | 2017-06-15 | 2019-01-04 | 北京嘀嘀无限科技发展有限公司 | Audio recognition method, device and storage medium |
CN109215662A (en) * | 2018-09-18 | 2019-01-15 | 平安科技(深圳)有限公司 | End-to-end audio recognition method, electronic device and computer readable storage medium |
CN109599093A (en) * | 2018-10-26 | 2019-04-09 | 北京中关村科金技术有限公司 | Keyword detection method, apparatus, equipment and the readable storage medium storing program for executing of intelligent quality inspection |
Also Published As
Publication number | Publication date |
---|---|
WO2021057038A1 (en) | 2021-04-01 |
CN110648659A (en) | 2020-01-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110648659B (en) | Voice recognition and keyword detection device and method based on multitask model | |
US11503155B2 (en) | Interactive voice-control method and apparatus, device and medium | |
Zeng et al. | Effective combination of DenseNet and BiLSTM for keyword spotting | |
CN106098059B (en) | Customizable voice wake-up method and system | |
CN113987179B (en) | Dialogue emotion recognition network model, construction method, electronic device and storage medium based on knowledge enhancement and retroactive loss | |
CN110517664B (en) | Multi-party identification method, device, equipment and readable storage medium | |
CN112017645B (en) | Voice recognition method and device | |
JP7070894B2 (en) | Time series information learning system, method and neural network model | |
CN110197279B (en) | Transformation model training method, device, equipment and storage medium | |
CN111653275B (en) | Construction method and device of speech recognition model based on LSTM-CTC tail convolution, and speech recognition method | |
CN114490950B (en) | Method and storage medium for training encoder model, and method and system for predicting similarity | |
CN114596844A (en) | Acoustic model training method, voice recognition method and related equipment | |
WO2022222056A1 (en) | Synthetic speech detection | |
CN114333790A (en) | Data processing method, device, equipment, storage medium and program product | |
CN116450839A (en) | Knowledge injection and training method and system for knowledge enhancement pre-training language model | |
CN112750469B (en) | Method for detecting music in speech, method for optimizing speech communication and corresponding device | |
CN115132196B (en) | Voice instruction recognition method and device, electronic equipment and storage medium | |
CN113793599B (en) | Training method of voice recognition model, voice recognition method and device | |
CN110648668A (en) | Keyword detection device and method | |
CN114242045A (en) | Deep learning method for natural language dialogue system intention | |
Bovbjerg et al. | Self-supervised pretraining for robust personalized voice activity detection in adverse conditions | |
CN114400006A (en) | Speech recognition method and device | |
KR20220153852A (en) | Natural language processing apparatus for intent analysis and processing of multi-intent speech, program and its control method | |
Hentschel et al. | Feature-based learning hidden unit contributions for domain adaptation of RNN-LMs | |
CN118553249B (en) | Audio identification method, system and related device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |