[go: up one dir, main page]

CN110648659B - Voice recognition and keyword detection device and method based on multitask model - Google Patents

Voice recognition and keyword detection device and method based on multitask model Download PDF

Info

Publication number
CN110648659B
CN110648659B CN201910906552.1A CN201910906552A CN110648659B CN 110648659 B CN110648659 B CN 110648659B CN 201910906552 A CN201910906552 A CN 201910906552A CN 110648659 B CN110648659 B CN 110648659B
Authority
CN
China
Prior art keywords
neural network
keyword
decoder
speech recognition
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910906552.1A
Other languages
Chinese (zh)
Other versions
CN110648659A (en
Inventor
赖家豪
郑达
李索恒
张志齐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Yitu Information Technology Co ltd
Original Assignee
Shanghai Yitu Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Yitu Information Technology Co ltd filed Critical Shanghai Yitu Information Technology Co ltd
Priority to CN201910906552.1A priority Critical patent/CN110648659B/en
Publication of CN110648659A publication Critical patent/CN110648659A/en
Priority to PCT/CN2020/090285 priority patent/WO2021057038A1/en
Application granted granted Critical
Publication of CN110648659B publication Critical patent/CN110648659B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)
  • Character Discrimination (AREA)

Abstract

The invention discloses a voice recognition and keyword detection device based on a multitask model, which comprises: a neural network; the system comprises a voice recognition decoder, a keyword decoder and a training module; in the training stage, the training module trains the speech recognition decoder and the neural network by adopting first input audio data, a first text label and a first CTC loss function, trains the keyword decoder and the neural network by adopting the first input audio data, a second text label and a second CTC loss function, and performs back propagation according to the output of the corresponding CTC loss function in the training process to train the neural network, the speech recognition decoder and the keyword decoder. The invention also discloses a voice recognition and keyword detection method based on the multitask model. The method can effectively utilize the training data of the voice recognition and train the keyword detection capability of the model at the same time, thereby obviously improving the accuracy and recall rate of the keyword detection.

Description

Voice recognition and keyword detection device and method based on multitask model
Technical Field
The invention relates to voice recognition, in particular to a voice recognition and keyword detection device and method based on a multitask model.
Background
Speech Recognition, also known as Automatic Speech Recognition (ASR), is a technology that converts an input Speech signal, i.e., an audio signal, into a corresponding text for output, and has important applications in Artificial Intelligence (AI).
The existing speech recognition device usually includes a Neural Network (NN), the Neural Network forms a corresponding model through training, and after a speech signal, i.e., an audio signal, is processed through feature extraction and input to the Neural Network according to the trained model, the Neural Network selects an optimal output path according to the trained model and forms a corresponding text signal for output. Neural networks include Recurrent Neural Networks (RNNs), which are typically trained using rules based on Connection Timing Classification (CTC) criteria. In the process of CTC-based rule training, training samples are provided, the training samples including an input audio signal, corresponding to the label of real output, each node in RNN has initial Weight value, namely Weight (Weight), after the input audio signal is input into RNN, the RNN generates output data according to the weight setting of each internal node, the output data and a real output label have a difference value and are calculated and output through a CTC loss function, the CTC loss can carry out back propagation to realize the adjustment of the weight of each node in the RNN, finally, the difference between the output data and the real output label is reduced to the required value or when the change of the difference between the output data and the real output label is small, and finishing the training, wherein each node in the RNN after the training has the corresponding final weight and is applied to the actual speech recognition. In the actual speech recognition, the audio signal subjected to feature extraction is input into the RNN, the RNN selects an output path with the largest score according to the training structure to output, the output path with the largest score is the output path corresponding to the largest probability product of each node of the RNN on the output path, and finally, the corresponding text information can be obtained through text decoding.
The speech recognition technology can recognize all texts appearing in the speech. In some applications, however, keyword detection is also required, and the keyword detection can obtain a command required in automatic control, or monitor sensitive information appearing in communication voice, and the like. The common keyword detection in the prior art is to directly search the top n (top pn) results for keywords, or to add a certain score to the keywords in the decoder so that the keywords can more easily survive in a beam search. These methods do not utilize the learning modeling ability of the deep neural network, and also cannot fully utilize the keyword information in a large amount of training data.
Disclosure of Invention
The invention aims to provide a voice recognition and keyword detection device based on a multi-task model, which can effectively utilize training data of voice recognition and simultaneously train the keyword detection capability of the model, thereby obviously improving the accuracy and recall rate of keyword detection. Therefore, the invention also provides a voice recognition and keyword detection method based on the multitask model.
In order to solve the above technical problems, the present invention provides a voice recognition and keyword detection apparatus based on a multitask model, comprising: a neural network; speech recognition decoder, keyword decoder.
The input end of the neural network is connected with input audio data, the neural network is provided with a plurality of nodes, and each node of each neural network is provided with a weight.
The output data formed by the output end of the neural network is respectively connected to the speech recognition decoder and the keyword decoder.
The voice recognition and keyword detection device based on the multitask model further comprises a training module;
in a training phase, the training module trains the speech recognition decoder and the neural network using first input audio data, a first text label, and a first CTC loss function, the training module trains the keyword decoder and the neural network using the first input audio data, the second text label, and a second CTC loss function, in the training process, the output of the first CTC loss function is subjected to back propagation to realize weight updating training of each node of the neural network and training of the voice recognition decoder, the output of the second CTC loss function is subjected to back propagation to realize weight updating training of each node of the neural network and training of the keyword decoder, and the final weight of each node of the neural network is obtained after the training is finished.
In a further improvement, the first text label is a text label composed of all words appearing in the first input audio data, and the second text label is a text label composed of keywords appearing in the first input audio data.
In a further refinement, said first CTC loss function outputs a difference between an output signal of said speech recognition decoder and said first text tag.
The second CTC loss function outputs a difference between an output signal of the keyword decoder and the second text tag.
In a further improvement, the voice recognition and keyword detection device based on the multitask model further comprises an inference module.
In the reasoning stage, second input audio data with unknown content are input into the neural network, the speech recognition decoder decodes output data of the neural network and obtains a speech recognition score, and the keyword decoder decodes output data of the neural network and obtains a keyword decoding score;
based on the speech recognition score and the keyword decoding score, the multitask model-based speech recognition and keyword detection device outputs prediction result data that predicts the content of the second input audio data.
In a further refinement, the input audio data is subjected to feature processing before being input to the neural network.
In a further refinement, the feature processing is to extract spectral features of the input audio data by short-time fourier transform.
In a further refinement, the multitask model based speech recognition and keyword detection means outputs the prediction result data based on a sum of the speech recognition score and the keyword decoding score.
In a further refinement, the neural network is a recurrent neural network.
In order to solve the above technical problems, the voice recognition and keyword detection method based on a multitask model according to the present invention uses a voice recognition and keyword detection apparatus based on a multitask model to perform voice recognition and keyword detection, wherein the voice recognition and keyword detection apparatus based on a multitask model comprises: a neural network; speech recognition decoder, keyword decoder.
The input end of the neural network is connected with input audio data, the neural network is provided with a plurality of nodes, and each node of each neural network is provided with a weight.
The output data formed by the output end of the neural network is respectively connected to the speech recognition decoder and the keyword decoder.
The voice recognition and keyword detection device based on the multitask model further comprises a training module, and the training method comprises the following steps:
in a training phase, the training module trains the speech recognition decoder and the neural network using first input audio data, a first text label, and a first CTC loss function, the training module trains the keyword decoder and the neural network using the first input audio data, the second text label, and a second CTC loss function, in the training process, the output of the first CTC loss function is subjected to back propagation to realize weight updating training of each node of the neural network and training of the voice recognition decoder, the output of the second CTC loss function is subjected to back propagation to realize weight updating training of each node of the neural network and training of the keyword decoder, and the final weight of each node of the neural network is obtained after the training is finished.
In a further improvement, the first text label is a text label composed of all words appearing in the first input audio data, and the second text label is a text label composed of keywords appearing in the first input audio data.
In a further refinement, said first CTC loss function outputs a difference between an output signal of said speech recognition decoder and said first text tag.
The second CTC loss function outputs a difference between an output signal of the keyword decoder and the second text tag.
In a further improvement, the voice recognition and keyword detection device based on the multitask model further comprises an inference module, and the inference method comprises the following steps:
in the reasoning stage, second input audio data with unknown content are input into the neural network, the speech recognition decoder decodes output data of the neural network and obtains a speech recognition score, and the keyword decoder decodes output data of the neural network and obtains a keyword decoding score;
based on the speech recognition score and the keyword decoding score, the multitask model-based speech recognition and keyword detection device outputs prediction result data for predicting the content of the second input audio data.
In a further refinement, the input audio data is subjected to feature processing before being input to the neural network.
In a further refinement, the feature processing is to extract spectral features of the input audio data by short-time fourier transform.
In a further improvement, the multitask model based speech recognition and keyword detection device outputs the prediction result data based on the sum of the speech recognition score and the keyword decoding score.
In a further refinement, the neural network is a recurrent neural network.
The voice recognition and keyword detection device based on the multitask model is formed by adding a keyword decoder in the voice recognition device, the keyword decoder and the voice recognition decoder share the same neural network, and the neural network is trained in parallel in a training stage, so that the model formed by training the neural network is the multitask model, namely, the model has the voice recognition and keyword detection capabilities at the same time, namely, the voice recognition training data can be effectively utilized to train the keyword detection capability of the model at the same time, and the keyword detection accuracy and recall rate can be remarkably improved.
Drawings
The invention is described in further detail below with reference to the following figures and embodiments:
FIG. 1 is a schematic structural diagram of a voice recognition and keyword detection apparatus based on a multitasking model according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a training phase of the apparatus for speech recognition and keyword detection based on multitasking model according to the embodiment of the present invention;
FIG. 3 is a flowchart of an inference phase of the apparatus for speech recognition and keyword detection based on multitask model according to the embodiment of the present invention.
Detailed Description
FIG. 1 is a schematic structural diagram of a voice recognition and keyword detection apparatus based on a multitask model according to an embodiment of the present invention; the voice recognition and keyword detection device based on the multitask model comprises the following steps: a neural network 102; a speech recognition decoder 103, a keyword decoder 104; further comprising: a voice characteristic processing module 101, a training module and an inference module. Fig. 2 is a flowchart of a training phase corresponding to the training module, and fig. 3 is a flowchart of an inference phase corresponding to the inference module.
The input end of the neural network 102 is connected to input audio data, the neural network 102 has a plurality of nodes, and each node of each neural network 102 has a weight.
Typically, the input audio data needs to be feature processed by the speech feature processing module 101 before being input to the neural network 102. Preferably, the feature processing is to extract a spectral feature of the input audio data by short-time fourier transform. The neural network 102 is a recurrent neural network 102.
The output data formed by the output of the neural network 102 is connected to the speech recognition decoder 103 and the keyword decoder 104, respectively.
As shown in fig. 2, in the training phase, the training module trains the speech recognition decoder 103 and the neural network 102 by using the first input audio data 109a, the first text label 107 and the first CTC loss function 105, the training module trains the keyword decoder 104 and the neural network 102 by using the first input audio data 109a, the second text label 108 and the second CTC loss function 106, the speech recognition decoder 103 and the keyword decoder 104 run in parallel in the training process, the output of the first CTC loss function 105 is propagated backwards to implement the weight update training of each node of the neural network 102 and the training of the speech recognition decoder 103, the output of the second CTC loss function 106 is propagated backwards to implement the weight update training of each node of the neural network 102 and the training of the keyword decoder 104, the back propagation is indicated by a dashed line in fig. 2, and the final weight of each node of the neural network 102 is obtained after the training is finished.
The first text label 107 is a text label composed of all words appearing in the first input audio data 109a, and the second text label 108 is a text label composed of keywords appearing in the first input audio data 109 a. The first input audio data 109a and the first text label 107 form a pair of samples corresponding to the speech recognition decoder 103; the first input audio data 109a and the second text label 108 form a pair of samples corresponding to the keyword decoder 104.
The first CTC loss function 105 outputs the difference between the output signal of the speech recognition decoder 103 and the first text tag 107. Finally, training of the neural network and the speech recognition decoder 103 based on CTC rules is achieved.
The second CTC loss function 106 outputs a difference between the output signal of the keyword decoder 104 and the second text tag 108. Finally, the neural network and the keyword decoder 104 are trained based on the CTC rules.
As shown in fig. 3, in the inference phase, second input audio data 109b with unknown content is input to the neural network 102, the speech recognition decoder 103 decodes the output data of the neural network 102 and obtains a speech recognition score, and the keyword decoder 104 decodes the output data of the neural network 102 and obtains a keyword decoding score.
Based on the speech recognition score and the keyword decoding score, the multitask model based speech recognition and keyword detection apparatus outputs prediction result data that predicts the content of the second input audio data 109 b. Preferably, the voice recognition and keyword detection apparatus based on multitask model outputs the prediction result data based on the sum of the voice recognition score and the keyword decoding score.
The voice recognition and keyword detection device based on the multitask model is formed by adding the keyword decoder 104 in the voice recognition device, the keyword decoder 104 and the voice recognition decoder 103 share the same neural network 102, and the neural network 102 is trained in parallel in a training stage, so that the model formed by training the neural network 102 is the multitask model, namely the model has the voice recognition and keyword detection capabilities at the same time, namely the voice recognition training data can be effectively utilized to train the keyword detection capability of the model at the same time, and therefore the keyword detection accuracy and the recall rate can be remarkably improved.
In the voice recognition and keyword detection method based on the multitask model according to the embodiment of the present invention, a voice recognition and keyword detection device based on the multitask model is used for voice recognition and keyword detection, and the voice recognition and keyword detection device based on the multitask model includes: a neural network 102; a speech recognition decoder 103, a keyword decoder 104; further comprising: a speech feature processing module 101, a training module and an inference module. Fig. 2 is a flowchart of a training phase corresponding to the training module, and fig. 3 is a flowchart of an inference phase corresponding to the inference module.
The input end of the neural network 102 is connected with input audio data, the neural network 102 has a plurality of nodes, and each node of each neural network 102 has a weight value.
Typically, the input audio data needs to be feature processed by the speech feature processing module 101 before being input to the neural network 102. Preferably, the feature processing is to extract a spectral feature of the input audio data by short-time fourier transform. The neural network 102 is a recurrent neural network 102.
The output data formed by the output of the neural network 102 is connected to the speech recognition decoder 103 and the keyword decoder 104, respectively.
As shown in fig. 2, the training method includes:
in the training stage, the training module trains the speech recognition decoder 103 and the neural network 102 by using first input audio data 109a, a first text tag 107 and a first CTC loss function 105, the training module trains the keyword decoder 104 and the neural network 102 by using the first input audio data 109a, a second text tag 108 and a second CTC loss function 106, the speech recognition decoder 103 and the keyword decoder 104 run in parallel in the training process, the output of the first CTC loss function 105 is propagated backwards to realize weight update training of each node of the neural network 102 and training of the speech recognition decoder 103, the output of the second CTC loss function 106 is propagated backwards to realize weight update training of each node of the neural network 102 and training of the keyword decoder 104, the back propagation is indicated by a dashed line in fig. 2, and the final weight of each node of the neural network 102 is obtained after the training is finished.
The first text label 107 is a text label composed of all words appearing in the first input audio data 109a, and the second text label 108 is a text label composed of keywords appearing in the first input audio data 109 a. The first input audio data 109a and the first text label 107 form a pair of samples corresponding to the speech recognition decoder 103; the first input audio data 109a and the second text label 108 form a pair of samples corresponding to the keyword decoder 104.
The first CTC loss function 105 outputs the difference between the output signal of the speech recognition decoder 103 and the first text tag 107. Finally, training of the neural network and the speech recognition decoder 103 based on CTC rules is achieved.
The second CTC loss function 106 outputs a difference between the output signal of the keyword decoder 104 and the second text tag 108. Finally, the neural network and the keyword decoder 104 are trained based on the CTC rules.
As shown in fig. 3, the inference method includes:
in the inference stage, the second input audio data 109b with unknown content is input to the neural network 102, the speech recognition decoder 103 decodes the output data of the neural network 102 to obtain a speech recognition score, and the keyword decoder 104 decodes the output data of the neural network 102 to obtain a keyword decoding score.
Based on the speech recognition score and the keyword decoding score, the multitask model based speech recognition and keyword detection device outputs prediction result data predicting the content of the second input audio data 109 b. Preferably, the multitask-model-based speech recognition and keyword detection device outputs the prediction result data based on the sum of the speech recognition score and the keyword decoding score.
The present invention has been described in detail with reference to the specific embodiments, but these should not be construed as limitations of the present invention. Many variations and modifications may be made by one of ordinary skill in the art without departing from the principles of the present invention, which should also be considered as within the scope of the present invention.

Claims (11)

1. A voice recognition and keyword detection device based on a multitask model is characterized by comprising: a neural network; a speech recognition decoder, a keyword decoder;
the input end of the neural network is connected with input audio data, the neural network is provided with a plurality of nodes, and each node of each neural network is provided with a weight;
the output data formed by the output end of the neural network is respectively connected to the speech recognition decoder and the keyword decoder;
the voice recognition and keyword detection device based on the multitask model further comprises a training module;
in a training stage, the training module trains the speech recognition decoder and the neural network by adopting first input audio data, a first text label and a first CTC loss function, the training module trains the keyword decoder and the neural network by adopting the first input audio data, a second text label and a second CTC loss function, in the training process, the output of the first CTC loss function is subjected to back propagation to realize weight updating training of each node of the neural network and training of the speech recognition decoder, the output of the second CTC loss function is subjected to back propagation to realize weight updating training of each node of the neural network and training of the keyword decoder, and the final weight of each node of the neural network is obtained after training is finished;
the first text label is a text label formed by all words appearing in the first input audio data, and the second text label is a text label formed by keywords appearing in the first input audio data;
the first CTC loss function outputs a difference between an output signal of the speech recognition decoder and the first text tag;
the second CTC loss function outputs a difference between an output signal of the keyword decoder and the second text tag.
2. The multitask model based speech recognition and keyword detection device according to claim 1, wherein: the voice recognition and keyword detection device based on the multitask model further comprises an inference module;
in the reasoning stage, second input audio data with unknown content are input into the neural network, the speech recognition decoder decodes output data of the neural network and obtains a speech recognition score, and the keyword decoder decodes output data of the neural network and obtains a keyword decoding score;
based on the speech recognition score and the keyword decoding score, the multitask model-based speech recognition and keyword detection device outputs prediction result data that predicts the content of the second input audio data.
3. The multitask model based speech recognition and keyword detection device according to claim 1, wherein: the input audio data is subjected to feature processing before being input to the neural network.
4. The multitask model based speech recognition and keyword detection device according to claim 3, wherein: the feature processing is to extract spectral features of the input audio data by short-time fourier transform.
5. The multitask model based speech recognition and keyword detection device according to claim 3, wherein: and outputting prediction result data by the voice recognition and keyword detection device based on the multitask model based on the sum of the voice recognition score and the keyword decoding score.
6. The multitask model based speech recognition and keyword detection device according to claim 1, wherein: the neural network is a recurrent neural network.
7. A voice recognition and keyword detection method based on a multitask model is characterized in that: the voice recognition and keyword detection device based on the multitask model is adopted to perform voice recognition and keyword detection, and comprises: a recurrent neural network; a speech recognition decoder, a keyword decoder;
the input end of the recurrent neural network is connected with input audio data, the recurrent neural network is provided with a plurality of nodes, and each node of each recurrent neural network is provided with a weight;
the output data formed by the output end of the recurrent neural network are respectively connected to the speech recognition decoder and the keyword decoder;
the voice recognition and keyword detection device based on the multitask model further comprises a training module, and the training method comprises the following steps:
in a training phase, the training module trains the speech recognition decoder and the recurrent neural network with first input audio data, a first text label, and a first CTC loss function, the training module trains the keyword decoder and the recurrent neural network using the first input audio data, a second text label, and a second CTC loss function, in the training process, the output of the first CTC loss function is subjected to back propagation to realize weight updating training of each node of the recurrent neural network and training of the voice recognition decoder, the output of the second CTC loss function is subjected to back propagation to realize weight updating training of each node of the recurrent neural network and training of the keyword decoder, and the final weight of each node of the recurrent neural network is obtained after training is finished;
the first text label is a text label formed by all words appearing in the first input audio data, and the second text label is a text label formed by key words appearing in the first input audio data;
the first CTC loss function outputs a difference between an output signal of the speech recognition decoder and the first text tag;
the second CTC loss function outputs a difference between an output signal of the keyword decoder and the second text tag.
8. The multitask model based speech recognition and keyword detection method of claim 7 wherein: the voice recognition and keyword detection device based on the multitask model further comprises an inference module, and the inference method comprises the following steps:
in an inference stage, second input audio data with unknown content is input into the recurrent neural network, the speech recognition decoder decodes output data of the recurrent neural network and obtains a speech recognition score, and the keyword decoder decodes output data of the recurrent neural network and obtains a keyword decoding score;
based on the speech recognition score and the keyword decoding score, the multitask model-based speech recognition and keyword detection device outputs prediction result data that predicts the content of the second input audio data.
9. The multitask model based speech recognition and keyword detection method of claim 7 wherein: the input audio data is subjected to feature processing before being input to the recurrent neural network.
10. The multitask model based speech recognition and keyword detection method of claim 9 wherein: the feature processing is to extract spectral features of the input audio data by short-time fourier transform.
11. The multitask model based speech recognition and keyword detection method of claim 8 wherein: and outputting the prediction result data by the voice recognition and keyword detection device based on the multitask model based on the sum of the voice recognition score and the keyword decoding score.
CN201910906552.1A 2019-09-24 2019-09-24 Voice recognition and keyword detection device and method based on multitask model Active CN110648659B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910906552.1A CN110648659B (en) 2019-09-24 2019-09-24 Voice recognition and keyword detection device and method based on multitask model
PCT/CN2020/090285 WO2021057038A1 (en) 2019-09-24 2020-05-14 Apparatus and method for speech recognition and keyword detection based on multi-task model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910906552.1A CN110648659B (en) 2019-09-24 2019-09-24 Voice recognition and keyword detection device and method based on multitask model

Publications (2)

Publication Number Publication Date
CN110648659A CN110648659A (en) 2020-01-03
CN110648659B true CN110648659B (en) 2022-07-01

Family

ID=69011144

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910906552.1A Active CN110648659B (en) 2019-09-24 2019-09-24 Voice recognition and keyword detection device and method based on multitask model

Country Status (2)

Country Link
CN (1) CN110648659B (en)
WO (1) WO2021057038A1 (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110648659B (en) * 2019-09-24 2022-07-01 上海依图信息技术有限公司 Voice recognition and keyword detection device and method based on multitask model
CN111261146B (en) * 2020-01-16 2022-09-09 腾讯科技(深圳)有限公司 Speech recognition and model training method, device and computer readable storage medium
CN112233655B (en) * 2020-09-28 2024-07-16 上海声瀚信息科技有限公司 Neural network training method for improving recognition performance of voice command words
CN114420105A (en) * 2020-10-13 2022-04-29 腾讯科技(深圳)有限公司 Training method, device, server and storage medium for speech recognition model
CN115206296A (en) * 2021-04-09 2022-10-18 京东科技控股股份有限公司 Method and device for speech recognition
CN113221555B (en) * 2021-05-07 2023-11-14 支付宝(杭州)信息技术有限公司 Keyword recognition method, device and equipment based on multitasking model
CN113314119B (en) * 2021-07-27 2021-12-03 深圳百昱达科技有限公司 Voice recognition intelligent household control method and device
CN113823274B (en) * 2021-08-16 2023-10-27 华南理工大学 Voice keyword sample screening method based on detection error weighted editing distance
CN113703579B (en) * 2021-08-31 2023-05-30 北京字跳网络技术有限公司 Data processing method, device, electronic device and storage medium
CN117275461B (en) * 2023-11-23 2024-03-15 上海蜜度科技股份有限公司 Multitasking audio processing method, system, storage medium and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105632487A (en) * 2015-12-31 2016-06-01 北京奇艺世纪科技有限公司 Voice recognition method and device
CN108735202A (en) * 2017-03-13 2018-11-02 百度(美国)有限责任公司 Convolution recurrent neural network for small occupancy resource keyword retrieval
CN109145281A (en) * 2017-06-15 2019-01-04 北京嘀嘀无限科技发展有限公司 Audio recognition method, device and storage medium
CN109215662A (en) * 2018-09-18 2019-01-15 平安科技(深圳)有限公司 End-to-end audio recognition method, electronic device and computer readable storage medium
CN109599093A (en) * 2018-10-26 2019-04-09 北京中关村科金技术有限公司 Keyword detection method, apparatus, equipment and the readable storage medium storing program for executing of intelligent quality inspection

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9646634B2 (en) * 2014-09-30 2017-05-09 Google Inc. Low-rank hidden input layer for speech recognition neural network
US9508340B2 (en) * 2014-12-22 2016-11-29 Google Inc. User specified keyword spotting using long short term memory neural network feature extractor
CN107358951A (en) * 2017-06-29 2017-11-17 阿里巴巴集团控股有限公司 A kind of voice awakening method, device and electronic equipment
CN108538285B (en) * 2018-03-05 2021-05-04 清华大学 Multi-instance keyword detection method based on multitask neural network
CN108922521B (en) * 2018-08-15 2021-07-06 合肥讯飞数码科技有限公司 Voice keyword retrieval method, device, equipment and storage medium
CN109616102B (en) * 2019-01-09 2021-08-31 百度在线网络技术(北京)有限公司 Acoustic model training method and device and storage medium
CN109840287B (en) * 2019-01-31 2021-02-19 中科人工智能创新技术研究院(青岛)有限公司 Cross-modal information retrieval method and device based on neural network
CN110648659B (en) * 2019-09-24 2022-07-01 上海依图信息技术有限公司 Voice recognition and keyword detection device and method based on multitask model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105632487A (en) * 2015-12-31 2016-06-01 北京奇艺世纪科技有限公司 Voice recognition method and device
CN108735202A (en) * 2017-03-13 2018-11-02 百度(美国)有限责任公司 Convolution recurrent neural network for small occupancy resource keyword retrieval
CN109145281A (en) * 2017-06-15 2019-01-04 北京嘀嘀无限科技发展有限公司 Audio recognition method, device and storage medium
CN109215662A (en) * 2018-09-18 2019-01-15 平安科技(深圳)有限公司 End-to-end audio recognition method, electronic device and computer readable storage medium
CN109599093A (en) * 2018-10-26 2019-04-09 北京中关村科金技术有限公司 Keyword detection method, apparatus, equipment and the readable storage medium storing program for executing of intelligent quality inspection

Also Published As

Publication number Publication date
WO2021057038A1 (en) 2021-04-01
CN110648659A (en) 2020-01-03

Similar Documents

Publication Publication Date Title
CN110648659B (en) Voice recognition and keyword detection device and method based on multitask model
US11503155B2 (en) Interactive voice-control method and apparatus, device and medium
Zeng et al. Effective combination of DenseNet and BiLSTM for keyword spotting
CN106098059B (en) Customizable voice wake-up method and system
CN113987179B (en) Dialogue emotion recognition network model, construction method, electronic device and storage medium based on knowledge enhancement and retroactive loss
CN110517664B (en) Multi-party identification method, device, equipment and readable storage medium
CN112017645B (en) Voice recognition method and device
JP7070894B2 (en) Time series information learning system, method and neural network model
CN110197279B (en) Transformation model training method, device, equipment and storage medium
CN111653275B (en) Construction method and device of speech recognition model based on LSTM-CTC tail convolution, and speech recognition method
CN114490950B (en) Method and storage medium for training encoder model, and method and system for predicting similarity
CN114596844A (en) Acoustic model training method, voice recognition method and related equipment
WO2022222056A1 (en) Synthetic speech detection
CN114333790A (en) Data processing method, device, equipment, storage medium and program product
CN116450839A (en) Knowledge injection and training method and system for knowledge enhancement pre-training language model
CN112750469B (en) Method for detecting music in speech, method for optimizing speech communication and corresponding device
CN115132196B (en) Voice instruction recognition method and device, electronic equipment and storage medium
CN113793599B (en) Training method of voice recognition model, voice recognition method and device
CN110648668A (en) Keyword detection device and method
CN114242045A (en) Deep learning method for natural language dialogue system intention
Bovbjerg et al. Self-supervised pretraining for robust personalized voice activity detection in adverse conditions
CN114400006A (en) Speech recognition method and device
KR20220153852A (en) Natural language processing apparatus for intent analysis and processing of multi-intent speech, program and its control method
Hentschel et al. Feature-based learning hidden unit contributions for domain adaptation of RNN-LMs
CN118553249B (en) Audio identification method, system and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant