CN111833844A - Training method and system of mixed model for speech recognition and language classification - Google Patents
Training method and system of mixed model for speech recognition and language classification Download PDFInfo
- Publication number
- CN111833844A CN111833844A CN202010739233.9A CN202010739233A CN111833844A CN 111833844 A CN111833844 A CN 111833844A CN 202010739233 A CN202010739233 A CN 202010739233A CN 111833844 A CN111833844 A CN 111833844A
- Authority
- CN
- China
- Prior art keywords
- training
- language
- layer
- speech recognition
- language classification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000012549 training Methods 0.000 title claims description 135
- 238000000034 method Methods 0.000 title claims description 34
- 238000013528 artificial neural network Methods 0.000 claims description 33
- 238000000605 extraction Methods 0.000 claims description 17
- 230000015654 memory Effects 0.000 claims description 10
- 241001672694 Citrus reticulata Species 0.000 claims description 9
- 230000037433 frameshift Effects 0.000 claims description 6
- 238000007476 Maximum Likelihood Methods 0.000 claims description 5
- 238000005457 optimization Methods 0.000 claims description 5
- 238000009432 framing Methods 0.000 claims description 4
- 230000006403 short-term memory Effects 0.000 claims description 4
- 230000007787 long-term memory Effects 0.000 claims description 3
- 239000010410 layer Substances 0.000 claims 30
- 239000011229 interlayer Substances 0.000 claims 2
- 230000008569 process Effects 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000010295 mobile communication Methods 0.000 description 3
- 238000002360 preparation method Methods 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/005—Language recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
Abstract
The embodiment of the invention provides a training method of a mixed model for voice recognition and language classification. The method comprises the following steps: performing feature extraction and data alignment on mixed training audio data with text labels and language labels to determine input data for training; inputting input data for training into the N layers of intermediate layers, performing voice recognition training based on voice recognition results and text labels output by the voice recognition layer, and training neural network parameters of the N layers of intermediate layers and the voice recognition layer; after the speech recognition training is completed, based on the language classification result and the language label output by the language classification layer, only the neural network parameters of the language classification layer are trained, and the language classification training is completed. The embodiment of the invention also provides a training system of the mixed model for speech recognition and language classification. The embodiment of the invention combines the voice recognition and the language classification, simplifies the system structure, saves the training cost and improves the overall system performance of the hybrid model.
Description
Technical Field
The invention relates to the field of voice recognition, in particular to a training method and a training system for a mixed model for voice recognition and language classification.
Background
For multi-lingual speech recognition, separate dialect and mandarin speech recognition (ASR) modules and language recognition modules are typically trained using neural networks based on existing dialect and mandarin audio. For the audio frequency sent into the system, it needs to first go through the language identification module to judge which language belongs to, then call the corresponding speech recognition (ASR) module to convert the sound into words, and then interact with other modules (such as semantic understanding, speech synthesis, etc.).
In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:
(1) high training and deployment costs
A plurality of speech recognition (ASR) modules of dialects and mandarin and a plurality of speech recognition modules and the like need to be prepared independently, a plurality of training models are long in time consumption, a plurality of ASR resources need to be deployed when online, a plurality of resources are occupied, and training and deployment costs are high.
(2) Interdependence between modules, mutual influence of performance
The accuracy of the language identification affects the performance of the subsequent speech identification, which results in higher requirements on the language module, and under the condition of wrong language identification, the high probability of the performance of the speech identification is very poor, thereby affecting the accuracy of other modules after identification.
(3) Poor integratability
The ASR module or the language module is only one, so that the ASR module or the language module is difficult to be a truly available product, most of the ASR module or the language module needs to be matched with other models (such as semantic understanding, speech synthesis, a dialogue system and the like) to form an available product, and the models need to recognize text and language information at the same time in many times, but the system is of a serial structure, cannot output recognized text and language information at the same time, cannot meet the requirement, and therefore the integratability is poor.
Disclosure of Invention
The problems that in the prior art, training and deployment cost is high, modules are interdependent, performance is influenced mutually, and integratability is poor are at least solved.
In a first aspect, an embodiment of the present invention provides a training method for a hybrid model for speech recognition and language classification, where the hybrid model is a deep neural network structure having N intermediate layers, and an nth intermediate layer branches into a speech recognition layer and a language classification layer, the speech recognition layer outputs a speech recognition result, and the language classification layer outputs a language classification result, the training method including:
performing feature extraction and data alignment on mixed training audio data with text labels and language labels to determine input data for training;
inputting the input data for training into the N-layer intermediate layer, performing voice recognition training based on the voice recognition result output by the voice recognition layer and the text label, and training neural network parameters of the N-layer intermediate layer and the voice recognition layer;
and after the speech recognition training is finished, only training the neural network parameters of the language classification layer based on the language classification result and the language label output by the language classification layer, and finishing the language classification training.
In a second aspect, an embodiment of the present invention provides a training system for a hybrid model for speech recognition and language classification, where the hybrid model is a deep neural network structure having N intermediate layers, and the N intermediate layer is bifurcated into a speech recognition layer and a language classification layer, the speech recognition layer outputs a speech recognition result, and the language classification layer outputs a language classification result, the training system including:
the input data determining program module is used for performing feature extraction and data alignment on mixed training audio data with text labels and language labels to determine input data for training;
an output program module, configured to input the input data for training to the N-layer intermediate layer, perform speech recognition training based on the speech recognition result output by the speech recognition layer and the text label, and train neural network parameters of the N-layer intermediate layer and the speech recognition layer;
and the training program module is used for training only the neural network parameters of the language classification layer based on the language classification result and the language label output by the language classification layer after the speech recognition training is finished, and finishing the language classification training.
In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for training a hybrid model for speech recognition and language classification according to any embodiment of the present invention.
In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the training method for a hybrid model for speech recognition and language classification according to any embodiment of the present invention.
The embodiment of the invention has the beneficial effects that: the speech recognition and the language classification are combined, parameters of a speech recognition part are not affected during the speech classification, so that the information of the language classification can be output more under the condition of keeping the speech recognition performance unchanged, the combining effect is achieved, the system structure is simplified, the training cost is saved during training, and meanwhile, the overall system performance of the hybrid model is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flow chart of a method for training a hybrid model for speech recognition and language classification according to an embodiment of the present invention;
FIG. 2 is a flow chart of a training phase of a method for training a hybrid model for speech recognition and language classification according to an embodiment of the present invention;
FIG. 3 is a network structure diagram of a training method for a hybrid model of speech recognition and language classification according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a training system for a hybrid model of speech recognition and language classification according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of a training method of a hybrid model for speech recognition and language classification according to an embodiment of the present invention, where the hybrid model is a deep neural network structure having N intermediate layers, and the nth intermediate layer is bifurcated into a speech recognition layer and a language classification layer, the speech recognition layer outputs a speech recognition result, and the language classification layer outputs a language classification result, including the following steps:
s11: performing feature extraction and data alignment on mixed training audio data with text labels and language labels to determine input data for training;
s12: inputting the input data for training into the N-layer intermediate layer, performing voice recognition training based on the voice recognition result output by the voice recognition layer and the text label, and training neural network parameters of the N-layer intermediate layer and the voice recognition layer;
s13: and after the speech recognition training is finished, only training the neural network parameters of the language classification layer based on the language classification result and the language label output by the language classification layer, and finishing the language classification training.
In the embodiment, the speech recognition module and the language classification module are combined, and the combined and mixed module can output the recognition text and the language information at the same time. Mainly focusing on the training of languages, the training of the speech recognition model is not limited, and includes four stages as shown in fig. 2, including: data preparation, feature extraction, data alignment and model training.
For step S11, training the blending module requires data preparation. The audio with text labels and language labels needs to be prepared, the audio can be manually labeled, other modes can also be used, the correct text of the audio and the dialect or mandarin which the audio belongs to need to be determined, and the higher the accuracy of labeling is, the better the labeling is, and the model training at the later stage is facilitated.
After a large amount of marked audio data are collected, wav and corresponding text marks are arranged, the audio is subjected to feature extraction, and FBANK features are adopted. More specifically, the feature extraction of the mixed training audio data with text labels and language labels includes:
the mixed training audio data is framed by using a window with a frame length of 25ms and a frame shift of 10ms, and m-dimensional FBANK characteristics and Mel cepstral coefficient characteristics of each frame in the mixed training audio data (the parameters provided herein are relatively good parameters recognized in the speech recognition field, and may also be broadly referred to as well, for example, the characteristics may be FBANK (FilterBank), mfcc (Mel Frequency Cepstrum Coefficients, Mel cepstral Coefficients), plp (perceptual linear prediction cepstral Coefficients), the frame length may be 20-40ms, and the frame shift may be 10-20 ms). And determining the data alignment according to the Mel cepstrum coefficient characteristics.
As an embodiment, the performing feature extraction and data alignment on mixed training audio data with text labels and language labels, and determining input data for training includes:
performing feature extraction on mixed training audio with text labels and language labels, and determining m-dimensional FBANK features and Mel cepstrum coefficient features of each frame in the mixed training audio, wherein the mixed training audio comprises multi-language audio, and the languages comprise Mandarin and dialects;
and carrying out supervised training on the mixed training audio and the m-dimensional Mel cepstrum coefficient characteristic of each frame, and determining the data alignment of each frame.
In this embodiment, for supervised training, it is necessary to know what the phoneme and language information is on each frame of each audio, and this step may use a training gaussian mixture model (GMM, although a neural network model may also be trained to generate alignment, the method is not limited). Extracting Mel cepstrum coefficient (MFCC) features of the audio, framing the audio by using a window with a frame length of 25ms and a frame shift of 10ms, extracting n-dimensional MFCC features of each frame, and preparing a pronunciation dictionary (the pronunciation dictionary is a pre-prepared union set comprising phoneme sets of dialect speech and mandarin speech); the corresponding GMM model is trained from the MFCC features and the pronunciation dictionary, and then the corresponding frame-level data alignment (alignment) is generated. For speech recognition, the alignment of each frame is a phoneme, and for linguistic classification, the alignment of each frame is a class of languages or silence (silence).
For step S12, the m-dimensional FBANK feature for each frame and the alignment corresponding to each frame prepared in step S11 are input to N intermediate layers, wherein the structure of the N intermediate layers is as shown in fig. 3, and the intermediate layer structure of the neural network may adopt multiple layers of DNN (deep neural network), LSTM (long short term memory neural network), FSMN (feedforward type sequence memory network), and the like. The output of the model is two, one is the output of ASR, and the other is the output of language; the speech recognition result output by the speech recognition layer can be obtained. Because the data preparation has the text labels corresponding to the voice, the text labels can be used as targets for training, and the voice recognition result is learned from the text labels. Thereby improving the recognition effect. The method of speech recognition is only an example of training, and the training mode is not limited. In this way, the parameters for speech recognition in the N intermediate layers are continuously trained.
For step S13, after the speech recognition part is trained, as an implementation manner, the language classification result and the language label output by the language classification layer include: based on a cross entropy training criterion, carrying out classification optimization on the data alignment of each frame by utilizing maximum likelihood estimation, and updating the language classification result to the language label.
After the performance of voice recognition meets the requirements, one more language output is added, when a language neural network is trained, Cross-Entropy training criteria is adopted, MLE (maximum Likelihood estimate) is utilized to carry out classification optimization on each frame, the classification error rate of each frame is minimized, the gradient only transmits and updates the parameters of the forked language NN layer, the neural network layer of the voice recognition part does not update the gradient, namely the parameters of the network are not changed, and only the neural network parameters of the language part are trained, so that the language information can be output while the voice recognition performance is kept unchanged.
According to the embodiment, the speech recognition and the language classification are combined, and the parameters of the speech recognition part are not influenced during the speech classification, so that the information of the language classification can be output more under the condition of keeping the speech recognition performance unchanged, the combination effect is achieved, the system structure is simplified, the training cost is saved during training, and the overall system performance of the hybrid model is improved.
Fig. 4 is a schematic structural diagram of a training system for a hybrid model for speech recognition and language classification according to an embodiment of the present invention, which can execute the training method for a hybrid model for speech recognition and language classification according to any of the above embodiments and is configured in a terminal.
The embodiment provides a training system of a hybrid model for speech recognition and language classification, which includes: an input data determination program module 11, an output program module 12 and a training program module 13.
The input data determining program module 11 is configured to perform feature extraction and data alignment on mixed training audio data with text labels and language labels, and determine input data for training; the output program module 12 is configured to input the input data for training to the N-layer intermediate layer, perform speech recognition training based on the speech recognition result output by the speech recognition layer and the text label, and train neural network parameters of the N-layer intermediate layer and the speech recognition layer; the training program module 13 is configured to train only the neural network parameters of the language classification layer based on the language classification result and the language label output by the language classification layer after the speech recognition training is completed, so as to complete the language classification training.
Further, the input data determination program module is for:
performing feature extraction on mixed training audio with text labels and language labels, and determining m-dimensional FBANK features and Mel cepstrum coefficient features of each frame in the mixed training audio, wherein the mixed training audio comprises multi-language audio, and the languages comprise Mandarin and dialects;
and carrying out supervised training on the mixed training audio and the m-dimensional Mel cepstrum coefficient characteristic of each frame, and determining the data alignment of each frame.
Further, the training program module is to:
based on a cross entropy training criterion, carrying out classification optimization on the data alignment of each frame by utilizing maximum likelihood estimation, and updating the language classification result to the language label.
Further, the input data determination program module is for:
and framing the mixed training audio data by using a window with the frame length of 25ms and the frame shift of 10ms, and determining m-dimensional FBANK characteristics and Mel cepstrum coefficient characteristics of each frame in the mixed training audio data.
Further, the structure of the N intermediate layers at least comprises: deep neural network, long and short term memory neural network, and feedforward sequence memory network.
The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the training method of the mixed model for speech recognition and language classification in any method embodiment;
as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:
performing feature extraction and data alignment on mixed training audio data with text labels and language labels to determine input data for training;
inputting the input data for training into the N-layer intermediate layer, performing voice recognition training based on the voice recognition result output by the voice recognition layer and the text label, and training neural network parameters of the N-layer intermediate layer and the voice recognition layer;
and after the speech recognition training is finished, only training the neural network parameters of the language classification layer based on the language classification result and the language label output by the language classification layer, and finishing the language classification training.
As a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium, which when executed by a processor, perform a method of training a hybrid model for speech recognition and language classification in any of the method embodiments described above.
The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
An embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for training a hybrid model for speech recognition and language classification according to any embodiment of the present invention.
The client of the embodiment of the present application exists in various forms, including but not limited to:
(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.
(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.
(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.
(4) Other electronic devices with data processing capabilities.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. A training method of a hybrid model for speech recognition and language classification, wherein the hybrid model is a deep neural network structure having N intermediate layers, and the N intermediate layer is branched into a speech recognition layer outputting speech recognition results and a language classification layer outputting language classification results, the training method comprising:
performing feature extraction and data alignment on mixed training audio data with text labels and language labels to determine input data for training;
inputting the input data for training into the N-layer intermediate layer, performing voice recognition training based on the voice recognition result output by the voice recognition layer and the text label, and training neural network parameters of the N-layer intermediate layer and the voice recognition layer;
and after the speech recognition training is finished, only training the neural network parameters of the language classification layer based on the language classification result and the language label output by the language classification layer, and finishing the language classification training.
2. The method of claim 1, wherein the performing feature extraction and data alignment on the mixed training audio data with text labels and language labels comprises:
performing feature extraction on mixed training audio with text labels and language labels, and determining m-dimensional FBANK features and Mel cepstrum coefficient features of each frame in the mixed training audio, wherein the mixed training audio comprises multi-language audio, and the languages comprise Mandarin and dialects;
and carrying out supervised training on the mixed training audio and the m-dimensional Mel cepstrum coefficient characteristic of each frame, and determining the data alignment of each frame.
3. The method of claim 1, wherein training only neural network parameters of the language classification layer based on the language classification result and the language label output by the language classification layer comprises:
based on a cross entropy training criterion, carrying out classification optimization on the data alignment of each frame by utilizing maximum likelihood estimation, and updating the language classification result to the language label.
4. The method of claim 1, wherein the feature extraction of the mixed training audio data with text labels and language labels comprises:
and framing the mixed training audio data by using a window with the frame length of 25ms and the frame shift of 10ms, and determining m-dimensional FBANK characteristics and Mel cepstrum coefficient characteristics of each frame in the mixed training audio data.
5. The method of claim 1, wherein the structure of the N-layer interlayer comprises at least: deep neural network, long and short term memory neural network, and feedforward sequence memory network.
6. A training system for a hybrid model of speech recognition and language classification, wherein the hybrid model is a deep neural network structure having N intermediate layers, and the N intermediate layer is branched into a speech recognition layer and a language classification layer, the speech recognition layer outputting speech recognition results, the language classification layer outputting language classification results, the training system comprising:
the input data determining program module is used for performing feature extraction and data alignment on mixed training audio data with text labels and language labels to determine input data for training;
an output program module, configured to input the input data for training to the N-layer intermediate layer, perform speech recognition training based on the speech recognition result output by the speech recognition layer and the text label, and train neural network parameters of the N-layer intermediate layer and the speech recognition layer;
and the training program module is used for training only the neural network parameters of the language classification layer based on the language classification result and the language label output by the language classification layer after the speech recognition training is finished, and finishing the language classification training.
7. The system of claim 6, wherein the input data determination program module is to:
performing feature extraction on mixed training audio with text labels and language labels, and determining m-dimensional FBANK features and Mel cepstrum coefficient features of each frame in the mixed training audio, wherein the mixed training audio comprises multi-language audio, and the languages comprise Mandarin and dialects;
and carrying out supervised training on the mixed training audio and the m-dimensional Mel cepstrum coefficient characteristic of each frame, and determining the data alignment of each frame.
8. The system of claim 6, wherein the training program module is to:
based on a cross entropy training criterion, carrying out classification optimization on the data alignment of each frame by utilizing maximum likelihood estimation, and updating the language classification result to the language label.
9. The system of claim 6, wherein the input data determination program module is to:
and framing the mixed training audio data by using a window with the frame length of 25ms and the frame shift of 10ms, and determining m-dimensional FBANK characteristics and Mel cepstrum coefficient characteristics of each frame in the mixed training audio data.
10. The system of claim 6, wherein the structure of the N layers of interlayers comprises at least: deep neural network, long and short term memory neural network, and feedforward sequence memory network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010739233.9A CN111833844A (en) | 2020-07-28 | 2020-07-28 | Training method and system of mixed model for speech recognition and language classification |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010739233.9A CN111833844A (en) | 2020-07-28 | 2020-07-28 | Training method and system of mixed model for speech recognition and language classification |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111833844A true CN111833844A (en) | 2020-10-27 |
Family
ID=72919152
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010739233.9A Withdrawn CN111833844A (en) | 2020-07-28 | 2020-07-28 | Training method and system of mixed model for speech recognition and language classification |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111833844A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113077781A (en) * | 2021-06-04 | 2021-07-06 | 北京世纪好未来教育科技有限公司 | Voice recognition method and device, electronic equipment and storage medium |
CN113129925A (en) * | 2021-04-20 | 2021-07-16 | 深圳追一科技有限公司 | Mouth action driving model training method and assembly based on VC model |
CN113327596A (en) * | 2021-06-17 | 2021-08-31 | 北京百度网讯科技有限公司 | Training method of voice recognition model, voice recognition method and device |
CN114596845A (en) * | 2022-04-13 | 2022-06-07 | 马上消费金融股份有限公司 | Training method of voice recognition model, voice recognition method and device |
CN115064157A (en) * | 2022-07-21 | 2022-09-16 | 北京达佳互联信息技术有限公司 | Training method and device of voice recognition model and voice recognition method and device |
CN115240632A (en) * | 2022-07-21 | 2022-10-25 | 中国平安人寿保险股份有限公司 | AI outbound assessment method, device, electronic equipment and storage medium |
CN115457942A (en) * | 2022-09-16 | 2022-12-09 | 中国科学院空天信息创新研究院 | An End-to-End Multilingual Speech Recognition Method Based on Hybrid Expert Model |
WO2023231576A1 (en) * | 2022-05-30 | 2023-12-07 | 京东科技信息技术有限公司 | Generation method and apparatus for mixed language speech recognition model |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104143327A (en) * | 2013-07-10 | 2014-11-12 | 腾讯科技(深圳)有限公司 | Acoustic model training method and device |
CN109326277A (en) * | 2018-12-05 | 2019-02-12 | 四川长虹电器股份有限公司 | Semi-supervised phoneme forces alignment model method for building up and system |
CN110033760A (en) * | 2019-04-15 | 2019-07-19 | 北京百度网讯科技有限公司 | Modeling method, device and the equipment of speech recognition |
CN110349564A (en) * | 2019-07-22 | 2019-10-18 | 苏州思必驰信息科技有限公司 | Across the language voice recognition methods of one kind and device |
CN110517664A (en) * | 2019-09-10 | 2019-11-29 | 科大讯飞股份有限公司 | Multi-party speech recognition methods, device, equipment and readable storage medium storing program for executing |
CN110930980A (en) * | 2019-12-12 | 2020-03-27 | 苏州思必驰信息科技有限公司 | Acoustic recognition model, method and system for Chinese and English mixed speech |
-
2020
- 2020-07-28 CN CN202010739233.9A patent/CN111833844A/en not_active Withdrawn
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104143327A (en) * | 2013-07-10 | 2014-11-12 | 腾讯科技(深圳)有限公司 | Acoustic model training method and device |
CN109326277A (en) * | 2018-12-05 | 2019-02-12 | 四川长虹电器股份有限公司 | Semi-supervised phoneme forces alignment model method for building up and system |
CN110033760A (en) * | 2019-04-15 | 2019-07-19 | 北京百度网讯科技有限公司 | Modeling method, device and the equipment of speech recognition |
CN110349564A (en) * | 2019-07-22 | 2019-10-18 | 苏州思必驰信息科技有限公司 | Across the language voice recognition methods of one kind and device |
CN110517664A (en) * | 2019-09-10 | 2019-11-29 | 科大讯飞股份有限公司 | Multi-party speech recognition methods, device, equipment and readable storage medium storing program for executing |
CN110930980A (en) * | 2019-12-12 | 2020-03-27 | 苏州思必驰信息科技有限公司 | Acoustic recognition model, method and system for Chinese and English mixed speech |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113129925A (en) * | 2021-04-20 | 2021-07-16 | 深圳追一科技有限公司 | Mouth action driving model training method and assembly based on VC model |
CN113077781A (en) * | 2021-06-04 | 2021-07-06 | 北京世纪好未来教育科技有限公司 | Voice recognition method and device, electronic equipment and storage medium |
CN113077781B (en) * | 2021-06-04 | 2021-09-07 | 北京世纪好未来教育科技有限公司 | Speech recognition method, device, electronic device and storage medium |
CN113327596A (en) * | 2021-06-17 | 2021-08-31 | 北京百度网讯科技有限公司 | Training method of voice recognition model, voice recognition method and device |
CN114596845A (en) * | 2022-04-13 | 2022-06-07 | 马上消费金融股份有限公司 | Training method of voice recognition model, voice recognition method and device |
WO2023231576A1 (en) * | 2022-05-30 | 2023-12-07 | 京东科技信息技术有限公司 | Generation method and apparatus for mixed language speech recognition model |
CN115064157A (en) * | 2022-07-21 | 2022-09-16 | 北京达佳互联信息技术有限公司 | Training method and device of voice recognition model and voice recognition method and device |
CN115240632A (en) * | 2022-07-21 | 2022-10-25 | 中国平安人寿保险股份有限公司 | AI outbound assessment method, device, electronic equipment and storage medium |
CN115457942A (en) * | 2022-09-16 | 2022-12-09 | 中国科学院空天信息创新研究院 | An End-to-End Multilingual Speech Recognition Method Based on Hybrid Expert Model |
CN115457942B (en) * | 2022-09-16 | 2025-03-04 | 中国科学院空天信息创新研究院 | End-to-end multilingual speech recognition method based on hybrid expert model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111833844A (en) | Training method and system of mixed model for speech recognition and language classification | |
US12230250B2 (en) | Speech recognition method and apparatus, device, and storage medium | |
EP3857543B1 (en) | Conversational agent pipeline trained on synthetic data | |
US11664020B2 (en) | Speech recognition method and apparatus | |
CN113439301B (en) | Method and system for machine learning | |
US11062699B2 (en) | Speech recognition with trained GMM-HMM and LSTM models | |
CN107195296B (en) | Voice recognition method, device, terminal and system | |
CN106796787B (en) | Context interpretation using previous dialog behavior in natural language processing | |
CN111862942B (en) | Method and system for training mixed speech recognition model of Mandarin and Sichuan | |
CN112771607A (en) | Electronic device and control method thereof | |
WO2022252904A1 (en) | Artificial intelligence-based audio processing method and apparatus, device, storage medium, and computer program product | |
CN114913859B (en) | Voiceprint recognition method, voiceprint recognition device, electronic equipment and storage medium | |
WO2017184387A1 (en) | Hierarchical speech recognition decoder | |
Rasipuram et al. | Acoustic and lexical resource constrained ASR using language-independent acoustic model and language-dependent probabilistic lexical model | |
CN112216270B (en) | Speech phoneme recognition method and system, electronic equipment and storage medium | |
CN114171002A (en) | Voice recognition method and device, electronic equipment and storage medium | |
CN113793599A (en) | Training method of voice recognition model and voice recognition method and device | |
CN113555016A (en) | Voice interaction method, electronic equipment and readable storage medium | |
CN114267334A (en) | Speech recognition model training method and speech recognition method | |
CN113724690A (en) | PPG feature output method, target audio output method and device | |
CN117456999B (en) | Audio identification method, audio identification device, vehicle, computer device, and medium | |
JP4163207B2 (en) | Multilingual speaker adaptation method, apparatus and program | |
CN115359808A (en) | Method for processing voice data, model generation method, model generation device and electronic equipment | |
CN114255736A (en) | Rhythm labeling method and system | |
MODEL | TROPE |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province Applicant after: Sipic Technology Co.,Ltd. Address before: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province Applicant before: AI SPEECH Co.,Ltd. |
|
CB02 | Change of applicant information | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20201027 |
|
WW01 | Invention patent application withdrawn after publication |