CN117953896A - Speech recognition method, device, electronic equipment and storage medium - Google Patents
Speech recognition method, device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN117953896A CN117953896A CN202311007598.2A CN202311007598A CN117953896A CN 117953896 A CN117953896 A CN 117953896A CN 202311007598 A CN202311007598 A CN 202311007598A CN 117953896 A CN117953896 A CN 117953896A
- Authority
- CN
- China
- Prior art keywords
- target
- scoring value
- decoding
- value
- voice data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 96
- 238000012545 processing Methods 0.000 claims abstract description 80
- 238000007499 fusion processing Methods 0.000 claims abstract description 23
- 238000006243 chemical reaction Methods 0.000 claims abstract description 15
- 230000008569 process Effects 0.000 claims description 31
- 230000004044 response Effects 0.000 claims description 28
- 230000015654 memory Effects 0.000 claims description 25
- 238000011156 evaluation Methods 0.000 claims description 22
- 230000004927 fusion Effects 0.000 claims description 20
- 238000013507 mapping Methods 0.000 claims description 11
- 238000000605 extraction Methods 0.000 claims description 4
- 230000006870 function Effects 0.000 description 26
- 238000012549 training Methods 0.000 description 25
- 238000010586 diagram Methods 0.000 description 19
- 239000003795 chemical substances by application Substances 0.000 description 13
- 238000004590 computer program Methods 0.000 description 7
- 239000013598 vector Substances 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 5
- 238000001514 detection method Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 241001672694 Citrus reticulata Species 0.000 description 1
- 108020004705 Codon Proteins 0.000 description 1
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000010845 search algorithm Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1822—Parsing for meaning understanding
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
Abstract
The embodiment of the disclosure provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring target voice data and performing coding processing to obtain semantic feature information; performing decoding search processing on the semantic feature information according to a preset decoding search mode to obtain a connection time sequence classification result; under the condition that the number of characters of the candidate text sequence is larger than a preset character number threshold value, inputting the connection time sequence classification result into a first decoding model to perform fusion processing of an acoustic prediction scoring value and a language prediction scoring value, and obtaining a voice recognition result; the language prediction scoring value is obtained through conversion according to the acoustic prediction scoring value; and under the condition that the number of characters of the candidate text sequence is smaller than or equal to a preset character number threshold value, inputting the connection time sequence classification result into a second decoding model to perform repeated processing of the acoustic prediction scoring value, and obtaining a voice recognition result. Therefore, the real-time performance and the accuracy can be simultaneously realized during the voice recognition.
Description
Technical Field
The present application relates to the field of machine learning technologies, and in particular, to a method and apparatus for voice recognition, an electronic device, and a storage medium.
Background
With the development of electronic technology, speech recognition is increasingly used. In the outbound scenario involving the opening interruption, the voice recognition model always receives the voice signal of the user side in real time and judges whether the interruption is required in real time, and in this case, a higher requirement is put on the real-time performance of decoding through the voice recognition model.
Decoding can obtain a decoding result with higher accuracy by a re-scoring mode, but the decoding mode needs to wait for the unified decoding and re-scoring after the voice stream input into the voice recognition model is coded, if the last voice stream is not decoded yet and a new voice stream is received, thread conflict can occur, bug is caused, and therefore user experience in an outbound scene is reduced.
Disclosure of Invention
The embodiment of the application provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium, so that the real-time performance and the accuracy of voice recognition are considered.
In a first aspect, an embodiment of the present application provides a method for voice recognition, including:
Acquiring target voice data to be identified;
Coding the target voice data to obtain semantic feature information of the target voice data;
Performing decoding search processing on the semantic feature information according to a preset decoding search mode to obtain a connection time sequence classification result of the semantic feature information; the connection time sequence classification result comprises a candidate text sequence and an acoustic prediction scoring value of a first character in the candidate text sequence;
Inputting the connection time sequence classification result into a first decoding model to perform fusion processing of the acoustic prediction scoring value and the language prediction scoring value under the condition that the character number of the candidate text sequence is larger than a preset character number threshold value, so as to obtain a voice recognition result of the target voice data; the language prediction scoring value is obtained by conversion according to the acoustic prediction scoring value;
and under the condition that the number of characters of the candidate text sequence is smaller than or equal to the preset character number threshold, inputting the connection time sequence classification result into a second decoding model to perform re-processing of the acoustic prediction scoring value, and obtaining a voice recognition result of the target voice data.
In a second aspect, an embodiment of the present application provides a voice recognition apparatus, including:
The acquisition unit is used for acquiring target voice data to be identified;
The coding unit is used for coding the target voice data to obtain semantic feature information of the target voice data;
The decoding search unit is used for carrying out decoding search processing on the semantic feature information according to a preset decoding search mode to obtain a connection time sequence classification result of the semantic feature information; the connection time sequence classification result comprises a candidate text sequence and an acoustic prediction scoring value of a first character in the candidate text sequence;
The fusion unit is used for inputting the connection time sequence classification result into a first decoding model to perform fusion processing of the acoustic prediction scoring value and the language prediction scoring value under the condition that the character number of the candidate text sequence is larger than a preset character number threshold value, so as to obtain a voice recognition result of the target voice data; the language prediction scoring value is obtained by conversion according to the acoustic prediction scoring value;
And the re-scoring unit is used for inputting the connection time sequence classification result into a second decoding model to perform re-scoring processing of the acoustic prediction scoring value under the condition that the character number of the candidate text sequence is smaller than or equal to the preset character number threshold value, so as to obtain a voice recognition result of the target voice data.
In a third aspect, an embodiment of the present application provides an electronic device, including: a processor; and a memory configured to store computer-executable instructions that, when executed, cause the processor to perform the speech recognition method of the first aspect.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the speech recognition method according to the first aspect.
It can be seen that in the embodiment of the present application, first, target voice data to be recognized is acquired; secondly, coding the target voice data to obtain semantic feature information of the target voice data; then, decoding and searching the semantic feature information according to a preset decoding and searching mode to obtain a connection time sequence classification result of the semantic feature information; the connection time sequence classification result comprises a candidate text sequence and an acoustic prediction scoring value of a first character in the candidate text sequence; under the condition that the number of characters of the candidate text sequence is larger than a preset character number threshold value, inputting the connection time sequence classification result into a first decoding model to perform fusion processing of an acoustic prediction scoring value and a language prediction scoring value, and obtaining a voice recognition result of target voice data; the language prediction scoring value is obtained through conversion according to the acoustic prediction scoring value; and under the condition that the number of characters of the candidate text sequence is smaller than or equal to a preset character number threshold value, inputting the connection time sequence classification result into a second decoding model to perform repeated processing of the acoustic prediction scoring value, and obtaining a voice recognition result of the target voice data. In this way, the connection time sequence classification result can be obtained quickly through decoding and searching the semantic feature information, the connection time sequence classification result can be regarded as an initial recognition result with lower accuracy, the initial recognition result needs to be further decoded through other decoding modes, in addition, the number of characters of the initial recognition result is usually accurate, how to select the decoding mode can be referred to, under the condition that the number of characters is large, the real-time requirement of a service can not be met by adopting a re-scoring mode with higher accuracy but low decoding speed for decoding, the voice recognition result is obtained by inputting the connection time sequence classification result into a first decoding model for fusion processing of an acoustic prediction scoring value and a language prediction scoring value, compared with the re-scoring mode, the decoding speed is higher, and therefore the real-time performance of voice recognition can be improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, it will be obvious that the drawings in the following description are only some embodiments described in the present specification, and that other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art;
FIG. 1 is a schematic diagram of an implementation environment of a speech recognition method according to an embodiment of the present application;
FIG. 2 is a process flow diagram of a speech recognition method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of an acoustic model training process according to an embodiment of the present application;
fig. 4 is a schematic diagram of a decoding process of a voice recognition method according to the present embodiment;
fig. 5 is a functional framework diagram of an outbound system according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a speech recognition device according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to enable those skilled in the art to better understand the technical solutions of the embodiments of the present application, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.
The voice recognition method provided in one or more embodiments of the present disclosure may be applied to an implementation environment of voice recognition, as shown in fig. 1, where the implementation environment includes at least a server 101 for performing voice recognition.
The server 101 may be a server, or a server cluster formed by a plurality of servers, or one or more cloud servers in a cloud computing platform, which are used for performing voice recognition processing on target voice data.
In this implementation environment, during the process of performing speech recognition, the server 101 first obtains target speech data to be recognized; secondly, coding the target voice data to obtain semantic feature information of the target voice data; then, carrying out decoding search processing on the semantic feature information according to a preset decoding search mode to obtain a connection time sequence classification result of the semantic feature information, wherein the connection time sequence classification result comprises a candidate text sequence and an acoustic prediction scoring value of a first character in the candidate text sequence; under the condition that the number of characters of the candidate text sequence is larger than a preset character number threshold value, inputting a connection time sequence classification result into a first decoding model to perform fusion processing of an acoustic prediction scoring value and a language prediction scoring value, and obtaining a voice recognition result of target voice data, wherein the language prediction scoring value is obtained by converting the acoustic prediction scoring value; under the condition that the number of characters of the candidate text sequence is smaller than or equal to a preset character number threshold, inputting a connection time sequence classification result into a second decoding model to perform the re-scoring of an acoustic prediction scoring value, so as to obtain a voice recognition result of target voice data.
The embodiment of a voice recognition method is provided in the specification:
The goal of speech recognition is to convert the lexical content in human speech into textual content. The end-to-end voice recognition uses a pure neural network mode to replace the traditional mode of mixed and separated training of an alignment model, an acoustic model, a language model and the like. In speech recognition, the process of converting a sequence of audio signals into text may be referred to as decoding. Decoding is performed in a re-scoring mode, accuracy is high, but all encoding results corresponding to the voice stream input into the model need to be waited for, and decoding and re-scoring are performed in a unified manner, so that the decoding time may be excessively long.
In the outbound scenario involving the opening interruption, the voice recognition model always receives the voice signal of the user side in real time and judges whether the interruption is required in real time, and in this case, a higher requirement is put on the real-time performance of decoding through the voice recognition model. If a re-scoring decoding mode is adopted, the last batch of voice streams are not decoded yet and new voice streams are received, and thread conflict and bug may occur. In order to solve the above problems, an embodiment of the present application provides a speech recognition method.
Fig. 2 is a process flow diagram of a voice recognition method according to an embodiment of the present application. Referring to fig. 2, the voice recognition method provided in the present embodiment specifically includes steps S202 to S210.
Step S202, target voice data to be recognized is obtained.
The target voice data to be recognized may be voice data for which there is a voice recognition demand.
The voice recognition method provided by the embodiment can be applied to an outbound scene, for example, when a robot agent and a client exchange in a call, a call system needs to collect voice data of the client as target voice data to be recognized, and performs voice recognition processing according to the target voice data to obtain a voice recognition result, so that response voice of the robot agent to the client is generated according to the voice recognition result.
The robot agent and the client have higher requirements on the real-time performance and accuracy of voice recognition when exchanging in a call, and if the robot agent returns a sentence to the client every half a day, or misunderstands the voice data of the client to answer questions, the client may be discontented.
The voice recognition method provided by the embodiment can also be applied to intelligent home scenes, for example, a user instructs the intelligent home robot to open a specified household appliance through voice. The intelligent home robot collects voice data of a user as target voice data, performs voice recognition processing according to the target voice data to obtain a voice recognition result, and further determines and executes an operation instruction corresponding to the voice recognition result.
The voice recognition method provided in this embodiment may also be used in other not-shown scenes where there is a voice recognition requirement.
In a specific implementation, obtaining target voice data to be identified includes: and in the process of broadcasting the voice to the call user by the robot agent, receiving the interrupt voice of the call user, and determining the interrupt voice as target voice data.
The talking user may be a user in a call with a robot agent.
The robot agent may serve the call user during the call.
Services that may be provided by the robot agents include, but are not limited to: business on/off services, advisory services, complaint services, transaction services, and the like.
The interrupting speech may be speech uttered by the calling user before the robot agent has finished broadcasting speech to the calling user.
For an outbound system with break-off, under the condition that the break-off voice of a call user is received, the broadcast voice is stopped and the break-off voice of the call user is responded as soon as possible through the seat robot. Under the condition that a call user sends out interrupt voice, the call user may have lower tolerance to the call, if the robot seat responds too slowly, the dissatisfaction of the call user is likely to be caused, and the call experience of the call user is reduced. Therefore, the real-time performance of the speech recognition is required relatively high in this case, and in turn, the decoding efficiency is required in the process of the speech recognition.
In the case where a broken voice of a call user is received, the broken voice may be determined as target voice data to be recognized.
Step S204, coding the target voice data to obtain semantic feature information of the target voice data.
Encoding is the process of converting information from one form or format to another.
The semantic feature information may be a vector, where each element of the vector reflects a semantic feature corresponding to the target audio data.
In a specific implementation manner, encoding processing is performed on target voice data to obtain semantic feature information of the target voice data, including: performing audio feature extraction processing according to the target voice data to obtain audio feature information; and carrying out coding processing according to the audio feature information to obtain semantic feature information.
The audio feature information may be Fbank features in vector form.
Fbank, the abbreviation FilterBank, the response of the human ear to the sound spectrum is nonlinear, fbank is a front-end processing algorithm that processes audio in a manner similar to the human ear to improve speech recognition performance. The step of obtaining Fbank features of the speech signal may include: pre-emphasis, framing, windowing, STFT (short-time Fourier transform ), mel filtering, de-averaging, etc.
The audio feature extraction process is performed according to the target voice data, and Fbank features can be obtained by performing fourier transform on the target voice data. In fourier transformation, the time domain is converted into the frequency domain, and specifically, the conversion may be performed by means of time domain multiplication and frequency domain convolution. Fbank features include both time domain features and frequency domain features.
Coding processing is performed according to the audio feature information to obtain semantic feature information corresponding to the audio feature information, and mapping processing can be performed on Fbank features to obtain semantic feature information.
Illustratively, the encoding process in this embodiment may be implemented by the conformer model.
The conformer model is a model obtained by combining a transducer with a CNN (Convolutional Neural Networks, convolutional neural network), wherein the transducer model is good at capturing global interactions based on content, the CNN effectively utilizes local features, and the combination of the transducer model and the CNN enables the conformer model to model long-term global interaction information and the local features better.
The transducer is a time sequence model based on a self-attention mechanism, can effectively encode time sequence information in an encoder part, has much better processing capacity than LSTM (Long Short-Term Memory network) and has high speed. the transducer is widely applied to the fields of natural language processing, computer vision, machine translation, voice recognition and the like.
In specific implementation, the audio feature information can be input to conformer models for coding processing to obtain corresponding semantic feature vectors.
Step S206, decoding and searching the semantic feature information according to a preset decoding and searching mode to obtain a connection time sequence classification result of the semantic feature information; the concatenated temporal classification result includes an acoustic predictive scoring value for a first character in the candidate text sequence.
The decoding search can be understood as: for an input voice signal, a recognition network is established according to the trained acoustic model and dictionary, a search algorithm is used for searching an optimal path in the network, and characters under the condition of maximum probability are output.
The preset decoding search method includes and is not limited to: ctc_prefix_beam_search (connection timing classification prefix bundle Search) decoding Search mode, GREEDY SEARCH (greedy Search) decoding Search mode, beam Search decoding Search mode, and so forth.
In this embodiment, a ctc_prefix_beam_search decoding search method is taken as an example for illustration, and other decoding search methods can refer to the description part of the ctc_prefix_beam_search decoding search method, and will not be described in detail.
The connection timing classification result, CTC (Connectionist Temporal Classification, connection timing classification) result, may include the candidate text sequence and an acoustic predictive scoring value for the first character in the candidate text sequence.
In the implementation, the semantic feature information is decoded and searched in a ctc_prefix_beam_search decoding and searching mode, so that a candidate text sequence corresponding to the semantic feature information can be obtained.
The candidate text sequence may include one or more first characters. In the case that the subsequent text sequence includes a first character, the acoustic prediction scoring value of the first character can be obtained by performing a decoding search process on the semantic feature information in a ctc_prefix_beam_search decoding search mode. In the case that the candidate text sequence includes a plurality of first characters, the acoustic prediction scoring value of each first character in the candidate text sequence can be obtained by performing decoding search processing on the semantic feature information in a ctc_prefix_beam_search decoding search mode.
The acoustic prediction scoring value may be used to represent a posterior probability of the corresponding first character, which may be predicted by an acoustic model.
The posterior probability refers to the probability of re-correction after obtaining the information of "result", and is the "fruit" in the problem of "cause of execution". The prior probability is connected with the posterior probability in an inseparable way, and the posterior probability is calculated based on the prior probability.
For any first character, if the posterior probability of the first character is larger, the first character is an easily-identified character, and the decoding accuracy is higher; if the posterior probability of the first character is smaller, the first character is a character which is not easy to recognize, and the decoding accuracy is lower.
The posterior probability is smaller or larger, and can be determined by comparing the posterior probability with the numerical value of the preset probability threshold.
After the semantic feature information is decoded and searched according to a preset decoding and searching mode to obtain a corresponding CTC result, the number of first characters included in the candidate text sequence in the CTC result can be counted, and the number is determined as the number of characters of the candidate text sequence.
For example, the CTC result includes a text sequence 1, the text sequence 1 being "weather today good", the text sequence 1 corresponding to a number of characters of 6.
After the number of characters of the candidate text sequence is obtained, the numerical value and the size are compared according to the number of characters of the candidate text sequence and a preset character number threshold value, and a comparison result is obtained.
The comparison result may include a first comparison result and a second comparison result.
The first comparison result indicates that the number of characters of the candidate text sequence is larger than a preset character number threshold; the second comparison result indicates that the number of characters of the candidate text sequence is smaller than or equal to a preset character number threshold.
The preset character number threshold may be a custom set character number. For example, if the preset character number threshold is 8 and the character number of the candidate text sequence is 10, comparing the values according to 10 and 8, and the obtained comparison result is the first comparison result. Or the number of characters of the candidate text sequence is 6, comparing the numerical values according to the 6 and 8, and obtaining a comparison result which is a second comparison result.
It should be noted that the number of candidate text sequences corresponding to the same target voice data may be one or more. Typically, the number of characters in each of a plurality of candidate text sequences corresponding to the same target speech data is the same.
Step S208, under the condition that the number of characters of the candidate text sequence is larger than a preset character number threshold, inputting the connection time sequence classification result into a first decoding model to perform fusion processing of an acoustic prediction scoring value and a language prediction scoring value, and obtaining a voice recognition result of target voice data; the language prediction scoring value is converted according to the acoustic prediction scoring value.
And under the condition that the comparison result is the first comparison result, inputting the CTC result into a first decoding model for fusion processing of the acoustic prediction scoring value and the language prediction scoring value, and obtaining a voice recognition result of the target voice data.
The fusion processing of the acoustic predictive scoring value and the language predictive scoring value can be realized by a shallowfusion. Specifically, the short fusion refers to putting together a pre-trained LM (Language Model) and LASDecoder (list, attend, and Spell Decoder, speech recognition decoder) for use, and integrating the outputs of the two.
Illustratively, LM may be an ngram model, in particular, n=3, i.e. LM may be a 3gram model.
The ngram model is a language model commonly used in large-vocabulary continuous speech recognition, and in the case that the language corresponding to the ngram model is chinese, the ngram model may be a chinese language model (CLM, chinese Language Model). Through the ngram model, the matching information between adjacent words in the context can be utilized to realize the automatic conversion from voice to Chinese characters.
The ngram model is based on the assumption that the occurrence of the nth word is related to only the first N-1 words, but not to any other words, and the probability of the whole sentence is the product of the occurrence probabilities of the respective words. N is a natural number greater than 0.
In this embodiment, the output of LASDecoder may be a CTC result, i.e., the output of LASDecoder may be the candidate text sequence and the acoustic prediction probability value of the first character in the candidate text sequence.
In the implementation process, the language prediction scoring value can be obtained through the conversion of the ngram according to the acoustic prediction scoring value, and the language prediction scoring value is output by the LM. The fusion process of the acoustic predictive scoring value and the language predictive scoring value is equivalent to the fusion process of the output of LASDecoder with the output of LM.
The speech recognition result of the target speech data may be a target text sequence corresponding to the target speech data.
In the case where the target speech data corresponds to a plurality of candidate text sequences, the target text sequence may be one of the plurality of candidate text sequences.
In a specific implementation manner, inputting a connection time sequence classification result into a first decoding model to perform fusion processing of an acoustic prediction scoring value and a language prediction scoring value to obtain a voice recognition result of target voice data, including: mapping the acoustic predictive scoring value of the first character to obtain a language predictive scoring value of the first character; obtaining a first target scoring value of the first character according to the first preset weight value, the acoustic prediction scoring value and the language prediction scoring value; and determining a voice recognition result of the target voice data according to the first target scoring value.
The language model may be trained in advance before the step S208 is performed. And inputting the acoustic predictive scoring value of the first character into a language model for mapping processing to obtain the language predictive scoring value of the first character.
The first preset weight value is a pre-configured weight parameter.
The obtaining a first target scoring value of the first character according to the first preset weight value, the acoustic predictive scoring value and the language predictive scoring value may be performing weighted summation processing according to the first preset weight value, the acoustic predictive scoring value of each first character and the language predictive scoring value of the first character to obtain the first target scoring value of the first character.
For example, the calculation process for generating the first target scoring value may refer to the following formula (1):
Y=argmaxy[logpTM(y|x)+λlogpLM(y)] (1)
wherein Y is used to represent the first target scoring value.
P TM (y|x) is the posterior probability of the acoustic model, in this embodiment the acoustic predictive scoring value for the first character, and p LM (y) is the posterior probability of the language model, in this embodiment the language predictive scoring value for the first character.
Lambda is used to represent the superparameter, and parameter adjustment can be performed according to actual data, and in this embodiment, a first preset weight value is referred to.
The log function is a logarithmic function.
The argmax function is a function that parameterizes the function (set) and can be used to find the parameter with the largest score.
The first target scoring value may reflect the decoding accuracy of the first character, where a higher value of the first target scoring value indicates a higher likelihood that the first character is an accurate decoding result. Therefore, according to the first target scoring value of the first character in the CTC result, the speech recognition result corresponding to the CTC result, that is, the speech recognition result of the target speech data, can be determined.
In a specific implementation, determining a speech recognition result of the target speech data according to the first target scoring value includes: determining a first comprehensive evaluation score of the candidate text sequence according to the first target score value; and determining the candidate text sequence with the maximum first comprehensive evaluation score as a voice recognition result of the target voice data.
The number of candidate text sequences included in the CTC result may be plural.
For example, CTC results obtained after the decoding search are preset to retain only top-N candidate text sequences. N is a natural number greater than 1.
The top-N candidate text sequences are obtained by decoding and searching to obtain a plurality of candidate text sequences, the total score of each candidate text sequence is determined according to the acoustic prediction scoring value of the first character in each candidate text sequence, the top-N candidate text sequences are obtained by sorting the total score of each candidate text sequence from large to small, and then the top-N candidate text sequences are selected from sorting results.
The total score of a candidate text sequence may reflect the decoding accuracy of the candidate text sequence, the higher the total score, i.e., the higher the decoding accuracy of the candidate text sequence.
In the same CTC result, each candidate text sequence may include M first characters, M being a natural number greater than 1.
And determining a first comprehensive evaluation score of the candidate text sequence according to the first target scoring values, which may be that the first target scoring values of the M first characters are summed to obtain the first comprehensive evaluation score of the candidate text sequence.
And determining a first comprehensive evaluation score of the candidate text sequence according to the first target scoring values, or performing mean value calculation processing on the first target scoring values of the M first characters to obtain the first comprehensive evaluation score of the candidate text sequence.
After the first comprehensive evaluation score of each candidate text sequence is obtained, each candidate text sequence may be ranked from large to small according to the value of the first comprehensive evaluation score, so as to determine a candidate text sequence with the largest first comprehensive evaluation score, that is, a target text sequence, and determine the target candidate text sequence as a speech recognition result of the target speech data.
Step S210, under the condition that the number of characters of the candidate text sequence is smaller than or equal to a preset character number threshold, inputting the connection time sequence classification result into a second decoding model to perform repeated processing of the acoustic prediction scoring value, and obtaining a voice recognition result of the target voice data.
And under the condition that the comparison result is the second comparison result, inputting the CTC result into a second decoding model to perform re-processing of the acoustic prediction scoring value, and obtaining a voice recognition result of the target voice data.
The scoring process may be attention scoring (attention rescoring decoding).
The effect of the re-scoring is to re-score the posterior probability of the first character in the candidate text sequence, introduce new language information, and improve the recognition effect.
The second decoding model may include a decoding model having a re-scoring function.
The decoding model with re-scoring function may be Transformer Decoder, the decoding module in the transducer model described above, for example.
The acoustic predictive scoring value is obtained through decoding search, and the decoding search has the characteristics of high speed and low accuracy, so that the accuracy of the acoustic predictive scoring value after the re-scoring is required to be improved through a attention re-scoring mode.
In a specific implementation manner, inputting the connection time sequence classification result into the second decoding model to perform the re-processing of the acoustic prediction scoring value to obtain the voice recognition result of the target voice data, including: decoding according to the connection time sequence classification result to obtain a remarking value of the first character; determining a second target scoring value of the first character according to the second preset weight value, the acoustic prediction scoring value and the re-scoring value; and determining a voice recognition result of the target voice data according to the second target scoring value.
In the implementation, the candidate text sequence and the acoustic prediction scoring value of the first character in the candidate text sequence are input Transformer Decoder for decoding processing, so that the second target scoring value of the first character is obtained.
The second preset weight value may be a preset weight parameter.
And determining a second target scoring value of the first character according to the second preset weight value, the acoustic prediction scoring value and the re-scoring value, wherein the second target scoring value of the first character can be obtained by carrying out weighted summation processing according to the first preset weight value, the acoustic prediction scoring value of each first character and the re-scoring value of the first character.
The second target scoring value may reflect the decoding accuracy of the first character, with a higher value of the second target scoring value indicating a higher likelihood that the first character is an accurate decoding result. Therefore, according to the second target scoring value of the first character in the CTC result, the voice recognition result corresponding to the CTC result, that is, the voice recognition result of the target voice data, can be determined.
It should be noted that, the first target scoring value and the second target scoring value have the same function, but different acquisition modes. The "first" and "second" in this specification are presented to facilitate distinguishing between two similar features, and have no actual meaning, and are not described in detail below.
Determining a voice recognition result of the target voice data according to the second target scoring value, which may be determining a second comprehensive evaluation score of the candidate text sequence according to the second target scoring value; and determining the candidate text sequence with the largest second comprehensive evaluation score as a voice recognition result of the target voice data.
The number of candidate text sequences included in the CTC result may be plural. In the same CTC result, each candidate text sequence may include M first characters, M being a natural number greater than 1.
And determining a second comprehensive evaluation score of the candidate text sequence according to the second target scoring values, wherein the second target scoring values of the M first characters are summed to obtain the second comprehensive evaluation score of the candidate text sequence.
And determining a second comprehensive evaluation score of the candidate text sequence according to the second target scoring values, or performing mean value calculation processing on the second target scoring values of the M first characters to obtain the second comprehensive evaluation score of the candidate text sequence.
After the second comprehensive evaluation score of each candidate text sequence is obtained, each candidate text sequence may be ranked from large to small according to the magnitude of the value of the second comprehensive evaluation score, so as to determine a candidate text sequence with the largest second comprehensive evaluation score, that is, a target text sequence, and determine the target candidate text sequence as a speech recognition result of the target speech data.
On one hand, the second decoding model adopts the re-scoring technology of the acoustic model, the second decoding model comprises a decoder with the re-scoring function, and in the process of generating a voice recognition result, a neural network in the decoder is required to be called once for each first character in a candidate text sequence to execute data processing, and the neural network has a complex structure and long data processing time; on the other hand, the time complexity of the decoder is the square of n, n is determined by the number of characters of the candidate text sequence, and n is a natural number greater than 1, so as the number of characters of the candidate text sequence increases, the total data processing duration corresponding to the CTC result greatly increases, and the total data processing duration is greatly influenced by the number of characters of the candidate text sequence.
The first decoding model adopts a short fusion technology which gives consideration to both an acoustic model and a language model, the structure of the language model is far less complex than that of a neural network in the acoustic model, the data processing time required by mapping processing by using the language model is obviously shorter than the time required by invoking the neural network in the acoustic model to execute the data processing, and even if the number of characters of a candidate text sequence is more, the total data processing time corresponding to a CTC result is relatively less.
In this embodiment, the decoding is performed by adopting the short fusion technique when the number of characters of the candidate text sequence is greater than the preset number of characters threshold, and the decoding is performed by adopting the re-scoring technique when the number of characters of the candidate text sequence is less than or equal to the preset number of characters threshold, which is in consideration of the fact that the accuracy of decoding is higher in comparison with the short fusion technique in the re-scoring technique, however, when the number of characters of the candidate text sequence is greater, the decoding time required by decoding by using the re-scoring technique is longer, which may be difficult to satisfy the real-time requirement in the actual application scene, and a decoding mode with faster decoding needs to be replaced.
By decoding by using the re-scoring technique when the number of characters of the candidate text sequence is small and decoding by using the short fusion technique when the number of characters of the candidate text sequence is large, whether the target voice data to be recognized is a long voice data or a short voice data can be improved as much as possible under the premise of ensuring that timeliness meets business requirements, thereby achieving the effect of considering both accuracy and instantaneity of voice recognition.
In a specific implementation, the voice recognition method further includes: performing intention recognition processing according to the voice recognition result of the target voice data to obtain a corresponding intention recognition result; determining a target response text of the intention recognition result according to the intention recognition result and the corresponding relation between the preset intention and the response text; and generating response voice of the robot seat aiming at the interrupt voice according to the target response text.
The speech recognition result may be a target text sequence.
In specific implementation, a plurality of preset intents may be configured in advance, and a corresponding response text is configured for each preset intention, so as to form a correspondence between the intention and the response text.
And carrying out intention recognition processing on the target text sequence to obtain an intention recognition result of the target text sequence.
According to the intention recognition result and the corresponding relation between the preset intention and the response text, determining the target response text of the intention recognition result can be carried out by carrying out query processing according to the corresponding relation between the intention recognition result and the response text to obtain the target response text corresponding to the intention recognition result.
The method comprises the steps of generating response voice of the robot agent for breaking voice according to the target response text, and performing text-to-voice processing on the target response text to obtain the response voice of the robot agent for breaking voice.
In addition, the above steps S204 to S210 may be replaced with:
Inputting the target voice data into a voice recognition model for voice recognition processing to obtain a voice recognition result corresponding to the target voice data; the speech recognition model includes: the device comprises an encoding module, a decoding search module, a shallow fusion module and a re-scoring module; the coding module is used for coding the target voice data to obtain semantic feature information of the target voice data; the decoding search module is used for carrying out decoding search processing on the semantic feature information according to a preset decoding search mode to obtain a connection time sequence classification result of the semantic feature information; the connection time sequence classification result comprises a candidate text sequence and an acoustic prediction scoring value of a first character in the candidate text sequence; the shallow fusion module is used for carrying out fusion processing of the acoustic prediction scoring value and the language prediction scoring value according to the connection time sequence classification result under the condition that the character number of the candidate text sequence is larger than a preset character number threshold value, so as to obtain a voice recognition result of the target voice data; the language prediction scoring value is obtained through conversion according to the acoustic prediction scoring value; the re-scoring module is used for performing re-scoring processing of the acoustic prediction scoring value according to the connection time sequence classification result under the condition that the character number of the candidate text sequence is smaller than or equal to a preset character number threshold value to obtain a voice recognition result of the target voice data; and forms a new implementation with the other processing steps in this embodiment.
The shallow layer fusion module corresponds to the first decoding model, and the re-segmentation module corresponds to the second decoding model.
The speech recognition model may be a model for converting lexical content in human speech into textual content.
In the voice recognition model, an encoding module is connected with a decoding search module, and the output of the encoding module is the input of the decoding search module; the decoding and searching module is respectively connected with the shallow layer fusion module and the reclassifying module, the output of the decoding and searching module is the input of the shallow layer fusion module, and the output of the decoding and searching module is the input of the reclassifying module.
Under the condition that the number of characters of the candidate text sequence is larger than a preset character number threshold value, the CTC result output by the decoding and searching module is transmitted to the shallow fusion module; and under the condition that the number of characters of the candidate text sequence is smaller than or equal to a preset character number threshold value, transmitting the CTC result output by the decoding search module to the reclassifying module.
The shallow fusion module comprises a language sub-module, and the language sub-module is used for carrying out mapping processing on the acoustic prediction scoring value of the first character to obtain the language prediction scoring value of the first character. The language submodule corresponds to the LM described above.
The repartitioning module comprises a decoding submodule, and the decoding submodule is used for carrying out decoding processing according to a CTC result to obtain a repartitioning value of the first character. The decoding submodule corresponds to the decoding model with the re-scoring function described above, for example Transformer Decoder.
The speech recognition model may be obtained as follows: firstly, training an acoustic model and training a language model respectively, and constructing a voice recognition model based on the trained acoustic model, the trained language model and other modules which do not involve parameter updating.
Wherein the training of the acoustic model involves: the coding module and the decoding sub-module in the re-dividing module; training of language models involves language sub-modules in a shallow fusion module. The part of the decoding search module and the repartitioning module except the decoding submodule and the part of the shallow fusion module except the language submodule do not participate in model training, and update of model parameters is not involved, so that the model training device can be directly put into use after the training of an acoustic model and the training of a language model are finished.
The acoustic model to be trained includes an encoding module, a linear layer, and a decoding submodule. The output of the encoding module may be the input of the linear layer, or the output of the encoding module may be the input of the decoding submodule.
The training flow of the acoustic module is as follows:
(a1) Acquiring a voice sample;
(a2) Inputting the voice sample into a coding module for coding processing to obtain sample semantic feature information of the voice sample;
(a3) Inputting the sample semantic feature information into a linear layer for mapping processing to obtain text sequence information;
(a4) Inputting the sample semantic feature information into a decoding sub-module for decoding processing to obtain a voice recognition result of a voice sample;
(a5) And generating training loss of the acoustic model to be trained according to the text sequence information and the voice recognition result, and training based on the training loss to obtain the trained acoustic model.
The voice sample may include voice data with a time length being a preset time length, and may further include a text tag corresponding to the voice data.
For each voice data, a text label corresponding to the voice data can be generated by performing a voice recognition process and an indexing process on the voice data.
The voice data included in the voice sample may be obtained through a segmentation process and/or a concatenation process.
For example, the mandarin chinese speech data with a total duration of ten thousand hours is subjected to the segmentation process and/or the concatenation process according to a preset time length in advance, so as to obtain a plurality of speech samples.
And the linear layer is used for carrying out mapping processing on the sample semantic feature information to obtain text sequence information.
The linear layer may be a linear network. The text sequence information may include the number of terms corresponding to the semantic feature information. The term number is used to represent the predicted probability of each character obtained by mapping semantic feature information to a pre-configured fixed-capacity text dictionary.
For example, the text dictionary includes 10000 characters, and the text sequence information may reflect a prediction probability that each of the 10000 characters belongs to a text sequence corresponding to the semantic feature information, so that a fixed index sequence is mapped based on the prediction probability and a fixed index corresponding to each character.
The codon module may be Transformer Decoder.
The voice sample includes a text label; generating training loss of the acoustic model to be trained according to the text sequence information and the voice recognition result, wherein the training loss comprises the following steps: generating a first function value of a first loss sub-function according to the text sequence information and the text label, and generating a second function value of a second loss sub-function according to the speech recognition prediction result and the text label; and generating training loss of the speech recognition model to be trained according to the first function value and the second function value.
The first loss sub-function may be CTC loss.
The input of CTC penalty may include the output of text labels and linear layers, i.e., the aforementioned text sequence information. The output of the CTC loss is the loss function value of the first loss subfunction, the sequence can be automatically aligned by calculating the CTC loss, and the sequence similarity between the input vector and the preset label text is calculated, so that the aim of modeling the text content is fulfilled. The CTC loss is used for driving the voice recognition model in training to improve the prediction accuracy of the vector sequence and improve the accuracy of acoustically converting the text vector by the coding module in the voice recognition model.
The second loss sub-function may be an Attention loss, referred to as ATT loss.
The input of the ATT penalty may comprise the output of the text label and decoding module, i.e. the speech recognition result, the output of the ATT penalty being the penalty function value of the second penalty sub-function. By calculating the ATT loss, the decoding submodule can be driven to learn how to improve the accuracy of the re-scoring of the acoustic predictive scoring value.
Generating a loss function value from the first function value and the second function value can be referred to as the following formula (2):
L=k*Lctc+(1-k)*Latt (2)
Where L is used to represent the total loss function value in this training, k is used to represent the weight of CTC loss, (1-k) is used to represent the weight of ATT loss, L ctc is used to represent the CTC loss value, and L att is used to represent the ATT loss value.
The value of k can be custom set, for example, to 0.3.k may be a real number greater than 0 and less than 1.
The trained language model can be obtained by training according to a chain rule and a mode of maximum likelihood estimation. And the text sample used for training the language model may have a correspondence with the text sample in the foregoing (a 1).
After the trained acoustic model and the trained language model are obtained, a speech recognition model can be built according to the trained acoustic model, the trained language model, the decoding search module without participating in model training, the part except the decoding submodule in the re-scoring module without participating in model training and the part except the language submodule in the shallow fusion module without participating in model training, and the speech recognition model can be used for implementing the speech recognition method provided by the embodiment.
In the embodiment shown in fig. 2, first, target voice data to be recognized is acquired; secondly, coding the target voice data to obtain semantic feature information of the target voice data; then, decoding and searching the semantic feature information according to a preset decoding and searching mode to obtain a connection time sequence classification result of the semantic feature information; the connection time sequence classification result comprises a candidate text sequence and an acoustic prediction scoring value of a first character in the candidate text sequence; under the condition that the number of characters of the candidate text sequence is larger than a preset character number threshold value, inputting the connection time sequence classification result into a first decoding model to perform fusion processing of an acoustic prediction scoring value and a language prediction scoring value, and obtaining a voice recognition result of target voice data; the language prediction scoring value is obtained through conversion according to the acoustic prediction scoring value; and under the condition that the number of characters of the candidate text sequence is smaller than or equal to a preset character number threshold value, inputting the connection time sequence classification result into a second decoding model to perform repeated processing of the acoustic prediction scoring value, and obtaining a voice recognition result of the target voice data. In this way, the connection time sequence classification result can be obtained quickly through decoding and searching the semantic feature information, the connection time sequence classification result can be regarded as an initial recognition result with lower accuracy, the initial recognition result needs to be further decoded through other decoding modes, in addition, the number of characters of the initial recognition result is usually accurate, how to select the decoding mode can be referred to, under the condition that the number of characters is large, the real-time requirement of a service can not be met by adopting a re-scoring mode with higher accuracy but low decoding speed for decoding, the voice recognition result is obtained by inputting the connection time sequence classification result into a first decoding model for fusion processing of an acoustic prediction scoring value and a language prediction scoring value, compared with the re-scoring mode, the decoding speed is higher, and therefore the real-time performance of voice recognition can be improved.
Fig. 3 is a schematic diagram of a training process of an acoustic model according to an embodiment of the present application.
As shown in fig. 3, the encoding module 302 is configured to perform encoding processing according to the voice sample, so as to obtain corresponding sample semantic feature information. The sample semantic feature information may be used to generate connection timing classification loss 308.
The connection timing classification penalty 308 may be a CTC penalty.
The decoding module 304 is configured to perform decoding processing according to the sample semantic feature information to obtain a speech recognition prediction result, where the speech recognition prediction result may be used to generate the attention loss 306.
The attention loss 306 may be an ATT loss.
Based on the connection timing classification loss 308 and the attention loss 306, a total loss 310 may be generated.
Since the technical conception is the same, the description in this embodiment is relatively simple, and the relevant parts only need to refer to the corresponding descriptions of the method embodiments provided above.
Fig. 4 is a schematic diagram of a decoding process of a speech recognition method according to the present embodiment.
As shown in fig. 4, the encoding module 402 is configured to encode target voice data to obtain semantic feature information of the target voice data.
The encoding module 402 may be conformer encoder, i.e., the encoding module in the conformer model.
In implementation, for example, the data stream is sequentially input to conformer encoder in a stream decoding manner with a chunk of 16, i.e., a duration of 0.64 s.
The decoding search module 404 is configured to perform decoding search processing on the semantic feature information according to a preset decoding search manner, so as to obtain CTC results of the semantic feature information; the CTC result includes an acoustic predictive scoring value for a first character in the candidate text sequence.
After the CTC result is obtained, it is determined whether the number of characters of the candidate text sequence is greater than 8 characters.
If the number of characters in the candidate text sequence is greater than 8 characters, the decoding and searching module 404 inputs the acoustic predictive scoring value of the first character in the CTC result into the preset language model 406 for mapping processing, so as to obtain the language predictive scoring value of the first character.
Shallow fusion is performed according to the acoustic predictive scoring values and the language predictive scoring values to obtain decoding results 410. The decoding result 410 is a speech recognition result of the target speech data.
If the number of characters in the candidate text sequence is less than or equal to 8 characters, the decoding search module 404 transmits the CTC result to the decoding module 408 for attention re-scoring, resulting in a re-scoring value for the first character. The decoding result 410 is determined based on the acoustic predictive scoring value and the re-scoring value of the first character.
The decoding module 408 may be a decoding model with a re-scoring function, such as Transformer Decoder.
Since the technical conception is the same, the description in this embodiment is relatively simple, and the relevant parts only need to refer to the corresponding descriptions of the method embodiments provided above.
Fig. 5 is a functional framework diagram of an outbound call system according to an embodiment of the present application.
As shown in fig. 5, the outbound system includes a speech acquisition module 502, a speech recognition module 504, an intent understanding module 506, a text generation module 508, and a speech synthesis module 510.
The steps of the speech recognition method provided by the foregoing speech recognition method embodiments may be performed by the speech recognition module 504.
In practical application, first, the voice acquisition module 502 receives a voice stream signal transmitted in real time by a phone user.
Then, the voice acquisition module 502 inputs the voice stream signal to the voice recognition module 504, and detects whether an intermediate result is generated or not through the end point detection function sub-module and the intermediate result generation sub-module in the voice recognition module 504.
And under the condition that the new sound of the client is detected, the robot seat is interrupted, the interrupting voice of the client is received, the interrupting voice is determined as an intermediate result, namely the detection result is the intermediate result. In the case where no new sound of the customer is detected, the detection result is that no intermediate result is generated.
If the detection result is an intermediate result, the intermediate result is transmitted to the intent understanding module 506.
The intention recognition module 506 performs the intention recognition processing on the intermediate result to obtain a corresponding intention recognition result.
And carrying out query processing on the corresponding relation between the intention recognition result and the pre-configured intention and response text through a text generation module 508 to obtain a target response text corresponding to the intention recognition result.
The response voice of the robot agent for breaking the voice is generated according to the target response text through the voice synthesis module 510.
Since the technical conception is the same, the description in this embodiment is relatively simple, and the relevant parts only need to refer to the corresponding descriptions of the method embodiments provided above.
In the foregoing embodiments, a voice recognition method is provided, and correspondingly, based on the same technical concept, the embodiments of the present application further provide a voice recognition device, which is described below with reference to the accompanying drawings.
Fig. 6 is a schematic diagram of a voice recognition device according to an embodiment of the present application.
The present embodiment provides a voice recognition apparatus 600, including:
An obtaining unit 602, configured to obtain target voice data to be identified;
the encoding unit 604 is configured to perform encoding processing on the target voice data to obtain semantic feature information of the target voice data;
The decoding search unit 606 is configured to perform decoding search processing on the semantic feature information according to a preset decoding search manner, so as to obtain a connection time sequence classification result of the semantic feature information; the connection time sequence classification result comprises a candidate text sequence and an acoustic prediction scoring value of a first character in the candidate text sequence;
The fusion unit 608 is configured to, when the number of characters in the candidate text sequence is greater than a preset number of characters threshold, input the connection timing classification result into a first decoding model to perform fusion processing of the acoustic prediction scoring value and the language prediction scoring value, so as to obtain a speech recognition result of the target speech data; the language prediction scoring value is obtained by conversion according to the acoustic prediction scoring value;
and a re-scoring unit 610, configured to input the connection timing classification result into a second decoding model to perform re-scoring of the acoustic prediction scoring value when the number of characters in the candidate text sequence is less than or equal to the preset number of characters threshold, so as to obtain a speech recognition result of the target speech data.
Optionally, the fusing unit 608 is specifically configured to:
Mapping the acoustic predictive scoring value of the first character to obtain the language predictive scoring value of the first character;
obtaining a first target scoring value of the first character according to a first preset weight value, the acoustic prediction scoring value and the language prediction scoring value;
and determining a voice recognition result of the target voice data according to the first target scoring value.
Optionally, the fusing unit 608 is further configured to:
determining a first comprehensive evaluation score of the candidate text sequence according to the first target score value;
And determining the candidate text sequence with the maximum first comprehensive evaluation score as a voice recognition result of the target voice data.
Optionally, the re-scoring unit 610 is specifically configured to:
Decoding according to the connection time sequence classification result to obtain a remarking value of the first character;
determining a second target scoring value of the first character according to a second preset weight value, the acoustic prediction scoring value and the re-scoring value;
and determining a voice recognition result of the target voice data according to the second target scoring value.
Optionally, the encoding unit 604 is specifically configured to:
performing audio feature extraction processing according to the target voice data to obtain audio feature information;
And carrying out coding processing according to the audio feature information to obtain the semantic feature information.
Optionally, the acquiring unit 602 is specifically configured to:
And in the process of broadcasting voice to the call user by the robot agent, receiving the interrupt voice of the call user, and determining the interrupt voice as the target voice data.
Optionally, the voice recognition apparatus 600 further includes:
the intention recognition unit is used for carrying out intention recognition processing according to the voice recognition result of the target voice data to obtain a corresponding intention recognition result;
a determining unit, configured to determine a target response text of the intention recognition result according to the intention recognition result and a correspondence between a preconfigured intention and the response text;
And the generation unit is used for generating response voice of the robot seat aiming at the interrupt voice according to the target response text.
The voice recognition device provided by the embodiment of the application comprises: the acquisition unit is used for acquiring target voice data to be identified; the coding unit is used for coding the target voice data to obtain semantic feature information of the target voice data; the decoding search unit is used for carrying out decoding search processing on the semantic feature information according to a preset decoding search mode to obtain a connection time sequence classification result of the semantic feature information; the connection time sequence classification result comprises a candidate text sequence and an acoustic prediction scoring value of a first character in the candidate text sequence; the fusion unit is used for inputting the connection time sequence classification result into the first decoding model to perform fusion processing of the acoustic prediction scoring value and the language prediction scoring value under the condition that the number of characters of the candidate text sequence is larger than a preset character number threshold value, so as to obtain a voice recognition result of the target voice data; the language prediction scoring value is obtained through conversion according to the acoustic prediction scoring value; and the re-scoring unit is used for inputting the connection time sequence classification result into the second decoding model to perform re-scoring processing of the acoustic predictive scoring value under the condition that the character number of the candidate text sequence is smaller than or equal to a preset character number threshold value, so as to obtain a voice recognition result of the target voice data. In this way, the connection time sequence classification result can be obtained quickly through decoding and searching the semantic feature information, the connection time sequence classification result can be regarded as an initial recognition result with lower accuracy, the initial recognition result needs to be further decoded through other decoding modes, in addition, the number of characters of the initial recognition result is usually accurate, how to select the decoding mode can be referred to, under the condition that the number of characters is large, the real-time requirement of a service can not be met by adopting a re-scoring mode with higher accuracy but low decoding speed for decoding, the voice recognition result is obtained by inputting the connection time sequence classification result into a first decoding model for fusion processing of an acoustic prediction scoring value and a language prediction scoring value, compared with the re-scoring mode, the decoding speed is higher, and therefore the real-time performance of voice recognition can be improved.
Corresponding to the above-described voice recognition method, based on the same technical concept, the embodiment of the present application further provides an electronic device, where the electronic device is configured to execute the above-described provided voice recognition method, and fig. 7 is a schematic structural diagram of an electronic device provided by the embodiment of the present application.
As shown in fig. 7, the electronic device may have a relatively large difference due to different configurations or performances, and may include one or more processors 701 and a memory 702, where the memory 702 may store one or more storage applications or data. Wherein the memory 702 may be transient storage or persistent storage. The application programs stored in the memory 702 may include one or more modules (not shown), each of which may include a series of computer-executable instructions in the electronic device. Still further, the processor 701 may be arranged to communicate with the memory 702 and execute a series of computer executable instructions in the memory 702 on an electronic device. The electronic device may also include one or more power supplies 703, one or more wired or wireless network interfaces 704, one or more input/output interfaces 705, one or more keyboards 706, and the like.
In one particular embodiment, an electronic device includes a memory, and one or more programs, where the one or more programs are stored in the memory, and the one or more programs may include one or more modules, and each module may include a series of computer-executable instructions for the electronic device, and execution of the one or more programs by one or more processors includes instructions for:
Acquiring target voice data to be identified;
Coding the target voice data to obtain semantic feature information of the target voice data;
Performing decoding search processing on the semantic feature information according to a preset decoding search mode to obtain a connection time sequence classification result of the semantic feature information; the connection time sequence classification result comprises a candidate text sequence and an acoustic prediction scoring value of a first character in the candidate text sequence;
Inputting the connection time sequence classification result into a first decoding model to perform fusion processing of the acoustic prediction scoring value and the language prediction scoring value under the condition that the character number of the candidate text sequence is larger than a preset character number threshold value, so as to obtain a voice recognition result of the target voice data; the language prediction scoring value is obtained by conversion according to the acoustic prediction scoring value;
and under the condition that the number of characters of the candidate text sequence is smaller than or equal to the preset character number threshold, inputting the connection time sequence classification result into a second decoding model to perform re-processing of the acoustic prediction scoring value, and obtaining a voice recognition result of the target voice data.
Corresponding to the above-described voice recognition method, the embodiment of the application further provides a computer readable storage medium based on the same technical concept.
The computer readable storage medium provided in this embodiment is configured to store computer executable instructions, where the computer executable instructions when executed by a processor implement the following procedures:
Acquiring target voice data to be identified;
Coding the target voice data to obtain semantic feature information of the target voice data;
Performing decoding search processing on the semantic feature information according to a preset decoding search mode to obtain a connection time sequence classification result of the semantic feature information; the connection time sequence classification result comprises a candidate text sequence and an acoustic prediction scoring value of a first character in the candidate text sequence;
Inputting the connection time sequence classification result into a first decoding model to perform fusion processing of the acoustic prediction scoring value and the language prediction scoring value under the condition that the character number of the candidate text sequence is larger than a preset character number threshold value, so as to obtain a voice recognition result of the target voice data; the language prediction scoring value is obtained by conversion according to the acoustic prediction scoring value;
and under the condition that the number of characters of the candidate text sequence is smaller than or equal to the preset character number threshold, inputting the connection time sequence classification result into a second decoding model to perform re-processing of the acoustic prediction scoring value, and obtaining a voice recognition result of the target voice data.
It should be noted that, in the present specification, the embodiments related to the computer readable storage medium and the embodiments related to the voice recognition method in the present specification are based on the same inventive concept, so that the specific implementation of the embodiments may refer to the implementation of the corresponding method, and the repetition is omitted.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-readable storage media (including, but not limited to, magnetic disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
Embodiments of the application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. One or more embodiments of the specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
The foregoing description is by way of example only and is not intended to limit the present disclosure. Various modifications and changes may occur to those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. that fall within the spirit and principles of the present document are intended to be included within the scope of the claims of the present document.
Claims (10)
1. A method of speech recognition, comprising:
Acquiring target voice data to be identified;
Coding the target voice data to obtain semantic feature information of the target voice data;
Performing decoding search processing on the semantic feature information according to a preset decoding search mode to obtain a connection time sequence classification result of the semantic feature information; the connection time sequence classification result comprises a candidate text sequence and an acoustic prediction scoring value of a first character in the candidate text sequence;
Inputting the connection time sequence classification result into a first decoding model to perform fusion processing of the acoustic prediction scoring value and the language prediction scoring value under the condition that the character number of the candidate text sequence is larger than a preset character number threshold value, so as to obtain a voice recognition result of the target voice data; the language prediction scoring value is obtained by conversion according to the acoustic prediction scoring value;
and under the condition that the number of characters of the candidate text sequence is smaller than or equal to the preset character number threshold, inputting the connection time sequence classification result into a second decoding model to perform re-processing of the acoustic prediction scoring value, and obtaining a voice recognition result of the target voice data.
2. The method according to claim 1, wherein inputting the connection timing classification result into a first decoding model for fusion processing of the acoustic prediction scoring value and the language prediction scoring value to obtain a speech recognition result of the target speech data comprises:
Mapping the acoustic predictive scoring value of the first character to obtain the language predictive scoring value of the first character;
obtaining a first target scoring value of the first character according to a first preset weight value, the acoustic prediction scoring value and the language prediction scoring value;
and determining a voice recognition result of the target voice data according to the first target scoring value.
3. The method of claim 2, wherein determining the speech recognition result of the target speech data based on the first target scoring value comprises:
determining a first comprehensive evaluation score of the candidate text sequence according to the first target score value;
And determining the candidate text sequence with the maximum first comprehensive evaluation score as a voice recognition result of the target voice data.
4. The method of claim 1, wherein inputting the connection timing classification result into a second decoding model for re-processing of the acoustic prediction scoring value to obtain a speech recognition result of the target speech data, comprises:
decoding according to the connection time sequence classification result to obtain a remarking value of the first character;
determining a second target scoring value of the first character according to a second preset weight value, the acoustic prediction scoring value and the re-scoring value;
and determining a voice recognition result of the target voice data according to the second target scoring value.
5. The method according to claim 1, wherein the encoding the target voice data to obtain semantic feature information of the target voice data includes:
performing audio feature extraction processing according to the target voice data to obtain audio feature information;
And carrying out coding processing according to the audio feature information to obtain the semantic feature information.
6. The method of claim 1, wherein the obtaining target voice data to be identified comprises:
And in the process of broadcasting voice to the call user by the robot agent, receiving the interrupt voice of the call user, and determining the interrupt voice as the target voice data.
7. The method of claim 6, wherein the method further comprises:
performing intention recognition processing according to the voice recognition result of the target voice data to obtain a corresponding intention recognition result;
Determining a target response text of the intention recognition result according to the intention recognition result and the corresponding relation between the preconfigured intention and the response text;
and generating response voice of the robot seat aiming at the interrupt voice according to the target response text.
8. A speech recognition apparatus, comprising:
The acquisition unit is used for acquiring target voice data to be identified;
The coding unit is used for coding the target voice data to obtain semantic feature information of the target voice data;
The decoding search unit is used for carrying out decoding search processing on the semantic feature information according to a preset decoding search mode to obtain a connection time sequence classification result of the semantic feature information; the connection time sequence classification result comprises a candidate text sequence and an acoustic prediction scoring value of a first character in the candidate text sequence;
The fusion unit is used for inputting the connection time sequence classification result into a first decoding model to perform fusion processing of the acoustic prediction scoring value and the language prediction scoring value under the condition that the character number of the candidate text sequence is larger than a preset character number threshold value, so as to obtain a voice recognition result of the target voice data; the language prediction scoring value is obtained by conversion according to the acoustic prediction scoring value;
And the re-scoring unit is used for inputting the connection time sequence classification result into a second decoding model to perform re-scoring processing of the acoustic prediction scoring value under the condition that the character number of the candidate text sequence is smaller than or equal to the preset character number threshold value, so as to obtain a voice recognition result of the target voice data.
9. An electronic device, the device comprising:
A processor; and a memory configured to store computer-executable instructions that, when executed, cause the processor to perform the speech recognition method of any of claims 1-7.
10. A computer readable storage medium storing computer executable instructions which, when executed by a processor, implement the speech recognition method of any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311007598.2A CN117953896A (en) | 2023-08-10 | 2023-08-10 | Speech recognition method, device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311007598.2A CN117953896A (en) | 2023-08-10 | 2023-08-10 | Speech recognition method, device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117953896A true CN117953896A (en) | 2024-04-30 |
Family
ID=90795062
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311007598.2A Pending CN117953896A (en) | 2023-08-10 | 2023-08-10 | Speech recognition method, device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117953896A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN119132347A (en) * | 2024-09-09 | 2024-12-13 | 美的集团(上海)有限公司 | Method, device, medium, program product and system for responding to speech termination point |
-
2023
- 2023-08-10 CN CN202311007598.2A patent/CN117953896A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN119132347A (en) * | 2024-09-09 | 2024-12-13 | 美的集团(上海)有限公司 | Method, device, medium, program product and system for responding to speech termination point |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111933129B (en) | Audio processing method, language model training method and device and computer equipment | |
CN106683677B (en) | Voice recognition method and device | |
CN112017645B (en) | Voice recognition method and device | |
CN109977207A (en) | Talk with generation method, dialogue generating means, electronic equipment and storage medium | |
US12367345B2 (en) | Identifying high effort statements for call center summaries | |
WO2022141706A1 (en) | Speech recognition method and apparatus, and storage medium | |
CN110995943B (en) | Multi-user streaming voice recognition method, system, device and medium | |
CN112397053B (en) | Voice recognition method and device, electronic equipment and readable storage medium | |
CN114596845A (en) | Training method of voice recognition model, voice recognition method and device | |
WO2021169825A1 (en) | Speech synthesis method and apparatus, device and storage medium | |
CN112131359A (en) | Intention identification method based on graphical arrangement intelligent strategy and electronic equipment | |
US20220399013A1 (en) | Response method, terminal, and storage medium | |
CN113793591A (en) | Speech synthesis method and related device, electronic equipment and storage medium | |
CN117558278A (en) | Self-adaptive voice recognition method and system | |
CN117953896A (en) | Speech recognition method, device, electronic equipment and storage medium | |
CN113793599B (en) | Training method of voice recognition model, voice recognition method and device | |
CN117496960A (en) | Training method and device of voice recognition model, electronic equipment and storage medium | |
CN114373443A (en) | Speech synthesis method and apparatus, computing device, storage medium, and program product | |
CN111640423B (en) | Word boundary estimation method and device and electronic equipment | |
CN118377909A (en) | Customer label determining method and device based on call content and storage medium | |
CN117975950A (en) | Speech recognition method, system, equipment and storage medium | |
CN112883350A (en) | Data processing method and device, electronic equipment and storage medium | |
CN117496981A (en) | Training method and device of voice recognition model, electronic equipment and storage medium | |
CN117496945A (en) | Training method of speech synthesis model, speech processing method and device | |
CN113505612B (en) | Multi-user dialogue voice real-time translation method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |