CN117133287A

CN117133287A - Speech recognition method and device and robot dialogue method and system

Info

Publication number: CN117133287A
Application number: CN202310574724.6A
Authority: CN
Inventors: 陈文轩; 邓博文; 廖益木; 陈志樑
Original assignee: Guangzhou On Bright Electronics Co Ltd
Current assignee: Guangzhou On Bright Electronics Co Ltd
Priority date: 2023-05-19
Filing date: 2023-05-19
Publication date: 2023-11-28

Abstract

The application provides a voice recognition method and device and a robot dialogue method and system. The voice recognition method comprises the following steps: constructing a finite state automaton for the received speech based on the corpus of speech recognition models; starting from the initial state of the finite state automaton, selecting an edge with the largest weight in the subsequent edges by adopting a searching method according to the current state and the weight of the subsequent edges, and transferring until the termination state of the finite state automaton is searched, so as to determine at least one recognition result sequence; determining a probability of each of the at least one recognition result sequence; taking the recognition result sequence with the highest probability in the at least one recognition result sequence as a preliminary recognition result; and matching the preliminary recognition result to the problems in the preset problem library and taking the matched problems as a final recognition result for the voice.

Description

Speech recognition method and device and robot dialogue method and system

Technical Field

The present application relates to the field of speech recognition, and more particularly to a speech recognition method and apparatus, and a robot dialogue method and system.

Background

With the development and popularization of voice robots, the voice robots are applied to more and more fields. For example, many shops and restaurants are equipped with auxiliary voice robots that help send meals, welcome or answer questions, etc. These application scenarios have the following characteristics: the application scene is fixed, the corpus is not required much, but the new vocabulary is more (such as restaurant names, menu names, etc.), so that the voice robot is required to accurately recognize the new vocabulary. The voice recognition method studied at the front at present pursues universality and robustness to the greatest extent, so that the corpus data size is larger and larger, and the corpus coverage area is larger and larger. For the application scenes, the existing voice recognition method is high in self false recognition rate and cannot be directly deployed on a voice robot and applied to the scenes, so that the method has low practicability and is disjointed with the market.

Early robotic dialogue systems on the market were based primarily on rules that were able to accurately and quickly return search results when the user question text hit a predefined rule. Thereby enabling a dialogue. However, rule bases are difficult to cover many cases in real scenes due to the complexity of natural language.

With the development of deep learning, many robot dialogue systems currently on the market discard the step of "search", because the deep learning model is a database and has the functions of search and matching. A complete question-answering system can be constructed by only relying on a trained deep learning model. The question-answering system first requires a plurality of questions with the structure of: the training sample set of the answer' is used for training the Seq2Seq network model, and the questions are converted into word vectors and are input into the model to obtain fixed answers. However, the question answering system has the following problems: (1) The retrieval precision is very dependent on the number of training samples, and when the number of the training samples is small, the precision is obviously reduced; (2) The user questions and a large number of candidate questions in the text library need to be subjected to similarity prediction one by one in one question answering, namely a large number of neural network deduction calculation needs to be performed, and therefore the retrieval speed is difficult to meet the requirement. Therefore, the question answering system needs to solve the problems of both the search accuracy and the search speed. Too long sentences are easy to search for too long time. On the other hand, many problems themselves are difficult to generalize, for example: "what you have" translates to "what you learn" will cause the model to collapse because the sentence itself has unchanged meaning but the word change rate has been as high as 30%. And when the user wants to add the question of "what skills you learn" he needs to reconstruct the sample set and train the model. This makes deployment of the question-answering system more difficult and less versatile and accurate.

Disclosure of Invention

One aspect of the present application provides a voice recognition method, including: constructing a finite state automaton for the received speech based on a corpus of speech recognition models, wherein each pronunciation unit of the speech acts as a state sequence of the finite state automaton, jumps from one state sequence to another state sequence act as edges of the finite state automaton, each edge having a weight representing a probability of a jump of the state sequence; starting from the initial state of the finite state automaton, selecting an edge with the largest weight in the subsequent edges by adopting a searching method according to the current state and the weight of the subsequent edges, and transferring until the termination state of the finite state automaton is searched, so as to determine at least one recognition result sequence; determining a probability of each of the at least one recognition result sequence; taking the recognition result sequence with the highest probability in the at least one recognition result sequence as a preliminary recognition result; and matching the preliminary recognition result to the problems in the preset problem library and taking the matched problems as a final recognition result for the voice.

Another aspect of the present application provides a voice recognition apparatus, comprising: a finite state automaton construction module configured to construct a finite state automaton for the received speech based on a corpus of speech recognition models, wherein each pronunciation unit of the speech acts as a state sequence of the finite state automaton, jumps from one state sequence to another state sequence act as edges of the finite state automaton, each edge having a weight representing a probability of a jump of the state sequence; the searching module is configured to start from the initial state of the finite state automaton, select the edge with the largest weight in the subsequent edges to transfer according to the current state and the weight of the subsequent edges by adopting a searching method until the termination state of the finite state automaton is searched, so as to determine at least one recognition result sequence; a probability determination module configured to determine a probability for each of the at least one recognition result sequence; the primary recognition result determining module is configured to take a recognition result sequence with the highest probability of the at least one recognition result sequence as a primary recognition result; and a matching module configured to match the preliminary recognition result to questions in a preset question bank and to take the matched questions as final recognition results for the voice.

Another aspect of the present application provides a voice recognition apparatus: comprising the following steps: a processor; and a memory storing computer readable instructions, wherein the processor is configured to execute the computer readable instructions to perform a speech recognition method according to an embodiment of the application.

Another aspect of the present application provides a computer-readable storage medium: comprising computer readable instructions which, when executed by a processor, cause the processor to perform a speech recognition method according to an embodiment of the application.

In another aspect of the present application, a robot dialogue method is provided, including: according to the voice recognition method of the embodiment of the application; and determining and outputting an answer in response to the final recognition result.

In another aspect of the present application, there is provided a robot dialog system comprising: according to the voice recognition device of the embodiment of the application; and an answer output module configured to determine and output an answer in response to the final recognition result.

The voice recognition method and device and the robot dialogue method and system have the characteristics of simple operation, quick response and high accuracy, and have high universality and high portability.

Drawings

The aspects of the application are best understood from the following detailed description when read with the accompanying drawing figures. Note that the various features are not necessarily drawn to scale according to industry standard practices. Like reference numerals describe like components throughout the several views. Like numerals having different letter suffixes may represent different instances of similar components. In the drawings:

a schematic diagram of the basic principle of a weighted finite state transducer WFST is shown in fig. 1;

FIG. 2 illustrates an example flow chart of a speech recognition method according to an embodiment of the application;

fig. 3 shows a schematic diagram of the principle of the beam set search method.

FIG. 4 shows a schematic diagram of topic classification employed by a matching mechanism of a speech recognition method in accordance with an embodiment of the present application;

FIG. 5 illustrates an example flow chart of a matching mechanism of a speech recognition method according to an embodiment of the application;

FIG. 6 shows an example block diagram of a speech recognition apparatus according to an embodiment of the application;

FIG. 7 shows an example block diagram of a matching module according to an embodiment of the application;

FIG. 8 shows an example block diagram of a robot dialog method in accordance with an embodiment of the present application;

fig. 9 shows a schematic diagram of an application example of a robot conversation method according to an embodiment of the present application;

FIG. 10 illustrates an example block diagram of a robotic dialog system in accordance with an embodiment of the application; and

fig. 11 illustrates an example block diagram of a computer system that may implement the voice recognition method shown in fig. 2 and the robot dialog method shown in fig. 8, in accordance with an embodiment of the present application.

Detailed Description

Features and exemplary embodiments of various aspects of the application are described in detail below. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the application. It will be apparent, however, to one skilled in the art that the present application may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the application by showing examples of the application. The present application is in no way limited to any particular configuration and algorithm set forth below, but rather covers any modification, substitution, and improvement of elements, components, and algorithms without departing from the spirit of the application. In the drawings and the following description, well-known structures and techniques have not been shown in order to avoid unnecessarily obscuring the present application.

The embodiment of the application provides a voice recognition method and device based on a wenet voice recognition model and combining an n-gram voice recognition model and wer word error rate matching, and a robot dialogue method and system, which have the characteristics of simple operation, quick response and high accuracy, and have high universality and high portability. A detailed description of the voice recognition method and apparatus and the robot dialog method and system of the present application is provided below by way of specific embodiments and with reference to the accompanying drawings. It should be understood that while the present application is described in terms of a wenet speech recognition model as an example, the present application is not limited to the description, and other models in the field of speech recognition may be employed.

The wenet speech recognition model is a chinese text recognition model for Automatic Speech Recognition (ASR) trained on a ten thousand hour wenet speech data set, covering various fields. It is difficult to recognize words based entirely on phonemes and context searches to satisfy actual usage scenarios due to the influence of noise and other factors of speech recognition. For example, when the web model is applied directly as part of a question-answering system in a professional field, new words that appear in the professional field cannot be identified by the web model because the ten thousand hours of corpus does not contain these words or the frequency of these words is too low. For another example, when some companies need voice robots to introduce company responsible person names, the wenent model cannot directly recognize the person names. It is noted that the eyes of these words are typically presented separately and accumulated as a dictionary, that is, the wenet speech recognition model is possible with accurate recognition of these eyes.

In order to cope with the above situation, the following two methods are generally adopted, wherein the first method is to add the new words into the hot word list, and set weights so that the matching score between the different words is the highest, and then the answer can be obtained. The second approach is to introduce pronunciations for these new words into the corpus. The wenet model was retrained. It is evident that both methods are too costly, since on the one hand the appropriate weights have to be set for each word and on the other hand the data set has to be reconstructed taking care of the data proportioning.

The present application proposes a third method. The third method is based on the original wenet model, a corpus is not required to be replaced, a hot word list is not required to be updated, only one pre-trained n-gram language model provided by the application is required to be added, and the original prefix (prefix) search method of the wenet model is replaced by the WfstBeamsearch search method based on the n-gram provided by the application, so that the requirements of voice recognition under different scenes can be met. The method can better apply the wenet model to various professional fields with minimal cost, and has high portability and low cost.

Wfstbeam search is a Weighted Finite State Transducer (WFST) based search algorithm that can implement result correction by constructing a Finite State automaton that can automatically perform error correction. The state nodes of this automaton represent possible sequences of results, the edges from one state node to another state node represent transitions from one sequence to another, and the weights of the edges represent the probability of this transition. A schematic diagram of the basic principle of a weighted finite state transducer WFST is shown in fig. 1. In fig. 1, there are status nodes of 0-5, 0 representing the initial status and 5 representing the termination status. A pair of < input labels on each side: output tag), and also a corresponding weight. For example, when the input is "abcd", its output is "zyxw/0.252".

The implementation of the WfstBeamSearch algorithm requires the advanced construction of a decoding diagram of a finite state automaton by utilizing a wenet model, wherein the decoding diagram integrates information of various layers of a modeling unit T, a dictionary L and a language model G into a decoding diagram TLG, and the modeling unit T is a pronunciation unit in a voice problem and corresponds to words in a text; l is a dictionary, a manually designed pronunciation sequence is abandoned, pronunciation of a single word is contained, and a corpus of a wenet model can be used as the dictionary; g is a language model (in the present application, n-gram language model), namely, the language model is converted into a WFST form representation, and specifically, the main weight of the edges between nodes in different states in the decoding graph is converted. Briefly, the decoding diagram lists all possible permutations of word sequences for constraining the final result of the wenet model.

The process of speech recognition can be understood as a process of searching the decoding graph for an optimal path. Specifically, starting from the initial state of the decoding diagram by using a search method, selecting one side with the highest probability for transition each time according to the weight of the subsequent side of the current state until the ending state of the decoding diagram is searched. And recording all possible recognition result sequences and the corresponding probabilities thereof in the searching process, and finally selecting one result sequence with the highest probability as a final recognition result.

The n-gram language model is described below. The N-gram language model is a statistical-based language model. The basic idea is to perform a sliding window operation of size N on the content in the text according to units, forming a sequence of fragments of length N. Each segment is called a gram, and the occurrence probability of all the grams is counted. Preferably, filtering can be performed according to a preset threshold value to form a key gram list. The Gram list is the vector feature space of the text, and each Gram in the list is a feature vector dimension.

The model is based on the assumption that the occurrence of the nth word is related to only the preceding N-1 words, but not to any other word, and the probability of the whole sentence is the product of the occurrence probabilities of the respective words. These probabilities can be obtained by directly counting the number of occurrences of each of the N words from the corpus. Commonly used are binary Bi-Gram (i.e. n=2) and ternary Tri-Gram (i.e. n=3).

Bi-Gram can be expressed as:

wherein,C(w _i-1 w _i ) For the number of times the ith word appears in the training set (text library) simultaneously with its preceding (N-1) word (since n=2, N-1=1), C (w _i-1 ) Is the number of times the i-1 st word appears in the training set.

Similarly, tri-Gram can be expressed as:

or by counting the number of times each word appears in the sample set, and will not be described in detail herein.

Based on this, an n-gram may be utilized to provide the primary weights of the edges between the different state nodes in the decoded graph. In some implementations, the weights of the edges between the different state nodes in the decoding graph may also take into account the phoneme sequence and transition probabilities between phonemes.

Therefore, the word in the professional field can be identified by combining the wenet voice recognition model with the n-gram language model, and only a small cost is required. Taking the word "link relation" as an example, in the case of using Tri-Gram, the "link relation" is divided into one word in chinese, which means that it becomes a single probability, if it is divided into "link", "connect", "close" and "series", it becomes four probabilities, and the frequency of occurrence of the four words is one Gram. Experiments show that the word frequency of 'connection' in the web corpus is far higher than that of 'link', so that the word frequency of 'link' needs to be improved in the training of an n-gram language model, the probability of 'link' and 'connection' occurring together is improved, and the recognition result in the web speech recognition is corrected through the n-gram language model. In the above, the modeling unit has been described by taking a word as an example, it should also be understood that the modeling unit may not be limited to a word, but may be a word or a combination of a word and a word, which may be set with reference to a corpus of the wenet model. Therefore, the segmentation of the n-gram language model can also be adaptively adjusted. For example, "four-trunk" may also be considered to be divided into "chain", "link", and "relationship" if desired, then they become three probabilities, and the frequency of occurrence of three words is one gram.

Based on the foregoing, an aspect of the present application provides a speech recognition method. Fig. 2 shows an example flowchart of a speech recognition method 200 according to an embodiment of the application. As shown, the speech recognition method 200 according to an embodiment of the present application includes S202-S210.

At S202, a finite state automaton is constructed for the received speech based on a corpus of speech recognition models, wherein each pronunciation unit of the speech acts as a state sequence of the finite state automaton, jumps from one state sequence to another state sequence act as edges of the finite state automaton, each edge having a weight representing a probability of a jump of the state sequence.

In some embodiments, the speech recognition model may be a wenet speech recognition model.

In some embodiments, the weight of the edge is determined as a probability of a jump from one state to another.

In some embodiments, the weight of each edge is determined by: determining the occurrence frequency of the text corresponding to the other state in the text library; determining the occurrence frequency of the text corresponding to the other state and the text corresponding to the precursor state in a text library; and taking the ratio of the occurrence frequency of the text corresponding to the other state and the text corresponding to the precursor state in the text library and the occurrence frequency of the text corresponding to the other state in the text library as the weight for jumping to the other state. It should be appreciated that the precursor state depends on the n-gram language model used. In the case of Bi-Gram, the precursor state is 1 state before the other state; in Tri-Gram, the precursor state is 2 states before the other state.

At S204, starting from the initial state of the finite state automaton, selecting, by using a search method, an edge with the largest weight among the subsequent edges according to the weights of the current state and the subsequent edges, and transferring until the termination state of the finite state automaton is searched, thereby determining at least one recognition result sequence.

The search method may be implemented using a variety of methods. In some embodiments, the search method may be a beam search method (BeamSearch). The beam set search method is an improvement of the greedy search method. The greedy search method is that each time step takes out an output with the highest conditional probability, and then takes the result from the beginning to the current step as input to obtain the output of the next time step until a mark of the end of the generation is given. The difference between the beam set search method and the greedy search method is that each time step does not take out an output with the highest conditional probability, but a plurality of sequences with the best conditional probability until the current step can be reserved.

Fig. 3 shows a schematic diagram of the basic principle of the bundle set search method. As shown in FIG. 3, in the first time step, A and C are the optimal two, thus yielding two results [ A ], [ C ], the other three being discarded; the second step is to continue the generation based on the two results, 5 candidates, namely [ AA ], [ AB ], [ AC ], [ AD ], [ AE ], and 5 are obtained in the same way on the branch A, 10 are uniformly ranked at the moment, and the optimal two candidates, namely [ AB ] and [ CE ] in the graph are reserved; and thirdly, the best two candidates are reserved from the new 10 candidates, and finally, two results of [ ABD ] and [ CED ] are obtained.

At S206, a probability of each of the at least one recognition result sequence is determined. In some embodiments, determining the probability of each recognition result sequence comprises: taking the product of the weights of each edge involved in the recognition result sequence as the probability of the recognition result sequence.

At S208, the recognition result sequence with the highest probability among the at least one recognition result sequence is taken as the preliminary recognition result.

At S210, matching the preliminary recognition result to questions in a preset question bank and regarding the matched questions as final recognition result for voice

According to the voice recognition method, the n-gram language model is used for providing the weight of the WFST side in the web voice recognition model, and the corpus of the web voice recognition model is not required to be changed, so that the voice recognition method has the characteristics of simplicity in operation, rapid response and high accuracy, and has high universality and high portability.

In some embodiments, the matching may include a matching mechanism based on WER Word Error Rate fusion topic classification, specifically, word Error Rate (WER) is an important index for evaluating ASR performance, and is used to evaluate the Error Rate between the predicted text and the standard text, so the feature that the Word Error Rate is the largest is the smaller the better. WER is commonly used in tasks such as English, arabic to text or speech recognition to measure the quality of ASR. Because the minimum unit of sentence in English sentence is word and the minimum unit in Chinese sentence is Chinese character, the word error rate (Character Error Rate, CER) is used in Chinese speech-to-text task or Chinese speech recognition task to measure the quality of Chinese ASR. The two calculation modes are the same, so that the row text is unified, and the WER is used for representing the performance in the unified mode.

Assume that a reference example sentence Ref and a predicted text Hyp generated after the ASR system transcribes the voice are provided. The word error rate can be expressed as:

where S represents the number of unit substitutions that occur when Hyp is converted to Ref, D represents the number of unit deletions that occur when Hyp is converted to Ref, I represents the number of unit insertions that occur when Hyp is converted to Ref, and N represents the total number of units in the Ref sentence. C represents the number of words identified correctly in Hyp sentences. I.e. the total number of words n=s+d+c of the original Ref sentence. It should be understood that in English sentence, the unit is English word; in a Chinese sentence, the unit is a word.

For example, ref sentences you eat, and transcribed sentences Hyp you eat. In this example, an error substitution occurs, i.e. Hyp sentence substitutes "not" for "but" i.e. s=1, d= 0,I =0, ref text word number n=4, so this transcription result wer=1/4=25%.

The matching mechanism of the speech recognition method of the present application is based on the simple wer word error rate calculation above.

Assuming that a question in the question bank is set to "who you dad is", when, for example, wenet voice recognition is required to identify who you dad is for the whole sentence and make a word-by-word impact, the probability is small, two questions are easy to occur, one is that the whole question is easy to fail because of a word change, for example, a user may ask who you father is and who you dad is …, which causes the sentence of the question itself to have a word error rate. And secondly, when the order of magnitude of the problem library rises and the wer word error rate is not 100%, the problem of the same topic is very easy to be confused and disordered, so that the determined recognition result is wrong.

One of the measures against problem one is to correct based on the n-gram language model during speech recognition, because the n-gram language model is self-contained with a correction mechanism, because if the user reads out the string of words "you dad is who", it will search and correct each word in the word stock using gram probability matching. That is, the speech recognition technology based on the n-gram language model according to the present application has a certain correction performance itself. The second measure is to introduce a hyponym and a segmentation match, for example, the segmentation word "father" matches with the hyponym "dad", which can point to the same recognition result. Segmentation matching is the segmentation of long subjects into parts of proper nouns. For example, "supporting the weight policy" can be divided into "supporting the weight policy", "supporting the supporting" and so on, and the subdivision degree of the division needs to be dynamically adjusted according to the number of problems, and the confusion degree of the matching system can be greatly increased by the divided parts, so that the repeated situation of many divided parts is easily caused. Thus, for this case and problem two, a matching mechanism fusing topic classification is proposed according to an embodiment of the present application.

Fig. 4 shows a schematic diagram of a topic classification employed by a matching mechanism of a speech recognition method according to an embodiment of the present application. As shown in fig. 4, the topic classification is to classify all the same type of problems into one class, for example, the problems include the problem of "yellow bodinier", and then classify all the problems of "yellow bodinier", and under the topic words "yellow bodinier", there are sub-topic words such as "industry aggregation", "talent policy system", "support of heavy weight policy", etc. There may be further segmentations under each sub-subject term, such as the sub-subject term "heavy policy support under" heavy policy support "," support ", and the like. The classification according to the subject matter words, sub subject matter words, and divided words is predefined as a preset knowledge base for use in matching the recognition result with the questions in the question base. In some embodiments, the word length of one or more of the subject word, the sub-subject word, and the segmentation word may also be recorded, which may greatly reduce the time required for word segmentation in matching. Although the description above is made of sub-subject words under subject words and divided words under sub-subject words, it should be understood that the divided words shown in the drawing may have one or more levels of classification, and the divided words may also be referred to as sub-subject words, that is, one or more levels of sub-subject words may be provided under the subject words.

Fig. 5 shows an example flow chart of a matching mechanism of a speech recognition method according to an embodiment of the application. As shown in fig. 5, the matching mechanism 500 of the voice recognition method according to the embodiment of the present application includes S502-S510.

In S502, a subject term of the preliminary recognition result is determined. In some embodiments, determining the subject term may include: s502-1, dividing the primary recognition result according to one or more subject word lengths; and S502-2, determining word error rate of each segmented word relative to one or more preset subject words, and determining the word with the word error rate larger than a first preset threshold value as the subject word.

In some embodiments, determining the word error rate of each segmented word relative to one or more preset subject words may include: comparing each segmented word with one or more preset subject words to determine the character replacement number, the character insertion number and the character deletion number of each segmented word relative to the one or more preset subject words; and determining a ratio of a sum of the number of character substitutions, the number of character insertions, and the number of character deletions to the total number of characters of each segmented word as the word error rate.

In the case where the subject word matching is successful, the flow proceeds to S504. At S504, the divided characters in the portions other than the subject word in the preliminary recognition result are determined. In some embodiments, determining the segmented character comprises: s504-1, dividing the part except the words serving as the subject words in the primary recognition result according to the character length; and S504-2, determining the matching rate of each segmented character relative to the preset characters, and determining the characters with the matching rate larger than a second preset threshold value as segmented characters.

In the case where the character matching is successful, the flow proceeds to S508. At S508, a matching question is determined from the question bank based on the subject word and the segmentation character as a final recognition result for the speech. That is, by integrating word segmentation and word error rate matching, it is possible to match the preliminary recognition result by the n-gram language model-based speech recognition to a problem in the problem library. In the case that the character matching fails, the flow proceeds to S510, and the final recognition fails due to the matching failure.

In some embodiments, as shown in fig. 5, the matching mechanism of the speech recognition method according to the present application may further include: s506, determining one or more levels of sub-subject words in the part except the subject word in the preliminary recognition result. In some embodiments, determining one or more levels of sub-subject terms may include: s506-1, performing word segmentation on the part except the words serving as the subject words in the preliminary recognition result step by step according to the lengths of one or more sub-subject words; and S506-2, determining word error rate of each segmented word relative to one or more preset subtopic words, and determining the word with the word error rate greater than a third preset threshold as one-level or multi-level subtopic words.

In the case where the sub subject word matching is successful, the flow proceeds to S504. Accordingly, the segmentation character is determined from the portion of the preliminary recognition result other than the subject word and the one-level or multi-level sub-subject word. Subsequently, at S508, a matching question is determined from the question bank as a final recognition result for the speech based on the subject word, the one or more sub-subject words, and the segmentation character. In the case where the sub-subject word matching fails, the flow proceeds to S510, where the recognition fails.

By matching one or more sub-subject terms, the matching operation can be made faster, thereby determining the final recognition result more quickly.

In some embodiments, the matching mechanism of the speech recognition method according to the present application may further comprise: the subject word, the sub-subject word, and the hyponym of the segmented character are determined, and at this time, the determination of the final recognition result is also based on the determined subject word, sub-subject word, and hyponym of the segmented character. In this way, the matching probability can be increased, thereby increasing the recognition probability.

In the matching mechanism of the speech recognition method according to the embodiment of the present application, preset thresholds for word error rate matching of the subject word, the one-level or multi-level sub-subject word, and the segmentation character may be set independently, and may be the same or different. In general, for the subject words and sub-subject words having fewer than 4 characters, the preset threshold for word error rate matching may be set to 100%.

It should also be appreciated that the priority of the superior class is greater than the priority of the inferior class, e.g., subject > sub-subject > segmentation. That is, when no subject term or upper level classification word is found, a matching search is performed for the sub subject term or lower level classification word. For example, when the question "what happens in the talent policy of the yellow bodinieries" is specific, the knowledge base has been preset to be the subject term of the yellow bodinieries, "the talent policy" is the sub-subject term, and if the question is "what the talent policy is," the corresponding matching term can be found from the lower level classification although the subject term is not matched at this time. Although talent policies may be different in different areas such as the yellow-cambodied area and the more-to-show area, the speech recognition method according to the present application focuses on the applicability of the monograph dialog system and thus preferentially matches the first entered word.

According to the voice recognition method and the problem matching mechanism, after various algorithms are fused, the dialogue robot with rich applicable scenes and high accuracy of recognition of the professional domain knowledge can be transplanted on development boards such as rk3588, rk3568 and the like completely. The portability is high, the portability is characterized in that the portability can be operated in different systems, the calculation effort is small, the use scene is rich, professional or specific vocabularies in different professional fields can be well identified, and the accuracy is high, so that the accuracy is based on a quick and accurate matching mechanism for setting problems.

Fig. 6 shows an example flowchart of a speech recognition apparatus 600 according to an embodiment of the application. As shown in the figure, the voice recognition apparatus 600 according to the embodiment of the present application includes a finite state automaton construction module 601, a search module 602, a probability determination module 603, and a recognition result determination module 604, and a matching module 605.

The finite state automaton construction module 601 is configured to construct a finite state automaton for the received speech based on a corpus of speech recognition models, wherein each pronunciation unit of the speech acts as a state sequence of the finite state automaton, jumps from one state sequence to another state sequence act as edges of the finite state automaton, each edge having a weight representing a probability of a jump of the state sequence.

In some embodiments, the finite state automaton construction module 601 is configured to: the probability of a jump from one state to another is determined as the weight of the edge.

In some embodiments, the finite state automaton construction module 601 is configured to: determining the occurrence frequency of the text corresponding to the other state in the text library; determining the occurrence frequency of the text corresponding to the other state and the text corresponding to the precursor state in a text library; and taking the ratio of the occurrence frequency of the text corresponding to the other state and the text corresponding to the precursor state in the text library and the occurrence frequency of the text corresponding to the other state in the text library as the weight for jumping to the other state.

The search module 602 is configured to select, from an initial state of the finite state automaton, an edge with a largest weight among the subsequent edges for transition by adopting a search method according to the weights of the current state and the subsequent edges until a termination state of the finite state automaton is searched, thereby determining at least one recognition result sequence.

The probability determination module 603 is configured to determine a probability for each of the at least one recognition result sequence. In some embodiments, the probability determination module 603 is configured to take, for each recognition result sequence, as the probability of that recognition result sequence, the product of the weights of each edge to which the recognition result sequence relates.

The preliminary recognition result determination module 604 is configured to take a recognition result sequence with the highest probability of the at least one recognition result sequence as a preliminary recognition result.

The matching module 605 is configured to match the preliminary recognition result to questions in a pre-set question bank and take the matched questions as the final recognition result for the speech.

Fig. 7 shows an example block diagram of a matching module according to an embodiment of the application. As shown, the matching module 605 includes: a subject term determination module 605-1, a subtopic determination module 605-2, a character determination module 605-3, and a final recognition result determination module 605-4.

The topic word determination module 605-1 is configured to determine topic words of the preliminary recognition result. In some embodiments, the subject term determination module 605-1 is configured to: dividing the primary recognition result according to one or more subject word lengths; and determining a word error rate of each segmented word relative to one or more preset subject words, and determining the word with the word error rate greater than a first preset threshold value as the subject word. In some embodiments, the subject term determination module 605-1 is configured to: comparing each segmented word with one or more preset subject words to determine the character replacement number, the character insertion number and the character deletion number of each segmented word relative to the one or more preset subject words; and determining a ratio of a sum of the number of character substitutions, the number of character insertions, and the number of character deletions to the total number of characters of each segmented word as the word error rate.

The character determination module 605-3 is configured to: and determining segmentation characters in the part except the subject word in the preliminary recognition result. In some embodiments, the character determination module 605-3 is configured to: dividing the part except the words serving as the subject words in the primary recognition result according to the character length; and determining the matching rate of each segmented character relative to the preset character, and determining the character with the matching rate larger than a second preset threshold value as the segmented character.

The final recognition result determination module 605-4 is configured to: is configured to determine a matching question from a question bank based on the subject word and the segmentation character as a final recognition result for the speech.

In some embodiments, as shown in fig. 5, the matching module 605 according to the present application may further include: the subtopic determination module 605-2 (illustrated in the figure as a dashed box) is configured to determine one or more levels of subtopic in portions of the preliminary recognition result other than the subtopic. In some embodiments, the subtopic determination module 605-2 is configured to: performing word segmentation on the part except the words serving as the subject words in the preliminary recognition result step by step according to the lengths of one or more sub-subject words; and determining word error rate of each segmented word relative to one or more preset subtopic words, and determining the word with the word error rate greater than a third preset threshold value as one-level or multi-level subtopic words. In implementations that utilize one or more levels of subtopics, the character determination module 605-3 is configured to determine segmented characters from portions of the preliminary recognition result other than the subject matter and the one or more levels of subtopics, and the final recognition result determination module 605-4 is configured to determine the final recognition result based on the one or more levels of subtopics.

The voice recognition method and the voice recognition device provided by the application have the advantages that the voice recognition model and the corpus are not required to be replaced, only one language model is required to be externally connected, and the original searching method of the language recognition model is replaced by the searching method described in the application. Thus, modifications can be implemented on the basis of existing models with minimal costs. Compared with the traditional voice recognition method and device, the voice recognition method and device can be widely applied to various fields by combining the pre-trained voice recognition model with different language models, and when an application scene is changed, only the training set of the language model is required to be changed, the voice recognition model is not required to be additionally trained, so that the deployment cost of voice recognition is reduced, and the applicability and accuracy of the system are improved.

The voice recognition method and the voice recognition device can be operated in different systems, and can be applied to a voice robot for realizing a voice robot dialogue method and a voice robot dialogue system.

Fig. 8 shows an example block diagram of a robot dialog method according to an embodiment of the application. As shown in fig. 8, the robot conversation method 800 according to the embodiment of the present application includes S802-S812. Steps S802 to S810 in the robot conversation method 800 are the same as steps S202 to S210 of the speech recognition method 200 shown in fig. 2, and are not described here again. The robot conversation method 800 according to the embodiment of the present application further includes S812, which determines and outputs an answer in response to the final recognition result. It should be understood that the manner of output is not limited, e.g., may be displayed via a display screen, may be played via a speaker, etc., or a combination of both, etc.

Fig. 9 shows a schematic diagram of an application example of the robot conversation method according to the embodiment of the present application. As shown in fig. 9, in this example, the user utters speech. For the received voice, web-based voice recognition 901 is performed. In 901, constructing a finite state automaton based on a corpus of a wenet speech recognition model, wherein each pronunciation unit of speech serves as a state sequence of the finite state automaton, jumps from one state sequence to another state sequence serve as edges of the finite state automaton, and each edge has a certain weight, and the weight represents the probability of the jump of the state sequence; starting from the initial state of the finite state automaton, selecting an edge with the largest weight in the subsequent edges by adopting a searching method according to the current state and the weight of the subsequent edges, and transferring until the termination state of the finite state automaton is searched, so as to determine at least one recognition result sequence; determining a probability of each of the at least one recognition result sequence; and taking the recognition result sequence with the highest probability of the at least one recognition result sequence as a preliminary recognition result.

In FIG. 9, a wenet speech recognition 901 is connected to an n-gram language model commander 902. It should be appreciated that in practice, the n-gram language model commander 902 is incorporated into the wenet speech recognition 901. As described above, the n-gram language model commander 902 provides weights for the edges of the transitions of the finite state automaton from one state sequence to another state sequence in the went speech recognition 901.

Next, proceed to matching mechanism 903. At 903, the preliminary recognition results determined by the wenet speech recognition selected by the fusion n-gram language model commander are matched to questions in the question bank to determine final recognition results for the speech. Specific matching procedures may be referred to the specific illustration and description for the previous embodiments.

After determining the recognition result for the voice, it may proceed to 904 to search in the database of the question-and-answer system to determine an answer corresponding to the final recognition result and finally output the answer in voice at 905, thereby implementing the robot dialogue.

Fig. 10 shows an example block diagram of a robot dialog system according to an embodiment of the application. As shown in fig. 10, the robot dialog system 1000 according to an embodiment of the present application includes a finite state automaton construction module 1001, a search module 1002, a probability determination module 1003, and a recognition result determination module 1004, a matching module 1005, and an answer output module 1006. The finite state automaton construction module 1001, the search module 1002, the probability determination module 1003, and the recognition result determination module 1004, and the matching module 1005 of the robot conversation system 1000 are the same as the finite state automaton construction module 701, the search module 702, the probability determination module 703, and the recognition result determination module 704, and the matching module 705 of the speech recognition apparatus 700 shown in fig. 7, and are not described herein. The robotic dialogue system 1000 according to an embodiment of the application further comprises an answer output module 1006 configured to determine and output an answer in response to the final recognition result. It should be understood that the manner of output is not limited, e.g., may be displayed via a display screen, may be played via a speaker, etc., or a combination of both, etc.

According to the robot dialogue method and system, through the fusion of the algorithms, the robot dialogue method and system have the characteristics of rich applicable scenes and high recognition accuracy, can well perform voice question-answering dialogue in specific vocabularies in different fields, and is rapid and high in accuracy.

FIG. 11 illustrates an example block diagram of a computer system that may implement the speech recognition method shown in FIG. 2 and the robot dialog method shown in FIG. 8, in accordance with an embodiment of the application.

Fig. 11 illustrates an example block diagram of a computer system that may implement the voice recognition method shown in fig. 2 and the robot dialog method shown in fig. 5, in accordance with an embodiment of the present application. It should be appreciated that the computer system 1100 illustrated in fig. 11 is merely an example, and should not be construed as limiting the functionality and scope of use of the embodiments of the present disclosure.

As shown in fig. 11, the computer system 1100 may include a processing device (e.g., a central processing unit, a graphics processor, etc.) 1101 that may perform various suitable actions and processes in accordance with programs stored in a Read Only Memory (ROM) 1102 or loaded from a storage device 1108 into a Random Access Memory (RAM) 1103. In the RAM 1103, various programs and data required for the operation of the computer system 1100 are also stored. The processing device 1101, ROM 1102, and RAM 1103 are connected to each other by a bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.

In general, the following devices may be connected to the I/O interface 1105: input devices 1106 including, for example, a touch screen, touchpad, camera, accelerometer, gyroscope, sensor, etc.; output devices 1107 including, for example, a liquid crystal display (LCD, liquid Crystal Display), speakers, vibrators, motors, electronic speed regulators, and the like; storage 1108 including, for example, flash memory (Flash Card) and the like; and a communication device 1109. The communication means 1109 may allow the computer system 1100 to communicate wirelessly or by wire with other devices to exchange data. While FIG. 11 illustrates a computer system 1100 having various devices, it should be understood that not all illustrated devices are required to be implemented or provided. More or fewer devices may be implemented or provided instead. Each block shown in fig. 11 may represent one device or a plurality of devices as needed.

In particular, according to embodiments of the present invention, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure provide a computer-readable storage medium storing a computer program containing program code for executing the process S108 shown in fig. 1. In such an embodiment, the computer program may be downloaded and installed from a network via the communication device 1009, or installed from the storage device 1008, or installed from the ROM 1002. When the computer program is executed by the processing device 1001, the process S108 shown in fig. 1 is implemented.

It should be noted that, the computer readable medium according to the embodiments of the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. A computer readable storage medium according to an embodiment of the present invention may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. Additionally, a computer-readable signal medium according to an embodiment of the present invention may include a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (Radio Frequency), and the like, or any suitable combination thereof.

Computer program code for carrying out operations in accordance with embodiments of the present application may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

It should be appreciated that although some examples are given above, these examples are provided only to illustrate basic ideas of a voice recognition method and apparatus and a robot conversation method and system according to embodiments of the present application, and the present application is not limited to these examples, but can also be applied in a wide variety of other application scenarios or fields.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The present invention may be embodied in other specific forms without departing from its spirit or essential characteristics. For example, the algorithms described in particular embodiments may be modified without departing from the basic spirit of the invention. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. A method of speech recognition, comprising:

constructing a finite state automaton for the received speech based on a corpus of speech recognition models, wherein each pronunciation unit of the speech acts as a state sequence of the finite state automaton, jumps from one state sequence to another state sequence act as edges of the finite state automaton, each edge having a weight representing a probability of a jump of the state sequence;

starting from the initial state of the finite state automaton, selecting the edge with the largest weight in the subsequent edges by adopting a searching method according to the current state and the weight of the subsequent edges, and transferring until the termination state of the finite state automaton is searched, so as to determine at least one recognition result sequence;

determining a probability of each of the at least one recognition result sequence;

taking the recognition result sequence with the highest probability of the at least one recognition result sequence as a recognition result of the voice; and

and matching the preliminary recognition result to the problems in a preset problem library, and taking the matched problems as a final recognition result of the voice.

2. The speech recognition method of claim 1, further comprising: the probability of a jump from one state to another is determined as the weight of the edge.

3. The speech recognition method of claim 1, further comprising: for each recognition result sequence, taking the product of the weights of each edge related to the recognition result sequence as the probability of the recognition result sequence.

4. A speech recognition method according to any one of claims 1-3, further comprising:

determining the occurrence frequency of the text corresponding to the other state in a text library;

determining the occurrence frequency of the text corresponding to the other state and the text corresponding to the precursor state in the text library; and

and taking the ratio of the occurrence frequency of the text corresponding to the other state and the text corresponding to the precursor state in the text library and the occurrence frequency of the text corresponding to the other state in the text library as the weight for jumping to the other state.

5. The speech recognition method of claim 1, further comprising:

determining a subject term of the preliminary recognition result;

determining segmentation characters in a part except the subject term in the preliminary recognition result; and

a matching question is determined from the question bank based on the subject word and the segmentation character as a final recognition result for the speech.

6. The speech recognition method of claim 5, further comprising:

determining one or more levels of sub-subject words in the portion of the preliminary recognition result except for the subject word;

wherein the segmentation character is determined from a portion of the preliminary recognition result other than the subject word and the one or more levels of sub-subject words, and

wherein the determination of the final recognition result is further based on the one or more levels of sub-subject matter.

7. The speech recognition method of claim 5, wherein

Determining the subject term includes: dividing the preliminary recognition result according to the length of one or more subject terms, determining the word error rate of each divided word relative to one or more preset subject terms, and determining the word with the word error rate larger than a first preset threshold value as the subject term; and

determining the segmented character comprises: dividing the part except the words serving as the subject words in the preliminary recognition result according to character length, determining the matching rate of each divided character relative to preset characters, and determining the characters with the matching rate larger than a second preset threshold value as the divided characters.

8. The speech recognition method of claim 6, wherein determining the one or more sub-subject terms comprises:

and carrying out word segmentation on the parts except the words serving as the subject words in the preliminary recognition result step by step according to the lengths of one or more sub-subject words, determining the word error rate of each segmented word relative to one or more preset sub-subject words, and determining the words with word error rates larger than a third preset threshold value as the one-level or multi-level sub-subject words.

9. The speech recognition method of claim 7, wherein determining a word error rate for each segmented word relative to one or more preset subject words comprises:

comparing each segmented word with the one or more preset subject words to determine the character replacement number, the character insertion number and the character deletion number of each segmented word relative to the one or more preset subject words; and

and determining the ratio of the sum of the number of character substitutions, the number of character insertions and the number of character deletions to the total number of characters of each segmented word as the word error rate.

10. The speech recognition method of claim 8, wherein determining a word error rate for each segmented word relative to one or more preset sub-subject words comprises:

Comparing each segmented word with the one or more preset subtopics to determine the character replacement number, the character insertion number and the character deletion number of each segmented word relative to the one or more preset subtopics; and

11. A speech recognition apparatus comprising:

a finite state automaton construction module configured to construct a finite state automaton for a received speech based on a corpus of speech recognition models, wherein each pronunciation unit of the speech acts as a state sequence of the finite state automaton, jumps from one state sequence to another state sequence act as edges of the finite state automaton, each edge having a weight representing a probability of a jump of the state sequence;

the searching module is configured to start from the initial state of the finite state automaton, select the edge with the largest weight in the subsequent edges according to the weights of the current state and the subsequent edges by adopting a searching method to transfer until the termination state of the finite state automaton is searched, so as to determine at least one recognition result sequence;

A probability determination module configured to determine a probability for each of the at least one recognition result sequence;

the primary recognition result determining module is configured to take a recognition result sequence with the highest probability of the at least one recognition result sequence as a primary recognition result; and

and the matching module is configured to match the preliminary recognition result to the problems in a preset problem library and take the matched problems as a final recognition result of the voice.

12. The speech recognition device of claim 11, wherein the finite state automaton construction module is configured to: the probability of a jump from one state to another is determined as the weight of the edge.

13. The speech recognition apparatus according to claim 11, wherein the probability determination module is configured to take, for each recognition result sequence, as the probability of that recognition result sequence, a product of the weights of each edge to which the recognition result sequence relates.

14. The speech recognition device of any one of claims 11-14, wherein the finite state automaton construction module is configured to:

Determining the occurrence frequency of the text corresponding to the other state and the text corresponding to the precursor state in the text library;

15. The speech recognition device of claim 11, wherein the matching module comprises:

a subject term determination module configured to determine a subject term of the preliminary recognition result;

a character determining module configured to determine a split character in a portion other than the subject word in the preliminary recognition result; and

and a final recognition result determining module configured to determine a matching question from the question bank as a final recognition result for the speech based on the subject word and the segmentation character.

16. The speech recognition device of claim 15, further comprising: a sub-subject term determination module configured to determine one or more sub-subject terms in a portion other than the subject term in the preliminary recognition result;

Wherein the character determining module is configured to determine the divided character from a portion of the preliminary recognition result other than the subject word and the one or more levels of sub-subject words, and

wherein the final recognition result determination module is configured to determine the final recognition result further based on the one or more levels of sub-subject matter.

17. The speech recognition device of claim 15, wherein

The subject term determination module is configured to:

dividing the preliminary recognition result according to the lengths of one or more subject terms, determining the word error rate of each divided word relative to one or more preset subject terms, and determining the word with the word error rate larger than a first preset threshold value as the subject term; and is also provided with

The character determination module is configured to:

dividing the part except the words serving as the subject words in the preliminary recognition result according to character length, determining the matching rate of each divided character relative to preset characters, and determining the characters with the matching rate larger than a second preset threshold value as the divided characters.

18. The speech recognition device of claim 16, wherein the sub-subject term determination module is configured to:

And carrying out word segmentation on the parts except the words serving as the subject words in the preliminary recognition result step by step according to the lengths of one or more sub-subject words, determining the word error rate of each segmented word relative to one or more preset sub-subject words, and determining the words with word error rates larger than a second preset threshold value as the one-level or multi-level sub-subject words.

19. The speech recognition device of claim 17, wherein the subject term determination module is configured to:

20. The speech recognition device of claim 18, wherein the sub-subject term determination module is configured to:

21. A speech recognition apparatus comprising:

a processor; and

a memory storing computer readable instructions that,

wherein the processor is configured to execute the computer readable instructions to perform the speech recognition method according to any one of claims 1 to 10.

22. A computer readable storage medium comprising computer readable instructions which, when executed by a processor, cause the processor to perform the speech recognition method according to any one of claims 1 to 10.

23. A robot conversation method, comprising:

the speech recognition method according to any one of claims 1 to 10; and

an answer is determined and output in response to the final recognition result.

24. A robotic dialogue system comprising:

the speech recognition device according to any one of claims 11 to 120; and

and an answer output module configured to determine and output an answer in response to the final recognition result.