CN119943032A

CN119943032A - Speech recognition method, system, device and medium based on artificial intelligence

Info

Publication number: CN119943032A
Application number: CN202510422269.7A
Authority: CN
Inventors: 余江; 田荣阳; 余昊; 余聪哲; 陈海森; 杨娇; 尹德华; 吕志; 龙能杰
Original assignee: Xunfei Lingzhi Jiangsu Technology Co ltd
Current assignee: Xunfei Lingzhi Jiangsu Technology Co ltd
Priority date: 2025-04-07
Filing date: 2025-04-07
Publication date: 2025-05-06
Anticipated expiration: 2045-04-07
Also published as: CN119943032B

Abstract

The present application provides a speech recognition method, system, device and medium based on artificial intelligence, which obtains a speech signal to be recognized; extracts keywords from the initial speech text corresponding to the speech signal based on the pitch period determined by the amplitude deviation between adjacent speech amplitudes in the speech signal and the speech spectrum of the speech signal to obtain multiple speech keywords; determines the homophone group of each speech keyword, and recognizes and verifies each speech keyword based on the contextual semantic information of each speech keyword in the initial speech text and the semantic features of the homophones in each homophone group, and obtains the recognition ambiguity of each speech keyword; recognizes the text probability distribution of the speech signal according to the recognition ambiguity corresponding to each speech keyword and the homophone group, analyzes the text probability distribution, and obtains the text recognition result of the speech signal. The above scheme is based on the text probability distribution, which can improve the recognition accuracy of speech under homophone confusion.

Description

Speech recognition method, system, equipment and medium based on artificial intelligence

Technical Field

The application relates to the technical field of voice recognition, in particular to a voice recognition method, a voice recognition system, voice recognition equipment and voice recognition media based on artificial intelligence.

Background

The development of voice recognition is an important branch in the field of artificial intelligence at present, wherein the development of voice recognition is not separated from the key link of feature extraction, the purpose of feature extraction is to extract feature information related to voice content from an original voice signal, noise and interference irrelevant to the voice content are removed at the same time, along with the development of deep learning technology, a feature extraction method based on a neural network is provided, and can more effectively extract high-level abstract features in the voice signal so as to improve the attention degree to important voice information.

In the existing speech recognition, the feature extraction is a process of converting a speech signal into a feature vector which can be processed by a recognition model, firstly, the speech signal is divided into short-time frames, then, the short-time frames are converted into frequency domain signals to analyze frequency components of the short-time frames so as to extract feature parameters reflecting short-time energy and spectrum shape of the speech, and phoneme information of the speech is captured through the extracted feature parameters so as to provide high-quality input for a subsequent speech recognition model.

Disclosure of Invention

The application provides a voice recognition method, a system, equipment and a medium based on artificial intelligence, which can improve the recognition accuracy of voice under homonym confusion.

In a first aspect, the present application provides an artificial intelligence based speech recognition method, comprising the steps of:

acquiring a voice signal to be recognized;

Determining a pitch period of the voice signal based on amplitude deviation between adjacent voice amplitudes in the voice signal, and extracting keywords from an initial voice text corresponding to the voice signal through the pitch period and a voice spectrogram of the voice signal to obtain a plurality of voice keywords corresponding to the voice signal;

performing homonym expansion on each voice keyword to obtain homonym groups of each voice keyword;

acquiring context semantic information of each voice keyword in the initial voice text, and carrying out recognition verification on each voice keyword based on all the context semantic information and semantic features of homophones in each homophone group to obtain recognition ambiguity of each voice keyword in a voice recognition process;

And recognizing the text probability distribution of the voice signal according to the recognition ambiguity and homonym corresponding to each voice keyword, and analyzing the text probability distribution through a pre-trained voice recognition model to obtain a text recognition result of the voice signal.

In some embodiments, determining the pitch period of the speech signal based on the amplitude deviation between adjacent speech amplitudes in the speech signal specifically comprises:

carrying out framing treatment on the voice signal to obtain a plurality of short-time signal frames corresponding to the voice signal;

carrying out amplitude calculation on each short-time signal frame to obtain an amplitude sequence of each short-time signal frame;

Calculating amplitude deviation between adjacent voice amplitudes in each amplitude sequence;

And determining the pitch period of the voice signal according to the deviation change characteristics of all amplitude deviations.

In some embodiments, extracting keywords from an initial voice text corresponding to the voice signal through the pitch period and a voice spectrogram of the voice signal, and obtaining a plurality of voice keywords corresponding to the voice signal specifically includes:

Dividing the speech signal into a plurality of time periods based on the pitch period, wherein each time period comprises a pitch period;

performing Fourier transform on the voice signals in each time period to obtain a voice spectrogram of each time period;

Determining the frequency band area and peak information in each voice spectrogram;

And extracting a plurality of voice keywords corresponding to the voice signals from the initial voice texts corresponding to the voice signals based on the frequency band areas and the peak information in each voice spectrogram.

In some embodiments, performing homonym expansion on each voice keyword to obtain homonym groups of each voice keyword specifically includes:

Performing phonetic analysis on each voice keyword, and identifying a plurality of candidate homophones corresponding to each voice keyword;

Selecting a voice keyword as a selected voice keyword;

screening a plurality of candidate homophones corresponding to the selected voice keywords according to phonology characteristics, and combining the candidate homophones obtained by screening to obtain homophone groups of the selected voice keywords;

and continuing to determine homonyms of the remaining voice keywords.

In some embodiments, performing recognition verification on each voice keyword based on all context semantic information and semantic features of homophones in each homophone group, so as to obtain recognition ambiguity of each voice keyword in a voice recognition process specifically includes:

Carrying out semantic analysis on each voice keyword based on the context semantic information of each voice keyword to obtain the key semantic feature of each voice keyword;

Selecting a voice keyword as a selected voice keyword;

Carrying out semantic association analysis on each homonym in the homonym group corresponding to the selected voice keyword according to the key semantic features of the selected voice keyword to obtain the semantic association degree between the semantic features of each homonym in the homonym group and the key semantic features;

Determining the recognition ambiguity of the selected voice keyword in the voice recognition process according to all the semantic relativity;

And continuing to determine the recognition ambiguity of the residual voice keywords in the voice recognition process.

In some embodiments, the analyzing the text probability distribution by using a pre-trained speech recognition model, and obtaining the text recognition result of the speech signal specifically includes:

acquiring a pre-trained voice recognition model;

inputting the text probability distribution into a pre-trained voice recognition model for analysis, and outputting a text recognition result of the voice signal by the voice recognition model.

In some embodiments, the speech signal to be recognized is obtained through an audio file.

In a second aspect, the present application provides an artificial intelligence based speech recognition system comprising:

The acquisition module is used for acquiring the voice signal to be identified;

the processing module is used for determining the pitch period of the voice signal based on the amplitude deviation between adjacent voice amplitudes in the voice signal, and extracting keywords from the initial voice text corresponding to the voice signal through the pitch period and a voice spectrogram of the voice signal to obtain a plurality of voice keywords corresponding to the voice signal;

the processing module is further used for carrying out homonym expansion on each voice keyword to obtain homonym groups of each voice keyword;

The processing module is further used for acquiring context semantic information of each voice keyword in the initial voice text, and carrying out recognition verification on each voice keyword based on all the context semantic information and semantic features of homophones in homophone groups to obtain recognition ambiguity of each voice keyword in a voice recognition process;

And the execution module is used for identifying the text probability distribution of the voice signal according to the identification ambiguity and homonym group corresponding to each voice keyword, and analyzing the text probability distribution through a pre-trained voice identification model to obtain a text identification result of the voice signal.

In a third aspect, the present application provides a computer device comprising a memory storing code and a processor configured to obtain the code and to perform the artificial intelligence based speech recognition method described above.

In a fourth aspect, the present application provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the artificial intelligence based speech recognition method described above.

The technical scheme provided by the embodiment of the application has the following beneficial effects:

According to the artificial intelligence-based voice recognition method, system, equipment and medium, firstly, a voice signal to be recognized is obtained, secondly, the pitch period of the voice signal is determined based on amplitude deviation between adjacent voice amplitudes in the voice signal, the voice frequency spectrogram of the voice signal and the pitch period are used for extracting keywords from initial voice texts corresponding to the voice signal to obtain a plurality of voice keywords corresponding to the voice signal, further, homonym expansion is carried out on each voice keyword to obtain homonym groups of each voice keyword, then context semantic information of each voice keyword in the initial voice text is obtained, recognition verification is carried out on each voice keyword based on semantic features of homonyms in all context semantic information and each homonym group to obtain recognition ambiguity of each voice keyword in a voice recognition process, finally, text distribution of the voice signal is recognized according to the recognition ambiguity corresponding to each voice keyword and homonym groups, and further, the text distribution of the voice signal is analyzed through pre-recognition training is carried out on the text distribution to obtain a text analysis result.

Therefore, the application can improve the recognition accuracy of the voice under the confusion of homonyms; firstly, obtaining the voice signals to be recognized and providing a data basis for the subsequent voice recognition, secondly, extracting keywords from the initial voice text corresponding to the voice signals through the voice frequency spectrogram of the voice signals and the pitch period, obtaining a plurality of voice keywords corresponding to the voice signals, enabling a voice recognition system to efficiently filter redundant information, accurately match user intentions and improve recognition accuracy, further, expanding homonyms of each voice keyword to obtain homonyms of each voice keyword, thereby being beneficial to the voice recognition system to make more reasonable judgment on uncertain pronunciation and further recognize the true intention of target voice, further, recognizing and verifying the voice keywords based on the context semantic information of each voice keyword in the initial voice text and the semantic features of homonyms in each homonym to obtain recognition ambiguity of each voice keyword in the voice recognition process, evaluating the similarity and uncertainty of the voice recognition results, optimizing the search results based on the similarity and the uncertainty of the voice recognition results, enabling the voice recognition system to provide the homonyms more in accordance with the user intentions, further being beneficial to making more reasonable judgment on the uncertain pronunciation of each voice word, further, generating the voice recognition text by the voice recognition system according to the probability distribution of the homonyms, and the probability of the homonyms, and finally recognizing the voice recognition text by the voice recognition text, and the voice recognition results can be more accurate, and the voice recognition results can be generated by the voice recognition text, the technical scheme provided by the application can improve the recognition accuracy of the voice under homonym confusion.

Drawings

FIG. 1 is an exemplary flow chart of an artificial intelligence based speech recognition method according to some embodiments of the application;

FIG. 2 is an exemplary flow chart for determining a pitch period according to some embodiments of the application;

FIG. 3 is an exemplary flow chart for determining homophones groups, according to some embodiments of the application;

FIG. 4 is a schematic diagram of an artificial intelligence based speech recognition system according to some embodiments of the application;

FIG. 5 is a schematic diagram of a computer device implementing an artificial intelligence based speech recognition method according to some embodiments of the application.

Detailed Description

In order to better understand the technical scheme of the present application, the following detailed description will refer to the accompanying drawings and specific embodiments.

Referring to FIG. 1, which is an exemplary flowchart of an artificial intelligence based speech recognition method 100 according to some embodiments of the present application, the artificial intelligence based speech recognition method 100 generally includes the steps of:

in step 101, a speech signal to be recognized is acquired.

In specific implementation, a voice signal to be recognized is obtained through an audio file, wherein the audio file is used for storing the voice signal, and the voice signal represents audio data used for voice recognition processing.

In step 102, a pitch period of the speech signal is determined based on the amplitude deviation between adjacent speech amplitudes in the speech signal, and the initial speech text corresponding to the speech signal is extracted by the pitch period and the speech spectrogram of the speech signal to obtain a plurality of speech keywords corresponding to the speech signal.

In some embodiments, reference is made to FIG. 2, which is an exemplary flow chart for determining a pitch period according to some embodiments of the application, where determining a pitch period of the speech signal based on amplitude deviations between adjacent speech amplitudes in the speech signal may be accomplished by:

firstly, in step 1021, framing the voice signal to obtain a plurality of short-time signal frames corresponding to the voice signal;

Next, in step 1022, performing an amplitude calculation on each short-time signal frame to obtain an amplitude sequence of each short-time signal frame;

then, in step 1023, calculating an amplitude deviation between adjacent speech amplitudes in each amplitude sequence;

Finally, in step 1024, the pitch period of the speech signal is determined from the variance characteristics of all amplitude variances.

In specific implementation, firstly, the voice signal is subjected to framing processing to obtain a plurality of short-time signal frames corresponding to the voice signal, namely, the voice signal is divided according to a preset sliding window and a sliding step length to obtain a plurality of short-time signal frames corresponding to the voice signal, for example, the sliding window can be set to be 20ms and the sliding step length can be set to be 10ms, wherein the sliding window is a hamming window, and in other embodiments, the sliding window and the sliding step length can be set according to actual requirements without limitation; in other embodiments, the amplitude sequence of each short-time signal frame is obtained by calculating the amplitude of each short-time signal frame by adopting a short-time amplitude calculation in the prior art, for example, the amplitude calculation of each short-time signal frame can be performed by adopting other calculation modes, the method is not limited herein, then, the absolute difference calculation is performed on the adjacent voice amplitude values in each amplitude sequence, the absolute difference calculation result is used as the amplitude deviation, all the amplitude deviations are obtained, finally, the pitch period of the voice signal is determined according to the deviation change characteristics of all the amplitude deviations, namely, the average value of all the amplitude deviations is calculated one by one according to the time sequence, the calculated amplitude deviation average value is used as the deviation change characteristics, the deviation change characteristics under different time sequences are obtained, and the time sequence corresponding to the first local minimum deviation change characteristic in all the deviation change characteristics is extracted and used as the pitch period of the voice signal, wherein the deviation change characteristics represent the change trend of the amplitude deviation under the time sequence.

It should be noted that, in this embodiment, the short-time signal frame represents a voice signal within a short period, the amplitude sequence in this embodiment represents a set including a plurality of signal amplitudes, and the pitch period in this embodiment represents a basic repetition period generated by periodic vibration of the voice signal, and is generally represented as a time interval between adjacent periodic signal peaks (or signal valleys), and by determining the pitch period, the robustness of voice keyword extraction can be effectively improved.

In some embodiments, extracting keywords from the initial voice text corresponding to the voice signal through the pitch period and the voice spectrogram of the voice signal, to obtain a plurality of voice keywords corresponding to the voice signal may be implemented by the following steps, that is:

In particular, first, the speech signal is divided into a plurality of time periods in time sequence based on the length of the pitch period, wherein each time period contains one pitch period; the method comprises the steps of obtaining a voice signal in each time period, carrying out Fourier transform on the voice signal in each time period to obtain a voice spectrogram of each time period, determining a frequency band region and peak information in each voice spectrogram, namely, extracting the frequency band region and the peak information in each voice spectrogram by adopting a Mel frequency cepstrum coefficient, identifying the peak information in each voice spectrogram by adopting a first derivative method in a peak detection algorithm, determining the frequency band region and the peak information in each voice spectrogram by adopting other methods, not limiting here, finally extracting a plurality of voice keywords corresponding to the voice signal from an initial voice text corresponding to the voice signal based on the frequency band region and the peak information in each voice spectrogram, namely, firstly, matching the frequency band region and the peak information in each voice spectrogram with an existing phoneme-spectrum mapping model to obtain a phoneme sequence corresponding to the voice signal, then adopting an n-gram language model to match and a word library in the initial voice text to obtain a plurality of keywords corresponding to the voice signal, wherein the keywords are mapped into a high-frequency-phoneme model (HMM is a limited-Gaussian mapping model, and the HMM is not limited by other frequency-spectrum models, in this embodiment, the initial voice text is obtained by roughly extracting the voice signal through a Hidden Markov Model (HMM) in the voice recognition model, which is not described herein.

It should be noted that, in this embodiment, the voice spectrogram represents a voice signal distribution diagram on a frequency domain, specifically, the voice spectrogram is a time-frequency-energy distribution diagram obtained after a voice signal is converted from a time domain to a frequency domain, in this embodiment, the frequency band region represents an energy distribution region in a specific frequency range in the voice spectrogram, the peak information in this embodiment represents a maximum value of amplitude values in the voice spectrogram on a frequency point, the voice keyword in this embodiment represents a voice word having important information meaning, specifically, the voice keyword in this embodiment is a word having important information meaning extracted from a voice signal, and the voice keyword is a keyword in the voice signal, so that by determining the voice keyword, the voice recognition system can efficiently filter redundant information, precisely match user intention, and improve recognition accuracy and response speed.

In step 103, homonym expansion is performed on each voice keyword, so as to obtain homonym groups of each voice keyword.

In some embodiments, referring to fig. 3, which is an exemplary flowchart for determining homonyms according to some embodiments of the present application, in this embodiment, homonym expansion is performed on each of the voice keywords, so as to obtain homonyms of each of the voice keywords, which may be implemented by using the following steps:

First, in step 1031, performing a phonetic analysis on each of the voice keywords, and identifying a plurality of candidate homophones corresponding to each of the voice keywords;

next, in step 1032, selecting a voice keyword as the selected voice keyword;

Then, in step 1033, screening a plurality of candidate homophones corresponding to the selected voice keywords according to phonology characteristics, and combining the candidate homophones obtained by screening to obtain homophone groups of the selected voice keywords;

Finally, in step 1034, the homonyms for the remaining speech keywords continue to be determined.

In the specific implementation, firstly, a phonetic analysis is carried out on each voice keyword, a plurality of candidate homophones corresponding to each voice keyword are identified, namely, a phoneme segmentation tool is adopted to convert each voice keyword into a standard phoneme sequence, the standard phoneme sequence represents the voice keyword to be converted into a series of phoneme sequences, for each voice keyword, the standard phoneme sequence of the voice keyword is mapped to words matched with similar pronunciation in a phoneme mapping library of a voice synthesis system, and then a plurality of candidate homophones corresponding to each voice keyword are obtained, wherein the phoneme segmentation tool can adopt Pypinyin tools, in other embodiments, other phoneme segmentation tools can also be adopted, no limitation is adopted, secondly, one voice keyword is selected as a selected voice keyword, then, a plurality of candidate homophones corresponding to the selected voice keyword are screened according to phonetic features, the selected candidate homophones are combined, namely, a dynamic time-warping method in phonetic feature analysis is adopted to calculate the words matched with the candidate homophones corresponding to the candidate homophones, the candidate homophones are screened out, and the candidate homophones are combined to obtain the candidate homophones with the similarity, and the candidate homophones are screened out.

It should be noted that, in this embodiment, the candidate homophones represent a group of words with similar pronunciation to the target voice keyword but different meanings, and the homophone groups represent a combination of multiple homophones, that is, the homophone groups are composed of multiple homophones, and in the voice recognition, the voice recognition is affected by noise, accent and speed change, so as to cause pronunciation deviation, therefore, by determining the homophone groups, the system can be helped to make more reasonable judgment on uncertain pronunciation, so as to recognize the true intention of the target voice.

In step 104, obtaining context semantic information of each voice keyword in the initial voice text, and performing recognition verification on each voice keyword based on all the context semantic information and semantic features of homophones in each homophone group to obtain recognition ambiguity of each voice keyword in a voice recognition process.

In specific implementation, the context semantic information of each voice keyword in the initial voice text is obtained through semantic analysis based on a text corpus, for example, the topic to which the voice keyword belongs is analyzed from the corpus through a latent dirichlet allocation topic model (LDA), so that the context semantic information of the voice keyword in the initial voice text is obtained through judgment, which is not described herein, and in other embodiments, the context semantic information of each voice keyword in the initial voice text can be obtained through other obtaining modes, for example, a method based on a language model and a method based on a context window, which is not limited herein.

It should be noted that, in the application, the context semantic information represents the associated information between the current sentence and the front and rear sentences, and the voice recognition system is facilitated to understand the voice input more accurately by determining the context semantic information, so as to reduce ambiguity and improve the accuracy of voice recognition.

In some embodiments, based on all the context semantic information and the semantic features of homophones in each homophone group, the recognition verification of each voice keyword is performed, and the recognition ambiguity of each voice keyword in the voice recognition process can be obtained by adopting the following steps:

Selecting a voice keyword as a selected voice keyword;

When the method is implemented specifically, firstly, semantic analysis is carried out on each voice keyword based on context semantic information of each voice keyword to obtain the key semantic feature of each voice keyword, namely, a pre-trained Large Language Model (LLM) is obtained, for each voice keyword, the context semantic information of the voice keyword is used as an input parameter to be input into the large language model, the large language model outputs the key semantic feature of the voice keyword, further, the key semantic feature of each voice keyword is obtained, the large language model in the embodiment adopts a voice recognition model (Seed-ASR) based on audio conditions, the large language model can generate the semantic feature of the voice keyword according to the input context information, in addition, in other embodiments, other methods can be adopted to carry out semantic analysis on each voice keyword, meaning association analysis is not limited here, secondly, semantic association analysis is carried out on each homonym in the corresponding homonym of the selected voice keyword according to the key semantic feature of the selected voice keyword, namely, the semantic feature of each homonym can be obtained, meaning feature of each homonym can be associated with the homonym is obtained, meaning feature of each homonym can be associated with the corresponding semantic feature of the homonym, meaning feature of the homonym can be obtained by other semantic feature of the homonym is obtained, and the semantic feature of the homonym can be further associated with the semantic feature of the homonym is obtained by other semantic feature, in other methods, in other embodiments are further, semantic feature of the homonym is similar is obtained by the semantic feature is obtained, and the semantic feature of the homonym is similar to the semantic feature is obtained by the semantic feature is similar to the semantic feature in the meaning in the homonym, the method is not limited, further, the recognition ambiguity of the selected voice keyword in the voice recognition process is determined according to all the semantic relevancy, namely, all the semantic relevancy is weighted and summed, the weighted and summed result is used as the recognition ambiguity of the selected voice keyword in the voice recognition process, wherein the weight value of each semantic relevancy can be set between 0 and 1 according to the occurrence frequency of synonyms corresponding to the voice relevancy in a corpus, the higher the occurrence frequency is, the larger the weight value is set, and otherwise, the smaller weight value is set, and finally, the recognition ambiguity of the rest voice keywords in the voice recognition process is continuously determined through the implementation mode of 'the recognition ambiguity of the selected voice keyword in the voice recognition process is determined according to all the semantic relevancy'.

It should be noted that, in this embodiment, the key semantic features are information representing the content of the voice keyword, for example, word vector features, through determining the semantic features, the voice recognition system can better process homonyms and make correct judgment in combination with contexts, in this embodiment, the semantic relevance represents the semantic relevance between the voice words, the semantic relevance measures the meaning proximity or contact strength between the voice words, and the recognition ambiguity represents the recognition confusion degree of the voice keyword in the voice recognition process, namely, the greater the recognition ambiguity is, the lesser the recognition confusion degree of the voice keyword in the voice recognition process is, the recognition ambiguity measures the severity degree of the voice ambiguity faced by the system when the voice signal is resolved, and reflects the difficulty of the system when the voice signal is correctly recognized, and by determining the recognition ambiguity, the voice recognition system can evaluate the similarity and uncertainty of the voice recognition result, and can provide the recommendation result more in accordance with the intention of the voice recognition system based on the similarity and the search result of the voice recognition result.

The application also needs to be explained, wherein the process of identifying and verifying the identified keywords is represented by the identification and verification method, wherein the identification and verification method is used for identifying and verifying each voice keyword based on all context semantic information and semantic features of homophones in homophones groups, and the identification ambiguity of each voice keyword in the voice identification process can be achieved by adopting the following steps that the semantic analysis is carried out on each voice keyword based on the context semantic information of each voice keyword, so as to obtain the key semantic features of each voice keyword; the method comprises the steps of selecting a voice keyword as the selected voice keyword, carrying out semantic association analysis on each homonym in homonym groups corresponding to the selected voice keyword according to the key semantic features of the selected voice keyword to obtain semantic association degrees of semantic features of each homonym in the homonym groups and the key semantic features, determining recognition ambiguity of the selected voice keyword in the voice recognition process according to all the semantic association degrees, and continuously determining recognition ambiguity of the rest voice keywords in the voice recognition process to finish recognition verification of each voice keyword.

In step 105, the text probability distribution of the voice signal is identified according to the recognition ambiguity and homonym corresponding to each voice keyword, and then the text probability distribution is analyzed through a pre-trained voice recognition model, so as to obtain a text recognition result of the voice signal.

In some embodiments, the text probability distribution of the speech signal according to the recognition ambiguity and homonym corresponding to each speech keyword may be implemented by the following steps:

acquiring the recognition ambiguity and homonyms corresponding to each voice keyword;

extracting the statistical probability of each voice keyword and each homonym in homonym groups in a given corpus;

weighting the statistical probability of each voice keyword based on all the recognition ambiguities to obtain the weighted statistical probability of each voice keyword;

Selecting a voice keyword as a selected voice keyword, comparing the weighted statistical probability of the selected voice keyword with the statistical probability of each homonym in the corresponding homonym group, when the weighted statistical probability is larger than the statistical probability of each homonym in the corresponding homonym group, taking the weighted statistical probability as the selected statistical probability under the corresponding condition of the selected voice keyword, and when the weighted statistical probability is smaller than the statistical probability of any homonym in the corresponding homonym group, extracting the largest statistical probability from the corresponding homonym group as the selected statistical probability under the corresponding condition of the selected voice keyword;

Continuously determining the selected statistical probability under the corresponding residual voice keywords;

A text probability distribution of the speech signal is derived based on all selected statistical probabilities.

When the method is implemented, firstly, the recognition ambiguity and homonyms corresponding to each voice keyword are obtained, secondly, the statistical probability of each homonym in a given corpus is determined, namely, the statistical probability of each voice keyword and each homonym in the homonym can be extracted through a data processing tool Python, the given corpus is a corpus with the maximum correlation with voice text, further, the statistical probability of each voice keyword is obtained by weighting the statistical probability of each voice keyword based on all the recognition ambiguities, namely, the recognition ambiguity corresponding to each voice keyword and the statistical probability are multiplied, the weighted statistical probability of each voice keyword is obtained by taking the product calculation result as the weighted statistical probability of the voice keyword, and further, the weighted statistical probability of a voice keyword is selected as the statistical probability of the selected voice keyword, the weighted statistical probability of each homonym in the corresponding to the voice keyword is compared with the statistical probability of the corresponding homonym in the given corpus, the weighted statistical probability of the corresponding to the statistical probability of the corresponding voice keyword is selected, and the statistical probability of the weighted statistical probability of the homonym is compared with the statistical probability of the corresponding to the statistical probability of the corresponding homonym in the selected word when the statistical probability is selected, and the statistical probability of the weighted statistical probability of the homonym is selected from the statistical probability of the corresponding to the homonym is selected, when the weighted statistical probability is larger than the statistical probability of each homonym in the corresponding homonym group, the weighted statistical probability is used as the selected statistical probability under the corresponding selected voice keyword, when the weighted statistical probability is smaller than the statistical probability of any homonym in the corresponding homonym group, the largest statistical probability is extracted from the corresponding homonym group and used as the determination mode of the selected statistical probability under the corresponding selected voice keyword to continuously determine the selected statistical probability under the corresponding residual voice keyword, and finally, all the selected statistical probabilities are combined to obtain the text probability distribution of the voice signal.

It should be noted that, in this embodiment, the statistical probability represents the occurrence probability of the voice keyword in the given corpus, the weighted statistical probability represents the statistical probability after weighted adjustment in this embodiment, the selected statistical probability represents the selected statistical probability in this embodiment, and the text probability distribution in this embodiment represents a set of multiple statistical probabilities, where the text probability distribution measures the probability distribution of the voice keyword in the voice signal as the final voice text, and by determining the text probability distribution, the understanding capability of the voice recognition system to the language can be effectively improved, so that the voice signal can generate the text more accurately.

In some embodiments, the text probability distribution is parsed by a pre-trained speech recognition model, and the text recognition result of the speech signal is obtained by:

acquiring a pre-trained voice recognition model;

Inputting the text probability distribution into a pre-trained voice recognition model for analysis, and obtaining a text recognition result of the voice signal based on the output of the voice recognition model.

In particular, a pre-trained Speech recognition model is first obtained, the pre-trained Speech recognition model being a Speech recognition model trained based on a convolutional neural network, for example Kaldi, deepSpeech or Google's Speech-to-TextAPI, then the text probability distribution is input into the pre-trained Speech recognition model for parsing, the text recognition result of the Speech signal is obtained based on the output of the Speech recognition model, wherein the text probability distribution is usually represented as a set of probability estimates for different possible text results, and when the pre-trained Speech recognition model receives the text probability distribution input by the channel, the pre-trained Speech recognition model compares the text probability distribution with the language and Speech patterns learned by the pre-trained Speech recognition model during training to generate an output, the output of the model is the text recognition result, the recognition result represents the text content most likely to be corresponding to the input Speech signal, the input text probability distribution provides the Speech recognition model with different possibilities about the text likely to be corresponding to the Speech signal, and the models parse these possibilities into the most likely text recognition result through the learned Speech and language patterns.

It should be noted that, in the present application, the text recognition result indicates the recognition result of the voice signal.

In addition, in some embodiments, the present application provides an artificial intelligence-based speech recognition system, referring to fig. 4, which is a schematic structural diagram of the artificial intelligence-based speech recognition system according to some embodiments of the present application, the artificial intelligence-based speech recognition system 200 includes an acquisition module 201, a processing module 202, and an execution module 203, which are respectively described as follows:

The acquisition module 201 is mainly used for acquiring the voice signal to be identified in the application;

The processing module 202 is mainly configured to determine a pitch period of the speech signal based on an amplitude deviation between adjacent speech amplitudes in the speech signal, and extract keywords from an initial speech text corresponding to the speech signal by using the pitch period and a speech spectrogram of the speech signal to obtain a plurality of speech keywords corresponding to the speech signal;

The processing module 202 is further configured to perform homonym expansion on each of the voice keywords to obtain homonym groups of each of the voice keywords;

in addition, the processing module 202 is further configured to obtain context semantic information of each voice keyword in the initial voice text, and identify and verify each voice keyword based on all context semantic information and semantic features of homophones in each homophone phrase, so as to obtain identification ambiguity of each voice keyword in a voice identification process;

The execution module 203 in the present application is mainly configured to identify a text probability distribution of the voice signal according to the recognition ambiguity and homonym corresponding to each voice keyword, and further analyze the text probability distribution through a pre-trained voice recognition model to obtain a text recognition result of the voice signal.

In addition, the application also provides a computer device, which comprises a memory and a processor, wherein the memory stores codes, and the processor is configured to acquire the codes and execute the artificial intelligence-based voice recognition method.

In some embodiments, reference is made to FIG. 5, which is a schematic diagram of a computer device implementing an artificial intelligence based speech recognition method, according to some embodiments of the application. The artificial intelligence based speech recognition method of the above embodiments may be implemented by a computer device as shown in fig. 5, the computer device 300 comprising at least one processor 301, a communication bus 302, a memory 303 and at least one communication interface 304.

The processor 301 may be a general purpose central processing unit (central processing unit, CPU) or may be an application-specific integrated circuit (ASIC) or one or more of the methods for controlling the execution of the artificial intelligence-based speech recognition methods of the present application.

Communication bus 302 may be used to transfer information between the above-described components.

The Memory 303 may be, but is not limited to, a read-only Memory (ROM) or other type of static storage device that can store static information and instructions, a random access Memory (random access Memory, RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable read-only Memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-only Memory, EEPROM), a compact disc (compact disc read-only Memory) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory 303 may be stand alone and be coupled to the processor 301 via the communication bus 302. Memory 303 may also be integrated with processor 301.

The memory 303 is used for storing program codes for executing the scheme of the present application, and the processor 301 controls the execution. The processor 301 is configured to execute program code stored in the memory 303. One or more software modules may be included in the program code. The determination of the artificial intelligence based speech recognition method in the above embodiments may be implemented by one or more software modules in the processor 301 and in the program code in the memory 303.

Communication interface 304, uses any transceiver-like device for communicating with other devices or communication networks, such as ethernet, radio access network (radio access network, RAN), wireless local area network (wireless local area networks, WLAN), etc.

In a specific implementation, as an embodiment, a computer device may include a plurality of processors, where each of the processors may be a single-core (single-CPU) processor or may be a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).

The computer device may be a general purpose computer device or a special purpose computer device. In a specific implementation, the computer device may be a desktop, a laptop, a web server, a personal computer (PDA), a mobile handset, a tablet, a wireless terminal device, a communication device, or an embedded device. Embodiments of the application are not limited to the type of computer device.

In addition, the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the artificial intelligence-based voice recognition method when being executed by a processor.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A speech recognition method based on artificial intelligence, comprising the steps of:

acquiring a voice signal to be recognized;

2. The method of claim 1, wherein determining the pitch period of the speech signal based on the amplitude deviation between adjacent speech amplitudes in the speech signal comprises:

3. The method of claim 1, wherein extracting keywords from the initial speech text corresponding to the speech signal by the pitch period and the speech spectrogram of the speech signal, obtaining a plurality of speech keywords corresponding to the speech signal specifically comprises:

4. The method of claim 1, wherein performing homonym expansion on each of the voice keywords to obtain homonym groups for each of the voice keywords comprises:

Selecting a voice keyword as a selected voice keyword;

and continuing to determine homonyms of the remaining voice keywords.

5. The method of claim 1, wherein identifying and verifying each voice keyword based on all context semantic information and semantic features of homophones in respective homophone groups, the obtaining the identification ambiguity of each voice keyword in the voice recognition process specifically comprises:

Selecting a voice keyword as a selected voice keyword;

6. The method of claim 1, wherein parsing the text probability distribution through a pre-trained speech recognition model to obtain text recognition results for the speech signal comprises:

acquiring a pre-trained voice recognition model;

7. The method of claim 1, wherein the speech signal to be recognized is obtained via an audio file.

8. An artificial intelligence based speech recognition system, comprising:

The acquisition module is used for acquiring the voice signal to be identified;

9. A computer device comprising a memory storing code and a processor configured to obtain the code and perform the artificial intelligence based speech recognition method of any one of claims 1 to 7.

10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the artificial intelligence based speech recognition method according to any one of claims 1 to 7.