CN114708860B

CN114708860B - Voice command recognition method, device, computer equipment and computer readable medium

Info

Publication number: CN114708860B
Application number: CN202210505217.2A
Authority: CN
Inventors: 洪振厚; 王健宗; 瞿晓阳; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-05-10
Filing date: 2022-05-10
Publication date: 2024-10-11
Anticipated expiration: 2042-05-10
Also published as: CN114708860A

Abstract

The application is applicable to the technical field of artificial intelligence, and particularly relates to a voice command recognition method, a voice command recognition device, computer equipment and a computer readable medium. When the fact that the collected real voice command does not meet the voice standard is detected, a preset non-standard voice recognition model is used for recognizing the real voice command, keywords are obtained, word senses of all the keywords are extracted, a word aid is filled between any two adjacent keywords, a predicted text is obtained, the predicted text is converted into a first predicted voice command, the first predicted voice command is generated into a corresponding control command by adopting a trained quantum deep learning model, voice command recognition is achieved, the standard of the voice command is judged, the command is predicted and recognized when the voice command does not meet the standard, accuracy of command recognition can be improved, the control command can be obtained rapidly through quantum deep learning recognition, and therefore accuracy and timeliness of recognition can be guaranteed under a complex voice command environment.

Description

Voice command recognition method, device, computer equipment and computer readable medium

Technical Field

The application is applicable to the technical field of artificial intelligence, and particularly relates to a voice command recognition method, a voice command recognition device, computer equipment and a computer readable medium.

Background

At present, along with the rapid increase of the use of robots in life, the quality of life of human beings can be improved, the human beings can guide the robots to finish corresponding work through corresponding voice instructions, and the process is based on a human-computer interaction technology. In the human-computer interaction process, the human and the machine are in voice communication, so that the machine needs to understand the true intention to be expressed in the human voice, and the current voice recognition technology is to enable the machine to convert voice signals into corresponding texts or commands through the recognition and understanding process. For example, voice recognition technology is applied to the internet of vehicles, and a driver can set a destination and enter navigation by speaking with a robot customer service. However, with the development of living standard, the requirements on timeliness and accuracy of voice command recognition are higher and higher, and the variety of voice commands is also more and more. Therefore, how to recognize commands timely and accurately under the condition of complicated voice command types is a problem to be solved.

Disclosure of Invention

In view of the above, embodiments of the present application provide a method, an apparatus, a computer device and a computer readable medium for recognizing a voice command, so as to solve the problem of timely and accurately recognizing the command under the condition of complicated voice command types.

In a first aspect, an embodiment of the present application provides a voice command recognition method, where the voice command recognition method includes:

Collecting a real voice command;

When the real voice command is detected to be not in accordance with the pronunciation standard, a preset non-standard voice recognition model is used for recognizing the real voice command to obtain M keywords, wherein M is an integer larger than zero;

extracting word senses of all keywords when M is greater than 1, and filling a fluxing word between any two adjacent keywords by combining the front-back sequence relation of all keywords in the real voice command to obtain a predicted text;

Performing voice synthesis on the predicted text by using a first sounding mode to obtain a first predicted voice command;

and inputting the first predicted voice command into a trained quantum deep learning model to generate a corresponding control instruction.

In one embodiment, after the capturing of the real voice command, further comprising:

When the real voice command is detected to be in accordance with the standard, the real voice command is input into the trained quantum deep learning model, and a corresponding control instruction is generated.

In one embodiment, before performing speech synthesis on the predicted text using the first sounding mode to obtain the first predicted speech command, the method further includes:

Performing voice feature recognition on the real voice command to obtain target voice features;

according to the target sound characteristics, matching corresponding target users in a preset sound characteristic library, wherein the sound characteristic library stores the mapping relation between the users and the sound characteristics;

Determining a pronunciation mode matched with the target user in a preset standard voice library as a target pronunciation mode, wherein the standard voice library stores a mapping relation between the user and the pronunciation mode.

In one embodiment, after obtaining the M keywords, the method further includes:

When M is 1, performing voice synthesis on the keyword by using a second pronunciation mode to obtain a second predicted voice command;

and inputting the second predicted voice command into a trained quantum deep learning model to generate a corresponding control instruction.

In an embodiment, the trained quantum deep learning model comprises N one-dimensional convolution pooling layers and a full-connection layer, the N one-dimensional convolution pooling layers are sequentially connected, the output of the last one-dimensional convolution pooling layer is connected with the full-connection layer, the one-dimensional convolution pooling layer comprises a one-dimensional convolution layer and a one-dimensional pooling layer, the output of the one-dimensional convolution layer is connected with the one-dimensional pooling layer, a voice command is input into the one-dimensional convolution layer of the first one-dimensional convolution pooling layer, and N is an integer greater than zero;

Inputting the first predicted voice command into a trained quantum deep learning model, and generating a corresponding control instruction comprises:

inputting the first predicted voice command into a first one-dimensional convolution pooling layer, and outputting pooling results to a next one-dimensional convolution pooling layer until reaching a last one-dimensional convolution pooling layer;

And inputting the pooling result output by the last one-dimensional convolution pooling layer into the full-connection layer for connection to obtain a corresponding control instruction.

In one embodiment, the one-dimensional convolution layer is a quantum convolution layer, and the quantum convolution layer comprises a quantum encoder, a quantum variation circuit and a measurer;

Inputting the first predicted voice command into a first one-dimensional convolution pooling layer comprises:

inputting the first predicted voice command into the quantum encoder to perform quantum encoding to obtain quantum encoding characteristics;

Classifying the quantum coding characteristics through the quantum variation circuit, and sending the classification result to the measurer for measuring the authenticity of the classification result to obtain a measurement result;

inputting the measurement result into a one-dimensional pooling layer, and outputting the pooling result.

In an embodiment, extracting word senses of all keywords, and filling a fluxing word between any two adjacent keywords by combining the front-to-back sequence relation of all keywords in the real voice command, and obtaining the predicted text includes:

Aiming at any keyword, similarity analysis is carried out on the keyword and the parts of speech of the previous keyword and/or the next keyword of the corresponding keyword, and the meaning of the corresponding keyword is determined according to the similarity analysis result;

And sequentially arranging all the keywords according to the front-to-back sequence relation in the real voice command, screening the auxiliary words according to the word senses of any two adjacent keywords, and filling the screened auxiliary words between the two corresponding adjacent keywords to obtain the predicted text.

In a second aspect, an embodiment of the present application provides a voice command recognition apparatus, including:

the command acquisition module is used for acquiring real voice commands;

The first recognition module is used for recognizing the real voice command to M keywords by using a preset non-standard voice recognition model when the real voice command is detected to be not in accordance with the pronunciation standard, wherein M is an integer larger than zero;

The text prediction module is used for extracting word senses of all keywords when M is greater than 1, and filling auxiliary words between any two adjacent keywords by combining the front-back sequence relation of all keywords in the real voice command to obtain a predicted text;

the first voice synthesis module is used for performing voice synthesis on the predicted text by using a first sounding mode to obtain a first predicted voice command;

The first instruction generation module is used for inputting the first predicted voice command into a trained quantum deep learning model and generating a corresponding control instruction.

In one embodiment, the voice command recognition device further comprises:

And the second recognition module is used for inputting the real voice command into the trained quantum deep learning model after the real voice command is acquired and when the real voice command is detected to meet the standard, and generating a corresponding control instruction.

In one embodiment, the voice command recognition device further comprises:

the voice feature recognition module is used for carrying out voice feature recognition on the real voice command before carrying out voice synthesis on the predicted text by using a first voice mode to obtain a first predicted voice command so as to obtain target voice features;

The target user determining module is used for matching corresponding target users in a preset sound feature library according to the target sound features, wherein the sound feature library stores the mapping relation between the users and the sound features of the users;

and the target voice determining module is used for determining a pronunciation mode matched with the target user in a preset standard voice library as a target pronunciation mode, and the standard voice library stores a mapping relation between the user and the pronunciation mode.

In one embodiment, the voice command recognition device further comprises:

The second voice synthesis module is used for performing voice synthesis on the keywords by using a second pronunciation mode when M is 1 after M keywords are obtained, so as to obtain a second predicted voice command;

And the second instruction generation module is used for inputting the second predicted voice command into the trained quantum deep learning model and generating a corresponding control instruction.

the first instruction generation module includes:

the convolution pooling unit is used for inputting the first predicted voice command into a first one-dimensional convolution pooling layer and outputting a pooling result to a next one-dimensional convolution pooling layer until reaching a last one-dimensional convolution pooling layer;

the instruction generating unit is used for inputting the pooling result output by the last one-dimensional convolution pooling layer into the full-connection layer for connection to obtain a corresponding control instruction.

the convolution pooling unit includes:

The quantum coding subunit is used for inputting the first predicted voice command into the quantum coder to carry out quantum coding so as to obtain quantum coding characteristics;

The variable component measuring subunit is used for classifying the quantum coding characteristics through the quantum variable component circuit, and sending the classification result to the measurer for measuring the authenticity of the classification result to obtain a measurement result;

Chi Huazi unit for inputting the measurement result into one-dimensional pooling layer and outputting pooling result.

In one embodiment, the text prediction module includes:

The word sense analysis unit is used for carrying out similarity analysis on the keyword and the parts of speech of the previous keyword and/or the next keyword of the corresponding keyword aiming at any keyword, and determining the word sense of the corresponding keyword according to the similarity analysis result;

And the text prediction unit is used for sequentially arranging all the keywords according to the front-to-back sequence relation in the real voice command, screening the auxiliary words according to the word senses of any two adjacent keywords, filling the screened auxiliary words between the two corresponding adjacent keywords, and traversing all the keywords to obtain a predicted text.

In a third aspect, an embodiment of the present application provides a computer device, the computer device including a processor, a memory, and a computer program stored in the memory and executable on the processor, the processor implementing the voice command recognition method according to the first aspect when executing the computer program.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium storing a computer program which, when executed by a processor, implements the voice command recognition method according to the first aspect.

Compared with the prior art, the embodiment of the application has the beneficial effects that: the application collects real voice command, when detecting that the real voice command does not meet the standard, using a preset non-standard voice recognition model to recognize the real voice command, obtaining at least one keyword with a front-back sequence relation, extracting word senses of all keywords, combining the front-back sequence relation of all keywords, filling a word aid between any two adjacent keywords to obtain a predicted text, adopting a text-to-voice tool, combining any existing standard voice to convert the predicted text into a first predicted voice command, inputting the first predicted voice command into a trained quantum deep learning model, generating a corresponding control command, realizing recognition of the voice command, and carrying out predictive recognition on the command when the voice command does not meet the standard by judging the standard of the voice command, thereby improving the accuracy of command recognition, and rapidly obtaining a control command by quantum deep learning recognition, so that the accuracy and timeliness of recognition can be ensured under a complex voice command environment.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an application environment of a voice command recognition method according to a first embodiment of the present application;

fig. 2 is a flow chart of a voice command recognition method according to a second embodiment of the present application;

fig. 3 is a flowchart of a voice command recognition method according to a third embodiment of the present application;

fig. 4 is a schematic structural diagram of a voice command recognition device according to a fourth embodiment of the present application;

fig. 5 is a schematic structural diagram of a computer device according to a fifth embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in the present description and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

Furthermore, the terms "first," "second," "third," and the like in the description of the present specification and in the appended claims, are used for distinguishing between descriptions and not necessarily for indicating or implying a relative importance.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Wherein artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

It should be understood that the sequence numbers of the steps in the following embodiments do not mean the order of execution, and the execution order of the processes should be determined by the functions and the internal logic, and should not be construed as limiting the implementation process of the embodiments of the present application.

In order to illustrate the technical scheme of the application, the following description is made by specific examples.

The voice command recognition method provided by the embodiment of the application can be applied to an application environment as shown in fig. 1, wherein a client communicates with a server. The client includes, but is not limited to, a palm computer, a desktop computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a cloud computer device, a Personal Digital Assistant (PDA), and other computer devices. The server may be implemented by a stand-alone server or a server cluster formed by a plurality of servers.

Referring to fig. 2, a flow chart of a voice command recognition method according to a second embodiment of the present application is shown, where the voice command recognition method is applied to a server in fig. 1, and a computer device corresponding to the server is connected to a corresponding database to obtain corresponding data in the database. The computer equipment can be connected with a corresponding client, and the client sends the voice command to the server, so that the function of collecting the voice command by the server is realized. As shown in fig. 2, the voice command recognition method may include the steps of:

step S201, collecting a real voice command, and when the real voice command is detected to be not in accordance with the pronunciation standard, using a preset non-standard voice recognition model to recognize the real voice command, so as to obtain M keywords.

In the application, the server is connected with the corresponding client, and the client is used for collecting the real voice command and sending the real voice command to the server, so that the step of collecting the real voice command by the server is realized. The client may refer to a device with a voice acquisition device, such as a voice robot, a vehicle-mounted terminal, and the like.

Before the client acquires the real voice command, the client also needs to acquire the acquisition permission of surrounding environment sounds, and can acquire the subsequent real voice command and upload the subsequent real voice command to the server when acquiring the acquisition permission. For example, the user wakes up the acquisition function of the client through the local wake-up instruction, uploads the acquired real voice command to the server, or the user acquires the real voice command through the client and then uses the sending function to send the real voice command.

The real voice command can refer to all sound data in the environment, and further, noise reduction and enhancement processing are further performed on the real voice command before the real voice command is sent to the server, so that the preliminary screening of the real voice command is realized, and the accuracy of the subsequent recognition of the real voice command is ensured.

The pronunciation criteria may be a standard pronunciation specified by the pointer for one language, e.g., mandarin pronunciation for a chinese language as the standard pronunciation for that chinese language. Because the real voice command may be a voice formed by dialects or other languages, in order to ensure accuracy, it is necessary to identify whether the real voice command meets the pronunciation standard, and then perform different processing on the pronunciation standard voice and the pronunciation nonstandard voice, so as to improve the processing efficiency.

If the real voice command is a voice with nonstandard pronunciation, the real voice command needs to be corrected or standardized so as to ensure the subsequent requirement on voice recognition, and if the real voice command is a voice with standard pronunciation, the real voice command does not need to be corrected.

Further, the pronunciation standard may be set with reference to different languages, for example, for a chinese language, mandarin pronunciation is standard, and phonemes in mandarin pronunciation are composed of three syllables, and may be analyzed into eight phonemes, "p, u, t, o, ng, h, u, a", respectively, and detecting whether the real voice command meets the pronunciation standard may include the following steps:

Extracting sound characteristics of a real voice command by adopting a hidden Markov model, and acquiring all phonemes from the sound characteristics;

the duty ratio of eight phonemes in mandarin pronunciation in all phonemes is calculated, when the duty ratio is larger than a threshold value, the real voice command is judged to be in accordance with the pronunciation standard, and when the duty ratio is not larger than the threshold value, the real voice command is judged to be not in accordance with the pronunciation standard.

The preset non-standard speech recognition model may refer to a dialect recognizer for extracting keywords in dialects, etc., through which a real speech command may be recognized to extract keyword features in the real speech command. The recognition process can include recognition of tone, sentence breaking and the like, and important words are extracted through the recognition process to serve as key words, so that corresponding auxiliary words such as mood words, connective words and the like are eliminated.

And identifying the real voice command by using a preset nonstandard voice identification model, wherein corresponding M keywords can be obtained, M is an integer larger than zero, when M is 1, the fact that only one available keyword can be identified in the real voice command is indicated, when M is larger than 1, the fact that a plurality of available keywords can be identified in the real voice command is indicated, and if a plurality of keywords are identified, the keywords are in a front-back sequence relationship in the real voice command.

And performing voice synthesis on the keywords by using a second pronunciation mode to obtain a second predicted voice command, and inputting the second predicted voice command into a trained quantum deep learning model to generate a corresponding control command.

The keywords may be composed of one word or more words, and the front-back sequence relationship is the appearance sequence of the keywords in the truest voice command, that is, the time sequence of the keywords.

The recognition result of the keywords is that each keyword is independent, but each keyword has the original front-back sequence relation, so that the key features capable of analyzing the semantics are reserved.

Optionally, after the capturing of the real voice command, the method further includes:

when the real voice command is detected to meet the pronunciation standard, inputting the real voice command into a trained quantum deep learning model, and generating a corresponding control instruction.

When the real voice command meets the standard, the real voice command does not need to be processed, and can be directly sent to a trained quantum deep learning module to obtain a control command, so that the processing efficiency of the real voice command is improved through pre-processing.

And S202, extracting word senses of all keywords when M is greater than 1, and filling a fluxing word between any two adjacent keywords by combining the front-back sequence relation of all keywords in a real voice command to obtain a predicted text.

In the present application, word senses may refer to parts of speech, meanings, etc. of the keyword, and essentially may refer to feature vectors of parts of speech, feature vectors of meanings, etc. And arranging all the keywords according to a front-to-back sequence relationship, judging the auxiliary words needing to be filled between two adjacent keywords, correlating the filled auxiliary words with word senses of the two adjacent keywords, and filling the auxiliary words between the two corresponding adjacent keywords after determining the auxiliary words, so as to finally form a predicted text with the keywords and the auxiliary words.

The predicted text may refer to a text format for the real voice command, but instead of directly adopting a voice-to-text technology, a part of voice is converted into text, and the rest of voice is used for filling in a word-assisting mode through prediction, so that the predicted text is finally formed.

Compared with the text converted between the real voice commands, the predicted text has a standard voice expression mode, can be better applied to the subsequent voice command recognition, and can obtain accurate control instructions.

In step S203, the first speech mode is used to perform speech synthesis on the predicted text, so as to obtain a first predicted speech command.

In the application, the pronunciation mode may refer to a language with a preset mode, and the preset mode may be pronunciation of a dialect, standard pronunciation, and the like, and either the dialect or the standard pronunciation may be machine fitting pronunciation or manual recording pronunciation, wherein the machine fitting pronunciation is to use setting tone, volume, audio and the like to pronounce phonemes, and the manual recording pronunciation is to use manual recording pronunciation of each word.

For example, for the pronunciation mode of mandarin chinese in a chinese language, the machine fitting pronunciation is to use the set condition pairs "p, u, t, o, ng, h, u, a"8 phonemes to make pronunciation, and for the pronunciation mode of dialects in a chinese language, the manual recording pronunciation is to manually make pronunciation for each word using dialects and recording.

The speech synthesis may be to pronounce each word according to a first pronunciation mode, and join all pronunciations together according to a word front-to-back order to obtain a predicted speech, so as to implement conversion of the predicted text into a first predicted speech command.

Compared with a real voice command, the first predicted voice command has standard expression specifications, standard tone colors, standard volume and the like, and can be better applied to subsequent voice command recognition, so that an accurate control instruction is obtained.

Optionally, before performing speech synthesis on the predicted text using the first sounding mode to obtain the first predicted speech command, the method further includes:

The user can record corresponding pronunciation to form a pronunciation mode, and the first pronunciation mode can be the pronunciation mode recorded by the user. In order to better fit the voice of the target user initiating the real voice command, at the time of the first predicted voice command, if the pronunciation pattern of the target user is stored, the pronunciation pattern may be used to form the first predicted voice command.

Specifically, the voice feature recognition is performed on the real voice command, so that the voice feature is matched with the target user from the voice feature library, and before that, the mapping relation between the target user and the voice feature is necessarily stored in the voice feature library, wherein the user identity (Identity Document, ID) of the target user can be stored in the voice feature library.

Then, the pronunciation mode corresponding to the target user is searched in the standard voice library, before that, the target user and the pronunciation mode thereof are necessarily stored in the standard voice library, and similarly, the user ID of the target user is stored in the standard voice library for the same user, and is consistent with the user ID in the voice feature library.

Step S204, inputting the first predicted voice command into the trained quantum deep learning model to generate a corresponding control command.

According to the application, the trained quantum deep learning model can recognize the voice command as a control instruction, so that the voice command is used for controlling corresponding equipment to generate actions, and the corresponding equipment is equipment required to be controlled in the real voice command acquired by the client.

The trained quantum deep learning model is a machine learning model based on quantum superposition state, and due to the characteristic of quantum superposition state, the training speed can be effectively improved due to the natural self-contained parallelism.

The trained quantum deep learning model may be based on a combination of models of different structures, e.g., PADDLEPADDLE, TENSORFLOW, CAFFE, THEANO, MXNET, TORCH and PyTorch, etc.

Optionally, the trained quantum deep learning model comprises N layers of one-dimensional convolution pooling layers and full-connection layers, the N layers of one-dimensional convolution pooling layers are sequentially connected, the output of the last one-dimensional convolution pooling layer is connected with the full-connection layer, the one-dimensional convolution pooling layer comprises a one-dimensional convolution layer and a one-dimensional pooling layer, the output of the one-dimensional convolution layer is connected with the one-dimensional pooling layer, a voice command is input into the one-dimensional convolution layer of the first one-dimensional convolution pooling layer, and N is an integer greater than zero;

Inputting a first predicted voice command into a first one-dimensional convolution pooling layer, and outputting a pooling result to a next one-dimensional convolution pooling layer until the last one-dimensional convolution pooling layer is reached;

The trained deep learning model is composed of a convolution layer, a pooling layer and a full-connection layer, and specifically, the number of layers of the convolution layer and the pooling layer can be set according to requirements, for example, the quantum deep learning model can be composed of 4 one-dimensional quantum convolution layers (QConv d), 4 one-dimensional pooling layers (MaxPool d) and a full-connection layer (FC), wherein the one-dimensional quantum convolution layers carry out convolution operation according to the characteristics of quantum superposition states.

During training, a sample set is input into a one-dimensional convolution pooling layer and a full-connection layer of the quantum deep learning model, contrast loss is used as a loss function, and the training target is that the contrast loss function converges.

The data set of model training may be a voice command recognition data set disclosed by Google, and includes 35 types of commands in total: "left", "go", "yes", "down", "up", "on", "right", "no", "off", "stop", etc. there are 84843 training samples and 11005 test samples in total.

Optionally, the one-dimensional convolution layer is a quantum convolution layer, and the quantum convolution layer comprises a quantum encoder, a quantum variation circuit and a measurer:

Inputting the first predicted voice command into the first layer one-dimensional convolution pooling layer comprises:

inputting the first predicted voice command into a quantum encoder for quantum encoding to obtain quantum encoding characteristics;

Classifying the quantum coding characteristics through a quantum variation circuit, and sending the classification result to a measurer for measuring the authenticity of the classification result to obtain a measurement result;

The one-dimensional convolution layer inputs data into the parameter-containing sub-line (Variational Quantum Circuits, VQC) as in the general convolution operation, and outputs a convolution result through the VQC.

The formula for the VQC above is as follows:

q₁＝f_e(y₁)

q₂＝f_u(θ₁,…,θ_m;c(q₁))

q₃＝f_d(q₂)

wherein y ₁＝{a₁,a₂,…,a_n is input data; f _e,f_u and f _d are E, V and M in the VQC respectively, wherein E represents a quantum encoder, V represents a quantum variation circuit and M represents a measurer; q ₁,q₂ and q ₃ are the outputs of f _e,f_u and f _d, respectively; θ ₁,…,θ_m is a parameter that can be learned in f _u; here, m is set to 3, representing 3 learnable parameters.

Under the condition of the same optimization algorithm and the same data quantity training, the accuracy of the quantum deep learning model reaches 78%, and under the same condition, the accuracy of the deep learning model is 75%, and the accuracy of the quantum deep learning model is higher. In addition, for the trained loss curve, it can be known that the quantum deep learning model converges faster than the deep learning model under the same condition, and the more quantum bits used in the model, the faster the corresponding quantum deep learning model converges when trained.

Optionally, after obtaining the M keywords, the method further includes:

And inputting the second predicted voice command into the trained quantum deep learning model to generate a corresponding control instruction.

In the case that only one available keyword can be identified in the real voice command, the keyword is not required to be sequenced, inserted and the like. Similarly, the pronunciation mode may refer to a language with a preset mode, and the preset mode may be pronunciation of a dialect, standard pronunciation, or the like, and machine fitting pronunciation or manual recording pronunciation may be adopted for pronunciation of the dialect or standard pronunciation, where the machine fitting pronunciation refers to pronunciation of phonemes by setting timbre, volume, audio frequency, and the like, and the manual recording pronunciation refers to recording pronunciation of each word manually. The second sound emission mode and the first sound emission mode may be the same sound emission mode.

The embodiment of the application collects real voice commands, when the real voice commands are detected to be not in accordance with the standard, a preset non-standard voice recognition model is used for recognizing the real voice commands, at least one keyword with a front-back sequence relation is obtained, word senses of all keywords are extracted, and the front-back sequence relation of all keywords is combined, a word aid is filled between any two adjacent keywords to obtain a predicted text, a text-to-voice tool is adopted, any existing standard voice is combined, the predicted text is converted into a first predicted voice command, the first predicted voice command is input into a trained quantum deep learning model, a corresponding control command is generated, recognition of the voice commands is realized, the command is predicted and recognized when the voice commands are not in accordance with the standard by judging the standard of the voice commands, the accuracy of command recognition can be improved, the control command can be obtained rapidly by quantum deep learning recognition, and therefore, the accuracy and the timeliness of the recognition can be ensured under the complex voice command environment.

Referring to fig. 3, a flowchart of a voice command recognition method according to a third embodiment of the present application is shown in fig. 3, where the voice command recognition method may include the following steps:

step S301, a real voice command is collected, and when the real voice command is detected to be not in accordance with the pronunciation standard, a preset non-standard voice recognition model is used for recognizing the real voice command, so that M keywords are obtained.

The content of step S301 is the same as that of step S201, and reference may be made to the description of step S201, which is not repeated here.

Step S302, when M is larger than 1, similarity analysis is carried out on the keyword and the part of speech of the previous keyword and/or the next keyword of the corresponding keyword aiming at any keyword, and the meaning of the corresponding keyword is determined according to the similarity analysis result.

In the application, the parts of speech of the front and rear keywords of a keyword are analyzed, and the parts of speech of the keyword are also analyzed, so that the meaning of the keyword can be judged through the similarity of the parts of speech.

Step S303, arranging all the keywords in sequence according to the front-back sequence relation, screening the auxiliary words according to the word senses of any two adjacent keywords, filling the screened auxiliary words between the two corresponding adjacent keywords, and traversing all the keywords to obtain a predicted text.

According to the word sense of the two adjacent keywords, the application can deduce what kind of word-assisting connection needs to be carried out between the two keywords. The selection of the auxiliary word can be selected according to the similarity of the parts of speech of the two keywords and the language specification.

Step S304, the first sound generating mode is used for carrying out voice synthesis on the predicted text, and a first predicted voice command is obtained.

Step S305, inputting the first predicted voice command into the trained quantum deep learning model to generate a corresponding control command.

The contents of steps S304 to S305 are the same as those of steps S203 to S204, and reference may be made to the descriptions of steps S203 to S204, which are not repeated here.

The embodiment of the application collects real voice commands, when the real voice commands are detected to be in non-conformity with the standard, a preset non-standard voice recognition model is used for recognizing the real voice commands, at least one keyword with a front-back sequence relation is obtained, similarity analysis is carried out on the keyword and the word parts of the former keyword and/or the latter keyword corresponding to the keyword aiming at any keyword, word meaning of the corresponding keyword is determined according to the similarity analysis result, all keywords are sequentially arranged according to the front-back sequence relation, the auxiliary words are screened according to the word meaning of any two adjacent keywords, the screened auxiliary words are filled between the corresponding two adjacent keywords, all keywords are traversed, predicted texts are obtained, a text-to-voice tool is adopted, any existing standard voice is combined, the predicted texts are converted into first predicted voice commands, the first predicted voice commands are input into a trained quantum deep learning model, corresponding control commands are generated, recognition of the voice commands is realized, the standard of the voice commands is judged, the commands are predicted and recognized when the voice commands are in non-conformity with the standard, the command recognition accuracy of the command recognition can be improved, and the accuracy of the voice commands can be controlled in time, and the accuracy of the voice commands can be guaranteed under the condition of the accuracy of the recognition is guaranteed.

Fig. 4 shows a block diagram of a voice command recognition device according to a fourth embodiment of the present application, where the voice command recognition device is applied to a server in fig. 1, and a computer device corresponding to the server is connected to a corresponding database to obtain corresponding data in the database. The computer equipment can be connected with a corresponding client, and the client sends the voice command to the server, so that the function of collecting the voice command by the server is realized. For convenience of explanation, only portions relevant to the embodiments of the present application are shown.

Referring to fig. 4, the voice command recognition apparatus includes:

a command acquisition module 41 for acquiring a real voice command;

A first recognition module 42, configured to recognize the real voice command to M keywords using a preset non-standard voice recognition model when it is detected that the real voice command does not meet the pronunciation standard, where M is an integer greater than zero;

the text prediction module 43 is configured to extract word senses of all keywords when M is greater than 1, and fill a fluxing word between any two adjacent keywords by combining a front-back sequential relationship of all keywords in a real voice command, so as to obtain a predicted text;

a first speech synthesis module 44, configured to perform speech synthesis on the predicted text using the first speech mode to obtain a first predicted speech command;

the first instruction generating module 45 is configured to input a first predicted voice command into the trained quantum deep learning model, and generate a corresponding control instruction.

Optionally, the preset non-standard voice recognition model recognition device further includes:

and the second recognition module is used for inputting the real voice command into the trained quantum deep learning model after the real voice command is acquired and when the real voice command is detected to meet the standard, and generating a corresponding control command.

Optionally, the voice command recognition device further includes:

the voice feature recognition module is used for carrying out voice feature recognition on the real voice command before carrying out voice synthesis on the predicted text by using the first voice mode to obtain a first predicted voice command so as to obtain target voice features;

the target user determining module is used for matching corresponding target users in a preset sound feature library according to the target sound features, and the sound feature library stores the mapping relation between the users and the sound features of the users;

And the target voice determining module is used for determining a pronunciation mode matched with a target user in a preset standard voice library as a target pronunciation mode, and the standard voice library stores a mapping relation between the user and the pronunciation mode.

Optionally, the voice command recognition device further includes:

And the second instruction generation module is used for inputting a second predicted voice command into the trained quantum deep learning model to generate a corresponding control instruction.

accordingly, the first instruction generating module 45 includes:

the convolution pooling unit is used for inputting a first predicted voice command into a first one-dimensional convolution pooling layer and outputting a pooling result to a next one-dimensional convolution pooling layer until reaching a last one-dimensional convolution pooling layer;

Optionally, the one-dimensional convolution layer is a quantum convolution layer, and the quantum convolution layer comprises a quantum encoder, a quantum variation circuit and a measurer;

Accordingly, the convolution pooling unit includes:

The quantum coding subunit is used for inputting the first predicted voice command into the quantum coder to carry out quantum coding, so as to obtain quantum coding characteristics;

Optionally, the text prediction module 43 includes:

And the text prediction unit is used for sequentially arranging all the keywords according to the front-back sequence relation in the real voice command, screening the auxiliary words according to the word senses of any two adjacent keywords, filling the screened auxiliary words between the corresponding two adjacent keywords, and traversing all the keywords to obtain a predicted text.

It should be noted that, because the content of information interaction and execution process between the modules and the embodiment of the method of the present application are based on the same concept, specific functions and technical effects thereof may be referred to in the method embodiment section, and details thereof are not repeated herein.

Fig. 5 is a schematic structural diagram of a computer device according to a fifth embodiment of the present application. As shown in fig. 5, the computer device of this embodiment includes: at least one processor (only one shown in fig. 5), a memory, and a computer program stored in the memory and executable on the at least one processor, the processor executing the computer program to perform the steps of any of the various speech command recognition method embodiments described above.

The computer device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that fig. 5 is merely an example of a computer device and is not intended to limit the computer device, and that a computer device may include more or fewer components than shown, or may combine certain components, or different components, such as may also include a network interface, a display screen, an input device, and the like.

The Processor may be a CPU, but may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, DSPs), application SPECIFIC INTEGRATED Circuits (ASICs), off-the-shelf Programmable gate arrays (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory includes a readable storage medium, an internal memory, etc., where the internal memory may be the memory of the computer device, the internal memory providing an environment for the execution of an operating system and computer-readable instructions in the readable storage medium. The readable storage medium may be a hard disk of a computer device, and in other embodiments may be an external storage device of a computer device, for example, a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), etc. that are provided on a computer device. Further, the memory may also include both internal storage units and external storage devices of the computer device. The memory is used to store an operating system, application programs, boot loader (BootLoader), data, and other programs such as program codes of computer programs, and the like. The memory may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above device may refer to the corresponding process in the foregoing method embodiment, which is not described herein again. The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above-described embodiment, and may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of the method embodiment described above. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, executable files or in some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code, a recording medium, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a U-disk, removable hard disk, magnetic or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

The present application may also be implemented as a computer program product for implementing all or part of the steps of the method embodiments described above, when the computer program product is run on a computer device, causing the computer device to execute the steps of the method embodiments described above.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided by the present application, it should be understood that the disclosed apparatus/computer device and method may be implemented in other manners. For example, the apparatus/computer device embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A method of voice command recognition, the method comprising:

Collecting a real voice command;

Inputting the first predicted voice command into a trained quantum deep learning model to generate a corresponding control command;

The trained quantum deep learning model comprises N layers of one-dimensional convolution pooling layers and full-connection layers, the N layers of one-dimensional convolution pooling layers are sequentially connected, the output of the last one-dimensional convolution pooling layer is connected with the full-connection layer, the one-dimensional convolution pooling layer comprises one-dimensional convolution layers and one-dimensional pooling layers, the output of the one-dimensional convolution layer is connected with the one-dimensional pooling layer, a voice command is input into the one-dimensional convolution layer of the first one-dimensional convolution pooling layer, and N is an integer greater than zero;

Inputting the pooling result output by the last one-dimensional convolution pooling layer into the full-connection layer for connection to obtain a corresponding control instruction;

the one-dimensional convolution layer is a quantum convolution layer, and the quantum convolution layer comprises a quantum encoder, a quantum variation circuit and a measurer;

Inputting the measurement result into a one-dimensional pooling layer, and outputting the pooling result;

Extracting word senses of all keywords, combining the front-back sequence relation of all keywords in the real voice command, filling auxiliary words between any two adjacent keywords, and obtaining a predicted text comprises the following steps:

2. The voice command recognition method according to claim 1, further comprising, after collecting the real voice command:

3. The method of claim 1, further comprising, prior to speech synthesizing the predicted text using the first speech pattern to obtain the first predicted speech command:

4. The voice command recognition method according to claim 1, further comprising, after obtaining the M keywords:

5. A voice command recognition device, the device comprising:

the command acquisition module is used for acquiring real voice commands;

the first instruction generation module is used for inputting the first predicted voice command into a trained quantum deep learning model to generate a corresponding control instruction;

the first instruction generation module includes:

the instruction generating unit is used for inputting the pooling result output by the last one-dimensional convolution pooling layer into the full-connection layer for connection to obtain a corresponding control instruction;

the convolution pooling unit includes:

Chi Huazi unit for inputting the measurement result into one-dimensional pooling layer and outputting pooling result;

the text prediction module comprises:

And the text prediction unit is used for sequentially arranging all the keywords according to the front-to-back sequence relation in the real voice command, screening the auxiliary words according to the word senses of any two adjacent keywords, and filling the screened auxiliary words between the two corresponding adjacent keywords to obtain a predicted text.

6. A computer device, characterized in that it comprises a processor, a memory and a computer program stored in the memory and executable on the processor, which processor implements the voice command recognition method according to any one of claims 1 to 4 when executing the computer program.

7. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the voice command recognition method according to any one of claims 1 to 4.