CN112802474A

CN112802474A - Voice recognition method, device, equipment and storage medium

Info

Publication number: CN112802474A
Application number: CN201911028867.7A
Authority: CN
Inventors: 刘娟
Original assignee: China Mobile Communications Group Co Ltd; Research Institute of China Mobile Communication Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; Research Institute of China Mobile Communication Co Ltd
Priority date: 2019-10-28
Filing date: 2019-10-28
Publication date: 2021-05-14

Abstract

The invention discloses a speech recognition method, device, equipment and storage medium. Wherein, the method includes: recognizing the input voice command to obtain a first-level recognition result, where the first-level recognition result includes more than two initial recognition results; matching the first-level recognition result with a historical command data set, Wherein, the historical command data set includes historical voice commands executed by the speech recognition device and the number of times the historical voice commands have been executed; if there are two or more initial recognition results that match the historical command data set For the recognition result, the final recognition result is determined according to the execution times of the historical voice commands corresponding to the two or more matching initial recognition results.

Description

Voice recognition method, device, equipment and storage medium

Technical Field

The present invention relates to speech recognition technologies, and in particular, to a speech recognition method, apparatus, device, and storage medium.

Background

At present, all kinds of smart machines increasingly become the indispensable instrument of people daily life, like intelligent wrist-watch, intelligent bracelet, smart audio amplifier etc. wherein smart audio amplifier has still possessed information inquiry, regularly alarm clock, amusement interactive etc. colorful ability more except the basic function that can audio playback, has brought a lot of facilities for everybody's life.

In the related technology, the bottom layer of the intelligent sound box adopts the voice recognition technology to recognize and understand the voice instruction of the user, so that the voice instruction is converted into a corresponding text or command and executed. However, due to the change of the user tone, pronunciation and the like, the distance or the influence of other sounds in the environment, the intelligent sound box is very easy to cause that the recognition of the voice command input by the user is not accurate enough, so that the played content is not wanted by the user, the user experience is greatly reduced, the user often needs to repeat the voice command for a plurality of times to obtain satisfactory response, and inconvenience is brought to the user.

Disclosure of Invention

In view of this, embodiments of the present invention provide a speech recognition method, apparatus, device and storage medium, which aim to improve accuracy of speech recognition.

The technical scheme of the embodiment of the invention is realized as follows:

the embodiment of the invention provides a voice recognition method, which comprises the following steps:

recognizing an input voice command to obtain a primary recognition result, wherein the primary recognition result comprises more than two initial recognition results;

matching the primary recognition result with a historical instruction data set, wherein the historical instruction data set comprises a historical voice instruction executed by voice recognition equipment and the executed times of the historical voice instruction;

and if more than two initial recognition results matched with the historical instruction data set exist in the primary recognition result, determining a final recognition result according to the executed times of the historical voice instructions corresponding to the more than two matched initial recognition results.

An embodiment of the present invention further provides a speech recognition apparatus, including:

the voice recognition module is used for recognizing the input voice command to obtain a primary recognition result, and the primary recognition result comprises more than two initial recognition results;

the matching module is used for matching the primary recognition result with a historical instruction data set, wherein the historical instruction data set comprises a historical voice instruction executed by voice recognition equipment and the execution times of the historical voice instruction; and if more than two initial recognition results matched with the historical instruction data set exist in the primary recognition result, determining a final recognition result according to the executed times of the historical voice instructions corresponding to the more than two matched initial recognition results.

An embodiment of the present invention further provides a speech recognition device, including: a processor and a memory for storing a computer program capable of running on the processor, wherein the processor, when running the computer program, is adapted to perform the steps of the method according to any of the embodiments of the present invention.

The embodiment of the invention also provides a storage medium, wherein a computer program is stored on the storage medium, and when the computer program is executed by a processor, the steps of the method of any embodiment of the invention are realized.

According to the technical scheme provided by the embodiment of the invention, the input voice instruction is recognized to obtain a primary recognition result, the primary recognition result comprises more than two initial recognition results, the primary recognition result is matched with the historical instruction data set, if more than two initial recognition results matched with the historical instruction data set exist in the primary recognition result, the final recognition result is determined according to the executed times of the historical voice instruction corresponding to the more than two matched initial recognition results, the primary recognition result of the voice recognition can be corrected, the accuracy of the voice recognition is improved, and the determined final recognition result meets the behavioral habit and psychological requirements of a user.

Drawings

FIG. 1 is a flow chart of a speech recognition method according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present invention;

FIG. 4 is a flow chart illustrating instruction matching in an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a speech recognition device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

In the related technology, the bottom layer of the intelligent sound box adopts the voice recognition technology to recognize and understand the voice instruction of the user, so that the voice instruction is converted into a corresponding text or command and executed. Due to the change of the tone, pronunciation and the like of the user, the distance or the influence of other sounds in the environment, the intelligent sound box is easy to recognize the voice instruction input by the user with insufficient accuracy.

In order to improve the voice recognition technology of the smart speaker, the following related technologies exist:

firstly, embedding a noise elimination related module in a sound box, so that when a user inputs a voice instruction, echo of the user and background noise in a space are filtered and eliminated, and the recognition degree of the voice instruction of the user is improved;

secondly, related sound collection or sound enhancement equipment is arranged in the room close to the user and is independent from the intelligent sound box, and the equipment can collect the sound of the user and transmit the sound to the intelligent sound box, so that the sound collection function is not limited to be executed in the area where the intelligent sound box is located, the sound collection range of the intelligent sound box is expanded, and the instruction identification accuracy is further improved;

and thirdly, realizing a voice recognition function by using additional intelligent equipment such as a Bluetooth headset and the like so as to solve the problems of limited voice recognition distance and low voice recognition accuracy rate in voice recognition by using an intelligent sound box.

Although the related technology can partially solve the problem of low recognition rate of the voice command of the intelligent sound box caused by noisy environment, too far distance of a user and the like, no effective solution is provided for the problem of low recognition rate caused by nonstandard pronunciation, unclear word biting, tone change and the like of the user. For example, in the evening almost every day, the user controls the intelligent sound box by playing the Kiss Baby through a voice command before sleeping, the Baby plays the music Kiss Baby of a cradle song for the Baby, and the Baby can be appealed quickly by gentle and relaxed melodies and lyrics which are full of love, so that the fun mode is naturally transited to the sleep mode. However, sometimes, due to a slight inaccuracy in pronunciation of the user or a limitation of a technical level of the voice recognition module of the smart speaker, the voice command of the user may be recognized incorrectly, for example, "Kiss Baby" is recognized as a keyword similar to "Kiss Me Baby" and the like, so that music irrelevant to "Kiss Baby" is played, the user often needs to repeat "playing Kiss Baby" several times until the smart speaker is recognized correctly, which not only consumes a lot of time and energy of the user, but also causes inconvenience and interference to the life of the user when the music is played incorrectly.

Based on this, in various embodiments of the present invention, an input voice instruction is recognized to obtain a primary recognition result, where the primary recognition result includes more than two initial recognition results, the primary recognition result is matched with a historical instruction data set, and if there are more than two initial recognition results matched with the historical instruction data set in the primary recognition result, a final recognition result is determined according to the number of times of execution of the historical voice instruction corresponding to the matched more than two initial recognition results, so that the primary recognition result of voice recognition can be corrected, and the accuracy of voice recognition is improved, so that the determined final recognition result meets the behavioral habits and psychological needs of a user.

An embodiment of the present invention provides a speech recognition method, as shown in fig. 1, the method includes:

step 101, recognizing an input voice command to obtain a primary recognition result, wherein the primary recognition result comprises more than two initial recognition results;

here, the input voice command is recognized by using a voice recognition technology to obtain a plurality of initial recognition results with recognition probabilities meeting the set requirements, the plurality of initial recognition results form a primary recognition result, and the primary recognition result at least comprises two initial recognition results, so that subsequent correction of the initial recognition results is required.

In practical application, the coverage range of the recognition result can be enlarged, and the recognition result with the recognition probability reaching the set probability is selected as the primary recognition result or the recognition result with the set number of recognition probabilities before is selected as the primary recognition result for the plurality of recognition results with the enlarged coverage range.

102, matching the primary identification result with a historical instruction data set;

here, the history instruction data set includes a history voice instruction executed by the voice recognition apparatus and the number of times the history voice instruction is executed.

In actual application, the historical instruction data set can be obtained locally or through a network.

And 103, if more than two initial recognition results matched with the historical instruction data set exist in the primary recognition result, determining a final recognition result according to the executed times of the historical voice instructions corresponding to the more than two matched initial recognition results.

In practical application, if only one initial recognition result exists in the matching result of the primary recognition result and the historical voice instruction in the historical instruction data set, determining that the historical voice instruction corresponding to the initial recognition result is the final recognition result.

And if more than two initial recognition results matched with the historical instruction data set exist in the primary recognition result, determining a final recognition result according to the executed times of the historical voice instructions corresponding to the more than two matched initial recognition results. Specifically, the historical voice instruction executed the most number of times is selected as the final recognition result. Therefore, the final recognition result is more accurate, and the preference and the use habit of the user are better met.

In one embodiment, each of the initial recognition results has a corresponding attribute, the method further comprising:

and if the primary recognition result does not have an initial recognition result matched with the historical instruction data set, matching the attribute of each initial recognition result in the primary recognition result with a historical instruction attribute set, and determining a final recognition result according to the matching result, wherein the historical instruction attribute set comprises the attribute corresponding to the historical voice instruction and the executed times of the attribute corresponding to the historical voice instruction.

Here, the attribute may be a music style attribute of music. Illustratively, the music style attributes may include: children's songs, popular music, balladry, rock, etc.

In an embodiment, the matching the primary recognition result with the historical instruction attribute set, and determining a final recognition result according to the matching result includes:

if only one initial recognition result matched with the historical instruction attribute set exists in the primary recognition result, determining the initial recognition result as a final recognition result;

if more than two initial recognition results matched with the historical instruction attribute set exist in the primary recognition result, determining a final recognition result according to the executed times of the attributes corresponding to the more than two matched initial recognition results;

and if the primary recognition result does not have the initial recognition result matched with the historical instruction attribute set, determining a final recognition result according to the voice recognition matching degree of each initial recognition result in the primary recognition result.

Therefore, under the condition that the matching of the primary recognition result and the historical instruction data set fails, the final recognition result can be determined according to the matching result of the primary recognition result and the historical instruction attribute set, and the final recognition result can be effectively recognized. If the primary recognition result fails to match with the historical instruction attribute set, the final recognition result can be determined according to the voice matching degree (namely recognition probability) of each initial recognition result, namely, the initial recognition result with the highest recognition probability is selected as the final recognition result.

In order to satisfy the condition that the historical instruction data set and the historical instruction attribute set can be updated according to the use habits of the user, in one embodiment, the method further comprises the following steps:

and outputting the final identification result to an execution module, and updating the historical instruction data set and the historical instruction attribute set according to the information fed back by the execution module.

In an embodiment, the updating the historical instruction data set and the historical instruction attribute set according to the information fed back by the execution module includes:

receiving a feedback result of the final recognition result successfully executed fed back by the execution module;

if the historical instruction data set has the historical voice instruction corresponding to the feedback result, updating the executed times of the historical voice instruction, and if the historical instruction data set does not have the historical voice instruction corresponding to the feedback result, adding the final recognition result into the historical instruction data set;

and if the historical instruction attribute set has the attribute corresponding to the feedback result, updating the executed times of the attribute, and if the historical instruction data set does not have the attribute corresponding to the feedback result, adding the attribute of the final identification result into the historical instruction attribute set.

In actual application, if the execution module successfully executes the final recognition result, the historical instruction data set and the historical instruction attribute set are updated according to the information (namely, the feedback result) of the execution module for executing the final recognition result. Specifically, 1 is added to the execution times of the corresponding historical voice commands in the historical command data set, and 1 is added to the execution times of the corresponding attributes in the historical command attribute set. If the historical instruction data set does not have the historical voice instruction corresponding to the feedback result, adding the final recognition result into the historical instruction data set, and assigning the execution times to be 1; and if the historical instruction attribute set does not have the attribute corresponding to the feedback result, adding the attribute of the final identification result into the historical instruction attribute set, and assigning the execution times to be 1.

In an embodiment, the historical instruction data set further includes executed times of historical voice instructions, the method further comprising:

and determining the executed times corresponding to the historical voice instructions in different intervals based on the executed times of the historical voice instructions, and correcting the executed times of the historical voice instructions according to the executed times corresponding to the historical voice instructions in different intervals.

Here, the history valid time period T may be divided into n equal time period sections, where n is a natural number greater than 2, each history voice instruction may be divided into corresponding sections according to the executed time of each history voice instruction, each section may be given a corresponding influence weight, and the number of times the history voice instruction is executed may be corrected based on the number of times the history voice instruction is executed corresponding to the history voice instruction in different sections and the corresponding influence weight.

In one embodiment, the set of historical instruction attributes further includes executed times of the attributes, the method further comprising:

and determining the executed times corresponding to the attributes in different intervals based on the executed times of the attributes, and correcting the executed times of the attributes according to the executed times corresponding to the attributes in the different intervals.

Similarly, each attribute may be divided into corresponding sections according to the executed time of each attribute, and each section may be given a corresponding influence weight, and the executed times of the attributes may be corrected based on the executed times corresponding to the attributes in different sections and the corresponding influence weights.

By correcting the executed times of the historical voice instructions in the historical instruction data set, the final recognition result can be determined according to the executed times of the historical voice instructions, the historical behavior habits of the user can be better met, and the accuracy of the recognition result is improved.

By correcting the executed times of the attributes in the historical instruction attribute set, the final recognition result can be determined according to the executed times of the attributes in the historical instruction attribute set, the historical behavior habit of the user can be better met, and the accuracy of the recognition result is improved.

In an embodiment, the impact weight may be determined based on a time factor, and may also be determined based on different user populations. Taking the example based on a time factor, the impact weight may be set to decrease away from the current time and increase closer to the current time.

In order to implement the method of the embodiment of the present invention, an embodiment of the present invention further provides a voice recognition device, where the voice recognition device may be disposed at the smart speaker, or separately disposed and communicatively connected to the smart speaker.

As shown in fig. 2, the apparatus includes:

the voice recognition module 201 is configured to recognize an input voice instruction to obtain a primary recognition result, where the primary recognition result includes more than two initial recognition results;

a matching module 202, configured to match the primary recognition result with a historical instruction data set, where the historical instruction data set includes a historical voice instruction executed by a voice recognition device and the number of times the historical voice instruction is executed; and if more than two initial recognition results matched with the historical instruction data set exist in the primary recognition result, determining a final recognition result according to the executed times of the historical voice instructions corresponding to the more than two matched initial recognition results.

In an embodiment, each of the initial recognition results has a corresponding attribute, and the matching module 202 is further configured to: and if the primary recognition result does not have an initial recognition result matched with the historical instruction data set, matching the attribute of each initial recognition result in the primary recognition result with a historical instruction attribute set, and determining a final recognition result according to the matching result, wherein the historical instruction attribute set comprises the attribute corresponding to the historical voice instruction and the executed times of the attribute corresponding to the historical voice instruction.

In an embodiment, the matching module 202 is specifically configured to:

In an embodiment, the apparatus further includes an updating module 203, configured to output the final recognition result to an execution module, and update the historical instruction data set and the historical instruction attribute set according to information fed back by the execution module.

In an embodiment, the updating module 203 is specifically configured to:

if the historical instruction attribute set has the attribute corresponding to the feedback result, updating the executed times of the attribute, and if the historical instruction attribute set does not have the attribute corresponding to the feedback result, adding the attribute of the final identification result into the historical instruction attribute set.

In an embodiment, the historical instruction data set further includes executed time of a historical voice instruction, and the updating module 203 is specifically configured to: and determining the executed times corresponding to the historical voice instructions in different intervals based on the executed times of the historical voice instructions, and correcting the executed times of the historical voice instructions according to the executed times corresponding to the historical voice instructions in different intervals.

In an embodiment, the historical instruction attribute set further includes executed time of the attribute, and the updating module 203 is specifically configured to: and determining the executed times corresponding to the attributes in different intervals based on the executed times of the attributes, and correcting the executed times of the attributes according to the executed times corresponding to the attributes in the different intervals.

In practical applications, the speech recognition module 201, the matching module 202, and the updating module 203 may be implemented by a processor in the speech recognition apparatus. Of course, the processor needs to run a computer program in memory to implement its functions.

It should be noted that: in the speech recognition apparatus provided in the above embodiment, only the division of the program modules is exemplified when performing speech recognition, and in practical applications, the processing may be distributed to different program modules as needed, that is, the internal structure of the apparatus may be divided into different program modules to complete all or part of the processing described above. In addition, the speech recognition apparatus and the speech recognition method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

The present invention will be described in further detail with reference to the following application examples.

As shown in fig. 3, the speech recognition apparatus according to the present embodiment includes: the system comprises a user input voice recognition module (equivalent to a voice recognition module), an instruction matching module (equivalent to a matching module), a result output and feedback module (having the function of an updating module), a user historical instruction data storage module and a user historical instruction attribute storage module. The following functional description of each module is as follows:

1. user input speech recognition module

The user input voice recognition module recognizes the instruction I input by the user through voice and preliminarily corresponds to a corresponding executable instruction, for example, if the user says that the user wants to listen to the Kiss Baby, the module analyzes that the user listens to the Kiss Baby and corresponds to a playing action, and the Kiss Baby corresponds to the played related content such as music, stories and the like. Note that unlike the voice recognition requirement of the conventional smart speaker, in order to expand the coverage of the recognition result and further match the recognition result with the historical instruction data, the module needs to generate a plurality of recognition results with higher probability for each voice instruction of the user and record the corresponding attributes of the recognition results, that is, the primary recognition result set R ═ { R ═ R₁,r₂,......r_nAnd its attribute c₁,c₂,......c_nIn which r is_iFor each possible recognition result, c_iFor the attribute corresponding to the instruction, n is the number of the primary recognition results, and n may be set to a fixed value, for example, 3, according to an empirical value, or may be defined according to the recognition matching degree, for example, all recognition results with the recognition matching degree greater than 95% are taken. For example, the speech recognition module may recognize "Kiss Baby" spoken by the user as the following results: "Kiss Me Baby" with a matching degree of 99%, "Kiss Baby" with a matching degree of 98%, "Cheers Baby" with a matching degree of 97%.

2. User historical instruction data storage module

The user historical instruction data storage module stores all historical instruction data D (namely a historical instruction data set) executed by a user on the intelligent sound box, represents the preference of the user for using the intelligent sound box, and comprises but is not limited to instructions executed by the user through voice input, instructions executed by the user through an APP (application) for operating the intelligent sound box on the intelligent terminal, and the like, so that the number of times each instruction is executed is counted, namely D { (D)₁,f_d₁),(d₂,f_d₂)......(d_n,f_d_n)}。

3. User history instruction attribute storage module

User history instruction attribute storage moduleStoring attribute data C ═ { C ═ C corresponding to all historical instructions executed by the user on the smart sound box₁,c₂......c_nFor example, for playing music type instructions, the played music attributes, such as songga, popular music, rock, etc., are stored, so as to count the number of times that the instructions of each type of attributes are executed, i.e., { f _ c }₁,f_c₂......f_c_n}。

In practical application, the situation that the accuracy of the preference of the user using the smart sound box represented by the data is continuously attenuated along with the continuous forward tracing of the time can be considered, for example, historical instruction data executed by the user on the smart sound box in 2019 and 2 months necessarily has higher reaction accuracy to the preference of the user than data in 2018 and 3 months. Here, a time decay model of historical command data impact may be established. At this time, the user history instruction data storage module needs to record the time of executing each instruction, and calculate the number of times (the number of times is increased by 1 per successful playing) that each instruction is executed in each period of time (for example, each month in the past). The user history instruction attribute storage module calculates the number of times the instruction for each type of attribute is executed (the number of times is increased by 1 per successful play) in each period of time (e.g., each month in the past).

If the historical instruction data of the user in a time period T (for example, 12 months from 3 months in 2018 to 2 months in 2019) is considered, T can be divided into n stages (for example, from 3 months in 2018 to one stage per month and up to 2 months in 2019), the n stages are sorted according to the time sequence from near to far, the sequence is recorded as a set {1,2,..., n }, and then, for a certain instruction d, the historical instruction data of the user can be divided into n stages (for example, from 3 months in 2018 to 2 months in each month, and the n stages are sorted into a set_iSeparately storing the execution times f _ d of the instruction in each stage_i＝{f_d_i，1,f_d_i，2,......f_d_i,n}. Instruction c for certain types of attributes_iRespectively storing the execution times f _ c of the attribute instruction in each stage_i＝{f_c_i，1,f_c_i,2,......f_c_i,n}。

Meanwhile, the influence of the instructions of each stage on the user preference in the whole period is considered to be different, and the influence is introducedThe force weight, i.e. the weight of the influence of each time phase i on the whole period, is λ_i，

0＜λ_i≤1，

λ_iAverage value of (2)

Then a certain instruction d in the period T is considered after the time decay factor is taken into account_iThe number of corrected plays may be defined as:

some kind of attribute instruction c in cycle T_iThe number of corrected plays may be defined as:

wherein the weight λ_iIn the practical application process, the value can be flexibly adjusted according to the practical degree of the influence of each instruction decaying along with the time period. I.e. introducing the weight adjustment factor eta_i，

η₁＞η₂＞...＞η_n

-1/n＜η_i＜(n-1)/n，

When adjusting, users with different characteristic attributes can be consideredThe difference of the command influence attenuation models is that for the group with fixed preference, such as the elderly, the reaction degree of the historical command data to the user preference is influenced less by the time attenuation, so that the lambda of each stage is smaller_iThe value of (b) can be averaged, i.e. each adjustment factor has a narrow range, for example, the following conditions can be given: eta_iStandard deviation of (2)<0.01; for another population with fast preference change, such as young people, the reaction degree of the historical instruction data on the user preference is greatly influenced by time attenuation, so that the lambda of each stage_iThe difference in values may be larger, i.e. the range of the adjustment factors may be larger, for example, a limiting condition η may be given_iStandard deviation of (2)>0.1。

4. Instruction matching module

The instruction matching module is used for outputting a primary recognition result set R { (R) based on the user input voice recognition module₁,c₁),(r₂,c₂),......(r_n,c_n) And matching the data in the user history instruction data storage module and the data in the user history instruction attribute storage module, so as to finally output a unique identification result (namely a final identification result).

The specific flow of instruction matching is shown in fig. 4. For example, the speech recognition module may recognize "Kiss Baby" spoken by the user as the following results: "Kiss Me Baby" with a matching degree of 99%, "Kiss Baby" with a matching degree of 98%, "chess Baby" with a matching degree of 97%, wherein the instruction attribute of "Kiss Baby" is "songby", "Kiss Me Baby" with an instruction attribute of "rock music", and "chess Baby" with an attribute of "folk music". The specific matching method comprises the following steps:

for an instruction I input by a user voice, firstly, the instruction I is preliminarily recognized by a user input voice recognition module to obtain a primary recognition result R ═ { R ═ R₁,r₂,......r_nAnd the attribute c corresponding to each result₁,c₂,......c_n}. Then, the first-level identification result is matched with the data in the user historical instruction data storage module, and the matching method is to take the first-level identification result R and the historical indexLet intersection R ═ D of data D be { D ═ D₁,d₂,......d_mI.e. which of the current possible results were played by the user within the period T:

1. if the intersection of the primary recognition result R and the historical instruction data D has only one instruction, namely only one of the current possible results is played by the user in the period T, the final matching result R_finalNamely, the instruction is output to the result output and feedback module. For example, if only one instruction play record, namely "Kiss Baby", is stored in the user history instruction data storage module, the final matching result is "Kiss Baby".

2. If there are multiple instructions in the intersection of the primary recognition result R and the historical instruction data D, that is, there are multiple instructions in the current possible result that the user played in the period T, the instruction that best represents the user's preference may be taken as the best matching result. I.e. further based on the corrected playing times of each instruction in the period T

Taking an instruction with the highest frequency as a matching result: r_final＝d_i,d_iBelongs to R #D, and

for example, the user history instruction data storage module has two instruction play records of "Kiss Baby" and "Kiss Me Baby", and the number of times of corrected play of "Kiss Baby" in the past year is 132 times, and the number of times of corrected play of "Kiss Me Baby" in the past year is 20 times, so that the final matching result is "Kiss Baby".

If the highest order is multiple, namely, the representative degrees of the orders to the user preference are the same, namely R_middle＝{d_i,d_j......d_α},{d_i,d_j......d_α}∈R∩D,

And is

The one of the instructions that best represents the user's preference for the attributes of the instruction may be taken as the best match result. The instructions are matched with the user instruction attribute storage module by respectively taking R_middle＝{d_i,d_j......d_αAttribute of { c } c_i,c_j......c_αObtaining the corrected playing times of each type of attribute instruction in the period T according to the data stored in the user history instruction attribute storage module

Taking the attribute with the highest frequency, and outputting the instruction corresponding to the attribute as a final matching result to output, namely R_final＝d_η,d_η∈R_middleAnd is and

for example, the user history instruction data storage module has two instruction play records of "Kiss Baby" and "Kiss Me Baby", and the two instruction play records both have 86 corrected play times in the past year, wherein the instruction attribute of "Kiss Baby" is a song for children, the instruction attribute of "Kiss Me Baby" is rock music, the corrected play times of the instruction of the attribute of the song for children in the past year is 1256 times, the corrected play times of the instruction of the attribute of rock music is 326 times, and the final matching result is the instruction "Kiss Baby" corresponding to the attribute of the song for children.

3. If the intersection of the primary recognition result R and the historical instruction data D is empty, that is, none of the current possible results is played by the user in the period T, and the primary recognition result cannot be further matched according to the historical instruction data of the user, matching can be performed according to the historical instruction attribute data of the user, that is, the instruction which can represent the preference degree of the user to the attribute of the instruction in the primary recognition result is taken as the best matching result. The matching method is to take the first-level recognition result R ═ R respectively₁,r₂,......r_nAttribute of { c } c₁,c₂,......c_nAnd with the userData C ═ { C ═ stored in the history instruction attribute storage module₁,c₂......c_nGet the intersection to get the result of attribute matching

And the corrected playing times of the instructions of each type of attribute in the period T

I.e. what attributes of the instructions in the current possible result were played by the user during the period T:

1) if the intersection of the primary recognition result R and the historical instruction attribute data D only has one instruction, namely the attribute of only one instruction in the current possible result is played by the user in the period T, the final matching result R_finalNamely, the instruction is output to the result output and feedback module. For example, all three instructions, namely 'Kiss Baby', 'Kiss Me Baby' and 'chess Baby', are not in the user history instruction data storage module, and the user history instruction attribute storage module only has an instruction play record of the attribute class of 'songga', and has no instruction play records of the attribute classes of 'rock music' and 'folk music', and the final matching result is the instruction 'Kiss Baby' corresponding to the attribute class of 'songga'.

2) If a plurality of instructions exist in the intersection of the primary recognition result R and the historical instruction attribute data D, namely the attributes of the plurality of instructions in the current possible result are played by the user in the period T, the instruction corresponding to the attribute which can represent the preference of the user can be selected as the most matched result. Get the attribute matching result immediately

In the correction of the number of playbacks within the period T

The highest attribute is output, and the instruction corresponding to the attribute is used as the final matching result R_final＝d_η,d_ηIs e.g. R, and

for example, all three instructions, i.e., "Kiss Baby", "Kiss Me Baby" and "chess Baby", are not in the user history instruction data storage module, and the corrected playing times of the instructions for the attribute of the children song in the past year are 1256 times, the corrected playing times of the instructions for the attribute of the rock music are 326 times, and there is no instruction playing record of the attribute class of the "folk music", and the final matching result is the instruction "Kiss Baby" corresponding to the attribute of the children song.

3) If the intersection of the primary recognition result R and the historical instruction attribute data C is also empty, that is, no attribute of one instruction in the current possible result is played by the user in the period T, the recognition result cannot be further matched according to the user historical instruction data, and the matching cannot be performed according to the user historical instruction attribute data. At this time, it can be considered that no instruction in the current level of recognition result R can reflect the preference of the user, and the instruction with the highest matching degree in the level of recognition result R can be selected for output. For example, if the matching degree of "Kiss Me Baby" is the highest at 99%, the final output result is "Kiss Me Baby".

5. Result output and feedback module

The module receives the matching result output by the instruction matching module, outputs the matching result to a related instruction execution module of the intelligent sound box so as to execute the instruction, feeds the instruction and the playing time thereof back to the user historical instruction data storage module for storage after the instruction execution is finished (for example, music is played once), and feeds the attribute of the instruction and the playing time thereof back to the user historical instruction attribute storage module for storage. At this time, the user history instruction data storage module and the user history instruction attribute storage module also need to update the stored instruction and instruction attribute data according to the feedback, that is, the playing times are correspondingly increased.

Based on the hardware implementation of the program module, and in order to implement the method according to the embodiment of the present invention, an embodiment of the present invention further provides a speech recognition device. Fig. 5 shows only an exemplary structure of the voice recognition apparatus, not the entire structure, and a part or the entire structure shown in fig. 5 may be implemented as necessary.

As shown in fig. 5, a speech recognition apparatus 500 provided by an embodiment of the present invention includes: at least one processor 501, memory 502, and at least one network interface 503. The various components in the speech recognition device 500 are coupled together by a bus system 504. It will be appreciated that the bus system 504 is used to enable communications among the components. The bus system 504 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 504 in fig. 5.

The memory 502 in embodiments of the present invention is used to store various types of data to support the operation of the speech recognition device. Examples of such data include: any computer program for operating on a speech recognition device.

The speech recognition method disclosed by the embodiment of the invention can be applied to the processor 501, or can be implemented by the processor 501. The processor 501 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the speech recognition method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 501. The Processor 501 may be a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc. Processor 501 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed by the embodiment of the invention can be directly implemented by a hardware decoding processor, or can be implemented by combining hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in the memory 502, and the processor 501 reads the information in the memory 502 to complete the steps of the voice recognition method provided by the embodiment of the present invention in combination with the hardware thereof.

In an exemplary embodiment, the voice recognition Device may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), FPGAs, general purpose processors, controllers, Micro Controllers (MCUs), microprocessors (microprocessors), or other electronic components for performing the aforementioned methods.

It will be appreciated that the memory 502 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The described memory for embodiments of the present invention is intended to comprise, without being limited to, these and any other suitable types of memory.

In an exemplary embodiment, the embodiment of the present invention further provides a storage medium, that is, a computer storage medium, which may be specifically a computer readable storage medium, for example, including a memory 502 storing a computer program, which is executable by a processor 501 of a speech recognition device to perform the steps described in the method of the embodiment of the present invention. The computer readable storage medium may be a ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface Memory, optical disk, or CD-ROM, among others.

It should be noted that: "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

In addition, the technical solutions described in the embodiments of the present invention may be arbitrarily combined without conflict.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A speech recognition method, comprising:

2. The method of claim 1, wherein each of the initial recognition results has a corresponding attribute, the method further comprising:

3. The method of claim 2, wherein matching the primary recognition result to a set of historical instruction attributes and determining a final recognition result based on the matching result comprises:

4. The method of claim 2, further comprising:

5. The method of claim 4, wherein said updating the set of historical instruction data and the set of historical instruction attributes based on information fed back by the execution module comprises:

6. The method of claim 5, wherein the historical instruction data set further includes executed times of historical voice instructions, the method further comprising:

7. The method of claim 5, wherein the set of historical instruction attributes further includes executed times for the attributes, the method further comprising:

8. A speech recognition apparatus, comprising:

9. A speech recognition device, comprising: a processor and a memory for storing a computer program capable of running on the processor, wherein,

the processor, when executing the computer program, is adapted to perform the steps of the method of any of claims 1 to 7.

10. A storage medium having a computer program stored thereon, the computer program, when executed by a processor, implementing the steps of the method of any one of claims 1 to 7.