CN108172219B

CN108172219B - Method and device for recognizing voice

Info

Publication number: CN108172219B
Application number: CN201711133270.XA
Authority: CN
Inventors: 徐夏伶; 毛跃辉; 梁博
Original assignee: Gree Electric Appliances Inc of Zhuhai
Current assignee: Gree Electric Appliances Inc of Zhuhai
Priority date: 2017-11-14
Filing date: 2017-11-14
Publication date: 2021-02-26
Anticipated expiration: 2037-11-14
Also published as: CN108172219A

Abstract

The invention discloses a method and a device for recognizing voice. Wherein, the method comprises the following steps: determining a voice to be recognized; determining an effective sound segment from a plurality of sound segments contained in the voice to be recognized; and acquiring a recognition result of the voice to be recognized according to the command words extracted from the effective sound segment. The invention solves the technical problem of low voice recognition efficiency in the prior art.

Description

Method and device for recognizing voice

Technical Field

The invention relates to the field of intelligent control, in particular to a method and a device for recognizing voice.

Background

With the rapid development of science and technology, the method for controlling the intelligent equipment through voice is widely popularized in various industries. However, in a noisy environment, the receiving rate of the intelligent device to the voice is low, and meanwhile, the accuracy rate of recognizing the voice is also low. In addition, when the voice of the user contains a command word for controlling the intelligent device, the intelligent device completes corresponding operation according to the command word in the voice of the user, but the user does not want to control the intelligent device at this time, for example, the voice content of the user is 'good-looking' in the CCTV1, the television extracts a keyword 'CCTV 1' from the voice content of the user, and jumps the current television program to the television program of the CCTV 1. From the above, the existing method for controlling the intelligent device through voice has the defect of false recognition.

In addition, in order to solve the above problems, the prior art mainly classifies and screens mixed sound sources (i.e., sound sources in which voice and non-voice are mixed together) by voiceprint recognition, and then selectively processes a single sound source for the screened voice. The equipment for implementing the method needs to perform a large amount of calculation work, and the performance of the equipment is influenced, so that the accuracy rate of voice recognition is reduced.

Aiming at the problem of low voice recognition efficiency in the prior art, an effective solution is not provided at present.

Disclosure of Invention

The embodiment of the invention provides a method and a device for recognizing voice, which at least solve the technical problem of low voice recognition efficiency in the prior art.

According to an aspect of an embodiment of the present invention, there is provided a method of recognizing speech, including: determining a voice to be recognized; determining an effective sound segment from a plurality of sound segments contained in the voice to be recognized; and acquiring a recognition result of the voice to be recognized according to the command words extracted from the effective sound segment.

According to another aspect of the embodiments of the present invention, there is also provided an apparatus for recognizing speech, including: the first determining module is used for determining the voice to be recognized; the second determining module is used for determining effective sound segments from a plurality of sound segments contained in the voice to be recognized; and the acquisition module is used for acquiring the recognition result of the voice to be recognized according to the command words extracted from the effective sound segment.

According to another aspect of embodiments of the present invention, there is also provided a storage medium including a stored program, wherein the program performs a method of recognizing a voice.

According to another aspect of the embodiments of the present invention, there is also provided a processor for executing a program, wherein the program executes a method of recognizing speech.

In the embodiment of the invention, the voice is recognized by voiceprint recognition, the voice to be recognized is determined, the effective vocal segments are determined from a plurality of vocal segments contained in the voice to be recognized, and the recognition result of the voice to be recognized is obtained according to the command words extracted from the effective vocal segments, so that the aim of accurately recognizing the voice is fulfilled, the technical effect of improving the accuracy of voice recognition is realized, and the technical problem of low voice recognition efficiency in the prior art is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow diagram of a method of recognizing speech according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an alternative classification of mixed sounds according to an embodiment of the invention;

FIG. 3 is a schematic diagram of an alternative division of individual sound segments according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an alternative method of determining active segments according to embodiments of the invention; and

fig. 5 is a schematic structural diagram of a device for recognizing speech according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

In accordance with an embodiment of the present invention, there is provided an embodiment of a method of recognizing speech, it being noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different than that presented herein.

Fig. 1 is a flowchart of a method for recognizing speech according to an embodiment of the present invention, as shown in fig. 1, the method including the steps of:

step S102, determining the voice to be recognized.

It should be noted that in a relatively noisy environment, the sound in the environment includes non-speech and speech, where the non-speech is a sound other than human speech, for example, a sound of a vehicle driving. The voice to be recognized is human voice. In order to accurately recognize the human voice, the voice needs to be extracted from the sound of the noisy environment, wherein the voice to be recognized can be extracted from the sound of the noisy environment by adopting a voiceprint recognition method.

The process of voiceprint recognition of sound mainly comprises feature extraction and speech recognition. The task of the feature extraction process is to extract and select acoustic or voice features with characteristics of strong separability, high stability and the like for the voiceprint of the speaker. The better acoustic or voice characteristics can not only effectively distinguish different speakers, but also keep relatively stable anti-noise performance when the voice of the same speaker changes. After the acoustic or speech features of the sound are extracted, speech recognition is performed according to the acoustic or speech features of the sound. The voice recognition technology is widely applied to the internet and the smart home industry, and mainly converts a voice signal into a corresponding control instruction through a recognition and analysis mode. The voice recognition technology mainly comprises a feature extraction technology, a pattern matching technology and a model training technology.

In addition, it should be noted that the accuracy of speech recognition in a noisy environment can be improved by extracting speech and non-speech from a sound in a noisy environment by a voiceprint recognition technique and then processing only the speech in the subsequent speech processing.

Step S104, determining effective sound segments from a plurality of sound segments contained in the voice to be recognized.

It should be noted that the command words included in the valid sound segment are command words for controlling the smart device to complete corresponding operations, and the command words included in the non-valid sound segment cannot control the smart device. For example, the voice content of the non-valid segment is "the dog of my house is very irritated whenever i turn off the air conditioner", and the voice content of the valid segment is "the fast-spot turn on air conditioner", and the processor of the smart device may start the air conditioner according to the command word "turn on the air conditioner" in the valid segment, but may not turn off the air conditioner according to the command word "turn off the air conditioner" in the non-valid segment.

In addition, it should be noted that the valid sound segment is determined from the plurality of sound segments included in the speech to be recognized, and the command word in the valid sound segment is used to control the intelligent device, but the command word in the non-valid sound segment cannot control the intelligent device, so that the misrecognition of the speech is effectively avoided.

And step S106, acquiring a recognition result of the voice to be recognized according to the command words extracted from the effective sound segment.

It should be noted that the command words are words for controlling the smart device, such as "turn on the air conditioner" and "turn off the air conditioner".

Specifically, after the command word is extracted from the valid sound segment, the processor of the smart device needs to further process the command word, for example, convert the command word into a control instruction corresponding to the command word, and send the control instruction to another unit of the smart device to execute an operation corresponding to the control instruction.

Based on the schemes defined in steps S102 to S106, it can be known that by determining the speech to be recognized, an effective segment is determined from a plurality of segments included in the speech to be recognized, and a recognition result of the speech to be recognized is obtained according to the command word extracted from the effective segment.

It is easy to note that, because the speech to be recognized includes a plurality of sound segments, and the plurality of sound segments may all include command words, if only extracting the command words in the sound segments may cause misrecognition on the speech, after determining the plurality of sound segments included in the speech to be recognized, determining effective sound segments from the plurality of sound segments, and performing speech recognition on the command words in the effective sound segments, the occurrence of a phenomenon of misrecognition on the speech due to the command words included in a longer speech can be reduced, thereby further improving the accuracy of speech recognition.

The above contents indicate that the present embodiment can achieve the purpose of accurately recognizing the speech, thereby achieving the technical effect of improving the accuracy of speech recognition, and further solving the technical problem of low speech recognition efficiency in the prior art.

In an alternative embodiment, determining the speech to be recognized comprises:

step S1020, acquiring mixed sound, wherein the mixed sound at least comprises voice and non-voice;

step S1022, extracting at least one voice from the mixed sound by using a voiceprint recognition algorithm;

step S1024, classifying at least one voice based on a voiceprint recognition algorithm to obtain a classification result;

step S1026, determining the speech to be recognized from the classification result.

Specifically, fig. 2 is a schematic diagram illustrating an alternative classification of mixed sounds. In fig. 2, the left side of the arrow is the human voice extracted from the mixed sound, which contains two sound sources, which may be from different users, wherein the black rectangles represent segments in the sound source 1 and the white rectangles represent segments in the sound source 2. Since the speech characteristics of the speech of different users are different, a plurality of speeches can be distinguished according to the speech characteristics of each user, that is, at least one speech is classified, the classification result is shown on the right side of an arrow in fig. 2, as can be seen from fig. 2, a sound source 1 is composed of a sound segment 1, a sound segment 4, a sound segment 5 and a sound segment 6, and a sound source 2 is composed of a sound segment 2, a sound segment 3 and a sound segment 7.

In addition, not all users can carry out voice control on the intelligent equipment, and only the user meeting a certain authority can control the intelligent equipment, so that after at least one voice is distinguished, the voice characteristics of a plurality of voices obtained by distinguishing can be matched with the voice characteristics prestored in the intelligent equipment, and if the matching is successful, the voice which is successfully matched is taken as the voice to be recognized; if the matching is unsuccessful, voice with unsuccessful matching is discarded.

It should be noted that at least one speech may be extracted from the mixed sound by a machine learning method, and the specific steps are as follows:

step S1022a, analyzing the mixed sound by using a preset model, and determining at least one voice in the mixed sound, where the preset model is obtained by machine learning training using multiple sets of sound data, and each set of data in the multiple sets of sound data includes: mixing the sounds and a tag identifying a voice in the mixed sound;

in step S1022b, the determined at least one speech is extracted from the mixed sound.

Specifically, at least one voice in the mixed sound may be voices from different users, where the voice characteristics of the voices of the different users are also different. Therefore, after the mixed sound in the noisy environment is obtained, the mixed sound may be analyzed by using a preset model obtained through machine learning training in advance to determine the speech features of the speech in the mixed sound, for example, to determine the number of different frequencies in the mixed sound, the amplitude type of the sound, and the like. In order to extract voice from voice, a neural network model can be established, voice features of multiple groups of mixed voice are obtained in advance, corresponding labels are set for the voice features in each group of mixed voice in a manual labeling mode, and then the set labels are used for training the mixed voice to obtain a preset model. After the preset model is constructed, the mixed sound can be used as an input of the preset model, and an output of the preset model is at least one voice extracted from the mixed sound.

It should be noted that, the method of machine learning is used to classify at least one speech by constructing a classification model, and the specific steps are as follows:

step S1024a, acquiring at least one voice;

step S1024b, analyzing the at least one voice by using a classification model, and determining a voice type of each voice in the at least one voice, where the classification model is obtained by machine learning training using multiple sets of voice data, and each set of voice data in the multiple sets of voice data includes: a tag of at least one voice and a voice type in the at least one voice;

step S1024c, classifying the at least one voice according to the voice type of each voice in the at least one voice, and obtaining a classification result.

Specifically, after at least one voice is acquired, the at least one voice can be analyzed by using a classification model obtained through machine learning training in advance, and the voice type of each voice in the at least one voice is determined, wherein the type to which the voice belongs can be determined by the voice characteristics of the voice, for example, a child has voice characteristics different from those of an adult, and therefore, whether the voice belongs to the adult or the child can be determined by the voice characteristics. In order to determine the voice type of the voice according to the voice characteristics of the voice, a neural network model can be established, the voice characteristics of a plurality of groups of voices are obtained in advance, corresponding labels are set for the voice types in each group of voices in a manual labeling mode, and then the set labels are used for training the voices to obtain a classification model. After the classification model is constructed, a plurality of voices can be used as the input of the classification model, and the output of the classification model is the voice type corresponding to each voice.

It should be noted that after the voice to be recognized is determined, an effective sound segment needs to be extracted from a plurality of voice segments included in the voice to be recognized, and the smart device needs to be controlled according to a command word in the effective sound segment. Determining an effective sound segment from sound segments contained in the speech to be recognized, specifically comprising the following steps:

step S1040, determining a plurality of sound segments in the speech to be recognized;

step S1042, dividing a plurality of sound segments into at least one independent sound segment according to the time length parameters of the sound segments contained in the speech to be recognized;

and step S1044, determining the effective sound segment according to the duration of at least one independent sound segment.

In an alternative embodiment, the sentence will have a pause during the speech uttering process of the user, so the speech to be recognized can be divided into a plurality of sound segments according to the pause time, for example, the speech duration of the user is 10 seconds, the user does not pause for the first 2 seconds, the pause time is 3 seconds, the speech is 4-7 seconds, the pause is 8 seconds, and the speech is 9-10 seconds, so the speech to be recognized can be divided into three sound segments, i.e., the sound segments corresponding to 1-2 seconds, 4-7 seconds, and 9-10 seconds respectively.

In an optional embodiment, after determining the multiple segments in the speech to be recognized, the multiple segments are divided into at least one independent segment according to the duration parameters of the multiple segments included in the speech to be recognized, where the duration parameters include at least one of: the start time of the segment and the end time of the segment. The specific method for dividing the independent sound segments is as follows:

step S1042a, determining an end time of the first sound segment and a start time of the second sound segment, where the first sound segment is a next sound segment to the second sound segment, and the end time of the first sound segment is smaller than the start time of the second sound segment;

step S1042b, calculating a first time difference between the ending time of the first sound segment and the starting time of the second sound segment;

step S1042c, determining that the first sound segment and the second sound segment belong to the same independent sound segment when the first time difference value is smaller than a first preset threshold;

in step S1042d, it is determined that the first sound segment and the second sound segment belong to different independent sound segments when the first time difference is greater than or equal to a first preset threshold.

Reference is now made to fig. 3 as an example, wherein fig. 3 shows a schematic diagram of an alternative division of individual segments. Fig. 3 includes 6 segments, which are segment 1, segment 2, segment 3, segment 4, segment 5, and segment 6, respectively, where the start time and the end time of segment 1 are t1 and t2, respectively, the start time and the end time of segment 2 are t3 and t4, respectively, the start time and the end time of segment 3 are t5 and t6, the start time and the end time of segment 4 are t7 and t8, respectively, the start time and the end time of segment 5 are t9 and t10, respectively, and the start time and the end time of segment 6 are t11 and t12, respectively. Assuming that the first preset threshold is δ, since t3-t2 < δ and t5-t4 < δ, sound segment 1, sound segment 2 and sound segment 3 are divided into independent sound segments 1; because t7-t6 is more than delta, and t9-t8 is more than delta, the sound segment 4 is separately divided into an independent sound segment, namely an independent sound segment 2; since t9-t8 > δ and t11-t10 < δ, sound segment 5 and sound segment 6 are divided into independent sound segments 3.

After the independent sound segments are determined, effective sound segments are determined according to the divided independent sound segments, and the step of determining the effective sound segments according to the duration of at least one independent sound segment specifically comprises the following steps:

step S1044a, determining a time duration of each of the at least one independent vocal segment;

step S1044b, determining the independent vocal range as the valid vocal range when the duration of the independent vocal range is less than the second preset threshold.

Reference is now made to fig. 4 as an example, wherein fig. 4 shows an alternative schematic diagram for determining an active sound segment. Fig. 4 includes 3 independent segments, which are an independent segment 1, an independent segment 2, and an independent segment 3, where the start time and the end time of the independent segment 1 are T1 and T2, respectively, and the duration λ 1 of the independent segment 1 is T2-T1; the start time and the end time of the independent sound segment 2 are respectively T3 and T4, and the duration lambda 2 of the independent sound segment 2 is T4-T3; the start time and the end time of the independent segment 3 are T5 and T6, respectively, and the duration λ 3 of the independent segment 3 is T6-T5. Assuming that the second preset threshold is λ, since λ 1 > λ, λ 3 > λ, and λ 2 < λ, the independent segment 2 is regarded as an effective segment, and the independent segments 1 and 3 are non-effective segments.

It should be noted that after the valid sound segment in the speech to be recognized is determined, the command word included in the valid sound segment is extracted, and the intelligent device is controlled to complete the corresponding operation according to the command word. The method for acquiring the recognition result of the speech to be recognized according to the command words extracted from the effective sound segment specifically comprises the following steps:

step 1060, obtaining command words in the effective sound segment;

step S1062, determining a control instruction corresponding to the command word in a preset voice library;

in step S1064, the control device completes an operation corresponding to the control instruction.

Specifically, after obtaining the effective segments, the processor of the smart device analyzes the effective segments by using the extraction model to determine command words in the effective segments, wherein the extraction model is obtained by using a plurality of groups of effective segments through machine learning training, and each group of effective segments in the plurality of groups of effective segments includes: a valid segment and a tag identifying a command word in the valid segment. After extracting the command words from the effective voice segments, matching the command words with keywords in a preset voice library, if the keywords matched with the command words exist in the preset voice library, determining a control instruction according to the keywords matched with the command words, and controlling the intelligent equipment to complete corresponding operation according to the control instruction; and if the keyword matched with the command word is not found in the preset voice library, the intelligent equipment does not perform any operation.

Example 2

According to an embodiment of the present invention, there is further provided an embodiment of a device for recognizing speech, where fig. 5 is a schematic structural diagram of the device for recognizing speech according to the embodiment of the present invention, and as shown in fig. 5, the device includes: a first determining module 501, a second determining module 503, and an obtaining module 505.

The first determining module 501 is configured to determine a speech to be recognized; a second determining module 503, configured to determine an effective sound segment from a plurality of sound segments included in the speech to be recognized; and an obtaining module 505, configured to obtain a recognition result of the speech to be recognized according to the command word extracted from the effective sound segment.

It should be noted that the first determining module 501, the second determining module 503, and the obtaining module 505 correspond to steps S102 to S106 in embodiment 1, and the three modules are the same as the corresponding steps in the implementation example and the application scenario, but are not limited to the disclosure in embodiment 1.

In an alternative embodiment, the first determining module comprises: the device comprises a first acquisition module, an extraction module, a first processing module and a third determination module. The first acquisition module is used for acquiring mixed sound, wherein the mixed sound at least comprises voice and non-voice; the extraction module is used for extracting at least one voice from the mixed sound by adopting a voiceprint recognition algorithm; the first processing module is used for classifying at least one voice based on a voiceprint recognition algorithm to obtain a classification result; and the third determining module is used for determining the voice to be recognized from the classification result.

It should be noted that the first obtaining module, the extracting module, the first processing module, and the third determining module correspond to steps S1020 to S1026 in embodiment 1, and the four modules are the same as the corresponding steps in the implementation example and the application scenario, but are not limited to the disclosure in embodiment 1.

In an alternative embodiment, the extraction module comprises: a fourth determination module and a fifth determination module. Wherein, the fourth confirms the module for use and predetermine the model and carry out the analysis to mixed sound, confirm at least one pronunciation in the mixed sound, wherein, predetermine the model and obtain for using multiunit sound data to pass through machine learning training, every group data in the multiunit sound data all includes: mixing the sounds and a tag identifying a voice in the mixed sound; and the fifth determining module is used for extracting the determined at least one voice from the mixed sound.

It should be noted that the fourth determining module and the fifth determining module correspond to steps S1022a to S1022b in embodiment 1, and the two modules are the same as the corresponding steps in the implementation example and the application scenario, but are not limited to the disclosure in embodiment 1.

In an alternative embodiment, the processing module comprises: the device comprises a second acquisition module, a sixth determination module and a second processing module. The second acquisition module is used for acquiring at least one voice; a sixth determining module, configured to analyze the at least one voice by using a classification model, and determine a voice type of each voice in the at least one voice, where the classification model is obtained by using multiple sets of voice data through machine learning training, and each set of voice data in the multiple sets of voice data includes: a tag of at least one voice and a voice type in the at least one voice; and the second processing module is used for classifying the at least one voice according to the voice type of each voice in the at least one voice to obtain a classification result.

It should be noted that the second acquiring module, the sixth determining module and the second processing module correspond to steps S1024a to S1024c in embodiment 1, and the three modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure in embodiment 1.

In an alternative embodiment, the second determining module comprises: the device comprises a seventh determining module, a dividing module and an eighth determining module. The seventh determining module is used for determining a plurality of sound segments in the speech to be recognized; the dividing module is used for dividing the plurality of sound segments into at least one independent sound segment according to the time length parameters of the plurality of sound segments contained in the voice to be recognized; and the eighth determining module is used for determining the effective sound segment according to the duration of at least one independent sound segment.

It should be noted that the seventh determining module, the dividing module, and the eighth determining module correspond to steps S1040 to S1044 in embodiment 1, and the three modules are the same as the corresponding steps in the implementation example and the application scenario, but are not limited to the disclosure in embodiment 1.

In an alternative embodiment, the duration parameter includes at least one of: the starting time of the sound segment and the ending time of the sound segment, wherein the dividing module comprises: the device comprises a ninth determination module, a calculation module, a tenth determination module and an eleventh determination module. The ninth determining module is configured to determine an end time of the first sound segment and a start time of the second sound segment, where the first sound segment is a previous sound segment adjacent to the second sound segment, and the end time of the first sound segment is smaller than the start time of the second sound segment; the calculating module is used for calculating a first time difference value between the ending time of the first sound segment and the starting time of the second sound segment; a tenth determining module, configured to determine that the first sound segment and the second sound segment belong to the same independent sound segment when the first time difference is smaller than a first preset threshold; and the eleventh determining module is used for determining that the first sound segment and the second sound segment belong to different independent sound segments under the condition that the first time difference value is greater than or equal to a first preset threshold value.

It should be noted that the ninth determining module, the calculating module, the tenth determining module and the eleventh determining module correspond to steps S1042a to S1042d in embodiment 1, and the four modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure in embodiment 1.

In an alternative embodiment, the eighth determining module includes: a twelfth determination module and a thirteenth determination module. Wherein the twelfth determining module is configured to determine a time duration of each of the at least one independent segment; and the thirteenth determining module is used for determining the independent vocal segments as the effective vocal segments under the condition that the duration of the independent vocal segments is less than the second preset threshold.

It should be noted that the twelfth determining module and the thirteenth determining module correspond to steps S1044a to S1044b in embodiment 1, and the two modules are the same as the corresponding steps in the example and application scenarios, but are not limited to the disclosure in embodiment 1.

In an alternative embodiment, the obtaining module includes: the device comprises a third acquisition module, a fourteenth determination module and a control module. The third acquisition module is used for acquiring command words in the effective sound segment; a fourteenth determining module, configured to determine a control instruction corresponding to the command word in the preset voice library; and the control module is used for controlling the equipment to complete the operation corresponding to the control instruction.

It should be noted that the third obtaining module, the fourteenth determining module and the control module correspond to steps S1060 to S1064 in embodiment 1, and the three modules are the same as the corresponding steps in the implementation example and application scenarios, but are not limited to the disclosure in embodiment 1.

Example 3

According to another aspect of embodiments of the present invention, there is also provided a storage medium including a stored program, wherein the program executes the method of recognizing a speech in embodiment 1 described above.

Example 4

According to another aspect of the embodiments of the present invention, there is also provided a processor for executing a program, wherein the program executes the method for recognizing speech in embodiment 1.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method of recognizing speech, comprising:

determining a voice to be recognized;

determining an effective sound segment from a plurality of sound segments contained in the voice to be recognized;

obtaining the recognition result of the speech to be recognized according to the command words extracted from the effective sound segment,

determining the speech to be recognized includes:

acquiring mixed sound, wherein the mixed sound at least comprises voice and non-voice;

extracting at least one voice from the mixed sound by adopting a voiceprint recognition algorithm;

classifying the at least one voice based on the voiceprint recognition algorithm to obtain a classification result;

determining the speech to be recognized from the classification result,

determining effective sound segments from the sound segments contained in the speech to be recognized, including:

determining a plurality of sound segments in the voice to be recognized;

dividing a plurality of sound segments into at least one independent sound segment according to time length parameters of the sound segments contained in the voice to be recognized;

and determining the effective sound segment according to the duration of the at least one independent sound segment.

2. The method of claim 1, wherein extracting at least one speech from the mixed sound using a voiceprint recognition algorithm comprises:

analyzing the mixed sound by using a preset model, and determining at least one voice in the mixed sound, wherein the preset model is obtained by using a plurality of groups of sound data through machine learning training, and each group of data in the plurality of groups of sound data comprises: the mixed sound and a tag identifying a voice in the mixed sound;

extracting the determined at least one voice from the mixed sound.

3. The method of claim 1, wherein classifying the at least one voice based on the voiceprint recognition algorithm comprises:

acquiring the at least one voice;

analyzing the at least one voice by using a classification model, and determining a voice type of each voice in the at least one voice, wherein the classification model is obtained by using a plurality of groups of voice data through machine learning training, and each group of voice data in the plurality of groups of voice data comprises: a tag of the at least one voice and a voice type in the at least one voice;

and classifying the at least one voice according to the voice type of each voice in the at least one voice to obtain the classification result.

4. The method of claim 1, wherein the duration parameter comprises at least one of: the method comprises the following steps of dividing a plurality of sound segments into at least one independent sound segment according to time length parameters of the sound segments contained in the speech to be recognized, wherein the sound segment starting time and the sound segment ending time comprise the following steps:

determining an end time of a first sound segment and a start time of a second sound segment, wherein the first sound segment is a sound segment adjacent to the second sound segment, and the end time of the first sound segment is smaller than the start time of the second sound segment;

calculating a first time difference value between the ending time of the first sound segment and the starting time of the second sound segment;

determining that the first sound segment and the second sound segment belong to the same independent sound segment under the condition that the first time difference value is smaller than a first preset threshold value;

and under the condition that the first time difference value is greater than or equal to the first preset threshold value, determining that the first sound segment and the second sound segment belong to different independent sound segments.

5. The method of claim 1, wherein determining the valid segments based on the duration of the at least one independent segment comprises:

determining a time duration for each of the at least one independent segment;

and under the condition that the duration of the independent sound segment is less than a second preset threshold, determining the independent sound segment as the effective sound segment.

6. The method according to claim 1, wherein obtaining the recognition result of the speech to be recognized according to the command word extracted from the valid sound segment comprises:

obtaining command words in the effective sound segment;

determining a control instruction corresponding to the command word in a preset voice library;

and the control equipment completes the operation corresponding to the control instruction.

7. An apparatus for recognizing speech, comprising:

the first determining module is used for determining the voice to be recognized;

the second determining module is used for determining effective sound segments from a plurality of sound segments contained in the voice to be recognized;

an obtaining module, configured to obtain a recognition result of the speech to be recognized according to the command word extracted from the effective vocal segment,

the first determining module includes: the device comprises a first acquisition module, an extraction module, a first processing module and a third determination module, wherein the first acquisition module is used for acquiring mixed sound, and the mixed sound at least comprises voice and non-voice; the extraction module is used for extracting at least one voice from the mixed sound by adopting a voiceprint recognition algorithm; the first processing module is used for classifying at least one voice based on a voiceprint recognition algorithm to obtain a classification result; a third determining module for determining the speech to be recognized from the classification result,

the second determining module includes: the voice recognition device comprises a seventh determining module, a dividing module and an eighth determining module, wherein the seventh determining module is used for determining a plurality of sound segments in the voice to be recognized; the dividing module is used for dividing the plurality of sound segments into at least one independent sound segment according to the time length parameters of the plurality of sound segments contained in the voice to be recognized; and the eighth determining module is used for determining the effective sound segment according to the duration of at least one independent sound segment.

8. A storage medium characterized by comprising a stored program, wherein the program executes the method of recognizing a voice according to any one of claims 1 to 6.

9. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the method of recognizing speech according to any one of claims 1 to 6.