CN106920558B

CN106920558B - Keyword recognition method and device

Info

Publication number: CN106920558B
Application number: CN201510993729.8A
Authority: CN
Inventors: 孙廷玮
Original assignee: Spreadtrum Communications Shanghai Co Ltd
Current assignee: Spreadtrum Communications Shanghai Co Ltd
Priority date: 2015-12-25
Filing date: 2015-12-25
Publication date: 2021-04-13
Anticipated expiration: 2035-12-25
Also published as: CN106920558A

Abstract

The method and the device for identifying the keywords comprise the following steps: dividing the acquired voice data to be identified into a plurality of overlapped voice frames; respectively carrying out fast Fourier transform operation on the sound signals of the plurality of divided sound frames to obtain corresponding frequency spectrum energy; converting the spectrum energy corresponding to each sound frame into the spectrum energy under the Mel frequency, and calculating the corresponding MFCC parameters; respectively calculating a DTW distance median, a Euclidean distance median and a cross-correlation distance median between the voice data to be identified and a plurality of preset reference templates according to MFCC parameters corresponding to each voice frame; and when determining that the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the voice data to be recognized and the current reference template is smaller than a preset threshold value, taking the key words in the current reference template as recognition results. According to the scheme, the accuracy of keyword identification can be improved, and computing resources are saved.

Description

Keyword recognition method and device

Technical Field

The invention relates to the technical field of voice recognition, in particular to a keyword recognition method and device.

Background

Speech recognition is a technique in which a machine converts human speech into corresponding text or instructions through a recognition and understanding process. As an important branch of the speech Recognition field, keyword (IWR) Recognition is widely used in the fields of communication, consumer electronics, self-service, office automation, and the like.

In the prior art, Hidden Markov Models (HMMs) and their corresponding parameters, or a keyword recognition system (KWS) are generally used for keyword recognition.

However, in the keyword recognition method in the prior art, a corresponding model needs to be established, and corresponding translation operation training model parameters are needed, so that the problems of large calculation amount and low recognition accuracy rate exist.

Disclosure of Invention

The embodiment of the invention solves the problems of improving the accuracy of keyword identification and saving computing resources.

In order to solve the above problem, an embodiment of the present invention provides a keyword recognition method, where the keyword recognition method includes:

dividing the acquired voice data to be identified into a plurality of overlapped voice frames;

respectively carrying out fast Fourier transform operation on the sound signals of the plurality of divided sound frames to obtain corresponding frequency spectrum energy;

converting the spectrum energy corresponding to each sound frame into the spectrum energy under the Mel frequency, and calculating the corresponding MFCC parameters;

respectively calculating a DTW distance median, a Euclidean distance median and a cross-correlation distance median between the voice data to be identified and a plurality of preset reference templates according to MFCC parameters corresponding to each voice frame;

and when determining that the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the voice data to be recognized and the current reference template is smaller than a preset threshold value, taking the key words in the current reference template as recognition results.

Optionally, when the spectral energy of the sound data to be identified is greater than a preset energy threshold, the operation of converting the spectral energy corresponding to each sound frame into the spectral energy at the mel frequency and calculating the corresponding MFCC parameter is performed.

Optionally, the preset threshold is associated with a noise level of the voice data to be recognized.

Optionally, the noise level of the voice data to be recognized includes a low noise level, a medium noise level and a high noise level, wherein:

when p is larger than or equal to p1, determining that the voice data to be identified has a low noise level, wherein p represents the absolute amplitude corresponding to the voice data to be identified, and p1 is a preset first threshold;

when p2 is larger than or equal to p & gtp 1, determining that the voice data to be identified has a medium noise level, p2 is a preset second threshold value, and p1 & gtp 2;

when p < p2, the sound data to be recognized is determined to have a high noise level.

Alternatively, p1 equals 0.8 and p2 equals 0.45.

Optionally, the reference template includes information of transient noise, static noise, and rich speech content of a specific person.

The embodiment of the invention also provides a keyword recognition device, which comprises:

the framing processing unit is suitable for dividing the acquired sound data to be identified into a plurality of overlapped sound frames;

the frequency domain conversion unit is suitable for respectively carrying out fast Fourier transform operation on the sound signals of the plurality of divided sound frames to obtain corresponding frequency spectrum energy;

the first calculation unit is suitable for converting the spectrum energy corresponding to each sound frame into the spectrum energy under the Mel frequency and calculating the corresponding MFCC parameters;

the second calculation unit is suitable for respectively calculating a DTW distance median, a Euclidean distance median and a cross-correlation distance median between the voice data to be identified and a plurality of preset reference templates according to the MFCC parameters corresponding to each voice frame;

the judging unit is suitable for judging whether the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the current sound frame and the current reference template is smaller than a preset threshold value or not;

and the keyword identification unit is suitable for taking the keywords in the current reference template as identification results when determining that the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the voice data to be identified and the current reference template is smaller than a preset threshold value.

Optionally, the voice recognition device further includes a triggering unit, and the triggering unit is adapted to trigger the first calculating unit to perform the operation of converting the spectral energy corresponding to each voice frame into the spectral energy at the mel frequency and calculating the corresponding MFCC parameter when the spectral energy of the voice data to be recognized is greater than a preset energy threshold.

Alternatively, p1 equals 0.8 and p2 equals 0.45.

Compared with the prior art, the technical scheme of the invention has the following advantages:

according to the scheme, whether the sound frame comprises the keywords is determined by comparing the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the sound data to be recognized and the reference template which are calculated based on the corresponding MFCC parameters with the preset threshold value, and a corresponding mathematical recognition model does not need to be established, and corresponding translation of the keywords does not need to be carried out, so that the calculation resources for keyword recognition can be saved, and the accuracy of keyword recognition can be improved.

Further, when the frequency spectrum energy of the voice data to be recognized is greater than the preset energy threshold, the corresponding voice data to be recognized is subjected to keyword recognition, otherwise, the voice data to be recognized is not subjected to keyword recognition, so that the computing resources can be further saved, and the speed of keyword recognition is increased.

Further, when recording the corresponding reference template, the reference template includes transient noise, static noise and rich voice content information of the specific person, so that the reference template can be accurately recorded with the voice of the corresponding specific person and the environment to which the voice belongs, and therefore, the accuracy of keyword recognition can be further improved.

Drawings

FIG. 1 is a flow chart of a keyword recognition method in an embodiment of the present invention;

FIG. 2 is a flow chart of another keyword recognition method in an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a keyword recognition apparatus in an embodiment of the present invention.

Detailed Description

In order to solve the above problems in the prior art, in the technical scheme adopted by the embodiment of the invention, whether the sound frame includes the keyword is determined by comparing the mean value of the DTW distance median, the euclidean distance median and the cross-correlation distance median between the sound data to be recognized and the reference template with the preset threshold value, so that the calculation resources for keyword recognition can be saved, and the accuracy of keyword recognition can be improved.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

Fig. 1 shows a flowchart of a keyword recognition method in an embodiment of the present invention. The keyword recognition method shown in fig. 1 may include the following steps:

step S101: and dividing the acquired voice data to be identified into a plurality of overlapped voice frames.

In a specific implementation, the size of the overlapping portion between the voice frames can be set according to actual needs. For example, when the length of each voice frame is 32ms, the size of the overlapping portion between adjacent voice frames may be 16 ms.

Step S102: and respectively carrying out fast Fourier transform operation on the sound signals of the plurality of divided sound frames to obtain corresponding frequency spectrum energy.

In a specific implementation, the plurality of divided sound signals are time-domain sound signals, and the time-domain sound signals can be converted into frequency-domain sound signals through Fast Fourier Transform (FFT).

Step S103: and converting the spectral energy corresponding to each sound frame into the spectral energy under the Mel frequency, and calculating the corresponding MFCC parameters.

In a specific implementation, the spectrum energy (power spectrum) of the sound signal is obtained through fast fourier transform operation, and may be converted into the spectrum energy at the Mel Frequency according to a preset corresponding relationship, and the Mel Frequency Cepstrum Coefficient (MFCC) parameter corresponding to each sound frame is calculated according to the spectrum energy at the Mel Frequency.

Step S104: and respectively calculating a DTW distance median, a Euclidean distance median and a cross-correlation distance median between the voice data to be identified and a plurality of preset reference templates according to the MFCC parameters corresponding to each voice frame.

In a specific implementation, the preset multiple reference templates respectively include the voice contents of the corresponding keywords. The number of the preset reference templates may be set according to actual needs, and the present invention is not limited herein.

Step S105: and when determining that the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the voice data to be recognized and the current reference template is smaller than a preset threshold value, taking the key words in the current reference template as recognition results.

In specific implementation, a plurality of preset reference templates are traversed, a DTW distance median, a Euclidean distance median and a cross-correlation distance median between current voice data to be recognized and the current reference template are respectively calculated, the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the current voice data to be recognized and the current reference template is compared with a preset threshold, and when the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the voice data to be recognized and the current reference template is determined to be smaller than the preset threshold, a keyword in the current reference template can be used as a recognition result; otherwise, determining that the current voice data to be recognized does not include the voice information of the keyword in the current reference template.

The keyword recognition method in the embodiment of the present invention will be described in further detail with reference to fig. 2.

Fig. 2 is a flowchart illustrating another keyword recognition method according to an embodiment of the present invention. The keyword recognition method as shown in fig. 2 may include the following steps:

step S201: and overlapping and framing the acquired sound data to obtain a plurality of corresponding sound frames.

In specific implementation, first, analog-to-digital conversion may be performed on the collected sound signal to obtain corresponding sound data. Next, the corresponding audio data may be overlapped and framed to obtain a plurality of audio frames. The collected sound data is subjected to framing, and the essence is that the sound data is subjected to short-time analysis. The short-time analysis is to divide the sound signal into short time segments with fixed periods, each short time segment being a relatively fixed sustained sound segment. The two adjacent sound frames are partially overlapped, and the overlapping range can be selected according to the actual situation.

Step S202: the obtained plurality of audio frames are subjected to windowing processing.

In specific implementation, window functions commonly used for speech signal processing, such as a hamming window, a hanning window, a rectangular window, and the like, can be selected, the frame length is selected to be 10-40 ms, and the typical value is 20 ms. Among them, framing a speech signal destroys the naturalness of the speech signal, and windowing, retracing, and the like are performed using a speech frame, which can solve this problem.

Step S203: and carrying out fast Fourier transform operation on the sound frames subjected to windowing processing to obtain information of the frequency spectrum energy corresponding to each sound frame.

In a specific implementation, the sound data theoretically changes with time, is an unstable process, and cannot be directly converted into a frequency domain. However, since the sound data is subjected to framing processing (short-time analysis), the sound data of each frame can be considered to be relatively stable, and thus frequency domain conversion can be applied thereto.

In a specific implementation, Short-Time Fourier Transform (Short-Time Fourier Transform/Short-Term Fourier Transform, STFT) may be used to perform frequency domain conversion on the sound data of each frame, so as to obtain the spectrum information corresponding to each sound frame. Wherein, the obtained frequency spectrum comprises the relation between the frequency and the energy of the corresponding sound signal.

Step S204: and converting the spectral energy corresponding to each sound frame into the spectral energy under the Mel frequency, and calculating the corresponding MFCC parameters.

In an embodiment of the present invention, after obtaining the spectrum energies corresponding to the multiple voice frames of the current voice data to be recognized, it may first be determined whether the spectrum energy of the current voice data to be recognized is greater than a preset energy threshold, and when it is determined that the spectrum energy of the current voice data to be recognized is greater than the energy threshold, step S204 is continuously performed, otherwise, it is determined that the current voice data to be recognized does not include the speech information of the keyword, so that the subsequent processing on the current voice data to be recognized may be stopped, so as to further save the computing resources.

In a specific implementation, the spectral energy obtained through the FFT operation may be converted into spectral energy at Mel (Mel) frequency according to a preset corresponding relationship, and the MFCC parameter corresponding to each voice frame is calculated as the feature vector of each voice frame.

Step S205: and calculating to obtain a DTW distance median, a Euclidean distance median and a cross-correlation distance median between the current sound frame and a current reference template in a plurality of preset reference templates according to the MFCC parameters corresponding to each sound frame.

In an embodiment of the present invention, in calculating the DTW distance between the current sound data to be recognized and the reference template, the current sound data to be recognized and the reference template are respectively divided into I frames. At the same time, the inventors of the present application have learned empirically that during the recording of the reference template, the speaker's pronunciation becomes more excited and the speech rate is slower than usual. Therefore, the reference template is divided into I frames, the size of each hop for calculating the DTW distance is 0.1I frame, and after the DTW distance between the I frame of the current voice data to be recognized and the I frame of the reference template is obtained through calculation, the median value of the I DTW distances is used as the median value of the DTW distances between the current voice data to be recognized and the corresponding reference template. Similarly, we can obtain the median Euclidean Distance (ED) and the median cross-correlation distance (CC) of the current voice data to be recognized and the corresponding reference template.

Step S206: judging whether the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the voice data to be identified and the current reference template is smaller than a preset threshold value or not; when the determination result is yes, step S207 may be executed, otherwise, the execution is started from step S205 for the next reference template in the preset plurality of reference templates.

In specific implementation, after a DTW distance median, a euclidean distance median and a cross-correlation distance median between current voice data to be recognized and a reference template are obtained through calculation, a mean value of the three is compared with a preset threshold.

In an embodiment of the present invention, the preset threshold is associated with a noise level of the current voice data to be recognized, that is, different noise levels, and the corresponding preset thresholds will be different. When the absolute amplitude probability of the current voice data to be recognized is greater than or equal to p1, determining that the voice data to be recognized has a low noise level, wherein p represents the absolute amplitude corresponding to the voice data to be recognized, and p1 is a preset first threshold; when p2 is more than or equal to p and more than p1, determining that the voice data to be recognized has a medium noise level, wherein p2 is a preset second threshold value, and p1 is more than p 2; when p < p2, the sound data to be recognized is determined to have a high noise level. In one embodiment of the present invention, p1 is 0.8 and p2 is 0.45.

Step S207: and taking the key words in the current reference template as recognition results and outputting the recognition results.

In a specific implementation, when it is determined that a mean value of a DTW distance median, a euclidean distance median, and a cross-correlation distance median between a certain reference template of the preset reference templates and the current voice data to be recognized is smaller than a preset threshold, it may be determined that the current voice data to be recognized includes voice information of a keyword in the reference template. Therefore, the keywords in the reference template can be used as the keyword recognition result of the current voice data to be recognized and output.

In a specific implementation, when the above keyword recognition method is applied to an alarm system, the alarm system may perform an alarm operation when a corresponding keyword is recognized.

It should be noted that in emergency or other keyword applications, a simple (e.g., untrained) user may be used to record personalized keywords. To ensure good recognition performance, the reference template becomes very important. This ensures the recording quality of the reference template by a simple check operation.

Therefore, the inventor of the present application advocates three detection factors, namely, detecting a transient noise source (such as a door drop), a static noise source (such as a fan or traffic noise), and enriching the pronunciation content of the keyword. The three factors need to be satisfied simultaneously, otherwise, the keyword needs to be recorded again. For the detection of transient noise, a difference of absolute amplitude of energy of a sound signal of which each hop is 5ms and which is continuous with 25ms sound frames may be used. Where the absolute amplitude of every 5 sound frames can be averaged. In static noise detection, the recording of keywords occurs within a preset 5s time window in a quiet environment. The signal energy of the beginning and end of the reference template, which does not include the keyword, has a large difference in the 5s time window compared to the sound data including the keyword. In checking rich pronunciation content, keywords having only a single vowel without a consonant such as "o" are rejected, which can be made based on a modified zero-crossing rate associated with the pronunciation content of the keyword.

The following describes a device corresponding to the keyword recognition method in the embodiment of the present invention in further detail.

Referring to fig. 3, the keyword recognition apparatus 300 according to the embodiment of the present invention may include a framing processing unit 301, a frequency domain converting unit 302, a first calculating unit 303, a second calculating unit 304, a determining unit 305, and a keyword recognition unit 306, wherein:

the framing processing unit 301 is adapted to divide the acquired sound data to be identified into a plurality of overlapped sound frames;

the frequency domain conversion unit 302 is adapted to traverse a plurality of sound frames obtained by dividing, and perform fast fourier transform operation on the sound signals of the traversed current sound frame to obtain corresponding spectral energy;

the first calculating unit 303 is adapted to convert the obtained spectral energy into spectral energy at a mel frequency, and calculate a corresponding MFCC parameter;

in a specific implementation, the keyword recognition apparatus 300 may further include a triggering unit (not shown in the figure), which is adapted to trigger the first calculating unit 303 to perform the operation of converting the obtained spectrum energy into the spectrum energy at the Mel frequency and calculating the corresponding MFCC parameter when the spectrum energy of the traversed current sound frame is greater than a preset energy threshold;

the second calculating unit 304 is adapted to calculate, according to the MFCC parameter corresponding to the current sound frame, a DTW distance median, an euclidean distance median, and a cross-correlation distance median between the current sound frame and a plurality of preset reference templates, respectively;

the determining unit 305 is adapted to determine whether an average of a DTW distance median, a euclidean distance median, and a cross-correlation distance median between the current sound frame and the reference template is smaller than a preset threshold;

in a specific implementation, the preset threshold is associated with the noise level of the current sound frame, wherein when p ≧ p1, the current sound frame is determined to have a low noise level, p represents the corresponding absolute amplitude of the current sound frame, and p1 is a preset first threshold; when p2 is more than or equal to p & gtp 1, determining that the current sound frame has a medium noise level, p2 is a preset second threshold value, and p1 & gtp 2; when p < p2, it is determined that the current sound frame has a high noise level. In one embodiment of the present invention, p1 equals 0.8, and p2 equals 0.45.

In a specific implementation, the reference template includes information of transient noise, stationary noise, and rich speech content of a particular person.

The keyword recognition unit 306 is adapted to, when it is determined that the mean of the DTW distance median, the euclidean distance median, and the cross-correlation distance median between the current sound frame and the reference template is smaller than a preset threshold, take the keyword in the current reference template as a recognition result and output the recognition result.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by instructions associated with hardware via a program, which may be stored in a computer-readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, and the like.

The method and system of the embodiments of the present invention have been described in detail, but the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A keyword recognition method, comprising:

respectively calculating a DTW distance median, a Euclidean distance median and a cross-correlation distance median between the voice data to be identified and a plurality of preset reference templates according to MFCC parameters corresponding to each voice frame; the plurality of reference templates respectively comprise voice contents of corresponding keywords; when calculating a DTW distance median, a Euclidean distance median and a cross-correlation distance median between the voice data to be identified and the preset multiple reference templates, dividing the voice data to be identified and the reference templates into I frames; the method comprises the steps that each hop for calculating DTW distance, Euclidean distance and cross-correlation distance is 0.1I frame, after the DTW distance, the Euclidean distance and the cross-correlation distance between the I frame of current voice data to be recognized and the I frame of a reference template are obtained through calculation, the median value of I DTW distances is used as the DTW median value of the voice data to be recognized and the corresponding reference template, the median value of I Euclidean distances is used as the median value of the Euclidean distances between the voice data to be recognized and the corresponding reference template, and the median value of I cross-correlation distances is used as the median value of the cross-correlation distance between the voice data to be recognized and the corresponding reference template;

and when determining that the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the voice data to be recognized and the current reference template is smaller than a preset threshold value, taking the keywords in the current reference template as the keyword recognition result of the voice data to be recognized.

2. The keyword recognition method according to claim 1, wherein the operation of converting the spectral energy corresponding to each sound frame into the spectral energy at mel frequency and calculating the corresponding MFCC parameter is performed when the spectral energy of the sound data to be recognized is greater than a preset energy threshold.

3. The method according to claim 1, wherein the predetermined threshold is associated with a noise level of the voice data to be recognized.

4. The keyword recognition method according to claim 3, wherein the noise level of the voice data to be recognized includes a low noise level, a medium noise level, and a high noise level, wherein:

5. The keyword recognition method according to claim 4, wherein p1 is equal to 0.8 and p2 is equal to 0.45.

6. The keyword recognition method according to claim 1, wherein the reference template includes information of transient noise, static noise and rich voice content of a specific person.

7. A keyword recognition apparatus, comprising:

the second calculation unit is suitable for respectively calculating a DTW distance median, a Euclidean distance median and a cross-correlation distance median between the voice data to be identified and a plurality of preset reference templates according to the MFCC parameters corresponding to each voice frame; the plurality of reference templates respectively comprise voice contents of corresponding keywords; when calculating a DTW distance median, a Euclidean distance median and a cross-correlation distance median between the voice data to be identified and the preset multiple reference templates, dividing the voice data to be identified and the reference templates into I frames; the method comprises the steps that each hop for calculating DTW distance, Euclidean distance and cross-correlation distance is 0.1I frame, after the DTW distance, the Euclidean distance and the cross-correlation distance between the I frame of current voice data to be recognized and the I frame of a reference template are obtained through calculation, the median value of I DTW distances is used as the DTW median value of the voice data to be recognized and the corresponding reference template, the median value of I Euclidean distances is used as the median value of the Euclidean distances between the voice data to be recognized and the corresponding reference template, and the median value of I cross-correlation distances is used as the median value of the cross-correlation distance between the voice data to be recognized and the corresponding reference template;

and the keyword identification unit is suitable for taking the keywords in the current reference template as the keyword identification result of the voice data to be identified when the mean value of the DTW distance, the Euclidean distance median and the cross-correlation distance median between the voice data to be identified and the current reference template is determined to be smaller than a preset threshold value.

8. The keyword recognition apparatus according to claim 7, further comprising a triggering unit, wherein the triggering unit is adapted to trigger the first computing unit to perform the operation of converting the spectral energy corresponding to each sound frame into the spectral energy at the mel frequency and computing the corresponding MFCC parameters when the spectral energy of the sound data to be recognized is greater than a preset energy threshold.

9. The keyword recognition apparatus according to claim 7, wherein the preset threshold is associated with a noise level of the voice data to be recognized.

10. The keyword recognition apparatus according to claim 9, wherein the noise level of the voice data to be recognized includes a low noise level, a medium noise level, and a high noise level, wherein:

11. The keyword recognition apparatus of claim 10, wherein p1 is equal to 0.8 and p2 is equal to 0.45.

12. The keyword recognition apparatus according to claim 7, wherein the reference template includes information of transient noise, static noise and rich voice content of a specific person.