CN106920558B - Keyword recognition method and device - Google Patents
Keyword recognition method and device Download PDFInfo
- Publication number
- CN106920558B CN106920558B CN201510993729.8A CN201510993729A CN106920558B CN 106920558 B CN106920558 B CN 106920558B CN 201510993729 A CN201510993729 A CN 201510993729A CN 106920558 B CN106920558 B CN 106920558B
- Authority
- CN
- China
- Prior art keywords
- voice data
- median
- recognized
- sound
- distance
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 238000001228 spectrum Methods 0.000 claims abstract description 30
- 230000005236 sound signal Effects 0.000 claims abstract description 16
- 230000003595 spectral effect Effects 0.000 claims description 21
- 238000012545 processing Methods 0.000 claims description 9
- 238000009432 framing Methods 0.000 claims description 8
- 230000001052 transient effect Effects 0.000 claims description 8
- 230000003068 static effect Effects 0.000 claims description 7
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 238000001514 detection method Methods 0.000 description 3
- 238000013519 translation Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002459 sustained effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/16—Hidden Markov models [HMM]
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Telephonic Communication Services (AREA)
Abstract
The method and the device for identifying the keywords comprise the following steps: dividing the acquired voice data to be identified into a plurality of overlapped voice frames; respectively carrying out fast Fourier transform operation on the sound signals of the plurality of divided sound frames to obtain corresponding frequency spectrum energy; converting the spectrum energy corresponding to each sound frame into the spectrum energy under the Mel frequency, and calculating the corresponding MFCC parameters; respectively calculating a DTW distance median, a Euclidean distance median and a cross-correlation distance median between the voice data to be identified and a plurality of preset reference templates according to MFCC parameters corresponding to each voice frame; and when determining that the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the voice data to be recognized and the current reference template is smaller than a preset threshold value, taking the key words in the current reference template as recognition results. According to the scheme, the accuracy of keyword identification can be improved, and computing resources are saved.
Description
Technical Field
The invention relates to the technical field of voice recognition, in particular to a keyword recognition method and device.
Background
Speech recognition is a technique in which a machine converts human speech into corresponding text or instructions through a recognition and understanding process. As an important branch of the speech Recognition field, keyword (IWR) Recognition is widely used in the fields of communication, consumer electronics, self-service, office automation, and the like.
In the prior art, Hidden Markov Models (HMMs) and their corresponding parameters, or a keyword recognition system (KWS) are generally used for keyword recognition.
However, in the keyword recognition method in the prior art, a corresponding model needs to be established, and corresponding translation operation training model parameters are needed, so that the problems of large calculation amount and low recognition accuracy rate exist.
Disclosure of Invention
The embodiment of the invention solves the problems of improving the accuracy of keyword identification and saving computing resources.
In order to solve the above problem, an embodiment of the present invention provides a keyword recognition method, where the keyword recognition method includes:
dividing the acquired voice data to be identified into a plurality of overlapped voice frames;
respectively carrying out fast Fourier transform operation on the sound signals of the plurality of divided sound frames to obtain corresponding frequency spectrum energy;
converting the spectrum energy corresponding to each sound frame into the spectrum energy under the Mel frequency, and calculating the corresponding MFCC parameters;
respectively calculating a DTW distance median, a Euclidean distance median and a cross-correlation distance median between the voice data to be identified and a plurality of preset reference templates according to MFCC parameters corresponding to each voice frame;
and when determining that the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the voice data to be recognized and the current reference template is smaller than a preset threshold value, taking the key words in the current reference template as recognition results.
Optionally, when the spectral energy of the sound data to be identified is greater than a preset energy threshold, the operation of converting the spectral energy corresponding to each sound frame into the spectral energy at the mel frequency and calculating the corresponding MFCC parameter is performed.
Optionally, the preset threshold is associated with a noise level of the voice data to be recognized.
Optionally, the noise level of the voice data to be recognized includes a low noise level, a medium noise level and a high noise level, wherein:
when p is larger than or equal to p1, determining that the voice data to be identified has a low noise level, wherein p represents the absolute amplitude corresponding to the voice data to be identified, and p1 is a preset first threshold;
when p2 is larger than or equal to p & gtp 1, determining that the voice data to be identified has a medium noise level, p2 is a preset second threshold value, and p1 & gtp 2;
when p < p2, the sound data to be recognized is determined to have a high noise level.
Alternatively, p1 equals 0.8 and p2 equals 0.45.
Optionally, the reference template includes information of transient noise, static noise, and rich speech content of a specific person.
The embodiment of the invention also provides a keyword recognition device, which comprises:
the framing processing unit is suitable for dividing the acquired sound data to be identified into a plurality of overlapped sound frames;
the frequency domain conversion unit is suitable for respectively carrying out fast Fourier transform operation on the sound signals of the plurality of divided sound frames to obtain corresponding frequency spectrum energy;
the first calculation unit is suitable for converting the spectrum energy corresponding to each sound frame into the spectrum energy under the Mel frequency and calculating the corresponding MFCC parameters;
the second calculation unit is suitable for respectively calculating a DTW distance median, a Euclidean distance median and a cross-correlation distance median between the voice data to be identified and a plurality of preset reference templates according to the MFCC parameters corresponding to each voice frame;
the judging unit is suitable for judging whether the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the current sound frame and the current reference template is smaller than a preset threshold value or not;
and the keyword identification unit is suitable for taking the keywords in the current reference template as identification results when determining that the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the voice data to be identified and the current reference template is smaller than a preset threshold value.
Optionally, the voice recognition device further includes a triggering unit, and the triggering unit is adapted to trigger the first calculating unit to perform the operation of converting the spectral energy corresponding to each voice frame into the spectral energy at the mel frequency and calculating the corresponding MFCC parameter when the spectral energy of the voice data to be recognized is greater than a preset energy threshold.
Optionally, the preset threshold is associated with a noise level of the voice data to be recognized.
Optionally, the noise level of the voice data to be recognized includes a low noise level, a medium noise level and a high noise level, wherein:
when p is larger than or equal to p1, determining that the voice data to be identified has a low noise level, wherein p represents the absolute amplitude corresponding to the voice data to be identified, and p1 is a preset first threshold;
when p2 is larger than or equal to p & gtp 1, determining that the voice data to be identified has a medium noise level, p2 is a preset second threshold value, and p1 & gtp 2;
when p < p2, the sound data to be recognized is determined to have a high noise level.
Alternatively, p1 equals 0.8 and p2 equals 0.45.
Optionally, the reference template includes information of transient noise, static noise, and rich speech content of a specific person.
Compared with the prior art, the technical scheme of the invention has the following advantages:
according to the scheme, whether the sound frame comprises the keywords is determined by comparing the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the sound data to be recognized and the reference template which are calculated based on the corresponding MFCC parameters with the preset threshold value, and a corresponding mathematical recognition model does not need to be established, and corresponding translation of the keywords does not need to be carried out, so that the calculation resources for keyword recognition can be saved, and the accuracy of keyword recognition can be improved.
Further, when the frequency spectrum energy of the voice data to be recognized is greater than the preset energy threshold, the corresponding voice data to be recognized is subjected to keyword recognition, otherwise, the voice data to be recognized is not subjected to keyword recognition, so that the computing resources can be further saved, and the speed of keyword recognition is increased.
Further, when recording the corresponding reference template, the reference template includes transient noise, static noise and rich voice content information of the specific person, so that the reference template can be accurately recorded with the voice of the corresponding specific person and the environment to which the voice belongs, and therefore, the accuracy of keyword recognition can be further improved.
Drawings
FIG. 1 is a flow chart of a keyword recognition method in an embodiment of the present invention;
FIG. 2 is a flow chart of another keyword recognition method in an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a keyword recognition apparatus in an embodiment of the present invention.
Detailed Description
In order to solve the above problems in the prior art, in the technical scheme adopted by the embodiment of the invention, whether the sound frame includes the keyword is determined by comparing the mean value of the DTW distance median, the euclidean distance median and the cross-correlation distance median between the sound data to be recognized and the reference template with the preset threshold value, so that the calculation resources for keyword recognition can be saved, and the accuracy of keyword recognition can be improved.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
Fig. 1 shows a flowchart of a keyword recognition method in an embodiment of the present invention. The keyword recognition method shown in fig. 1 may include the following steps:
step S101: and dividing the acquired voice data to be identified into a plurality of overlapped voice frames.
In a specific implementation, the size of the overlapping portion between the voice frames can be set according to actual needs. For example, when the length of each voice frame is 32ms, the size of the overlapping portion between adjacent voice frames may be 16 ms.
Step S102: and respectively carrying out fast Fourier transform operation on the sound signals of the plurality of divided sound frames to obtain corresponding frequency spectrum energy.
In a specific implementation, the plurality of divided sound signals are time-domain sound signals, and the time-domain sound signals can be converted into frequency-domain sound signals through Fast Fourier Transform (FFT).
Step S103: and converting the spectral energy corresponding to each sound frame into the spectral energy under the Mel frequency, and calculating the corresponding MFCC parameters.
In a specific implementation, the spectrum energy (power spectrum) of the sound signal is obtained through fast fourier transform operation, and may be converted into the spectrum energy at the Mel Frequency according to a preset corresponding relationship, and the Mel Frequency Cepstrum Coefficient (MFCC) parameter corresponding to each sound frame is calculated according to the spectrum energy at the Mel Frequency.
Step S104: and respectively calculating a DTW distance median, a Euclidean distance median and a cross-correlation distance median between the voice data to be identified and a plurality of preset reference templates according to the MFCC parameters corresponding to each voice frame.
In a specific implementation, the preset multiple reference templates respectively include the voice contents of the corresponding keywords. The number of the preset reference templates may be set according to actual needs, and the present invention is not limited herein.
Step S105: and when determining that the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the voice data to be recognized and the current reference template is smaller than a preset threshold value, taking the key words in the current reference template as recognition results.
In specific implementation, a plurality of preset reference templates are traversed, a DTW distance median, a Euclidean distance median and a cross-correlation distance median between current voice data to be recognized and the current reference template are respectively calculated, the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the current voice data to be recognized and the current reference template is compared with a preset threshold, and when the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the voice data to be recognized and the current reference template is determined to be smaller than the preset threshold, a keyword in the current reference template can be used as a recognition result; otherwise, determining that the current voice data to be recognized does not include the voice information of the keyword in the current reference template.
The keyword recognition method in the embodiment of the present invention will be described in further detail with reference to fig. 2.
Fig. 2 is a flowchart illustrating another keyword recognition method according to an embodiment of the present invention. The keyword recognition method as shown in fig. 2 may include the following steps:
step S201: and overlapping and framing the acquired sound data to obtain a plurality of corresponding sound frames.
In specific implementation, first, analog-to-digital conversion may be performed on the collected sound signal to obtain corresponding sound data. Next, the corresponding audio data may be overlapped and framed to obtain a plurality of audio frames. The collected sound data is subjected to framing, and the essence is that the sound data is subjected to short-time analysis. The short-time analysis is to divide the sound signal into short time segments with fixed periods, each short time segment being a relatively fixed sustained sound segment. The two adjacent sound frames are partially overlapped, and the overlapping range can be selected according to the actual situation.
Step S202: the obtained plurality of audio frames are subjected to windowing processing.
In specific implementation, window functions commonly used for speech signal processing, such as a hamming window, a hanning window, a rectangular window, and the like, can be selected, the frame length is selected to be 10-40 ms, and the typical value is 20 ms. Among them, framing a speech signal destroys the naturalness of the speech signal, and windowing, retracing, and the like are performed using a speech frame, which can solve this problem.
Step S203: and carrying out fast Fourier transform operation on the sound frames subjected to windowing processing to obtain information of the frequency spectrum energy corresponding to each sound frame.
In a specific implementation, the sound data theoretically changes with time, is an unstable process, and cannot be directly converted into a frequency domain. However, since the sound data is subjected to framing processing (short-time analysis), the sound data of each frame can be considered to be relatively stable, and thus frequency domain conversion can be applied thereto.
In a specific implementation, Short-Time Fourier Transform (Short-Time Fourier Transform/Short-Term Fourier Transform, STFT) may be used to perform frequency domain conversion on the sound data of each frame, so as to obtain the spectrum information corresponding to each sound frame. Wherein, the obtained frequency spectrum comprises the relation between the frequency and the energy of the corresponding sound signal.
Step S204: and converting the spectral energy corresponding to each sound frame into the spectral energy under the Mel frequency, and calculating the corresponding MFCC parameters.
In an embodiment of the present invention, after obtaining the spectrum energies corresponding to the multiple voice frames of the current voice data to be recognized, it may first be determined whether the spectrum energy of the current voice data to be recognized is greater than a preset energy threshold, and when it is determined that the spectrum energy of the current voice data to be recognized is greater than the energy threshold, step S204 is continuously performed, otherwise, it is determined that the current voice data to be recognized does not include the speech information of the keyword, so that the subsequent processing on the current voice data to be recognized may be stopped, so as to further save the computing resources.
In a specific implementation, the spectral energy obtained through the FFT operation may be converted into spectral energy at Mel (Mel) frequency according to a preset corresponding relationship, and the MFCC parameter corresponding to each voice frame is calculated as the feature vector of each voice frame.
Step S205: and calculating to obtain a DTW distance median, a Euclidean distance median and a cross-correlation distance median between the current sound frame and a current reference template in a plurality of preset reference templates according to the MFCC parameters corresponding to each sound frame.
In an embodiment of the present invention, in calculating the DTW distance between the current sound data to be recognized and the reference template, the current sound data to be recognized and the reference template are respectively divided into I frames. At the same time, the inventors of the present application have learned empirically that during the recording of the reference template, the speaker's pronunciation becomes more excited and the speech rate is slower than usual. Therefore, the reference template is divided into I frames, the size of each hop for calculating the DTW distance is 0.1I frame, and after the DTW distance between the I frame of the current voice data to be recognized and the I frame of the reference template is obtained through calculation, the median value of the I DTW distances is used as the median value of the DTW distances between the current voice data to be recognized and the corresponding reference template. Similarly, we can obtain the median Euclidean Distance (ED) and the median cross-correlation distance (CC) of the current voice data to be recognized and the corresponding reference template.
Step S206: judging whether the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the voice data to be identified and the current reference template is smaller than a preset threshold value or not; when the determination result is yes, step S207 may be executed, otherwise, the execution is started from step S205 for the next reference template in the preset plurality of reference templates.
In specific implementation, after a DTW distance median, a euclidean distance median and a cross-correlation distance median between current voice data to be recognized and a reference template are obtained through calculation, a mean value of the three is compared with a preset threshold.
In an embodiment of the present invention, the preset threshold is associated with a noise level of the current voice data to be recognized, that is, different noise levels, and the corresponding preset thresholds will be different. When the absolute amplitude probability of the current voice data to be recognized is greater than or equal to p1, determining that the voice data to be recognized has a low noise level, wherein p represents the absolute amplitude corresponding to the voice data to be recognized, and p1 is a preset first threshold; when p2 is more than or equal to p and more than p1, determining that the voice data to be recognized has a medium noise level, wherein p2 is a preset second threshold value, and p1 is more than p 2; when p < p2, the sound data to be recognized is determined to have a high noise level. In one embodiment of the present invention, p1 is 0.8 and p2 is 0.45.
Step S207: and taking the key words in the current reference template as recognition results and outputting the recognition results.
In a specific implementation, when it is determined that a mean value of a DTW distance median, a euclidean distance median, and a cross-correlation distance median between a certain reference template of the preset reference templates and the current voice data to be recognized is smaller than a preset threshold, it may be determined that the current voice data to be recognized includes voice information of a keyword in the reference template. Therefore, the keywords in the reference template can be used as the keyword recognition result of the current voice data to be recognized and output.
In a specific implementation, when the above keyword recognition method is applied to an alarm system, the alarm system may perform an alarm operation when a corresponding keyword is recognized.
It should be noted that in emergency or other keyword applications, a simple (e.g., untrained) user may be used to record personalized keywords. To ensure good recognition performance, the reference template becomes very important. This ensures the recording quality of the reference template by a simple check operation.
Therefore, the inventor of the present application advocates three detection factors, namely, detecting a transient noise source (such as a door drop), a static noise source (such as a fan or traffic noise), and enriching the pronunciation content of the keyword. The three factors need to be satisfied simultaneously, otherwise, the keyword needs to be recorded again. For the detection of transient noise, a difference of absolute amplitude of energy of a sound signal of which each hop is 5ms and which is continuous with 25ms sound frames may be used. Where the absolute amplitude of every 5 sound frames can be averaged. In static noise detection, the recording of keywords occurs within a preset 5s time window in a quiet environment. The signal energy of the beginning and end of the reference template, which does not include the keyword, has a large difference in the 5s time window compared to the sound data including the keyword. In checking rich pronunciation content, keywords having only a single vowel without a consonant such as "o" are rejected, which can be made based on a modified zero-crossing rate associated with the pronunciation content of the keyword.
The following describes a device corresponding to the keyword recognition method in the embodiment of the present invention in further detail.
Referring to fig. 3, the keyword recognition apparatus 300 according to the embodiment of the present invention may include a framing processing unit 301, a frequency domain converting unit 302, a first calculating unit 303, a second calculating unit 304, a determining unit 305, and a keyword recognition unit 306, wherein:
the framing processing unit 301 is adapted to divide the acquired sound data to be identified into a plurality of overlapped sound frames;
the frequency domain conversion unit 302 is adapted to traverse a plurality of sound frames obtained by dividing, and perform fast fourier transform operation on the sound signals of the traversed current sound frame to obtain corresponding spectral energy;
the first calculating unit 303 is adapted to convert the obtained spectral energy into spectral energy at a mel frequency, and calculate a corresponding MFCC parameter;
in a specific implementation, the keyword recognition apparatus 300 may further include a triggering unit (not shown in the figure), which is adapted to trigger the first calculating unit 303 to perform the operation of converting the obtained spectrum energy into the spectrum energy at the Mel frequency and calculating the corresponding MFCC parameter when the spectrum energy of the traversed current sound frame is greater than a preset energy threshold;
the second calculating unit 304 is adapted to calculate, according to the MFCC parameter corresponding to the current sound frame, a DTW distance median, an euclidean distance median, and a cross-correlation distance median between the current sound frame and a plurality of preset reference templates, respectively;
the determining unit 305 is adapted to determine whether an average of a DTW distance median, a euclidean distance median, and a cross-correlation distance median between the current sound frame and the reference template is smaller than a preset threshold;
in a specific implementation, the preset threshold is associated with the noise level of the current sound frame, wherein when p ≧ p1, the current sound frame is determined to have a low noise level, p represents the corresponding absolute amplitude of the current sound frame, and p1 is a preset first threshold; when p2 is more than or equal to p & gtp 1, determining that the current sound frame has a medium noise level, p2 is a preset second threshold value, and p1 & gtp 2; when p < p2, it is determined that the current sound frame has a high noise level. In one embodiment of the present invention, p1 equals 0.8, and p2 equals 0.45.
In a specific implementation, the reference template includes information of transient noise, stationary noise, and rich speech content of a particular person.
The keyword recognition unit 306 is adapted to, when it is determined that the mean of the DTW distance median, the euclidean distance median, and the cross-correlation distance median between the current sound frame and the reference template is smaller than a preset threshold, take the keyword in the current reference template as a recognition result and output the recognition result.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by instructions associated with hardware via a program, which may be stored in a computer-readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, and the like.
The method and system of the embodiments of the present invention have been described in detail, but the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.
Claims (12)
1. A keyword recognition method, comprising:
dividing the acquired voice data to be identified into a plurality of overlapped voice frames;
respectively carrying out fast Fourier transform operation on the sound signals of the plurality of divided sound frames to obtain corresponding frequency spectrum energy;
converting the spectrum energy corresponding to each sound frame into the spectrum energy under the Mel frequency, and calculating the corresponding MFCC parameters;
respectively calculating a DTW distance median, a Euclidean distance median and a cross-correlation distance median between the voice data to be identified and a plurality of preset reference templates according to MFCC parameters corresponding to each voice frame; the plurality of reference templates respectively comprise voice contents of corresponding keywords; when calculating a DTW distance median, a Euclidean distance median and a cross-correlation distance median between the voice data to be identified and the preset multiple reference templates, dividing the voice data to be identified and the reference templates into I frames; the method comprises the steps that each hop for calculating DTW distance, Euclidean distance and cross-correlation distance is 0.1I frame, after the DTW distance, the Euclidean distance and the cross-correlation distance between the I frame of current voice data to be recognized and the I frame of a reference template are obtained through calculation, the median value of I DTW distances is used as the DTW median value of the voice data to be recognized and the corresponding reference template, the median value of I Euclidean distances is used as the median value of the Euclidean distances between the voice data to be recognized and the corresponding reference template, and the median value of I cross-correlation distances is used as the median value of the cross-correlation distance between the voice data to be recognized and the corresponding reference template;
and when determining that the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the voice data to be recognized and the current reference template is smaller than a preset threshold value, taking the keywords in the current reference template as the keyword recognition result of the voice data to be recognized.
2. The keyword recognition method according to claim 1, wherein the operation of converting the spectral energy corresponding to each sound frame into the spectral energy at mel frequency and calculating the corresponding MFCC parameter is performed when the spectral energy of the sound data to be recognized is greater than a preset energy threshold.
3. The method according to claim 1, wherein the predetermined threshold is associated with a noise level of the voice data to be recognized.
4. The keyword recognition method according to claim 3, wherein the noise level of the voice data to be recognized includes a low noise level, a medium noise level, and a high noise level, wherein:
when p is larger than or equal to p1, determining that the voice data to be identified has a low noise level, wherein p represents the absolute amplitude corresponding to the voice data to be identified, and p1 is a preset first threshold;
when p2 is larger than or equal to p & gtp 1, determining that the voice data to be identified has a medium noise level, p2 is a preset second threshold value, and p1 & gtp 2;
when p < p2, the sound data to be recognized is determined to have a high noise level.
5. The keyword recognition method according to claim 4, wherein p1 is equal to 0.8 and p2 is equal to 0.45.
6. The keyword recognition method according to claim 1, wherein the reference template includes information of transient noise, static noise and rich voice content of a specific person.
7. A keyword recognition apparatus, comprising:
the framing processing unit is suitable for dividing the acquired sound data to be identified into a plurality of overlapped sound frames;
the frequency domain conversion unit is suitable for respectively carrying out fast Fourier transform operation on the sound signals of the plurality of divided sound frames to obtain corresponding frequency spectrum energy;
the first calculation unit is suitable for converting the spectrum energy corresponding to each sound frame into the spectrum energy under the Mel frequency and calculating the corresponding MFCC parameters;
the second calculation unit is suitable for respectively calculating a DTW distance median, a Euclidean distance median and a cross-correlation distance median between the voice data to be identified and a plurality of preset reference templates according to the MFCC parameters corresponding to each voice frame; the plurality of reference templates respectively comprise voice contents of corresponding keywords; when calculating a DTW distance median, a Euclidean distance median and a cross-correlation distance median between the voice data to be identified and the preset multiple reference templates, dividing the voice data to be identified and the reference templates into I frames; the method comprises the steps that each hop for calculating DTW distance, Euclidean distance and cross-correlation distance is 0.1I frame, after the DTW distance, the Euclidean distance and the cross-correlation distance between the I frame of current voice data to be recognized and the I frame of a reference template are obtained through calculation, the median value of I DTW distances is used as the DTW median value of the voice data to be recognized and the corresponding reference template, the median value of I Euclidean distances is used as the median value of the Euclidean distances between the voice data to be recognized and the corresponding reference template, and the median value of I cross-correlation distances is used as the median value of the cross-correlation distance between the voice data to be recognized and the corresponding reference template;
the judging unit is suitable for judging whether the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the current sound frame and the current reference template is smaller than a preset threshold value or not;
and the keyword identification unit is suitable for taking the keywords in the current reference template as the keyword identification result of the voice data to be identified when the mean value of the DTW distance, the Euclidean distance median and the cross-correlation distance median between the voice data to be identified and the current reference template is determined to be smaller than a preset threshold value.
8. The keyword recognition apparatus according to claim 7, further comprising a triggering unit, wherein the triggering unit is adapted to trigger the first computing unit to perform the operation of converting the spectral energy corresponding to each sound frame into the spectral energy at the mel frequency and computing the corresponding MFCC parameters when the spectral energy of the sound data to be recognized is greater than a preset energy threshold.
9. The keyword recognition apparatus according to claim 7, wherein the preset threshold is associated with a noise level of the voice data to be recognized.
10. The keyword recognition apparatus according to claim 9, wherein the noise level of the voice data to be recognized includes a low noise level, a medium noise level, and a high noise level, wherein:
when p is larger than or equal to p1, determining that the voice data to be identified has a low noise level, wherein p represents the absolute amplitude corresponding to the voice data to be identified, and p1 is a preset first threshold;
when p2 is larger than or equal to p & gtp 1, determining that the voice data to be identified has a medium noise level, p2 is a preset second threshold value, and p1 & gtp 2;
when p < p2, the sound data to be recognized is determined to have a high noise level.
11. The keyword recognition apparatus of claim 10, wherein p1 is equal to 0.8 and p2 is equal to 0.45.
12. The keyword recognition apparatus according to claim 7, wherein the reference template includes information of transient noise, static noise and rich voice content of a specific person.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510993729.8A CN106920558B (en) | 2015-12-25 | 2015-12-25 | Keyword recognition method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510993729.8A CN106920558B (en) | 2015-12-25 | 2015-12-25 | Keyword recognition method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106920558A CN106920558A (en) | 2017-07-04 |
CN106920558B true CN106920558B (en) | 2021-04-13 |
Family
ID=59454658
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510993729.8A Active CN106920558B (en) | 2015-12-25 | 2015-12-25 | Keyword recognition method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106920558B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109065043B (en) * | 2018-08-21 | 2022-07-05 | 广州市保伦电子有限公司 | A kind of command word recognition method and computer storage medium |
CN112765335B (en) * | 2021-01-27 | 2024-03-08 | 上海三菱电梯有限公司 | Voice call system |
Family Cites Families (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2005114576A1 (en) * | 2004-05-21 | 2005-12-01 | Asahi Kasei Kabushiki Kaisha | Operation content judgment device |
CN101222703A (en) * | 2007-01-12 | 2008-07-16 | 杭州波导软件有限公司 | Identity verification method for mobile terminal based on voice identification |
CN101599269B (en) * | 2009-07-02 | 2011-07-20 | 中国农业大学 | Phonetic end point detection method and device therefor |
US8432368B2 (en) * | 2010-01-06 | 2013-04-30 | Qualcomm Incorporated | User interface methods and systems for providing force-sensitive input |
CN102509547B (en) * | 2011-12-29 | 2013-06-19 | 辽宁工业大学 | Voiceprint recognition method and system based on vector quantization |
CN103021409B (en) * | 2012-11-13 | 2016-02-24 | 安徽科大讯飞信息科技股份有限公司 | A kind of vice activation camera system |
CN103065627B (en) * | 2012-12-17 | 2015-07-29 | 中南大学 | Special purpose vehicle based on DTW and HMM evidence fusion is blown a whistle sound recognition methods |
CN103971678B (en) * | 2013-01-29 | 2015-08-12 | 腾讯科技(深圳)有限公司 | Keyword spotting method and apparatus |
CN103854645B (en) * | 2014-03-05 | 2016-08-24 | 东南大学 | A kind of based on speaker's punishment independent of speaker's speech-emotion recognition method |
CN104978507B (en) * | 2014-04-14 | 2019-02-01 | 中国石油化工集团公司 | An identity authentication method for intelligent logging evaluation expert system based on voiceprint recognition |
CN104103280B (en) * | 2014-07-15 | 2017-06-06 | 无锡中感微电子股份有限公司 | The method and apparatus of the offline speech terminals detection based on dynamic time consolidation algorithm |
CN104103272B (en) * | 2014-07-15 | 2017-10-10 | 无锡中感微电子股份有限公司 | Audio recognition method, device and bluetooth earphone |
CN104778951A (en) * | 2015-04-07 | 2015-07-15 | 华为技术有限公司 | Speech enhancement method and device |
-
2015
- 2015-12-25 CN CN201510993729.8A patent/CN106920558B/en active Active
Non-Patent Citations (5)
Title |
---|
"Voice Command Recognition system based on MFCC and DTW";Abhijeet Kumar;《International Journal or engineering Science and Technology》;20101231;全文 * |
"Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient and DTW techniques";Lindasalwa;《Journal of Computing》;20100331;第2卷(第3期);全文 * |
"一种结合端点检测可检错的DTW乐谱跟随算法";吴康妍;《计算机应用与软件》;20150315;全文 * |
"加权DTW距离的自动步态识别";刘志镜;《中国图像图形学报》;20101231;全文 * |
"时间序列动态模糊聚类的研究";赵晓慧;《中国优秀硕士学位论文全文数据库 信息科技辑》;20141231;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN106920558A (en) | 2017-07-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3719798B1 (en) | Voiceprint recognition method and device based on memorability bottleneck feature | |
US9368116B2 (en) | Speaker separation in diarization | |
CN108198547A (en) | Voice endpoint detection method, device, computer equipment and storage medium | |
US20090313016A1 (en) | System and Method for Detecting Repeated Patterns in Dialog Systems | |
Ananthapadmanabha et al. | Detection of the closure-burst transitions of stops and affricates in continuous speech using the plosion index | |
CN105529028A (en) | Voice analytical method and apparatus | |
US11580955B1 (en) | Synthetic speech processing | |
US20210183358A1 (en) | Speech processing | |
WO2014153800A1 (en) | Voice recognition system | |
US11308946B2 (en) | Methods and apparatus for ASR with embedded noise reduction | |
Chaudhary et al. | Gender identification based on voice signal characteristics | |
Lokhande et al. | Voice activity detection algorithm for speech recognition applications | |
CN100485780C (en) | Quick audio-frequency separating method based on tonic frequency | |
Jung et al. | Linear-scale filterbank for deep neural network-based voice activity detection | |
CN108091340B (en) | Voiceprint recognition method, voiceprint recognition system, and computer-readable storage medium | |
CN108682432A (en) | Speech emotion recognition device | |
Prathosh et al. | Estimation of voice-onset time in continuous speech using temporal measures | |
Tong et al. | Evaluating vad for automatic speech recognition | |
CN106920558B (en) | Keyword recognition method and device | |
US12080276B2 (en) | Adapting automated speech recognition parameters based on hotword properties | |
JP6526602B2 (en) | Speech recognition apparatus, method thereof and program | |
Joseph et al. | Indian accent detection using dynamic time warping | |
US11978431B1 (en) | Synthetic speech processing by representing text by phonemes exhibiting predicted volume and pitch using neural networks | |
Guo et al. | Research on voice activity detection in burst and partial duration noisy environment | |
KR20150092587A (en) | Method for recognizing short length pulsed repetitive sounds |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |