US8326612B2 - Non-speech section detecting method and non-speech section detecting device - Google Patents
Non-speech section detecting method and non-speech section detecting device Download PDFInfo
- Publication number
- US8326612B2 US8326612B2 US12/754,156 US75415610A US8326612B2 US 8326612 B2 US8326612 B2 US 8326612B2 US 75415610 A US75415610 A US 75415610A US 8326612 B2 US8326612 B2 US 8326612B2
- Authority
- US
- United States
- Prior art keywords
- frame
- speech section
- equal
- speech
- section
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related, expires
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L2025/783—Detection of presence or absence of voice signals based on threshold decision
Definitions
- the present invention relates to a non-speech section detecting method and a non-speech section detecting device of: generating frames having a given time length on the basis of sound data obtained by sampling sound; and then detecting a non-speech section.
- a speech recognition device used in a vehicle-mounted device detects a speech section, and then recognizes a word sequence on the basis of the feature of the speech calculated for the detected speech section.
- the rate of speech recognition in the section is degraded.
- the speech recognition device is intended to exactly detect a speech section. Further, the speech recognition device detects a non-speech section and then excludes it from the target of speech recognition.
- a section in which the power of speech input exceeds a criterion value obtained by adding a threshold value to the estimated present background noise level is treated as a speech section.
- a section containing noise having strong non-stationarity e.g., noise sound having large power fluctuation such as buzzer sound; the sound of wiper sliding; and the echo of speech prompt
- a technique that a correction coefficient is calculated from the maximum speech power of the latest utterance and the speech recognition result at that time and then used together with the estimated background noise level so as to correct the future criterion value is disclosed in Japanese Patent Application Laid-Open No. H7-92989.
- a non-speech section detecting device generating a plurality of frames having a given time length on the basis of sound data obtained by sampling sound, and then detecting a non-speech section having a frame not containing voice data based on speech uttered by a person, the device including:
- a calculating part calculating a bias of a spectrum obtained by converting sound data of each frame into components on a frequency axis
- a judging part judging, when the calculated bias of the spectrum has a positive value or a negative value, whether the bias is greater than or equal to a given threshold or alternatively smaller than or equal to a given threshold;
- a counting part counting the number of consecutive frames judged as having a bias greater than or equal to the threshold or alternatively smaller than or equal to the threshold;
- a count judging part judging whether the obtained number of consecutive frames is greater than or equal to a given value
- a detecting part detecting, when the obtained number of consecutive frames is judged as greater than or equal to the given value, the section with the consecutive frames as a non-speech section.
- FIG. 1 is a block diagram illustrating a speech recognition device serving as an implementation example of a non-speech section detecting device.
- FIG. 2 is a block diagram illustrating an example of processing concerning speech recognition performed by a control part.
- FIG. 3 is a flow chart illustrating an example of speech recognition processing performed by a control part.
- FIG. 4 is a flow chart illustrating a processing procedure performed by a control part in association with a subroutine of non-speech section detection.
- FIG. 5 is a diagram illustrating data such as the power and the high-frequency/low-frequency intensity of sound of sniffling.
- FIG. 6 is a diagram illustrating data such as the power and the high-frequency/low-frequency intensity of sound of the alarm of a railroad crossing.
- FIG. 7 is a diagram illustrating data such as the power and the high-frequency/low-frequency intensity of utterance sound (“eh, tesutochu desu” (Japanese sentence; the meaning is “uh, testing”)).
- FIG. 8 is a diagram illustrating data such as the power and the high-frequency/low-frequency intensity of utterance sound (“keiei” (Japanese word; the meaning is “operation (of a company)”)).
- FIG. 9 is a block diagram illustrating an example of processing concerning speech recognition performed by a control part of a speech recognition device serving as an implementation example of a non-speech section detecting device according to Embodiment 2.
- FIG. 10 is a block diagram illustrating an example of processing concerning speech recognition performed by a control part of a speech recognition device serving as an implementation example of a non-speech section detecting device according to Embodiment 3.
- FIG. 11 is a flow chart illustrating an example of speech recognition processing performed by a control part.
- FIG. 12 is a flow chart illustrating a processing procedure performed by a control part in association with a subroutine of non-speech section detection.
- FIGS. 13A and 13B are flow charts illustrating a processing procedure performed by a control part in association with a subroutine of non-speech section detection exclusion.
- FIGS. 14A and 14B are flow charts illustrating a processing procedure performed by a control part in association with a subroutine of non-speech section detection confirmation.
- FIGS. 15A and 15B are flow charts illustrating a processing procedure performed by a control part in association with a subroutine of non-speech section detection in a speech recognition device serving as an implementation example of a non-speech section detecting device according to Embodiment 4.
- FIG. 16 is a flow chart illustrating an example of speech recognition processing performed by a control part of a speech recognition device serving as an implementation example of a non-speech section detecting device according to Embodiment 5.
- FIGS. 17A and 17B are flow charts illustrating a processing procedure performed by a control part in association with a subroutine of non-speech section detection in a speech recognition device serving as an implementation example of a non-speech section detecting device according to Embodiment 6.
- FIG. 18 is a flow chart illustrating an example of speech recognition processing performed by a control part of a speech recognition device serving as an implementation example of a non-speech section detecting device according to Embodiment 7.
- FIG. 1 is a block diagram illustrating a speech recognition device serving as an implementation example of a non-speech section detecting device.
- Numeral 1 in the figure indicates a speech recognition device employing a computer like a navigation device mounted on a vehicle.
- the speech recognition device 1 includes: a control part 2 such as a CPU (Central Processing Unit) and a DSP (Digital Signal Processor) controlling the entirety of the device; a recording part 3 such as a hard disk and a ROM recording various kinds of information such as programs and data; a storage part 4 such as a RAM recording data that is generated temporarily; a sound acquiring part 5 such as a microphone and the like acquiring sound from the outside; a sound output part 6 such as a speaker and the like outputting sound; a display part 7 such as a liquid crystal display monitor and the like; and a navigation part 8 executing processing concerning navigation like instruction of a route to a destination.
- a control part 2 such as a CPU (Central Processing Unit) and a DSP (Digital Signal
- the recording part 3 stores a computer program 30 used for executing a non-speech section detecting method.
- the computer stores, in the recording part 3 , various kinds of procedures contained in the recorded computer program 30 , and then executes the program by the control of the control part 2 so as to operate as the non-speech section detecting device.
- a part of the recording area of the recording part 3 is used as various kinds of databases such as an acoustic model database (DB) 31 recording an acoustic model for speech recognition and a recognition dictionary 32 recording syntax and recognition vocabulary written by phoneme or syllable definition corresponding to the acoustic model.
- DB acoustic model database
- a part of the storage area of the storage part 4 is used as a sound data buffer 41 recording sound data digitized by sampling, with a given period, sound which is an analog signal acquired by the sound acquiring part 5 .
- Another part of the storage area of the storage part 4 is used as a frame buffer 42 storing data such as a feature (feature quantity) extracted from each frame obtained by partitioning the sound data into a given time length.
- Yet another part of the storage area of the storage part 4 is used as a work memory 43 storing information generated temporarily.
- the navigation part 8 has a position detecting mechanism such as a GPS (Global Positioning System) and a recording medium such as a DVD (Digital Versatile Disk) and a hard disk recording map information.
- the navigation part 8 executes navigation processing such as route search and route instruction for a route from a present location to a destination.
- the navigation part 8 displays a map and a route onto the display part 7 , and outputs guidance by speech through the sound output part 6 .
- the function of speech recognition may be implemented by one or a plurality of VLSI chips, and then may be incorporated into the navigation device.
- a dedicated device for speech recognition may externally be attached to the navigation device.
- the control part 2 may be shared by the processing of speech recognition and the processing of navigation, or alternatively, separate dedicated circuits may be employed.
- a co-processor executing the processing of particular arithmetic operations concerning speech recognition such as FFT (Fast Fourier Transform), DCT (Discrete Cosine Transform), and IDCT (Inverse Discrete Cosine Transform), which will be described later, may be built into the control part 2 .
- the sound data buffer 41 may be implemented as an attached circuit to the sound acquiring part 5
- the frame buffer 42 and the work memory 43 may be implemented on a memory provided in the control part 2 .
- the speech recognition device 1 is not limited to a vehicle-mounted device such as the navigation device, and may be applied to a device of various kinds of applications performing speech recognition.
- FIG. 2 is a block diagram illustrating an example of processing concerning speech recognition performed by the control part 2 .
- FIG. 3 is a flow chart illustrating an example of speech recognition processing performed by the control part 2 .
- the control part 2 includes: a frame generating part 20 generating a frame from sound data; a spectrum bias calculating part 21 calculating the bias of the spectrum of the generated frame; a non-speech section detecting part 22 detecting a non-speech section on the basis of a judgment criterion based on the calculated bias of the spectrum; a speech section judging part 23 confirming the start/end of a speech section on the basis of the detected non-speech section; and a speech recognition part 24 recognizing the speech of the judged speech section.
- the control part 2 acquires external sound as an analog signal through the sound acquiring part 5 (step 511).
- the control part 2 records sound data digitized by sampling the acquired sound with a given period, in the sound data buffer 41 (step S 12 ).
- the external sound acquired at step S 11 is sound in which various kinds of sound such as speech uttered by a person, stationary noise, and non-stationary noise are superposed on one another.
- the speech uttered by a person is speech serving as a target of recognition by the speech recognition device 1 .
- the stationary noise is noise such as road noise and engine sound, and is removed by various kinds of removing methods already proposed and established. Examples of the non-stationary noise are: relay sound like those from hazard lamps and blinkers arranged on the vehicle; and mechanical noise like the sliding sound of wipers.
- the frame generating part 20 of the control part 2 From the sound data stored in the sound data buffer 41 , the frame generating part 20 of the control part 2 generates frames, each having a frame length of 10 msec, overlapped with one another by 5 msec (step S 13 ).
- the control part 2 stores the generated frames in the frame buffer 42 (step S 14 ).
- the frame generating part 20 performs high-frequency emphasis filtering processing on the data before frame division. Then, the frame generating part 20 divides the data into frames. The following processing is performed on each frame generated as described here.
- the spectrum bias calculating part 21 calculates the bias of the spectrum described later (step S 15 ).
- the bias calculating part 21 writes the calculated bias of the spectrum into the frame buffer 42 .
- a pointer (address) to the frame buffer 42 to be used for referring to the frame and the bias of the spectrum having been written is provided on the work memory 43 . That is, the pointer allows the bias calculating part 21 to access the bias of the spectrum stored in the frame buffer 42 .
- noise cancellation processing and spectrum subtraction processing may be performed so that the influence of noise may be eliminated.
- the non-speech section detecting part 22 calls a subroutine of detecting a non-speech section by using a judgment criterion based on the bias of the spectrum (step S 16 ).
- the frames in the non-speech section detected by the non-speech section detecting part 22 by using the judgment criterion are sequentially provided to the speech section judging part 23 via the frame buffer 42 .
- Not-yet-judged frames that is, frames that may belong to a non-speech section depending on the subsequent frames, are suspended by the non-speech section detecting part 22 until all judgment criteria are used up.
- the speech section judging part 23 recognizes as a speech section the section not detected as a non-speech section by the non-speech section detecting part 22 .
- the speech section judging part 23 judges that a speech section is started and confirms the speech section start frame. Then, the frame where the speech section is terminated is recognized as a candidate for a speech section end point. After that, if the next speech section starts before a given maximum pause length L 2 elapses, the above-mentioned speech section end point candidate is rejected and then the next termination of the speech section is awaited.
- the speech section judging part 23 confirms the speech section end point candidate as the speech section end frame. When the start/end frames of the speech section have been confirmed, the speech section judging part 23 terminates the judgment of one speech section (step S 17 ).
- the speech section detected as described here is provided to the speech recognition part 24 via the frame buffer 42 .
- a speech section obtained by expanding the speech section judged by the speech section judging part 23 , forward and backward by 100 msec each, may be adopted as a confirmed speech section.
- the speech recognition part 24 extracts a feature vector from the digital signal of each frame in the speech section. On the basis of the extracted feature vector, the speech recognition part 24 refers to the acoustic model recorded in the acoustic model database 31 and the acoustics vocabulary and the syntax stored in the recognition dictionary 32 . The speech recognition part 24 executes speech recognition processing to the end of the input data in the frame buffer 42 (to the end of the speech section) (step S 18 ).
- speech recognition processing is executed and then the procedure is terminated.
- speech recognition processing may be started at any frame where calculation is applicable, so that the response time may be reduced. Further, when a speech section is not detected within a given time, the processing may be terminated.
- step S 15 the bias of a spectrum mentioned at step S 15 is described below in further detail with reference to FIG. 3 .
- a high-frequency/low-frequency intensity is defined as a measure indicating the inclination of the spectrum in each frame of the sound data, that is, a deviation in the high-frequency range/low frequency range of the spectrum.
- the high-frequency/low-frequency intensity is used as the bias of the spectrum.
- the bias of a spectrum is expressed as the absolute value of the high-frequency/low-frequency intensity.
- the high-frequency/low-frequency intensity serves as an index approximating the spectral envelope.
- the high-frequency/low-frequency intensity is expressed by the ratio of the first order autocorrelation function based on a one-sample delay to the zero-th order autocorrelation function expressing the power of the sound data.
- the autocorrelation function c( ⁇ ) may be obtained by performing inverse Fourier transform (IDFT) on a short-time spectrum S( ⁇ ).
- IDFT inverse Fourier transform
- the short-time spectrum S( ⁇ ) is obtained by applying a Hamming window to each frame and then performing DFT (Discrete Fourier Transform) on the data of the frame to which the window has been applied.
- IDCT/DCT may be employed in place of the IDFT/DFT.
- the high-frequency/low-frequency intensity A is defined as the following Formula 3 and Formula 4 by using the first order to the zero-th order ratio.
- A c (1)/ c (0) ( c (0) ⁇ 0)
- Formula 4
- A takes a value within the range ⁇ 1 ⁇ A ⁇ 1.
- a value closer to 1 (or ⁇ 1) indicates a higher intensity in the low-frequency range (or high-frequency range) of the spectrum.
- the high-frequency/low-frequency intensity may be defined as: the ratio between autocorrelation functions of orders other than the zero-th order and the first order; the ratio of the power of a given frequency band to the power of a different given frequency band; MFCC; or a cepstrum obtained by performing inverse Fourier transform on a logarithmic spectrum.
- the high-frequency/low-frequency intensity may be defined as at least one of the ratio of the frequencies and the ratio of the powers of mutually different formants among the estimated formants.
- FIGS. 5 to 8 are diagrams each illustrating data such as the power and the high-frequency/low-frequency intensity of the sound of sniffling, the alarm tone of a railroad crossing, and two kinds of utterance sound (“eh, tesutochu desu” (Japanese sentence; the meaning is “uh, testing”) and “keiei” (Japanese word; the meaning is “operation (of a company)”)).
- the horizontal axes indicate time.
- the vertical axes indicate from the top part to the bottom part: the waveform of the sound data; the power (a dashed line, the left axis) and the high-frequency/low-frequency intensity A (a solid line, the right axis) of the sound data; and the spectrogram (the left axis).
- the dark region is deviated to the upper part corresponding to the high-frequency range.
- the value A is close to ⁇ 1 in this section.
- the value A greatly fluctuates in the range of approximately ⁇ 0.7 ⁇ A ⁇ 0.7. That is, in the section during utterance, the value A does not stay at a particular value for a long time, and fluctuates within a certain range.
- a situation that the value A stays stably even during the utterance occurs when the same phoneme continues like “su” at the end of utterance as illustrated in FIG. 7 . In this case, the “su” is devoiced, and hence a fricative /s/ continues that has a high intensity in the high-frequency range.
- the value A stays stably near ⁇ 0.7 which is close to ⁇ 1 for approximately 0.3 seconds. Further, even in a section in which one phoneme continues similarly, the value A fluctuates depending on the uttered phoneme. For example, in FIG. 7 , despite that a vowel /u/ continues near “u” at the end of “tesutochu”, the value A is deviated to the positive direction, and takes a value of approximately 0.6.
- keiei is uttered as “keh-eh”, in which /e/ continues by an approximately 4-mora length except for the first /k/. This is probably a case that the same phoneme continues for the longest time in Japanese. The duration time is approximately 1.2 seconds at the longest even when the word is uttered slowly.
- the threshold in terms of the duration time of frames may be replaced by a threshold in terms of the number of frames within the duration.
- the balance between the high-frequency and the low-frequency ranges fluctuates and hence the bias
- the threshold in the judgment described above is adjusted in accordance with the transfer characteristics of the input system.
- FIG. 4 is a flow chart illustrating a processing procedure performed by the control part 2 in association with a subroutine of non-speech section detection.
- the control part 2 judges whether the bias of the spectrum of the frame indicated by the present pointer is greater than or equal to a given threshold (e.g., 0.7 described above) (step S 21 ).
- a given threshold e.g., 0.7 described above
- the control part 2 updates the pointer indicating the frame buffer 42 stored on the work memory 43 , backward by one frame (step S 22 ), and then returns the procedure.
- control part 2 returns the procedure without detecting a non-speech section.
- step S 21 When it is judged as being greater than or equal to the given threshold (step S 21 : YES), the control part 2 stores the frame number of the frame indicated by the present pointer, as a “start frame number” in the work memory 43 (step S 23 ). Then, the control part 2 initializes into “1” the stored value of “frame count” provided on the work memory 43 (step S 24 ).
- the “frame count” is used for counting the number of frames where comparison judgment between the bias of the spectrum and the given threshold has been performed.
- the control part 2 judges whether the memory contents value of “frame count” is greater than or equal to a given value (e.g., 10 which is the number of frames contained within 0.1 seconds described above) (step S 25 ).
- a given value e.g. 10 which is the number of frames contained within 0.1 seconds described above
- the control part 2 adds “1” to the memory contents of “frame count” (step S 26 ).
- the control part 2 updates the pointer indicating the frame buffer, backward by one frame (step S 27 ). Then, the control part 2 judges whether the bias of the spectrum of the frame indicated by the present pointer is greater than or equal to the given threshold (step S 28 ).
- step S 28 When the bias of the spectrum is judged as greater than or equal to the given threshold (step S 28 : YES), the control part 2 returns the procedure to step S 25 .
- step S 28 NO
- the control part 2 deletes the contents of “start frame number” (step S 29 ), and then returns the procedure.
- control part 2 returns the procedure without detecting a non-speech section.
- step S 25 When, at step S 25 , the memory contents value of “frame count” is judged as greater than or equal to the given value (step S 25 : YES), the control part 2 goes to the processing of detecting the end frame of the non-speech section.
- the control part 2 updates the pointer indicating the frame buffer, backward by one frame (step S 30 ). Then, the control part 2 judges whether the bias of the spectrum of the frame indicated by the present pointer is greater than or equal to the given threshold (step S 31 ).
- step S 31 When the bias of the spectrum is judged as greater than or equal to the given threshold (step S 31 : YES), the control part 2 returns the procedure to step S 30 .
- step S 31 NO
- the control part 2 stores the frame number of the frame preceding to the frame indicated by the present pointer, as an “end frame number” in the work memory 43 (step S 32 ), and then returns the procedure.
- the section partitioned by the “start frame number” and the “end frame number” is recognized as a detected non-speech section.
- Embodiment 1 when frames where the bias
- a section in which frames having a high bias of the spectrum and having a feature of non-speech continue to an extent of being unlike speech is detected as a non-speech section. Accordingly, correction of the criterion value based on an utterance of a person is not necessary. Thus, even under an environment where noise of a large power or noise of strong non-stationarity is generated, a non-speech section may accurately be detected regardless of the timing before or after the utterance.
- Embodiment 2 is a mode that a speech section detecting device based on the estimated background noise power is employed together with the non-speech section detecting device according to Embodiment 1.
- FIG. 9 is a block diagram illustrating an example of processing concerning speech recognition performed by a control part 2 of a speech recognition device 1 serving as an implementation example of a non-speech section detecting device according to Embodiment 2.
- the control part 2 includes: a frame generating part 20 ; a spectrum bias calculating part 21 ; a non-speech section detecting part 22 a detecting a non-speech section by using a judgment criterion based on the calculated bias of the spectrum; a speech section judging part 23 a confirming the start/end of a speech section on the basis of the detected non-speech section; a feature calculating part 28 calculating the feature used for collation in speech recognition of the confirmed speech section; and a collating part 29 performing collation processing of speech recognition by using the calculated feature.
- the control part 2 further includes: a power calculating part 26 calculating the power of the sound data of the frame generated by the frame generating part 20 ; a background noise power estimating part 27 estimating the background noise power on the basis of the calculated power value; and a speech section correcting section 25 notifying the speech section judging part 23 a of the frame number for a frame to be corrected.
- the non-speech section detecting part 22 a provides the frame number of a detected non-speech section to the speech section judging part 23 a and the speech section correcting section 25 .
- the speech section correcting section 25 provides the speech section judging part 23 a with a given correcting signal and the frame number of a frame to be corrected.
- the power calculating part 26 calculates the power of the sound data of each frame provided by the frame generating part 20 , and then provides the calculated power value to the background noise power estimating part 27 .
- noise cancellation processing and spectrum subtraction processing may be performed so that the influence of noise may be eliminated.
- the background noise power estimating part 27 unconditionally recognizes the head frame of the sound data as noise, and then adopts the power of the sound data of the frame as the initial value for the estimated background noise power. After that, the background noise power estimating part 27 excludes the frames within the speech section notified by the speech section judging part 23 a . As for the second and subsequent frames of the sound data, the background noise power estimating part 27 calculates the simple moving average of the power of the latest two frames. On the basis of the calculated moving average, the background noise power estimating part 27 updates the estimated background noise power of each frame.
- the update value of the estimated background noise power may be calculated by using an IIR (Infinite Impulse Response) filter.
- the background noise power estimating part 27 When correction of the estimated background noise power is notified from the speech section judging part 23 a , the background noise power estimating part 27 overwrites and corrects the estimated background noise power by using the power calculated from the sound data of the presently newest frame among the frames corrected into a non-speech section.
- the background noise power estimating part 27 may calculate the estimated background noise power for the sound data of the frame corrected into a non-speech section.
- the estimated background noise power may be overwritten for the first time when the given-N-th correction (N is a natural number greater than or equal to 2) is notified, by using the power calculated from the sound data of the presently newest frame. This avoids a situation that a speech section is not detected owing to an excessive increase in the estimated background noise level when the background noise level fluctuates up and down.
- the speech section judging part 23 a judges the frame as a speech section. Further, when the given correcting signal described above is provided by the speech section correcting section 25 , the speech section judging part 23 a corrects the judgment result of the speech section on the basis of the frame number to be corrected. Then, when the judged speech section continues for a duration time greater than or equal to the shortest input time length and shorter than or equal to the longest input time length, the speech section judging part 23 a confirms the present speech section. The speech section judging part 23 a notifies the feature calculating part 28 , the collating part 29 and the background noise power estimating part 27 , of the judged speech section.
- the speech section judging part 23 a notifies the background noise power estimating part 27 of an instruction for correcting the estimated background noise power on the basis of the sound data of the frame corrected into a non-speech section.
- the feature calculating part 28 calculates the feature used for collation of speech recognition for the section finally confirmed as a speech section by the speech section judging part 23 a .
- the feature described here indicates, for example, a feature vector whose similarity to the acoustic model recorded in the acoustic model database 31 is allowed to be calculated.
- the feature is calculated by converting a digital signal having undergone frame processing.
- the feature in the present embodiment is an MFCC (Mel Frequency Cepstrum Coefficient).
- the feature may be an LPC (Linear Predictive Coding) cepstrum or an LPC coefficient.
- the digital signal having undergone frame processing is processed by FFT so that an amplitude spectrum is obtained.
- the MFCC processing is performed by a mel filter bank whose center frequencies are located at regular intervals in the mel frequency domain. Then, the logarithm of the processing result is transferred by DCT.
- coefficients of low orders such as the first order to the fourteenth order are used as a feature vector called an MFCC.
- the orders are determined by various kinds of factors such as the sampling frequency and the application, and their numerical values are not limited to particular ones.
- the collating part 29 For the speech section judged and confirmed as a speech section by the speech section judging part 23 a , on the basis of the feature vector which is a feature calculated by the feature calculating part 28 , the collating part 29 refers to the acoustic model recorded in the acoustic model database 31 , and the recognition vocabulary and the syntax recorded in the recognition dictionary 32 , to execute speech recognition processing. Further, on the basis of the recognition result, the collating part 29 controls the output of other input and output parts such as the sound output part 6 and the display part 7 .
- the detection result by the speech section detecting device based on the power of sound data is corrected by the non-speech section detecting device. This improves the overall accuracy in speech section detection.
- FIG. 10 is a block diagram illustrating an example of processing concerning speech recognition performed by a control part 2 of a speech recognition device 1 serving as an implementation example of a non-speech section detecting device according to Embodiment 3. Further, FIG. 11 is a flow chart illustrating an example of speech recognition processing performed by the control part 2 .
- the control part 2 includes: a frame generating part 20 generating frames from sound data; a spectrum bias/power/pitch calculating part 21 a calculating the spectrum bias/power/pitch of the sound data of each generated frame; a variation amount calculating part 21 b calculating the amount of variation relative to the preceding frame with respect to the calculated spectrum bias/power/pitch; a non-speech section detecting part 22 b detecting a non-speech section on the basis of a judgment criterion based on the calculated variation amount; a speech section judging part 23 b confirming the start/end of a speech section on the basis of the detected non-speech sections; and a speech recognition part 24 recognizing speech in the judged speech section.
- the processing at steps S 41 to S 44 is similar to that at steps S 11 to S 14 in FIG. 3 . Thus, description is not repeated.
- the following processing is performed on each frame generated in the processing at steps S 41 to S 44 .
- the spectrum bias/power/pitch calculating part 21 a calculates at least one of the bias of the spectrum of the sound data, the power of the sound data, and the pitch of the sound data (step S 45 ).
- the spectrum bias/power/pitch calculating part 21 a writes at least one of the calculated the bias of the spectrum, power and pitch in the frame buffer 42 .
- the quantity to be calculated here is not limited to the spectrum bias/power/pitch which is a scalar quantity.
- the power spectrum, the amplitude spectrum, the MFCC, the LPC cepstrum, the LPC coefficient, the PLP coefficient or the LSP parameter may be employed, which are vectors expressing acoustical characteristics.
- the variation amount calculating part 21 b calculates the amount of variation relative to the preceding frame, and then writes the obtained result into the frame buffer 42 (step S 46 ).
- a pointer (address) to the frame buffer 42 to be used for referring to the frame and the variation amount having been written is provided and initialized on the work memory 43 .
- the non-speech section detecting part 22 b calls a subroutine of detecting a non-speech section by using a judgment criterion based on the variation amount (step S 47 ).
- the frames in the non-speech section detected by the non-speech section detecting part 22 b by using the judgment criterion are sequentially provided to the speech section judging part 23 b via the frame buffer 42 .
- the speech section judging part 23 b confirms the start/end frames of the speech section so as to judge the speech section (step S 48 ).
- the speech recognition part 24 executes speech recognition processing to the end of the input data in the frame buffer 42 (to the end of the speech section) (step S 49 ).
- the judgment based on C(t) is not limited to the above-mentioned (d) and (e). That is, different conditions may be set up by differently combining a threshold concerning the variation amount and a threshold concerning the duration time. Further, since the frame length is constant, the threshold in terms of the duration time of frames may be replaced by a threshold in terms of the number of frames within the duration.
- step S 47 in FIG. 11 may be executed for each variation amount so that a non-speech section may be detected independently.
- FIG. 12 is a flow chart illustrating a processing procedure performed by the control part 2 in association with a subroutine of non-speech section detection.
- the control part 2 judges whether the variation amount of the frame indicated by the present pointer is smaller than or equal to a given threshold (e.g., 0.05 described above) (step S 51 ).
- a given threshold e.g., 0.05 described above
- the control part 2 calls the subroutine of non-speech section detection confirmation (step S 52 ), and then returns the procedure.
- step S 51 When the variation amount is judged as greater than the given threshold (step S 51 : NO), the control part 2 judges whether the variation amount exceeds a second threshold (e.g., 0.5 described above) (step S 53 ). When it is judged as not exceeding the second threshold (step S 53 : NO), the control part 2 returns the procedure intact.
- a second threshold e.g., 0.5 described above
- step S 53 When the variation amount is judged as exceeding the second threshold (step S 53 : YES), the control part 2 calls a subroutine of non-speech section detection exclusion (step S 54 ), and then returns the procedure.
- FIGS. 13A and 13B are flow charts illustrating a processing procedure performed by the control part 2 in association with the subroutine of non-speech section detection exclusion.
- FIGS. 14A and 14B are flow charts illustrating a processing procedure performed by the control part 2 in association with the subroutine of non-speech section detection confirmation.
- the control part 2 when the subroutine of non-speech section detection exclusion is called, stores the frame number of the frame indicated by the present pointer, as a “start frame number” in the work memory 43 (step S 61 ). Then, the control part 2 initializes the stored value of “frame count” provided on the work memory 43 to “1” (step S 62 ).
- the “frame count” is used for counting the number of frames where comparison judgment between the variation amount and the second threshold has been performed.
- the control part 2 judges whether the memory contents value of “frame count” is smaller than or equal to a given value (e.g., 3 which is the number of frames contained within 30 msec) (step S 63 ).
- a given value e.g. 3 which is the number of frames contained within 30 msec
- the control part 2 adds “1” to the memory contents of “frame count” (step S 64 ).
- the control part 2 updates the pointer indicating the frame buffer, backward by one frame (step S 65 ). Then, the control part 2 judges whether the variation amount of the frame indicated by the present pointer exceeds a second threshold greater than the given threshold described above (step S 66 ).
- step S 66 When the variation amount is judged as exceeding the second threshold (step S 66 : YES), the control part 2 returns the procedure to step S 63 .
- step S 66 NO
- the procedure goes to step S 67 .
- the control part 2 judges whether the frame located a “second given number” of frames ago (the above-mentioned w frames ago, in this example) relative to the frame whose frame number is stored in the “start frame number” belongs to a non-speech section (step S 67 ).
- step S 67 When the frame located the “second given number” of frames ago is judged as belonging to a non-speech section (step S 67 : YES), on the assumption that the section where the variation amount has increased accidentally has a possibility of being judged later as a non-speech section, the control part 2 imparts a mark “non-speech candidate section” to the section (step S 68 ).
- step S 63 the memory contents value of “frame count” is judged as exceeding the given value (step S 63 : NO), that is, when the section having a large variation amount continues to an extent of being unlike an accidental situation, the control part 2 goes to the processing of detecting the end frame of the section.
- the control part 2 updates the pointer indicating the frame buffer, backward by one frame (step S 69 ). Then, the control part 2 judges whether the variation amount of the frame indicated by the present pointer exceeds the second threshold (step S 70 ). When the variation amount is judged as exceeding the second threshold (step S 70 : YES), the control part 2 returns the procedure to step S 69 .
- step S 70 When the variation amount is judged as smaller than or equal to the second threshold (step S 70 : NO), that is, when the section where the variation amount exceeds the second threshold has ended, or alternatively, when at step S 67 , the frame located the “second given number” of frames ago is judged as not belonging to a non-speech section (step S 67 : NO), the control part 2 goes to step S 71 . In order to exclude from the target of non-speech section detection the section where the variation amount exceeds the second threshold, the control part 2 imparts a mark “non-speech exclusion section” to the section (step S 71 ).
- the control part 2 When the processing at step S 71 is completed, or alternatively when the processing at step S 68 is completed, the control part 2 performs the processing of subtracting “the second given number (in this example, w described above) ⁇ 1” from the contents of “start frame number” (step S 72 ). Further, the control part 2 generates a number by adding “the second given number (in this example, w described above) ⁇ 1” to the frame number of the frame preceding to the frame indicated by the present pointer, then stores the generated number as the “end frame number” in the work memory 43 (step S 73 ), and then returns the procedure.
- the control part 2 When the processing at step S 71 is completed, or alternatively when the processing at step S 68 is completed, the control part 2 performs the processing of subtracting “the second given number (in this example, w described above) ⁇ 1” from the contents of “start frame number” (step S 72 ). Further, the control part 2 generates a number by adding “the second given number (in this example,
- a section obtained by extending the section where the variation amount exceeds the second threshold, by “w ⁇ 1” frames forward or backward is recognized as a “non-speech candidate section” or a “non-speech exclusion section”.
- the control part 2 stores the frame number of the frame indicated by the present pointer, as a “start frame number” in the work memory 43 (step S 81 ). Then, the control part 2 initializes the stored value of “frame count” provided on the work memory 43 to “1” (step S 82 ).
- the “frame count” is used for counting the number of frames where comparison judgment between the variation amount and a given threshold has been performed.
- the control part 2 judges whether the memory contents value of “frame count” is greater than or equal to a given value (e.g., the number of frames contained within the above-mentioned 0.5 seconds) which is different from the given value employed at step S 63 (step S 83 ).
- a given value e.g., the number of frames contained within the above-mentioned 0.5 seconds
- the control part 2 adds “1” to the memory contents of “frame count” (step S 84 ).
- the control part 2 updates the pointer indicating the frame buffer, backward by one frame (step S 85 ).
- the control part 2 judges whether the variation amount of the frame indicated by the present pointer is smaller than or equal to a given threshold (step S 86 ).
- step S 86 When the variation amount is judged as smaller than or equal to the given threshold (step S 86 : YES), the control part 2 returns the procedure to step S 83 .
- step S 86 NO
- the control part 2 recognizes that a non-speech section is not found.
- the control part 2 judges whether the frame preceding to the frame whose frame number is stored in the “start frame number” is contained within a non-speech candidate section (step S 87 ).
- step S 87 When the preceding frame is judged as contained within a non-speech candidate section (step S 87 : YES), the control part 2 changes the non-speech candidate section into a non-speech exclusion section (step S 88 ).
- step S 87 NO
- step S 89 the control part 2 deletes the memory contents of “start frame number” (step S 89 ), and then returns the procedure.
- step S 83 the memory contents value of “frame count” is judged as greater than or equal to the given value (step S 83 : YES)
- the control part 2 goes to the processing of detecting the end frame of the non-speech section.
- the control part 2 updates the pointer indicating the frame buffer, backward by one frame (step S 90 ).
- the control part 2 judges whether the variation amount of the frame indicated by the present pointer is smaller than or equal to a given threshold (step S 91 ).
- step S 91 the control part 2 returns the procedure to step S 90 .
- step S 91 When the variation amount is judged as exceeding the given threshold (step S 91 : NO), that is, when the detected non-speech section has ended, the control part 2 judges whether the frame preceding to the frame whose frame number is stored in the “start frame number” is contained within a non-speech candidate section (step S 92 ). When the preceding frame is judged as contained within a non-speech candidate section (step S 92 : YES), the control part 2 deletes the mark of the non-speech candidate section so as to confirm the section as a non-speech section (step S 93 ).
- control part 2 stores the frame number of the frame preceding to the frame indicated by the present pointer, as an “end frame number” in the work memory 43 (step S 94 ), and then returns the procedure.
- the section partitioned by the “start frame number” and the “end frame number” is recognized as a newly detected non-speech section.
- judgment is performed concerning at least one of the spectrum bias, the power, and the pitch calculated from the sound data of each frame.
- frames where the variation amount C(t) relative to that of the preceding frame is smaller than or equal to, for example, 0.05 continue for a number of frames greater than or equal to a number corresponding to the duration time of 0.5 seconds
- the section between the first frame where the variation amount becomes smaller than or equal to 0.05 and the last frame having a variation amount smaller than or equal to 0.05 is detected as a non-speech section.
- a section having an accidentally large variation amount is excluded from the target of non-speech section detection.
- the judgment result is nullified and the section is detected as a non-speech section.
- a section in which frames having a small variation amount and a feature of non-speech continue to an extent of being unlike speech is detected as a non-speech section. Accordingly, correction of the criterion value based on an utterance of a person is not necessary. Thus, even under an environment where noise having large power fluctuation is generated, a non-speech section is accurately detected regardless of the timing before or after the utterance. Further, non-speech section detection may appropriately be achieved even for a section having an accidentally large variation amount (e.g., an instance when the amount of air flow from an air-conditioner has fluctuated so that a quantitative noise has varied).
- the variation amount is replaced by the maximum value of the variation amount in the frame near C(t).
- a non-speech section becomes hardly detected, and hence erroneous detection of a non-speech section is suppressed.
- backward z frames relative to the frame t that is, in the section between the frame t-z to the frame t+z.
- each calculated value may be recognized as the bias of the spectrum of the frame t.
- a non-speech section may be detected independently for each of the newly calculated quantities of the bias of the spectrum.
- Embodiment 1 a section in which frames where the bias of the spectrum is greater than or equal to a given threshold continue for a number greater than or equal to a given threshold has been detected as a non-speech section.
- Embodiment 4 when a section in which the fraction of frames where the bias of the spectrum is greater than or equal to a given threshold exceeds a given value continues over frames for a number greater than or equal to a given value, the section is detected as a non-speech section.
- FIGS. 15A and 15B are flow charts illustrating a processing procedure performed by a control part 2 in association with a subroutine of non-speech section detection in a speech recognition device 1 serving as an implementation example of a non-speech section detecting device according to Embodiment 4.
- the control part 2 judges whether the bias of the spectrum of the frame indicated by the present pointer is greater than or equal to a given threshold (step S 111 ). When it is judged as being smaller than the given threshold (step S 111 : NO), the control part 2 updates the pointer indicating the frame buffer 42 stored on the work memory 43 , backward by one frame (step S 112 ), and then returns the procedure.
- control part 2 returns the procedure without detecting a non-speech section.
- step S 111 When it is judged as being greater than or equal to the given threshold (step S 111 : YES), the control part 2 stores the frame number of the frame indicated by the present pointer, as a “start frame number” in the work memory 43 (step S 113 ). Then, the control part 2 initializes the stored value of “frame count 1 ” provided on the work memory 43 to “1” (step S 114 ). The control part 2 further initializes the stored value of “frame count 2 ” into “1” (step S 115 ).
- the “frame count 1 ” is used for counting the number of frames where comparison judgment between the bias of the spectrum and the given threshold has been performed. Further, the “frame count 2 ” is used for counting the number of frames where the bias of the spectrum is greater than or equal to the given threshold.
- control part 2 judges whether the memory contents value of “frame count 1 ” is greater than or equal to a given value (step S 116 ). When it is judged as being smaller than the given value (step S 116 : NO), the control part 2 adds “1” to the memory contents of “frame count 1 ” (step S 117 ). The control part 2 updates the pointer indicating the frame buffer, backward by one frame (step S 118 ). Then, the control part 2 judges whether the bias of the spectrum of the frame indicated by the present pointer is greater than or equal to the given threshold (step S 119 ).
- step S 119 When the bias of the spectrum is judged as greater than or equal to the given threshold (step S 119 : YES), the control part 2 adds “1” to the memory contents of “frame count 2 ” (step S 120 ), and then returns the procedure to step S 116 .
- step S 119 NO
- the procedure goes to step S 121 .
- the control part 2 judges whether the ratio of the memory contents value of “frame count 2 ” to the memory contents value of “frame count 1 ”, that is, the ratio of the number of frames where the bias of the spectrum is greater than or equal to the given threshold relative to the number of all frames where judgment of the bias of the spectrum have been performed, is greater than or equal to a given value (e.g., 0.8) (step S 121 ).
- a given value e.g. 0.
- step S 121 When it is judged as being greater than or equal to the given ratio (step S 121 : YES), the control part 2 returns the procedure to step S 116 .
- step S 121 NO
- the control part 2 deletes the contents of “start frame number” (step S 122 ), and then returns the procedure.
- control part 2 returns the procedure without detecting a non-speech section.
- step S 116 the memory contents value of “frame count 1 ” is judged as greater than or equal to the given value (step S 116 : YES)
- step S 116 the control part 2 goes to the processing of detecting the end frame of the non-speech section, and then adds “1” to the memory contents of “frame count” (step S 123 ).
- the control part 2 updates the pointer indicating the frame buffer, backward by one frame (step S 124 ). Then, the control part 2 judges whether the bias of the spectrum of the frame indicated by the present pointer is greater than or equal to the given threshold (step S 125 ).
- step S 125 When the bias of the spectrum is judged as greater than or equal to the given threshold (step S 125 : YES), the control part 2 adds “1” to the memory contents of “frame count 2 ” (step S 126 ). When the processing at step S 126 is completed, or alternatively when the bias of the spectrum is judged as smaller than the given threshold (step S 125 : NO), the control part 2 goes to step S 127 . The control part 2 judges whether the ratio of the memory contents value of “frame count 2 ” to the memory contents value of “frame count 1 ” is greater than or equal to the given ratio (step S 127 ).
- step S 127 When it is judged as being greater than or equal to the given ratio (step S 127 : YES), the control part 2 returns the procedure to step S 123 . Further, when it is judged as being smaller than the given ratio (step S 127 : NO), the control part 2 stores the frame number of the frame preceding to the frame indicated by the present pointer, as the “end frame number,” in the work memory 43 (step S 128 ), and then returns the procedure.
- the section partitioned by the “start frame number” and the “end frame number” is recognized as a detected non-speech section.
- Embodiment 4 in a section in which the fraction of frames where the bias of the spectrum calculated from the sound data of each frame is greater than or equal to a given threshold exceeds a given value, when the section continues over frames in a number greater than or equal to a given value, the section between the first frame where the bias of the spectrum becomes greater than or equal to the given threshold and the position immediately before the fraction of frames where the bias of the spectrum is greater than or equal to the given threshold becomes smaller than the given value is detected as a non-speech section.
- the to-be-detected head frame of a non-speech section is not limited to the first frame where the value becomes greater than or equal to the given threshold.
- a frame located forward relative to the above-mentioned first frame may be adopted as the head frame.
- Embodiment 5 is a mode that in Embodiment 1, a signal-to-noise ratio is calculated and then in accordance with the calculated signal-to-noise ratio, the given threshold concerning the bias of the spectrum is changed.
- FIG. 16 is a flow chart illustrating an example of speech recognition processing performed by a control part 2 of a the speech recognition device 1 serving as an implementation example of a non-speech section detecting device according to Embodiment 5.
- the processing at steps S 131 to S 135 is similar to that at steps S 11 to S 15 in FIG. 3 . Thus, description is not repeated.
- the following processing is performed on the bias of the spectrum generated in the processing at steps S 131 to S 135 and then written into the frame buffer 42 .
- the non-speech section detecting part 22 calls the subroutine of detecting a non-speech section (step S 136 ). After that, on the basis of the sound data of the frames detected as a non-speech section and the sound data of the frames other than the non-speech section, the control part 2 calculates the signal-to-noise ratio (step S 137 ). Then, in accordance with the high/low of the calculated signal-to-noise ratio, the control part 2 decreases/increases the given threshold (step S 138 ).
- the speech section judging part 23 recognizes as a speech section the section not detected as a non-speech section by the non-speech section detecting part 22 .
- the speech section judging part 23 confirms the speech section start frame and the speech section end frame, and then terminates the judgment of one speech section (step S 139 ).
- the speech section detected as described here is provided to the speech recognition part 24 via the frame buffer.
- the speech recognition part 24 executes speech recognition processing up to the end of the input data in the frame buffer 42 (step S 140 ).
- Embodiment 5 on the basis of the sound data of the frames detected as a non-speech section and the sound data of the frames other than the non-speech section, the signal-to-noise ratio is calculated. In Embodiment 5, in accordance with the high/low of the calculated signal-to-noise ratio, the given threshold concerning the bias of the spectrum is decreased/increased.
- Embodiment 6 is a mode that in Embodiment 1, the maximum of the intensity values of frequency components of the pitch is calculated (referred to as the pitch intensity, hereinafter) and then in accordance with the calculated pitch intensity, a given threshold concerning the bias of the spectrum is changed.
- the pitch intensity the maximum of the intensity values of frequency components of the pitch
- FIGS. 17A and 17B are flow charts illustrating a processing procedure performed by a control part 2 in association with a subroutine of non-speech section detection in a the speech recognition device 1 serving as an implementation example of a non-speech section detecting device according to Embodiment 6.
- the control part 2 calculates the pitch intensity of the frame indicated by the present pointer (step S 151 ). In accordance with the high/low of the calculated pitch intensity, the control part 2 decreases/increases the given threshold (step S 152 ). After that, the control part 2 judges whether the bias of the spectrum of the frame is greater than or equal to the given threshold (step S 153 ). When it is judged as being smaller than the given threshold (step S 153 : NO), the control part 2 updates the pointer indicating the frame buffer 42 stored on the work memory 43 , backward by one frame (step S 154 ), and then returns the procedure.
- control part 2 returns the procedure without detecting a non-speech section.
- step S 153 When it is judged as being greater than or equal to the given threshold (step S 153 : YES), the control part 2 stores the frame number of the frame indicated by the present pointer, as a “start frame number” onto the work memory 43 (step S 155 ). Then, the control part 2 initializes the stored value of “frame count” provided on the work memory 43 into “1” (step S 156 ).
- the “frame count” is used for counting the number of frames where comparison judgment between the bias of the spectrum and the given threshold has been performed.
- control part 2 judges whether the memory contents value of “frame count” is greater than or equal to a given value (step S 157 ). When it is judged as being smaller than the given value (step S 157 : NO), the control part 2 adds “1” to the memory contents of “frame count” (step S 158 ). The control part 2 updates the pointer indicating the frame buffer 42 , backward by one frame (step S 159 ). After that, the control part 2 calculates the pitch intensity of the frame indicated by the present pointer (step S 160 ), and then on the basis of the calculated pitch intensity, changes the given threshold (step S 161 ).
- step S 162 judges whether the bias of the spectrum is greater than or equal to a given threshold.
- step S 162 determines whether the bias of the spectrum is greater than or equal to a given threshold.
- step S 162 determines whether the bias of the spectrum is greater than or equal to a given threshold.
- step S 162 determines whether the bias of the spectrum is greater than or equal to a given threshold.
- control part 2 returns the procedure without detecting a non-speech section.
- step S 157 the memory contents value of “frame count” is judged as greater than or equal to the given value (step S 157 : YES)
- step S 157 the control part 2 goes to the processing of detecting the end frame of the non-speech section, and then updates the pointer indicating the frame buffer, backward by one frame (step S 164 ).
- step S 165 the control part 2 calculates the pitch intensity of the frame indicated by the present pointer.
- the control part 2 changes the given threshold (step S 166 ).
- step S 167 judges whether the bias of the spectrum of the frame is greater than or equal to the given threshold.
- step S 167 : YES the control part 2 returns the procedure to step S 164 .
- step S 167 : NO the control part 2 stores the frame number of the frame preceding to the frame indicated by the present pointer, as the “end frame number” in the work memory 43 (step S 168 ), and then returns the procedure.
- the section partitioned by the “start frame number” and the “end frame number” is recognized as a detected non-speech section.
- step S 151 the pitch intensity mentioned at step S 151 , S 160 and S 165 with reference to FIGS. 17A and 17B is described below in further detail.
- the pitch intensity B is calculated from the autocorrelation function Y ( ⁇ ) of the short-time spectrum S( ⁇ ) in accordance with the following Formula 9.
- B argmaxY ( ⁇ ) for 1 ⁇ max, Formula 9
- ⁇ max is a value corresponding to the expected maximum pitch frequency.
- the short-time spectrum of 0 to 4000 Hz is expressed by a 129-dimensional vector.
- the pitch intensity is calculated for the sound data of each frame, and then in accordance with the high/low of the calculated pitch intensity, the given threshold concerning the bias of the spectrum is decreased/increased.
- the pitch intensity is high, that is, when the pitch is clear, the sound data is expected to be a vowel or a half vowel of speech.
- the value taken by the bias of the spectrum has limitation.
- the following judgment (h) may be added.
- Embodiment 7 is a mode that in Embodiment 1, the given threshold concerning the bias of the spectrum is determined on the basis of learning performed in advance.
- FIG. 18 is a flow chart illustrating an example of speech recognition processing performed by a control part 2 of a the speech recognition device 1 serving as an implementation example of a non-speech section detecting device according to Embodiment 7.
- the processing at steps S 171 to S 174 is similar to that at steps S 11 to S 14 in FIG. 3 . Thus, description is not repeated. The following processing is performed on each frame generated in the processing at steps S 171 to S 174 .
- the control part 2 For the frames provided via the frame buffer 42 , the control part 2 marks an utterance section in the sound data (step S 175 ). At that time, the marking of an utterance section is achieved easily, because phoneme labeling has been performed in the voice data for learning. Further, the control part 2 sets up N threshold values within the range [ ⁇ 1, ⁇ 1] of the value
- control part 2 judges whether the aggregation has been completed for all N threshold values (step S 178 ). When the aggregation is judged as not completed (step S 178 : NO), the control part 2 returns the procedure to step S 177 . When the aggregation is judged as completed for all N threshold values (step S 178 : YES), the control part 2 determines the given threshold concerning the bias of the spectrum on the basis of the result of aggregation (step S 179 ).
- the given threshold is determined as somewhat larger (or smaller) so that erroneous detection of a non-speech section is suppressed.
- Embodiment 7 for an utterance section where marking has been performed in existing voice data, a plurality of threshold candidates are prepared in advance.
- an optimum value for the given threshold concerning the bias of the spectrum is determined from among a plurality of threshold candidates.
- a non-speech section may accurately be detected.
- Embodiments 1 to 7 have been described for a case that the absolute value
- the high-frequency/low-frequency intensity A may be adopted as the bias of the spectrum. Then, when the bias of the spectrum is positive (or negative), it may be judged whether the bias is greater than or equal to a given positive threshold (or smaller than or equal to a given negative threshold).
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Navigation (AREA)
- Telephonic Communication Services (AREA)
- Telephone Function (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
- Time-Division Multiplex Systems (AREA)
Abstract
Description
A=c(1)/c(0) (c(0)≠0)
A=0 (c(0)=0)
C(t)=|A(t)−A(t−1)| for t>1
C(t)=0 for t=1
B=argmaxY (τ) for 1≦τ≦τmax, Formula 9
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/675,317 US8798991B2 (en) | 2007-12-18 | 2012-11-13 | Non-speech section detecting method and non-speech section detecting device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2007/074274 WO2009078093A1 (en) | 2007-12-18 | 2007-12-18 | Non-speech section detecting method and non-speech section detecting device |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2007/074274 Continuation WO2009078093A1 (en) | 2007-12-18 | 2007-12-18 | Non-speech section detecting method and non-speech section detecting device |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/675,317 Division US8798991B2 (en) | 2007-12-18 | 2012-11-13 | Non-speech section detecting method and non-speech section detecting device |
Publications (2)
Publication Number | Publication Date |
---|---|
US20100191524A1 US20100191524A1 (en) | 2010-07-29 |
US8326612B2 true US8326612B2 (en) | 2012-12-04 |
Family
ID=40795219
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/754,156 Expired - Fee Related US8326612B2 (en) | 2007-12-18 | 2010-04-05 | Non-speech section detecting method and non-speech section detecting device |
US13/675,317 Active US8798991B2 (en) | 2007-12-18 | 2012-11-13 | Non-speech section detecting method and non-speech section detecting device |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/675,317 Active US8798991B2 (en) | 2007-12-18 | 2012-11-13 | Non-speech section detecting method and non-speech section detecting device |
Country Status (3)
Country | Link |
---|---|
US (2) | US8326612B2 (en) |
JP (1) | JP5229234B2 (en) |
WO (1) | WO2009078093A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10546576B2 (en) * | 2014-04-23 | 2020-01-28 | Google Llc | Speech endpointing based on word comparisons |
Families Citing this family (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
PT2491559E (en) | 2009-10-19 | 2015-05-07 | Ericsson Telefon Ab L M | Method and background estimator for voice activity detection |
US8990074B2 (en) * | 2011-05-24 | 2015-03-24 | Qualcomm Incorporated | Noise-robust speech coding mode classification |
JP5810912B2 (en) | 2011-12-28 | 2015-11-11 | 富士通株式会社 | Speech recognition apparatus, speech recognition method, and speech recognition program |
WO2013164029A1 (en) * | 2012-05-03 | 2013-11-07 | Telefonaktiebolaget L M Ericsson (Publ) | Detecting wind noise in an audio signal |
US9269355B1 (en) * | 2013-03-14 | 2016-02-23 | Amazon Technologies, Inc. | Load balancing for automatic speech recognition |
US9275136B1 (en) * | 2013-12-03 | 2016-03-01 | Google Inc. | Method for siren detection based on audio samples |
US10229686B2 (en) * | 2014-08-18 | 2019-03-12 | Nuance Communications, Inc. | Methods and apparatus for speech segmentation using multiple metadata |
CN107004405A (en) * | 2014-12-18 | 2017-08-01 | 三菱电机株式会社 | Speech recognition equipment and audio recognition method |
US10134425B1 (en) * | 2015-06-29 | 2018-11-20 | Amazon Technologies, Inc. | Direction-based speech endpointing |
US10121471B2 (en) * | 2015-06-29 | 2018-11-06 | Amazon Technologies, Inc. | Language model speech endpointing |
US10854192B1 (en) * | 2016-03-30 | 2020-12-01 | Amazon Technologies, Inc. | Domain specific endpointing |
CN107305774B (en) | 2016-04-22 | 2020-11-03 | 腾讯科技(深圳)有限公司 | Voice detection method and device |
US10878814B2 (en) * | 2016-07-22 | 2020-12-29 | Sony Corporation | Information processing apparatus, information processing method, and program |
US10431236B2 (en) * | 2016-11-15 | 2019-10-01 | Sphero, Inc. | Dynamic pitch adjustment of inbound audio to improve speech recognition |
CN109935240A (en) * | 2017-12-18 | 2019-06-25 | 上海智臻智能网络科技股份有限公司 | Pass through the method for speech recognition mood |
CN109935241A (en) * | 2017-12-18 | 2019-06-25 | 上海智臻智能网络科技股份有限公司 | Voice information processing method |
CN109961803A (en) * | 2017-12-18 | 2019-07-02 | 上海智臻智能网络科技股份有限公司 | Voice mood identifying system |
US11276390B2 (en) * | 2018-03-22 | 2022-03-15 | Casio Computer Co., Ltd. | Audio interval detection apparatus, method, and recording medium to eliminate a specified interval that does not represent speech based on a divided phoneme |
JP7222265B2 (en) * | 2018-03-22 | 2023-02-15 | カシオ計算機株式会社 | VOICE SECTION DETECTION DEVICE, VOICE SECTION DETECTION METHOD AND PROGRAM |
CN109087632B (en) * | 2018-08-17 | 2023-06-06 | 平安科技(深圳)有限公司 | Speech processing method, device, computer equipment and storage medium |
TR201917042A2 (en) * | 2019-11-04 | 2021-05-21 | Cankaya Ueniversitesi | Signal energy calculation with a new method and speech signal encoder obtained by this method. |
EP4060662A4 (en) * | 2019-12-13 | 2023-03-08 | Mitsubishi Electric Corporation | Information processing device, detection method, and detection program |
CN112420079B (en) * | 2020-11-18 | 2022-12-06 | 青岛海尔科技有限公司 | Voice endpoint detection method and device, storage medium and electronic equipment |
FI20225762A1 (en) * | 2022-08-31 | 2024-03-01 | Elisa Oyj | Computer-implemented method for detecting activity in an audio stream |
Citations (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS63291096A (en) | 1987-05-23 | 1988-11-28 | 日本電気株式会社 | Voice section detecting system |
JPH0683391A (en) | 1992-09-04 | 1994-03-25 | Matsushita Electric Ind Co Ltd | Vocalized speech detecting device for television conference |
JPH0713584A (en) | 1992-10-05 | 1995-01-17 | Matsushita Electric Ind Co Ltd | Voice detector |
JPH0792989A (en) | 1993-09-22 | 1995-04-07 | Oki Electric Ind Co Ltd | Speech recognizing method |
US5414796A (en) * | 1991-06-11 | 1995-05-09 | Qualcomm Incorporated | Variable rate vocoder |
JPH07191696A (en) | 1993-12-27 | 1995-07-28 | Ricoh Co Ltd | Speech recognition device |
US5515432A (en) * | 1993-11-24 | 1996-05-07 | Ericsson Inc. | Method and apparatus for volume and intelligibility control for a loudspeaker |
US5590242A (en) * | 1994-03-24 | 1996-12-31 | Lucent Technologies Inc. | Signal bias removal for robust telephone speech recognition |
JPH09152894A (en) | 1995-11-30 | 1997-06-10 | Denso Corp | Sound and silence discriminator |
US5664059A (en) * | 1993-04-29 | 1997-09-02 | Panasonic Technologies, Inc. | Self-learning speaker adaptation based on spectral variation source decomposition |
JPH1097269A (en) | 1996-09-20 | 1998-04-14 | Nippon Telegr & Teleph Corp <Ntt> | Voice detection device and method |
US5765124A (en) * | 1995-12-29 | 1998-06-09 | Lucent Technologies Inc. | Time-varying feature space preprocessing procedure for telephone based speech recognition |
US5794192A (en) * | 1993-04-29 | 1998-08-11 | Panasonic Technologies, Inc. | Self-learning speaker adaptation based on spectral bias source decomposition, using very short calibration speech |
US6014620A (en) * | 1995-06-21 | 2000-01-11 | Telefonaktiebolaget Lm Ericsson | Power spectral density estimation method and apparatus using LPC analysis |
US6073092A (en) * | 1997-06-26 | 2000-06-06 | Telogy Networks, Inc. | Method for speech coding based on a code excited linear prediction (CELP) model |
JP2001236085A (en) | 2000-02-25 | 2001-08-31 | Matsushita Electric Ind Co Ltd | Voice section detector, stationary noise section detector, non-stationary noise section detector, and noise section detector |
EP1160763A2 (en) | 2000-06-02 | 2001-12-05 | Nec Corporation | Voice detecting method and apparatus |
US6456697B1 (en) * | 1998-09-23 | 2002-09-24 | Industrial Technology Research Institute | Device and method of channel effect compensation for telephone speech recognition |
US20030115055A1 (en) * | 2001-12-12 | 2003-06-19 | Yifan Gong | Method of speech recognition resistant to convolutive distortion and additive distortion |
JP2005156887A (en) | 2003-11-25 | 2005-06-16 | Matsushita Electric Works Ltd | Voice interval detector |
US7062433B2 (en) * | 2001-03-14 | 2006-06-13 | Texas Instruments Incorporated | Method of speech recognition with compensation for both channel distortion and background noise |
JP2006209069A (en) | 2004-12-28 | 2006-08-10 | Advanced Telecommunication Research Institute International | Voice segment detection device and voice segment detection program |
US7106839B2 (en) * | 2000-01-12 | 2006-09-12 | Multi-Tech Systems, Inc. | System for providing analog and digital telephone functions using a single telephone line |
US20070168189A1 (en) * | 2006-01-19 | 2007-07-19 | Kabushiki Kaisha Toshiba | Apparatus and method of processing speech |
JP2007233267A (en) | 2006-03-03 | 2007-09-13 | National Institute Of Advanced Industrial & Technology | Discrimination apparatus and method for audio signal and non-audio signal |
US20080201137A1 (en) * | 2007-02-20 | 2008-08-21 | Koen Vos | Method of estimating noise levels in a communication system |
Family Cites Families (40)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4074069A (en) * | 1975-06-18 | 1978-02-14 | Nippon Telegraph & Telephone Public Corporation | Method and apparatus for judging voiced and unvoiced conditions of speech signal |
US4008375A (en) * | 1975-08-21 | 1977-02-15 | Communications Satellite Corporation (Comsat) | Digital voice switch for single or multiple channel applications |
FR2466825A1 (en) * | 1979-09-28 | 1981-04-10 | Thomson Csf | DEVICE FOR DETECTING VOICE SIGNALS AND ALTERNAT SYSTEM COMPRISING SUCH A DEVICE |
US4375083A (en) * | 1980-01-31 | 1983-02-22 | Bell Telephone Laboratories, Incorporated | Signal sequence editing method and apparatus with automatic time fitting of edited segments |
US4624008A (en) * | 1983-03-09 | 1986-11-18 | International Telephone And Telegraph Corporation | Apparatus for automatic speech recognition |
US4696039A (en) * | 1983-10-13 | 1987-09-22 | Texas Instruments Incorporated | Speech analysis/synthesis system with silence suppression |
US4879748A (en) * | 1985-08-28 | 1989-11-07 | American Telephone And Telegraph Company | Parallel processing pitch detector |
US4797929A (en) * | 1986-01-03 | 1989-01-10 | Motorola, Inc. | Word recognition in a speech recognition system using data reduced word templates |
US4802221A (en) * | 1986-07-21 | 1989-01-31 | Ncr Corporation | Digital system and method for compressing speech signals for storage and transmission |
US4771465A (en) * | 1986-09-11 | 1988-09-13 | American Telephone And Telegraph Company, At&T Bell Laboratories | Digital speech sinusoidal vocoder with transmission of only subset of harmonics |
US5365592A (en) * | 1990-07-19 | 1994-11-15 | Hughes Aircraft Company | Digital voice detection apparatus and method using transform domain processing |
US5216747A (en) * | 1990-09-20 | 1993-06-01 | Digital Voice Systems, Inc. | Voiced/unvoiced estimation of an acoustic signal |
US5226108A (en) * | 1990-09-20 | 1993-07-06 | Digital Voice Systems, Inc. | Processing a speech signal with estimated pitch |
JP3343965B2 (en) * | 1992-10-31 | 2002-11-11 | ソニー株式会社 | Voice encoding method and decoding method |
US5450484A (en) * | 1993-03-01 | 1995-09-12 | Dialogic Corporation | Voice detection |
IT1270438B (en) * | 1993-06-10 | 1997-05-05 | Sip | PROCEDURE AND DEVICE FOR THE DETERMINATION OF THE FUNDAMENTAL TONE PERIOD AND THE CLASSIFICATION OF THE VOICE SIGNAL IN NUMERICAL CODERS OF THE VOICE |
JPH07212296A (en) * | 1994-01-17 | 1995-08-11 | Japan Radio Co Ltd | VOX control communication device |
US5682463A (en) * | 1995-02-06 | 1997-10-28 | Lucent Technologies Inc. | Perceptual audio compression based on loudness uncertainty |
US6006175A (en) * | 1996-02-06 | 1999-12-21 | The Regents Of The University Of California | Methods and apparatus for non-acoustic speech characterization and recognition |
JP4307557B2 (en) * | 1996-07-03 | 2009-08-05 | ブリティッシュ・テレコミュニケーションズ・パブリック・リミテッド・カンパニー | Voice activity detector |
US6202046B1 (en) * | 1997-01-23 | 2001-03-13 | Kabushiki Kaisha Toshiba | Background noise/speech classification method |
JPH10257583A (en) * | 1997-03-06 | 1998-09-25 | Asahi Chem Ind Co Ltd | Voice processing unit and its voice processing method |
WO1999010719A1 (en) * | 1997-08-29 | 1999-03-04 | The Regents Of The University Of California | Method and apparatus for hybrid coding of speech at 4kbps |
US6556967B1 (en) * | 1999-03-12 | 2003-04-29 | The United States Of America As Represented By The National Security Agency | Voice activity detector |
US6246978B1 (en) * | 1999-05-18 | 2001-06-12 | Mci Worldcom, Inc. | Method and system for measurement of speech distortion from samples of telephonic voice signals |
US6959274B1 (en) * | 1999-09-22 | 2005-10-25 | Mindspeed Technologies, Inc. | Fixed rate speech compression system and method |
US6757301B1 (en) * | 2000-03-14 | 2004-06-29 | Cisco Technology, Inc. | Detection of ending of fax/modem communication between a telephone line and a network for switching router to compressed mode |
WO2001078062A1 (en) * | 2000-04-06 | 2001-10-18 | Telefonaktiebolaget Lm Ericsson (Publ) | Pitch estimation in speech signal |
US6587816B1 (en) * | 2000-07-14 | 2003-07-01 | International Business Machines Corporation | Fast frequency-domain pitch estimation |
US6694293B2 (en) * | 2001-02-13 | 2004-02-17 | Mindspeed Technologies, Inc. | Speech coding system with a music classifier |
US20030028386A1 (en) * | 2001-04-02 | 2003-02-06 | Zinser Richard L. | Compressed domain universal transcoder |
US6721699B2 (en) * | 2001-11-12 | 2004-04-13 | Intel Corporation | Method and system of Chinese speech pitch extraction |
CA2365203A1 (en) * | 2001-12-14 | 2003-06-14 | Voiceage Corporation | A signal modification method for efficient coding of speech signals |
US7613606B2 (en) * | 2003-10-02 | 2009-11-03 | Nokia Corporation | Speech codecs |
US7643993B2 (en) * | 2006-01-05 | 2010-01-05 | Broadcom Corporation | Method and system for decoding WCDMA AMR speech data using redundancy |
US20060262851A1 (en) * | 2005-05-19 | 2006-11-23 | Celtro Ltd. | Method and system for efficient transmission of communication traffic |
US8019615B2 (en) * | 2005-07-26 | 2011-09-13 | Broadcom Corporation | Method and system for decoding GSM speech data using redundancy |
US8135047B2 (en) * | 2006-07-31 | 2012-03-13 | Qualcomm Incorporated | Systems and methods for including an identifier with a packet associated with a speech signal |
US8015000B2 (en) * | 2006-08-03 | 2011-09-06 | Broadcom Corporation | Classification-based frame loss concealment for audio signals |
US8275611B2 (en) * | 2007-01-18 | 2012-09-25 | Stmicroelectronics Asia Pacific Pte., Ltd. | Adaptive noise suppression for digital speech signals |
-
2007
- 2007-12-18 WO PCT/JP2007/074274 patent/WO2009078093A1/en active Application Filing
- 2007-12-18 JP JP2009546107A patent/JP5229234B2/en not_active Expired - Fee Related
-
2010
- 2010-04-05 US US12/754,156 patent/US8326612B2/en not_active Expired - Fee Related
-
2012
- 2012-11-13 US US13/675,317 patent/US8798991B2/en active Active
Patent Citations (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS63291096A (en) | 1987-05-23 | 1988-11-28 | 日本電気株式会社 | Voice section detecting system |
US5778338A (en) * | 1991-06-11 | 1998-07-07 | Qualcomm Incorporated | Variable rate vocoder |
US5414796A (en) * | 1991-06-11 | 1995-05-09 | Qualcomm Incorporated | Variable rate vocoder |
JPH0683391A (en) | 1992-09-04 | 1994-03-25 | Matsushita Electric Ind Co Ltd | Vocalized speech detecting device for television conference |
JPH0713584A (en) | 1992-10-05 | 1995-01-17 | Matsushita Electric Ind Co Ltd | Voice detector |
US5617508A (en) | 1992-10-05 | 1997-04-01 | Panasonic Technologies Inc. | Speech detection device for the detection of speech end points based on variance of frequency band limited energy |
US5664059A (en) * | 1993-04-29 | 1997-09-02 | Panasonic Technologies, Inc. | Self-learning speaker adaptation based on spectral variation source decomposition |
US5794192A (en) * | 1993-04-29 | 1998-08-11 | Panasonic Technologies, Inc. | Self-learning speaker adaptation based on spectral bias source decomposition, using very short calibration speech |
JPH0792989A (en) | 1993-09-22 | 1995-04-07 | Oki Electric Ind Co Ltd | Speech recognizing method |
US5515432A (en) * | 1993-11-24 | 1996-05-07 | Ericsson Inc. | Method and apparatus for volume and intelligibility control for a loudspeaker |
JPH07191696A (en) | 1993-12-27 | 1995-07-28 | Ricoh Co Ltd | Speech recognition device |
US5590242A (en) * | 1994-03-24 | 1996-12-31 | Lucent Technologies Inc. | Signal bias removal for robust telephone speech recognition |
US6014620A (en) * | 1995-06-21 | 2000-01-11 | Telefonaktiebolaget Lm Ericsson | Power spectral density estimation method and apparatus using LPC analysis |
US5937375A (en) | 1995-11-30 | 1999-08-10 | Denso Corporation | Voice-presence/absence discriminator having highly reliable lead portion detection |
JPH09152894A (en) | 1995-11-30 | 1997-06-10 | Denso Corp | Sound and silence discriminator |
US5765124A (en) * | 1995-12-29 | 1998-06-09 | Lucent Technologies Inc. | Time-varying feature space preprocessing procedure for telephone based speech recognition |
JPH1097269A (en) | 1996-09-20 | 1998-04-14 | Nippon Telegr & Teleph Corp <Ntt> | Voice detection device and method |
US6073092A (en) * | 1997-06-26 | 2000-06-06 | Telogy Networks, Inc. | Method for speech coding based on a code excited linear prediction (CELP) model |
US6456697B1 (en) * | 1998-09-23 | 2002-09-24 | Industrial Technology Research Institute | Device and method of channel effect compensation for telephone speech recognition |
US7106839B2 (en) * | 2000-01-12 | 2006-09-12 | Multi-Tech Systems, Inc. | System for providing analog and digital telephone functions using a single telephone line |
JP2001236085A (en) | 2000-02-25 | 2001-08-31 | Matsushita Electric Ind Co Ltd | Voice section detector, stationary noise section detector, non-stationary noise section detector, and noise section detector |
US20020007270A1 (en) | 2000-06-02 | 2002-01-17 | Nec Corporation | Voice detecting method and apparatus, and medium thereof |
JP2001350488A (en) | 2000-06-02 | 2001-12-21 | Nec Corp | Voice detection method and apparatus and recording medium thereof |
EP1160763A2 (en) | 2000-06-02 | 2001-12-05 | Nec Corporation | Voice detecting method and apparatus |
US20060271363A1 (en) | 2000-06-02 | 2006-11-30 | Nec Corporation | Voice detecting method and apparatus using a long-time average of the time variation of speech features, and medium thereof |
US7062433B2 (en) * | 2001-03-14 | 2006-06-13 | Texas Instruments Incorporated | Method of speech recognition with compensation for both channel distortion and background noise |
US20030115055A1 (en) * | 2001-12-12 | 2003-06-19 | Yifan Gong | Method of speech recognition resistant to convolutive distortion and additive distortion |
JP2005156887A (en) | 2003-11-25 | 2005-06-16 | Matsushita Electric Works Ltd | Voice interval detector |
JP2006209069A (en) | 2004-12-28 | 2006-08-10 | Advanced Telecommunication Research Institute International | Voice segment detection device and voice segment detection program |
US20070168189A1 (en) * | 2006-01-19 | 2007-07-19 | Kabushiki Kaisha Toshiba | Apparatus and method of processing speech |
JP2007233267A (en) | 2006-03-03 | 2007-09-13 | National Institute Of Advanced Industrial & Technology | Discrimination apparatus and method for audio signal and non-audio signal |
US20080201137A1 (en) * | 2007-02-20 | 2008-08-21 | Koen Vos | Method of estimating noise levels in a communication system |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10546576B2 (en) * | 2014-04-23 | 2020-01-28 | Google Llc | Speech endpointing based on word comparisons |
US11004441B2 (en) | 2014-04-23 | 2021-05-11 | Google Llc | Speech endpointing based on word comparisons |
US11636846B2 (en) | 2014-04-23 | 2023-04-25 | Google Llc | Speech endpointing based on word comparisons |
US12051402B2 (en) | 2014-04-23 | 2024-07-30 | Google Llc | Speech endpointing based on word comparisons |
Also Published As
Publication number | Publication date |
---|---|
JPWO2009078093A1 (en) | 2011-04-28 |
US8798991B2 (en) | 2014-08-05 |
US20130073281A1 (en) | 2013-03-21 |
JP5229234B2 (en) | 2013-07-03 |
US20100191524A1 (en) | 2010-07-29 |
WO2009078093A1 (en) | 2009-06-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8326612B2 (en) | Non-speech section detecting method and non-speech section detecting device | |
US8768692B2 (en) | Speech recognition method, speech recognition apparatus and computer program | |
US9536525B2 (en) | Speaker indexing device and speaker indexing method | |
KR100870889B1 (en) | Sound signal processing method, sound signal processing apparatus and recording medium | |
US8311813B2 (en) | Voice activity detection system and method | |
JP4497834B2 (en) | Speech recognition apparatus, speech recognition method, speech recognition program, and information recording medium | |
US7647224B2 (en) | Apparatus, method, and computer program product for speech recognition | |
US20110077943A1 (en) | System for generating language model, method of generating language model, and program for language model generation | |
US10755731B2 (en) | Apparatus, method, and non-transitory computer-readable storage medium for storing program for utterance section detection | |
KR100724736B1 (en) | Pitch detection method and pitch detection apparatus using spectral auto-correlation value | |
US8942977B2 (en) | System and method for speech recognition using pitch-synchronous spectral parameters | |
KR101236539B1 (en) | Apparatus and Method For Feature Compensation Using Weighted Auto-Regressive Moving Average Filter and Global Cepstral Mean and Variance Normalization | |
JP5867199B2 (en) | Noise estimation device, noise estimation method, and computer program for noise estimation | |
JP4325044B2 (en) | Speech recognition system | |
JP4571871B2 (en) | Speech signal analysis method and apparatus for performing the analysis method, speech recognition apparatus using the speech signal analysis apparatus, program for executing the analysis method, and storage medium thereof | |
Wang et al. | Improved Mandarin speech recognition by lattice rescoring with enhanced tone models | |
JP4655184B2 (en) | Voice recognition apparatus and method, recording medium, and program | |
JPH11327593A (en) | Voice recognition system | |
JP2009025388A (en) | Voice recognition device | |
Anguita et al. | Jacobian adaptation with improved noise reference for speaker verification. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: FUJITSU LIMITED, JAPAN Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:WASHIO, NOBUYUKI;HAYAKAWA, SHOJI;REEL/FRAME:024227/0618 Effective date: 20100324 |
|
ZAAA | Notice of allowance and fees due |
Free format text: ORIGINAL CODE: NOA |
|
ZAAB | Notice of allowance mailed |
Free format text: ORIGINAL CODE: MN/=. |
|
FEPP | Fee payment procedure |
Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20241204 |