[go: up one dir, main page]

CN120148484B - Speech recognition method and device based on microcomputer - Google Patents

Speech recognition method and device based on microcomputer Download PDF

Info

Publication number
CN120148484B
CN120148484B CN202510635399.9A CN202510635399A CN120148484B CN 120148484 B CN120148484 B CN 120148484B CN 202510635399 A CN202510635399 A CN 202510635399A CN 120148484 B CN120148484 B CN 120148484B
Authority
CN
China
Prior art keywords
phoneme sequence
frequency
processed
voice
signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202510635399.9A
Other languages
Chinese (zh)
Other versions
CN120148484A (en
Inventor
张平
陈家盛
李欢
段成钢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Madigi Electronic Technology Co ltd
Original Assignee
Shenzhen Madigi Electronic Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Madigi Electronic Technology Co ltd filed Critical Shenzhen Madigi Electronic Technology Co ltd
Priority to CN202510635399.9A priority Critical patent/CN120148484B/en
Publication of CN120148484A publication Critical patent/CN120148484A/en
Application granted granted Critical
Publication of CN120148484B publication Critical patent/CN120148484B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention is suitable for the technical field of voice processing, and provides a voice recognition method and a voice recognition device based on a microcomputer, wherein the voice recognition method based on the microcomputer comprises the steps of obtaining a voice signal to be processed, and extracting layering time domain characteristics in the voice signal to be processed; extracting layered frequency domain characteristics from a voice signal to be processed, inputting the first phoneme sequence or the second phoneme sequence into a language model when the first phoneme sequence is identical to the second phoneme sequence to obtain a voice recognition result output by the language model, processing the voice signal to be processed by adopting a multi-order cascade filter to obtain a frequency spectrum coefficient when the first phoneme sequence is different from the second phoneme sequence, and recognizing the voice recognition result according to the frequency spectrum coefficient. The invention obviously improves the performance of the voice recognition system under the low signal-to-noise ratio and high noise background by combining the layered feature extraction and the multi-order cascading filter, and effectively solves the limitation of the traditional voice recognition technology under the complex environment.

Description

Speech recognition method and device based on microcomputer
Technical Field
The invention belongs to the technical field of voice processing, and particularly relates to a voice recognition method and device based on a microcomputer.
Background
The microcomputer is widely used in an embedded system as a low-power-consumption, compact computing platform with high computing power. Compared with the traditional computer, the microcomputer has small volume and low cost, and is suitable for scenes needing convenient and real-time processing, especially in portable equipment. Therefore, how to use the advantages of the microcomputer to improve the processing power and accuracy of the speech recognition technology has become an important research direction.
Most of the traditional voice recognition methods are based on fixed acoustic models and language models, and the methods neglect the hierarchical characteristics of signals to a certain extent, so that the problem of low recognition accuracy in complex environments exists. Particularly in the face of large background noise or poor signal quality, the performance of speech recognition systems tends to be significantly degraded.
Disclosure of Invention
In view of this, the embodiments of the present invention provide a method and apparatus for voice recognition based on a microcomputer, so as to solve the technical problem that the performance of a voice recognition system tends to be significantly reduced when the background noise is large or the signal quality is low.
A first aspect of an embodiment of the present invention provides a microcomputer-based voice recognition method, including:
Acquiring a voice signal to be processed, and extracting layered time domain characteristics in the voice signal to be processed;
extracting layered frequency domain characteristics in the voice signal to be processed;
Inputting the layered time domain features into a time domain acoustic model to obtain a first phoneme sequence, and inputting the layered frequency domain features into a frequency domain acoustic model to obtain a second phoneme sequence;
when the first phoneme sequence and the second phoneme sequence are the same, inputting the first phoneme sequence or the second phoneme sequence into a language model to obtain a voice recognition result output by the language model, wherein the voice recognition result is a word sequence;
when the first phoneme sequence is different from the second phoneme sequence, a multi-order cascade filter is adopted to process the voice signal to be processed to obtain a frequency spectrum coefficient, and the multi-order cascade filter is used for processing different frequency ranges of the signal, wherein the response of each cascade filter is the product of time domain impulse responses;
And recognizing a voice recognition result according to the frequency spectrum coefficient.
Further, the step of obtaining the voice signal to be processed and extracting the layered time domain features in the voice signal to be processed includes:
Collecting an original voice signal;
Carrying out denoising, pre-emphasis, framing and windowing on the original voice signals to obtain a plurality of voice signals to be processed;
Calculating short-time energy, short-time zero-crossing rate and pitch period of the voice signal to be processed;
Performing multi-resolution time domain analysis on the voice signal to be processed through wavelet transformation, and extracting time domain features under different time scales;
and taking the short-time energy, the short-time zero-crossing rate, the pitch period and the time domain characteristics under different time scales as the layered time domain characteristics.
Further, the step of extracting the layered frequency domain features in the voice signal to be processed includes:
Performing short-time Fourier transform on the voice signal to be processed to obtain frequency spectrum information;
Calculating a mel frequency cepstrum coefficient of the voice signal to be processed;
Carrying out multi-resolution frequency domain analysis on the frequency spectrum information through wavelet transformation, and extracting frequency domain features under different frequencies;
and taking the Mel frequency cepstrum coefficient and the frequency domain characteristics under different frequencies as the layered frequency domain characteristics.
Further, when the first phoneme sequence and the second phoneme sequence are different, the step of processing the voice signal to be processed by adopting a multi-order cascade filter to obtain a frequency spectrum coefficient includes:
Extracting difference phonemes in the first phoneme sequence and the second phoneme sequence, and matching to-be-processed voice signals corresponding to the difference phonemes, wherein the difference phonemes are different phonemes existing in the same sequence position in the first phoneme sequence and the second phoneme sequence;
and inputting the voice signal to be processed corresponding to the difference phonemes into a multi-order cascade filter to obtain the frequency spectrum coefficient.
Further, the step of inputting the to-be-processed voice signal corresponding to the difference phoneme into a multi-order cascade filter to obtain the spectrum coefficient includes:
inputting the voice signals to be processed corresponding to the difference phonemes into a multi-order cascade filter to obtain composite responses corresponding to a plurality of frequency bands;
Respectively calculating energy values of composite responses corresponding to the frequency bands;
carrying out logarithmic compression processing on energy values corresponding to the plurality of frequency bands to obtain logarithmic energy corresponding to the plurality of frequency bands;
And performing discrete cosine transform on the logarithmic energy to obtain a frequency spectrum coefficient.
Further, the step of recognizing the speech recognition result according to the spectral coefficient includes:
Inputting the frequency spectrum coefficient into a frequency domain acoustic model by combining the layered frequency domain characteristics to obtain a third phoneme sequence output by the frequency domain acoustic model;
If the third phoneme sequence is the same as the first phoneme sequence, inputting the first phoneme sequence or the third phoneme sequence into a language model to obtain a speech recognition result output by the language model;
and if the third phoneme sequence is the same as the second phoneme sequence, inputting the second phoneme sequence or the third phoneme sequence into a language model to obtain a speech recognition result output by the language model.
Further, if the third phoneme sequence is the same as the second phoneme sequence, inputting the second phoneme sequence or the third phoneme sequence into a language model, and obtaining a speech recognition result output by the language model includes:
If the third phoneme sequence is the same as the second phoneme sequence, inputting the second phoneme sequence or the third phoneme sequence into a language model to obtain an initial recognition result output by the language model;
and carrying out spelling correction and grammar checking on the initial recognition result to obtain the voice recognition result.
A second aspect of an embodiment of the present invention provides a microcomputer-based voice recognition apparatus, including:
The device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a voice signal to be processed and extracting layered time domain characteristics in the voice signal to be processed;
the extraction unit is used for extracting the layered frequency domain characteristics in the voice signal to be processed;
The computing unit is used for inputting the layered time domain features into a time domain acoustic model to obtain a first phoneme sequence, and inputting the layered frequency domain features into a frequency domain acoustic model to obtain a second phoneme sequence;
the first judging unit is used for inputting the first phoneme sequence or the second phoneme sequence into a language model when the first phoneme sequence and the second phoneme sequence are the same, so as to obtain a voice recognition result output by the language model;
The second judging unit is used for processing the voice signal to be processed by adopting a multi-order cascade filter to obtain a frequency spectrum coefficient when the first phoneme sequence and the second phoneme sequence are different, wherein the multi-order cascade filter is used for processing different frequency ranges of the signal, and the response of each cascade filter is the product of time domain impulse responses;
and the recognition unit is used for recognizing the voice recognition result according to the frequency spectrum coefficient.
A third aspect of the embodiments of the present invention provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the steps in the microcomputer-based speech recognition method of the first aspect when the computer program is executed by the processor.
A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the microcomputer-based speech recognition method of the first aspect described above.
Compared with the prior art, the method has the beneficial effect that key information of voice can be better captured in various noise environments by extracting layered time domain features and frequency domain features of the voice signal to be processed. The time domain features preserve the timing characteristics in the speech signal, while the frequency domain features help reveal the spectral characteristics of the speech. By double extraction of the two features, the method can effectively reduce interference of background noise on voice signals and improve voice recognition performance in a complex environment. And processing the time domain characteristics and the frequency domain characteristics respectively by adopting a time domain acoustic model and a frequency domain acoustic model, and generating two phoneme sequences. By comparing the similarity of the two sequences, the recognition deviation caused by low signal quality can be effectively reduced. Under the condition that the phoneme sequences are consistent, the input language model further optimizes the recognition result, so that the accuracy and the reliability of voice recognition are improved. When the first phoneme sequence is inconsistent with the second phoneme sequence, the invention adopts the multi-order cascade filter to further process the voice signal to be processed. The multi-order cascade filter can process signals in different frequency ranges respectively, so that the frequency spectrum characteristics of the signals are improved effectively, and the accuracy of voice recognition is improved. The response of each cascade filter is optimized through the product of the time domain impulse responses, so that the system can inhibit noise in different frequency ranges in a targeted manner, and a better recognition effect is achieved in a complex noise environment. The method is excellent in performance especially in a low signal-to-noise ratio environment through extraction of layered time domain and frequency domain features and processing of a multi-order cascading filter. While the traditional voice recognition system is easy to generate recognition errors or performance degradation when facing noise and signal distortion, the invention obviously enhances the anti-interference capability of the system and improves the voice recognition precision under severe environment through a multi-level and all-directional feature extraction and processing strategy. In summary, the invention combines the layered feature extraction and the multi-order cascading filter, thereby remarkably improving the performance of the voice recognition system under the low signal-to-noise ratio and high noise background and effectively solving the limitation of the traditional voice recognition technology under the complex environment.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.
FIG. 1 shows a schematic flow chart of a microcomputer-based speech recognition method provided by the present invention;
FIG. 2 is a schematic diagram of a microcomputer-based speech recognition device according to an embodiment of the present invention;
fig. 3 shows a schematic diagram of a terminal device according to an embodiment of the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
The embodiment of the invention provides a voice recognition method and device based on a microcomputer, which are used for solving the technical problem that the performance of a voice recognition system is often obviously reduced under the condition of high background noise or low signal quality.
First, the present invention provides a microcomputer-based speech recognition method. Referring to fig. 1, fig. 1 is a schematic flow chart of a voice recognition method based on a microcomputer according to the present invention. As shown in fig. 1, the microcomputer-based voice recognition method may include the steps of:
step 101, obtaining a voice signal to be processed, and extracting layering time domain characteristics in the voice signal to be processed;
The time domain features are extracted directly from the time domain waveform of the audio signal. These features can reflect the time course of the speech signal, and are particularly important for short-time speech recognition, especially when the signal noise is high, the retention of time domain information helps to reduce noise interference. The specific extraction logic of the layered time domain features is as follows:
Specifically, step 101 specifically includes steps 1011 to 1015:
step 1011, collecting original voice signals;
The original speech signal to be processed is acquired by a microphone or other audio acquisition device. The original speech signal is typically a continuous analog audio signal containing the speech content of the speaker and may also include interfering components such as ambient noise.
Step 1012, denoising, pre-emphasis, framing and windowing are carried out on the original voice signals to obtain a plurality of voice signals to be processed;
Denoising, in order to remove background noise in a speech signal, a variety of denoising techniques (e.g., spectral subtraction, wiener filtering, etc.) may be employed. The purpose of the denoising process is to improve signal quality so that subsequent feature extraction is more accurate.
Pre-emphasis-pre-emphasis is to compensate for spectral imbalance of a speech signal. In general, the high frequency part of the speech signal is weaker, so by high-pass filtering (i.e., pre-emphasis) the high frequency component can be enhanced, making the high frequency part of the signal smoother, enhancing the sharpness of the signal.
Framing process-since the speech signal is a non-stationary signal (i.e. the signal varies in time), the signal needs to be divided into short time windows (typically 20ms to 40ms time frames). The signal within each frame may be considered a stationary signal to facilitate further feature analysis.
Windowing process-the windowing process is to reduce the signal "boundary effects" that occur during framing (i.e., the beginning and ending portions of a frame may not be smooth, resulting in spectral leakage). A window function (e.g., hamming window, hanning window, etc.) is typically applied to each frame signal to smooth the boundary, improving the accuracy of the spectral analysis.
Step 1013, calculating short-time energy, short-time zero-crossing rate and pitch period of the voice signal to be processed;
The short-time energy is an index describing the intensity of the signal in a certain frame, and the calculation method is the sum of squares of each sampling point of the frame signal. The short-time energy can reflect the change in signal strength and help distinguish between speech signals and silence/noise.
The short-time zero-crossing rate refers to the number of times the signal waveform crosses zero in a certain frame. It can reflect the periodicity and frequency characteristics of the signal, especially when processing noisy or reverberated speech signals, the short-time zero-crossing rate is a distinguishing feature of speech and noise. Signals with higher zero-crossing rates are typically associated with noise, while speech signals have lower and regular zero-crossing rates.
Pitch period refers to the duration of the periodic pitch portion of the speech signal, typically used to identify the vocal cord vibration period of the speech. In a speech signal, the fundamental frequency (F0) of speech is reflected in the pitch period, and is very important for characteristics such as pitch and intonation.
Step 1014, performing multi-resolution time domain analysis on the voice signal to be processed through wavelet transformation, and extracting time domain features under different time scales;
Wavelet transformation is a time-frequency analysis method that is capable of simultaneously analyzing the high-frequency and low-frequency components of a signal on different time scales. Unlike conventional fourier transforms, wavelet transforms have good time-frequency localization capability, and can finely capture changes in signals over different frequency bands and time scales.
By multi-resolution analysis of the signal, features of the signal can be extracted from different time scales. For example, the wavelet transform may provide higher temporal resolution in the low frequency portion and higher frequency resolution in the high frequency portion. Thus, voice information of different levels can be captured, which is particularly important for voice recognition under a complex background.
The time domain features extracted by wavelet transformation include signal energy, instantaneous frequency, etc. at different frequency bands, which are very useful in capturing time variations of signals, variations of speech phonemes, etc. For voice signals with stronger noise or poorer signal quality, wavelet transformation can help a system to extract more detailed and multi-level characteristics, thereby improving the recognition accuracy.
Wavelet transformation provides signal information at different time scales by decomposing a signal into sub-signals of multiple frequency bandwidths. Unlike global analysis of fourier transforms, wavelet transforms can analyze changes in signals locally, particularly useful for processing non-stationary signals (e.g., speech signals). The calculation formula of the wavelet transform is as follows:
Wherein, the The wavelet function is represented by a wavelet function,The scale factor is represented by a scale factor,The translation parameters are represented by a number of parameters,Representing the speech signal to be processed at time t,Representing the wavelet transform result.
In time domain feature extraction, multi-resolution analysis using wavelet transforms can look at the changes in the signal on different time scales. For example, on a longer time scale we can capture long time variations of syllables, while on a shorter time scale we can recognize rapid variations in speech, such as transient phoneme variations.
Step 1015, using the short-time energy, the short-time zero-crossing rate, the pitch period, and the time domain features at the different time scales as the layered time domain features.
The extracted features (short-time energy, short-time zero-crossing rate, pitch period, time domain features extracted by wavelet transformation) together form a layered time domain feature. These features capture information of the speech signal in terms of energy variation, frequency characteristics, periodicity, time scale, etc. of the signal, respectively. By combining these different source features, the system is able to describe the speech signal in multiple dimensions, enhancing robustness to noise interference and providing more rich and diversified information for subsequent speech recognition. This layered feature extraction approach helps to improve system performance in different speech recognition tasks.
In the embodiments of steps 1011 to 1015, the objective is to construct an information representation with multiple levels and dimensions through a series of preprocessing and feature extraction methods, so as to effectively process speech signals in a complex environment. The system can improve the signal quality through operations such as denoising, pre-emphasis, framing, windowing and the like, and can extract the time domain characteristics of the voice from multiple angles through the extraction of characteristics such as short-time energy, zero-crossing rate, pitch period and the like and the multi-resolution analysis of wavelet transformation, thereby providing a solid foundation for subsequent voice recognition. The layered time domain feature extraction resolves speech signals from multiple layers and on different time scales by combining short-time energy, zero crossing rate, autocorrelation analysis, wavelet transform, and other techniques. The multi-level and all-directional feature extraction strategy can fully capture diversified information in voice signals, has strong robustness, and is particularly suitable for complex background noise or voice signals of different speakers.
102, Extracting layered frequency domain characteristics in the voice signal to be processed;
unlike time domain features, frequency domain features are obtained by fourier transforming an audio signal (e.g., short-time fourier transform). The frequency domain features can reflect the energy distribution of different frequency components, and are generally particularly effective for feature identification of pitch, timbre and the like of sound. Layered frequency domain features mean extracting features in different frequency ranges, enabling finer capture of variations in the speech signal. The specific extraction logic of the hierarchical frequency domain features is as follows:
specifically, step 102 specifically includes steps 1021 through 1024:
Step 1021, performing short-time Fourier transform on the voice signal to be processed to obtain frequency spectrum information;
Short-time fourier transforms are a common method of converting a signal from the time domain to the frequency domain. The short-time fourier transform obtains a representation of each frame in the frequency domain by framing the signal and then fourier transforming each frame. The short-time fourier transform can exhibit a change in the signal in both the time and frequency dimensions. The frequency spectrum information obtained by short-time Fourier transform describes the frequency distribution of the voice signal in each frame. The frequency spectrum information of each frame contains each frequency component and the corresponding energy distribution in the period, and the frequency characteristic of the voice signal can be effectively reflected. The spectral information is the basis for subsequent feature extraction (e.g., mel-frequency cepstral coefficients) and speech recognition, and the frequency variation of the speech signal can be understood in depth through the spectrum, thereby helping to distinguish between different speech phonemes or background noise.
Step 1022, calculating the mel frequency cepstrum coefficient of the voice signal to be processed;
mel-frequency cepstrum coefficients are generated by:
First, the spectrum of the speech signal is calculated using STFT.
The spectrum is filtered using a set of mel filters whose frequency response mimics the perceived difference of the human ear to different frequencies. The mel-filter bank divides the spectrum into a plurality of sub-bands, each of which has a bandwidth smaller in the high frequency part and larger in the low frequency part.
The energy value output by each filter is logarithmized to simulate the nonlinear response of the human ear to sound intensity.
Finally, the log energy spectrum is compressed by discrete cosine transform to obtain MFCC coefficients, typically the first 12 coefficients used to characterize the speech signal.
The mel-frequency cepstrum coefficient effectively compresses the frequency information of the voice signal by simulating the auditory mechanism of human ears, and simultaneously retains the characteristics useful for recognition in voice.
1023, Carrying out multi-resolution frequency domain analysis on the frequency spectrum information through wavelet transformation, and extracting frequency domain features under different frequencies;
Wavelet transformation is a frequency domain analysis tool that enables multi-resolution analysis. Unlike fourier transforms, wavelet transforms have localized capabilities in both time and frequency, and are capable of capturing both high frequency and low frequency information of a signal at different frequency scales. By wavelet transformation, the signal can be analyzed for different frequency ranges. In the low frequency part the wavelet transform provides a higher time resolution and in the high frequency part a higher frequency resolution. This enables the wavelet transform to capture both long-time scale stationary features and short-time scale rapidly changing features in the signal. By wavelet transforming the spectral information, the system can extract frequency domain features at different frequency ranges. These features can help the system better identify detail changes in the speech signal, and wavelet transforms can provide finer frequency domain features, especially in noisy environments.
The wavelet transform is conventional and can be implemented with reference to step 1014, which is not described in detail herein.
Step 1024, using the mel frequency cepstrum coefficient and the frequency domain characteristics under different frequencies as the layered frequency domain characteristics.
The Mel Frequency Cepstrum Coefficient (MFCC) and the frequency domain features in different frequency ranges extracted by wavelet transform are combined together to form layered frequency domain features. This layered frequency domain feature combines the global frequency characteristics of the MFCC for speech with the analysis of the local frequency details by wavelet transformation.
By combining these two features, the system is able to process and understand the details of the speech signal simultaneously at multiple frequency levels. MFCC provides an overall description of the spectrum of a speech signal, while wavelet transformation captures finer frequency variations in the signal, especially in complex environments, and such layered frequency domain features can improve the robustness and accuracy of the speech recognition system.
In the embodiment of steps 1021 through 1024, the layered frequency domain features of the speech signal are extracted in order to obtain the frequency information of the speech on multiple frequency scales. The short-time Fourier transform provides basic frequency spectrum information, and a Mel Frequency Cepstrum Coefficient (MFCC) simulates the auditory characteristics of human ears, so that the frequency domain characteristics of voice can be effectively extracted. Wavelet transforms provide finer frequency domain analysis that captures local and global frequency characteristics in the signal on a multi-resolution basis. By combining the frequency domain features of different sources, the system can analyze and understand the voice signals more comprehensively and accurately, and particularly improves the voice recognition performance in a noisy environment.
Step 103, inputting the layered time domain features into a time domain acoustic model to obtain a first phoneme sequence, and inputting the layered frequency domain features into a frequency domain acoustic model to obtain a second phoneme sequence;
this step is to input features extracted from the time domain into an acoustic model for generating a phoneme sequence. The phoneme sequence represents the basic phonetic unit (phoneme) order in speech, which is an intermediate output in speech recognition.
Features extracted from the frequency domain are input to another acoustic model (and possibly also a deep neural network) to generate a phoneme sequence. Since the frequency domain features are sensitive to the frequency content of the speech, this model can better handle those frequency dependent speech information.
The time domain acoustic model and the frequency domain acoustic model are conventional models, and specific processing logic is not described herein.
104, Inputting the first phoneme sequence or the second phoneme sequence into a language model when the first phoneme sequence and the second phoneme sequence are the same, and obtaining a speech recognition result output by the language model, wherein the speech recognition result is a word sequence;
The system will compare the phoneme sequences generated in the time domain with those generated in the frequency domain. If the two sequences are identical, the features extracted from the two different angles are said to have the same result in speech recognition, further proving the reliability of the recognition process. If the first and second phoneme sequences are identical, either phoneme sequence is input into the language model for subsequent language understanding and reasoning. The language model can output a final speech recognition result (word sequence) through the context information. The language model adopts a traditional model, and the processing procedure is not described herein.
Step 105, when the first phoneme sequence is different from the second phoneme sequence, processing the voice signal to be processed by adopting a multi-order cascade filter to obtain a frequency spectrum coefficient, wherein the multi-order cascade filter is used for processing different frequency ranges of the signal, and the response of each cascade filter is the product of time domain impulse responses;
If the phoneme sequences extracted in the time and frequency domains are not identical, this may mean that there is a strong noise in the signal or some feature that is difficult to identify by conventional methods. Thus, the system chooses to introduce a multi-order cascaded filter to further process the speech signal. The filter is typically used to separate signals of different frequency ranges in order to perform finer processing on signals of different frequency bands. The response of each order filter is the product of the time domain impulse responses, which helps filter specific frequency bands in the signal, reduces noise and enhances the effective speech information. Each cascaded filter is dedicated to processing a different frequency segment in the speech signal. By cascading a plurality of filters, high-frequency noise or low-frequency interference can be filtered step by step, and frequency components which are vital to voice recognition are reserved.
The product of the time domain impulse responses means that the impulse response functions of the filters combine the responses of different frequency bands through a cascade operation to form the final frequency response. Such a design can effectively handle complex speech signals, especially in noisy environments.
Specifically, step 105 specifically includes steps 1051 to 1052:
Step 1051, extracting differential phonemes in the first phoneme sequence and the second phoneme sequence, and matching to-be-processed voice signals corresponding to the differential phonemes, wherein the differential phonemes refer to different phonemes existing in the same sequence position in the first phoneme sequence and the second phoneme sequence;
The phoneme sequence is a basic unit of speech, and by comparing the phonemes in the "first phoneme sequence" and the "second phoneme sequence", it can be found that there is a difference between the two at some positions.
Step 1052, inputting the voice signal to be processed corresponding to the difference phonemes into a multi-order cascade filter to obtain the frequency spectrum coefficient.
By cascading a plurality of filters, the signal can be processed stepwise to extract different frequency components of the signal. The cascaded filters enable a fine analysis of the signal at different frequency bins so that features in different frequency ranges are more effectively enhanced or suppressed.
Through the processing of the multi-order cascade filter, the system will obtain the spectral coefficients corresponding to these difference phonemes. The spectral coefficients reflect the energy distribution of the signal in the various frequency bins. Spectral coefficients play a central role in the processing and feature extraction of speech signals, and are very important fundamental features, especially when comparing different phonemes or identifying subtle differences.
In the embodiment of steps 1051 to 1052, corresponding speech signal segments are extracted for different phonemes (difference phonemes), and the frequency characteristics of these signal segments are further analyzed by a multi-stage cascade filter, resulting in spectral coefficients. The process helps the system to concentrate on processing the part with the difference on the phoneme level, is beneficial to improving the recognition accuracy of the voice signal, and especially when processing the phonemes with nuances or the voice under the noise environment, the important characteristics in the voice can be captured finely through the high-efficiency frequency domain analysis of the multi-order cascade filter, so that the robustness of the system is improved.
Specifically, step 1052 specifically includes steps A1 to A4:
A1, inputting the voice signals to be processed corresponding to the difference phonemes into a multi-order cascade filter to obtain composite responses corresponding to a plurality of frequency bands;
as already mentioned above, the difference phones are derived by comparing different phones in the phone sequence. The speech signal segments corresponding to the difference phonemes are then input into a multi-stage cascade of filters.
The cascade filter is typically formed by cascading a plurality of filters, each of which processes signals of a different frequency band, and is capable of decomposing the speech signal to extract details in different frequency ranges. Through multi-order cascade filter processing, corresponding composite responses on a plurality of frequency bands can be obtained, namely signals processed by the filters on each frequency band.
The composite response generally refers to the output of a signal across the filter over the various frequency bands, reflecting the frequency content of the signal at each frequency band. The composite responses of the multiple frequency bands together constitute a complete description of the frequency domain characteristics of the speech signal.
The calculation process of the multi-order cascade filter is as follows:
Wherein, the Is a time-dependentAnd center frequencyIs a function of (a) and (b),The polynomial factors representing the time domain part,Representing the ith bandwidth.
Representing an exponential decay term of a filter, controlling a bandwidth of the filter. This term determines the rate of attenuation of the signal by the filter, which in turn affects the bandwidth (frequency selectivity) of the filter. Larger sizeCorresponding to a wider bandwidth, smallerCorresponding to a narrower bandwidth. Here, theIs the firstThe bandwidth of the filters is adjusted for different frequency ranges. For example, a filter in the low frequency region may have a smaller bandwidth, while a filter in the high frequency region may have a larger bandwidth.
Cosine modulation term of filter, which controls center frequency of filter. This term indicates the selectivity of the filter for a particular frequency,Is the firstThe center frequency of each filter determines the "position" of the frequency response of that filter. Due to the periodicity of the cosine function, the filter response appears in time to have a center frequencyThe response of the filter is strongest near the center frequency and gradually fades away with frequency.
Represent the firstThe bandwidth parameters of each filter determine the frequency selectivity of the filter. The larger the bandwidth, the wider the response area of the filter and vice versa. In practical applications, the appropriate bandwidth may be selected according to the frequency interval handled by the filter. In general, the bandwidth of the low frequency region may be small and the bandwidth of the high frequency region may be large to accommodate the frequency selectivity of the cochlea.
Represent the firstThe center frequency of the filter. Each filter has a center frequency that determines the strongest response frequency of the filter. The selection of the center frequency is typically adjusted according to the spectral characteristics of the signal. In such a cascaded filter structure, the center frequency of each filterMay be different and therefore the filter will process signals in different frequency bands.
Representing the index of the filter and representing the position of the current filter during the cascading process. For example, the number of the cells to be processed,Is the first filter to be used for the filtering,Is the second filter, and so on.The variation of (a) determines the different parameters of the filters (e.g. order, bandwidth, center frequency) and controls the range and manner in which each filter acts on the signal.
Is the total number of cascaded filters, i.e. how many filters are included in the cascading process. Each filter has a different bandwidth and center frequency, and the purpose of the cascade is to have each filter process a different frequency band of the signal separately and then combine their outputs to form the final filtering result.
The formula describes a cascade of filters, the time domain response of each filter consisting of a polynomial factor, an exponential decay factor and a cosine modulation factor. By adjusting the bandwidth of each filterAnd center frequencyCareful signal processing can be performed for different frequency ranges, and frequency response characteristics of the human ear in different frequency ranges can be simulated. By cascading multiple filters, we can provide more complex frequency selectivity, thereby better analyzing and processing the signal.
Cascading filters means that the outputs of multiple filters are passed progressively to the next filter. Each filter processes a different portion of the signal (different frequency ranges) to produce a composite response in the final output. Here, each filter of the cascade is analyzed for a particular frequency band or characteristic, and the final output is the combined response of all these bands. By cascading a plurality of filters, the processing capacity of the human ear for complex frequency distribution can be simulated, especially in the case of filtering responses of different frequency bands to different characteristics.
Each individual filter response is a product of the time domain impulse responses, typically represented as a band pass filter, whose bandwidth and center frequency determine its selectivity for different frequency bands. By cascading multiple filters, the response of the filters becomes more refined and can better simulate the complex auditory process of the human ear.
The gain function (determined by the exponential decay term and the cosine term) of each filter controls the frequency response of the signal at that filter band. Due to the different responses of the multiple filters (e.g., different bandwidthsAnd center frequency) They can carefully analyze the signal over different frequency ranges, similar to the response of different parts of the cochlea (different basement membrane areas) to different frequencies.
Cascaded filters can provide more complex frequency selectivity-filters in the low frequency region may have higher frequency resolution while filters in the high frequency region may have looser filter response. By applying different bandwidths and center frequencies in different frequency bands, different sensing characteristics of the human ear to different frequency intervals can be simulated.
The method can enhance the frequency resolution of the filter in different frequency bands, carry out more detailed analysis on the signal in a low frequency region, and process the smooth response of the signal through a wide bandwidth filter in a high frequency region.
The cochlea does not simply perform a level of frequency analysis on the input signal. In fact, the basilar membrane of the cochlea has different inductive properties for different frequencies at different locations, each location corresponding to the role of a frequency selective filter. By cascading a plurality of filters, the processing of signals of different frequency bands at different positions by the cochlea can be simulated.
Cascading filters simulates the process of such multi-stage filtering, where each filter acts on a different frequency range and creates a complex frequency response in the final output.
By cascading filters, we can better model the frequency selectivity and multi-stage processing mechanism of the cochlea. The plurality of filters provides finer frequency analysis and enables higher accuracy feature extraction over a wider frequency band. In addition, the cascade filter can utilize different bandwidths and center frequency parameters when processing signals in different frequency intervals, so that the robustness of the whole system in a noise environment is enhanced.
The cascade filter not only can refine the frequency selectivity of signals, but also can simulate the multi-stage frequency processing capability of the cochlea in different frequency bands. The method has the advantages that finer and flexible frequency response can be provided through the cooperative work of a plurality of filters, and further, the performance and the robustness in the fields of voice recognition, audio processing and the like are improved.
A2, respectively calculating energy values of composite responses corresponding to the frequency bands;
For the composite response of each band, its energy value needs to be calculated. The energy is typically represented by the sum of squares of the signals or calculated by the square of its spectral amplitude. The energy value can reflect the signal strength or importance over the frequency band.
The purpose of calculating the energy value is to capture the energy distribution of the signal over the various frequency bands, the frequency bands with high energy often contain more useful information, while the frequency bands with low energy may contain noise or insignificant components.
A3, carrying out logarithmic compression processing on energy values corresponding to the plurality of frequency bands to obtain logarithmic energy corresponding to the plurality of frequency bands;
The logarithmic compression processing is mainly used for simulating the perception mode of human ears on sound. The human ear has a logarithmic characteristic of perception of sound, i.e. is more sensitive to large amplitude variations, while the response to small amplitude variations is weaker. By carrying out logarithmic compression on the energy values, the energy difference between different frequency bands can be more in accordance with the perception rule of human ears.
By log-compressing the energy, the difference in energy between the bands can be better balanced and the energy value converted into a form more suitable for subsequent analysis. This helps to compress the information and reduce unwanted fluctuations, making the features more stable, especially in low energy parts, avoiding excessive amplification of weaker signals.
And A4, performing discrete cosine transform on the logarithmic energy to obtain a frequency spectrum coefficient.
The discrete cosine transform can effectively extract the principal component of the signal and remove redundant information. After the discrete cosine transform process, the resulting spectral coefficients are a low-dimensional representation of the speech signal, typically comprising a principal component of logarithmic energy over each frequency band. Spectral coefficients are important features in speech signals and can effectively describe the spectral structure of speech.
The discrete cosine transform process is capable of extracting the most representative portion of the signal, removing most of the redundant information while preserving features associated with speech recognition. The obtained frequency spectrum coefficient not only reduces the data quantity, but also maintains the key frequency domain characteristics in the voice signal.
In the embodiment of step A1 to step A4, the input speech signal is subjected to a multiband filtering process, resulting in a "composite response" over a plurality of frequency bands. The energy value of the composite response for each band is calculated, capturing the intensity distribution of the signal at different bands. The energy value is subjected to logarithmic compression to simulate the perception characteristic of human ears, so that the energy distribution accords with the auditory perception rule. The logarithmic energy is converted into spectral coefficients by DCT, resulting in a compact and efficient representation of the features. These steps work together to extract spectral coefficients that can be used as inputs for subsequent speech recognition or analysis tasks, which can effectively describe the frequency domain characteristics of the speech signal. By these operations, key information of the speech signal is preserved and redundant parts are compressed, thereby improving efficiency and accuracy of system processing and recognition.
And 106, recognizing a voice recognition result according to the frequency spectrum coefficient.
The spectral coefficients processed by the multi-order cascaded filter reflect the intensity and distribution of the signal at different frequencies. These spectral coefficients are further used as input to a speech recognition model for final phoneme recognition and output corresponding speech recognition results.
Specifically, step 106 specifically includes steps 1061 to 1063:
Step 1061, inputting the spectral coefficients into a frequency domain acoustic model in combination with the layered frequency domain features to obtain a third phoneme sequence output by the frequency domain acoustic model;
The spectral coefficients are obtained in the previous steps by means of a multi-order cascade of filters, energy calculations, logarithmic compression and DCT transformation. The hierarchical frequency domain feature refers to feature information on a plurality of frequency bands and different frequency bands, and relates to different expressions of high-frequency, low-frequency or intermediate-frequency components. Combining spectral coefficients with layered frequency domain features is actually a more detailed characterization of the frequency domain information of the speech signal. The combined frequency domain features are input into a frequency domain acoustic model, which analyzes the signal and outputs a phoneme sequence, which is referred to as a third phoneme sequence.
Step 1062, if the third phoneme sequence is the same as the first phoneme sequence, inputting the first phoneme sequence or the third phoneme sequence into a language model to obtain a speech recognition result output by the language model;
According to the target of speech recognition, it is first determined whether the third phoneme sequence output by the frequency domain acoustic model matches the first phoneme sequence. If the third phoneme sequence is identical to the first phoneme sequence, it means that the phoneme sequence extracted from the speech signal is identical to the original phoneme sequence, which indicates that the model processes the signal more accurately. After obtaining the phoneme sequences, the system needs to generate corresponding words or sentences from the phoneme sequences. At this time, the system inputs the first phoneme sequence or the third phoneme sequence into the language model. The language model can further optimize the result of the voice recognition by considering factors such as grammar, context and the like, and generate reasonable voice recognition output. The output of the language model is the final speech recognition result, which not only depends on the phoneme sequence recognized by the acoustic model, but also needs to consider the grammar structure and the context meaning of the sentence, so that the final recognition result is more natural and accurate.
Step 1063, if the third phoneme sequence is the same as the second phoneme sequence, inputting the second phoneme sequence or the third phoneme sequence into a language model to obtain a speech recognition result output by the language model.
If the third phoneme sequence matches the second phoneme sequence, meaning that the two are consistent in speech characteristics, different speech signals or language changes may be reflected. At this point, the system recognizes from this new phoneme sequence. Similar to the previous step, the second phoneme sequence or the third phoneme sequence is input into the language model for processing. The language model judges the most probable word or sentence according to the input phoneme sequence and outputs the final speech recognition result. The recognition result obtained at this time may be different from the previous result because the phoneme sequence input into the language model is different. The language model combines the second phoneme sequence or the third phoneme sequence according to the factors such as the context, grammar rules and the like to generate a final recognition result.
In the embodiment of steps 1061 to 1063, the speech signal is first subjected to phoneme recognition by using the frequency-domain acoustic model to obtain a third phoneme sequence. If the third phoneme sequence is consistent with the first phoneme sequence, the sequence or the original phoneme sequence is input into a language model to obtain a final speech recognition result. If the third phoneme sequence is consistent with the second phoneme sequence, the second phoneme sequence or the third phoneme sequence is input into a language model to generate a final speech recognition result. The design of the steps enables the system to be flexibly adjusted according to different conditions in the phoneme recognition process, combines the advantages of an acoustic model and a language model, and improves the accuracy and the robustness of speech recognition.
Specifically, step 1063 specifically includes steps B1 to B2:
B1, if the third phoneme sequence is the same as the second phoneme sequence, inputting the second phoneme sequence or the third phoneme sequence into a language model to obtain an initial recognition result output by the language model;
After language model processing, the output obtained by the system is an initial recognition result. This result represents the most probable vocabulary or sentence predicted based on the current phoneme sequence and language model.
And step B2, performing spelling correction and grammar checking on the initial recognition result to obtain the voice recognition result.
Spelling correction, namely, the initial recognition result is a recognition output obtained based on a language model, but in the process of voice recognition, the recognition result may contain misspellings due to noise, accent, unclear pronunciation and the like. The spelling correction phase examines and corrects the vocabulary in the initial recognition result.
Grammar checking-in addition to spelling errors, the initial result of speech recognition may also have grammar errors. Although some basic context information is considered by the language model, inappropriate grammars or expressions may still occur in more complex sentence structures. Grammar checking examines the syntactic structure of the recognition result to ensure that sentences conform to grammar rules. For example, problems such as inconsistent main meaning, misuse tense or word order may be corrected. Grammar checking is typically performed using rule-based grammatical analysis or statistical models.
After spelling correction and grammar checking, the recognition result is more accurate and natural. The final output is the voice recognition result obtained after two steps of correction, and the accuracy and fluency of voice recognition are obviously improved.
In the embodiments of steps B1 to B2, through spelling correction and grammar checking, the system can better process common errors in speech recognition (such as misspelling, improper word selection, unsmooth grammar, etc.), thereby improving naturalness, accuracy and legibility of recognition results. The multi-level processing mode not only depends on the phoneme sequence output of the acoustic model, but also combines the semantic inference of the language model and the correction of the post-processing stage, so that the final result is closer to the language understanding of human beings.
In the embodiments corresponding to steps 101 to 106, by extracting the layered time domain features and the frequency domain features of the speech signal to be processed, the key information of the speech can be better captured in various noise environments. The time domain features preserve the timing characteristics in the speech signal, while the frequency domain features help reveal the spectral characteristics of the speech. By double extraction of the two features, the method can effectively reduce interference of background noise on voice signals and improve voice recognition performance in a complex environment. And processing the time domain characteristics and the frequency domain characteristics respectively by adopting a time domain acoustic model and a frequency domain acoustic model, and generating two phoneme sequences. By comparing the similarity of the two sequences, the recognition deviation caused by low signal quality can be effectively reduced. Under the condition that the phoneme sequences are consistent, the input language model further optimizes the recognition result, so that the accuracy and the reliability of voice recognition are improved. When the first phoneme sequence is inconsistent with the second phoneme sequence, the invention adopts the multi-order cascade filter to further process the voice signal to be processed. The multi-order cascade filter can process signals in different frequency ranges respectively, so that the frequency spectrum characteristics of the signals are improved effectively, and the accuracy of voice recognition is improved. The response of each cascade filter is optimized through the product of the time domain impulse responses, so that the system can inhibit noise in different frequency ranges in a targeted manner, and a better recognition effect is achieved in a complex noise environment. The method is excellent in performance especially in a low signal-to-noise ratio environment through extraction of layered time domain and frequency domain features and processing of a multi-order cascading filter. While the traditional voice recognition system is easy to generate recognition errors or performance degradation when facing noise and signal distortion, the invention obviously enhances the anti-interference capability of the system and improves the voice recognition precision under severe environment through a multi-level and all-directional feature extraction and processing strategy. In summary, the invention combines the layered feature extraction and the multi-order cascading filter, thereby remarkably improving the performance of the voice recognition system under the low signal-to-noise ratio and high noise background and effectively solving the limitation of the traditional voice recognition technology under the complex environment.
Referring to fig. 2, fig. 2 shows a schematic diagram of a voice recognition device based on a microcomputer according to the present invention, and fig. 2 shows a voice recognition device based on a microcomputer includes:
An obtaining unit 21, configured to obtain a to-be-processed voice signal, and extract a layered time domain feature in the to-be-processed voice signal;
an extracting unit 22, configured to extract a layered frequency domain feature in the speech signal to be processed;
A calculating unit 23, configured to input the layered time domain feature into a time domain acoustic model to obtain a first phoneme sequence, and input the layered frequency domain feature into a frequency domain acoustic model to obtain a second phoneme sequence;
A first judging unit 24, configured to input the first phoneme sequence or the second phoneme sequence into a language model when the first phoneme sequence and the second phoneme sequence are the same, so as to obtain a speech recognition result output by the language model;
A second judging unit 25, configured to process the speech signal to be processed by using a multi-order cascade filter when the first phoneme sequence and the second phoneme sequence are different, so as to obtain a spectral coefficient;
And a recognition unit 26 for recognizing a speech recognition result based on the spectral coefficients.
According to the voice recognition device based on the microcomputer, key information of voice can be better captured in various noise environments by extracting layered time domain features and frequency domain features of the voice signals to be processed. The time domain features preserve the timing characteristics in the speech signal, while the frequency domain features help reveal the spectral characteristics of the speech. By double extraction of the two features, the method can effectively reduce interference of background noise on voice signals and improve voice recognition performance in a complex environment. And processing the time domain characteristics and the frequency domain characteristics respectively by adopting a time domain acoustic model and a frequency domain acoustic model, and generating two phoneme sequences. By comparing the similarity of the two sequences, the recognition deviation caused by low signal quality can be effectively reduced. Under the condition that the phoneme sequences are consistent, the input language model further optimizes the recognition result, so that the accuracy and the reliability of voice recognition are improved. When the first phoneme sequence is inconsistent with the second phoneme sequence, the invention adopts the multi-order cascade filter to further process the voice signal to be processed. The multi-order cascade filter can process signals in different frequency ranges respectively, so that the frequency spectrum characteristics of the signals are improved effectively, and the accuracy of voice recognition is improved. The response of each cascade filter is optimized through the product of the time domain impulse responses, so that the system can inhibit noise in different frequency ranges in a targeted manner, and a better recognition effect is achieved in a complex noise environment. The method is excellent in performance especially in a low signal-to-noise ratio environment through extraction of layered time domain and frequency domain features and processing of a multi-order cascading filter. While the traditional voice recognition system is easy to generate recognition errors or performance degradation when facing noise and signal distortion, the invention obviously enhances the anti-interference capability of the system and improves the voice recognition precision under severe environment through a multi-level and all-directional feature extraction and processing strategy. In summary, the invention combines the layered feature extraction and the multi-order cascading filter, thereby remarkably improving the performance of the voice recognition system under the low signal-to-noise ratio and high noise background and effectively solving the limitation of the traditional voice recognition technology under the complex environment.
Fig. 3 is a schematic diagram of a terminal device according to an embodiment of the present invention. As shown in fig. 3, a terminal device 3 of this embodiment comprises a processor 30, a memory 31 and a computer program 32, for example a microcomputer-based speech recognition program, stored in said memory 31 and executable on said processor 30. The processor 30, when executing the computer program 32, implements the steps of each of the microcomputer-based speech recognition method embodiments described above, such as steps 101 through 106 shown in fig. 1. Or the processor 30, when executing the computer program 32, performs the functions of the units in the above-described device embodiments, for example the functions of the units shown in fig. 2.
By way of example, the computer program 32 may be divided into one or more units, which are stored in the memory 31 and executed by the processor 30 to complete the present invention. The one or more units may be a series of computer program instruction segments capable of performing a specific function describing the execution of the computer program 32 in the one terminal device 3. For example, the computer program 32 may be partitioned into units having the following specific functions:
The device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a voice signal to be processed and extracting layered time domain characteristics in the voice signal to be processed;
the extraction unit is used for extracting the layered frequency domain characteristics in the voice signal to be processed;
The computing unit is used for inputting the layered time domain features into a time domain acoustic model to obtain a first phoneme sequence, and inputting the layered frequency domain features into a frequency domain acoustic model to obtain a second phoneme sequence;
the first judging unit is used for inputting the first phoneme sequence or the second phoneme sequence into a language model when the first phoneme sequence and the second phoneme sequence are the same, so as to obtain a voice recognition result output by the language model;
The second judging unit is used for processing the voice signal to be processed by adopting a multi-order cascade filter to obtain a frequency spectrum coefficient when the first phoneme sequence and the second phoneme sequence are different, wherein the multi-order cascade filter is used for processing different frequency ranges of the signal, and the response of each cascade filter is the product of time domain impulse responses;
and the recognition unit is used for recognizing the voice recognition result according to the frequency spectrum coefficient.
Including but not limited to a processor 30 and a memory 31. It will be appreciated by those skilled in the art that fig. 3 is merely an example of one type of terminal device 3 and is not meant to be limiting as to one type of terminal device 3, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., the one type of terminal device may also include input and output devices, network access devices, buses, etc.
The Processor 30 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 31 may be an internal storage unit of the terminal device 3, such as a hard disk or a memory of the terminal device 3. The memory 31 may also be an external storage device of the terminal device 3, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the terminal device 3. Further, the memory 31 may also include both an internal storage unit and an external storage device of the one terminal device 3. The memory 31 is used for storing the computer program and other programs and data required for the one roaming control device. The memory 31 may also be used for temporarily storing data that has been output or is to be output.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.
It should be noted that, because the content of information interaction and execution process between the above devices/units is based on the same concept as the method embodiment of the present invention, specific functions and technical effects thereof may be referred to in the method embodiment section, and will not be described herein.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules, so as to perform all or part of the functions described above. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present invention. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.
Embodiments of the present invention also provide a computer readable storage medium storing a computer program which, when executed by a processor, implements steps for implementing the various method embodiments described above.
Embodiments of the present invention provide a computer program product which, when run on a mobile terminal, causes the mobile terminal to perform steps that enable the implementation of the method embodiments described above.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiments, and may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include at least any entity or device capable of carrying computer program code to a photographing apparatus/terminal device, a recording medium, a computer Memory, a Read-Only Memory (ROM), a random-Access Memory (RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a U-disk, removable hard disk, magnetic or optical disk, etc.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/network device and method may be implemented in other manners. For example, the apparatus/network device embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions in actual implementation, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units.
It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
As used in the present description and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to a detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is monitored" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon monitoring a [ described condition or event ]" or "in response to monitoring a [ described condition or event ]".
Furthermore, the terms "first," "second," "third," and the like in the description of the present specification and in the appended claims, are used for distinguishing between descriptions and not necessarily for indicating or implying a relative importance.
Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the invention. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.
The foregoing embodiments are merely illustrative of the technical solutions of the present invention, and not restrictive, and although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that modifications may still be made to the technical solutions described in the foregoing embodiments or equivalent substitutions of some technical features thereof, and that such modifications or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. A microcomputer-based speech recognition method, characterized in that the microcomputer-based speech recognition method comprises:
Acquiring a voice signal to be processed, and extracting layered time domain characteristics in the voice signal to be processed;
extracting layered frequency domain characteristics in the voice signal to be processed;
Inputting the layered time domain features into a time domain acoustic model to obtain a first phoneme sequence, and inputting the layered frequency domain features into a frequency domain acoustic model to obtain a second phoneme sequence;
when the first phoneme sequence and the second phoneme sequence are the same, inputting the first phoneme sequence or the second phoneme sequence into a language model to obtain a voice recognition result output by the language model, wherein the voice recognition result is a word sequence;
Extracting difference phonemes in the first phoneme sequence and the second phoneme sequence when the first phoneme sequence and the second phoneme sequence are different, and matching to-be-processed voice signals corresponding to the difference phonemes, wherein the difference phonemes are different phonemes existing in the same sequence position in the first phoneme sequence and the second phoneme sequence;
inputting the voice signals to be processed corresponding to the difference phonemes into a multi-order cascade filter to obtain composite responses corresponding to a plurality of frequency bands;
Respectively calculating energy values of composite responses corresponding to the frequency bands;
carrying out logarithmic compression processing on energy values corresponding to the plurality of frequency bands to obtain logarithmic energy corresponding to the plurality of frequency bands;
Performing discrete cosine transform on the logarithmic energy to obtain a frequency spectrum coefficient, wherein the multi-order cascade filter is used for processing different frequency ranges of signals, and the response of each cascade filter is the product of time domain impulse responses;
And recognizing a voice recognition result according to the frequency spectrum coefficient.
2. The method of microcomputer-based speech recognition of claim 1, wherein the step of acquiring a speech signal to be processed and extracting layered time domain features in the speech signal to be processed comprises:
Collecting an original voice signal;
Carrying out denoising, pre-emphasis, framing and windowing on the original voice signals to obtain a plurality of voice signals to be processed;
Calculating short-time energy, short-time zero-crossing rate and pitch period of the voice signal to be processed;
Performing multi-resolution time domain analysis on the voice signal to be processed through wavelet transformation, and extracting time domain features under different time scales;
and taking the short-time energy, the short-time zero-crossing rate, the pitch period and the time domain characteristics under different time scales as the layered time domain characteristics.
3. The microcomputer-based speech recognition method of claim 1 wherein the step of extracting layered frequency domain features in the speech signal to be processed comprises:
Performing short-time Fourier transform on the voice signal to be processed to obtain frequency spectrum information;
Calculating a mel frequency cepstrum coefficient of the voice signal to be processed;
Carrying out multi-resolution frequency domain analysis on the frequency spectrum information through wavelet transformation, and extracting frequency domain features under different frequencies;
and taking the Mel frequency cepstrum coefficient and the frequency domain characteristics under different frequencies as the layered frequency domain characteristics.
4. The method of microcomputer-based speech recognition of claim 1, wherein the step of recognizing a speech recognition result based on the spectral coefficients comprises:
Inputting the frequency spectrum coefficient into a frequency domain acoustic model by combining the layered frequency domain characteristics to obtain a third phoneme sequence output by the frequency domain acoustic model;
If the third phoneme sequence is the same as the first phoneme sequence, inputting the first phoneme sequence or the third phoneme sequence into a language model to obtain a speech recognition result output by the language model;
and if the third phoneme sequence is the same as the second phoneme sequence, inputting the second phoneme sequence or the third phoneme sequence into a language model to obtain a speech recognition result output by the language model.
5. The method of claim 4, wherein if the third phoneme sequence is identical to the second phoneme sequence, inputting the second phoneme sequence or the third phoneme sequence into a language model to obtain a speech recognition result output by the language model comprises:
If the third phoneme sequence is the same as the second phoneme sequence, inputting the second phoneme sequence or the third phoneme sequence into a language model to obtain an initial recognition result output by the language model;
and carrying out spelling correction and grammar checking on the initial recognition result to obtain the voice recognition result.
6. A microcomputer-based speech recognition apparatus, characterized in that the microcomputer-based speech recognition apparatus comprises:
The device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a voice signal to be processed and extracting layered time domain characteristics in the voice signal to be processed;
the extraction unit is used for extracting the layered frequency domain characteristics in the voice signal to be processed;
The computing unit is used for inputting the layered time domain features into a time domain acoustic model to obtain a first phoneme sequence, and inputting the layered frequency domain features into a frequency domain acoustic model to obtain a second phoneme sequence;
the first judging unit is used for inputting the first phoneme sequence or the second phoneme sequence into a language model when the first phoneme sequence and the second phoneme sequence are the same, so as to obtain a voice recognition result output by the language model;
The device comprises a first judging unit, a second judging unit, a multi-order cascade filter, a discrete cosine transform and a frequency spectrum coefficient, wherein the first judging unit is used for extracting difference phonemes in a first phoneme sequence and a second phoneme sequence when the first phoneme sequence is different from the second phoneme sequence, and matching the difference phonemes in the first phoneme sequence and the second phoneme sequence with to-be-processed voice signals corresponding to the difference phonemes, the difference phonemes are different phonemes existing in the same sequence position in the first phoneme sequence and the second phoneme sequence, the to-be-processed voice signals corresponding to the difference phonemes are input into the multi-order cascade filter to obtain composite responses corresponding to a plurality of frequency bands, energy values of the composite responses corresponding to the plurality of frequency bands are calculated respectively, logarithmic compression processing is carried out on the energy values corresponding to the plurality of frequency bands to obtain logarithmic energy corresponding to the plurality of frequency bands, the discrete cosine transform is carried out on the logarithmic energy to obtain the frequency spectrum coefficient, and the multi-order cascade filter is used for processing different frequency ranges of signals, and the response of each cascade filter is the product of time domain impulse responses.
And the recognition unit is used for recognizing the voice recognition result according to the frequency spectrum coefficient.
7. A terminal device, characterized in that it comprises a memory, a processor and a microcomputer-based speech recognition program stored on the memory and executable on the processor, the microcomputer-based speech recognition program being configured to implement the steps of the microcomputer-based speech recognition method according to any of claims 1 to 5.
8. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps in the microcomputer based speech recognition method of any of claims 1 to 5.
CN202510635399.9A 2025-05-16 2025-05-16 Speech recognition method and device based on microcomputer Active CN120148484B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510635399.9A CN120148484B (en) 2025-05-16 2025-05-16 Speech recognition method and device based on microcomputer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510635399.9A CN120148484B (en) 2025-05-16 2025-05-16 Speech recognition method and device based on microcomputer

Publications (2)

Publication Number Publication Date
CN120148484A CN120148484A (en) 2025-06-13
CN120148484B true CN120148484B (en) 2025-07-11

Family

ID=95951247

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510635399.9A Active CN120148484B (en) 2025-05-16 2025-05-16 Speech recognition method and device based on microcomputer

Country Status (1)

Country Link
CN (1) CN120148484B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120319223A (en) * 2025-06-18 2025-07-15 心有灵犀科技股份有限公司 A method and system for voice recognition of smart helmets for food delivery riders

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116013256A (en) * 2022-12-19 2023-04-25 镁佳(北京)科技有限公司 Speech recognition model construction and speech recognition method, device and storage medium
CN119541475A (en) * 2025-01-21 2025-02-28 东莞市华泽电子科技有限公司 Audio processing method and system for speech spectrum reconstruction

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11978433B2 (en) * 2021-06-22 2024-05-07 Microsoft Technology Licensing, Llc. Multi-encoder end-to-end automatic speech recognition (ASR) for joint modeling of multiple input devices

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116013256A (en) * 2022-12-19 2023-04-25 镁佳(北京)科技有限公司 Speech recognition model construction and speech recognition method, device and storage medium
CN119541475A (en) * 2025-01-21 2025-02-28 东莞市华泽电子科技有限公司 Audio processing method and system for speech spectrum reconstruction

Also Published As

Publication number Publication date
CN120148484A (en) 2025-06-13

Similar Documents

Publication Publication Date Title
Shrawankar et al. Techniques for feature extraction in speech recognition system: A comparative study
Li et al. An auditory-based feature extraction algorithm for robust speaker identification under mismatched conditions
AU656787B2 (en) Auditory model for parametrization of speech
CN109256138B (en) Identity verification method, terminal device and computer readable storage medium
Ganapathy et al. Robust feature extraction using modulation filtering of autoregressive models
Xiao et al. Normalization of the speech modulation spectra for robust speech recognition
CN108198545B (en) A Speech Recognition Method Based on Wavelet Transform
CN106486131A (en) A kind of method and device of speech de-noising
CN108447495A (en) A Deep Learning Speech Enhancement Method Based on Comprehensive Feature Set
CN110942766A (en) Audio event detection method, system, mobile terminal and storage medium
CN112786059A (en) Voiceprint feature extraction method and device based on artificial intelligence
Bharath et al. New replay attack detection using iterative adaptive inverse filtering and high frequency band
CN120148484B (en) Speech recognition method and device based on microcomputer
Hasan et al. Preprocessing of continuous bengali speech for feature extraction
CN103971697A (en) Speech enhancement method based on non-local mean filtering
Joy et al. Deep scattering power spectrum features for robust speech recognition
CN109741761B (en) Sound processing method and device
Zouhir et al. A bio-inspired feature extraction for robust speech recognition
Li et al. A high-performance auditory feature for robust speech recognition.
Ganapathy Signal analysis using autoregressive models of amplitude modulation
Upadhyay et al. Robust recognition of English speech in noisy environments using frequency warped signal processing
Mallidi et al. Robust speaker recognition using spectro-temporal autoregressive models.
Tzudir et al. Low-resource dialect identification in Ao using noise robust mean Hilbert envelope coefficients
Maged et al. Improving speaker identification system using discrete wavelet transform and AWGN
CN116913296A (en) Audio processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant