CN103065629A

CN103065629A - Speech recognition system of humanoid robot

Info

Publication number: CN103065629A
Application number: CN 201210475180
Authority: CN
Inventors: 刘治; 林俊潜; 徐淑琼; 章云
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2012-11-20
Filing date: 2012-11-20
Publication date: 2013-04-24

Abstract

The invention discloses a speech recognition system which comprises a speech input module, a preprocessing module, a feature extraction module, a training module, a recognition module, a recognition decision module and a threshold comparison module. The output end of the speech input module is connected with the input end of the preprocessing module, the output end of the preprocessing module is connected with the input end of the feather extraction module, the output end of the feather extraction module is connected with the input ends of the training module and the recognition module, and the training module is connected with the recognition module. The output end of the recognition module is connected with the input end of the recognition decision module, and the output end of the recognition decision module is connected with the input end of the threshold comparison module. The speech recognition system adopts a hidden Markov model (HMM) and a wavelet transform and a neural network technique, makes a decision by adopting threshold comparison and improves a recognition rate.

Description

A Speech Recognition System for Humanoid Robot

技术领域technical field

本发明是一种基于仿人机器人的语音识别系统，用于智能机器人，也可用于智能系统或智能装备，人机交互装置等等。The present invention is a voice recognition system based on a humanoid robot, which is used for intelligent robots, and can also be used for intelligent systems or intelligent equipment, human-computer interaction devices and the like.

背景技术Background technique

能够用语音与机器进行语音交流，让机器明白你说什么，这是人们长期以来梦寐以求的事情，也是人类一直以来的一个理想，就是让各种机器能听懂人类的语言并能按人的口头命令来行动，从而实现人机的语言交流。随着科学技术的不断发展，出现了语音识别技术，使人类的这个理想逐渐得以实现。但是要完全实现这个理想还需人类的不懈努力。语音识别技术就是让机器通过识别和理解语音信号，并将其转换成为相应的文本或命令的技术。Being able to use voice to communicate with machines so that machines can understand what you say is something that people have long dreamed of. Command to act, so as to realize the language communication between man and machine. With the continuous development of science and technology, speech recognition technology has emerged, which gradually realizes this ideal of human beings. But to fully realize this ideal requires the unremitting efforts of human beings. Speech recognition technology is a technology that allows machines to recognize and understand voice signals and convert them into corresponding text or commands.

语音识别是近年来十分活跃的一个研究领域。其应用领域非常广泛，常见的有：语音输入系统、语音控制系统、语音拨号系统、智能家电等等。在不远的将来语音识别技术有可能作为一种重要的人机交互手段，辅助甚至取代传统的键盘、鼠标等输入设备，在个人计算机上进行文字录入和操作控制。而在手持式PDA、智能家电、工业现场控制等应用场合，语音识别技术则有更为广阔的发展前景。尤其是在包括PDA、手机等的掌上型嵌入式系统中，键盘的存在已经大大妨碍了系统的小型化，然而这些系统越来越趋向于智能化、信息化，不仅可以显示大量的文字和图形，还需要提供方便的文字输入能力，传统的键盘输入方式已经不能胜任，而语音识别技术就是一种极富潜力的替代手段。并且，语音技术的应用已经成为一个具有竞争性的新兴技术产业。因此研究语音识别技术有着广泛的应用价值和发展前景。Speech recognition is a very active research field in recent years. Its application fields are very wide, and the common ones are: voice input system, voice control system, voice dialing system, smart home appliances and so on. In the near future, speech recognition technology may be used as an important means of human-computer interaction, assisting or even replacing traditional keyboards, mice and other input devices for text entry and operation control on personal computers. In applications such as handheld PDAs, smart home appliances, and industrial field control, speech recognition technology has a broader development prospect. Especially in handheld embedded systems including PDA, mobile phone, etc., the existence of the keyboard has greatly hindered the miniaturization of the system. , It is also necessary to provide convenient text input capabilities. The traditional keyboard input method is no longer competent, and speech recognition technology is a very promising alternative. Moreover, the application of voice technology has become a competitive emerging technology industry. Therefore, the study of speech recognition technology has a wide range of application value and development prospects.

语音识别技术主要包括特征提取技术、模式匹配准则及模型训练技术三个方面。语音识别技术车联网也得到了充分的引用，例如在翼卡车联网中，只需按一键通客服人员口述即可设置目的地直接导航，安全、便捷。但语音识别主要还面临着有以下五个问题：Speech recognition technology mainly includes three aspects: feature extraction technology, pattern matching criteria and model training technology. Voice recognition technology has also been fully used in the Internet of Vehicles. For example, in the Internet of Wing trucks, you only need to press the key to talk to the customer service personnel to dictate the destination and directly navigate, which is safe and convenient. However, speech recognition still faces the following five problems:

(1)、对自然语言的识别和理解。首先必须将连续的讲话分解为词、音素等单位，其次要建立一个理解语义的规则；(1) Recognition and understanding of natural language. First, continuous speech must be decomposed into units such as words and phonemes, and secondly, a rule for understanding semantics must be established;

(2)、语音信息量大。语音模式不仅对不同的说话人不同，对同一说话人也是不同的，例如，一个说话人在随意说话和认真说话时的语音信息时不同的。一个人的说话方式随着时间变化；(2), the amount of voice information is large. Speech patterns are not only different for different speakers, but also for the same speaker. For example, the voice information of a speaker is different when he speaks casually and when he speaks seriously. the way a person speaks changes over time;

(3)、语音的模糊性。说话者在讲话时，不同的词可能听起来是相似的。这在英语和汉语中常见；(3), the ambiguity of speech. Different words may sound similar when the speaker is speaking. This is common in English and Chinese;

(4)、单个字母或词、字的语音特性受上下文的影响，以致改变了重音、音调、音量和发音速度等；(4) The phonetic characteristics of a single letter or word or character are affected by the context, resulting in changes in accent, pitch, volume and pronunciation speed, etc.;

(5)、环境噪声和干扰对语音识别有严重影响，致使识别率低。(5) Environmental noise and interference have a serious impact on speech recognition, resulting in a low recognition rate.

近几十年来，很多专家、学者带着这些问题，不断地研究与探索，使得语音识别技术得到发展。并基于语音识别技术构造了各种各样地语音识别系统。目前语音识别系统的应用领域有：电话通信的语音拨号、汽车的语音控制、工业控制及医疗领域、个人数字助理(Personal Digital Assistant，PDA)、智能玩具、家电遥控等等。人们不断的研究语音识别技术，是希望有一天能够达到像人和人之间交流一样，人和机器也能实现自由的对话，从而实现工业生产的自动化、智能化。随着科技的发展和人们对语音识别理论的逐渐深入化的研究，理论体系的日趋成熟，随着数字信号处理技术的发展，在未来20年，语音识别技术将逐渐的进入工业、家电、通信、汽车电子、医疗以及各种电子设备中。可以肯定地说，语音识别技术必将成为未来信息产业中的一项关键的技术。但是也不可否认，它还有很长的一段路需要走，要真正的商业化，还需要在多方面取得突破性的进展，还需要借助于其它相关学科的发展。In recent decades, many experts and scholars have continued to research and explore these problems, which has led to the development of speech recognition technology. And construct various speech recognition systems based on speech recognition technology. At present, the application fields of speech recognition system include: voice dialing of telephone communication, voice control of automobiles, industrial control and medical field, personal digital assistant (Personal Digital Assistant, PDA), smart toys, remote control of home appliances, etc. People continue to study speech recognition technology in the hope that one day it will be possible to achieve a free dialogue between humans and machines, just like the communication between humans, so as to realize the automation and intelligence of industrial production. With the development of science and technology and people's gradually in-depth research on speech recognition theory, the theoretical system is becoming more and more mature. With the development of digital signal processing technology, in the next 20 years, speech recognition technology will gradually enter the industry, home appliances, communications , automotive electronics, medical and various electronic equipment. It can be said with certainty that speech recognition technology will become a key technology in the future information industry. But it is also undeniable that it still has a long way to go. To be truly commercialized, it needs to make breakthroughs in many aspects and rely on the development of other related disciplines.

发明内容Contents of the invention

本发明是一种语音识别系统，主要目的是提供一种高效的、稳定的、实用性强的、高识别率的语音识别系统。The present invention is a speech recognition system, the main purpose of which is to provide a speech recognition system with high efficiency, stability, strong practicability and high recognition rate.

为实现上述目的，本发明以MATLAB为实现工具，结合迎宾仿人机器人平台。搭建好完整的语音识别系统，用户利用平台通过麦克风语音命令，输入语音信号经处理、识别，得出结果作用于迎宾机器人的行动动作。测评该系统能否能达到期望指标，识别能力强，正确率高，鲁棒性好的语音识别系统。In order to achieve the above object, the present invention uses MATLAB as a realization tool, combined with a welcoming humanoid robot platform. After building a complete voice recognition system, the user uses the platform to command voice through the microphone, input the voice signal, process and recognize it, and get the result to act on the action of the welcome robot. Evaluate whether the system can achieve the expected indicators, strong recognition ability, high accuracy rate, and robust speech recognition system.

本发明是通过以下技术方案实现的，一种仿人机器人的语音识别系统，包括语音输入模块、预处理模块、特征提取模块、训练模块、识别模块、识别决策模块、阈值比较模块，语音输入模块的输出端与预处理模块的输入端连接，预处理模块的输出端与特征提取模块的输入端连接，特征提取模块的输出端分别与训练模块、识别模块的输入端连接，训练模块与识别模块连接；识别模块的输出端与识别决策模块的输入端连接，识别决策模块的输出端与阈值比较模块的输入端连接。The present invention is achieved through the following technical solutions, a speech recognition system of a humanoid robot, comprising a speech input module, a preprocessing module, a feature extraction module, a training module, a recognition module, a recognition decision module, a threshold comparison module, and a speech input module The output end of the preprocessing module is connected with the input end of the preprocessing module, the output end of the preprocessing module is connected with the input end of the feature extraction module, the output end of the feature extraction module is respectively connected with the input end of the training module and the recognition module, and the training module and the recognition module connection; the output end of the identification module is connected with the input end of the identification decision module, and the output end of the identification decision module is connected with the input end of the threshold value comparison module.

所述语音输入模块用于输入原始语音信号。The voice input module is used for inputting original voice signals.

所述预处理模块包括顺次连接的预滤波单元、采样与量化单元、预加重单元、加窗单元、端点检测单元；The preprocessing module includes a pre-filtering unit, a sampling and quantization unit, a pre-emphasis unit, a windowing unit, and an endpoint detection unit connected in sequence;

所述预滤波单元用于去除原始语音信号的高频噪声；The pre-filter unit is used to remove high-frequency noise of the original speech signal;

所述采样与量化单元采样奈奎斯特采样定理采样和量化去噪的模拟信号，获得数字信号；The sampling and quantization unit samples the Nyquist sampling theorem to sample and quantize the denoised analog signal to obtain a digital signal;

所述预加重单元用于提升高频部分，让信号的频谱变得平坦，以便参数分析；The pre-emphasis unit is used to enhance the high-frequency part, so that the frequency spectrum of the signal becomes flat for parameter analysis;

所述加窗单元用于将信号有限化；The windowing unit is used to limit the signal;

所述端点检测单元用于检测语音段的起点、终点，去除不需要的静音段，提取试剂的语音信号段。The endpoint detection unit is used to detect the start point and end point of the speech segment, remove unnecessary silent segments, and extract the speech signal segment of the reagent.

所述端点检测单元采用双门限能量法与人工神经网络相结合的方法。The endpoint detection unit adopts the method of combining double-threshold energy method and artificial neural network.

所述特征提取模块采用基于小波变换的混合特征参数提取算法，所述提取算法为基于小波变换的线性预测倒谱参数和基于小波变换的Mel频率倒谱系数。The feature extraction module adopts a wavelet transform-based hybrid feature parameter extraction algorithm, and the extraction algorithm is wavelet transform-based linear prediction cepstral parameters and wavelet transform-based Mel frequency cepstral coefficients.

所述训练模块是通过Baum-Welch(期望值修正)算法作为隐马尔科夫模型的训练学习方法。The training module uses the Baum-Welch (expectation correction) algorithm as a training and learning method for the hidden Markov model.

所述识别决策模块是通过Viterbi(维特比)算法得到输出概率。The identification and decision-making module obtains the output probability through a Viterbi (Viterbi) algorithm.

所述阈值比较模块用于将获得的输出概率值与设定的阈值比较，如果高于阈值则输出识别结果，否则丢弃该识别结果。The threshold comparison module is used to compare the obtained output probability value with the set threshold, if it is higher than the threshold, then output the recognition result, otherwise discard the recognition result.

本发明的工作过程：语音信号从麦克风即语音输入模块输入信号，经预处理模块预处理，预处理包括预滤波、采样与量化、预加重、加窗及端点检测；预处理后对信号进行特征参数提取，将所提取的参数序列，建立保存成语音参数模板库即训练模板模块；语音识别过程是语音从麦克风输入，经过预处理、特征参数提取，将提取的特征参数与所建立的语音参数模板库进行概率计算与匹配，匹配得出结果通过阈值比较模块进行阈值比较，最终得到识别结果。The working process of the present invention: the voice signal is input from the microphone, that is, the voice input module, and is preprocessed by the preprocessing module. The preprocessing includes prefiltering, sampling and quantization, pre-emphasis, windowing and endpoint detection; after preprocessing, the signal is characterized Parameter extraction, the extracted parameter sequence is established and saved as a speech parameter template library, that is, the training template module; the speech recognition process is that the speech is input from the microphone, and after preprocessing and feature parameter extraction, the extracted feature parameters are combined with the established speech parameters. The template library performs probability calculation and matching, and the matching result is compared with the threshold through the threshold comparison module, and finally the recognition result is obtained.

本发明中，计算概率后再进行一次阈值比较，若高于阈值者认为是正确识别结果，否则，丢弃该识别结果，并提示“请再说一遍”语音后重新输入语音命令。阈值是一个经验值，在特定的实验室环境下，经过多次实验而得出的值。In the present invention, a threshold value comparison is performed after calculating the probability. If it is higher than the threshold value, it is considered as the correct recognition result, otherwise, the recognition result is discarded, and the voice command of "please say it again" is prompted to re-input the voice command. The threshold is an empirical value, which is obtained after many experiments in a specific laboratory environment.

附图说明Description of drawings

图1语音识别系统框图；Fig. 1 speech recognition system block diagram;

图2语音信号预处理框图；Fig. 2 voice signal preprocessing block diagram;

图3DWTM计算过程框图。Figure 3. Block diagram of DWTM calculation process.

具体实施方式Detailed ways

为了更好的理解本发明，下面结合附图对本发明的实施例作详细说明：本实施例在以本发明技术方案为前提下进行实施，给出了详细的实施方式和具体操作过程，但本发明的保护范围不限于下述的实施例。In order to better understand the present invention, the embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings: this embodiment is implemented on the premise of the technical solution of the present invention, and detailed implementation methods and specific operation processes are provided. The scope of protection of the invention is not limited to the following examples.

如图1所示，为本发明的系统框图，一种语音识别系统，包括语音输入模块、预处理模块、特征提取模块、训练模块、识别模块、识别决策模块、阈值比较模块，语音输入模块的输出端与预处理模块的输入端连接，预处理模块的输出端与特征提取模块的输入端连接，特征提取模块的输出端分别与训练模块、识别模块的输入端连接，训练模块与识别模块连接；识别模块的输出端与识别决策模块的输入端连接，识别决策模块的输出端与阈值比较模块的输入端连接。As shown in Figure 1, it is a system block diagram of the present invention, a kind of speech recognition system, comprises speech input module, preprocessing module, feature extraction module, training module, identification module, recognition decision-making module, threshold comparison module, the speech input module The output end is connected to the input end of the preprocessing module, the output end of the preprocessing module is connected to the input end of the feature extraction module, the output end of the feature extraction module is respectively connected to the input end of the training module and the identification module, and the training module is connected to the identification module ; The output end of the identification module is connected to the input end of the identification decision module, and the output end of the identification decision module is connected to the input end of the threshold comparison module.

如图2所示所述预处理模块包括顺次连接的预滤波单元、采样与量化单元、预加重单元、加窗单元、端点检测单元。As shown in FIG. 2 , the preprocessing module includes a pre-filtering unit, a sampling and quantization unit, a pre-emphasis unit, a windowing unit, and an endpoint detection unit connected in sequence.

预处理中，先对语音信号预滤波，其目的是防止混叠干扰，预滤波实际上是一个带通滤波器，其上下截止频率分别为f_H和f_L；再对信号进行A/D转换，模拟语音信号是连续信号，无法被计算机处理，所以语音信号处理的第一步，就是要将模拟信号转换成数字信号。因此，必须经过取样和量化两个步骤，从而语音信号从麦克风录入后，经A/D转换将模拟信号转换成数字信号，再对其采样与量化，采样与量化是通过电脑或者其它数字录音设备采集的语音信号都已经经过数字化了，一般不需要用户再进行数字化处理。根据奈奎斯特采样定理：f_s＞2*f_max，以8000Hz的频率采样，分成200采样的帧，相邻帧有50％重叠；再对量化后信号进行预加重，由于语音信号的平均功率谱受声门激励和口鼻辐射的影响，高频端大约在800Hz以上按6dB/倍频程跌落，为此要对语音信号进行预加重。预加重的目的是提升高频部分，使信号的频谱变得平坦，以便于进行频谱分析或声道参数分析，预加重数字滤波器H(z)＝1-uz^-1，u为0.97；最后对信号进行加窗，由于人自身的发音器官的运动，语音信号是一种典型的非平稳信号，其特性是随时间变化的。但是，这种物理运动要比声波振动速度缓慢得多，因此，语音信号常常可假定在10～20ms这样的时间段内，语音信号是平稳信号，其频谱特性和某些物理特征参量可近似地看作是不变的；本发明中采用的是汉明窗。In the preprocessing, the voice signal is pre-filtered first, and its purpose is to prevent aliasing interference. The pre-filter is actually a band-pass filter, and its upper and lower cut-off frequencies are f _H and f _L respectively; then A/D conversion is performed on the signal , the analog voice signal is a continuous signal and cannot be processed by a computer, so the first step in voice signal processing is to convert the analog signal into a digital signal. Therefore, it must go through two steps of sampling and quantization, so that after the voice signal is recorded from the microphone, the analog signal is converted into a digital signal by A/D conversion, and then it is sampled and quantized. The sampling and quantization are performed by a computer or other digital recording equipment. The collected voice signals have already been digitized, and generally do not need to be digitized by the user. According to the Nyquist sampling theorem: f _s > 2*f _max , sampling at a frequency of 8000Hz, divided into 200 sampling frames, and adjacent frames overlap by 50%; then pre-emphasize the quantized signal, due to the average of the speech signal The power spectrum is affected by glottal excitation and mouth and nose radiation, and the high-frequency end drops by 6dB/octave above 800 Hz. For this reason, the voice signal must be pre-emphasized. The purpose of pre-emphasis is to enhance the high-frequency part and make the spectrum of the signal flat, so as to facilitate spectrum analysis or channel parameter analysis. The pre-emphasis digital filter H(z)=1-uz ^-1 , u is 0.97; finally Windowing is performed on the signal. Due to the movement of the human's own vocal organs, the speech signal is a typical non-stationary signal, and its characteristics change with time. However, this physical movement is much slower than the speed of sound wave vibration. Therefore, the speech signal can often be assumed to be a stationary signal within a time period of 10-20 ms, and its spectral characteristics and some physical characteristic parameters can be approximated. regarded as invariant; what is adopted in the present invention is the Hamming window.

所述预滤波单元用于去除原始语音信号的高频噪声；去除不必要的成分，为后面信号处理做准备，保证信号的质量与速度。The pre-filter unit is used to remove high-frequency noise of the original speech signal; remove unnecessary components, prepare for subsequent signal processing, and ensure signal quality and speed.

所述采样与量化单元采样奈奎斯特采样定理采样和量化去噪的模拟信号，获得数字信号；由于原始语音信号是混有高频噪声的模拟信号，因为对原始语音信号进行去高频噪声和数字化处理，根据Nyquist采样定理，用8000Hz的频率采样和量化，得到数字化的语音信号。The sampling and quantization unit samples the Nyquist sampling theorem to sample and quantize the denoised analog signal to obtain a digital signal; since the original voice signal is an analog signal mixed with high-frequency noise, the original voice signal is de-high-frequency noise And digital processing, according to the Nyquist sampling theorem, use 8000Hz frequency sampling and quantization to obtain digital voice signals.

所述预加重单元用于提升高频部分，让数字信号的频谱变得平坦，以便参数分析；语音信号的平均功率受声门激励和口鼻辐射的影响，高频端会出现跌落，为此为语音信号进行预加重处理，提升高频部分，让信号的频谱变得平坦，以便参数分析。预加重数字滤波器为：The pre-emphasis unit is used to enhance the high-frequency part to flatten the spectrum of the digital signal for parameter analysis; the average power of the voice signal is affected by glottal excitation and mouth and nose radiation, and the high-frequency end will drop. Perform pre-emphasis processing for the speech signal, enhance the high-frequency part, and flatten the spectrum of the signal for parameter analysis. The pre-emphasis digital filter is:

H(z)＝1-μz^-1 H(z)＝1-μz ^-1

其中，u值接近于1，在此取0.97。Among them, the value of u is close to 1, and 0.97 is taken here.

所述加窗单元用于将数字信号有限化；由于语音信号是一种非平衡信号，其特性随时间而变化。但是语音信号常常可假定在10ms～20ms时间段内，可看作平稳信号，其频谱特性也近似不变。因此对语音信号进行加窗处理，将其分成苦干个短段，每个短段称为一个分析帧。将数字化的语音信号分成200采样的帧，相邻帧有50％重叠。用汉明窗给语音信号加窗操作，汉明窗函数为：The windowing unit is used to limit the digital signal; since the voice signal is an unbalanced signal, its characteristics change with time. However, the voice signal can often be assumed to be a stationary signal within the time period of 10 ms to 20 ms, and its spectral characteristics are also approximately unchanged. Therefore, the speech signal is windowed and divided into short segments, each of which is called an analysis frame. The digitized speech signal is divided into 200-sample frames, and adjacent frames have 50% overlap. Use the Hamming window to add a window to the speech signal, and the Hamming window function is:

其中，N是每帧的采样点数，在本系统N＝200。Among them, N is the number of sampling points per frame, and N=200 in this system.

语音端点检测，目的是从包含语音的一段信号中确定出语音的起点和终点，是语音识别系统中极其关键的一步，只有准确地判定语音信号的端点，才能正确的进行语音处理。有效的端点检测不仅能使处理时间减到最少，而且能排除无声段的噪声干扰，从而使处理质量得到保证。一个好的端点检测方法能改变语音识别软件存在的检测效果不理想、识别率低等问题，能为语音识别提供可靠的基础，应具有很好的鲁棒性，能很好地区别背景噪音、非语音声音和非对话人的声音与正常对话音，减少这些声音引起的端点错误和由此引起的误打断。端点检测的高精度能保证输入识别器的信号是有效完整的语音信号，使识别效果更准确快速。常用的方法有能量法，但是在实际应用中，无法达到很高的信噪比，所以造成端点检测不精确，检测出来的语音就会不完整，影响识别的效果。在本发明中，加窗后的信号先计算其短时能量和短时平均过零率，初步检测其清音、浊音和各语音段；再经过一个多层感知器神经网络，进一步的检测，再用中值滤波来进行平滑，平滑各频谱。经实验证明，这种混合方法达到很好的端点检测效果。Speech endpoint detection, the purpose is to determine the start and end of the speech from a signal containing speech, which is an extremely critical step in the speech recognition system. Only by accurately determining the endpoint of the speech signal can the correct speech processing be performed. Effective endpoint detection can not only minimize the processing time, but also eliminate the noise interference of the silent segment, so that the processing quality is guaranteed. A good endpoint detection method can change the problems of speech recognition software such as unsatisfactory detection effect and low recognition rate, and can provide a reliable basis for speech recognition. It should have good robustness and can well distinguish background noise, Non-speech sounds and the voices of non-interlocutors and normal conversational speech, reducing endpoint errors and resulting false interruptions caused by these sounds. The high precision of endpoint detection can ensure that the signal input to the recognizer is an effective and complete speech signal, making the recognition effect more accurate and faster. The commonly used method is the energy method, but in practical applications, it cannot achieve a high signal-to-noise ratio, so the endpoint detection is not accurate, and the detected speech will be incomplete, affecting the recognition effect. In the present invention, the windowed signal first calculates its short-term energy and short-term average zero-crossing rate, and initially detects its unvoiced sound, voiced sound and each speech segment; then through a multi-layer perceptron neural network, further detection, and then Smoothing is performed using a median filter, smoothing each spectrum. Experiments have proved that this hybrid method achieves very good endpoint detection results.

所述端点检测单元用于检测语音段的起点、终点，去除不需要的静音段，提取试剂的语音信号段；是语音识别系统中极其关键的一步，只有准确地判定语音信号的端点，才能正确的进行语音处理。The endpoint detection unit is used to detect the starting point and end point of the speech segment, remove unnecessary silent segments, and extract the speech signal segment of the reagent; it is an extremely critical step in the speech recognition system. Only by accurately determining the endpoint of the speech signal can it be correct. for voice processing.

所述端点检测单元采用双门限能量法与人工神经网络相结合的方法。能量法，即短时能量法与短时平均过零率相结合的方法，只能在低信噪比(SNR)的情况下，才能达到很高的准确性。The endpoint detection unit adopts the method of combining double-threshold energy method and artificial neural network. The energy method, that is, the combination of the short-term energy method and the short-term average zero-crossing rate, can only achieve high accuracy in the case of low signal-to-noise ratio (SNR).

短时能量检测算法适合检测浊音，其计算公式：The short-term energy detection algorithm is suitable for detecting voiced sounds, and its calculation formula is:

${E E.}_{n no} = = {Σ Σ}_{m m = = - - \infty \infty}^{\infty \infty} {[[x x ((m m)) w w ((n no - - m m))]]}^{22} = = {Σ Σ}_{m m = = n no - - N N - - 11}^{n no} {[[x x ((m m)) w w ((n no - - m m))]]}^{22}$

短时平均过零率适合检测清音，其计算公式：The short-term average zero-crossing rate is suitable for detecting unvoiced sounds, and its calculation formula is:

${Z Z}_{n no} = = {Σ Σ}_{m m = = - - \infty \infty}^{\infty \infty} | | sgn sgn [[x x ((m m))]] - - sgn sgn [[x x ((m m - - 11))]] | | w w ((n no - - m m))$

$= = | | sgn sgn [[x x ((n no))]] - - sgn sgn [[x x ((n no - - 11))]] | | w w ((n no))$

其中， $sgn [x (n)] = \{\begin{matrix} 1, x (n) &GreaterEqual; 0 \\ - 1, x (n) < 0 \end{matrix} .$ in, $sgn [x (no)] = \{\begin{matrix} 1, x (no) &Greater Equal; 0 \\ - 1, x (no) < 0 \end{matrix} .$

双门限能量法是在能量法的基础上设置两个门限值，一个比较小，用于检测静音段与语音信号段的分界，另一个比较大，用于检测语音信号的强度。在此对两个门限值设置得宽松一下，以保证语音信号完整。于此，再将信号输入到人工神经网络检测器。The double-threshold energy method is based on the energy method to set two thresholds, one is relatively small, used to detect the boundary between the silent segment and the speech signal segment, and the other is relatively large, used to detect the strength of the speech signal. Here, the two thresholds are set looser to ensure the integrity of the voice signal. Here, the signal is then input to the artificial neural network detector.

人工神经网络检测器是采用多层感知人工神经网络，将其用于端点检测的优点是考虑到了语音帧之间的相关性及误差的概率最小。但主要缺点是采用这种算法很难找到能显著区分浊音和清音的语音特征，这正好与双门限能量法形成优点互补。人工神经网络是由许多处理单元(神经元)组成的互连模式反映了神经网络的结构，它决定着这个网络的能力。每一个神经元(如神经元j)接受其他神经元(如神经元i)的信息传递，总输入的关系式为：The artificial neural network detector uses a multi-layer perceptual artificial neural network. The advantage of using it for endpoint detection is that the correlation between speech frames and the probability of error are minimized. But the main disadvantage is that it is difficult to find speech features that can significantly distinguish voiced and unvoiced sounds by using this algorithm, which just complements the advantages of the dual-threshold energy method. The artificial neural network is composed of many processing units (neurons), and the interconnection pattern reflects the structure of the neural network, which determines the ability of the network. Each neuron (such as neuron j) accepts information transmission from other neurons (such as neuron i), and the relationship between the total input is:

${I I}_{j j} = = {Σ Σ}_{i i = = 11}^{n no} {w w}_{ij ij} {x x}_{i i} - - {θ θ}_{j j}$

上式中，w_ij为从神经元i到神经元j的连接权值；x_i为神经元i的输出；θ_j表示神经元j的阈值。神经元j的输出关系式为：o_j＝f(I_j)，这里函数，f(·)称为激励函数，选取S型函数为：In the above formula, w _ij is the connection weight from neuron i to neuron j; x _i is the output of neuron i; θ _j is the threshold of neuron j. The output relationship of neuron j is: o _j = f(I _j ), where the function, f( ) is called the activation function, and the S-type function is selected as:

$f f ((x x)) = = \frac{11}{11 + + {e e}^{- - x x}}$

多层感知器(multilayer perception networks，MLPN)神经网络，它是一种前向神经网络，是一具有多层神经元、前馈、误差反传结构的神经网络。通常由三部分组成：一组感知单元组成输入层；一层或多层计算节点的隐藏层；一层计算节点的输出层。输入层用于感知外界信息，输出层给出分类结果，隐藏层处理输入信息。每一个隐藏层或输出层的神经元用来进行两种计算：计算一个神经元的输出处出现的函数信号和梯度向量的估计计算，它需要反向通过网络，因此，采用反向传播法(BP算法)作为MLPN网络的标准训练算法，BP算法以一种有教师示教的方式进行学习，学习过程由正向传播过程和反向传播过程组成。首先由教师对每一种输入模式设定一个期望输出值，然后对网张输入实际的学习记忆模式(即训练样本)，并由输入层经中间层向输出层传播，实际输出与期望值的差即是误差，再根据误差由输出层往中间层逐层修正连接权值，通过这样的过程，实际输出逐渐向各自所对应的期望值输出逼近，最后确定各层间的连接权值ω_ij。Multilayer perception networks (MLPN) neural network, which is a forward neural network, is a neural network with a multi-layer neuron, feedforward, and error backpropagation structure. It usually consists of three parts: a set of perceptual units form the input layer; one or more hidden layers of computing nodes; and an output layer of one layer of computing nodes. The input layer is used to perceive external information, the output layer gives classification results, and the hidden layer processes the input information. Each neuron in the hidden layer or output layer is used to perform two calculations: calculate the function signal and gradient vector estimation calculation that appear at the output of a neuron, it needs to pass through the network in reverse, so the backpropagation method ( BP algorithm) as the standard training algorithm of MLPN network, the BP algorithm learns in a way with a teacher teaching, and the learning process consists of a forward propagation process and a back propagation process. First, the teacher sets an expected output value for each input mode, and then inputs the actual learning and memory mode (that is, training samples) to the net sheet, and propagates from the input layer to the output layer through the intermediate layer. The difference between the actual output and the expected value is It is the error, and then the connection weight is corrected layer by layer from the output layer to the middle layer according to the error. Through such a process, the actual output gradually approaches the corresponding expected value output, and finally the connection weight ω _ij between each layer is determined.

语音信号的幅度相对于背景噪声其动态范围大，可认为在语音段内随机事件多，平均信息量大，即熵值大；再者，背景噪声段的能量在各频率处的分布比较平稳，反映到信息量上认为其所含的平均信息量即谱熵比较大。因此，将经过双门限能量法处理后的语音信号，计算其幅度熵及谱熵并作为MLPN神经网络的输入，经过神经网络各层传播，最后得出检测结果。Compared with the background noise, the amplitude of the speech signal has a large dynamic range. It can be considered that there are many random events in the speech segment, and the average amount of information is large, that is, the entropy value is large; moreover, the energy distribution of the background noise segment is relatively stable at each frequency. Reflected on the amount of information, it is considered that the average amount of information contained in it, that is, the spectral entropy, is relatively large. Therefore, the speech signal processed by the double-threshold energy method is used to calculate its amplitude entropy and spectral entropy and take it as the input of the MLPN neural network, propagate through each layer of the neural network, and finally obtain the detection result.

语音识别是一个匹配的过程，首先，根据语音的特点建立模型，对输入的语音信号进行分析，抽取所需的特征，在此基础上建立起语音识别所需的模板。在识别过程中要根据语音识别的整体模型，将计算机中存放的语音模板与输入的语音信号的特征进行比较，根据一定的搜索和匹配策略，找出一系列最优的与输入的语音匹配的模板，求出识别结果，这是语音识别的原理。对语音信号的特征参数提取常用的方法有：LPCC(线性预测倒谱系数)参数、MFCC(Mel频率倒谱系数)参数、基于LPCC或MFCC一阶差分的参数等等。LPCC参数的缺点是没有利用入耳的听觉，MFCC参数虽然利用了入耳的听觉特性，但是这两个参数只反映了语音的静态特性，无法反映其动态特性，因此，提出了用一阶差分参数来描述语音的动态特性。语音信号是非平稳信号，傅立叶变换是一种对信号全局性的分析方法，无法反映语音信号的局部性，本发明引入了离散小波变换，在提取LPCC参数和MFCC参数时，用离散小波变换代替傅立叶变换，发挥小波变换的优点，能够提高语音识别系统的识别性能。为了能够提高系统的识别率，再计算其一阶差分参数，结合两者，以ΔDWTL+DWTL，ΔDWTM+DWTM作为语音信号的特征参数。其中DWTL和DWTM分别是用小波变换的LPCC和MFCC参数，ΔDWTL和ΔDWTM分别是LPCC和MFCC参数的差分参数。Speech recognition is a matching process. First, a model is established according to the characteristics of the speech, the input speech signal is analyzed, the required features are extracted, and the template required for speech recognition is established on this basis. In the recognition process, according to the overall model of speech recognition, the speech template stored in the computer is compared with the characteristics of the input speech signal, and a series of optimal matching with the input speech is found according to a certain search and matching strategy. Template to obtain the recognition result, which is the principle of speech recognition. Commonly used methods for extracting feature parameters of speech signals include: LPCC (Linear Prediction Cepstral Coefficient) parameters, MFCC (Mel Frequency Cepstral Coefficient) parameters, parameters based on LPCC or MFCC first-order difference, and so on. The disadvantage of the LPCC parameter is that it does not use the hearing of the ear. Although the MFCC parameter uses the auditory characteristics of the ear, these two parameters only reflect the static characteristics of the speech and cannot reflect its dynamic characteristics. Therefore, a first-order difference parameter is proposed. Describe the dynamics of speech. Speech signal is non-stationary signal, and Fourier transform is a kind of analysis method to the globality of signal, can't reflect the locality of speech signal, the present invention has introduced discrete wavelet transform, when extracting LPCC parameter and MFCC parameter, replace Fourier transform with discrete wavelet transform Transform, taking advantage of wavelet transform, can improve the recognition performance of the speech recognition system. In order to improve the recognition rate of the system, calculate its first-order difference parameters, combine the two, and use ΔDWTL+DWTL, ΔDWTM+DWTM as the characteristic parameters of the speech signal. Among them, DWTL and DWTM are LPCC and MFCC parameters using wavelet transform respectively, and ΔDWTL and ΔDWTM are difference parameters of LPCC and MFCC parameters respectively.

特征参数提取是指从语音信号中获得一组能够描述语音信号特征参数的过程。所提出的算法是基于线性预测倒谱参数(LPCC)和Mel频率倒谱系数(MFCC)，用离散小波变换代替离散傅立叶变换，小波变换具有时域局部性和频域局部性，并且其时频窗口可以根据不同频率自适应地调节，从而能精确地反映非平稳信号的瞬间变化；小波分析在时频平面不同位置具有不同的分辨率，是一种多分辨率分析方法。即在低频部分具有较高的频率分辨率和较低的时间分辨率，在高频部分具有较高的时间分辨率和较低的频率分辨率，因此，在时频两域都具有表征信号局部特征的能力。Feature parameter extraction refers to the process of obtaining a set of characteristic parameters that can describe the speech signal from the speech signal. The proposed algorithm is based on linear predictive cepstral parameters (LPCC) and Mel-frequency cepstral coefficients (MFCC), replacing discrete Fourier transform with discrete wavelet transform. The window can be adaptively adjusted according to different frequencies, so that it can accurately reflect the instantaneous changes of non-stationary signals; wavelet analysis has different resolutions at different positions on the time-frequency plane, and is a multi-resolution analysis method. That is, it has higher frequency resolution and lower time resolution in the low frequency part, and has higher time resolution and lower frequency resolution in the high frequency part. characteristic capabilities.

线性预测倒谱参数(LPCC)是表征说话人个性的特征参数，其模型系统函数：The linear predictive cepstral parameter (LPCC) is a characteristic parameter that characterizes the personality of the speaker, and its model system function:

$lgH lm w ((z z)) = = \overset{^^}{H h} ((z z)) = = {Σ Σ}_{n no = = 11}^{\infty \infty} \overset{^^}{h h ((n no))} {z z}^{- - n no}$

将上两式组合，再对其两边关于z求导，即有Combining the above two formulas, and then deriving both sides with respect to z, we have

$((11 - - {Σ Σ}_{k k = = 11}^{p p} {a a}_{k k} {z z}^{- - k k})) {Σ Σ}_{n no = = 11}^{\infty \infty} n no \overset{^^}{h h} ((n no)) {z z}^{- - n no + + 11} = = {Σ Σ}_{k k = = 11}^{p p} k k {a a}_{k k} {z z}^{- - k k + + 11}$

令上式左右的常数项和z^-1各次幂的系数分别相等，即得到h(n)和a_k之间的递推关系，再从预测系数a_k得出倒谱

也即得出LPCC参数。Let the constant terms around the above formula and the coefficients of each power of z ^-1 be equal to obtain the recursive relationship between h(n) and a _k , and then get the cepstrum from the prediction coefficient a _k

That is to say, the LPCC parameters are obtained.

基于小波变换的线性预测倒谱参数(简称为DWTL参数)的计算步骤：Calculation steps of linear predictive cepstrum parameters (referred to as DWTL parameters) based on wavelet transform:

(1)、原始的语音信号先要经过预处理；(1), the original voice signal must first be preprocessed;

(2)、对每帧信号进行小波包分解，计算小波包系数；(2), carry out wavelet packet decomposition to each frame signal, calculate wavelet packet coefficient;

(3)、对上一步得到的小波包系数求倒谱D_n；(3), obtain the cepstrum D _n to the wavelet packet coefficient obtained in the previous step;

(4)、将D_n合并成新的向量[D₁D₂…D_n]作为DWTL参数。(4) Merge D _n into a new vector [D ₁ D _{2 .} . . D _n ] as a DWTL parameter.

Mel频率倒谱系数(MFCC)最重要的特点就是利用了人耳的听觉原理和倒谱的解相关的特性，Mel频率与Hz频率成非线性对应关系，是利用它们之间的这种关系，将语音信号的频谱变换到感知频域。其关系式为：The most important feature of the Mel frequency cepstrum coefficient (MFCC) is the use of the auditory principle of the human ear and the decorrelation characteristics of the cepstrum. The Mel frequency and the Hz frequency have a nonlinear correspondence, which is based on the relationship between them. Transform the frequency spectrum of a speech signal into the perceptual frequency domain. Its relationship is:

${f f}_{m m} = = 25952595 lg lg ((11 + + \frac{{f f}_{Hz Hz}}{700700}))$

其中，f_m是以Mel为单位的感知频域，f_Hz是以Hz为单位的实际频域。Among them, f _m is the perceived frequency domain in Mel, and f _Hz is the actual frequency domain in Hz.

基于小波变换的Mel频率倒谱参数(简称为DWTM参数)的计算过程如图3。The calculation process of Mel frequency cepstrum parameters (abbreviated as DWTM parameters) based on wavelet transform is shown in Figure 3.

DWTM参数计算步骤：DWTM parameter calculation steps:

(2)、将预处理后的每帧信号都经过离散小波变换的处理，处理后的信号进行小波包分解，计算小波包系数；(2), each frame signal after preprocessing is all processed through discrete wavelet transform, and the signal after processing is carried out wavelet packet decomposition, calculates wavelet packet coefficient;

(3)、将每个频带上分解出的小波系数，再进行滤波；(3), filter the wavelet coefficients decomposed on each frequency band;

(4)、将上一步得到的再求取系数的能量值S(m)；(4), obtain the energy value S(m) of the coefficient again with the obtained in the previous step;

(5)、将对数频谱S(m)经过离散余弦变换DCT变换到倒谱频域。(5) Transform the logarithmic spectrum S(m) into the cepstrum frequency domain through discrete cosine transform DCT.

DWTM和DWTL参数采用的阶数为13，去掉直流分量c(0)后取12阶，滤波组为24组。以ΔDWTL+DWTL，ΔDWTM+DWTM作为语音信号的特征参数。使得动态特征和静态特征有效的结合，能够提高系统的识别率。其中DWTL和DWTM分别是用小波变换的LPCC和MFCC参数，ΔDWTL和ΔDWTM分别是DWTL和DWTM参数的差分参数。混合参数是一个24*24的矩阵。矩阵如下：The order of DWTM and DWTL parameters is 13, the order of 12 is taken after removing the DC component c(0), and the filter group is 24 groups. Take ΔDWTL+DWTL, ΔDWTM+DWTM as the characteristic parameters of the speech signal. The effective combination of dynamic features and static features can improve the recognition rate of the system. Among them, DWTL and DWTM are the LPCC and MFCC parameters using wavelet transform respectively, and ΔDWTL and ΔDWTM are the difference parameters of DWTL and DWTM parameters respectively. The mixing parameters are a 24*24 matrix. The matrix is as follows:

$ML ML = = [\begin{matrix} ΔDWTL ΔDWTL & ΔDWTM ΔDWTM \\ DWTL DWTL & DWTM DWTM \end{matrix}]$

常用的识别决策有动态时间弯折算法、Viterbi算法等，在概率统计方法中最具有代表性的就是HMM的方法，与HMM结合最为紧密的算法当属Viterbi算法。Viterbi算法是一种动态规划算法，用来寻找由观测信息产生(0bserved Event)的最可能隐状态序列(Viterbi路径)，向前算法是一个类似的算法，用来计算一串观测事件发生的概率。在识别时，对每个观测事件计算概率，概率最高的就是识别结果，有时，概率最高的并不一定是正确的识别结果，因此，本发明中，计算概率后再进行一次阈值比较，若高于阈值者认为是正确识别结果，否则，丢弃该识别结果，并提示“请再说一遍”语音后重新输入语音命令。阈值是一个经验值，在特定的实验室环境下，经过多次实验而得出的值。Commonly used identification decisions include dynamic time bending algorithm, Viterbi algorithm, etc. Among the probability statistics methods, the most representative method is the HMM method, and the algorithm most closely combined with HMM is the Viterbi algorithm. The Viterbi algorithm is a dynamic programming algorithm used to find the most likely hidden state sequence (Viterbi path) generated by the observed information (Observed Event). The forward algorithm is a similar algorithm used to calculate the probability of a series of observed events. . During identification, the probability is calculated for each observed event, and the highest probability is the identification result. Sometimes, the highest probability is not necessarily the correct identification result. Therefore, in the present invention, a threshold comparison is performed after calculating the probability. If it exceeds the threshold, it is considered to be the correct recognition result; otherwise, the recognition result is discarded, and the voice command "Please say it again" is prompted and the voice command is re-entered. The threshold is an empirical value, which is obtained after many experiments in a specific laboratory environment.

所述阈值比较模块用于将获得的输出概率值与设定的阈值比较，如果高于阈值则输出识别结果，否则丢弃该识别结果。The threshold comparison module is used to compare the obtained output probability value with a set threshold, and if it is higher than the threshold, the recognition result is output, otherwise, the recognition result is discarded.

语音识别系统包括预处理、端点检测、特征参数提取、识别决策及阈值比较等。建立好一个完整的语音识别系统时，接着要建立一个模板库，模板库根据实际应用的需要来选择建立，在本实施例中，建立了一个由6个人、每人对一个命令重复10次录音的模板库。模板库的语音命令包括有：方向命令，如“向左”、“向右”、“前进”、“停止”等；问好命令，如“您好”、“谢谢”；问答命令，如“你叫什么名字？”、“你来自哪里？”、“你有哪些功能？”等等。对于每个命令都包括以下过程：录入模板语音，经过预处理，检测语音端点，提取语音和特征参数，识别决策，阈值比较。The speech recognition system includes preprocessing, endpoint detection, feature parameter extraction, recognition decision-making and threshold comparison, etc. When a complete speech recognition system is set up, then a template library will be set up. The template library will be selected and established according to the needs of the actual application. template library. The voice commands of the template library include: direction commands, such as "left", "right", "forward", "stop" and so on; greeting commands, such as "hello", "thank you"; question and answer commands, such as "you What's your name?", "Where are you from?", "What functions do you have?", etc. Each command includes the following processes: input template speech, preprocessing, detection of speech endpoints, extraction of speech and feature parameters, recognition decision, threshold comparison.

1)、语音录入：将模拟语音信号数字化，再用8000HZ的采样频率对语音进行采样、量化；1) Voice recording: Digitize the analog voice signal, and then sample and quantify the voice with a sampling frequency of 8000HZ;

2)、预处理包括了：预滤波，预滤波实际上是一个带通滤波器，其目的是防止混叠干扰；预加重，预加重的目的是提升高频部分，使信号的频谱变得平坦，以便于进行频谱分析或声道参数分析，预加重数字滤波器H(z)＝1-uz^-1，u为0.97；加窗，语音信号是非平稳信号，语音信号常常可假定在10～20ms这样的时间段内，语音信号是平稳信号，其频谱特性和某些物理特征参量可近似地看作是不变的；2), pre-processing includes: pre-filtering, pre-filtering is actually a band-pass filter, its purpose is to prevent aliasing interference; pre-emphasis, the purpose of pre-emphasis is to enhance the high frequency part, so that the spectrum of the signal becomes flat , in order to carry out spectrum analysis or channel parameter analysis, the pre-emphasis digital filter H(z)=1-uz ^-1 , u is 0.97; add window, the speech signal is a non-stationary signal, and the speech signal can usually be assumed to be in 10～20ms In such a time period, the speech signal is a stationary signal, and its spectral characteristics and some physical characteristic parameters can be approximately regarded as constant;

3)、端点检测，目的是从包含语音的一段信号中确定出语音的起点和终点，在本实施例，先用能量法初步检测其端点，再经过多层感知器神经网络做进一步检测；3), endpoint detection, purpose is to determine the starting point and the end point of speech from a section signal that comprises speech, in the present embodiment, first use energy method to initially detect its endpoint, then do further detection through multi-layer perceptron neural network;

4)、提取特征参数，分别计算出DWTL参数和DWTM参数，再分别计算一阶差分参数ΔDWTL和ΔDWTM；再将它们组合成一个24*24的矩阵作为语音特征参数；4), extract characteristic parameters, calculate DWTL parameter and DWTM parameter respectively, then calculate first-order difference parameter ΔDWTL and ΔDWTM respectively; Then they are combined into a matrix of 24*24 as speech characteristic parameter;

5)、建立模板库，将提取出来的特征参数建立一个参数序列。5) Establish a template library, and establish a parameter sequence with the extracted feature parameters.

对于每一个语音命令，都经过1)～5)步骤，反复循环，最终建立一个完整的语音模板库。建立完模板库后，将进行测试，其过程如下：For each voice command, go through steps 1) to 5) and repeat the cycle to finally establish a complete voice template library. After the template library is built, it will be tested, and the process is as follows:

步骤1)、录入测试语音，对着麦克风输入语音命令，如“向右”，经过A/D将模拟语音信号转化成数字语音信号，即数字化；再对其进行采样、量化。Step 1), input test voice, input voice command to microphone, as " to the right ", through A/D, analog voice signal is converted into digital voice signal, namely digitization; Then it is sampled, quantized.

步骤2)、预处理，过程如上面步骤2)。Step 2), preprocessing, the process is as above step 2).

步骤3)、端点检测，过程如上面步骤3)。Step 3), endpoint detection, the process is as above step 3).

步骤4)、提取特征参数，过程如上面步骤4)。Step 4), extracting characteristic parameters, the process is as above step 4).

步骤5)、识别决策，对每个提取出来的参数计算概率，最大概率的为待识别结果。Step 5), identify the decision, calculate the probability for each extracted parameter, and the one with the highest probability is the result to be identified.

步骤6)、将步骤5)计算的最大概率与阈值比较，大于等于阈值的最大概率作为识别结果，小于阈值的，放弃待识别结果，提示用户再次输入语音命令。Step 6), the maximum probability calculated in step 5) is compared with the threshold value, the maximum probability greater than or equal to the threshold value is used as the recognition result, and if it is less than the threshold value, the result to be recognized is discarded, and the user is prompted to input the voice command again.

对其他测试语音命令也做同样的过程，直至全部测试语音完成测试。并发现系统的不足，然后调整参数，再测试。经过实验证明，本发明能够很好的识别语音命令，例如，对麦克风说“向左”，经过处理、识别后，机器人向左转。其他命令或问答，都能有效地识别。Do the same process for other test voice commands until all test voices are tested. And discover the deficiencies of the system, then adjust the parameters, and then test. Experiments have proved that the present invention can recognize voice commands very well, for example, say "turn left" to the microphone, after processing and recognition, the robot turns left. Other commands or questions and answers can be effectively recognized.

Claims

1. A voice recognition system comprises a voice input module (1), a preprocessing module (2), a feature extraction module (3), a training module (4) and a recognition module (5), wherein the output end of the voice input module (1) is connected with the input end of the preprocessing module (2), the output end of the preprocessing module (2) is connected with the input end of the feature extraction module (3), the output end of the feature extraction module (3) is respectively connected with the input ends of the training module (4) and the recognition module (5), and the training module (4) is connected with the recognition module (5); the device is characterized by further comprising an identification decision module (6) and a threshold comparison module (7), wherein the output end of the identification module (5) is connected with the input end of the identification decision module (6), and the output end of the identification decision module (6) is connected with the input end of the threshold comparison module (7).

2. Speech recognition system according to claim 1, characterized in that the speech input module (1) is adapted to input an original speech signal.

3. The speech recognition system according to claim 1, characterized in that the pre-processing module (2) comprises a pre-filtering unit, a sampling and quantization unit, a pre-emphasis unit, a windowing unit, an end-point detection unit, connected in series;

the pre-filtering unit is used for removing high-frequency noise of an original voice signal;

the sampling and quantizing unit samples Nyquist sampling theorem sampling and quantizing de-noised analog signals to obtain digital signals;

the pre-emphasis unit is used for improving a high-frequency part and flattening the frequency spectrum of the signal so as to analyze parameters;

the windowing unit is used for limiting the signal;

the end point detection unit is used for detecting the starting point and the end point of the voice section, removing the unnecessary mute section and extracting the voice signal section of the reagent.

4. The speech recognition system of claim 3 wherein the endpoint detection unit employs a combination of a dual threshold energy method and an artificial neural network.

5. The speech recognition system according to claim 1, wherein the feature extraction module (3) employs a wavelet transform-based hybrid feature parameter extraction algorithm, which is wavelet transform-based linear prediction cepstrum parameters and wavelet transform-based Mel-frequency cepstrum coefficients.

6. The speech recognition system according to claim 1, characterized in that the training module (4) is a training learning method as a hidden markov model by means of the Baum-Welch algorithm.

7. The speech recognition system of claim 1, wherein the recognition decision block (6) derives the output probability by a Viterbi algorithm.

8. The speech recognition system according to claim 1, characterized in that the threshold comparison module (7) is adapted to compare the obtained output probability value with a set threshold, and to output the recognition result if the obtained output probability value is higher than the set threshold, and to discard the recognition result if the obtained output probability value is not higher than the set threshold.