TW201015538A

TW201015538A - Intelligent speech recognition control device

Info

Publication number: TW201015538A
Application number: TW97139455A
Authority: TW
Inventors: Mao-Lin Chen
Original assignee: Mao-Lin Chen
Priority date: 2008-10-15
Filing date: 2008-10-15
Publication date: 2010-04-16

Abstract

An intelligent speech recognition control device is disclosed, which comprises a speech acquisition component, a host, a signal transmitting component and a control component. The host comprises an operator and a digital information storage unit, and the digital information storage unit is pre-stored with at least one speech command and a speech recognition processing software. The speech acquisition component receives a speech message. The operator of the host executes the speech recognition processing software to process the speech message by means of a normalized fuzzy logic Kalman filter process, an end-point detection and frame process, a cepstrum analysis, a Mel-scale process, a filter bank process, a Mel-frequency cepstral coefficient process and a dynamic time warping algorithm, so as to compare whether the speech message matches the speech command. After the speech message is confirmed to match the speech command, the signal transmitting component transmits an activation signal to the control component to activate the control component.

Description

201015538 九、發明說明：【發明所屬之技術領域】一種語音辨識控制裝置，尤指一種使用主機且具擴充性的智慧型語音辨識控制裝置。【先前技術】控制器為操控啟動家電、機台、玩具等裝置的啟動介面’為方便使用者使用，其可形成多種介面形式表現出201015538 IX. Description of the invention: [Technical field to which the invention pertains] A speech recognition control device, in particular, an intelligent speech recognition control device using a host and having a scalability. [Prior Art] The controller is used to control the startup interface of devices such as home appliances, machines, toys, etc., which is convenient for the user to use, and can be formed in various interface forms.

來如按紐、遙控器、搖桿等。讓使用者可以藉由按壓或撥動按紐、遙控n、搖桿等控制器，以馳家電、機台、玩具等裝置的動作。錢觸式的控㈣，其賴藉由物理上的接觸，才此啟動控制器來操控各者必須直接接觸才n 其在操作上，使用已逐—因此在使用上多有不便，為此已逐龠發展出非接觸式的控制器。 #接觸式的控制器目前有聲 _分別藉由聲音、_ 4聲控7^、磁控等等，其 ❺先線力場來觸發’依據原理的不同，田徑制器感受磬咅、止# 場即會觸動純磁力場時，聲音、光線、磁力技參間「㈣器’而形成非接觸式控制器。可藉由聲音來^ 習知的聲控控制電路卜其雖是否啟動二’，、、、其單純藉由感受聲音的有無來決定此若環境劈音過大，其很容易誤啟動控制要相當’但若將啟動的門摄調高’則需習知的聲控控制雷=曰/其反而會造成使用上的困擾。且电路.1 ’其為硬體上的設計，因此其幾乎 5 201015538 難以滿足使用沒有擴充的空間，因此其應用相當的有限者的需求。【發明内容】一=概上述之缺失，本發明的主要目的在於揭不一種智慧型語音辨識控制裝置，其聲音後可藉儲存於主機内的:音二軟體加以辨識，以正確啟動控制器。 ❹ 人一本發明係在提供-種智慧型語音_㈣裝置，其包二一取音70件、m㈣輸轉與-控制元件， :主:連音元件’並透過該訊號傳輸元件連接該控制以牛，該主機包含—運算器與一數位資訊儲存單元，且該數位資訊儲存單元預存有至少—語音命令扭立辨 :處理軟體’其中該取音元件用於掘取一語音訊息，：藉二主機的運算器執行該語音辨識處理軟體對該語音訊息進行正規化獅邏輯卡爾曼濾波器處理、軌彳貞測及取^ 框處理、㈣譜分析、梅關度處理、濾期組處理、ς 爾倒頻譜參數處理與動態時間校準&，該語音訊息經該注音辨識處理軟體的處理後供比對該語音訊息是否符合^ 语音命令，並在該語音訊息符合該語音命令時，產生一起動訊號，以藉該訊號傳輸元件傳輸至該控制元件而加以啟動該控制元件。據此，本發明在擷取語音訊息之後，可加以訊號處理並比對其是否與預存的語音命令相同，其在比對相同之後 6 201015538 * 才產生該起動訊號以啟動該控制元件，因此不需提高聲音啟動的強度門檻值，即可解決環境噪音所造成的誤啟動，又本發明以軟體運行，因此極具擴充空間，可以滿足使用 • 者的需求。【實施方式】茲有關本發明之詳細内容及技術說明，現配合圖式說明如下：請參閱「圖2」與「圖3」所示，本發明包含一取音 ⑩ 元件1〇、一主機20、一訊號傳輸元件30與一控制元件4〇, 其中該取音元件1〇用於擷取一語音訊息並與該主機2〇連結，而該主機20包含一運算器21與一數位資訊儲存單元 22，且該數位資訊儲存單元22預存有至少一語音命令，且該主機20的運算器21執行該語音辨識處理軟體對該語音訊息進行正規化模糊邏輯卡爾曼濾波器處理、端點偵測及取音框處理、倒頻譜分析、梅爾刻度處理、濾波器組處 φ 理、梅爾倒頻譜參數處理與動態時間校準法。其中卡爾曼濾波器中使用線性預估係數演算法，其線性預測的公式為： y[n] = ^aky[n-k] + e[n] 語音訊號於卡爾曼/慮波器中作雜訊消除語音強化處理，得到第一次濾波輸出訊號。 7 201015538 ，再將卡爾曼濾波器輸出訊號以正規化最小均方演算法獲得適應性常數σ"，公式如下： “ ”崎⑻f 同時以卡®曼濾波器輪出訊號與原始語音訊號做相減誤差的均方處理為回授誤差，與誤差變化量作模糊邏輯運算的輸入。經由模糊解調卡爾曼濾波器的參數，以獲得較佳的輸出語音訊號。又可將正規化模糊邏輯卡爾曼濾波器的輸出語音訊號做線性預估係數演算法’估測頻譜模型，再重複前述的線性預估係數演算法，目的是計算是一個音框内的取樣值，以求出一組線性預測係數，讓誤差的能量和達到最小。五=Σ 幻] 、 **=1 y 而端點偵測及取音框處理，其步驟包含步驟一：定義語音長度、步驟二.切割音框與步驟三：越零率偵測找出語音端點與終點。一般經由取樣所付之離散語音訊號，一般人說話的頻譜是集中在4kHz以下，故根據取樣原理，為免造成失真，其取樣頻率要設定在訊號頻寬的兩倍以上。且為降低所取 8 201015538 ' 語音訊號振幅大小對系統的影響，需將語音訊號作正規化 (Normalizing)，如下式。 S眶=max〇S(«))，w = l，2,3,…，#Come like buttons, remote control, joystick, etc. The user can operate the device such as a home appliance, a machine, a toy, etc. by pressing or pushing a button, a remote control n, a joystick, and the like. The control of money touch (4) relies on physical contact, so that the controller is controlled to control each person to have direct contact. The operation is used, and the use is already used. Therefore, it is inconvenient in use. A non-contact controller was developed. #Contact type controller currently has sound _ separately by sound, _ 4 voice control 7^, magnetron, etc., its ❺ first line force field to trigger 'according to the difference of principle, track and field controller feels 磬咅, stop # field When the pure magnetic field is touched, the sound, light, and magnetic technique "(4)" form a non-contact controller. The sound control circuit can be used by sound to know whether or not to activate the second ',,,, It is simply by feeling the presence or absence of the sound to determine if the ambient sound is too large, it is easy to start the control by mistake, but if the door is turned up, then the voice control is required to be known as Ray = 曰 / It causes troubles in use. And circuit .1 'is a hardware design, so its almost 5 201015538 is difficult to meet the use of space without expansion, so its application is quite limited. [Summary] In the absence of the present invention, the main purpose of the present invention is to provide a smart voice recognition control device, which can be identified by a sound software stored in the host to correctly activate the controller. Providing a type of intelligent voice _ (four) device, which includes a 70-piece sound, 70 (m) transmission and control components, a main: a ligature component and connects the control to the cow through the signal transmission component, the host includes - an arithmetic unit and a digital information storage unit, and the digital information storage unit prestores at least - a voice command twisted: the processing software 'where the sound pickup element is used to dig a voice message: by the operator of the second host The speech recognition processing software normalizes the speech message to the lion logic Kalman filter processing, the track measurement and the frame processing, the (4) spectrum analysis, the Meiguan degree processing, the filter period group processing, and the cepstrum parameter processing. And the dynamic time calibration & the voice message is processed by the phonetic recognition processing software to compare whether the voice message conforms to the voice command, and when the voice message conforms to the voice command, a motion signal is generated to The signal transmission component is transmitted to the control component to activate the control component. Accordingly, the present invention can process the signal after capturing the voice message and compare it to whether The pre-stored voice command is the same, and after the comparison is the same, the start signal is generated to activate the control component, so that the false threshold caused by the ambient noise can be solved without increasing the intensity threshold of the sound activation. The invention runs in software, so it has a large expansion space, which can meet the needs of the user. [Embodiment] The details and technical description of the present invention are as follows: Please refer to "Figure 2" and "Figure 3 shows that the present invention comprises a sound pickup 10 component 1 , a host 20 , a signal transmission component 30 and a control component 4 , wherein the sound pickup component 1 is used for capturing a voice message and the host 2, the host 20 includes an operator 21 and a digital information storage unit 22, and the digital information storage unit 22 prestores at least one voice command, and the operator 21 of the host 20 executes the voice recognition processing software pair. The voice message is subjected to normalized fuzzy logic Kalman filter processing, endpoint detection and sound box processing, cepstrum analysis, Meyer scale processing, and filter At φ processing, Mel Cepstrum parameters and the dynamic time alignment process method. The linear prediction coefficient algorithm is used in the Kalman filter. The formula for linear prediction is: y[n] = ^aky[nk] + e[n] The speech signal is used for noise elimination in the Kalman/wave filter. The voice enhancement process is performed to obtain the first filtered output signal. 7 201015538 , the Kalman filter output signal is obtained by the normalized least mean square algorithm to obtain the adaptive constant σ ", the formula is as follows: " 崎 (8) f at the same time with the card ® MANN filter wheel signal and the original voice signal minus The mean square processing of the error is the feedback error, and the error variation is the input of the fuzzy logic operation. The parameters of the Kalman filter are demodulated via the blur to obtain a better output speech signal. The output speech signal of the normalized fuzzy logic Kalman filter can be used as a linear prediction coefficient algorithm to estimate the spectrum model, and then the aforementioned linear prediction coefficient algorithm is repeated, in order to calculate the sample value in a sound frame. To find a set of linear prediction coefficients, so that the energy sum of the errors is minimized. 5=Σ 幻], **=1 y and endpoint detection and sound box processing, the steps include step 1: defining the voice length, step 2. cutting the sound box and step three: finding the voice with zero rate detection End point and end point. Generally, the discrete speech signals paid by sampling, the spectrum of the average person's speech is concentrated below 4 kHz, so according to the sampling principle, in order to avoid distortion, the sampling frequency should be set at more than twice the signal bandwidth. In order to reduce the impact of the amplitude of the voice signal on the system, the voice signal should be normalized, as follows. S眶=max〇S(«)), w = l,2,3,...,#

其中k為最大振幅，而在每個樣本點除以·^後，整個語音訊號振幅值會被設定於-1到1之間。為減少不必要的資訊，與提升端點偵測的反應時間，將所取得之語音訊號，定出語音起點與終點。因為語音是一個時變訊號，可由觀察知道語音訊號在短時間變化是相當緩慢的，故取得語音端點資訊後，系統以固定取樣點來形成一個音框（Frame)，作後續處理與分析，其圖形如圖 ❹4-2所示。不管辨識為連續語句或單字，端點偵測皆依賴短時距能量（short time energy)與越零率（zer〇 cr〇ssing rate)兩項作為偵測標準，其說明如下： (1)短時距能量：是切割出一小段一小段的語音音框，每一音框的短時距能量將定義為時間λ内的#個訊^二本之能量取絕對值後再相加，是為五(幻，如下式所示;若邱超過預先設定之能量臨界值，將視為此音框含有語音資訊的，而真正說話的段落，能量將比其他靜音部份高Υ 、° 9 201015538 N-l E(k) = Y^(n + m)\ OT=0 其中#:表示一個音框長度灸：是每個音框編號 «:取樣點時間 :音框中的取樣點 ··第Μ固音框的短時距能量 (2)越零率：是訊號通過原點的次數，亦即相鄰訊號樣本的振幅，在一正一負變化下，其越零率累計值加一，如下式所示。而越零率的偵測是為輔助短時距能量偵測於判斷上的不足處，如說話時的摩擦音、鼻音、子音等，在能量表現上並不足以超越短時距能量偵測的臨界值，因此會有端點誤判現象。Where k is the maximum amplitude, and after each sample point is divided by ^^, the amplitude of the entire speech signal is set between -1 and 1. In order to reduce the unnecessary information, and to improve the reaction time of the endpoint detection, the obtained voice signal is determined as the beginning and end of the voice. Because the voice is a time-varying signal, it can be observed that the voice signal changes slowly in a short time. Therefore, after obtaining the voice endpoint information, the system forms a frame with fixed sampling points for subsequent processing and analysis. The figure is shown in Figure 4-2. Regardless of whether it is recognized as a continuous sentence or a single word, endpoint detection relies on short time energy and ZER〇cr〇ssing rate as detection criteria. The description is as follows: (1) Short Time-distance energy: It is a small sound segment that cuts out a short period of time. The short-term energy of each frame will be defined as the time of the # 讯 ^ 二能量能量能量取取取取取取Five (illusion, as shown in the following formula; if Qiu exceeds the preset energy threshold, it will be considered that the sound box contains voice information, and the real speaking paragraph will be higher than other silent parts, ° 9 201015538 Nl E(k) = Y^(n + m)\ OT=0 where #: indicates a length of the frame moxibustion: is the number of each frame «: sampling point time: sampling point in the sound box · · Μ Μ The short-time energy of the frame (2) is zero: the number of times the signal passes through the origin, that is, the amplitude of the adjacent signal sample. Under a positive-negative change, the cumulative value of the zero-crossing rate is increased by one, as shown in the following equation. The zero-rate detection is used to assist the short-term energy detection in the judgment of the deficiency, such as the friction sound when talking, nose Sounds, consonants, etc., are not enough to exceed the critical value of short-term energy detection in terms of energy performance, so there will be false positives.

1 Ν-Χ ^ m=0 1))1 sgn[5'(n)]= 1 "⑻ 20 -1 if S{n) < 0 其中 #:表示一個音框長度 201015538 λ:是每個音框編號 w:取樣點時間 :音框中的取樣點 zw :第a個音框的越零率值原則上’越零率的摘測可分辨出有聲與無聲的語音，如摩擦音的無聲語音，能量大多集中於3紙以上，因此越零率值會較高；反之，有聲語音則越零率值會較低。所以 ❽對語音訊號的端點偵測是先使用短時距能量偵測，來判斷有聲語音的開始與結尾處；再以越零率偵測，找尋語音真正的開始與結尾處。找尋規則如下，若五(幻小於短時距能量的低臨界值，認定是非語音部份。若£(幻大於短時距能量的低臨界值，也高於短時距能量的高臨界值，則認定是語音段起點。若五(幻小於短時距能量的高臨界值，則需加上越零率臨界值作輔助，在大於越零率臨界值，才可以判定語音的起點。 Ο 開始反向搜尋，在五定是語音段終點。點。語音終點的尋找’是從取訊號的尾端在大於短時距能量的低臨界值，則認倒頻谱分析則是因為語音辨識需要一個有效係數做比較’因此在語音訊號分析上，將以頻域中的頻譜作觀察1 Ν-Χ ^ m=0 1))1 sgn[5'(n)]= 1 "(8) 20 -1 if S{n) < 0 where #: indicates a length of the sound box 201015538 λ: is each Frame number w: sampling point time: sampling point zw in the sound box: the zero-zero rate value of the a-th sound box. In principle, the zero-zero rate measurement can distinguish between voiced and unvoiced voices, such as silent voices. Most of the energy is concentrated on more than 3 papers, so the higher the zero rate value will be higher; on the contrary, the higher the zero value of the voiced speech will be lower. Therefore, the endpoint detection of the voice signal is to use short-term energy detection to determine the beginning and end of the voiced speech; then, at the zero-rate detection, find the beginning and end of the voice. The search rules are as follows, if five (the magic is less than the low threshold of short-term energy, it is considered to be a non-speech part. If £ (the magic is greater than the low threshold of short-time energy, it is also higher than the high threshold of short-time energy, It is determined to be the starting point of the speech segment. If the fifth (the magic is less than the high critical value of the short-time energy, the zero-rate threshold should be added as an aid, and the threshold of the zero-crossing rate can be used to determine the starting point of the speech. To search, the fifth point is the end of the speech segment. Point. The search for the speech endpoint is 'from the end of the signal to the lower threshold than the short-term energy, then the spectrum analysis is because the speech recognition needs an effective The coefficients are compared 'so that in the speech signal analysis, the spectrum in the frequency domain will be observed.

定倒頻譜係數，，爾後再作逆，回到時域中，便可得到一組新的參數，即 ’方程式表示如下式。 201015538 j^kn 0<,η<Ν-^ 其中’是一序列語音訊號无⑻，„ = 〇1，2,，#-l，用離散傅利葉轉換而得。經上式中求得的q⑻就是一組語音音框的特徵向量。而梅爾（Mel)是音調頻率度量單位，根據實驗，在1kHz下，人對於聲音的感知將呈線性關係；但若超過 _ 1kHz以上時，會呈現對數對應關係，這兩者關係方程式， ®表示為 2595 *1ο8(1 + ^ζ/700) 其中‘是為梅爾刻度頻率值，是音高測量使用的刻度。&是實際刻度頻率值，是聽者感知之音高與頻率對應關係。為呈現語音訊號特徵做發音代表，可以將輸入語音訊號經濾波器組處理，其輸出將代表語音特徵的一組參數，處理過程式將之前語音輸*訊號送人-組24個帶通濾波器、模擬I覺訊號。：^濾、波器組為非均勻間隔分佈於頻率 =低頻訊號部份所需之濾波器組越多且頻寬將越驗-般人對於音調感知大都集c窄寬。基於實組-又枚樣_在馳範_，料録^下。 12 201015538 • (a)頻率範圍1kHz以下：為等間格、線性，每隔100Hz 設定成一個三角濾波器中心頻率。 ' (b)頻率範圍1kHz以下：為對數間格，大約以1.2倍 • 之倍數增大三角濾波器中心頻率。 (c) 三角濾波器中心頻率強度值為1。 (d) 三角濾波器中心最高頻率低於訊號頻寬，即是 H_z 2 2 ，目前三角濾波器中心最高頻率為 φ 3. 9kHz ° (e)三角濾波器中心頻率就是臨界頻帶的中心頻率，兩邊的截止頻率就是兩個相鄰臨界頻帶的中心頻率。數學式表示，濾波器組輸出結果以符號表示，輸入為木⑽，貝 β +ΔΙ» m Yt(ni)= + k=Bm-Am 其中&是各帶通濾波器的中心頻率，所取的每一個值是前一個的結束點，表示為Set the cepstral coefficient, and then reverse it, and return to the time domain to get a new set of parameters, ie, the equation is expressed as follows. 201015538 j^kn 0<,η<Ν-^ where 'is a sequence of speech signals without (8), „ = 〇1,2,,#-l, obtained by discrete Fourier transform. The q(8) obtained by the above equation is The eigenvector of a set of speech frames. Mel is the unit of measure of pitch frequency. According to the experiment, at 1 kHz, people's perception of sound will be linear; but if it exceeds _ 1 kHz, logarithm will be presented. Relationship, the relationship between the two, ® is expressed as 2595 *1ο8 (1 + ^ζ/700) where ' is the value of the Meyer scale frequency, which is the scale used for the pitch measurement. & is the actual scale frequency value, is listening The relationship between the perceived pitch and the frequency. To represent the voice signal feature, the input voice signal can be processed by the filter bank, and the output will represent a set of parameters of the voice feature, and the processing mode will transmit the previous voice signal. Sending people - group of 24 bandpass filters, analog I signal.: ^ Filter, wave group is non-uniformly spaced in the frequency = low frequency signal part of the required filter group and the bandwidth will be more - The average person has a narrow and wide sense of pitch perception. Based on the real group - the same sample _ in the Chi _, the material record ^ under. 12 201015538 • (a) the frequency range below 1kHz: for the equal space, linear, every 100Hz set to a triangle filter center frequency. ' ( b) The frequency range is below 1 kHz: for the logarithmic grid, increase the center frequency of the triangular filter by approximately 1.2 times • (c) The center frequency intensity value of the triangular filter is 1. (d) The maximum frequency of the center of the triangular filter is low. In the signal bandwidth, which is H_z 2 2 , the current maximum frequency of the center of the triangular filter is φ 3. 9 kHz ° (e) The center frequency of the triangular filter is the center frequency of the critical band, and the cutoff frequencies on both sides are two adjacent critical bands. The center frequency. Mathematical expression, the filter bank output is represented by a symbol, the input is wood (10), and the shell β + ΔΙ» m Yt(ni) = + k=Bm-Am where & is the center of each bandpass filter Frequency, each value taken is the end point of the previous one, expressed as

Bm = + Am-1 而〜則表示從中心頻率算起的各帶通濾波器左右頻寬大小，其中每個帶通濾波器頻寬為2Δ»，所以在此處的每個帶通濾波器頻寬將會等於前一個的1. 2倍，表示為 13 201015538 △„=1.2*4 爾後可得單一帶通濾波器心的頻率響應式，為Bm = + Am-1 and ~ represents the left and right bandwidth of each bandpass filter from the center frequency, where each bandpass filter has a bandwidth of 2Δ», so each bandpass filter here The bandwidth will be equal to 1.2 times of the previous one, expressed as 13 201015538 △ „== 1.2*4 The frequency response of the single bandpass filter can be obtained.

此將表示每個濾波器從前一個濾波器的中心點開始，每個頻寬增加1. 2倍，且帶通濾波器的頻率響應呈現三角形特性分佈。經濾波器組處理後，對數頻譜上將產生#個輸出值，其中#為濾波器組中濾波器數目。是將各頻率能量，lxWl2乘以（4-7)式，累加起來就是通過這濾波器能量，取對數值可得到This will mean that each filter starts from the center point of the previous filter, each bandwidth is increased by 1.2 times, and the frequency response of the bandpass filter exhibits a triangular characteristic distribution. After processing by the filter bank, # output values will be generated on the logarithmic spectrum, where # is the number of filters in the filter bank. Multiply the frequency energy, lxWl2, by (4-7), and add up the energy of the filter.

因為在對數強度頻譜上的值是實數與對稱，故於傅利葉轉換時，對全部μ個濾波器輸出的對數能量作離散餘弦轉換，即可得梅爾倒頻譜。其梅爾倒頻譜參數公式如下所Since the values on the logarithmic intensity spectrum are real and symmetric, the Fourier cepstrum can be obtained by performing a discrete cosine transform on the logarithmic energy of all μ filter outputs during Fourier transform. Its Mel cepstrum parameter formula is as follows

14 201015538 其中从是全部的頻帶數目，w ^ 咖的梅爾匈頻譜係數(鞭c)。，頻帶數’ w是語音訊號而動態時間校準法在於提升態時間校準方法簡稱為本發明採用動時間校準模板匹配法疋—個效果很好的非線性 ❹ 動態時間校準方法是一種相似度計算方 i坐過特徵提取與特徵職壓縮，將針對每侧式來產t 或幾個模板（Template)，識別處理是將待測識別模式的特徵向量與各模板進行相似度計算，制別是屬於那一類。方法使用是設定參考模板特徵向量序列j = {ai，a2,，α}，輸入語音特徵向量序列為…·。所以動㈣^ 準法目的就是要尋找一個最佳的時間正規函數，使待測語音模板時間軸：·非線性的映射到參考模板的時間軸；: 讓累計失真最小。使用方程式表示為 ❹ C = {C(l)，C(2)，..”C(i\〇} 其中Y為路徑長度，C⑻=(㈣，;(Λ))表示第”個匹配點，這是由參考模板的第咖）個特徵向量與待測模板的第）⑻ 個特徵向量所構成。這兩者之間的距離（或失真值）為 ⑽，即是局部匹配距離。 * 動態時間校準法的演算法是通過局優化的算法來實現加權距離總和為最小，表示為 15 201015538 ' [[啦⑻七⑻)，] ^min =mjn^-^- £w« _ n=l ‘ 其中w"是加權函數，選擇有下列兩因素： (1) 依據第《對匹配點前一步之局部路徑走向來選取，且限制45°方向的局部路徑，以適應〃·/的情況。 (2) 將考慮語音個部份給不同權值，以強化某些區別特徵。 ⑩ 該語音訊息經該語音辨識處理軟體的處理後，其可供比對該語音訊息是否符合該語音命令，且該主機20可與一聲音輸出元件50連結，其在該取音元件10擷取該語音訊息後，可複頌輸出該語音訊息，以供使用者確認，又該語音辨識處理軟體於該語音訊息符合該語音命令時，可產生一起動訊號。該訊號傳輸元件30為訊號傳輸線’並讓該訊號傳輸 @ 元件30與該主機20連結，以接收來自該主機20的該起動訊號，又該控制元件40與該訊號傳輸元件30連結，並接收該起動訊號以啟動該控制元件40。請參閱「圖4」所示，其所揭示的訊號傳輸元件31亦可以為包含一無線發射器311與一無線接收器312，其中該無線發射器311與該主機20連結，該無線接收器312 與該控制元件40連結。且請參閱「圖5」與「圖6」所示，該無線發射器311與該無線接收器312可以為RF發射器 16 201015538 ' 與RF接收器。據此其形成無線聲控，可增加使用上的便利性。如上所述’本發明在藉取音元件1〇 _取語音訊息之後，再藉主機20的運算器21執行語音辨識處理軟體加以訊號處理並比對其是否與預存的語音命令相同，其在比對相同之後才產生該起動訊號以啟動該控制元件4〇,因此不需提高聲控啟動的門檻，故在高環境噪音之下，不會誤啟動，且可避免環境噪音的干擾，依然可以正確的聲控啟動 ❹控制器，且其以軟體的模式運行，具日後的擴充空間，可以滿足使用者的需求。惟上述僅為本發明之較佳實施例而已，並非用來限定本發明實施之範圍。即凡依本發明申請專利範圍所做的均等變化與修飾，皆為本發明專利範圍所涵蓋。14 201015538 where is the total number of bands, w ^ café of the Melhon spectrum coefficient (whip c). The frequency band ' w is a voice signal and the dynamic time calibration method is a lifting state time calibration method. The short form is a dynamic time calibration template matching method for the invention. A good nonlinear ❹ dynamic time calibration method is a similarity calculation method. i sat through feature extraction and feature job compression, and will produce t or several templates for each side. The recognition process is to calculate the similarity between the feature vector of the recognition mode to be tested and each template. one type. The method uses to set the reference template feature vector sequence j = {ai, a2,, α}, and input the speech feature vector sequence to... Therefore, the purpose of the (4)^ quasi-method is to find an optimal time normal function, so that the time axis of the speech template to be tested: • non-linear mapping to the time axis of the reference template; Use the equation to denote ❹ C = {C(l), C(2), .."C(i\〇} where Y is the path length and C(8)=((4),;(Λ)) represents the first "match point, This is made up of the ()th feature vector of the reference template and the (8)th feature vector of the template to be tested. The distance (or distortion value) between the two is (10), which is the local matching distance. * The algorithm of the dynamic time calibration method is to achieve the minimum of the weighted distance sum by the local optimization algorithm, expressed as 15 201015538 ' [[啦(8)七(8)),] ^min =mjn^-^- £w« _ n= l 'where w" is a weighting function, and the following two factors are selected: (1) According to the section "Selecting the local path of the previous step of the matching point, and limiting the local path in the 45° direction to adapt to the case of 〃·/. (2) Different parts of the speech will be considered to give different weights to enhance certain distinguishing features. After the voice message is processed by the voice recognition processing software, it can be compared with whether the voice message conforms to the voice command, and the host 20 can be coupled to a sound output component 50, which is captured in the sound pickup component 10 After the voice message, the voice message can be re-exported for the user to confirm, and the voice recognition processing software can generate a motion signal when the voice message conforms to the voice command. The signal transmission component 30 is a signal transmission line 'and the signal transmission@ component 30 is coupled to the host 20 to receive the activation signal from the host 20, and the control component 40 is coupled to the signal transmission component 30 and receives the signal. The start signal is activated to activate the control element 40. The signal transmission component 31 may also include a wireless transmitter 311 and a wireless receiver 312. The wireless transmitter 311 is coupled to the host 20, and the wireless receiver 312 is shown in FIG. It is coupled to the control element 40. Referring to FIG. 5 and FIG. 6 , the wireless transmitter 311 and the wireless receiver 312 can be an RF transmitter 16 201015538 'and an RF receiver. Accordingly, it forms wireless sound control, which increases the convenience of use. As described above, the present invention performs the voice recognition processing software by the arithmetic unit 21 of the host 20 to perform signal processing after the borrowing of the audio component 1〇_takes the voice message, and is compared with whether it is the same as the pre-stored voice command. The start signal is generated after the same to start the control element 4, so there is no need to raise the threshold of the voice-activated start, so under the high ambient noise, the false start can be prevented, and the interference of the environmental noise can be avoided, and the correct operation can still be performed. The voice control starts the controller, and it runs in software mode, with future expansion space to meet the needs of users. The above are only the preferred embodiments of the present invention and are not intended to limit the scope of the present invention. That is, the equivalent changes and modifications made by the scope of the patent application of the present invention are covered by the scope of the invention.

17 201015538 * 【圖式簡單說明】圖1，習知聲控控制器電路。圖2，本發明系統架構圖。 • 圖3，本發明取音元件的實施電路圖。圖4，本發明另一實施方式的系統架構圖。圖5，本發明無線發射元件電路圖。圖6，本發明無線接收元件電路圖。【主要元件符號說明】 ❹習知 1 :聲控控制電路本發明 10 :取音元件 20 :主機 21 :運算器 22 :數位資訊儲存單元 ^ 30、31 :訊號傳輸元件 311 :無線發射器 312 :無線接收器 40 :控制元件 50 :聲音輸出元件 1817 201015538 * [Simple diagram of the diagram] Figure 1, a conventional voice controller circuit. Figure 2 is a diagram showing the system architecture of the present invention. • Fig. 3 is a circuit diagram showing the implementation of the sound pickup element of the present invention. 4 is a system architecture diagram of another embodiment of the present invention. Figure 5 is a circuit diagram of a wireless transmitting device of the present invention. Figure 6 is a circuit diagram of a wireless receiving device of the present invention. [Main component symbol description] ❹ 知知1: Voice control circuit 10: Sound pickup component 20: Host 21: Operator 22: Digital information storage unit ^ 30, 31: Signal transmission component 311: Wireless transmitter 312: Wireless Receiver 40: Control Element 50: Sound Output Element 18

Claims

201015538 X. Patent application scope: A kind of intelligent voice recognition control device, which comprises: a sound material, a Guanyin component is used for recording a voice message; a machine includes an operator and a digital information storage unit, and the The touch-storage unit pre-county at least - the voice command and the one-to-one ^ = body and the host's operator performs the voice recognition process.

= body mountain to this; H line normalization _ material Karl (four) wave device, knowing point system and sound box processing, _ 镨 analysis, plum _ degree processing, wave wave group New Zealand, Mei _ county parameter New and dynamic _ The calibration method, the speech tfi is processed by the speech recognition processing software=the information is (4) the Wei tone command, and the listening sound_new soft message matches the voice command, and generates a motion signal; the signal transmission component, the The transmission transmission component is coupled to the charmer and receives the activation signal from the host; and the control component 'controls the connection with the signal transmission component and receives the activation signal to start. The intelligent voice recognition control device of the first item of the m patent scope, wherein the signal transmission component is a signal transmission line. 3. The intelligent voice recognition control device as claimed in claim i, wherein the + signal transmission component comprises a wireless transmitter (four) and a flawless receiver, the wireless transmission n and the voice processing software. Connected, the wireless receiver is coupled to the control element. 4. The intelligent voice recognition control device of claim 3, wherein the wireless transmitter and the wireless receiver are a W transmitter 19 201015538 'and an RF receiver. 5. The intelligent speech recognition 'control device according to claim 1, wherein the host is coupled to a sound output component to allow the sound output component to be recovered after the sound pickup component captures the voice message颂 Output the voice message.

❿ 20