CN107342074A

CN107342074A - The recognition methods invention of voice and sound

Info

Publication number: CN107342074A
Application number: CN201610273827.9A
Authority: CN
Inventors: 王荣
Original assignee: Individual
Current assignee: Individual
Priority date: 2016-04-29
Filing date: 2016-04-29
Publication date: 2017-11-10
Anticipated expiration: 2036-04-29
Also published as: CN107342074B

Abstract

The present invention proposes the method for realizing speech recognition.This method is characterized in ignoring the less sound of loudness, and calculate sound to be identified and pure voice apart from when, the maximum loudness for being no more than pure voice of result of institute, therefore to having the environment of noise and pronouncing shorter word or word has preferable recognition effect.

Description

The recognition methods invention of voice and sound

Technical field

The invention belongs to speech recognition and voice recognition field, and in particular to one kind realizes that voice and sound are known Method for distinguishing.

Background technology

Speech recognition is the important component of artificial intelligence, there is extensive purposes, but current language Sound identification recognition capability in the environment for have noise is poor.《IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL.10, NO.5, JUNE 1992》Magazine《An Objective Measure for Predicting Subjective Quality of Speech Coders》One text (with Call document 1 in the following text) method that describes difference between a kind of two voices of comparison, but if knowing for voice Not, this method effect is very undesirable.In addition, this method needs two voices to be aligned completely, But in reality, voice can beginning and end at any time, as a consequence it is hardly possible to is aligned in advance.Therefore, The present invention proposes a solution, it is intended to solves these problems.

The content of the invention

A kind of method for realizing speech recognition, method are that pure voice A is converted to represent described pure The two-dimensional array F of loudness of the voice A on Bark, sound G to be identified is converted to and represents institute The two-dimensional array H of the loudness of sound G to be identified on Bark is stated, it is characterized in that：

In the array F and the array H, it is smaller to ignore loudness in the array F Element and the array H in element corresponding with the less element of loudness in the array F.

A kind of method for realizing speech recognition, method are that pure voice A2 is converted to represent described pure The two-dimensional array F2 of loudness of the voice A2 on Bark, sound G2 to be identified is converted to expression The two-dimensional array H2 of loudness of the sound G2 to be identified on Bark, it is characterized in that：

The corresponding element in the element F2 [x] [y] and the array H2 for calculating the array F2 H2 [x] [y] apart from when, make the maximum value for being no more than the element F2 [x] [y] of result of calculating.

Preferably, if sound G3 to be identified is the sound different with pure voice A3 length, for meter Calculate whether the sound G3 to be identified includes the pure voice A3, it is characterized in that：

Extraction and the pure voice A3 length identical one from the sound G3 to be identified frame by frame Duan Shengyin G4, then the sound G4 and pure voice A3.

Preferably, the pure voice A and the pure voice A2 are multiplied by a scale factor, then and The sound G to be identified and the sound G2 to be identified are compared.

Compared with prior art, advantage of the invention is that：To having the environment of noise and pronunciation shorter Word or word have preferable recognition effect.

Embodiment

Embodiment 1：

In voice, and it is broader for sound in, distribution of the power in frequency is not complete phase Deng, and distribution of the power in frequency can change over time.The distribution of exactly this frequency, with And their change, make one that various sound can be told.Assuming that there are one 200 hertz and one 2000 Hertz, the constant sinusoidal sound of intensity occurs simultaneously, and the loudness of 200 hertz of sinusoidal sounds is 2000 hertz Hereby 2 times, in this case, the mankind can recognize the sound for having one 2000 hertz in sound easily Sound.But if the method and formula of document 1 are directly used in the identification of sound, and calculate two sound Distance, will be considered that this sound and 2000 hertz it is far apart, thus can not identify 2000 hertz this Individual sound.But if first hearing 2000 hertz of sine wave pure tone to the mankind, he, it can be seen that this Loudness of the individual sound on 200Hz and other frequencies is zero, thus can ignore 200 hertz of sound, Only consider 2000 hertz this sound, thus can still recognize 2000 hertz of this sound.

In addition, in having the environment of noise, the too small sound of loudness is easily interfered very much, therefore is having , it is necessary to ignore the sound that loudness is too small in pure voice during progress speech recognition in the environment of noise.

It is now assumed that the voice for thering is some to record, such as " north " word (hereinafter referred to as A) in " Beijing ". A length of 0.5 second during A, 8000 hertz of sample rate, therefore share 4000 samplings.First, A points Into multiple overlapping or nonoverlapping frame, then each frame using window function (such as Hamming window, Hanning window, Sin windows etc.) adding window.The application recommends more than 8 times of overlap sampling, and is added using sin window functions Window.For example, it is assumed that each frame is 50 milliseconds, 8 times of overlap samplings, then the 1st frame of voice is A 1 to 400 sampling, the 2nd frame is 51 to 450 sampling, and the 3rd frame is 101 to 500 to adopt Sample etc..Then each frame is used sin window function adding windows.Therefore, A is converted into 2 dimensions Group E, the element of array is E [n] [m], and wherein n is 1 totalframes for arriving A, and m is 1 to 400, Wherein 400 be the hits of each frame.Here certain a line from E [x] [1] to E [x] [24] is represented with E [x]. Array E every a line is calculated to loudness caused by each Bark (bark) of human ear as the method for document 1 (loudness), thus array E to be converted into array F, F element be F [n] [m], wherein n is 1 To A totalframes, m is 1 to 24, wherein 24 be the Bark number of human ear, a line for representing E is (right Answer, an A frame) calculated by the method for document 1, to caused by 24 Barks of human ear Loudness.But the division of other quantity is also feasible, such as each Bark is divided into two again, because This is divided into 48 Barks, can obtain more preferable recognition effect.It is now assumed that played again at another moment During voice A, due to the influence of noise, A becomes G.Equally, the method for G being used document 1 The element for being converted into array H, H is H [n] [m], and wherein n is 1 totalframes for arriving A, and m arrives for 1 24.H a line represents what is calculated by the method for document 1, to caused by 24 bark of human ear Loudness.Whether voice A is included for identification H, it is to calculate definitely to make array P=abs (H-F), wherein abs The function of value.That is, array P each element of each element equal to array H is made to subtract Array F each corresponding element, then array P each element is taken absolute value.

In order to be identified in the environment for have noise, it is necessary to ignore the element that loudness is too small in F.Because In noisy environment, these elements are easily interfered very much, become almost unavailable.For loudness too Small standard, the application recommend in pure voice 1/4 to 1/2 of maximum loudness value on bark.Human ear is come Say, 1/4 loudness, acoustical power only about 1/100.Even 1/2 loudness, acoustical power is also about Only 1/10, thus while their loudness in pure voice are not small, but actual acoustical power very little, and And therefore it is highly susceptible to disturb.In quiet environment, these sound still contribute to identification, But in noisy environment, but become no longer available.Specifically, it is assumed that maximum element in array F It is worth for mf, then each element in F is checked, if F [x] [y] ＜ (mf/4), then P [x] [y]=0 is set, And F [x] [y]=0, these elements is not had any influence on result again in follow-up calculating, In other words, these elements are ignored.

Secondly, when whether containing some voice in calculating identified sound, the calculating of distance is most very much not The loudness that bark is corresponded in pure voice should be exceeded.That is, check array P each element P [x] [y], if P [x] [y] ＞ F [x] [y], then make P [x] [y]=F [x] [y].As an example it is assumed that calculate The P [2] [5] gone out is equal to 0.8, and F [2] [5] is equal to 0.5, then will make P [2] [5]=0.5.

Afterwards, all elements sum in array F is calculated, obtains sf；Calculate in array P all elements it With obtain sp.Make d=sp/sf.If d is less than or equal to some less numerical value, such as 0.2, then It is considered as have found voice A in sound G.It should be noted that voice A is have found in sound G, Possibility comprising other sound or voice is not precluded from G, such as the other voices or background spoken simultaneously Sound of music etc..

Embodiment 2：

For embodiment 1, there is preferable judgement effect, but also had some problems to need to solve, For example, it is assumed that the length of pure voice is 0.5 second, sound length to be identified is 10 seconds, and language therein Sound may be since any time of 10 seconds, and embodiment 1 is then assumed before comparison, pure voice and Sound length to be identified is identical, and the position that voice occurs in sound to be identified and pure voice is also complete It is exactly the same.Solution is to compare frame by frame.For example, it is assumed that the sampling of sound and pure voice to be identified Rate is all 8000Hz, and 50 milliseconds of frame length, using 8 times of overlap samplings, therefore the step-length of frame is 8000/ (1000/50)/8, equal to 50.If pure voice A length is 0.5 second, therefore has 4000 samplings. First, the sampling from 1 to 4000 in voice to be identified is taken, judges it using the method in embodiment 1 In whether contain A, followed by the 2nd frame, that is, one step-length of increase, that is, in voice to be identified 51 to 4050 sampling and pure speech comparison.Followed by the 3rd frame, the 4th frame etc..But, may be used The problem of voice is repeatedly identified, such as 4000 samplings that the 2nd frame and the 3rd frame start can occur Voice A is all have identified, therefore, if same pure voice is identified in the position being too close to, such as Only poor 1 to 2 frame, then need to delete these identifications repeated.

Further, since the reason such as recording, pure voice may become in the sound in be identified it is lighter or It is louder, thus also need to the loudness of pure voice to be multiplied by every time or divided by a less coefficient, such as 1.05, then compare again with sound to be identified, until the loudness of pure voice and sound to be identified differs It is too remote, such as more than 10 times, so that unlikely in sound to be identified be comprising this pure voice Only.

In this application, voice and sound almost can always be replaced mutually.Embodiment described above, Simply one kind of more preferably embodiment of the invention, those skilled in the art is in the technology of the present invention Aspects in, the usual variations and alternatives of progress, should all include within the scope of the present invention.

Claims

1. a kind of method for realizing speech recognition, method is that pure voice A is converted to represent the pure voice The two-dimensional array F of loudness of the A on Bark, sound G to be identified is converted to and represents described to be identified Loudness of the sound G on Bark two-dimensional array H, it is characterized in that：

In the array F and the array H, it is less to ignore loudness in the array F Element corresponding with the less element of loudness in the array F in element and the array H.

2. a kind of method for realizing speech recognition, method is that pure voice A2 is converted to represent the pure language The two-dimensional array F2 of loudness of the sound A2 on Bark, sound G2 to be identified is converted to described in expression The two-dimensional array H2 of the loudness of sound G2 to be identified on Bark, it is characterized in that：

3. according to the method for realizing speech recognition described in claim 1 and/or claim 2, method It is, if sound G3 to be identified is the sound different with pure voice A3 length, to wait to know described in calculating Whether other sound G3 includes the pure voice A3, it is characterized in that：

4. according to the method for realizing speech recognition described in claim 1 and/or claim 2, it is special Sign is：

The pure voice A and the pure voice A2 are multiplied by a scale factor, then wait to know with described Other sound G and the sound G2 to be identified are compared.