CN109036381A - Method of speech processing and device, computer installation and readable storage medium storing program for executing - Google Patents
Method of speech processing and device, computer installation and readable storage medium storing program for executing Download PDFInfo
- Publication number
- CN109036381A CN109036381A CN201810897646.2A CN201810897646A CN109036381A CN 109036381 A CN109036381 A CN 109036381A CN 201810897646 A CN201810897646 A CN 201810897646A CN 109036381 A CN109036381 A CN 109036381A
- Authority
- CN
- China
- Prior art keywords
- sentence
- voice signal
- characteristic parameter
- unit
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 72
- 238000009434 installation Methods 0.000 title claims abstract description 28
- 238000001228 spectrum Methods 0.000 claims description 42
- 238000000605 extraction Methods 0.000 claims description 37
- 230000009466 transformation Effects 0.000 claims description 24
- 238000004590 computer program Methods 0.000 claims description 22
- 238000004422 calculation algorithm Methods 0.000 claims description 14
- 238000009432 framing Methods 0.000 claims description 12
- 230000009467 reduction Effects 0.000 claims description 8
- 230000005236 sound signal Effects 0.000 claims description 8
- 238000013507 mapping Methods 0.000 claims description 6
- 230000006870 function Effects 0.000 description 24
- 230000008569 process Effects 0.000 description 16
- 239000000284 extract Substances 0.000 description 10
- 238000005452 bending Methods 0.000 description 6
- 230000008859 change Effects 0.000 description 6
- 238000007781 pre-processing Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 238000010606 normalization Methods 0.000 description 5
- 238000006243 chemical reaction Methods 0.000 description 4
- 230000003068 static effect Effects 0.000 description 4
- 230000001755 vocal effect Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 241000208340 Araliaceae Species 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 2
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 2
- 235000003140 Panax quinquefolius Nutrition 0.000 description 2
- 241000209140 Triticum Species 0.000 description 2
- 235000021307 Triticum Nutrition 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 150000001875 compounds Chemical class 0.000 description 2
- 230000007812 deficiency Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000009977 dual effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000005284 excitation Effects 0.000 description 2
- 235000008434 ginseng Nutrition 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- NGVDGCNFYWLIFO-UHFFFAOYSA-N pyridoxal 5'-phosphate Chemical compound CC1=NC=C(COP(O)(O)=O)C(C=O)=C1O NGVDGCNFYWLIFO-UHFFFAOYSA-N 0.000 description 2
- 230000005855 radiation Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 230000000630 rising effect Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 238000010183 spectrum analysis Methods 0.000 description 2
- 239000013589 supplement Substances 0.000 description 2
- 230000010365 information processing Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 239000000047 product Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/02—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
A kind of method of speech processing, which comprises voice signal is pre-processed;Characteristic parameter is extracted to pretreated voice signal;According to the characteristic parameter, the voice signal is decoded using preparatory trained speech recognition modeling, obtains the text as unit of sentence;Abstract sentence is extracted from the text as unit of sentence by Hidden Markov Model HMM.The present invention also provides a kind of voice processing apparatus, computer installation and computer readable storage mediums.The present invention can identify voice, and garbage is removed from speech recognition result.
Description
Technical field
The present invention relates to computer audio technical fields, and in particular to a kind of method of speech processing and device, computer dress
It sets and computer readable storage medium.
Background technique
In intelligent meeting system, speech recognition technology is a key technology, can be by the signal conversion of speaking of people
Text information for that can be identified by computer is used as output.
However, existing intelligent meeting system is the conversion of realization voice to text, and cannot be to the text identified
Word information is further to be handled, and the text information being directly converted to according to voice can be able to include useless information, such as
Some sentences unrelated with conference content.
Summary of the invention
In view of the foregoing, it is necessary to propose a kind of method of speech processing and device, computer installation and computer-readable
Storage medium can identify voice, and remove garbage from speech recognition result.
The first aspect of the application provides a kind of method of speech processing, which comprises
Voice signal is pre-processed;
Characteristic parameter is extracted to pretreated voice signal;
According to the characteristic parameter, the voice signal is decoded using preparatory trained speech recognition modeling,
Obtain the text as unit of sentence;
Abstract sentence is extracted from the text as unit of sentence by Hidden Markov Model HMM.
In alternatively possible implementation, it is described by Hidden Markov Model HMM from described as unit of sentence
Abstract sentence is extracted in text, is specifically included:
Obtain the observation state sequence O={ O of the text as unit of sentence1,O2…On};
Determine the hidden state of HMM;
HMM parameter Estimation is carried out, trained HMM is obtained;
According to the trained HMM, the sentence is marked by Viterbi algorithm, obtains each sentence symbol
Close the degree of conformity of abstract sentence;
The sentence for meeting default degree of conformity is extracted from the text as unit of sentence, is obtained described with sentence
Son is the abstract sentence in the text of unit.
It is described that pretreatment is carried out including detecting in the voice signal to voice signal in alternatively possible implementation
Efficient voice, specifically include:
Adding window framing is carried out to the voice signal, obtains the speech frame of the voice signal;
Discrete Fourier transform is carried out to the speech frame, obtains the frequency spectrum of the speech frame;
The accumulated energy of each frequency band is calculated according to the frequency spectrum of the speech frame;
Logarithm operation is carried out to the accumulated energy of each frequency band, obtains the accumulated energy logarithm of each frequency band
Value;
The accumulated energy logarithm of each frequency band is compared with preset threshold, obtains the efficient voice.
In alternatively possible implementation, the characteristic parameter include initial mel cepstrum coefficients MFCC characteristic parameter,
First-order difference MFCC characteristic parameter and second differnce MFCC characteristic parameter.
In alternatively possible implementation, the method also includes:
Dimension-reduction treatment is carried out to the characteristic parameter, the characteristic parameter after obtaining dimensionality reduction.
In alternatively possible implementation, described includes to pre- place to pretreated voice signal extraction characteristic parameter
Voice signal after reason extracts mel cepstrum coefficients MFCC characteristic parameter, specifically includes:
Using the mapping equation of bilinear transformation low pass filter cutoff frequency, calculates alignment difference and speak third for each person
The frequency of formant bends the factor;
The factor is bent according to the frequency, triangle used in MFCC characteristic parameter extraction is filtered using bilinear transformation
The position of device group and width are adjusted;
The normalized MFCC characteristic parameter of sound channel is calculated according to triangular filter group adjusted.
In alternatively possible implementation, described includes to pre- place to pretreated voice signal extraction characteristic parameter
Voice signal after reason extracts mel cepstrum coefficients MFCC characteristic parameter, specifically includes:
Discrete Fourier transform DFT is carried out to each speech frame, obtains the frequency spectrum of the speech frame;
Square for seeking the spectrum amplitude of the speech frame obtains the conventional spectrum of the speech frame;
By the conventional spectrum of the speech frame by equally distributed triangular filter group in Mel frequency, each three are obtained
The output of angle filter;
Logarithm operation is done to the output of all triangular filters, obtains the log power spectrum of the speech frame;
To the log power spectrum discrete cosine transform, the initial MFCC characteristic parameter of the speech frame is obtained.
The second aspect of the application provides a kind of voice processing apparatus, and described device includes:
Pretreatment unit, for being pre-processed to voice signal;
Feature extraction unit, for extracting characteristic parameter to pretreated voice signal;
Decoding unit is used for according to the characteristic parameter, using preparatory trained speech recognition modeling to the voice
Signal is decoded, and obtains the text as unit of sentence;
Abstract extraction unit, for being extracted from the text as unit of sentence by Hidden Markov Model HMM
Abstract sentence.
The third aspect of the application provides a kind of computer installation, and the computer installation includes processor, the processing
Device is for realizing the method for speech processing when executing the computer program stored in memory.
The fourth aspect of the application provides a kind of computer readable storage medium, is stored thereon with computer program, described
The method of speech processing is realized when computer program is executed by processor.
The present invention pre-processes voice signal;Characteristic parameter is extracted to pretreated voice signal;According to described
Characteristic parameter is decoded the voice signal using preparatory trained speech recognition modeling, obtains as unit of sentence
Text;Abstract sentence is extracted from the text as unit of sentence by Hidden Markov Model HMM.The present invention not only will
Voice messaging is converted into text, and the also abstract sentence in extraction text is exported, and eliminates by useless in speech recognition result
Information obtains better speech processes result.
Detailed description of the invention
Fig. 1 is the flow chart of method of speech processing provided in an embodiment of the present invention.
Fig. 2 is the structure chart of voice processing apparatus provided in an embodiment of the present invention.
Fig. 3 is the schematic diagram of computer installation provided in an embodiment of the present invention.
Specific embodiment
To better understand the objects, features and advantages of the present invention, with reference to the accompanying drawing and specific real
Applying example, the present invention will be described in detail.It should be noted that in the absence of conflict, embodiments herein and embodiment
In feature can be combined with each other.
In the following description, numerous specific details are set forth in order to facilitate a full understanding of the present invention, described embodiment is only
It is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill
Personnel's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
Unless otherwise defined, all technical and scientific terms used herein and belong to technical field of the invention
The normally understood meaning of technical staff is identical.Term as used herein in the specification of the present invention is intended merely to description tool
The purpose of the embodiment of body, it is not intended that in the limitation present invention.
Preferably, method of speech processing of the invention is applied in one or more computer installation.The computer
Device is that one kind can be according to the instruction for being previously set or storing, the automatic equipment for carrying out numerical value calculating and/or information processing,
Hardware includes but is not limited to microprocessor, specific integrated circuit (Application Specific Integrated
Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processing unit
(Digital Signal Processor, DSP), embedded device etc..
The computer installation can be the calculating such as desktop PC, notebook, palm PC and cloud server and set
It is standby.The computer installation can carry out people by modes such as keyboard, mouse, remote controler, touch tablet or voice-operated devices with user
Machine interaction.
Embodiment one
Fig. 1 is the flow chart for the method for speech processing that the embodiment of the present invention one provides.The method of speech processing is applied to
Computer installation.The method of speech processing is from the text identified as unit of sentence in voice signal, from being single with sentence
Abstract sentence is extracted in the text of position.
As shown in Figure 1, the method for speech processing specifically includes the following steps:
Step 101, voice signal is pre-processed.
The voice signal can be analog voice signal, be also possible to audio digital signals.If the voice signal is
The analog voice signal is then carried out analog to digital conversion, is converted to audio digital signals by analog voice signal.
The present invention is used for continuous speech recognition, i.e., handles continuous audio stream.In one embodiment of the present of invention
In, the method for speech processing is applied in intelligent meeting system, and the voice signal is by voice-input device (such as wheat
Gram wind, mobile phone microphone etc.) be input to intelligent meeting system spokesman voice signal.
Carrying out pretreatment to the voice signal may include carrying out preemphasis to the voice signal.
The purpose of preemphasis is to promote the high fdrequency component of voice, and the frequency spectrum of signal is made to become flat.Voice signal due to by
The influence of glottal excitation and mouth and nose radiation, energy are obviously reduced in front end, and usually it is smaller to get over amplitude for frequency.When frequency mentions
When rising twice, power spectral amplitude is fallen by 6dB/oct.Therefore, spectrum analysis or channel parameters analysis are being carried out to voice signal
Before, it needs to carry out frequency upgrading to the high frequency section of voice signal, i.e., preemphasis is carried out to voice signal.Preemphasis generally utilizes
High-pass filter realizes, the transmission function of high-pass filter can be with are as follows:
H (z)=1- κ z-1, 0.9≤κ≤1.0.
Wherein, κ is pre emphasis factor, and preferably value is between 0.94-0.97.
Pre-processing to the voice signal can also include carrying out adding window framing to the voice signal.
Voice signal is a kind of time varying signal of non-stationary, is broadly divided into voiced sound and voiceless sound two major classes.The fundamental tone week of voiced sound
Phase, to ask Voiced signal amplitude and channel parameters etc. all slowly varying at any time, but can be with usually within the time of 10ms-30ms
Think with short-term stationarity.Voice signal can be divided into some short sections (i.e. acquisition short-term stationarity letter in Speech processing
Number) handled, this process is known as framing, the voice signal of obtained short section is known as speech frame.Framing is by language
Sound signal carries out windowing process to realize.In order to avoid the amplitude of variation of adjacent two frame is excessive, needed between frame and frame be overlapped
A part.In one embodiment of the invention, each speech frame is 25 milliseconds, and there are 15 milliseconds between two neighboring speech frame
Overlapping, that is, a speech frame is taken every 10 milliseconds.
Common window function has rectangular window, Hamming window and Hanning window, rectangular window function are as follows:
Hamming window function are as follows:
Hanning window function are as follows:
Wherein, the number for the sampled point that N includes by a speech frame.
Pre-processing to the voice signal can also include the efficient voice detected in the voice signal.
The purpose for detecting efficient voice is that non-effective voice (i.e. non-speech segment) is rejected from voice signal, obtains effective language
Sound (i.e. voice segments) shortens the time of speech recognition to reduce the calculation amount and accuracy of feature extraction, improves discrimination.It can
To carry out efficient voice detection according to the short-time energy of voice signal and short-time zero-crossing rate etc..
In one embodiment, it is assumed that n-th of speech frame is x in voice signaln(m), then short-time energy are as follows:
Short-time zero-crossing rate are as follows:
Wherein, sgn [] is sign function, expression formula are as follows:
The beginning and end of efficient voice in the voice signal can be detected using two-stage determining method.Two-stage determining method is
Techniques known, details are not described herein again.
In another embodiment, the efficient voice in the voice signal can be detected by following methods:
(1) adding window framing is carried out to the voice signal, obtains the speech frame x (n) of the voice signal.It is specific at one
In embodiment, Hamming window, every frame 20ms can be added to the voice signal, frame moves 10ms.If to voice in preprocessing process
Signal adding window framing, then the step is omitted.
(2) to the speech frame x (n) carry out discrete Fourier transform (Discrete Fourier Transform,
DFT), the frequency spectrum of the speech frame x (n) is obtained:
(3) accumulated energy of each frequency band is calculated according to the frequency spectrum of the speech frame x (n):
Wherein E (m) indicates the accumulated energy of m-th of frequency band, (m1,m2) indicate m-th of frequency band start frequency band point.
(4) logarithm operation is carried out to the accumulated energy of each frequency band, obtains the accumulated energy pair of each frequency band
Numerical value.
(5) the accumulated energy logarithm of each frequency band is obtained into the efficient voice compared with preset threshold.If one
The accumulated energy logarithm of a frequency band is higher than preset threshold, then the corresponding voice of the frequency band is efficient voice.
Step 102, characteristic parameter is extracted to pretreated voice signal.
Characteristic parameter extraction is analyzed voice signal, and the parameters,acoustic sequence of reflection essential phonetic feature is extracted
Column.
Whens the characteristic parameter of extraction may include short-time average energy, short-time average zero-crossing rate, formant, pitch period etc.
Field parameter can also include linear predictor coefficient (Linear Prediction Coefficient, LPC), linear prediction cepstrum coefficient
Coefficient (Linear Prediction Cepstrum Coefficient, LPCC), mel cepstrum coefficients (Mel Frequency
Cepstrum Coefficient, MFCC), perception linear prediction (Perceptual Linear Predictive, PLP) etc. becomes
Change field parameter.
In one embodiment of the invention, the MFCC characteristic parameter of voice signal can be extracted.Extract MFCC feature ginseng
Several steps is as follows:
(1) discrete Fourier transform is carried out to each speech frame (Discrete Fourier Transform, DFT, can be with
It is Fast Fourier Transform (FFT)), obtain the frequency spectrum of the speech frame.
(2) square for seeking the spectrum amplitude of the speech frame obtains the conventional spectrum of the speech frame.
(3) the conventional spectrum of the speech frame is passed through into equally distributed triangular filter (i.e. triangle in one group of Mel frequency
Filter group), obtain the output of each triangular filter.The centre frequency of this group of triangular filter is equal in Mel frequency scale
Even arrangement, and the frequency of two bottom points of triangle of each triangular filter is respectively equal in two adjacent triangular filters
Frequency of heart.The centre frequency of triangular filter are as follows:
The frequency response of triangular filter are as follows:
Wherein, fh、f1For the high and low frequency of triangular filter;N is the points of Fourier transformation;FsFor sample frequency;M
For the number of triangular filter;B-1=700 (eb/1125It -1) is fmelInverse function.
(4) logarithm operation is done to the output of all triangular filters, obtains the log power spectrum S (m) of the speech frame.
(5) discrete cosine transform (Discrete Cosine Transform, DCT) is done to S (m), obtains the speech frame
Initial MFCC characteristic parameter.Discrete cosine transform are as follows:
Introduce triangular filter group in MFCC, and triangular filter be distributed in low-frequency range it is closeer, high band distribution compared with
It dredges, meets human hearing characteristic, still there is preferable recognition performance in a noisy environment.
Extract MFCC characteristic parameter the step of can also include:
(6) according to the dynamic difference MFCC characteristic parameter of the initial MFCC characteristic parameter extraction speech frame of speech frame.Initially
MFCC characteristic parameter only reflects the static characteristic of speech parameter, the dynamic characteristic of voice can by the Difference Spectrum of static nature come
Description, sound state combine can effective lifting system recognition performance, usually using single order and/or second differnce MFCC feature
Parameter.
In one embodiment, the MFCC characteristic parameter of extraction is the characteristic vector of 39 dimensions, including the 13 initial MFCC of dimension
Characteristic parameter, 13 dimension first-order difference MFCC characteristic parameters and 13 dimension second differnce MFCC characteristic parameters.
It, can also be right after extracting characteristic parameter to pretreated voice signal in implementing at of the invention one
The characteristic parameter of extraction carries out dimension-reduction treatment, the characteristic parameter after obtaining dimensionality reduction.For example, using segmentation mean data dimension-reduction algorithm
Dimension-reduction treatment is carried out to the characteristic parameter (such as MFCC characteristic parameter), the characteristic parameter after obtaining dimensionality reduction.Spy after dimensionality reduction
Sign parameter will be used for subsequent step.
Step 103, according to the characteristic parameter, using preparatory trained speech recognition modeling to the voice signal into
Row decoding, obtains the text as unit of sentence.
The speech recognition modeling may include dynamic time warping model, Hidden Markov Model, artificial neural network
Model, support vector cassification model etc..The speech recognition modeling is also possible to the group of two or more model
It closes.
In one embodiment of the invention, the speech recognition modeling is Hidden Markov Model (HMM).The HMM
Including acoustic model and speech model.
Acoustic model (Acoustic Model): phoneme is modeled using Hidden Markov Model.In voice field, and
Not instead of with word, using sub- word as recognition unit, sub- word is the basic acoustic elements of acoustic model.It is sound in English neutron word
Element, for some specific word, corresponding acoustic model is spliced by multiple phonemes by searching for the syntax rule of Pronounceable dictionary
At.It is initial consonant and simple or compound vowel of a Chinese syllable in Chinese neutron word.Every sub- word can be modeled with the HMM including multiple states.For example, often
One phoneme can be modeled with the HMM comprising most 6 states, and each state can use gauss hybrid models (GMM) fitting pair
The observation frame answered, observation frame are chronologically combined into observation sequence.And observation sequence different in size can be generated in each acoustic model
Column, i.e., one-to-many mapping.
Language model (Language Model): being to be effectively combined syntax and semantics during speech recognition
Knowledge, improve discrimination, reduce the range of search.Due to being difficult accurately to determine that the boundary of word and acoustic model describe language
The ability of the change of tune opposite sex is limited, and when identification will generate the sequence of the similar word of many probability scores.Therefore know in practical voice
Most possible secondary sequence is selected from many candidate results to supplement acoustic mode usually using language model P (W) in other system
The deficiency of type.
In the present embodiment, using rule-based language model.Rule-based language model can sum up grammer
Then rule or even semantic rules exclude the result for not conforming to syntax rule or semantic rules in acoustics identification with these rules.System
Language model is counted by the dependence between statistical probability descriptor and word, grammer or semantic rules are compiled indirectly
Code.
Decoding is exactly that an optimal path is searched in state network, and voice corresponds to the maximum probability of this paths.At this
In embodiment, global optimum path is found using dynamic programming algorithm (i.e. Viterbi algorithm).
Assuming that being feature vector Y to the characteristic parameter that voice signal extracts, most possible generation Y is found by decoding algorithm
Word sequence w1:L=w1,w2…wL。
Parameter w corresponding to maximum that decoding algorithm is to solve for so that posterior probability P (w | Y), it may be assumed that
wbest=argmax p (w | Y) }
Above formula is converted by Bayes' theorem are as follows:
Since observation probability P (Y) is constant under given observation sequence, above formula can be further simplified are as follows:
wbest=argmax p (Y | w) p (w) }
Wherein prior probability P (W) is determined by language model, and likelihood probability p (Y | w) it is determined by acoustic model.By above
Calculate parameter w corresponding to you can get it posterior probability P (w | Y) maximum.
Step 104, abstract sentence is extracted from the text as unit of sentence by Hidden Markov Model HMM.
By step 103, voice signal is decoded as the text as unit of sentence, in conventional speech recognition system
In, speech recognition work has been completed.This method is further extracted from the text as unit of sentence identified and is plucked
Want sentence.
The purpose for extracting abstract sentence is that important information is extracted from voice, rejects garbage.
This method extracts abstract sentence by HMM model.At this point, the dual random relationship of HMM model can be described as: a weight
Stochastic relation is the release of sentence sequence, is observable;Another heavy stochastic relation is whether the sentence should be classified as making a summary
The property of sentence is not observable.So the process for extracting abstract sentence with HMM model can be described as given sentence sequence O
={ O1,O2…On, with determine sentence whether be make a summary sentence maximum likelihood.Key step is as follows:
(1) the observation state sequence O={ O of the text as unit of sentence is obtained1,O2…On};
(2) HMM hidden state is determined.5 hidden states can be set." 1 "-hidden state can be set as to meet,
" 2 "-relatively meet, and " 3 "-are general, and " 4 "-are not met relatively, and " 5 "-are not met, for successively indicating that sentence meets the degree of abstract sentence.
(3) HMM parameter Estimation is carried out.Initial probability parameter is randomly generated first, by constantly iteration, is set when reaching
When fixed threshold value, stops calculating, obtain suitable HMM parameter.
(4) according to trained HMM, sentence is marked by Viterbi algorithm, each sentence is obtained and meets abstract
The degree of conformity of sentence.
(5) sentence (sentence for example, at least relatively met) of default degree of conformity will be met from the text as unit of sentence
It is extracted in this, obtains the abstract sentence in the text as unit of sentence.
The method of speech processing of embodiment one pre-processes voice signal;Pretreated voice signal is extracted special
Levy parameter;According to the characteristic parameter, the voice signal is decoded using preparatory trained speech recognition modeling, is obtained
To the text as unit of sentence;Abstract sentence is extracted from the text as unit of sentence by Hidden Markov Model HMM.It is real
It applies example one and not only converts text for voice messaging, the also abstract sentence in extraction text is exported, and is eliminated by speech recognition
As a result the garbage in obtains better speech processes result.
In another embodiment, when extracting MFCC characteristic parameter, sound channel length normalization (Vocal can be carried out
Tract Length Normalization, VTLN), obtain the normalized MFCC characteristic parameter of sound channel length.
Sound channel can be expressed as cascade vocal tube model, and each sound pipe can regard a resonant cavity, their resonance as
Frequency depends on the length and shape of sound pipe.Therefore, the part acoustic difference between speaker is since the sound channel of speaker is long
Degree is different.For example, the variation range of sound channel length generally changes to 18cm (adult male) from 13cm (adult female), therefore,
Dissimilarity others say that the same formant frequency of vowel differs greatly.VTLN is exactly to eliminate male, female's sound channel length
Difference makes the result of accents recognition not by the interference of gender.
VTLN can move frequency coordinate by bending peace to make the formant frequency of each speaker match.In this implementation
In example, the VTLN method based on bilinear transformation can be used.The VTLN method based on bilinear transformation is not directly to language
The frequency spectrum of sound signal is folded, but uses the mapping equation of bilinear transformation low pass filter cutoff frequency, calculates alignment
Difference is spoken the frequency bending factor of third formant for each person;The factor is bent according to the frequency, using bilinear transformation pair
The position (such as starting point, intermediate point and end point of triangular filter) of triangular filter group and width are adjusted;According to tune
Triangular filter group after whole calculates the normalized MFCC characteristic parameter of sound channel.For example, to carry out frequency spectrum pressure to voice signal
Contracting, then stretch the scale of triangular filter, and triangular filter group is extended to the left and moved at this time.To voice signal
Frequency spectrum stretching is carried out, then the scale of triangular filter is compressed, triangular filter group is compressed to the right and moved at this time.Using
When the VTLN method based on bilinear transformation carries out sound channel normalization to specific crowd or particular person, it is only necessary to be filtered to triangle
Device group coefficient carries out linear transformation, without all folding every time to signal spectrum when extracting characteristic parameter, to subtract significantly
Small calculation amount.Also, the VTLN method based on bilinear transformation of being somebody's turn to do is avoided to frequency factor linear search, reduces operation
Complexity.Meanwhile bilinear transformation should be utilized based on the VTLN method of bilinear transformation, keep the frequency of bending continuous and without bandwidth
Change.
Embodiment two
Fig. 2 is the structure chart of voice processing apparatus provided by Embodiment 2 of the present invention.As shown in Fig. 2, the speech processes
Device 10 may include: pretreatment unit 201, feature extraction unit 202, decoding unit 203, abstract extraction unit 204.
Pretreatment unit 201, for being pre-processed to voice signal.
The voice signal can be analog voice signal, be also possible to audio digital signals.If the voice signal is
The analog voice signal is then carried out analog to digital conversion, is converted to audio digital signals by analog voice signal.
The present invention is used for continuous speech recognition, i.e., handles continuous audio stream.In one embodiment of the present of invention
In, the method for speech processing is applied in intelligent meeting system, and the voice signal is by voice-input device (such as wheat
Gram wind, mobile phone microphone etc.) be input to intelligent meeting system spokesman voice signal.
Carrying out pretreatment to the voice signal may include carrying out preemphasis to the voice signal.
The purpose of preemphasis is to promote the high fdrequency component of voice, and the frequency spectrum of signal is made to become flat.Voice signal due to by
The influence of glottal excitation and mouth and nose radiation, energy are obviously reduced in front end, and usually it is smaller to get over amplitude for frequency.When frequency mentions
When rising twice, power spectral amplitude is fallen by 6dB/oct.Therefore, spectrum analysis or channel parameters analysis are being carried out to voice signal
Before, it needs to carry out frequency upgrading to the high frequency section of voice signal, i.e., preemphasis is carried out to voice signal.Preemphasis generally utilizes
High-pass filter realizes, the transmission function of high-pass filter can be with are as follows:
H (z)=1-kz-1, 0.9≤κ≤1.0.
Wherein, k is pre emphasis factor, and preferably value is between 0.94-0.97.
Pre-processing to the voice signal can also include carrying out adding window framing to the voice signal.
Voice signal is a kind of time varying signal of non-stationary, is broadly divided into voiced sound and voiceless sound two major classes.The fundamental tone week of voiced sound
Phase, to ask Voiced signal amplitude and channel parameters etc. all slowly varying at any time, but can be with usually within the time of 10ms-30ms
Think with short-term stationarity.Voice signal can be divided into some short sections (i.e. acquisition short-term stationarity letter in Speech processing
Number) handled, this process is known as framing, the voice signal of obtained short section is known as speech frame.Framing is by language
Sound signal carries out windowing process to realize.In order to avoid the amplitude of variation of adjacent two frame is excessive, needed between frame and frame be overlapped
A part.In one embodiment of the invention, each speech frame is 25 milliseconds, and there are 15 milliseconds between two neighboring speech frame
Overlapping, that is, a speech frame is taken every 10 milliseconds.
Common window function has rectangular window, Hamming window and Hanning window, rectangular window function are as follows:
Hamming window function are as follows:
Hanning window function are as follows:
Wherein, the number for the sampled point that N includes by a speech frame.
Pre-processing to the voice signal can also include the efficient voice detected in the voice signal.
The purpose for detecting efficient voice is that non-effective voice (i.e. non-speech segment) is rejected from voice signal, obtains effective language
Sound (i.e. voice segments) shortens the time of speech recognition to reduce the calculation amount and accuracy of feature extraction, improves discrimination.It can
To carry out efficient voice detection according to the short-time energy of voice signal and short-time zero-crossing rate etc..
In one embodiment, it is assumed that n-th of speech frame is x in voice signaln(m), then short-time energy are as follows:
Short-time zero-crossing rate are as follows:
Wherein, sgn [] is sign function, expression formula are as follows:
The beginning and end of efficient voice in the voice signal can be detected using two-stage determining method.Two-stage determining method is
Techniques known, details are not described herein again.
In another embodiment, the efficient voice in the voice signal can be detected by following methods:
(1) adding window framing is carried out to the voice signal, obtains the speech frame x (n) of the voice signal.It is specific at one
In embodiment, Hamming window, every frame 20ms can be added to the voice signal, frame moves 10ms.If to voice in preprocessing process
Signal adding window framing, then the step is omitted.
(2) to the speech frame x (n) carry out discrete Fourier transform (Discrete Fourier Transform,
DFT), the frequency spectrum of the speech frame x (n) is obtained:
(3) accumulated energy of each frequency band is calculated according to the frequency spectrum of the speech frame x (n):
Wherein E (m) indicates the accumulated energy of m-th of frequency band, (m1,m2) indicate m-th of frequency band start frequency band point.
(4) logarithm operation is carried out to the accumulated energy of each frequency band, obtains the accumulated energy pair of each frequency band
Numerical value.
(5) the accumulated energy logarithm of each frequency band is obtained into the efficient voice compared with preset threshold.If one
The accumulated energy logarithm of a frequency band is higher than preset threshold, then the corresponding voice of the frequency band is efficient voice.
Feature extraction unit 202, for extracting characteristic parameter to pretreated voice signal.
Characteristic parameter extraction is analyzed voice signal, and the parameters,acoustic sequence of reflection essential phonetic feature is extracted
Column.
Whens the characteristic parameter of extraction may include short-time average energy, short-time average zero-crossing rate, formant, pitch period etc.
Field parameter can also include linear predictor coefficient (Linear Prediction Coefficient, LPC), linear prediction cepstrum coefficient
Coefficient (Linear Prediction Cepstrum Coefficient, LPCC), mel cepstrum coefficients (Mel Frequency
Cepstrum Coefficient, MFCC), perception linear prediction (Perceptual Linear Predictive, PLP) etc. becomes
Change field parameter.
In one embodiment of the invention, the MFCC characteristic parameter of voice signal can be extracted.Extract MFCC feature ginseng
Several steps is as follows:
(1) discrete Fourier transform (Discrete is carried out to each speech frame that pretreatment unit 201 obtains
Fourier Transform, DFT, can be Fast Fourier Transform (FFT)), obtain the frequency spectrum of the speech frame.
(2) square for seeking the spectrum amplitude of the speech frame obtains the conventional spectrum of the speech frame.
(3) the conventional spectrum of the speech frame is passed through into equally distributed triangular filter (i.e. triangle in one group of Mel frequency
Filter group), obtain the output of each triangular filter.The centre frequency of this group of triangular filter is equal in Mel frequency scale
Even arrangement, and the frequency of two bottom points of triangle of each triangular filter is respectively equal in two adjacent triangular filters
Frequency of heart.The centre frequency of triangular filter are as follows:
The frequency response of triangular filter are as follows:
Wherein, fh、f1For the high and low frequency of triangular filter;N is the points of Fourier transformation;FsFor sample frequency;M
For the number of triangular filter;B-1=700 (eb/1125It -1) is fmelInverse function.
(4) logarithm operation is done to the output of all triangular filters, obtains the log power spectrum S (m) of the speech frame.
(5) discrete cosine transform (Discrete Cosine Transform, DCT) is done to S (m), obtains the speech frame
Initial MFCC characteristic parameter.Discrete cosine transform are as follows:
Introduce triangular filter group in MFCC, and triangular filter be distributed in low-frequency range it is closeer, high band distribution compared with
It dredges, meets human hearing characteristic, still there is preferable recognition performance in a noisy environment.
Extract MFCC characteristic parameter the step of can also include:
(6) the dynamic difference MFCC characteristic parameter of speech frame is extracted.Initial MFCC characteristic parameter only reflects speech parameter
Static characteristic, the dynamic characteristic of voice can be described by the Difference Spectrum of static nature, and sound state is combined and can effectively be promoted
The recognition performance of system, usually using single order and/or second differnce MFCC characteristic parameter.
In one embodiment, the MFCC characteristic parameter of extraction is the characteristic vector of 39 dimensions, including the 13 initial MFCC of dimension
Characteristic parameter, 13 dimension first-order difference MFCC characteristic parameters and 13 dimension second differnce MFCC characteristic parameters.
It, can also be right after extracting characteristic parameter to pretreated voice signal in implementing at of the invention one
The characteristic parameter of extraction carries out dimension-reduction treatment, the characteristic parameter after obtaining dimensionality reduction.For example, using segmentation mean data dimension-reduction algorithm
Dimension-reduction treatment is carried out to the characteristic parameter (such as MFCC characteristic parameter), the characteristic parameter after obtaining dimensionality reduction.Spy after dimensionality reduction
Sign parameter will be used for subsequent step.
Decoding unit 203 is used for according to the characteristic parameter, using preparatory trained speech recognition modeling to institute's predicate
Sound signal is decoded, and obtains the text as unit of sentence.
The speech recognition modeling may include dynamic time warping model, Hidden Markov Model, artificial neural network
Model, support vector cassification model etc..The speech recognition modeling is also possible to the group of two or more model
It closes.
In one embodiment of the invention, the speech recognition modeling is Hidden Markov Model (HMM).The HMM
Including acoustic model and speech model.
Acoustic model (Acoustic Model): phoneme is modeled using Hidden Markov Model.In voice field, and
Not instead of with word, using sub- word as recognition unit, sub- word is the basic acoustic elements of acoustic model.It is sound in English neutron word
Element, for some specific word, corresponding acoustic model is spliced by multiple phonemes by searching for the syntax rule of Pronounceable dictionary
At.It is initial consonant and simple or compound vowel of a Chinese syllable in Chinese neutron word.Every sub- word can be modeled with the HMM including multiple states.For example, often
One phoneme can be modeled with the HMM comprising most 6 states, and each state can use gauss hybrid models (GMM) fitting pair
The observation frame answered, observation frame are chronologically combined into observation sequence.And observation sequence different in size can be generated in each acoustic model
Column, i.e., one-to-many mapping.
Language model (Language Model): being to be effectively combined syntax and semantics during speech recognition
Knowledge, improve discrimination, reduce the range of search.Due to being difficult accurately to determine that the boundary of word and acoustic model describe language
The ability of the change of tune opposite sex is limited, and when identification will generate the sequence of the similar word of many probability scores.Therefore know in practical voice
Most possible secondary sequence is selected from many candidate results to supplement acoustic mode usually using language model P (W) in other system
The deficiency of type.
In the present embodiment, using rule-based language model.Rule-based language model can sum up grammer
Then rule or even semantic rules exclude the result for not conforming to syntax rule or semantic rules in acoustics identification with these rules.System
Language model is counted by the dependence between statistical probability descriptor and word, grammer or semantic rules are compiled indirectly
Code.
Decoding is exactly that an optimal path is searched in state network, and voice corresponds to the maximum probability of this paths.At this
In embodiment, global optimum path is found using dynamic programming algorithm (i.e. Viterbi algorithm).
Assuming that the characteristic parameter that feature extraction unit 202 is extracted is feature vector Y, found by decoding algorithm most possible
Generate the word sequence w of Y1:L=w1,w2…wL。
Parameter w corresponding to maximum that decoding algorithm is to solve for so that posterior probability P (w | Y), it may be assumed that
wbest=argmax p (w | Y) }
Above formula is converted by Bayes' theorem are as follows:
Since observation probability P (Y) is constant under given observation sequence, above formula can be further simplified are as follows:
wbest=argmax p (Y | w) p (w) }
Wherein prior probability P (W) is determined by language model, and likelihood probability p (Y | w) it is determined by acoustic model.By above
Calculate parameter w corresponding to you can get it posterior probability P (w | Y) maximum.
Abstract extraction unit 204, for extracting abstract sentence from the text as unit of sentence.
Voice signal is decoded as the text as unit of sentence by decoding unit 203, in conventional speech recognition system,
Speech recognition work has been completed.In the present invention, abstract extraction unit 204 is from the text as unit of sentence identified
In extract abstract sentence.
The purpose for extracting abstract sentence is that important information is extracted from voice, rejects garbage.
Abstract extraction unit 204 extracts abstract sentence by HMM model.At this point, the dual random relationship of HMM model can retouch
It states are as follows: a weight stochastic relation is the release of sentence sequence, is observable;Another heavy stochastic relation is whether the sentence should be by
It is classified as the property of abstract sentence, is not observable.So the process for extracting abstract sentence with HMM model can be described as given sentence
Subsequence O={ O1,O2…On, with determine sentence whether be make a summary sentence maximum likelihood.Key step is as follows:
(1) the observation state sequence O={ O of the text as unit of sentence is obtained1,O2…On};
(2) HMM hidden state is determined.5 hidden states can be set." 1 "-hidden state can be set as to meet,
" 2 "-relatively meet, and " 3 "-are general, and " 4 "-are not met relatively, and " 5 "-are not met, for successively indicating that sentence meets the degree of abstract sentence.
(3) HMM parameter Estimation is carried out.Initial probability parameter is randomly generated first, by constantly iteration, is set when reaching
When fixed threshold value, stops calculating, obtain suitable HMM parameter.
(4) according to trained HMM, sentence is marked by Viterbi algorithm, each sentence is obtained and meets abstract
The degree of conformity of sentence.
(5) sentence (sentence for example, at least relatively met) of default degree of conformity will be met from the text as unit of sentence
It is extracted in this, obtains the abstract sentence in the text as unit of sentence.
The voice processing apparatus 10 of embodiment two pre-processes voice signal;Pretreated voice signal is extracted
Characteristic parameter;According to the characteristic parameter, the voice signal is decoded using preparatory trained speech recognition modeling,
Obtain the text as unit of sentence;Abstract sentence is extracted from the text as unit of sentence by Hidden Markov Model HMM.
Embodiment two not only converts text for voice messaging, and the also abstract sentence in extraction text is exported, and eliminates and is known by voice
Garbage in other result obtains better speech processes result.
In another embodiment, feature extraction unit 202 can carry out sound channel length and return when extracting MFCC characteristic parameter
One changes (Vocal Tract Length Normalization, VTLN), obtains the normalized MFCC characteristic parameter of sound channel length.
Sound channel can be expressed as cascade vocal tube model, and each sound pipe can regard a resonant cavity, their resonance as
Frequency depends on the length and shape of sound pipe.Therefore, the part acoustic difference between speaker is since the sound channel of speaker is long
Degree is different.For example, the variation range of sound channel length generally changes to 18cm (adult male) from 13cm (adult female), therefore,
Dissimilarity others say that the same formant frequency of vowel differs greatly.VTLN is exactly to eliminate male, female's sound channel length
Difference makes the result of accents recognition not by the interference of gender.
VTLN can move frequency coordinate by bending peace to make the formant frequency of each speaker match.In this implementation
In example, the VTLN method based on bilinear transformation can be used.The VTLN method based on bilinear transformation is not directly to language
The frequency spectrum of sound signal is folded, but uses the mapping equation of bilinear transformation low pass filter cutoff frequency, calculates alignment
Difference is spoken the frequency bending factor of third formant for each person;The factor is bent according to the frequency, using bilinear transformation pair
The position (such as starting point, intermediate point and end point of triangular filter) of triangular filter group and width are adjusted;According to tune
Triangular filter group after whole calculates the normalized MFCC characteristic parameter of sound channel.For example, to carry out frequency spectrum pressure to voice signal
Contracting, then stretch the scale of triangular filter, and triangular filter group is extended to the left and moved at this time.To voice signal
Frequency spectrum stretching is carried out, then the scale of triangular filter is compressed, triangular filter group is compressed to the right and moved at this time.Using
When the VTLN method based on bilinear transformation carries out sound channel normalization to specific crowd or particular person, it is only necessary to be filtered to triangle
Device group coefficient carries out linear transformation, without all folding every time to signal spectrum when extracting characteristic parameter, to subtract significantly
Small calculation amount.Also, the VTLN method based on bilinear transformation of being somebody's turn to do is avoided to frequency factor linear search, reduces operation
Complexity.Meanwhile bilinear transformation should be utilized based on the VTLN method of bilinear transformation, keep the frequency of bending continuous and without bandwidth
Change.
Embodiment three
The present embodiment provides a kind of computer readable storage medium, computer is stored on the computer readable storage medium
Program, the computer program realize the step in above-mentioned method of speech processing embodiment when being executed by processor, such as shown in Fig. 1
Step 101-104:
Step 101, voice signal is pre-processed;
Step 102, characteristic parameter is extracted to pretreated voice signal;
Step 103, according to the characteristic parameter, using preparatory trained speech recognition modeling to the voice signal into
Row decoding, obtains the text as unit of sentence;
Step 104, abstract sentence is extracted from the text as unit of sentence by Hidden Markov Model HMM.
Alternatively, the function of each module/unit in above-mentioned apparatus embodiment is realized when the computer program is executed by processor,
Such as the unit 201-204 in Fig. 2:
Pretreatment unit 201, for being pre-processed to voice signal;
Feature extraction unit 202, for extracting characteristic parameter to pretreated voice signal;
Decoding unit 203 is used for according to the characteristic parameter, using preparatory trained speech recognition modeling to institute's predicate
Sound signal is decoded, and obtains the text as unit of sentence;
Abstract extraction unit 204 is plucked for being extracted from the text as unit of sentence by Hidden Markov Model HMM
Want sentence.
Example IV
Fig. 3 is the schematic diagram for the computer installation that the embodiment of the present invention three provides.The computer installation 1 includes memory
20, processor 30 and the computer program 40 that can be run in the memory 20 and on the processor 30, example are stored in
Such as voice processing program.The processor 30 is realized when executing the computer program 40 in above-mentioned method of speech processing embodiment
The step of, such as step 101-104 shown in FIG. 1:
Step 101, voice signal is pre-processed;
Step 102, characteristic parameter is extracted to pretreated voice signal;
Step 103, according to the characteristic parameter, using preparatory trained speech recognition modeling to the voice signal into
Row decoding, obtains the text as unit of sentence;
Step 104, abstract sentence is extracted from the text as unit of sentence by Hidden Markov Model HMM.
Alternatively, realizing each module in above-mentioned apparatus embodiment/mono- when the processor 30 executes the computer program 40
The function of member, such as the unit 201-204 in Fig. 2:
Pretreatment unit 201, for being pre-processed to voice signal;
Feature extraction unit 202, for extracting characteristic parameter to pretreated voice signal;
Decoding unit 203 is used for according to the characteristic parameter, using preparatory trained speech recognition modeling to institute's predicate
Sound signal is decoded, and obtains the text as unit of sentence;
Abstract extraction unit 204 is plucked for being extracted from the text as unit of sentence by Hidden Markov Model HMM
Want sentence.
Illustratively, the computer program 40 can be divided into one or more module/units, it is one or
Multiple module/units are stored in the memory 20, and are executed by the processor 30, to complete the present invention.Described one
A or multiple module/units can be the series of computation machine program instruction section that can complete specific function, which is used for
Implementation procedure of the computer program 40 in the computer installation 1 is described.For example, the computer program 40 can be by
Pretreatment unit 201, feature extraction unit 202, decoding unit 203, the abstract extraction unit 204 being divided into Fig. 2, each unit
Concrete function is referring to embodiment two.
The computer installation 1 can be the calculating such as desktop PC, notebook, palm PC and cloud server and set
It is standby.It will be understood by those skilled in the art that the schematic diagram 3 is only the example of computer installation 1, do not constitute to computer
The restriction of device 1 may include perhaps combining certain components or different components, example than illustrating more or fewer components
Such as described computer installation 1 can also include input-output equipment, network access equipment, bus.
Alleged processor 30 can be central processing unit (Central Processing Unit, CPU), can also be
Other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit
(Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-
Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic,
Discrete hardware components etc..General processor can be microprocessor or the processor 30 is also possible to any conventional processor
Deng the processor 30 is the control centre of the computer installation 1, utilizes various interfaces and connection entire computer dress
Set 1 various pieces.
The memory 20 can be used for storing the computer program 40 and/or module/unit, and the processor 30 passes through
Operation executes the computer program and/or module/unit being stored in the memory 20, and calls and be stored in memory
Data in 20 realize the various functions of the computer installation 1.The memory 20 can mainly include storing program area and deposit
Store up data field, wherein storing program area can application program needed for storage program area, at least one function (for example sound is broadcast
Playing function, image player function etc.) etc.;Storage data area, which can be stored, uses created data (ratio according to computer installation 1
Such as audio data, phone directory) etc..In addition, memory 20 may include high-speed random access memory, it can also include non-easy
The property lost memory, such as hard disk, memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital
(Secure Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other
Volatile solid-state part.
If the integrated module/unit of the computer installation 1 is realized in the form of SFU software functional unit and as independence
Product when selling or using, can store in a computer readable storage medium.Based on this understanding, of the invention
It realizes all or part of the process in above-described embodiment method, can also instruct relevant hardware come complete by computer program
At the computer program can be stored in a computer readable storage medium, which is being executed by processor
When, it can be achieved that the step of above-mentioned each embodiment of the method.Wherein, the computer program includes computer program code, described
Computer program code can be source code form, object identification code form, executable file or certain intermediate forms etc..The meter
Calculation machine readable medium may include: can carry the computer program code any entity or device, recording medium, USB flash disk,
Mobile hard disk, magnetic disk, CD, computer storage, read-only memory (ROM, Read-Only Memory), random access memory
Device (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium etc..It needs to illustrate
It is that the content that the computer-readable medium includes can be fitted according to the requirement made laws in jurisdiction with patent practice
When increase and decrease, such as in certain jurisdictions, according to legislation and patent practice, computer-readable medium does not include electric carrier wave letter
Number and telecommunication signal.
In several embodiments provided by the present invention, it should be understood that disclosed computer installation and method, it can be with
It realizes by another way.For example, computer installation embodiment described above is only schematical, for example, described
The division of unit, only a kind of logical function partition, there may be another division manner in actual implementation.
It, can also be in addition, each functional unit in each embodiment of the present invention can integrate in same treatment unit
It is that each unit physically exists alone, can also be integrated in same unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of hardware adds software function module.
It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie
In the case where without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter
From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power
Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims
Variation is included in the present invention.Any reference signs in the claims should not be construed as limiting the involved claims.This
Outside, it is clear that one word of " comprising " does not exclude other units or steps, and odd number is not excluded for plural number.It is stated in computer installation claim
Multiple units or computer installation can also be implemented through software or hardware by the same unit or computer installation.The
One, the second equal words are used to indicate names, and are not indicated any particular order.
Finally it should be noted that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although reference
Preferred embodiment describes the invention in detail, those skilled in the art should understand that, it can be to of the invention
Technical solution is modified or equivalent replacement, without departing from the spirit and scope of the technical solution of the present invention.
Claims (10)
1. a kind of method of speech processing, which is characterized in that the described method includes:
Voice signal is pre-processed;
Characteristic parameter is extracted to pretreated voice signal;
According to the characteristic parameter, the voice signal is decoded using preparatory trained speech recognition modeling, is obtained
Text as unit of sentence;
Abstract sentence is extracted from the text as unit of sentence by Hidden Markov Model HMM.
2. the method as described in claim 1, which is characterized in that it is described by Hidden Markov Model HMM from described with sentence
For extraction abstract sentence in the text of unit, specifically include:
Obtain the observation state sequence O={ O of the text as unit of sentence1,O2…On};
Determine the hidden state of HMM;
HMM parameter Estimation is carried out, trained HMM is obtained;
According to the trained HMM, the sentence is marked by Viterbi algorithm, obtains each sentence and meet to pluck
Want the degree of conformity of sentence;
The sentence for meeting default degree of conformity is extracted from the text as unit of sentence, obtain described be with sentence
Abstract sentence in the text of unit.
3. the method as described in claim 1, which is characterized in that described to carry out pretreatment including detecting institute's predicate to voice signal
Efficient voice in sound signal, specifically includes:
Adding window framing is carried out to the voice signal, obtains the speech frame of the voice signal;
Discrete Fourier transform is carried out to the speech frame, obtains the frequency spectrum of the speech frame;
The accumulated energy of each frequency band is calculated according to the frequency spectrum of the speech frame;
Logarithm operation is carried out to the accumulated energy of each frequency band, obtains the accumulated energy logarithm of each frequency band;
The accumulated energy logarithm of each frequency band is compared with preset threshold, obtains the efficient voice.
4. the method as described in claim 1, which is characterized in that the characteristic parameter includes initial mel cepstrum coefficients MFCC special
Levy parameter, first-order difference MFCC characteristic parameter and second differnce MFCC characteristic parameter.
5. the method as described in claim 1, which is characterized in that the method also includes:
Dimension-reduction treatment is carried out to the characteristic parameter, the characteristic parameter after obtaining dimensionality reduction.
6. the method as described in claim 1, which is characterized in that described to extract characteristic parameter packet to pretreated voice signal
It includes and mel cepstrum coefficients MFCC characteristic parameter is extracted to pretreated voice signal, specifically include:
Using the mapping equation of bilinear transformation low pass filter cutoff frequency, calculates alignment difference and speak third resonance for each person
The frequency at peak bends the factor;
The factor is bent according to the frequency, using bilinear transformation to triangular filter group used in MFCC characteristic parameter extraction
Position and width be adjusted;
The normalized MFCC characteristic parameter of sound channel is calculated according to triangular filter group adjusted.
7. the method as described in claim 1, which is characterized in that described to extract characteristic parameter packet to pretreated voice signal
It includes and mel cepstrum coefficients MFCC characteristic parameter is extracted to pretreated voice signal, specifically include:
Discrete Fourier transform DFT is carried out to each speech frame, obtains the frequency spectrum of the speech frame;
Square for seeking the spectrum amplitude of the speech frame obtains the conventional spectrum of the speech frame;
By the conventional spectrum of the speech frame by equally distributed triangular filter group in Mel frequency, each triangle filter is obtained
The output of wave device;
Logarithm operation is done to the output of all triangular filters, obtains the log power spectrum of the speech frame;
To the log power spectrum discrete cosine transform, the initial MFCC characteristic parameter of the speech frame is obtained.
8. a kind of voice processing apparatus, which is characterized in that described device includes:
Pretreatment unit, for being pre-processed to voice signal;
Feature extraction unit, for extracting characteristic parameter to pretreated voice signal;
Decoding unit is used for according to the characteristic parameter, using preparatory trained speech recognition modeling to the voice signal
It is decoded, obtains the text as unit of sentence;
Abstract extraction unit, for extracting abstract from the text as unit of sentence by Hidden Markov Model HMM
Sentence.
9. a kind of computer installation, it is characterised in that: the computer installation includes processor, and the processor is deposited for executing
The computer program stored in reservoir is to realize the method for speech processing as described in any one of claim 1-7.
10. a kind of computer readable storage medium, computer program, feature are stored on the computer readable storage medium
It is: realizes the method for speech processing as described in any one of claim 1-7 when the computer program is executed by processor.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810897646.2A CN109036381A (en) | 2018-08-08 | 2018-08-08 | Method of speech processing and device, computer installation and readable storage medium storing program for executing |
PCT/CN2018/108190 WO2020029404A1 (en) | 2018-08-08 | 2018-09-28 | Speech processing method and device, computer device and readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810897646.2A CN109036381A (en) | 2018-08-08 | 2018-08-08 | Method of speech processing and device, computer installation and readable storage medium storing program for executing |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109036381A true CN109036381A (en) | 2018-12-18 |
Family
ID=64632382
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810897646.2A Pending CN109036381A (en) | 2018-08-08 | 2018-08-08 | Method of speech processing and device, computer installation and readable storage medium storing program for executing |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109036381A (en) |
WO (1) | WO2020029404A1 (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109741761A (en) * | 2019-03-13 | 2019-05-10 | 百度在线网络技术(北京)有限公司 | Sound processing method and device |
CN109872714A (en) * | 2019-01-25 | 2019-06-11 | 广州富港万嘉智能科技有限公司 | A kind of method, electronic equipment and storage medium improving accuracy of speech recognition |
CN110300001A (en) * | 2019-05-21 | 2019-10-01 | 深圳壹账通智能科技有限公司 | Conference audio control method, system, equipment and computer readable storage medium |
CN110738991A (en) * | 2019-10-11 | 2020-01-31 | 东南大学 | Voice recognition device based on flexible wearable sensor |
CN111128178A (en) * | 2019-12-31 | 2020-05-08 | 上海赫千电子科技有限公司 | A speech recognition method based on facial expression analysis |
CN111509843A (en) * | 2020-04-14 | 2020-08-07 | 佛山市威格特电气设备有限公司 | Cable damage prevention early warning device with mechanical breaking hammer characteristic quantity recognition function |
CN111509841A (en) * | 2020-04-14 | 2020-08-07 | 佛山市威格特电气设备有限公司 | Cable external damage prevention early warning device with excavator characteristic quantity recognition function |
CN111509842A (en) * | 2020-04-14 | 2020-08-07 | 佛山市威格特电气设备有限公司 | Cable anti-damage early warning device with cutting machine characteristic quantity recognition function |
CN111933116A (en) * | 2020-06-22 | 2020-11-13 | 厦门快商通科技股份有限公司 | Speech recognition model training method, system, mobile terminal and storage medium |
CN111968622A (en) * | 2020-08-18 | 2020-11-20 | 广州市优普科技有限公司 | Attention mechanism-based voice recognition method, system and device |
CN112201253A (en) * | 2020-11-09 | 2021-01-08 | 平安普惠企业管理有限公司 | Character marking method and device, electronic equipment and computer readable storage medium |
CN112420070A (en) * | 2019-08-22 | 2021-02-26 | 北京峰趣互联网信息服务有限公司 | Automatic labeling method and device, electronic equipment and computer readable storage medium |
CN112562646A (en) * | 2020-12-09 | 2021-03-26 | 江苏科技大学 | Robot voice recognition method |
CN114155860A (en) * | 2020-08-18 | 2022-03-08 | 深圳市万普拉斯科技有限公司 | Abstract recording method, apparatus, computer equipment and storage medium |
CN115063895A (en) * | 2022-06-10 | 2022-09-16 | 深圳市智远联科技有限公司 | A kind of ticket selling method and ticket selling system based on speech recognition |
CN118193713A (en) * | 2024-05-16 | 2024-06-14 | 北京市农林科学院 | A knowledge question-answering method and device based on virtual digital experts |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB0400101D0 (en) * | 2004-01-05 | 2004-02-04 | Toshiba Res Europ Ltd | Speech recognition system and technique |
WO2004049188A1 (en) * | 2002-11-28 | 2004-06-10 | Agency For Science, Technology And Research | Summarizing digital audio data |
JP2006146261A (en) * | 2001-08-08 | 2006-06-08 | Nippon Telegr & Teleph Corp <Ntt> | Speech processing method and program therefor |
WO2010019831A1 (en) * | 2008-08-14 | 2010-02-18 | 21Ct, Inc. | Hidden markov model for speech processing with training method |
JP2012037797A (en) * | 2010-08-10 | 2012-02-23 | Nippon Telegr & Teleph Corp <Ntt> | Dialogue learning device, summarization device, dialogue learning method, summarization method, program |
CN103021408A (en) * | 2012-12-04 | 2013-04-03 | 中国科学院自动化研究所 | Method and device for speech recognition, optimizing and decoding assisted by stable pronunciation section |
US20160328366A1 (en) * | 2015-05-04 | 2016-11-10 | King Fahd University Of Petroleum And Minerals | Systems and associated methods for arabic handwriting synthesis and dataset design |
CN106446109A (en) * | 2016-09-14 | 2017-02-22 | 科大讯飞股份有限公司 | Acquiring method and device for audio file abstract |
CN107403619A (en) * | 2017-06-30 | 2017-11-28 | 武汉泰迪智慧科技有限公司 | A kind of sound control method and system applied to bicycle environment |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101464898B (en) * | 2009-01-12 | 2011-09-21 | 腾讯科技(深圳)有限公司 | Method for extracting feature word of text |
CN102436809B (en) * | 2011-10-21 | 2013-04-24 | 东南大学 | Network speech recognition method in English oral language machine examination system |
US9514739B2 (en) * | 2012-06-06 | 2016-12-06 | Cypress Semiconductor Corporation | Phoneme score accelerator |
CN103646094B (en) * | 2013-12-18 | 2017-05-31 | 上海紫竹数字创意港有限公司 | Realize that audiovisual class product content summary automatically extracts the system and method for generation |
CN108305632B (en) * | 2018-02-02 | 2020-03-27 | 深圳市鹰硕技术有限公司 | Method and system for forming voice abstract of conference |
-
2018
- 2018-08-08 CN CN201810897646.2A patent/CN109036381A/en active Pending
- 2018-09-28 WO PCT/CN2018/108190 patent/WO2020029404A1/en active Application Filing
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2006146261A (en) * | 2001-08-08 | 2006-06-08 | Nippon Telegr & Teleph Corp <Ntt> | Speech processing method and program therefor |
WO2004049188A1 (en) * | 2002-11-28 | 2004-06-10 | Agency For Science, Technology And Research | Summarizing digital audio data |
GB0400101D0 (en) * | 2004-01-05 | 2004-02-04 | Toshiba Res Europ Ltd | Speech recognition system and technique |
WO2010019831A1 (en) * | 2008-08-14 | 2010-02-18 | 21Ct, Inc. | Hidden markov model for speech processing with training method |
JP2012037797A (en) * | 2010-08-10 | 2012-02-23 | Nippon Telegr & Teleph Corp <Ntt> | Dialogue learning device, summarization device, dialogue learning method, summarization method, program |
CN103021408A (en) * | 2012-12-04 | 2013-04-03 | 中国科学院自动化研究所 | Method and device for speech recognition, optimizing and decoding assisted by stable pronunciation section |
US20160328366A1 (en) * | 2015-05-04 | 2016-11-10 | King Fahd University Of Petroleum And Minerals | Systems and associated methods for arabic handwriting synthesis and dataset design |
CN106446109A (en) * | 2016-09-14 | 2017-02-22 | 科大讯飞股份有限公司 | Acquiring method and device for audio file abstract |
CN107403619A (en) * | 2017-06-30 | 2017-11-28 | 武汉泰迪智慧科技有限公司 | A kind of sound control method and system applied to bicycle environment |
Non-Patent Citations (4)
Title |
---|
于江德等: "隐马尔可夫模型在自然语言处理中的应用", 《计算机工程与设计》 * |
刘云中等: "基于隐马尔可夫模型的文本信息抽取", 《系统仿真学报》 * |
金砚硕等: "一种基于隐马尔可夫聚类的信息提取方法", 《情报杂志》 * |
陈科等: "基于MFCC与CHMM的方向指令语音识别", 《成都大学学报(自然科学版)》 * |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109872714A (en) * | 2019-01-25 | 2019-06-11 | 广州富港万嘉智能科技有限公司 | A kind of method, electronic equipment and storage medium improving accuracy of speech recognition |
CN109741761B (en) * | 2019-03-13 | 2020-09-25 | 百度在线网络技术(北京)有限公司 | Sound processing method and device |
CN109741761A (en) * | 2019-03-13 | 2019-05-10 | 百度在线网络技术(北京)有限公司 | Sound processing method and device |
CN110300001A (en) * | 2019-05-21 | 2019-10-01 | 深圳壹账通智能科技有限公司 | Conference audio control method, system, equipment and computer readable storage medium |
CN110300001B (en) * | 2019-05-21 | 2022-03-15 | 深圳壹账通智能科技有限公司 | Conference audio control method, system, device and computer readable storage medium |
CN112420070A (en) * | 2019-08-22 | 2021-02-26 | 北京峰趣互联网信息服务有限公司 | Automatic labeling method and device, electronic equipment and computer readable storage medium |
CN110738991A (en) * | 2019-10-11 | 2020-01-31 | 东南大学 | Voice recognition device based on flexible wearable sensor |
CN111128178A (en) * | 2019-12-31 | 2020-05-08 | 上海赫千电子科技有限公司 | A speech recognition method based on facial expression analysis |
CN111509843A (en) * | 2020-04-14 | 2020-08-07 | 佛山市威格特电气设备有限公司 | Cable damage prevention early warning device with mechanical breaking hammer characteristic quantity recognition function |
CN111509841A (en) * | 2020-04-14 | 2020-08-07 | 佛山市威格特电气设备有限公司 | Cable external damage prevention early warning device with excavator characteristic quantity recognition function |
CN111509842A (en) * | 2020-04-14 | 2020-08-07 | 佛山市威格特电气设备有限公司 | Cable anti-damage early warning device with cutting machine characteristic quantity recognition function |
CN111933116A (en) * | 2020-06-22 | 2020-11-13 | 厦门快商通科技股份有限公司 | Speech recognition model training method, system, mobile terminal and storage medium |
CN111968622A (en) * | 2020-08-18 | 2020-11-20 | 广州市优普科技有限公司 | Attention mechanism-based voice recognition method, system and device |
CN114155860A (en) * | 2020-08-18 | 2022-03-08 | 深圳市万普拉斯科技有限公司 | Abstract recording method, apparatus, computer equipment and storage medium |
CN112201253A (en) * | 2020-11-09 | 2021-01-08 | 平安普惠企业管理有限公司 | Character marking method and device, electronic equipment and computer readable storage medium |
CN112201253B (en) * | 2020-11-09 | 2023-08-25 | 观华(广州)电子科技有限公司 | Text marking method, text marking device, electronic equipment and computer readable storage medium |
CN112562646A (en) * | 2020-12-09 | 2021-03-26 | 江苏科技大学 | Robot voice recognition method |
CN115063895A (en) * | 2022-06-10 | 2022-09-16 | 深圳市智远联科技有限公司 | A kind of ticket selling method and ticket selling system based on speech recognition |
CN118193713A (en) * | 2024-05-16 | 2024-06-14 | 北京市农林科学院 | A knowledge question-answering method and device based on virtual digital experts |
Also Published As
Publication number | Publication date |
---|---|
WO2020029404A1 (en) | 2020-02-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109036381A (en) | Method of speech processing and device, computer installation and readable storage medium storing program for executing | |
Malik et al. | Automatic speech recognition: a survey | |
Tirumala et al. | Speaker identification features extraction methods: A systematic review | |
Arora et al. | Automatic speech recognition: a review | |
Cutajar et al. | Comparative study of automatic speech recognition techniques | |
Mouaz et al. | Speech recognition of Moroccan dialect using hidden Markov models | |
Khelifa et al. | Constructing accurate and robust HMM/GMM models for an Arabic speech recognition system | |
Chelali et al. | Text dependant speaker recognition using MFCC, LPC and DWT | |
Karpov | An automatic multimodal speech recognition system with audio and video information | |
Singh et al. | An efficient algorithm for recognition of emotions from speaker and language independent speech using deep learning | |
Ranjan et al. | Isolated word recognition using HMM for Maithili dialect | |
Nandi et al. | Parametric representation of excitation source information for language identification | |
Devi et al. | An analysis on types of speech recognition and algorithms | |
Bhatt et al. | Effects of the dynamic and energy based feature extraction on hindi speech recognition | |
Nedjah et al. | Automatic speech recognition of Portuguese phonemes using neural networks ensemble | |
Mistry et al. | Overview: Speech recognition technology, mel-frequency cepstral coefficients (mfcc), artificial neural network (ann) | |
Gaudani et al. | Comparative study of robust feature extraction techniques for ASR for limited resource Hindi language | |
Sahoo et al. | MFCC feature with optimized frequency range: An essential step for emotion recognition | |
Sharma et al. | Soft-Computational Techniques and Spectro-Temporal Features for Telephonic Speech Recognition: an overview and review of current state of the art | |
Tailor et al. | Deep learning approach for spoken digit recognition in Gujarati language | |
Nguyen et al. | Vietnamese voice recognition for home automation using MFCC and DTW techniques | |
Trivedi | A survey on English digit speech recognition using HMM | |
Grewal et al. | Isolated word recognition system for English language | |
Khanna et al. | Application of vector quantization in emotion recognition from human speech | |
Bohouta | Improving wake-up-word and general speech recognition systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181218 |