CN108198546B

CN108198546B - Voice signal preprocessing method based on cochlear nonlinear dynamics mechanism

Info

Publication number: CN108198546B
Application number: CN201711469953.2A
Authority: CN
Inventors: 龙长才; 闫冰岩; 沈涛; 张�杰
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2017-12-29
Filing date: 2017-12-29
Publication date: 2020-05-19
Anticipated expiration: 2037-12-29
Also published as: CN108198546A

Abstract

The invention discloses a speech signal preprocessing method based on the nonlinear dynamics mechanism of the cochlea, including (1) establishing a nonlinear dynamics model of the cochlea; (2) building a nonlinear cochlear array; the nonlinear cochlear array is a group consisting of n There are two active simulation modules with different natural frequencies. After each active simulation module performs corresponding operations on the received input speech signal according to the cochlear nonlinear dynamic model, the real-time response output signal of each active simulation module is obtained; The real-time response output signal of the active simulation module is processed to obtain a speech preprocessing signal. In the present invention, the nonlinear cochlear array using the nonlinear dynamic model of the cochlea is introduced to preprocess the speech signal instead of the traditional passive filter bank, so that the preprocessed periodic or quasi-periodic speech signal is amplified, and shows The combined tone related to the pitch, thereby improving the anti-noise ability and feature analysis ability of speech processing.

Description

Voice signal preprocessing method based on cochlear nonlinear dynamics mechanism

Technical Field

The invention belongs to the technical field of signal processing, and particularly relates to a voice signal preprocessing method for voice recognition.

Background

The speech signal processing is the essence of modern information processing, and speech recognition and man-machine speech communication are realized through a computer. With the development of artificial intelligence technology, the level of computer speech recognition is quite high, but there is still a gap compared to humans. The problem of machine speech processing is mainly manifested in that its speech recognition capability in real scenes is susceptible to interference from ambient noise and other sound sources.

The machine voice signal processing flow mainly comprises: the method comprises the steps of voice signal preprocessing, voice signal feature extraction and voice recognition according to voice features. Artificial intelligence techniques such as neural networks, deep learning, etc. are used at the back end of the process: speech is recognized based on the extracted speech features. The speech preprocessing and speech feature extraction at the front end of the process have been implemented in the past by mathematical signal processing methods, such as: fourier analysis, wavelet transform, etc. To further improve the capability of machine speech recognition, people are increasingly inclined to use auditory signal processing mechanisms for speech signal processing. Existing hearing-based speech signal processing methods use band-pass filter banks with different center frequencies to simulate the frequency analysis function of the cochlea. The gamma tone is most commonly used as the impulse response of the filter. These hearing-based signal processing provide some improvement to speech signal processing, but the existing cochlear filter model, as a linear model, is very different from the real cochlea.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a voice signal preprocessing method based on a cochlear nonlinear dynamics mechanism, and aims to further improve the feature analysis and the anti-noise interference capability of the existing machine voice signal processing technology.

The invention provides a voice signal preprocessing method based on a cochlear nonlinear dynamics mechanism, which comprises the following steps:

(1) establishing a cochlear nonlinear dynamical model;

(2) constructing a nonlinear cochlear array according to a cochlear nonlinear dynamical model;

the nonlinear cochlear array is a group of active simulation modules containing n different natural frequencies, and each active simulation module performs corresponding operation on the received input voice signal according to the cochlear nonlinear dynamics model to obtain a real-time response output signal of each active simulation module;

(3) processing the real-time response output signals of the active simulation modules to obtain voice preprocessing signals;

wherein n is the number of the active simulation modules, and n is an integer greater than or equal to 1.

Further, the nonlinear dynamical model of cochlea is:

wherein x is the basement membrane off-balancePosition displacement, t is time, gamma is damping coefficient, gamma_αIs the adaptive force coefficient, B is the outer hair cell electrostrictive coefficient, x₀Is the primary length of outer hair cells, omega_iIs the natural circular frequency of the cochlea, S (t) is the input speech signal, x_iAnd (t) is a real-time response output signal of the ith active simulation module, i is a serial number of the active simulation module, and i is 1,2,3.

Further, the adaptive force coefficient gamma in the cochlear nonlinear dynamics model_αThe following ranges should be satisfied: gamma is more than 0_αGamma is less than or equal to gamma, within this range gamma_αThe larger the value, the greater the amplification of the speech signal near its natural frequency by the active simulation module.

Further, the natural frequencies of the n active simulation modules can be set as follows: for natural frequency range a-a × e^ε(n-1)Hz (epsilon < 1) nonlinear cochlear array, wherein the natural frequency of the ith active simulation module is f_i＝a*e^ε(i-1)Hz; i is the serial number of the active simulation module,

i

1,2,3.

Wherein a is more than or equal to 20Hz and less than or equal to 200 Hz.

Further, in step (3), the real-time response output signal x is averaged according to an energy method_i(T) processing to obtain speech frame signal with time length T

Wherein x is_i(t) is the real-time response output signal of the ith active simulation module, y_iAnd (T) is the signal after pretreatment, T is time, and T is the duration of a voice frame.

Compared with the prior art, the technical scheme of the invention has the advantages that the nonlinear cochlear array utilizing the cochlear nonlinear dynamical model is introduced to replace the traditional passive filter bank to preprocess the voice signal, so that the preprocessed periodic or quasi-periodic voice signal is amplified, and the combined voice related to the tone is displayed, thereby improving the anti-noise capability and the characteristic analysis capability of the voice processing.

Drawings

Fig. 1 is a technical block diagram of the present invention for implementing a novel speech signal processing by using a cochlear nonlinear dynamical model.

FIG. 2(a) is a response spectrum of a nonlinear cochlear model; fig. 2(b) shows a response spectrum in a cochlear physiological experiment.

FIG. 3(a) is a spectrogram of a speech after adding noise; fig. 3(b) is a spectrogram processed by the active simulation module.

FIG. 4 is a graph comparing frequency response characteristics of an active simulation module and a passive system. In the figure, the solid line is the frequency response characteristic line of the active simulation module, and the dotted line is the frequency response characteristic line of the passive system.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The cochlea is a nonlinear signal processing system, and has the characteristics of two-tone suppression, generation of combined tones and the like, and the characteristics play an important role in signal processing. For example, the term "combined tone" means that when two frequencies are f₁、f₂When excited by sound of (2), there will be pf₁+mf₂Frequency component components of the excitation signal (p, m are integers) appear. Wherein the difference frequency component f₁-f₂The composite sound without fundamental frequency and with higher harmonic passes through cochlea to generate corresponding fundamental frequency, so that the tones with fundamental frequency which does not exist still exist. The cochlea also enables the tones of the quasi-periodic signal to be perceived through nonlinearity. The nonlinear characteristic of the cochlea can be described by a nonlinear dynamical equation, and the invention designs an active simulation module capable of completing the mathematical operation according to the nonlinear dynamical equation and according to the parameter omega in the equation_iThe nonlinear cochlear array formed by n active simulation modules with different natural frequencies is designed for preprocessing voice by different values (representing the natural circular frequencies of the active simulation modules). Compared with the conventional method using band-pass filter bankCompared with the sound processing strategy, the voice signal processed by the nonlinear cochlear array can reflect a plurality of nonlinear effects related to the sound processing mechanism of the cochlea, such as nonlinear tuning, polyphonic distortion, two-tone suppression and the like similar to the auditory processing result, particularly can show a combination tone related to the tone, and an active amplification mechanism of a periodic or quasi-periodic voice signal close to the natural frequency of an active simulation module, so that the voice processing has better characteristic analysis and anti-noise capability than the traditional method.

In order to improve the speech recognition and anti-noise interference capability of the existing machine speech signal processing in the real environment. The invention constructs a voice signal preprocessing method based on a cochlear nonlinear dynamics model. The method introduces a nonlinear cochlear array utilizing a cochlear nonlinear dynamical model to replace a traditional passive filter bank to preprocess voice signals. Analysis shows that the simulation result of the nonlinear cochlear array is highly consistent with the physiological experiment result of the cochlear basilar membrane and the auditory psychological experiment result, and especially, a plurality of nonlinear effects related to the sound processing mechanism of the cochlea, such as nonlinear tuning, polyphonic distortion, two-tone suppression and the like, can be well simulated. By using the method to process the voice signal, the combined voice related to the tone can be displayed, the periodic signal characteristic related to the voice is enhanced, and the voice signal is highlighted from the noise, so that the voice identification degree is improved.

The invention adopts the following specific technical scheme:

(1) establishing a cochlea nonlinear dynamics model:

we take cochlear basilar membrane dynamics as the basis, and take cochlear local as an example for stress analysis. During the acoustic conduction process, the cochlea basal membrane is locally subjected to external force F caused by external acoustic stimulation_s(t), base film self-elasticity F_TLymphatic fluid and self-generated resistance

And the nonlinear adaptive force F regulated by the electrostriction of the outer hair cells and the ciliary movement_aIts simplified expression is as follows:

the nonlinear dynamical model of the cochlea built according to the Newton's law of mechanics is as follows:

wherein x is the displacement of the basement membrane from the equilibrium position, gamma is the damping coefficient, gamma_αIs the adaptive force coefficient, B is the outer hair cell electrostrictive coefficient, x₀Is the primary length of outer hair cells, omega_iThe input signal is S (t) which is the natural circular frequency of the cochlea in this region. Solving the nonlinear equation to obtain the real-time response output x of the cochlear basilar membrane_i(t)。

(2) A voice signal preprocessing method added with a cochlear nonlinear dynamics model comprises the following steps:

as shown in fig. 1, the new speech signal preprocessing method requires constructing a nonlinear cochlear array according to a nonlinear dynamical model of the cochlea to simulate the processing mechanism of the cochlea on sound. The nonlinear cochlear array is a group of n cochlear nonlinear dynamical models with different natural frequencies

The active simulation module of (2) to form a nonlinear simulation array. The input speech signal is S (t), the real-time response output x of different processed channels can be obtained by solving the equation_i(t) of (d). Then each channel outputs a signal x_i(t) averaging according to an energy method to obtain a speech preprocessing signal

It should be noted that the design of the active simulation module should be such that the adaptive force coefficient γ is_αThe settings were: gamma is more than 0_αIn the range of ≦ γ, when γ is present_αWhen the adaptive force is 0, the system becomes a passive system; when gamma is_αWhen γ, the system will eventually oscillate self-sustained. When gamma is_αWithin the above range, γ_αThe larger the value is, the larger the maximum value of the adaptive force is, and the active simulation isThe smaller the effective damping of the module, the greater the amplitude of the response to a speech signal near the natural frequency of the active simulation module. The frequency response curve of the active simulation module and the frequency response characteristic curve of the passive system are shown in fig. 4, for example, it can be seen that the active simulation module has a better amplification effect on the voice signal near its natural frequency.

The present invention will be described in further detail with reference to the accompanying drawings and specific examples.

Fig. 1 is a block diagram of a novel speech signal processing technique implemented by using a cochlear nonlinear dynamical model. The specific strategy is as follows: n frequency band channels are required to be designed according to the frequency, and each frequency band channel comprises an active simulation module with different natural frequencies to form a nonlinear cochlear array. The sound signal S (t) recorded by the microphone is processed and output as x by different simulation modules_i(t), thereafter x_i(t) average output by energy of y_i(t)，y_iAnd (t) is the signal after pretreatment. The preprocessing signal obtained by the method can reflect the combined tone information which is consistent with the cochlear processing result and is related to auditory tones, and enhance the periodic signal characteristics related to voice, so that the voice signal characteristics are highlighted in noise, and the voice recognition degree is improved.

Fig. 2 shows the multi-tone distortion effect, and as can be seen from the response spectrum of the active simulation module in fig. 2(a), distortion products which cannot be found in the conventional passive filter system appear in the response spectrum of the nonlinear cochlear array, namely: and combining the sound. Fig. 2(b) is the response result on the cochlear basilar membrane of the actual physiological experiment. The comparison shows that the active simulation module can well simulate the polyphonic distortion effect in the cochlea, and combined tone information related to the voice tones appears in the voice signals processed by the active simulation module, and meanwhile, the fundamental frequency signals of the voice are improved, which is also the basis for improving the voice characteristics by the strategy.

Fig. 3(a) is a result of the richness spectrum analysis of a speech after adding noise, and it can be seen from the figure that the speech features are almost drowned by the noise. The results of the Fourier transform analysis using the model constructed in this study are shown in FIG. 3 (b). It can be seen that speech signal features are clearly highlighted in the noise.

FIG. 4 is a graph comparing frequency response characteristics of an active simulation module and a passive system. The horizontal axis of the image is the sound frequency, and the vertical axis is the response amplitude of the system to the sounds with different frequencies. In the figure, the natural frequencies of the active simulation module and the passive system are both 140Hz, the solid line is the frequency response characteristic line of the active simulation module, and the dotted line is the frequency response characteristic line of the passive system. Compared with a passive system, the active simulation module has larger response amplitude to the voice signals near the natural frequency, and the active amplification effect of the active simulation module on the voice signals is embodied.

(1) establishing a cochlear nonlinear dynamics model

(2) And constructing a nonlinear cochlear array according to the cochlear nonlinear dynamical model, wherein the nonlinear cochlear array is a group of active simulation modules containing n different natural frequencies, and the nonlinear cochlear array is used for performing corresponding mathematical operation according to the nonlinear dynamical model. The natural frequency of the active simulation module can be set according to the following modes: e.g. for the natural frequency range a-e^ε(n-1)Hz (generally, 20Hz is more than or equal to a and less than or equal to 200Hz) nonlinear cochlear array, wherein the natural frequency of the ith active simulation module is f_i＝a*e^ε(i-1)Hz (i ═ 1,2,3.. n). The design of the active simulation module should make the adaptive force coefficient gamma_αThe settings were: gamma is more than 0_αA range of ≦ γ, in which range γ is present_αThe larger the value is, the larger the amplification effect of the active simulation module on the voice signal near the natural frequency is;

the input voice signal is S (t), the real-time response output x of each processed active simulation module can be obtained by solving the equation_i(t)；

(3) Output signal x for each channel_i(t) processing to obtain a speech pre-processed signal

Wherein the channel output signals x can be processed according to energy-wise averaging_i(T), thereby obtaining the voice frame signal with the time length T for the subsequent voice processing process.

The nonlinear cochlear array can align periodic or periodical voice signals to play an active amplification role, so that the voice signals required by people can be highlighted from noise. Meanwhile, the nonlinear cochlear array can well simulate the polyphonic distortion effect of the cochlea, so that the preprocessed signals can show combined tones related to tones, the voice characteristics are highlighted, and the voice recognition degree is improved.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A speech signal preprocessing method based on a cochlear nonlinear dynamics mechanism is characterized by comprising the following steps:

(1) establishing a cochlear nonlinear dynamical model, wherein the cochlear nonlinear dynamical model comprises the following steps:

(2) constructing a nonlinear cochlear array according to a cochlear nonlinear dynamical model, wherein the nonlinear cochlear array is a group of active simulation modules comprising n different natural frequencies, and each active simulation module performs corresponding operation on a received input voice signal according to the cochlear nonlinear dynamical model to obtain a real-time response output signal of each active simulation module;

in the step (3), leveling is performed according to an energy methodAll respond to real-time output signal x_i(T) processing to obtain speech frame signal with time length T

Wherein x is the displacement of the basement membrane from the equilibrium position, t is time, gamma is the damping coefficient, gamma_αAs adaptive force coefficient, the adaptive force coefficient gamma in the cochlea nonlinear dynamical model_αThe following ranges should be satisfied: gamma is more than 0_αGamma is less than or equal to gamma, within this range gamma_αThe larger the value is, the larger the amplification effect of the active simulation module on the voice signal near the natural frequency is, B is the outer hair cell electrostriction coefficient, x₀Is the primary length of outer hair cells, omega_iIs the natural circular frequency of a certain part of the cochlea, S (t) is the input voice signal, x_i(t) is a real-time response output signal of the ith active simulation module, i is a serial number of the active simulation module, and i is 1,2,3_i(T) is the signal after pretreatment, and T is the duration of a voice frame; n is the number of the active simulation modules, and n is an integer greater than or equal to 1.

2. The speech signal preprocessing method of claim 1 wherein the natural frequencies of the n active simulation modules are set as follows:

for natural frequency range a-a × e^ε(n-1)Hz non-linear cochlear array, wherein the i-th active simulation module has a natural frequency f_i＝a*e^ε(i-1)Hz；ε＜1。

3. The speech signal preprocessing method of claim 2 wherein 20Hz ≦ a ≦ 200 Hz.